Topic Modeling News Headlines to Classify Articles

Topic Modeling News Headlines to Classify Articles

Link to Github repo

Introduction

We implement in this project topic modeling on the Australian Broadcasting Corporation (“ABC”) headlines dataset combining the text and publication dates of ~1.1M ABC News article headlines published over 2003–2017. The goal is to uncover—unsupervised—common topics across headlines and then assign unseen headlines to a topic category (with applications in document indexing/retrieval and content recommendation). We compare several methods: Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Latent Semantic Indexing (LSI), and Hierarchical Dirichlet Process (HDP).


Exploratory Data Analysis

The ABC News dataset spans 1.1M headlines across news, politics, business, sports, opinion, etc. Figure 1 shows a word cloud of the most common tokens—police, new, man, says, govt, court, council, interview, NSW, and Australia—hinting at everyday news: law enforcement, government announcements, courts, and interviews.

Most common words in ABC News Headline dataset

We also inspect headline lengths: averages of 6.4 words and 40.2 characters across a corpus of 7.1M words.

Distribution of headline word lengths

Distribution of headline character lengths

We extract part‑of‑speech (POS) tags with TextBlob to understand grammatical structure. Nouns (NN, NNS), adjectives (JJ), prepositions (IN), and verbs (VB, VBP, VBZ) dominate.

from textblob import TextBlob
tagged_headlines = [TextBlob(reindexed_data[i]).pos_tags for i in range(reindexed_data.shape[0])]

POS distribution

Next, we chart headline counts by year, month, and day (2003–2017). Yearly counts rise 2004–2014 before declining; months show sharp 50–70% dips in Sep‑2006, Jan‑2015, and Jan‑2016.

Daily counts appear capped at ~250/day in 2003–2011, ~400/day in 2012–2016, then back to ~200–250/day in 2016–2018. Only eight zero‑headline days appear, all pre‑2009.

Year-, month-, and day-level trends

Seasonality: day‑of‑month counts are fairly flat (with the 31st lower, as expected). Weekends see roughly half the headlines of weekdays. Winter months (Dec–Feb) run ~8% lower than the rest of the year.

Seasonality by day / weekday / month


Theoretical Background to Topic Modeling

We’ll keep the math light (see references for deeper dives) and focus on LDA intuition: documents are distributions of topics, and topics are distributions of words; word order is ignored (bag‑of‑words). For K topics, LDA iteratively updates word‑topic assignments using:

  • p(w_j | t_k) — how strongly word w_j associates with topic t_k across the corpus (global relevance).
  • p(t_k | d_i) — how prevalent topic t_k is within document d_i (local relevance).

Multiplying yields p(w_j | t_k, d_i) to update each word’s topic. Iterate until convergence (or max iters). A high‑level sketch:

LDA steps


Topic Modeling

We compare two preprocessing paths:

  1. scikit‑learn: remove stopwords only (no lemmatization/stemming).
  2. NLTK + Gensim: remove stopwords and lemmatize/stem.

i) LSA using scikit‑learn

Sample 10,000 headlines. Build a document‑term matrix with CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

@tdec
def words2vec(data, max_features=40000):
    count_vectorizer = CountVectorizer(stop_words='english', max_features=max_features)
    document_term_matrix = count_vectorizer.fit_transform(data)
    return count_vectorizer, document_term_matrix

small_text_sample = reindexed_data.sample(n=10000, random_state=0).values
counter_vectorizer, small_document_term_matrix = words2vec(data=small_text_sample, max_features=40000)

Quick check:

print('Before preprocessing ', small_text_sample[1])
print('Words converted to vector ', document_term_matrix[1])
print('Word vector inverse transformed ', inv_transform_count_vectorizer(counter_vectorizer, document_term_matrix[1]))

Fit LSA (TruncatedSVD, 8 topics) and inspect top words per topic:

from sklearn.decomposition import TruncatedSVD

lsa_model = TruncatedSVD(n_components=8)
lsa_topic_matrix = lsa_model.fit_transform(small_document_term_matrix)
lsa_keys = get_keys(lsa_topic_matrix)
lsa_categories, lsa_counts = keys_to_counts(lsa_keys)

top_n_words_lsa = get_top_words(
    n=15, n_topics=8, keys=lsa_keys,
    document_term_matrix=small_document_term_matrix,
    count_vectorizer=counter_vectorizer,
)
for i, words in enumerate(top_n_words_lsa, start=1):
    print(f"Topic {i}: ", words)

Table I. Sklearn LSA predicted topic categories (top words shown in the original post).

Distribution by LSA topic

t‑SNE on LSA topic probabilities shows weak separability → LSA seems ill‑suited here.

t-SNE of LSA topics

ii) LDA using scikit‑learn

Train LDA (8 topics) and review top words:

from sklearn.decomposition import LatentDirichletAllocation

lda_model_sklearn = LatentDirichletAllocation(n_components=n_topics, learning_method='online', random_state=0)
lda_topic_matrix_sklearn = lda_model_sklearn.fit_transform(document_term_matrix)
lda_keys_sklearn = get_keys(lda_topic_matrix_sklearn)
lda_categories_sklearn, lda_counts_sklearn = keys_to_counts(lda_keys_sklearn)

top_n_words_lda_sklearn = get_top_words(
    n=15, n_topics=n_topics, keys=lda_keys_sklearn,
    document_term_matrix=document_term_matrix,
    count_vectorizer=counter_vectorizer,
)

Distribution by LDA topic

t‑SNE now shows much better cluster separation → LDA is a better fit. Scale to 100k headlines.

t-SNE of LDA topics

With 100k headlines, topic prevalence by year:

Correlation heatmap by year

Topic frequencies by year

iii) LDA using NLTK and Gensim

Preprocess with lemmatization + stemming and build a bag‑of‑words dictionary:

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer

def lemmatize_and_stem(text):
    stemmer = SnowballStemmer('english')
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def lemmatize_stem_remove_stopwords(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_and_stem(token))
    return result

processed_docs = raw_data['headline_text'].map(lemmatize_stem_remove_stopwords)

bowdict = gensim.corpora.Dictionary(processed_docs)
bowdict.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
bow_corpus = [bowdict.doc2bow(doc) for doc in processed_docs]

Fit Gensim LDA (LdaMulticore) and print topics:

lda_model_gensim = gensim.models.LdaMulticore(bow_corpus, num_topics=n_topics, id2word=bowdict, passes=2, workers=2)
for idx, topic in lda_model_gensim.print_topics(-1):
    print(f'Topic: {idx}\nWords: {topic}')

t-SNE of Gensim LDA topics

Distribution by Gensim LDA topic

iv) LSI and HDP using Gensim

Implementations are straightforward:

from gensim.models import LsiModel, HdpModel

bow_vectors = [bowdict.doc2bow(lemmatize_stem_remove_stopwords(doc)) for doc in headlines_raw]
lsi_model = LsiModel(corpus=bow_vectors, num_topics=10, id2word=bowdict)
lsi_model.show_topics(num_topics=8)

hdp_model = HdpModel(corpus=bow_vectors, id2word=bowdict)
hdp_model.show_topics(num_topics=8)

Results appear lower‑quality than LDA in this use case.


Conclusion

Across LDA, LSA, LSI, and HDP (via scikit‑learn, NLTK, and Gensim), scikit‑learn LDA on stopword‑only text produced the best‑separated topic clusters (t‑SNE). Introducing lemmatization/stemming yielded tighter local clusters but worse global separation for this corpus. Next steps: try non‑negative matrix factorization and explicit semantic analysis.

Thanks for reading!

Sources