Topic Modeling News Headlines to Classify Articles
Link to Github repo
Introduction
We implement in this project topic modeling on the Australian Broadcasting Corporation (“ABC”) headlines dataset combining the text and publication dates of ~1.1M ABC News article headlines published over 2003–2017. The goal is to uncover—unsupervised—common topics across headlines and then assign unseen headlines to a topic category (with applications in document indexing/retrieval and content recommendation). We compare several methods: Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Latent Semantic Indexing (LSI), and Hierarchical Dirichlet Process (HDP).
Exploratory Data Analysis
The ABC News dataset spans 1.1M headlines across news, politics, business, sports, opinion, etc. Figure 1 shows a word cloud of the most common tokens—police, new, man, says, govt, court, council, interview, NSW, and Australia—hinting at everyday news: law enforcement, government announcements, courts, and interviews.
We also inspect headline lengths: averages of 6.4 words and 40.2 characters across a corpus of 7.1M words.
We extract part‑of‑speech (POS) tags with TextBlob to understand grammatical structure. Nouns (NN, NNS), adjectives (JJ), prepositions (IN), and verbs (VB, VBP, VBZ) dominate.
from textblob import TextBlob
tagged_headlines = [TextBlob(reindexed_data[i]).pos_tags for i in range(reindexed_data.shape[0])]
Next, we chart headline counts by year, month, and day (2003–2017). Yearly counts rise 2004–2014 before declining; months show sharp 50–70% dips in Sep‑2006, Jan‑2015, and Jan‑2016.
Daily counts appear capped at ~250/day in 2003–2011, ~400/day in 2012–2016, then back to ~200–250/day in 2016–2018. Only eight zero‑headline days appear, all pre‑2009.
Seasonality: day‑of‑month counts are fairly flat (with the 31st lower, as expected). Weekends see roughly half the headlines of weekdays. Winter months (Dec–Feb) run ~8% lower than the rest of the year.
Theoretical Background to Topic Modeling
We’ll keep the math light (see references for deeper dives) and focus on LDA intuition: documents are distributions of topics, and topics are distributions of words; word order is ignored (bag‑of‑words). For K topics, LDA iteratively updates word‑topic assignments using:
- p(w_j | t_k) — how strongly word w_j associates with topic t_k across the corpus (global relevance).
- p(t_k | d_i) — how prevalent topic t_k is within document d_i (local relevance).
Multiplying yields p(w_j | t_k, d_i) to update each word’s topic. Iterate until convergence (or max iters). A high‑level sketch:
Topic Modeling
We compare two preprocessing paths:
- scikit‑learn: remove stopwords only (no lemmatization/stemming).
- NLTK + Gensim: remove stopwords and lemmatize/stem.
i) LSA using scikit‑learn
Sample 10,000 headlines. Build a document‑term matrix with CountVectorizer
:
from sklearn.feature_extraction.text import CountVectorizer
@tdec
def words2vec(data, max_features=40000):
count_vectorizer = CountVectorizer(stop_words='english', max_features=max_features)
document_term_matrix = count_vectorizer.fit_transform(data)
return count_vectorizer, document_term_matrix
small_text_sample = reindexed_data.sample(n=10000, random_state=0).values
counter_vectorizer, small_document_term_matrix = words2vec(data=small_text_sample, max_features=40000)
Quick check:
print('Before preprocessing ', small_text_sample[1])
print('Words converted to vector ', document_term_matrix[1])
print('Word vector inverse transformed ', inv_transform_count_vectorizer(counter_vectorizer, document_term_matrix[1]))
Fit LSA (TruncatedSVD
, 8 topics) and inspect top words per topic:
from sklearn.decomposition import TruncatedSVD
lsa_model = TruncatedSVD(n_components=8)
lsa_topic_matrix = lsa_model.fit_transform(small_document_term_matrix)
lsa_keys = get_keys(lsa_topic_matrix)
lsa_categories, lsa_counts = keys_to_counts(lsa_keys)
top_n_words_lsa = get_top_words(
n=15, n_topics=8, keys=lsa_keys,
document_term_matrix=small_document_term_matrix,
count_vectorizer=counter_vectorizer,
)
for i, words in enumerate(top_n_words_lsa, start=1):
print(f"Topic {i}: ", words)
Table I. Sklearn LSA predicted topic categories (top words shown in the original post).
t‑SNE on LSA topic probabilities shows weak separability → LSA seems ill‑suited here.
ii) LDA using scikit‑learn
Train LDA (8 topics) and review top words:
from sklearn.decomposition import LatentDirichletAllocation
lda_model_sklearn = LatentDirichletAllocation(n_components=n_topics, learning_method='online', random_state=0)
lda_topic_matrix_sklearn = lda_model_sklearn.fit_transform(document_term_matrix)
lda_keys_sklearn = get_keys(lda_topic_matrix_sklearn)
lda_categories_sklearn, lda_counts_sklearn = keys_to_counts(lda_keys_sklearn)
top_n_words_lda_sklearn = get_top_words(
n=15, n_topics=n_topics, keys=lda_keys_sklearn,
document_term_matrix=document_term_matrix,
count_vectorizer=counter_vectorizer,
)
t‑SNE now shows much better cluster separation → LDA is a better fit. Scale to 100k headlines.
With 100k headlines, topic prevalence by year:
iii) LDA using NLTK and Gensim
Preprocess with lemmatization + stemming and build a bag‑of‑words dictionary:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
def lemmatize_and_stem(text):
stemmer = SnowballStemmer('english')
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def lemmatize_stem_remove_stopwords(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
result.append(lemmatize_and_stem(token))
return result
processed_docs = raw_data['headline_text'].map(lemmatize_stem_remove_stopwords)
bowdict = gensim.corpora.Dictionary(processed_docs)
bowdict.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
bow_corpus = [bowdict.doc2bow(doc) for doc in processed_docs]
Fit Gensim LDA (LdaMulticore
) and print topics:
lda_model_gensim = gensim.models.LdaMulticore(bow_corpus, num_topics=n_topics, id2word=bowdict, passes=2, workers=2)
for idx, topic in lda_model_gensim.print_topics(-1):
print(f'Topic: {idx}\nWords: {topic}')
iv) LSI and HDP using Gensim
Implementations are straightforward:
from gensim.models import LsiModel, HdpModel
bow_vectors = [bowdict.doc2bow(lemmatize_stem_remove_stopwords(doc)) for doc in headlines_raw]
lsi_model = LsiModel(corpus=bow_vectors, num_topics=10, id2word=bowdict)
lsi_model.show_topics(num_topics=8)
hdp_model = HdpModel(corpus=bow_vectors, id2word=bowdict)
hdp_model.show_topics(num_topics=8)
Results appear lower‑quality than LDA in this use case.
Conclusion
Across LDA, LSA, LSI, and HDP (via scikit‑learn, NLTK, and Gensim), scikit‑learn LDA on stopword‑only text produced the best‑separated topic clusters (t‑SNE). Introducing lemmatization/stemming yielded tighter local clusters but worse global separation for this corpus. Next steps: try non‑negative matrix factorization and explicit semantic analysis.
Thanks for reading!
Sources
- https://www.mygreatlearning.com/blog/understanding-latent-dirichlet-allocation/
- https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2
- https://iq.opengenus.org/topic-modelling-techniques/
- https://www.kaggle.com/therohk/million-headlines/code
- https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df
- https://github.com/susanli2016/NLP-with-Python/blob/master/LDA_news_headlines.ipynb
- https://radimrehurek.com/gensim/corpora/dictionary.html
- https://www.kaggle.com/faressayah/text-analysis-topic-modelling-with-spacy-gensim
- https://towardsdatascience.com/t-sne-clearly-explained-d84c537f53a