WSJ Sentiment, Reader Engagement, and SP500 Moves
Link to R Shiny App | Link to Github repo
Introduction
Online reader engagement has become an increasing focus of news publications seeking to foster increased reader loyalty in order to bolster web advertising revenues (Masullo Chen, Ng, Riedl, & Chen, 2020; Farhi, 2007). As the share of U.S. newspaper advertising revenue generated digitally grew from ~17% in 2011 to nearly 35% in 2018 (Pew Research Center, 2018), how newspapers engage with the ~70 million U.S. adults who prefer to get their daily news online has become central (Geiger, 2019). Power usersâthose who visit a site âĽ10 times per month and spend >1 hourâare especially prized (Olmstead & Rosenstiel, 2011). Commenting behavior is positively correlated with repeat visits and time on site, making comments and inter-user discussion increasingly valuable (Ziegele, Weber, Quiring, & Breiner, 2018).
Opinion pieces and emotionally charged articles can drive engagement by confirming readersâ predispositions (Bakir & McStay, 2018), leading to more comments/shares and higher ad revenue. To investigate relationships between emotionality, subjectivity, polarity, and engagement, I scraped Wall Street Journal (WSJ) articles and asked:
- Question 1: Can a statistically significant relationship be demonstrated between a WSJ articleâs subjectivity/objectivity and positivity/negativity (as defined by Python sentiment libraries) and the number of reader comments?
- Question 2: Given WSJâs financial readership and prior literature linking media tone/coverage to markets (e.g., Aman, 2013), is there a significant relationship between WSJ sentiment on day t and S&P 500 moves on day t + n where 0 ⤠n ⤠1?
Data and Web Scraping
To answer Question 1, I scraped 22,772 full-text WSJ articles published between Janâ2019 and Julâ2020 from the WSJ news archives. For each article I captured: article text, headline, sub-headline, date, author, number of comments, and section (rubric). Below is a descriptive table from the R Shiny App showing variables scraped, followed by a sample of where this text appears on a WSJ page:
I used Selenium because (i) WSJ requires login and Selenium can fill username/password by element ID, and (ii) explicit waits ensure all elements load before scraping (via WebDriverWait
). The login flow is shown below:
The appâs EDA tab includes a word cloud of common keywords (slider to control count). Unsurprisingly in an election year, many high-frequency words are political: democrats, political, Trump, Biden, Schumer. More general terms common to national/global coverageâpublic, rule, federal, law, policyâalso appear. A bar chart ranks WSJ sections by average comments per article: Politics and Opinion lead, with Politics more than double the third-ranked section U.S.
Data Cleaning and Preprocessing
For the S&P regression, paragraph texts for the same date were concatenated into one cell per day (groupby + join
). This avoided dropping empty-paragraph rows and sped up processing. The resulting 232 unique days were inner-joined with daily SPX prices/volumes from Yahoo Finance. Because markets close on weekends/holidays, the merged set contains 158 unique trading daysâconsistent with the calendar.
Python Sentiment Analysis
Two common frameworks were used:
VADER variables
- Negative â float in [0,1] (negativity score)
- Neutral â float in [0,1] (neutrality score)
- Positive â float in [0,1] (positivity score)
- Compound â normalized aggregate of negative/neutral/positive
TextBlob variables
- Polarity â float in [-1,1] (1 = positive, -1 = negative)
- Subjectivity â float in [0,1] (1 = subjective, 0 = objective)
Visualizations and Data Manipulation
TextBlob. At first glance, article polarity shows little relationship with comment counts; polarity is tightly distributed around the mean. Subjectivity similarly explains little variation, though itâs more widely spread with more outliers.
Many lowâcomment articles flatten relationships: 2,497 of 12,329 had â¤5 comments. Distributions are leftâskewed (box/bar plots below). I filtered to >5 comments and trimmed high-end outliers using IQR: Q3 (165) + 1.5 Ă 147 = 386 comments. The filtered set has 8,578 articles (5â386 comments).
On the filtered set, I repeated visual regressions (seaborn) using paragraph and headline text scores. Results still look largely flat between comments and TextBlob polarity/subjectivity.
VADER. Overall, positivity vs comments is flat on the full dataset.
A surprisingly strong positive relationship appeared between negativity and comments in the full set:
After filtering, positivity remains flat and the negativity link weakensâsuggesting outliers drove much of it. Next I quantified with linear regressions.
Results â Simple Linear Regression Analysis
i) Number of Comments
A linear regression of comments on TextBlob polarity, TextBlob subjectivity, VADER positivity, and VADER negativity (full 12,329âarticle set) yields very low adjusted R² (0.014) and a nonâsignificant overall Fâtest (Prob(F) = 0.2045). Interestingly, VADER negativity is individually significant at the 1% level.
Regressing comments on negativity alone gives a similarly low adjusted R² (0.014). Despite weak predictive power, the simple model is preferable to the larger one. One hypothesis: highly negative stories (e.g., tragedies) may prompt communal responses, boosting comments. Further analysis would be needed.
ii) Sameâday S&P 500 % Change
Four OLS models regressed sameâday SPX % change on (i) TextBlob only, (ii) VADER only, (iii) both sets, and (iv) both + SPX volume. Adj R² values are ~0.01 with high pâvaluesâpoor models with low predictive power. Somewhat surprisingly, TextBlob polarity is significant at the 10% level, despite earlier flat relationships with comments.
iii) Followingâday S&P 500 % Change
A model for nextâday SPX % change again shows low explanatory power (Adj R² 0.008, Prob(F) 0.315). VADER compound is significant at the 5% level (β â 0.36), suggesting a 1âunit rise in compound score associates with a +0.36% SPX move the next dayâthough the small sample cautions against strong conclusions.
Thanks for reading!