Biden deciphered? — A breakdown of 100 senators’ speech patterns
This article explores using machine learning to analyze political speeches, identifying linguistic patterns across party lines. Through techniques like cosine similarity and lemmatization, it examines whether a senator’s party can be predicted from their speech — bridging technology and politics.
Introduction
In the vast expanse of political discourse, every speech, every word carries weight, revealing not just the immediate context but echoing the broader political landscape it originates from. Leveraging the power of machine learning, particularly text analysis techniques, we embarked on a fascinating journey to uncover how closely senators’ speeches align, and if the words they choose betray their party lines.
The Quest for Similarity
Our mission was to determine which senator’s speeches, out of a total of 99 others, most closely resembled those of Senator Biden during the 105th Congress. Through the lens of cosine similarity — a mathematical measure to gauge how similar two documents are — we processed speeches using a series of text preprocessing steps including tokenization, lemmatization, and more. The goal? To reduce noise and zoom in on the meaningful essence of the political discourse.
Why These Steps?
- Lowercasing and Tokenization ensure uniformity and help analyze texts at a granular level.
- Removing Stopwords and Punctuation clears the clutter, focusing our analysis on the significant words that carry the core message.
- Lemmatization dives deeper, considering the context to bring words to their root form, thus preserving the essence of the speech.
# Define a tokenizer preprocessor
def text_preprocesser(text):
# Convert text to lowercase
tokens = word_tokenize(text.lower())
# Remove stopwords and punctuation
tokens = [token for token in tokens if token not in (stop_words or string.punctuation)]
# Lemmatize tokens
tokens = [lemmatizer.lemmatize(token) for token in tokens]
# Filter out tokens with length less than 3
tokens = [word for word in tokens if len(word)>=3]
# Join tokens back to form preprocessed text
preprocessed_text = ' '.join(tokens)
return preprocessed_text
Lemmatization vs. Stemming: The Turning Point
Our exploration took an intriguing turn when we compared the impact of lemmatization and stemming. Lemmatization, with its context-aware approach, seemed the superior choice for our analysis, aiming to preserve the semantic integrity of the speeches. Stemming, though less computationally demanding, often simplifies words to a point where the original meaning might get lost in translation.
This distinction was not just academic; it shaped our findings. With lemmatization, we found Senator Pat Robert’s speeches to be closest to Biden’s, a surprising twist given their party differences. Stemming, on the other hand, highlighted Senator Joe Lieberman, aligning more closely with expectations due to party affiliations.
Predicting Party Lines: A Challenge Unveiled

Venturing further, we applied the Multinomial Naive Bayes classifier to predict senators’ party affiliations from their speeches. Despite our multilayered preprocessing pipeline, the model’s accuracy hovered around 50% — a humbling reminder of the complexity of political language and the subtleties of partisan speech.
Refining Our Approach
The quest for accuracy urged us to revisit our text preprocessing strategies. Could incorporating bigrams or sentiment analysis offer a clearer window into the partisan nature of political speeches? Would adjusting our stop words list unmask previously obscured indicators of party affiliation?
The Road Ahead
Our journey through the political lexicon using machine learning has been illuminating, challenging, and, above all, a testament to the nuanced relationship between language and politics. It’s a reminder that in the realm of political speeches, words are more than mere vehicles of communication; they are the threads that weave the intricate tapestry of our political fabric.
As we continue to refine our models and approaches, the potential to uncover deeper insights into political discourse remains vast. The dialogue between machine learning and political analysis is just beginning, promising richer understandings of the words that shape our world.
Notes:
For those inclined to tinker with codes, you’re welcome to explore it on GitHub.
About the Project
This blog was written as part of an assignment for the MS in Business Analytics program at CEU’s Department of Business & Economics 2023–24. Special thanks to Professor Arieda Muço for the supervision.