Commencement example
This example uses the Receptiviti
API to analyze commencement speeches.
Data¶
We'll start by collecting and processing the speeches.
Collection¶
The speeches used to be provided more directly, but the service hosting them has since shut down.
They are still available in a slightly less convenient form, as the source of a site that displays them: whatrocks.github.io/commencement-db.
First, we can retrieve metadata from a separate repository:
import pandas
speeches = pandas.read_csv(
"https://raw.githubusercontent.com/whatrocks/markov-commencement-speech"
"/refs/heads/master/speech_metadata.csv"
)
speeches.iloc[0:5, 1:4]
name | school | year | |
---|---|---|---|
0 | Aaron Sorkin | Syracuse University | 2012 |
1 | Abigail Washburn | Colorado College | 2012 |
2 | Adam Savage | Sarah Lawrence College | 2012 |
3 | Adrienne Rich | Douglass College | 1977 |
4 | Ahmed Zewail | Caltech | 2011 |
One file in the source repository contains an invalid character on Windows (:
),
so we'll need to pull them in individually, rather than cloning the repository:
import os
import requests
text_dir = "../../../commencement_speeches/"
os.makedirs(text_dir, exist_ok=True)
text_url = (
"https://raw.githubusercontent.com/whatrocks/commencement-db"
"/refs/heads/master/src/pages/"
)
for file in speeches["filename"]:
out_file = text_dir + file.replace(':', '')
if not os.path.isfile(out_file):
req = requests.get(
text_url + file.replace(".txt", "/index.md"), timeout=999
)
if req.status_code != 200:
print(f"failed to retrieve {file}")
continue
with open(out_file, "w", encoding="utf-8") as content:
content.write("\n".join(req.text.split("\n---")[1:]))
Text Preparation¶
Now we can read in the texts and associate them with their metadata:
import re
bullet_pattern = re.compile("([a-z]) •")
def read_text(file: str):
with open(text_dir + file.replace(":", ""), encoding="utf-8") as content:
text = bullet_pattern.sub("\\1. ", content.read())
return text
speeches["text"] = [read_text(file) for file in speeches["filename"]]
Load Package¶
If this is your first time using the package, see the Get Started guide to install it and set up your API credentials.
import receptiviti
# since our texts are from speeches,
# it might make sense to use the spoken norming context
processed = receptiviti.request(
speeches["text"], version = "v2", context = "spoken"
)
processed = pandas.concat([speeches.iloc[:, 0:4], processed], axis=1)
processed.iloc[0:5, 7:]
summary.words_per_sentence | summary.sentence_count | summary.six_plus_words | summary.capitals | summary.emojis | summary.emoticons | summary.hashtags | summary.urls | big_5.extraversion | big_5.active | ... | disc_dimensions.people_relationship_emotion_focus | disc_dimensions.task_system_object_focus | disc_dimensions.d_axis | disc_dimensions.i_axis | disc_dimensions.s_axis | disc_dimensions.c_axis | disc_dimensions.d_axis_proportional | disc_dimensions.i_axis_proportional | disc_dimensions.s_axis_proportional | disc_dimensions.c_axis_proportional | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19.522388 | 134 | 0.192278 | 0.029845 | 0 | 0 | 0.000000 | 0 | 64.362931 | 58.626324 | ... | 68.785932 | 48.847076 | 52.559226 | 62.370538 | 54.667259 | 46.067726 | 0.243708 | 0.289201 | 0.253483 | 0.213608 |
1 | 28.587156 | 109 | 0.246149 | 0.031928 | 0 | 0 | 0.000963 | 0 | 55.125763 | 61.658720 | ... | 58.656537 | 40.455881 | 45.894668 | 55.262383 | 53.025679 | 44.037115 | 0.231534 | 0.278793 | 0.267509 | 0.222163 |
2 | 12.072000 | 125 | 0.221339 | 0.034079 | 0 | 0 | 0.000000 | 0 | 58.409255 | 51.114651 | ... | 62.707516 | 52.665746 | 57.492323 | 62.734369 | 48.323395 | 44.285521 | 0.270125 | 0.294755 | 0.227046 | 0.208074 |
3 | 32.389831 | 59 | 0.311879 | 0.011606 | 0 | 0 | 0.000000 | 0 | 59.149444 | 29.627236 | ... | 68.591156 | 45.065873 | 46.459670 | 57.317375 | 59.781553 | 48.457056 | 0.219133 | 0.270345 | 0.281968 | 0.228554 |
4 | 24.326733 | 101 | 0.326007 | 0.023517 | 0 | 0 | 0.000000 | 0 | 72.021225 | 50.677059 | ... | 61.249178 | 55.538689 | 62.837175 | 65.988609 | 42.076374 | 40.066922 | 0.297850 | 0.312788 | 0.199443 | 0.189918 |
5 rows × 216 columns
Full: Analyze Style¶
To get at stylistic uniqueness, we can calculate Language Style Matching between each speech and the mean of all speeches:
lsm_categories = [
"liwc15." + c
for c in [
"personal_pronouns",
"impersonal_pronouns",
"articles",
"auxiliary_verbs",
"adverbs",
"prepositions",
"conjunctions",
"negations",
"quantifiers",
]
]
category_means = processed[lsm_categories].agg("mean")
processed["lsm_mean"] = (
1
- abs(processed[lsm_categories] - category_means)
/ (processed[lsm_categories] + category_means)
).agg("mean", axis=1)
processed.sort_values("lsm_mean").iloc[0:10][
["name", "school", "year", "lsm_mean", "summary.word_count"]
]
name | school | year | lsm_mean | summary.word_count | |
---|---|---|---|---|---|
99 | Gary Malkowski | Gallaudet University | 2011 | 0.751292 | 2154 |
268 | Theodor ‘Dr. Seuss’ Geisel | Lake Forest College | 1977 | 0.763584 | 96 |
178 | Makoto Fujimura | Belhaven University | 2011 | 0.820082 | 2569 |
81 | Dwight Eisenhower | Penn State | 1955 | 0.831307 | 2518 |
100 | George C. Marshall | Harvard University | 1947 | 0.833347 | 1449 |
170 | Lewis Lapham | St. John’s College | 2003 | 0.834804 | 3688 |
232 | Rev. Joseph L. Levesque | Niagara University | 2007 | 0.845988 | 483 |
86 | Edward W. Brooke | Wellesley College | 1969 | 0.847166 | 3083 |
117 | Janet Napolitano | Northeastern University | 2014 | 0.853360 | 1526 |
289 | Whoopi Goldberg | Savannah College of Art and Design | 2011 | 0.855008 | 1248 |
Here, it is notable that the most stylistically unique speech was delivered in American Sign Language, and the second most stylistically unique speech was a short rhyme.
We might also want to see which speeches are most similar to one another:
import numpy
# calculate all pairwise comparisons
lsm_pairs = processed[lsm_categories].T.corr(
lambda a, b: numpy.mean(1 - abs(a - b) / (a + b))
)
# set self-matches to 0
numpy.fill_diagonal(lsm_pairs.values, 0)
# identify the closes match to each speech
speeches["match"] = lsm_pairs.idxmax()
best_match = lsm_pairs.max()
# loo at the top matches
top_matches = best_match.sort_values(ascending=False).index[:20].astype(int).to_list()
top_match_pairs = pandas.DataFrame(
{"a": top_matches, "b": speeches["match"][top_matches].to_list()}
)
top_match_pairs = top_match_pairs[
~top_match_pairs.apply(
lambda r: "".join(r.sort_values().astype(str)), 1
).duplicated()
]
pandas.concat(
[
speeches.iloc[top_match_pairs["a"], 1:4].reset_index(drop=True),
pandas.DataFrame({"Similarity": best_match[top_match_pairs["a"]]}).reset_index(
drop=True
),
speeches.iloc[top_match_pairs["b"], 1:4].reset_index(drop=True),
],
axis=1,
)
name | school | year | Similarity | name | school | year | |
---|---|---|---|---|---|---|---|
0 | Cynthia Enloe | Connecticut College | 2011 | 0.999325 | Howard Gordon | Connecticut College | 2013 |
1 | Benjamin Carson Jr. | Niagara University | 2003 | 0.984177 | Jonathon Youshaei | Deerfield High School | 2009 |
2 | John Legend | University of Pennsylvania | 2014 | 0.980730 | Sheryl Sandberg | City Colleges of Chicago | 2014 |
3 | Ronald Reagan | Eureka College | 1957 | 0.979818 | Arianna Huffington | Sarah Lawrence College | 2011 |
4 | Amy Poehler | Harvard University | 2011 | 0.977867 | Sheryl Sandberg | City Colleges of Chicago | 2014 |
5 | Melissa Harris-Perry | Wellesley College | 2012 | 0.975869 | Drew Houston | Massachusetts Institute of Technology | 2013 |
6 | Alan Alda | Connecticut College | 1980 | 0.975361 | Nora Ephron | Wellesley College | 1996 |
7 | Woody Hayes | Ohio State University | 1986 | 0.975117 | James Carville | Hobart and William Smith Colleges | 2013 |
8 | Mindy Kaling | Harvard Law School | 2014 | 0.975033 | Stephen Colbert | Wake Forest University | 2015 |
9 | Tim Cook | Auburn University | 2010 | 0.975007 | Arianna Huffington | Vassar College | 2015 |
10 | Barbara Bush | Wellesley College | 1990 | 0.974678 | Daniel S. Goldin | Massachusetts Institute of Technology | 2001 |
Full: Analyze Content¶
To look at content over time, we might focus on a potentially interesting framework, such as drives:
from statistics import linear_regression
from matplotlib.pyplot import subplots
from matplotlib.style import use
drive_data = processed.filter(regex="year|drives")[processed["year"] > 1980]
trending_drives = (
drive_data.corrwith(drive_data["year"])
.abs()
.sort_values(ascending=False)[1:4]
.index
)
first_year = drive_data["year"].min()
colors = ["#82C473", "#A378C0", "#616161", "#9F5C61", "#D3D280"]
linestyles = ["-", "--", ":", "-.", (5, (8, 2))]
use(["dark_background", {"figure.facecolor": "#1e2229", "axes.facecolor": "#1e2229"}])
fig, ax = subplots()
ax.set(ylabel="Score", xlabel="Year")
for i, cat in enumerate(trending_drives):
points = ax.scatter(drive_data["year"], drive_data[cat], color=colors[i])
beta, intercept = linear_regression(drive_data["year"], drive_data[cat])
line = ax.axline(
(first_year, intercept + beta * first_year),
slope=beta,
color=colors[i],
linestyle=linestyles[i],
label=cat,
)
legend = ax.legend(loc="upper center")
To better visualize the effects, we might look between aggregated blocks of time:
summary = processed[trending_drives].aggregate(["mean", "std"])
standardized = ((processed[trending_drives] - summary.loc["mean"])) / summary.loc["std"]
time_median = int(processed["year"].median())
standardized["Time Period"] = pandas.Categorical(
processed["year"] >= time_median
).set_categories([f"< {time_median}", f">= {time_median}"], rename=True)
summaries = standardized.groupby("Time Period", observed=True)[trending_drives].agg(
["mean", "size"]
)
fig, ax = subplots()
ax.set(ylabel="Score (Scaled)", xlabel="Time Period")
for i, cat in enumerate(trending_drives):
summary = summaries[cat]
ax.errorbar(
summary.index,
summary["mean"],
yerr=1 / summary["size"] ** 0.5,
color=colors[i],
linestyle=linestyles[i],
label=cat,
capsize=6
)
legend = ax.legend(loc="upper center")
This suggests that references to risk and reward have increased since the 2000s while references to power have decreased at a similar rate. (Note that error bars represent how much variance there is within groups, which allows you to eyeball the statistical significance of mean differences.)
The shift in emphasis from power to risk-reward could reflect that commencement speakers are now focusing more abstractly on the potential benefits and hazards of life after graduation, whereas earlier speakers more narrowly focused on ambition and dominance (perhaps referring to power held by past alumni and projecting the potential for graduates to climb social ladders in the future). You could examine a sample of speeches that show this pattern most dramatically (speeches high in risk-reward and low in power in recent years, and vice versa for pre-2009 speeches) to help determine how these themes have shifted and what specific motives or framing devices seem to have been (de)emphasized.
Analyze Segments¶
Another thing we might look for is trends within each speech. For instance, are there common emotional trajectories over the course of a speech?
One way to look at this would be to split texts into roughly equal sizes, and score each section:
import nltk
from math import ceil
nltk.download("punkt_tab", quiet=True)
def count_words(text: str):
return len([token for token in nltk.word_tokenize(text) if token.isalnum()])
def split_text(text: str, bins=3):
sentences = nltk.sent_tokenize(text)
bin_size = ceil(count_words(text) / bins) + 1
text_parts = [[]]
word_counts = [0] * bins
current_bin = 0
for sentence in sentences:
sentence_size = count_words(sentence)
if (current_bin + 1) < bins and (
word_counts[current_bin] + sentence_size
) > bin_size:
text_parts.append([])
current_bin += 1
word_counts[current_bin] += sentence_size
text_parts[current_bin].append(sentence)
return pandas.DataFrame(
{
"text": [" ".join(x) for x in text_parts],
"segment": pandas.Series(range(bins)) + 1,
"WC": word_counts,
}
)
segmented_text = []
for i, text in enumerate(speeches["text"]):
text_parts = split_text(text)
text_info = speeches.iloc[[i] * 3][["name", "school", "year"]]
text_info.reset_index(drop=True, inplace=True)
segmented_text.append(pandas.concat([text_info, text_parts], axis=1))
segmented_text = pandas.concat(segmented_text)
segmented_text.reset_index(drop=True, inplace=True)
segmented_text.iloc[0:9, :6]
name | school | year | text | segment | WC | |
---|---|---|---|---|---|---|
0 | Aaron Sorkin | Syracuse University | 2012 | Thank you very much. Madam Chancellor, members... | 1 | 856 |
1 | Aaron Sorkin | Syracuse University | 2012 | The actor had been offered the lead role in a ... | 2 | 827 |
2 | Aaron Sorkin | Syracuse University | 2012 | In that 11 years, I’ve written three televisio... | 3 | 880 |
3 | Abigail Washburn | Colorado College | 2012 | Bright morning stars are rising\nBright mornin... | 1 | 995 |
4 | Abigail Washburn | Colorado College | 2012 | I’m standing here thinking… yea right, how am ... | 2 | 934 |
5 | Abigail Washburn | Colorado College | 2012 | I looked around at the South Carolina party an... | 3 | 1083 |
6 | Adam Savage | Sarah Lawrence College | 2012 | To President Lawrence, Chairman Hill, the Boar... | 1 | 471 |
7 | Adam Savage | Sarah Lawrence College | 2012 | I decried and derided all of the skills I'd se... | 2 | 461 |
8 | Adam Savage | Sarah Lawrence College | 2012 | I'll wager that at some point you'll have the ... | 3 | 519 |
Segments: Process Text¶
Now we can send each segment to the API to be scored:
processed_segments = receptiviti.request(
segmented_text["text"], version="v2", context="spoken"
)
processed_segments = receptiviti.request(
segmented_text["text"], version="v2", context="spoken"
)
segmented_text = pandas.concat([segmented_text, processed_segments], axis=1)
segmented_text.iloc[0:9, 8:]
summary.word_count | summary.words_per_sentence | summary.sentence_count | summary.six_plus_words | summary.capitals | summary.emojis | summary.emoticons | summary.hashtags | summary.urls | big_5.extraversion | ... | disc_dimensions.people_relationship_emotion_focus | disc_dimensions.task_system_object_focus | disc_dimensions.d_axis | disc_dimensions.i_axis | disc_dimensions.s_axis | disc_dimensions.c_axis | disc_dimensions.d_axis_proportional | disc_dimensions.i_axis_proportional | disc_dimensions.s_axis_proportional | disc_dimensions.c_axis_proportional | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 873 | 20.785714 | 42 | 0.182131 | 0.025233 | 0 | 0 | 0.000000 | 0 | 66.167710 | ... | 77.517546 | 38.989039 | 49.997553 | 70.498112 | 52.742495 | 37.405196 | 0.237356 | 0.334680 | 0.250388 | 0.177576 |
1 | 852 | 18.127660 | 47 | 0.213615 | 0.034065 | 0 | 0 | 0.000000 | 0 | 40.093982 | ... | 46.868716 | 47.558481 | 45.819695 | 45.486208 | 51.165188 | 51.540311 | 0.236170 | 0.234451 | 0.263723 | 0.265656 |
2 | 891 | 19.800000 | 45 | 0.181818 | 0.030344 | 0 | 0 | 0.000000 | 0 | 72.795664 | ... | 72.966879 | 54.987488 | 55.788997 | 64.265738 | 56.272576 | 48.850144 | 0.247756 | 0.285400 | 0.249903 | 0.216941 |
3 | 1030 | 35.517241 | 29 | 0.273786 | 0.024229 | 0 | 0 | 0.000000 | 0 | 58.482160 | ... | 66.958582 | 47.893402 | 52.189081 | 61.708492 | 53.739374 | 45.449313 | 0.244920 | 0.289594 | 0.252195 | 0.213291 |
4 | 953 | 25.078947 | 38 | 0.230850 | 0.040709 | 0 | 0 | 0.003148 | 0 | 54.971582 | ... | 50.129968 | 38.818173 | 40.352055 | 45.856070 | 53.946434 | 47.471349 | 0.215067 | 0.244402 | 0.287521 | 0.253011 |
5 | 1133 | 26.976190 | 42 | 0.233892 | 0.031886 | 0 | 0 | 0.000000 | 0 | 44.229974 | ... | 55.383952 | 38.426485 | 46.396992 | 55.701445 | 49.353260 | 41.109217 | 0.240947 | 0.289267 | 0.256299 | 0.213487 |
6 | 487 | 15.709677 | 31 | 0.250513 | 0.043437 | 0 | 0 | 0.000000 | 0 | 61.890701 | ... | 54.513837 | 53.597281 | 54.051544 | 54.511747 | 49.798123 | 49.377714 | 0.260190 | 0.262405 | 0.239715 | 0.237691 |
7 | 482 | 12.358974 | 39 | 0.215768 | 0.029583 | 0 | 0 | 0.000000 | 0 | 45.491979 | ... | 55.922241 | 42.280350 | 51.518933 | 59.250205 | 45.624964 | 39.671584 | 0.262764 | 0.302196 | 0.232702 | 0.202338 |
8 | 540 | 9.818182 | 55 | 0.200000 | 0.029501 | 0 | 0 | 0.000000 | 0 | 60.548414 | ... | 65.154960 | 56.856802 | 60.629919 | 64.903709 | 47.989630 | 44.829601 | 0.277669 | 0.297242 | 0.219780 | 0.205308 |
9 rows × 217 columns
Segments: Analyze Scores¶
The SALLEE framework offers measures of emotions, so we might see which categories deviate the most in any of their segments:
# select the narrower SALLEE categories
emotions = segmented_text.filter(regex="^sallee").iloc[:, 6:]
# correlate emotion scores with segment contrasts
# and select the 5 most deviating emotions
most_deviating = emotions[
pandas.get_dummies(segmented_text["segment"])
.apply(emotions.corrwith)
.abs()
.agg("max", 1)
.sort_values(ascending=False)[:5]
.index
]
Now we can look at those categories across segments:
from matplotlib.colors import ListedColormap
segment_data = most_deviating.groupby(segmented_text["segment"])
bars = segment_data.agg("mean").T.plot.bar(
colormap=ListedColormap(colors[:3]),
yerr=(segment_data.agg("std") / segment_data.agg("count") ** 0.5).values,
capsize=3,
ylabel="Score",
xlabel="Category"
)
The bar chart displays original values, which offers the clearest view of how meaningful the differences between segments might be, in addition to their statistical significance (which offers a rough guide to the reliability of differences, based on the variance within and between segments). By looking at the bar graph, you can immediately see that admiration shows some of the starkest differences between middle and early/late segments.
scaled_summaries = (
(most_deviating - most_deviating.agg("mean")) / most_deviating.agg("std")
).groupby(segmented_text["segment"]).agg(
["mean", "size"]
)
fig, ax = subplots()
ax.set(ylabel="Score (Scaled)", xlabel="Segment")
for i, cat in enumerate(most_deviating):
summary = scaled_summaries[cat]
ax.errorbar(
summary.index.astype(str),
summary["mean"],
yerr=1 / summary["size"] ** 0.5,
color=colors[i],
linestyle=linestyles[i],
label=cat,
capsize=4,
)
legend = fig.legend(loc="right", bbox_to_anchor=(1.2, .5))
The line charts, on the other hand, shows standardized values, effectively zooming in on the differences between segments. This more clearly shows, for example, that admiration and joy seem to be used as bookends in commencement speeches, peaking early and late, whereas more negative and intense emotions such as anger, disgust, and surprise peak in the middle section.