Commencement example

This example uses the Receptiviti API to analyze commencement speeches.

Data¶

We'll start by collecting and processing the speeches.

Collection¶

The speeches used to be provided more directly, but the service hosting them has since shut down.

They are still available in a slightly less convenient form, as the source of a site that displays them: whatrocks.github.io/commencement-db.

First, we can retrieve metadata from a separate repository:

In [2]:

Copied!





import pandas

speeches = pandas.read_csv(
  "https://raw.githubusercontent.com/whatrocks/markov-commencement-speech"
  "/refs/heads/master/speech_metadata.csv"
)

speeches.iloc[0:5, 1:4]
import pandas

speeches = pandas.read_csv(
  "https://raw.githubusercontent.com/whatrocks/markov-commencement-speech"
  "/refs/heads/master/speech_metadata.csv"
)

speeches.iloc[0:5, 1:4]

Out[2]:

	name	school	year
0	Aaron Sorkin	Syracuse University	2012
1	Abigail Washburn	Colorado College	2012
2	Adam Savage	Sarah Lawrence College	2012
3	Adrienne Rich	Douglass College	1977
4	Ahmed Zewail	Caltech	2011

One file in the source repository contains an invalid character on Windows (:), so we'll need to pull them in individually, rather than cloning the repository:

In [3]:

Copied!





import os
import requests

text_dir = "../../../commencement_speeches/"
os.makedirs(text_dir, exist_ok=True)

text_url = (
  "https://raw.githubusercontent.com/whatrocks/commencement-db"
  "/refs/heads/master/src/pages/"
)
for file in speeches["filename"]:
    out_file = text_dir + file.replace(':', '')
    if not os.path.isfile(out_file):
        req = requests.get(
            text_url + file.replace(".txt", "/index.md"), timeout=999
        )
        if req.status_code != 200:
            print(f"failed to retrieve {file}")
            continue
        with open(out_file, "w", encoding="utf-8") as content:
            content.write("\n".join(req.text.split("\n---")[1:]))
import os
import requests

text_dir = "../../../commencement_speeches/"
os.makedirs(text_dir, exist_ok=True)

text_url = (
  "https://raw.githubusercontent.com/whatrocks/commencement-db"
  "/refs/heads/master/src/pages/"
)
for file in speeches["filename"]:
    out_file = text_dir + file.replace(':', '')
    if not os.path.isfile(out_file):
        req = requests.get(
            text_url + file.replace(".txt", "/index.md"), timeout=999
        )
        if req.status_code != 200:
            print(f"failed to retrieve {file}")
            continue
        with open(out_file, "w", encoding="utf-8") as content:
            content.write("\n".join(req.text.split("\n---")[1:]))

Text Preparation¶

Now we can read in the texts and associate them with their metadata:

In [4]:

Copied!

import re

bullet_pattern = re.compile("([a-z]) •")

def read_text(file: str):
  with open(text_dir + file.replace(":", ""), encoding="utf-8") as content:
      text = bullet_pattern.sub("\\1. ", content.read())
  return text

speeches["text"] = [read_text(file) for file in speeches["filename"]]
import re

bullet_pattern = re.compile("([a-z]) •")

def read_text(file: str):
  with open(text_dir + file.replace(":", ""), encoding="utf-8") as content:
      text = bullet_pattern.sub("\\1. ", content.read())
  return text

speeches["text"] = [read_text(file) for file in speeches["filename"]]

Load Package¶

If this is your first time using the package, see the Get Started guide to install it and set up your API credentials.

In [5]:

Copied!

import receptiviti
import receptiviti

Analyze Full Texts¶

We might start by seeing if any speeches stand out in terms of language style, or if there are any trends in content over time.

Full: Process Text¶

Now we can send the texts to the API for scoring, and join the results we get to the metadata:

In [6]:

Copied!





# since our texts are from speeches,
# it might make sense to use the spoken norming context
processed = receptiviti.request(
    speeches["text"], version = "v2", context = "spoken"
)
processed = pandas.concat([speeches.iloc[:, 0:4], processed], axis=1)

processed.iloc[0:5, 7:]
# since our texts are from speeches,
# it might make sense to use the spoken norming context
processed = receptiviti.request(
    speeches["text"], version = "v2", context = "spoken"
)
processed = pandas.concat([speeches.iloc[:, 0:4], processed], axis=1)

processed.iloc[0:5, 7:]

Out[6]:

	summary.words_per_sentence	summary.sentence_count	summary.six_plus_words	summary.capitals	summary.hashtags	big_5.extraversion	big_5.active	...	disc_dimensions.people_relationship_emotion_focus	disc_dimensions.task_system_object_focus	disc_dimensions.d_axis	disc_dimensions.i_axis	disc_dimensions.s_axis	disc_dimensions.c_axis	disc_dimensions.d_axis_proportional	disc_dimensions.i_axis_proportional	disc_dimensions.s_axis_proportional	disc_dimensions.c_axis_proportional
0	19.522388	134	0.192278	0.029845	0.000000	64.362931	58.626324	...	68.785932	48.847076	52.559226	62.370538	54.667259	46.067726	0.243708	0.289201	0.253483	0.213608
1	28.587156	109	0.246149	0.031928	0.000963	55.125763	61.658720	...	58.656537	40.455881	45.894668	55.262383	53.025679	44.037115	0.231534	0.278793	0.267509	0.222163
2	12.072000	125	0.221339	0.034079	0.000000	58.409255	51.114651	...	62.707516	52.665746	57.492323	62.734369	48.323395	44.285521	0.270125	0.294755	0.227046	0.208074
3	32.389831	59	0.311879	0.011606	0.000000	59.149444	29.627236	...	68.591156	45.065873	46.459670	57.317375	59.781553	48.457056	0.219133	0.270345	0.281968	0.228554
4	24.326733	101	0.326007	0.023517	0.000000	72.021225	50.677059	...	61.249178	55.538689	62.837175	65.988609	42.076374	40.066922	0.297850	0.312788	0.199443	0.189918

5 rows × 216 columns

Full: Analyze Style¶

To get at stylistic uniqueness, we can calculate Language Style Matching between each speech and the mean of all speeches:

In [7]:

Copied!





lsm_categories = [
    "liwc15." + c
    for c in [
        "personal_pronouns",
        "impersonal_pronouns",
        "articles",
        "auxiliary_verbs",
        "adverbs",
        "prepositions",
        "conjunctions",
        "negations",
        "quantifiers",
    ]
]

category_means = processed[lsm_categories].agg("mean")
processed["lsm_mean"] = (
    1
    - abs(processed[lsm_categories] - category_means)
    / (processed[lsm_categories] + category_means)
).agg("mean", axis=1)

processed.sort_values("lsm_mean").iloc[0:10][
    ["name", "school", "year", "lsm_mean", "summary.word_count"]
]
lsm_categories = [
    "liwc15." + c
    for c in [
        "personal_pronouns",
        "impersonal_pronouns",
        "articles",
        "auxiliary_verbs",
        "adverbs",
        "prepositions",
        "conjunctions",
        "negations",
        "quantifiers",
    ]
]

category_means = processed[lsm_categories].agg("mean")
processed["lsm_mean"] = (
    1
    - abs(processed[lsm_categories] - category_means)
    / (processed[lsm_categories] + category_means)
).agg("mean", axis=1)

processed.sort_values("lsm_mean").iloc[0:10][
    ["name", "school", "year", "lsm_mean", "summary.word_count"]
]

Out[7]:

	name	school	year	lsm_mean	summary.word_count
99	Gary Malkowski	Gallaudet University	2011	0.751292	2154
268	Theodor ‘Dr. Seuss’ Geisel	Lake Forest College	1977	0.763584	96
178	Makoto Fujimura	Belhaven University	2011	0.820082	2569
81	Dwight Eisenhower	Penn State	1955	0.831307	2518
100	George C. Marshall	Harvard University	1947	0.833347	1449
170	Lewis Lapham	St. John’s College	2003	0.834804	3688
232	Rev. Joseph L. Levesque	Niagara University	2007	0.845988	483
86	Edward W. Brooke	Wellesley College	1969	0.847166	3083
117	Janet Napolitano	Northeastern University	2014	0.853360	1526
289	Whoopi Goldberg	Savannah College of Art and Design	2011	0.855008	1248

Here, it is notable that the most stylistically unique speech was delivered in American Sign Language, and the second most stylistically unique speech was a short rhyme.

We might also want to see which speeches are most similar to one another:

In [8]:

Copied!





import numpy

# calculate all pairwise comparisons
lsm_pairs = processed[lsm_categories].T.corr(
    lambda a, b: numpy.mean(1 - abs(a - b) / (a + b))
)

# set self-matches to 0
numpy.fill_diagonal(lsm_pairs.values, 0)

# identify the closes match to each speech
speeches["match"] = lsm_pairs.idxmax()
best_match = lsm_pairs.max()

# loo at the top matches
top_matches = best_match.sort_values(ascending=False).index[:20].astype(int).to_list()
top_match_pairs = pandas.DataFrame(
    {"a": top_matches, "b": speeches["match"][top_matches].to_list()}
)
top_match_pairs = top_match_pairs[
    ~top_match_pairs.apply(
        lambda r: "".join(r.sort_values().astype(str)), 1
    ).duplicated()
]
pandas.concat(
    [
        speeches.iloc[top_match_pairs["a"], 1:4].reset_index(drop=True),
        pandas.DataFrame({"Similarity": best_match[top_match_pairs["a"]]}).reset_index(
            drop=True
        ),
        speeches.iloc[top_match_pairs["b"], 1:4].reset_index(drop=True),
    ],
    axis=1,
)
import numpy

# calculate all pairwise comparisons
lsm_pairs = processed[lsm_categories].T.corr(
    lambda a, b: numpy.mean(1 - abs(a - b) / (a + b))
)

# set self-matches to 0
numpy.fill_diagonal(lsm_pairs.values, 0)

# identify the closes match to each speech
speeches["match"] = lsm_pairs.idxmax()
best_match = lsm_pairs.max()

# loo at the top matches
top_matches = best_match.sort_values(ascending=False).index[:20].astype(int).to_list()
top_match_pairs = pandas.DataFrame(
    {"a": top_matches, "b": speeches["match"][top_matches].to_list()}
)
top_match_pairs = top_match_pairs[
    ~top_match_pairs.apply(
        lambda r: "".join(r.sort_values().astype(str)), 1
    ).duplicated()
]
pandas.concat(
    [
        speeches.iloc[top_match_pairs["a"], 1:4].reset_index(drop=True),
        pandas.DataFrame({"Similarity": best_match[top_match_pairs["a"]]}).reset_index(
            drop=True
        ),
        speeches.iloc[top_match_pairs["b"], 1:4].reset_index(drop=True),
    ],
    axis=1,
)

Out[8]:

	name	school	year	Similarity	name	school	year
0	Cynthia Enloe	Connecticut College	2011	0.999325	Howard Gordon	Connecticut College	2013
1	Benjamin Carson Jr.	Niagara University	2003	0.984177	Jonathon Youshaei	Deerfield High School	2009
2	John Legend	University of Pennsylvania	2014	0.980730	Sheryl Sandberg	City Colleges of Chicago	2014
3	Ronald Reagan	Eureka College	1957	0.979818	Arianna Huffington	Sarah Lawrence College	2011
4	Amy Poehler	Harvard University	2011	0.977867	Sheryl Sandberg	City Colleges of Chicago	2014
5	Melissa Harris-Perry	Wellesley College	2012	0.975869	Drew Houston	Massachusetts Institute of Technology	2013
6	Alan Alda	Connecticut College	1980	0.975361	Nora Ephron	Wellesley College	1996
7	Woody Hayes	Ohio State University	1986	0.975117	James Carville	Hobart and William Smith Colleges	2013
8	Mindy Kaling	Harvard Law School	2014	0.975033	Stephen Colbert	Wake Forest University	2015
9	Tim Cook	Auburn University	2010	0.975007	Arianna Huffington	Vassar College	2015
10	Barbara Bush	Wellesley College	1990	0.974678	Daniel S. Goldin	Massachusetts Institute of Technology	2001

Full: Analyze Content¶

To look at content over time, we might focus on a potentially interesting framework, such as drives:

In [9]:

Copied!





from statistics import linear_regression
from matplotlib.pyplot import subplots
from matplotlib.style import use

drive_data = processed.filter(regex="year|drives")[processed["year"] > 1980]
trending_drives = (
    drive_data.corrwith(drive_data["year"])
    .abs()
    .sort_values(ascending=False)[1:4]
    .index
)

first_year = drive_data["year"].min()
colors = ["#82C473", "#A378C0", "#616161", "#9F5C61", "#D3D280"]
linestyles = ["-", "--", ":", "-.", (5, (8, 2))]
use(["dark_background", {"figure.facecolor": "#1e2229", "axes.facecolor": "#1e2229"}])
fig, ax = subplots()
ax.set(ylabel="Score", xlabel="Year")
for i, cat in enumerate(trending_drives):
    points = ax.scatter(drive_data["year"], drive_data[cat], color=colors[i])
    beta, intercept = linear_regression(drive_data["year"], drive_data[cat])
    line = ax.axline(
        (first_year, intercept + beta * first_year),
        slope=beta,
        color=colors[i],
        linestyle=linestyles[i],
        label=cat,
    )
legend = ax.legend(loc="upper center")
from statistics import linear_regression
from matplotlib.pyplot import subplots
from matplotlib.style import use

drive_data = processed.filter(regex="year|drives")[processed["year"] > 1980]
trending_drives = (
    drive_data.corrwith(drive_data["year"])
    .abs()
    .sort_values(ascending=False)[1:4]
    .index
)

first_year = drive_data["year"].min()
colors = ["#82C473", "#A378C0", "#616161", "#9F5C61", "#D3D280"]
linestyles = ["-", "--", ":", "-.", (5, (8, 2))]
use(["dark_background", {"figure.facecolor": "#1e2229", "axes.facecolor": "#1e2229"}])
fig, ax = subplots()
ax.set(ylabel="Score", xlabel="Year")
for i, cat in enumerate(trending_drives):
    points = ax.scatter(drive_data["year"], drive_data[cat], color=colors[i])
    beta, intercept = linear_regression(drive_data["year"], drive_data[cat])
    line = ax.axline(
        (first_year, intercept + beta * first_year),
        slope=beta,
        color=colors[i],
        linestyle=linestyles[i],
        label=cat,
    )
legend = ax.legend(loc="upper center")

No description has been provided for this image

To better visualize the effects, we might look between aggregated blocks of time:

In [10]:

Copied!





summary = processed[trending_drives].aggregate(["mean", "std"])
standardized = ((processed[trending_drives] - summary.loc["mean"])) / summary.loc["std"]
time_median = int(processed["year"].median())
standardized["Time Period"] = pandas.Categorical(
    processed["year"] >= time_median
).set_categories([f"< {time_median}", f">= {time_median}"], rename=True)

summaries = standardized.groupby("Time Period", observed=True)[trending_drives].agg(
    ["mean", "size"]
)

fig, ax = subplots()
ax.set(ylabel="Score (Scaled)", xlabel="Time Period")
for i, cat in enumerate(trending_drives):
    summary = summaries[cat]
    ax.errorbar(
        summary.index,
        summary["mean"],
        yerr=1 / summary["size"] ** 0.5,
        color=colors[i],
        linestyle=linestyles[i],
        label=cat,
        capsize=6
    )
legend = ax.legend(loc="upper center")
summary = processed[trending_drives].aggregate(["mean", "std"])
standardized = ((processed[trending_drives] - summary.loc["mean"])) / summary.loc["std"]
time_median = int(processed["year"].median())
standardized["Time Period"] = pandas.Categorical(
    processed["year"] >= time_median
).set_categories([f"< {time_median}", f">= {time_median}"], rename=True)

summaries = standardized.groupby("Time Period", observed=True)[trending_drives].agg(
    ["mean", "size"]
)

fig, ax = subplots()
ax.set(ylabel="Score (Scaled)", xlabel="Time Period")
for i, cat in enumerate(trending_drives):
    summary = summaries[cat]
    ax.errorbar(
        summary.index,
        summary["mean"],
        yerr=1 / summary["size"] ** 0.5,
        color=colors[i],
        linestyle=linestyles[i],
        label=cat,
        capsize=6
    )
legend = ax.legend(loc="upper center")

This suggests that references to risk and reward have increased since the 2000s while references to power have decreased at a similar rate. (Note that error bars represent how much variance there is within groups, which allows you to eyeball the statistical significance of mean differences.)

The shift in emphasis from power to risk-reward could reflect that commencement speakers are now focusing more abstractly on the potential benefits and hazards of life after graduation, whereas earlier speakers more narrowly focused on ambition and dominance (perhaps referring to power held by past alumni and projecting the potential for graduates to climb social ladders in the future). You could examine a sample of speeches that show this pattern most dramatically (speeches high in risk-reward and low in power in recent years, and vice versa for pre-2009 speeches) to help determine how these themes have shifted and what specific motives or framing devices seem to have been (de)emphasized.

Analyze Segments¶

Another thing we might look for is trends within each speech. For instance, are there common emotional trajectories over the course of a speech?

One way to look at this would be to split texts into roughly equal sizes, and score each section:

In [11]:

Copied!





import nltk
from math import ceil

nltk.download("punkt_tab", quiet=True)


def count_words(text: str):
    return len([token for token in nltk.word_tokenize(text) if token.isalnum()])


def split_text(text: str, bins=3):
    sentences = nltk.sent_tokenize(text)
    bin_size = ceil(count_words(text) / bins) + 1
    text_parts = [[]]
    word_counts = [0] * bins
    current_bin = 0
    for sentence in sentences:
        sentence_size = count_words(sentence)
        if (current_bin + 1) < bins and (
            word_counts[current_bin] + sentence_size
        ) > bin_size:
            text_parts.append([])
            current_bin += 1
        word_counts[current_bin] += sentence_size
        text_parts[current_bin].append(sentence)
    return pandas.DataFrame(
        {
            "text": [" ".join(x) for x in text_parts],
            "segment": pandas.Series(range(bins)) + 1,
            "WC": word_counts,
        }
    )


segmented_text = []
for i, text in enumerate(speeches["text"]):
    text_parts = split_text(text)
    text_info = speeches.iloc[[i] * 3][["name", "school", "year"]]
    text_info.reset_index(drop=True, inplace=True)
    segmented_text.append(pandas.concat([text_info, text_parts], axis=1))
segmented_text = pandas.concat(segmented_text)
segmented_text.reset_index(drop=True, inplace=True)

segmented_text.iloc[0:9, :6]
import nltk
from math import ceil

nltk.download("punkt_tab", quiet=True)


def count_words(text: str):
    return len([token for token in nltk.word_tokenize(text) if token.isalnum()])


def split_text(text: str, bins=3):
    sentences = nltk.sent_tokenize(text)
    bin_size = ceil(count_words(text) / bins) + 1
    text_parts = [[]]
    word_counts = [0] * bins
    current_bin = 0
    for sentence in sentences:
        sentence_size = count_words(sentence)
        if (current_bin + 1) < bins and (
            word_counts[current_bin] + sentence_size
        ) > bin_size:
            text_parts.append([])
            current_bin += 1
        word_counts[current_bin] += sentence_size
        text_parts[current_bin].append(sentence)
    return pandas.DataFrame(
        {
            "text": [" ".join(x) for x in text_parts],
            "segment": pandas.Series(range(bins)) + 1,
            "WC": word_counts,
        }
    )


segmented_text = []
for i, text in enumerate(speeches["text"]):
    text_parts = split_text(text)
    text_info = speeches.iloc[[i] * 3][["name", "school", "year"]]
    text_info.reset_index(drop=True, inplace=True)
    segmented_text.append(pandas.concat([text_info, text_parts], axis=1))
segmented_text = pandas.concat(segmented_text)
segmented_text.reset_index(drop=True, inplace=True)

segmented_text.iloc[0:9, :6]

Out[11]:

	name	school	year	text	segment	WC
0	Aaron Sorkin	Syracuse University	2012	Thank you very much. Madam Chancellor, members...	1	856
1	Aaron Sorkin	Syracuse University	2012	The actor had been offered the lead role in a ...	2	827
2	Aaron Sorkin	Syracuse University	2012	In that 11 years, I’ve written three televisio...	3	880
3	Abigail Washburn	Colorado College	2012	Bright morning stars are rising\nBright mornin...	1	995
4	Abigail Washburn	Colorado College	2012	I’m standing here thinking… yea right, how am ...	2	934
5	Abigail Washburn	Colorado College	2012	I looked around at the South Carolina party an...	3	1083
6	Adam Savage	Sarah Lawrence College	2012	To President Lawrence, Chairman Hill, the Boar...	1	471
7	Adam Savage	Sarah Lawrence College	2012	I decried and derided all of the skills I'd se...	2	461
8	Adam Savage	Sarah Lawrence College	2012	I'll wager that at some point you'll have the ...	3	519

Segments: Process Text¶

Now we can send each segment to the API to be scored:

In [12]:

Copied!





processed_segments = receptiviti.request(
    segmented_text["text"], version="v2", context="spoken"
)

processed_segments = receptiviti.request(
    segmented_text["text"], version="v2", context="spoken"
)
segmented_text = pandas.concat([segmented_text, processed_segments], axis=1)

segmented_text.iloc[0:9, 8:]
processed_segments = receptiviti.request(
    segmented_text["text"], version="v2", context="spoken"
)

processed_segments = receptiviti.request(
    segmented_text["text"], version="v2", context="spoken"
)
segmented_text = pandas.concat([segmented_text, processed_segments], axis=1)

segmented_text.iloc[0:9, 8:]

Out[12]:

	summary.word_count	summary.words_per_sentence	summary.sentence_count	summary.six_plus_words	summary.capitals	summary.hashtags	big_5.extraversion	...	disc_dimensions.people_relationship_emotion_focus	disc_dimensions.task_system_object_focus	disc_dimensions.d_axis	disc_dimensions.i_axis	disc_dimensions.s_axis	disc_dimensions.c_axis	disc_dimensions.d_axis_proportional	disc_dimensions.i_axis_proportional	disc_dimensions.s_axis_proportional	disc_dimensions.c_axis_proportional
0	873	20.785714	42	0.182131	0.025233	0.000000	66.167710	...	77.517546	38.989039	49.997553	70.498112	52.742495	37.405196	0.237356	0.334680	0.250388	0.177576
1	852	18.127660	47	0.213615	0.034065	0.000000	40.093982	...	46.868716	47.558481	45.819695	45.486208	51.165188	51.540311	0.236170	0.234451	0.263723	0.265656
2	891	19.800000	45	0.181818	0.030344	0.000000	72.795664	...	72.966879	54.987488	55.788997	64.265738	56.272576	48.850144	0.247756	0.285400	0.249903	0.216941
3	1030	35.517241	29	0.273786	0.024229	0.000000	58.482160	...	66.958582	47.893402	52.189081	61.708492	53.739374	45.449313	0.244920	0.289594	0.252195	0.213291
4	953	25.078947	38	0.230850	0.040709	0.003148	54.971582	...	50.129968	38.818173	40.352055	45.856070	53.946434	47.471349	0.215067	0.244402	0.287521	0.253011
5	1133	26.976190	42	0.233892	0.031886	0.000000	44.229974	...	55.383952	38.426485	46.396992	55.701445	49.353260	41.109217	0.240947	0.289267	0.256299	0.213487
6	487	15.709677	31	0.250513	0.043437	0.000000	61.890701	...	54.513837	53.597281	54.051544	54.511747	49.798123	49.377714	0.260190	0.262405	0.239715	0.237691
7	482	12.358974	39	0.215768	0.029583	0.000000	45.491979	...	55.922241	42.280350	51.518933	59.250205	45.624964	39.671584	0.262764	0.302196	0.232702	0.202338
8	540	9.818182	55	0.200000	0.029501	0.000000	60.548414	...	65.154960	56.856802	60.629919	64.903709	47.989630	44.829601	0.277669	0.297242	0.219780	0.205308

9 rows × 217 columns

Segments: Analyze Scores¶

The SALLEE framework offers measures of emotions, so we might see which categories deviate the most in any of their segments:

In [13]:

Copied!





# select the narrower SALLEE categories
emotions = segmented_text.filter(regex="^sallee").iloc[:, 6:]

# correlate emotion scores with segment contrasts
# and select the 5 most deviating emotions
most_deviating = emotions[
    pandas.get_dummies(segmented_text["segment"])
    .apply(emotions.corrwith)
    .abs()
    .agg("max", 1)
    .sort_values(ascending=False)[:5]
    .index
]
# select the narrower SALLEE categories
emotions = segmented_text.filter(regex="^sallee").iloc[:, 6:]

# correlate emotion scores with segment contrasts
# and select the 5 most deviating emotions
most_deviating = emotions[
    pandas.get_dummies(segmented_text["segment"])
    .apply(emotions.corrwith)
    .abs()
    .agg("max", 1)
    .sort_values(ascending=False)[:5]
    .index
]

Now we can look at those categories across segments:

In [14]:

Copied!





from matplotlib.colors import ListedColormap

segment_data = most_deviating.groupby(segmented_text["segment"])

bars = segment_data.agg("mean").T.plot.bar(
    colormap=ListedColormap(colors[:3]),
    yerr=(segment_data.agg("std") / segment_data.agg("count") ** 0.5).values,
    capsize=3,
    ylabel="Score",
    xlabel="Category"
)
from matplotlib.colors import ListedColormap

segment_data = most_deviating.groupby(segmented_text["segment"])

bars = segment_data.agg("mean").T.plot.bar(
    colormap=ListedColormap(colors[:3]),
    yerr=(segment_data.agg("std") / segment_data.agg("count") ** 0.5).values,
    capsize=3,
    ylabel="Score",
    xlabel="Category"
)

The bar chart displays original values, which offers the clearest view of how meaningful the differences between segments might be, in addition to their statistical significance (which offers a rough guide to the reliability of differences, based on the variance within and between segments). By looking at the bar graph, you can immediately see that admiration shows some of the starkest differences between middle and early/late segments.

In [15]:

Copied!





scaled_summaries = (
    (most_deviating - most_deviating.agg("mean")) / most_deviating.agg("std")
).groupby(segmented_text["segment"]).agg(
    ["mean", "size"]
)

fig, ax = subplots()
ax.set(ylabel="Score (Scaled)", xlabel="Segment")
for i, cat in enumerate(most_deviating):
    summary = scaled_summaries[cat]
    ax.errorbar(
        summary.index.astype(str),
        summary["mean"],
        yerr=1 / summary["size"] ** 0.5,
        color=colors[i],
        linestyle=linestyles[i],
        label=cat,
        capsize=4,
    )
legend = fig.legend(loc="right", bbox_to_anchor=(1.2, .5))
scaled_summaries = (
    (most_deviating - most_deviating.agg("mean")) / most_deviating.agg("std")
).groupby(segmented_text["segment"]).agg(
    ["mean", "size"]
)

fig, ax = subplots()
ax.set(ylabel="Score (Scaled)", xlabel="Segment")
for i, cat in enumerate(most_deviating):
    summary = scaled_summaries[cat]
    ax.errorbar(
        summary.index.astype(str),
        summary["mean"],
        yerr=1 / summary["size"] ** 0.5,
        color=colors[i],
        linestyle=linestyles[i],
        label=cat,
        capsize=4,
    )
legend = fig.legend(loc="right", bbox_to_anchor=(1.2, .5))

The line charts, on the other hand, shows standardized values, effectively zooming in on the differences between segments. This more clearly shows, for example, that admiration and joy seem to be used as bookends in commencement speeches, peaking early and late, whereas more negative and intense emotions such as anger, disgust, and surprise peak in the middle section.