forked from cjbarrie/CTA-ED
-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path13-comparison-and-complexity.Rmd
313 lines (201 loc) · 10.3 KB
/
13-comparison-and-complexity.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
# Exercise 3: Comparison and complexity
## Introduction
The hands-on exercise for this week focuses on: 1) comparing texts; 2) measuring he document-level characteristics of text---here, complexity.
In this tutorial, you will learn how to:
* Compare texts using character-based measures of similarity and distance
* Compare texts using term-based measures of similarity and distance
* Calculate the complexity of texts
* Replicate analyses from @schoonvelde_liberals_2019
## Setup
Before proceeding, we'll load the remaining packages we will need for this tutorial.
```{r, echo=F}
library(kableExtra)
```
```{r, message=F}
library(readr) # more informative and easy way to import data
library(quanteda) # includes functions to implement Lexicoder
library(quanteda.textstats) # for estimating similarity and complexity measures
library(stringdist) # for basic character-based distance measures
library(dplyr) #for wrangling data
library(tibble) #for wrangling data
library(ggplot2) #for visualization
```
For this example we'll be using data from the 2017-2018 Theresa May Cabinet in the UK. The data are tweets by members of this cabinet.
We can load the data as follows.
```{r}
tweets <- readRDS("data/comparison-complexity/cabinet_tweets.rds")
```
If you're working on this document from your own computer ("locally") you can download the tweets data in the following way:
```{r, eval = F}
tweets <- readRDS(gzcon(url("https://github.com/cjbarrie/CTA-ED/blob/main/data/comparison-complexity/cabinet_tweets.rds?raw=true")))
```
And we see that the data contain three variables: "username," which is the username of the MP in question; "tweet," which is the text of the given tweet, and "date" in days in yyyy-mm-dd format.
```{r}
head(tweets)
```
And there are 24 MPs whose tweets we're examining.
```{r}
unique(tweets$username)
length(unique(tweets$username))
```
## Generate document feature matrix
In order to use the `quanteda` package and its accompanying `quanteda.textstats` package, we need to reformat the data into a quanteda "corpus" object. To do this we just need to specify the text we're interested in as well as any associated document-level variables in which we're interested.
We can do this as follows.
```{r}
#make corpus object, specifying tweet as text field
tweets_corpus <- corpus(tweets, text_field = "tweet")
#add in username document-level information
docvars(tweets_corpus, "username") <- tweets$username
tweets_corpus
```
We are now ready to reformat the data into a document feature matrix.
```{r}
dfmat <- dfm(tokens(tweets_corpus,
remove_punct = TRUE)) %>%
dfm_remove(stopwords("english"))
dfmat
```
Note that when we do this we need to have tokenized our corpus object first. We can do this by wrapping the `tokens` function inside the `dfm()` function as above.
So what is this object? Well the documents here are tweets. And the matrix is a sparse (i.e., mostly zeroes) matrix of 1s and 0s for whether a given word appears in the document (tweet) in question.
The vertical elements (columns) of this vector are made up of all the words used in all of the tweets combined. Here, it helps to imagine every tweet positioned side by side to understand what's going on here.
## Compare between MPs
Once we have our data in this format, we are ready to compare between the text produced by members of Theresa May's Cabinet.
Here's an example of the correlations between the combined tweets of 5 of the MPs with each other.
```{r}
corrmat <- dfmat %>%
dfm_group(groups = username) %>%
textstat_simil(margin = "documents", method = "correlation")
corrmat[1:5,1:5]
```
Note that here we're using the `dfm_group()` function, which allows you to take a document feature matrix and make calculations while grouping by one of the document-level variables we specified above.
There are many different measures of similarity, however, that we might think about using.
In the below, we combine four different measures of similarity, and see how they compare to each other across MPs. Note that here we're looking only at the similarity between an MP's tweets and those of then Prime Minister, Theresa May.
## Compare between measures
Let's see what this looks like for one of these measures---cosine similarity.
We first get similarities between the text of MP tweets and all other MPs.
```{r}
#estimate similarity, grouping by username
cos_sim <- dfmat %>%
dfm_group(groups = username) %>%
textstat_simil(margin = "documents", method = "cosine") #specify method here as character object
```
But remember we're only interested in how they compare to what Theresa May has been saying.
So we need to take these cosine similarities and retain only those similarity measures corresponding to the text of Theresa May's tweets.
We first convert the `textstat_simil()` output to a matrix.
```{r}
cosmat <- as.matrix(cos_sim) #convert to a matrix
```
And we can see that the 23rd row of this matrix contains the similarity measures with the Theresa May tweets.
We take this row, removing the similarity of Theresa May with herself (which will always = 1), and convert it to a datframe object.
```{r}
#generate data frame keeping only the row for Theresa May
cosmatdf <- as.data.frame(cosmat[23, c(1:22, 24)])
```
We then rename the cosine similarity column with an appropriate name and convert row names to a column variable so that we have cells containing information on the MP to which the cosine similarity measure refers.
```{r}
#rename column
colnames(cosmatdf) <- "corr_may"
#create column variable from rownames
cosmatdf <- tibble::rownames_to_column(cosmatdf, "username")
```
And like so we have our data in tidy format, which we can then plot like so.
```{r}
ggplot(cosmatdf) +
geom_point(aes(x=reorder(username, -corr_may), y= corr_may)) +
coord_flip() +
xlab("MP username") +
ylab("Cosine similarity score") +
theme_minimal()
```
Combining these steps into a single `for` loop, we can see how our different similarity measures of interest compare.
```{r}
#specify different similarity measures to explore
methods <- c("correlation", "cosine", "dice", "edice")
#create empty dataframe
testdf_all <- data.frame()
#gen for loop across methods types
for (i in seq_along(methods)) {
#pass method to character string object
sim_method <- methods[[i]]
#estimate similarity, grouping by username
test <- dfmat %>%
dfm_group(groups = username) %>%
textstat_simil(margin = "documents", method = sim_method) #specify method here as character object created above
testm <- as.matrix(test) #convert to a matrix
#generate data frame keeping only the row for Theresa May
testdf <- as.data.frame(testm[23, c(1:22, 24)])
#rename column
colnames(testdf) <- "corr_may"
#create column variable from rownames
testdf <- tibble::rownames_to_column(testdf, "username")
#record method in new column variable
testdf$method <- sim_method
#bind all together
testdf_all <- rbind(testdf_all, testdf)
}
#create variable (for viz only) that is mean of similarity scores for each MP
testdf_all <- testdf_all %>%
group_by(username) %>%
mutate(mean_sim = mean(corr_may))
ggplot(testdf_all) +
geom_point( aes(x=reorder(username, -mean_sim), y= corr_may, color = method)) +
coord_flip() +
xlab("MP username") +
ylab("Similarity score") +
theme_minimal()
```
## Complexity
We now move to document-level measures of text characteristics. And here we will focus on the paper by @schoonvelde_liberals_2019.
We will be using a subset of these data, taken from EU speeches given by four politicians. These are provided by the authors at [https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/S4IZ8K](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/S4IZ8K).
We can load the data as follows.
```{r}
speeches <- readRDS("data/comparison-complexity/speeches.rds")
```
If you're working on this document from your own computer ("locally") you can download the tweets data in the following way:
```{r, eval = F}
speeches <- readRDS(gzcon(url("https://github.com/cjbarrie/CTA-ED/blob/main/data/comparison-complexity/speeches.rds?raw=true")))
```
And we can take a look at what the data contains below.
```{r}
head(speeches)
```
The data contain speeches by four different politicians, each of whom are positioned at different points on a liberal-conservative scale.
We can then calculate the Flesch-Kincaid readability/complexity score with the `quanteda.textstats` package like so.
```{r}
speeches$flesch.kincaid <- textstat_readability(speeches$text, measure = "Flesch.Kincaid")
# returned as quanteda data.frame with document-level information;
# need just the score:
speeches$flesch.kincaid <- speeches$flesch.kincaid$Flesch.Kincaid
```
We want this information aggregated over each of our politicians: Gordon Brown, Jose Zapatero", David Cameron, and Mariano Rajoy. These are recorded in the data under a column called "speaker."
```{r}
#get mean and standard deviation of Flesch-Kincaid, and N of speeches for each speaker
sum_corpus <- speeches %>%
group_by(speaker) %>%
summarise(mean = mean(flesch.kincaid, na.rm=TRUE),
SD=sd(flesch.kincaid, na.rm=TRUE),
N=length(speaker))
# calculate standard errors and confidence intervals
sum_corpus$se <- sum_corpus$SD / sqrt(sum_corpus$N)
sum_corpus$min <- sum_corpus$mean - 1.96*sum_corpus$se
sum_corpus$max <- sum_corpus$mean + 1.96*sum_corpus$se
```
And this gives us data in tidy format that looks like so.
```{r}
sum_corpus
```
Which we can then plot---and we see that our results look like those in Figure 1 of the published article by @schoonvelde_liberals_2019.
```{r}
ggplot(sum_corpus, aes(x=speaker, y=mean)) +
geom_bar(stat="identity") +
geom_errorbar(ymin=sum_corpus$min,ymax=sum_corpus$max, width=.2) +
coord_flip() +
xlab("") +
ylab("Mean Complexity") +
theme_minimal() +
ylim(c(0,20))
```
## Exercises
1. Compute distance measures such as "euclidean" or "manhattan" for the MP tweets as above, comparing between tweets by MPs and tweets by PM, Theresa May.
2. Estimate at least three other complexity measures for the EU speeches as above. Consider how the results compare to the Flesch-Kincaid measure used in the article by @schoonvelde_liberals_2019.
3. (Advanced---optional) Estimate similarity scores between the MP tweets and the PM tweets for each week contained in the data. Plot the results.