-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlinguistics_landscapes.qmd
172 lines (142 loc) · 4.13 KB
/
linguistics_landscapes.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
title: "Linguistic landscapes"
title-block-banner: true
subtitle: "Paris' 13th arrondissement"
author:
- name: Olivier Caron
email: olivier.caron@dauphine.psl.eu
affiliations:
name: "Paris Dauphine - PSL"
city: Paris
state: France
date : "last-modified"
toc: true
number-sections: true
number-depth: 5
format:
html:
theme:
light: yeti
dark: darkly
code-fold: true
code-summary: "Display code"
code-tools: true #enables to display/hide all blocks of code
code-copy: true #enables to copy code
grid:
body-width: 1000px
margin-width: 100px
toc: true
toc-location: left
execute:
echo: true
warning: false
message: false
editor: visual
fig-align: "center"
highlight-style: ayu
css: styles.css
reference-location: margin
---
## Linguistic landscapes
As part of the seminar, we had to wander in the 13th arrondissement of Paris and take photos of text written on walls, displays, buildings, shops and try to get a sense of multilingualism in the neighborhood.
The course it not about NLP but since that's what I do, I wanted to profit of the occasion to try new methods, like extracting texts from photos.
The different photos I took are located in the `images/photos` folder. I searched online and it seems like tesseract is a nice package to do just this. So let's try it.
## Libraries
```{r}
#| label: libraries
#| message: false
library(tidyverse)
library(tesseract)
library(quanteda)
library(magick)
library(quanteda.textstats)
library(quanteda.textplots)
library(reactable)
```
## Extract text from photos
```{r}
setwd("images/photos/")
files <- list.files()
sentence <- list()
tesseract_download("fra")
fr <- tesseract("fra")
t1 <- Sys.time()
for (photo in files) {
cat("Photo processing of", photo, "\n")
text <- magick::image_read(photo) %>%
tesseract::ocr_data(engine = fr) %>%
filter(confidence > 50)
sentence[[photo]] <- text
#texts[[photo]] <- bind_rows(texts,texts[[photo]])
}
Sys.time()-t1
all_texts <- bind_rows(sentence, .id = "n_photo")
#image <- image_read("PXL_20230902_075947721.MP.jpg") %>%
#image_ocr()
```
```{r}
all_texts <- all_texts %>%
filter(word != "|") %>%
mutate(word = str_replace_all(word, "[[:punct:]]|\\d+|[[:cntrl:]]", ""))
text_df <- all_texts %>%
group_by(n_photo) %>%
summarize(sentence = paste(word, collapse = " "))
reactable(text_df, striped = TRUE)
```
## Co-occurences network
```{r}
set.seed(100)
toks <- text_df %>%
pull(sentence) %>%
tokens(remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_remove("|")
fcmat <- fcm(toks, context = "window", tri = FALSE)
feat <- names(topfeatures(fcmat, 30))
fcm_select(fcmat, pattern = feat) %>%
textplot_network(min_freq = 1, vertex_labelsize = 1.5 * rowSums(.)/min(rowSums(.)))
```
## Co-occurences network without stopwords
```{r}
set.seed(100)
toks <- text_df %>%
pull(sentence) %>%
tokens(remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_remove(pattern = c(stopwords("french"),"|"), padding = FALSE)
fcmat <- fcm(toks, context = "window", tri = FALSE)
feat <- names(topfeatures(fcmat, 30))
fcm_select(fcmat, pattern = feat) %>%
textplot_network(min_freq = 1, vertex_labelsize = rowSums(.)/min(rowSums(.)))
```
## Wordcloud
```{r}
set.seed(10)
dfmat1 <- dfm(corpus(text_df$sentence),
remove = c(stopwords("french"),"|"), remove_punct = TRUE) %>%
dfm_trim(min_termfreq = 1)
# basic wordcloud
textplot_wordcloud(dfmat1)
```
## Collocations
```{r}
text_tokens <- tokens(text_df$sentence, remove_punct = TRUE)
# extract collocations
text_coll <- textstat_collocations(text_tokens, size = 2, min_count = 1)
# inspect
text_coll[1:6, 1:6]
```
## Sentiment analysis using DistilCamemBERT-Sentiment
```{python}
from transformers import pipeline
sentiment = pipeline(
task='text-classification',
model="cmarkea/distilcamembert-base-sentiment",
tokenizer="cmarkea/distilcamembert-base-sentiment"
)
result = sentiment(
"Ne pas déposer miroirs et vitres verres à boire vaisselle Pour la tranquillité des riverains merci de ne pas jeter vos verres entre heures et heures Attention à ne pas vous coincer les doigts",
top_k=None
)
result
```