linguistics_landscapes.qmd

---
title: "Linguistic landscapes"
title-block-banner: true
subtitle: "Paris' 13th arrondissement"
author:
  - name: Olivier Caron
    email: olivier.caron@dauphine.psl.eu
    affiliations: 
      name: "Paris Dauphine - PSL"
      city: Paris
      state: France
date : "last-modified"
toc: true
number-sections: true
number-depth: 5
format:
  html:
    theme:
      light: yeti
      dark: darkly
    code-fold: true
    code-summary: "Display code"
    code-tools: true #enables to display/hide all blocks of code
    code-copy: true #enables to copy code
    grid:
      body-width: 1000px
      margin-width: 100px
    toc: true
    toc-location: left
execute:
  echo: true
  warning: false
  message: false
editor: visual
fig-align: "center"
highlight-style: ayu
css: styles.css
reference-location: margin
---

## Linguistic landscapes

As part of the seminar, we had to wander in the 13th arrondissement of Paris and take photos of text written on walls, displays, buildings, shops and try to get a sense of multilingualism in the neighborhood.

The course it not about NLP but since that's what I do, I wanted to profit of the occasion to try new methods, like extracting texts from photos.

The different photos I took are located in the `images/photos` folder. I searched online and it seems like tesseract is a nice package to do just this. So let's try it.

## Libraries

```{r}
#| label: libraries
#| message: false
library(tidyverse)
library(tesseract)
library(quanteda)
library(magick)
library(quanteda.textstats)
library(quanteda.textplots)
library(reactable)
```

## Extract text from photos

```{r}
setwd("images/photos/")
files <- list.files()
sentence <- list()
tesseract_download("fra")
fr <- tesseract("fra")
t1 <- Sys.time()

for (photo in files) {
  cat("Photo processing of", photo, "\n")
  text <- magick::image_read(photo) %>%
  tesseract::ocr_data(engine = fr) %>%
  filter(confidence > 50)
  
  sentence[[photo]] <- text
  #texts[[photo]] <- bind_rows(texts,texts[[photo]])
}
Sys.time()-t1
all_texts <- bind_rows(sentence, .id = "n_photo")

#image <- image_read("PXL_20230902_075947721.MP.jpg") %>%
  #image_ocr()

```

```{r}
all_texts <- all_texts %>%
  filter(word != "|") %>%
  mutate(word = str_replace_all(word, "[[:punct:]]|\\d+|[[:cntrl:]]", ""))


text_df <- all_texts %>%
  group_by(n_photo) %>%
  summarize(sentence = paste(word, collapse = " "))

reactable(text_df, striped = TRUE)
```

## Co-occurences network

```{r}
set.seed(100)
toks <- text_df %>%
  pull(sentence) %>%
  tokens(remove_punct = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove("|")
fcmat <- fcm(toks, context = "window", tri = FALSE)
feat <- names(topfeatures(fcmat, 30))
fcm_select(fcmat, pattern = feat) %>%
  textplot_network(min_freq = 1, vertex_labelsize = 1.5 * rowSums(.)/min(rowSums(.)))

```

## Co-occurences network without stopwords

```{r}

set.seed(100)
toks <- text_df %>%
  pull(sentence) %>%
  tokens(remove_punct = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(pattern = c(stopwords("french"),"|"), padding = FALSE)
fcmat <- fcm(toks, context = "window", tri = FALSE)
feat <- names(topfeatures(fcmat, 30))
fcm_select(fcmat, pattern = feat) %>%
  textplot_network(min_freq = 1, vertex_labelsize = rowSums(.)/min(rowSums(.)))
```

## Wordcloud

```{r}
set.seed(10)
dfmat1 <- dfm(corpus(text_df$sentence),
              remove = c(stopwords("french"),"|"), remove_punct = TRUE) %>%
   dfm_trim(min_termfreq = 1)
    
# basic wordcloud
textplot_wordcloud(dfmat1)
```

## Collocations

```{r}
text_tokens <- tokens(text_df$sentence, remove_punct = TRUE)
# extract collocations
text_coll <- textstat_collocations(text_tokens, size = 2, min_count = 1)
# inspect
text_coll[1:6, 1:6]
```

## Sentiment analysis using DistilCamemBERT-Sentiment

```{python}
from transformers import pipeline

sentiment = pipeline(
    task='text-classification',
    model="cmarkea/distilcamembert-base-sentiment",
    tokenizer="cmarkea/distilcamembert-base-sentiment"
)
result = sentiment(
    "Ne pas déposer miroirs et vitres verres à boire vaisselle Pour la tranquillité des riverains merci de ne pas jeter vos verres entre heures et heures Attention à ne pas vous coincer les doigts",
    top_k=None
  )
result
```