giscience-historic-text-analysis.Rmd

---
title: "Text analysis of accepted full papers at historic GIScience conferences"
author: Daniel Nüst, Opening Reproducible Research (o2r), Institute for Geoinformatics,
  University of Münster
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  html_document:
    toc: yes
    code_download: true
    code_folding: hide
    self_contained: false
    lib_dir: libs
params:
  with_sp: no
---

```{css, echo=FALSE}
pre {
  font-size: 12px;
  overflow-x: auto;
}
pre code {
  word-wrap: normal;
  white-space: pre;
}
```

## Introduction

This document is an exploratory analysis of all accepted full papers, and posters at the [GIScience conference series](https://www.giscience.org/).
The analysis is based on the text analysis published in _"Reproducible research and GIScience: an evaluation using AGILE conference papers"_ ([https://doi.org/10.7717/peerj.5072](https://doi.org/10.7717/peerj.5072)).

```{r load_libraries, message=FALSE, warning=FALSE}
library("here")
library("pdftools")
library("stringr")
library("tidyverse")
library("tidytext")
library("wordcloud")
library("RColorBrewer")
library("grid")
library("gridBase")
library("gridExtra")
library("kableExtra")
library("quanteda")

# for deterministic cloud rendering
set.seed(nchar("International Conference on Geographic Information Science"))
```

## Load data

**List of proceedings**

- Proceedings 10th International Conference on Geographic Information Science (GIScience 2018). 2018. Winter, S., Griffin, A., Sester, M. (Eds.), LIPICS Vol. 114. ISBN 978-3-95977-083-5. http://www.dagstuhl.de/dagpub/978-3-95977-083-5
- Geographic Information Science. 2016. J. A. Miller, D. O’Sullivan, & N. Wiegand (Eds.), Lecture Notes in Computer Science. Springer International Publishing. https://doi.org/10.1007/978-3-319-45738-3
- Geographic Information Science. 2014. M. Duckham, E. Pebesma, K. Stewart, & A. U. Frank (Eds.), Lecture Notes in Computer Science. Springer International Publishing. https://doi.org/10.1007/978-3-319-11593-1
- Geographic Information Science. 2012. N. Xiao, M.-P. Kwan, M. F. Goodchild, & S. Shekhar (Eds.), Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-33024-7
- Geographic Information Science. 2010. Fabrikant, S.I., Reichenbacher, T., Kreveld, M. van, Schlieder, C. (Eds.), Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-15300-6
- Geographic Information Science. 2008. In Cova, T.J., Miller, H.J., Beard, K., Frank, A.U., Goodchild, M.F. (Eds.), Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-87473-7
- Geographic Information Science. 2004. Egenhofer, M.J., Freksa, C., Miller, H.J. (Eds.), Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/b101397
- Geographic Information Science. 2002. Egenhofer, M.J., Mark, D.M. (Eds.), Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-45799-2
- Geographic Information Science. 2006. Raubal, M., Miller, H.J., Frank, A.U., Goodchild, M.F. (Eds.), Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/11863939

LNCS proceedings are available at the publisher website: [https://link.springer.com/conference/giscience](https://link.springer.com/conference/giscience).

**Note:** The 2018 proceedings include the short papers in the same document.
For comparability, only the full papers are taken into account for the analysis below.

```{r input_files}
data_path <- here::here("proceedings")
proceedings <- c(
  "2002" = "geographic-information-science-2002.pdf",
  "2004" = "geographic-information-science-2004.pdf",
  "2006" = "geographic-information-science-2006.pdf",
  "2008" = "geographic-information-science-2008.pdf",
  "2010" = "geographic-information-science-2010.pdf",
  "2012" = "10.1007_978-3-642-33024-7.pdf",
  "2014" = "10.1007_978-3-319-11593-1.pdf",
  "2016" = "10.1007_978-3-319-45738-3.pdf",
  "2018" = "lipics-vol114-giscience2018-complete.pdf"
)
proceedings_files <- file.path(data_path, proceedings)
names(proceedings_files) <- names(proceedings)
```

Add the PDFs to a directory called ` `r data_path` ` next to the file `giscience-historic-text-analysis.Rmd` (this file).
The proceedings of the papers are not openly available for the years 2012 to 2016.
You can contact the original paper authors and ask for the test dataset to reproduce the full analysis.
Alternatively, you can download the 2018 proceedings from the LIPIcs website (Open Access; [direct PDF link](https://drops.dagstuhl.de/opus/volltexte/lipics-complete/lipics-vol114-giscience2018-complete.pdf)) and conduct the analysis with that subset of the data.
For the analysis below the following input files were used:

```{r list_files}
knitr::kable(tibble(year = names(proceedings), file = proceedings)) %>%
  kableExtra::kable_styling("striped", full_width = FALSE)
```

```{r data_download_drive, eval=FALSE}
# Code not evaluated when document is rendered!
dir.create(data_path, showWarnings = FALSE)

library("googledrive")
drive_dir <- googledrive::drive_get("https://drive.google.com/drive/folders/17EUtM_zCx1gQMea1MHN_5XSVrssxv9GA")
drive_dir_contents <- googledrive::drive_ls(drive_dir)
for (i in rownames(drive_dir_contents)) {
  current <- drive_dir_contents[i,]
  if(endsWith(current$name, ".pdf"))
    googledrive::drive_download(as_id(current$id), file.path(data_path, current$name))
}
```

The text is extracted from PDFs and it is processed to create a [tidy](https://www.jstatsoft.org/article/view/v059i10) data structure without [stop words](https://en.wikipedia.org/wiki/Stop_words).
The stop words include specific words, such as `university`, which is included in many pages, abbreviations such as `e.g.`, and terms particular to scientific articles, such as `figure`.
Also all numeric literas are removed from the word list.

```{r load_files, cache=TRUE}
texts <- lapply(proceedings_files, pdftools::pdf_text)

if(params$with_sp) {
  texts[["2018-sp"]] <- texts[["2018"]][c(283:length(texts[["2018"]]))]
  proceedings_files <- c(proceedings_files, `2018-sp` = proceedings_files[[4]])
}

# don't include short papers in 2018 year
texts[["2018"]] <- texts[["2018"]][c(1:282)]

texts <- unlist(lapply(texts, stringr::str_c, collapse = TRUE))

tidy_texts <- tibble::tibble(year = names(texts),
                             path = proceedings_files,
                             text = texts)

# create a table of all words
all_words <- tidy_texts %>%
  dplyr::select(year, text) %>%
  tidytext::unnest_tokens(word, text)

# remove stop words and remove numbers
my_stop_words <- tibble::tibble(
  word = c(
    "et",
    "al",
    "fig",
    "e.g",
    "i.e",
    "http",
    "ing",
    "pp",
    "figure",
    "based",
    "conference",
    "university",
    "table"
  ),
  lexicon = "giscience"
)

all_stop_words <- stop_words %>%
  dplyr::bind_rows(my_stop_words)
suppressWarnings({
  no_numbers <- all_words %>%
    dplyr::filter(is.na(as.numeric(word)))
})

no_stop_words <- no_numbers %>%
  dplyr::anti_join(all_stop_words, by = "word")

total_words = nrow(no_numbers)
after_cleanup = nrow(no_stop_words)
```

About `r round(after_cleanup/total_words * 100)`&nbsp;% of the words are considered stop words.
The following tables shows how many non-stop words each conference year has, sorted by number of non-stop words (descending).

```{r stop_words, message=FALSE, warning=FALSE}
nonstopwords_per_year <- no_stop_words %>%
  dplyr::group_by(year) %>%
  dplyr::summarise(words = n()) %>%
  dplyr::arrange(desc(words)) %>%
  dplyr::rename(`non-stop words` = words)

words_per_year <- no_numbers %>%
  dplyr::group_by(year) %>%
  dplyr::summarise(words = n()) %>%
  dplyr::arrange(desc(words)) %>%
  dplyr::rename(`all words` = words)

dplyr::inner_join(nonstopwords_per_year, words_per_year, by = "year") %>%
  dplyr::bind_rows(tibble(year = "Total",
                   `non-stop words` = sum(nonstopwords_per_year$`non-stop words`),
                   `all words` = sum(words_per_year$`all words`))) %>%
  knitr::kable() %>%
  kableExtra::kable_styling("striped", full_width = FALSE) %>%
  kableExtra::row_spec(nrow(nonstopwords_per_year) + 1, bold = TRUE)
```

## Top wordstems and wordstem clouds

```{r params}
# chosen manually
minimum_occurence <- 99
max_words <- 100
```

The following table shows the number of occurence for the `r max_words` most occuring wordstems across all proceedings.

```{r top_wordstems}
wordstems <- no_stop_words %>%
  dplyr::mutate(wordstem = quanteda::char_wordstem(no_stop_words$word))

countYearsUsingWordstem <- function(the_word) {
  sapply(the_word, function(w) {
    wordstems %>%
      dplyr::filter(wordstem == w) %>%
      dplyr::group_by(year) %>%
      dplyr::count() %>%
      nrow
  })
}

top_wordstems <- wordstems %>%
  dplyr::group_by(wordstem) %>%
  dplyr::tally() %>%
  dplyr::arrange(desc(n)) %>%
  head(n = max_words) %>%
  dplyr::mutate(`years w/ wordstem` = countYearsUsingWordstem(wordstem)) %>%
  tibble::add_column(place = c(1:nrow(.)), .before = 0)

write.csv(top_wordstems, here::here("results/text_analysis_topwordstems.csv"), row.names = FALSE)

top_wordstems %>%
  knitr::kable() %>%
  kableExtra::kable_styling("striped", full_width = FALSE) %>%
  kableExtra::scroll_box(height = "300px")
```

The following clouds and table are based on word stems extracted with a stemming algorithm from package [`quanteda`](https://cran.r-project.org/package=quanteda).
Words must occur at least `r minimum_occurence` times to be included in the cloud.
Each cloud has a maximum of `r max_words` words.

```{r cloud_wordstems, message=FALSE, warning=FALSE}
cloud_wordstems <- wordstems %>%
  dplyr::group_by(year, wordstem) %>%
  dplyr::tally() %>%
  dplyr::arrange(desc(n))
```

```{r wordclouds_create_plot, message=FALSE, warning=FALSE}
# plot is created to file to fit more words to a specific pixel size
png(filename = here::here("results/text_analysis_wordstemclouds.png"),
    width = 1000,
    height = 1000)

par(mfrow = c(3,3))
for (the_year in names(proceedings)) {
  year_cloud_wordstems <- cloud_wordstems %>%
    dplyr::filter(year == the_year) %>%
    dplyr::filter(n >= minimum_occurence) %>%
    head(n = max_words)
  #cat(str(year_cloud_wordstems))
  
  wordcloud::wordcloud(words = year_cloud_wordstems$wordstem,
                       freq = year_cloud_wordstems$n,
                       min.freq = 1,
                       random.order = FALSE,
                       fixed.asp = FALSE,
                       rot.per = 0,
                       color = brewer.pal(8, "Dark2"))
}
dev.off()

file.copy(from = here::here("results/text_analysis_wordstemclouds.png"),
          to = here::here("docs/text_analysis_wordstemclouds.png"),
          overwrite = TRUE)
```

<!-- path fixed to output in docs/ directory - see Makefile -->
![](text_analysis_wordstemclouds.png)

_`r paste0("World clouds of full papers per conference year (rowwise, starting top left, from ", head(names(proceedings_files), n = 1), " to ", tail(names(proceedings_files), n = 1), ").")`_

## Reproducible research-related keywordstems in GIScience papers

The following tables lists how often wordstems of terms related to reproducible research appear in each document.
The detection matches full words using regex option `\b`.

- reproduc (`reproduc.*`, reproducibility, reproducible, reproduce, reproduction)
- replic (`replicat.*`, i.e. replication, replicate)
- repeatab (`repeatab.*`, i.e. repeatability, repeatable)
- software
- (pseudo) code/script(s) [column name _code_]
- algorithm (`algorithm.*`, i.e. algorithms, algorithmic)
- process (`process.*`, i.e. processing, processes, preprocessing)
- data (`data.*`, i.e. dataset(s), database(s))
- result(s) (`results?`)
- repository(ies) (`repositor(y|ies)`)
- collaboration platforms (`git(hub|lab)`)

```{r keywords_per_year, warning=FALSE}
tidy_texts_lower <- stringr::str_to_lower(tidy_texts$text)
word_counts <- tibble::tibble(
  year = tidy_texts$year,
  `words` = str_count(tidy_texts_lower, "\\b.*\\b"),
  `reproduc..` = str_count(tidy_texts_lower, "\\breproduc.*\\b"),
  `replic..` = str_count(tidy_texts_lower, "\\breplicat.*\\b"),
  `repeatab..` = str_count(tidy_texts_lower, "\\brepeatab.*\\b"),
  `code` = str_count(tidy_texts_lower,
    "(\\bcode\\b|\\bscript.*\\b|\\bpseudo\ code\\b)"),
  software = str_count(tidy_texts_lower, "\\bsoftware\\b"),
  `algorithm(s)` = str_count(tidy_texts_lower, "\\balgorithm.*\\b"),
  `(pre)process..` = str_count(tidy_texts_lower, 
                "(\\bprocess.*\\b|\\bpreprocess.*\\b|\\bpre-process.*\\b)"),
  `data.*` = str_count(tidy_texts_lower, "\\bdata.*\\b"),
  `result(s)` = str_count(tidy_texts_lower, "\\bresults?\\b"),
  `repository/ies` = str_count(tidy_texts_lower, "\\brepositor(y|ies)\\b"),
  #`repos` = str_count(tidy_texts_lower, "\\bzenodo|figshare|osf|dryad\\b"),
  `github/lab` = str_count(tidy_texts_lower, "\\bgit(hub|lab)\\b")
)

word_counts_sums <- rbind(word_counts,
                          word_counts %>% 
                            dplyr::summarise_if(is.numeric, funs(sum)) %>%
                            tibble::add_column(year = "Total", .before = 0))

write.csv(word_counts_sums, here::here("results/text_analysis_keywordstems.csv"), row.names = FALSE)

word_counts_sums %>%
  knitr::kable() %>%
  kableExtra::kable_styling("striped", font_size = 10, bootstrap_options = "condensed")  %>%
  kableExtra::row_spec(0, font_size = "x-small", bold = T)  %>%
  kableExtra::row_spec(nrow(word_counts_sums), bold = T)
```

**Note**: The high number for "code" in 2012 is largely due to a single paper about "land use codes".

## Colophon

This document is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).
All contained code is licensed under the [Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/).
This document is versioned in a public [git](https://git-scm.com/) repository, [https://github.com/nuest/reproducible-research-at-giscience](https://github.com/nuest/reproducible-research-at-giscience), and archived on Zenodo at [https://doi.org/10.5281/zenodo.4032875](https://doi.org/10.5281/zenodo.4032875).

**Runtime environment description:**

```{r session_info, echo=FALSE}
devtools::session_info(include_base = TRUE)
```

```{r upload_to_drive, eval=FALSE, include=FALSE}
# upload the HTML and Rmd file to the authoring team's shared folder
library("googledrive")
googledrive::drive_auth(use_oob = TRUE)
googledrive::drive_put("giscience-historic-text-analysis.Rmd", path = "https://drive.google.com/drive/folders/17EUtM_zCx1gQMea1MHN_5XSVrssxv9GA/")
googledrive::drive_put("giscience-historic-text-analysis.html", path = "https://drive.google.com/drive/folders/17EUtM_zCx1gQMea1MHN_5XSVrssxv9GA/")
```