02-week2.Rmd

# Week 2: Tokenization and word frequencies

When approaching large-scale quantiative analyses of text, a key task is how we identify and capture the unit of analysis. One of the most commonly used approaches, across diverse analytical contexts, is text tokenization. Here, we are splitting the text into word units: unigrams, bigrams, trigrams etc. 

The chapters by @manning_introduction_2007, listed below, provide a technical introduction to the task of "querying" text according to different word-based queries. This is a task we will be studying in the hands-on assignment for this week. 

For the seminar discussion, we will be focusing on some widely-cited examples of research in the applied social sciences employing token-based, or word frequency, analyses of large corpora. The first, by @michel_quantitative_2011 uses the enormous Google books corpus to measure cultural and linguistic trends. The second, by @bollen_historical_2021 uses the same corpus to demonstrate a more specific change over time---so-called "cognitive distortion." In both examples, we should be attentive to questions of sampling covered in previous weeks. This question is central to the back-and-forths in the short responses and replies to the articles by @michel_quantitative_2011 and @bollen_historical_2021.   

Questions:

1. Tokenizing and counting: what does this capture?
2. Corpus-based sampling: what biases might threaten inference?
3. If you had to write a critique of either @michel_quantitative_2011 or @bollen_historical_2021, what would it focus on?

**Required reading**:

- @michel_quantitative_2011
  - @schwartz_culturomics_2011
  - @morse-gagne_culturomics_2011
  - @aiden_culturomicsresponse_2011
  
- @bollen_historical_2021
  - @schmidt_uncontrolled_2021
  - @bollen_reply_2021
  
- @manning_introduction_2007 (ch. 2): [https://nlp.stanford.edu/IR-book/information-retrieval-book.html](https://nlp.stanford.edu/IR-book/information-retrieval-book.html)]
- @krippendorff_content_2004 (ch. 5)

**Further reading**:

- @rozado_prevalence_2021
- @alshaabi_storywrangler_2021
- @campos_survey_2015
- @greenfield_changing_2013

**Slides**:

- Week 2 [Slides](https://docs.google.com/presentation/d/1EB8l2R3aDnfabpx23qKq-dH-6HehfgDqC9n1Pc90jN8/edit?usp=sharing)