01-week1.Rmd

# Week 1: Retrieving and analyzing text

Our first task when conducting large-scale text analyses is gathering and curating the text information itself. This is the focus of the chapters by @manning_introduction_2007 listed below. Here, you'll find an introduction to different ways in which we can reformat and 'query' text data in order to begin asking questions of it. This is often referred to in computer science and natural language processing contexts as "information retrieval" and is the foundation of many search, including web search, processes.

The articles by @tatman_gender_2017 and @pechenick_characterizing_2015 will be the focus of our seminar (Q&A). These articles will get us thinking about the fundamentals of text discovery and sampling. When reading the articles we should think about where we are locating our texts, how we are sampling them, what biases might inhere in this sampling process, and what these texts *represent*; i.e., about what population or phenomenon of interest they might provide inferences. 

Questions for seminar:

1. Where do we access text? What do we need to consider when doing so?
2. How do we sample texts?
3. What biases do we need to keep in mind?

**Required reading**:

- @tatman_gender_2017
- @pechenick_characterizing_2015

- @manning_introduction_2007 (chs.1 and 10): [https://nlp.stanford.edu/IR-book/information-retrieval-book.html](https://nlp.stanford.edu/IR-book/information-retrieval-book.html)
- @krippendorff_content_2004 (ch. 6)

**Further reading**:

- @olteanu_social_2019
- @biber_using_1993
- @barbera_understanding_2015

**Slides**:

- Week 1 [Slides](https://docs.google.com/presentation/d/1TljlFQwyY8xoa5qr5R3EBasc0U-ixoUJx_D0RieJs_o/edit?usp=sharing)