-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathp_untangled_query_rows.qmd
397 lines (249 loc) · 11.5 KB
/
p_untangled_query_rows.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
---
title: 'Querying rows'
---
```{python}
# | echo: false
# Setup
import pandas as pd
pd.options.display.max_rows = 5
```
## Intro
Querying rows is one of the most frequently used operations in data analysis. It allows you to filter your dataset to focus on specific subsets of interest, enabling more targeted and efficient analysis.
In this lesson, we'll explore various techniques to subset rows in pandas.
Let's get started!
## Learning objectives
1. You can use the `query()` method to keep or drop rows from a DataFrame.
2. You can specify conditions using relational operators like greater than (`>`), less than (`<`), equal to (`==`), not equal to (`!=`), and is an element of (`isin()`).
3. You can combine conditions with `&` and `|`.
4. You can negate conditions with `~`.
5. You can use the `isna()` and `notna()` methods.
6. You can query based on string patterns using `str.contains()`.
## The Yaounde COVID-19 dataset
In this lesson, we will again use the data from the COVID-19 serological survey conducted in Yaounde, Cameroon.
You can download the dataset from this link: [yaounde_data.csv](https://raw.githubusercontent.com/the-graph-courses/idap_book/main/data/yaounde_data.csv)
You can find out more about this dataset here: https://www.nature.com/articles/s41467-021-25946-0
Let's load the data into a pandas DataFrame.
```{python}
import pandas as pd
yaounde = pd.read_csv("data/yaounde_data.csv")
# a smaller subset of variables
yao = yaounde[
[
"age",
"sex",
"weight_kg",
"neighborhood",
"occupation",
"symptoms",
"is_smoker",
"is_pregnant",
"igg_result",
"igm_result",
]
]
yao.head()
```
## Introducing `query()`
We can use the `query()` method to keep rows that satisfy a set of conditions. Let's take a look at a simple example. If we want to keep just the male records, we run:
```{python}
yao.query('sex == "Male"')
```
As you can see, the `query()` syntax is quite simple. (It may be a bit surprising to have to put code in quotes, but it is quite readable.)
Note the use of double equals (`==`) instead of single equals (`=`) there. The `==` sign tests for equality, while the single equals sign assigns a value. This is a common source of errors when you are a beginner, so watch out for it.
We can chain `query()` with `shape[0]` to count the number of male respondents.
```{python}
yao.query('sex == "Male"').shape[0]
```
::: {.callout-note title="Reminder"}
The `shape` property returns the number of rows and columns in a DataFrame. The first element, `shape[0]`, is the number of rows, and the second element, `shape[1]`, is the number of columns.
For example:
```{python}
yao.shape
```
```{python}
yao.shape[0] # rows
```
```{python}
yao.shape[1] # columns
```
:::
::: {.callout-note title="Key Point"}
Note that these subsets are not modifying the DataFrame itself. If we want a modified version, we create a new DataFrame to store the subset. For example, below we create a subset of male respondents:
```{python}
yao_male = yao.query('sex == "Male"')
yao_male
```
But for ease of explanation, in the examples below, we are simply printing the result, without storing it in a variable.
:::
::: {.callout-tip title="Practice"}
### Practice Q: Subset for Pregnant Respondents
Subset the `yao` data frame to respondents who were pregnant during the survey (The `is_pregnant` column contains "Yes", "No" or NaN). Assign the result to a new DataFrame called `yao_pregnant`. Then print this new DataFrame. There should be 24 rows.
```{python}
# Your code here
```
:::
## Relational operators
The `==` operator introduced above is an example of a "relational" operator, as it tests the relation between two values. Here is a list of some more of these operators. You will use these often when you are querying rows in your data.
| | |
|:-------------|:------------------------------------|
| **Operator** | **is True if** |
| A == B | A is **equal** to B |
| A != B | A is **not equal** to B |
| A < B | A is **less than** B |
| A <= B | A is **less than or equal** to B |
| A > B | A is **greater than** B |
| A >= B | A is **greater than or equal to** B |
| A.isin([B]) | A **is an element of** B |
Let's see how to use these with `query()`:
```{python}
yao.query('sex == "Female"') # keep rows where `sex` is female
yao.query('sex != "Male"') # keep rows where `sex` is not "Male"
yao.query("age < 6") # keep respondents under 6
yao.query("age >= 70") # keep respondents aged at least 70
# keep respondents whose neighbourhood is "Tsinga" or "Messa"
yao.query('neighborhood.isin(["Tsinga", "Messa"])')
```
::: {.callout-tip title="Practice"}
### Practice Q: Subset for Children
- From `yao`, keep only respondents who were children (under 18). Assign the result to a new DataFrame called `yao_children`. There should be 291 rows.
```{python}
# Your code here
```
:::
::: {.callout-tip title="Practice"}
### Practice Q: Subset for Tsinga and Messa
- With `isin()`, keep only respondents who live in the "Carriere" or "Ekoudou" neighborhoods. Assign the result to a new DataFrame called `yao_carriere_ekoudou`. There should be 426 rows.
```{python}
# Your code here
```
:::
## Accessing external variables in `query()`
The `query()` method allows you to access variables outside the DataFrame using the `@` symbol. This is useful when you want to use dynamic values in your query conditions.
For example, say you have the variable `min_age` that you want to use in your query. You can do this as follows:
```{python}
min_age = 25
# Query using external variables
yao.query('age >= @min_age')
```
This feature is helpful when you need to filter data based on values that may change or are determined at runtime.
::: {.callout-tip title="Practice"}
### Practice Q: Subset for Young Respondents
- From `yao`, keep respondents who are less than or equal to the variable `max_age`, defined below. Assign the result to a new DataFrame called `yao_young`. There should be 590 rows.
```{python}
max_age = 30
# Your code here
```
:::
## Combining conditions with `&` and `|`
We can pass multiple conditions to `query()` using `&` (the "ampersand" symbol) for AND and `|` (the "vertical bar" or "pipe" symbol) for OR.
For example, to keep respondents who are either younger than 18 OR older than 65, we can write:
```{python}
yao.query("age < 18 | age > 65")
```
To keep respondents who are pregnant and are ex-smokers, we write:
```{python}
yao.query('is_pregnant == "Yes" & is_smoker == "Ex-smoker"')
```
To keep all respondents who are pregnant or ex-smokers, we write:
```{python}
yao.query('is_pregnant == "Yes" | is_smoker == "Ex-smoker"')
```
::: {.callout-note title="Side note"}
To get the unique values in a column, you can use the `value_counts()` method.
```{python}
yao.is_smoker.value_counts()
```
:::
::: {.callout-tip title="Practice"}
### Practice Q: Subset for IgG Positive Men
Subset `yao` to only keep men who tested IgG positive. Assign the result to a new DataFrame called `yao_igg_positive_men`. There should be 148 rows after your query. Think carefully about whether to use `&` or `|`.
```{python}
# Your code here
```
:::
## Negating conditions with the `~` operator
To negate conditions in `query()`, we use the `~` operator (pronounced "tilde").
Let's use this to drop respondents who are students:
```{python}
yao.query('~ (occupation == "Student")')
```
Notice that we have to enclose the condition in parentheses.
We can also enclose multiple conditions in parentheses.
Imagine we want to give out a drug, but since it is a strong drug, we don't want children or lightweight (under 30kg) respondents to take it. First, we can write a query to select the children and these light respondents:
```{python}
yao.query("age < 18 | weight_kg < 30")
```
Now to drop these individuals, we can negate the condition with `~`:
```{python}
yao.query("~ (age < 18 | weight_kg < 30)")
```
This could also be written as:
```{python}
yao.query("age >= 18 & weight_kg >= 30")
```
But sometimes negated conditions are easier to read.
::: {.callout-tip title="Practice"}
### Practice Q: Drop Smokers and drop those over 50
We want to avoid giving a drug to older individuals and smokers. From `yao`, drop respondents that are either above 50 or who are smokers. Use `~` to negate the conditions. Assign the result to a new DataFrame called `yao_dropped`. Your output should have 810 rows.
```{python}
# Your code here
```
:::
## `NaN` values
The relational operators introduced so far do not work with null values like `NaN`.
For example, the `is_pregnant` column contains (NA) values for men. To keep the rows with missing `is_pregnant` values, we could try writing:
```{python}
# |eval: false
yao.query("is_pregnant == NaN") # does not work
```
But this will not work. This is because `NaN` is a non-existent value. So the system cannot evaluate whether it is “equal to” or “not equal to” anything.
Instead, we can use the `isna()` method to select rows with missing values:
```{python}
yao.query("is_pregnant.isna()")
```
Or we can select rows that are not missing with `notna()`:
```{python}
yao.query("is_pregnant.notna()")
```
::: {.callout-tip title="Practice"}
### Practice Q: Keep Missing Smoking Status
From the `yao` dataset, keep all the respondents who had NA records for the report of their smoking status.
```{python}
# Your code here
```
:::
## Querying Based on String Patterns
Sometimes, we need to filter our data based on whether a string column contains a certain substring. This is particularly useful when dealing with multi-answer type variables, where responses may contain multiple values separated by delimiters. Let's explore this using the `occupation` column in our dataset.
First, let's take a look at the unique values in the `occupation` column:
```{python}
yao.occupation.value_counts().to_dict()
```
As we can see, some respondents have multiple occupations, separated by "--". To query based on string containment, we can use the `str.contains()` method within our `query()`.
### Basic String Containment
To find all respondents who are students (either solely or in combination with other occupations), we can use:
```{python}
yao.query("occupation.str.contains('Student')")
```
This query will return all rows where the `occupation` column contains the word "Student", regardless of whether it's the only occupation or part of a multiple-occupation entry.
### Negating String Containment
To find respondents who are not students (i.e., their occupation does not contain "Student"), you can use the `~` operator:
```{python}
yao.query("~occupation.str.contains('Student')")
```
### Using `|` with string containment
To find respondents who are students or farmers, we can use:
```{python}
yao.query("occupation.str.contains('Student|Farmer')")
```
::: {.callout-tip title="Practice"}
### Practice Q: Symptoms
The symptoms column contains a list of symptoms that respondents reported.
Query `yao` to find respondents who reported "Cough" or "Fever" as symptoms. Your answer should have 219 rows.
```{python}
# Your code here
```
:::
## Wrap up
Great job! You've learned how to select specific columns and filter rows based on various conditions.
These skills allow you to focus on relevant data and create targeted subsets for analysis.
Next, we'll explore how to modify and transform your data, further expanding your data wrangling toolkit. See you in the next lesson!