Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last minute changes to Functions #234

Merged
merged 4 commits into from
Oct 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 52 additions & 53 deletions modules/Functions/Functions.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ library(dplyr)
library(knitr)
library(stringr)
library(tidyr)
library(emo)
library(readr)
opts_chunk$set(comment = "")
```
Expand Down Expand Up @@ -192,7 +191,7 @@ loud(word = "hooray!")
<!-- ``` -->


## Functions for tibbles - curly braces{.codesmall}
## Functions for tibbles - curly braces

```{r}
# get means and missing for a specific column
Expand All @@ -203,23 +202,32 @@ get_summary <- function(dataset, col_name) {
}
```

Examples:
## Functions for tibbles - example{.codesmall}

```{r}
```{r message = FALSE}
er <- read_csv(file = "https://daseh.org/data/CO_ER_heat_visits.csv")
```

```{r}
get_summary(er, visits)
```

```{r message = FALSE}
yearly_co2 <-
read_csv(file = "https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv")
```

yearly_co2 <- read_csv(file = "https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv")
```{r}
get_summary(yearly_co2, `2014`)
```

## Summary

- Simple functions take the form:
- `NEW_FUNCTION <- function(x, y){x + y}`
- Can specify defaults like `function(x = 1, y = 2){x + y}`
-`return` will provide a value as output
- `print` will simply print the value on the screen but not save it
- Can specify defaults like `function(x = 1, y = 2){x + y}`
- `return` will provide a value as output
- Specify a column (from a tibble) inside a function using `{{double curly braces}}`


## Lab Part 1
Expand All @@ -245,7 +253,7 @@ sapply(<a vector, list, data frame>, some_function)

Let's apply a function to look at the CO heat-related ER visits dataset.

`r emo::ji("rotating_light")` There are no parentheses on the functions! `r emo::ji("rotating_light")`
🚨There are no parentheses on the functions!🚨

You can also pipe into your function.

Expand Down Expand Up @@ -357,7 +365,6 @@ er %>%
))
```


## Applying functions with `across` from `dplyr`

Using different `tidyselect()` options (e.g., `starts_with()`, `ends_with()`, `contains()`)
Expand All @@ -368,20 +375,6 @@ er %>%
summarize(across(contains("cl"), mean, na.rm=T))
```


<!-- ## Applying functions with `across` from `dplyr`{.codesmall} -->

<!-- `mutate()` across to round across many columns at once! -->

<!-- ```{r} -->
<!-- calenviroscreen %>% -->
<!-- mutate(across( -->
<!-- where(is.numeric), -->
<!-- function(x) round(x, digits = 0) -->
<!-- )) %>% select(7:13) -->
<!-- ``` -->


## Applying functions with `across` from `dplyr` {.smaller}

Combining with `mutate()` - the `replace_na` function
Expand All @@ -401,29 +394,15 @@ yearly_co2 %>%
))
```

## GUT CHECK!

<!-- ## Use custom functions within `mutate` and `across` -->
Why use `across()`?

<!-- If your function needs to span more than one line, better to define it first before using inside `mutate()` and `across()`. -->
A. Efficiency - faster and less repetitive

<!-- ```{r} -->
<!-- times1000 <- function(x) x * 1000 -->

<!-- airquality %>% -->
<!-- mutate(across( -->
<!-- everything(), -->
<!-- .fns = times1000 -->
<!-- )) %>% -->
<!-- head(n = 2) -->

<!-- airquality %>% -->
<!-- mutate(across( -->
<!-- everything(), -->
<!-- .fns = function(x) x * 1000 -->
<!-- )) %>% -->
<!-- head(n = 2) -->
<!-- ``` -->
B. Calculate the cross product

C. Connect across datasets

## `purrr` package

Expand All @@ -433,22 +412,29 @@ While we won't get into `purrr` too much in this class, its a handy package for

# Multiple Data Frames

## Multiple data frames {.smaller}
## Multiple data frames

Lists help us work with multiple data frames
Lists help us work with multiple tibbles / data frames

```{r}
AQ_list <- list(AQ1 = airquality, AQ2 = airquality, AQ3 = airquality)
str(AQ_list)
df_list <- list(AQ = airquality, er = er, yearly_co2 = yearly_co2)
```

<br>

`select()` from each tibble the numeric columns:

```{r}
df_list <-
df_list %>%
sapply(function(x) select(x, where(is.numeric)))
```

## Multiple data frames: `sapply`
## Multiple data frames: `sapply` {.smaller}

```{r}
AQ_list %>% sapply(class)
AQ_list %>% sapply(nrow)
AQ_list %>% sapply(colMeans, na.rm = TRUE)
df_list %>% sapply(nrow)
df_list %>% sapply(colMeans, na.rm = TRUE)
```


Expand All @@ -457,7 +443,7 @@ AQ_list %>% sapply(colMeans, na.rm = TRUE)
- Apply your functions with `sapply(<a vector or list>, some_function)`
- Use `across()` to apply functions across multiple columns of data
- Need to use `across` within `summarize()` or `mutate()`
- Can use `sapply` or `purrr` to work with multiple data frames within lists simultaneously
- Can use `sapply` (or `purrr` package) to work with multiple data frames within lists simultaneously


## Lab Part 2
Expand All @@ -466,7 +452,20 @@ AQ_list %>% sapply(colMeans, na.rm = TRUE)

💻 [Lab](https://daseh.org/modules/Functions/lab/Functions_Lab.Rmd)

```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'}
📃 [Day 9 Cheatsheet](https://daseh.org/modules/cheatsheets/Day-9.pdf)

📃 [Posit's purrr Cheatsheet](https://rstudio.github.io/cheatsheets/purrr.pdf)

## Research Survey

<br>

https://forms.gle/jVue79CjgoMmbVbg9

<br>
<br>

```{r, fig.alt="The End", out.width = "30%", echo = FALSE, fig.align='center'}
knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg"))
```

Expand Down
116 changes: 73 additions & 43 deletions modules/Functions/lab/Functions_Lab_Key.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,29 +11,21 @@ knitr::opts_chunk$set(echo = TRUE)

# Part 1

Load all the libraries we will use in this lab.
Load the `tidyverse` package.

```{r message=FALSE}
library(tidyverse)
```

### 1.1

Create a function that takes one argument, a vector, and returns the sum of the vector and then squares the result. Call it "sum_squared". Test your function on the vector `c(2,7,21,30,90)` - you should get the answer 22500.
Create a function that:

```
# General format
NEW_FUNCTION <- function(x, y) x + y
```
or

```
# General format
NEW_FUNCTION <- function(x, y){
result <- x + y
return(result)
}
```
* Takes one argument, a vector.
* Returns the sum of the vector and then squares the result.
* Call it "sum_squared".
* Test your function on the vector `c(2,7,21,30,90)` - you should get the answer 22500.
* Format is `NEW_FUNCTION <- function(x, y) x + y`

```{r 1.1response}
nums <- c(2, 7, 21, 30, 90)
Expand All @@ -50,7 +42,12 @@ sum_squared(x = nums)

### 1.2

Create a function that takes two arguments, (1) a vector and (2) a numeric value. This function tests whether the number (2) is contained within the vector (1). **Hint**: use `%in%`. Call it `has_n`. Test your function on the vector `c(2,7,21,30,90)` and number `21` - you should get the answer TRUE.
Create a function that:

* takes two arguments, (1) a vector and (2) a numeric value.
* This function tests whether the number (2) is contained within the vector (1). **Hint**: use `%in%`.
* Call it `has_n`.
* Test your function on the vector `c(2,7,21,30,90)` and number `21` - you should get the answer TRUE.

```{r 1.2response}
nums <- c(2, 7, 21, 30, 90)
Expand All @@ -74,11 +71,24 @@ has_n(x = nums)

### P.1

Create a new number `b_num` that is not contained with `nums`. Use your updated `has_n` function with the default value and add `b_num` as the `n` argument when calling the function. What is the outcome?
Create a function for the CalEnviroScreen Data.

* Read in (https://daseh.org/data/CalEnviroScreen_data.csv)
* The function takes an argument for a column name. (use `{{col_name}}`)
* The function creates a ggplot with `{{col_name}}` on the x-axis and `Poverty` on the y-axis.
* Use `geom_point()`
* Test the function using the `Lead` column and `HousingBurden` columns, or other columns of your choice.

```{r P.1response}
b_num <- 11
has_n(x = nums, n = b_num)
ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv")

plot_ces <- function(col_name){
ggplot(data = ces, aes(x = {{col_name}}, y = Poverty)) +
geom_point()
}

plot_ces(Lead)
plot_ces(HousingBurden)
```


Expand All @@ -96,7 +106,12 @@ ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv")

### 2.2

We want to get some summary statistics on water contamination. Use `across` inside `summarize` to get the sum total variable containing the string "water" AND ending with "Pctl". **Hint**: use `contains()` AND `ends_with()` to select the right columns inside `across`. Remember that `NA` values can influence calculations.
We want to get some summary statistics on water contamination.

* Use `across` inside `summarize`.
* Choose columns about "water". **Hint**: use `contains("water")` inside `across`.
* Use `mean` as the function inside of `across`.
* Remember that `NA` values can influence calculations.

```
# General format
Expand All @@ -110,19 +125,26 @@ data %>%
```{r 2.2response}
ces %>%
summarize(across(
contains("Water") & ends_with("Pctl"),
sum
contains("water"),
mean
))

# Accounting for NA
ces %>%
summarize(across(
contains("Water") & ends_with("Pctl"),
function(x) sum(x, na.rm = T)
contains("water"),
function(x) mean(x, na.rm = T)
))
```

### 2.3

Use `across` and `mutate` to convert all columns containing the word "water" into proportions (i.e., divide that value by 100). **Hint**: use `contains()` to select the right columns within `across()`. Use an anonymous function ("function on the fly") to divide by 100 (`function(x) x / 100`). It will also be easier to check your work if you `select()` columns that match "Pctl".
Convert all columns that are percentiles into proportions.

* Use `across` and `mutate`
* Choose columns that contain "Pctl" in the name. **Hint**: use `contains("Pctl")` inside `across`.
* Use an anonymous function ("function on the fly") to divide by 100 (`function(x) x / 100`).
* Check your work - It will also be easier if you `select(contains("Pctl"))`.

```
# General format
Expand All @@ -136,7 +158,7 @@ data %>%
```{r 2.3response}
ces %>%
mutate(across(
contains("water"),
contains("Pctl"),
function(x) x / 100
)) %>%
select(contains("Pctl"))
Expand All @@ -149,42 +171,50 @@ ces %>%

Use `across` and `mutate` to convert all columns starting with the string "PM" into a binary variable: TRUE if the value is greater than 10 and FALSE if less than or equal to 10.

- **Hint**: use `starts_with()` to select the columns that start with "PM".
- Use an anonymous function ("function on the fly") to do a logical test if the value is greater than 10.
- A logical test with `mutate` will automatically fill a column with TRUE/FALSE.
* **Hint**: use `starts_with()` to select the columns that start with "PM".
* Use an anonymous function ("function on the fly") to do a logical test if the value is greater than 10.
* A logical test with `mutate` (x > 10) will automatically fill a column with TRUE/FALSE.

```{r P.2response}
ces %>%
mutate(across(
starts_with("PM"),
function(x) x > 10
))
)) %>%
glimpse() # add glimpse to view the changes
```

### P.3

Take your code from previous question and assign it to the variable `ces_dat`.

- Use `filter()` to drop any rows where "Oakland" appears in `ApproxLocation`. Make sure to reassign this to `ces_dat`.
- Create a ggplot boxplot (`geom_boxplot()`) where (1) the x-axis is `Asthma` and (2) the y-axis is `PM2.5`.
- You change the `labs()` layer so that the x-axis is "ER Visits for Asthma: PM2.5 greater than 10"
- Create a ggplot where the x-axis is `Asthma` and the y-axis is `PM2.5`.
- Add a boxplot (`geom_boxplot()`)
- Change the `labs()` layer so that the x-axis is "ER Visits for Asthma: PM2.5 greater than 10"

```{r P.3response}
ces_dat <-
ces %>%
mutate(across(
starts_with("PM"),
function(x) x > 10
)) %>%
filter(ApproxLocation != "Oakland")

ces_boxplot <- function(df) {
ggplot(df) +
geom_boxplot(aes(
x = `Asthma`,
y = `PM2.5`
)) +
))

ggplot(data = ces_dat, aes(x = `Asthma`, y = `PM2.5`)) +
geom_boxplot() +
labs(x = "ER Visits for Asthma: PM2.5 greater than 10")

# Make everything a function if you like!
ces_boxplot <- function() {
ces %>%
mutate(across(
starts_with("PM"),
function(x) x > 10
)) %>%
ggplot(aes(x = `Asthma`, y = `PM2.5`)) +
geom_boxplot() +
labs(x = "ER Visits for Asthma: PM2.5 greater than 10")
}
ces_boxplot(ces_dat)

ces_boxplot()
```
Loading