-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy pathacross_vignette.Rmd
146 lines (120 loc) · 4.67 KB
/
across_vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
title: "TidyVerse CREATE"
author: "Kevin Havis"
date: "2024-10-24"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Tidy operations across columns
In this vignette we will walk through a useful function in the `dplyr` package for column-wise operations; `across()`.
```{r}
library(tidyverse)
# Load initial data
df <- as_tibble(trees)
```
`across()` is best paired with `mutate()`. Consider we would like to rank the respective `Girth`, `Height`, and `Volume` of the black cherry trees in the `trees` data.
```{r}
# Without across()
df |>
mutate(
girth_rank = rank(Girth),
height_rank = rank(Height),
vol_rank = rank(Volume)
)
```
This is easy enough but what if we had many columns? Or we did not know how many columns we are going to operate across?
We can use `across()` in conjunction with `mutate()` to perform this operation.
```{r}
# With across()
ranked_trees <- df |>
mutate(
across(
.cols = everything(), # Operate across all columns
.fns = rank, # The function(s) to perform
.names = "{.col}_rank" # How the new columns should be named
)
)
ranked_trees
```
We can also specify what *type* of columns we would like to perform by replacing `everything()` with `starts_with()`. You could also use `is.character` or `is.numerical`, which is useful for things such as string normalization, rounding floats into integers, etc.
```{r}
# Using across for specific columns
df |>
mutate(
across(
.cols = starts_with(c("Height", "Volume")), # Operate on subset of cols
.fns = rank,
.names = "{.col}_rank"
)
)
```
```{r}
# Operate on specific column type
df |>
mutate(
tree = "I'm not a tree..."
) |>
mutate(
across(
.cols = where(is.character), # Select character columns only
.fns = ~ "I'm a tree!" # Replace values with "tree"
)
)
```
```{r}
# Round numeric columns to integers
df |>
mutate(
across(
.cols = where(is.numeric), # Select all numeric columns
.fns = ~ round(.x), # Apply round() to each column
.names = "{.col}_int" # Create new columns with _int suffix
)
)
```
## Extending Tidy Operations Across Columns
In addition to the `across()` function `dplyr` also has the `if_any()` and `if_all()` functions which operate very similarly to `across()`. As demonstrated above, `across()` can be used with the `mutate()` and by extension the `summarise()` and `reframe()` functions. Its behavior does not extend to the `filter()` function however, which is where `if_any()` and `if_all()` can help.
```{r}
# THIS WILL NOT WORK, INCLUDED FOR DEMONSTRATION ONLY
# Uncomment the below code to see an example of the error that occurs when using across() in a filter() statement
#df |>
# filter(
# across(
# .cols = where(is.numeric),
# .fns = ~ .x > 5,
# )
# )
```
The `if_any()` and `if_all()` functions take all the same arguments as `across()` but return a logical vector which allows for filtering. The `if_any()` function will preserve all rows with at least one cell which meets the condition, whereas `if_all()` will only keep rows for which all cells meet the specified condition. Let's look at an example using the tree rankings we generated earlier.
```{r}
ranked_trees |>
filter(
if_any(
.cols = contains("rank"), # Only check the columns with "rank" in their name
.fns = ~ . <= 10 # Fetch the rows which are in the top 10 of ANY measure
)
)
```
Using `if_any()` we are able to get all the trees that are in the top 10 of EITHER `Girth_rank`, `Height_rank`, OR `Volume_rank`. If we perform the exact same operations but substitute `if_all()` for `if_any()` we get a much shorter list containing only the trees which are in the top 10 of ALL ranks.
```{r}
ranked_trees |>
filter(
if_all(
.cols = contains("rank"), # Only check the columns with "rank" in their name
.fns = ~ . <= 10 # Fetch the rows which are in the top 10 of ALL measures
)
)
```
You can still use `if_any()` and `if_all()` within functions like `mutate()` and `summarise()` to perform different operations depending on the values present in a row. In the example below, we add a new column called "size" which has the value "big" if the tree is in the top 10 by all measures, "medium" if it is in the top 10 by at least one measure, and "small" otherwise.
```{r}
ranked_trees |>
mutate(
size = case_when(
if_all(contains("rank"), ~ . <= 10) ~ "big",
if_any(contains("rank"), ~ . <= 10) ~ "medium",
TRUE ~ "small"
)
)
```