across_vignette.Rmd

---
title: "TidyVerse CREATE"
author: "Kevin Havis"
date: "2024-10-24"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Tidy operations across columns

In this vignette we will walk through a useful function in the `dplyr` package for column-wise operations; `across()`.

```{r}
library(tidyverse)

# Load initial data

df <- as_tibble(trees)
```

`across()` is best paired with `mutate()`. Consider we would like to rank the respective `Girth`, `Height`, and `Volume` of the black cherry trees in the `trees` data.

```{r}
# Without across()
df |> 
  mutate(
    girth_rank = rank(Girth),
    height_rank = rank(Height),
    vol_rank = rank(Volume)
  )
```

This is easy enough but what if we had many columns? Or we did not know how many columns we are going to operate across?

We can use `across()` in conjunction with `mutate()` to perform this operation.

```{r}
# With across()
ranked_trees <- df |> 
  mutate(
    across(
      .cols = everything(),    # Operate across all columns
      .fns = rank,             # The function(s) to perform
      .names = "{.col}_rank"   # How the new columns should be named
    )
  )
ranked_trees
```

We can also specify what *type* of columns we would like to perform by replacing `everything()` with `starts_with()`. You could also use `is.character` or `is.numerical`, which is useful for things such as string normalization, rounding floats into integers, etc.

```{r}
# Using across for specific columns
df |> 
  mutate(
    across(
      .cols = starts_with(c("Height", "Volume")),    # Operate on subset of cols
      .fns = rank,                       
      .names = "{.col}_rank"            
    )
  )
```
```{r}
# Operate on specific column type
df |> 
  mutate(
    tree = "I'm not a tree..."
  ) |> 
  mutate(
    across(
      .cols = where(is.character),  # Select character columns only
      .fns = ~ "I'm a tree!"               # Replace values with "tree"
    )
  )
```

```{r}
# Round numeric columns to integers

df |> 
  mutate(
    across(
      .cols = where(is.numeric),  # Select all numeric columns
      .fns = ~ round(.x),         # Apply round() to each column
      .names = "{.col}_int"       # Create new columns with _int suffix
    )
  )
```

## Extending Tidy Operations Across Columns

In addition to the `across()` function `dplyr` also has the `if_any()` and `if_all()` functions which operate very similarly to `across()`. As demonstrated above, `across()` can be used with the `mutate()` and by extension the `summarise()` and `reframe()` functions. Its behavior does not extend to the `filter()` function however, which is where `if_any()` and `if_all()` can help.

```{r}
# THIS WILL NOT WORK, INCLUDED FOR DEMONSTRATION ONLY
# Uncomment the below code to see an example of the error that occurs when using across() in a filter() statement

#df |> 
#  filter(
#    across(
#      .cols = where(is.numeric), 
#      .fns = ~ .x > 5,            
#    )
#  )
```

The `if_any()` and `if_all()` functions take all the same arguments as `across()` but return a logical vector which allows for filtering. The `if_any()` function will preserve all rows with at least one cell which meets the condition, whereas `if_all()` will only keep rows for which all cells meet the specified condition. Let's look at an example using the tree rankings we generated earlier.

```{r}
ranked_trees |> 
  filter(
    if_any(
      .cols = contains("rank"),   # Only check the columns with "rank" in their name
      .fns = ~ . <= 10            # Fetch the rows which are in the top 10 of ANY measure
    )
  )
```

Using `if_any()` we are able to get all the trees that are in the top 10 of EITHER `Girth_rank`, `Height_rank`, OR `Volume_rank`. If we perform the exact same operations but substitute `if_all()` for `if_any()` we get a much shorter list containing only the trees which are in the top 10 of ALL ranks.

```{r}
ranked_trees |> 
  filter(
    if_all(
      .cols = contains("rank"),   # Only check the columns with "rank" in their name
      .fns = ~ . <= 10            # Fetch the rows which are in the top 10 of ALL measures
    )
  )
```

You can still use `if_any()` and `if_all()` within functions like `mutate()` and `summarise()` to perform different operations depending on the values present in a row. In the example below, we add a new column called "size" which has the value "big" if the tree is in the top 10 by all measures, "medium" if it is in the top 10 by at least one measure, and "small" otherwise.

```{r}
ranked_trees |> 
  mutate(
    size = case_when(
      if_all(contains("rank"), ~ . <= 10) ~ "big",
      if_any(contains("rank"), ~ . <= 10) ~ "medium",
      TRUE                                ~ "small"
    )
  )
```