-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy pathgroupby_vingette.Rmd
82 lines (57 loc) · 2.92 KB
/
groupby_vingette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
title: "Group By Vingette"
author: "John Ferrara"
date: "`r Sys.Date()`"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Demonstrating the Group by (group_by) functionality in dplyr
The group_by() function is used in order to get aggeegated metrics for a specific vairable in a dataframe, particularly when you want to get aggregate data for other columns or categories in the data.
## Pulling in example data (Using NYC Parks Pools Data)
```{r reding in data}
library(tidyverse)
pools_df <- read_csv("https://data.cityofnewyork.us/resource/y5rm-wagw.csv")
head(pools_df)
```
## Grouping by a Single Category
Looking at our example pool data, which outlines all of the public pool with NYC Parks, say we wanted to get a total number of pools by Commnity Board, or by borough. We need to combine the groupby function with the summarize() function to get our numbers. The summarize function dictates what type of aggregate function we want to derive from the base data when grouping.
```{r single_cat}
# Grouping by Community Board, getting the count of rows, or pools, for each Cmm. Board
pool_by_cd <- pools_df %>% group_by(communityboard) %>% summarize(pool_count=n())
head(pool_by_cd)
## Grouping By Borough
pool_by_bro <- pools_df %>% group_by(borough) %>% summarize(pool_count=n())
head(pool_by_bro)
```
## Grouping by multiple categories
YOu can group by multiple columns as well. Lets say we wanted the total number of pools by borough, along with the type of pool that it is.
```{r multiple_cat}
pool_by_bro_type <- pools_df %>% group_by(borough,pooltype) %>% summarize(pool_count=n())
head(pool_by_bro_type)
```
## Remove the grouping
`upgroup()` removes the grouping and it allows you to calculate further operations on the data frame
```{r upgroup}
## Calculate the mean pool by borough
pool_by_bro_type_mean <- pool_by_bro_type |>
group_by(borough) |>
mutate(pool_count_mean = mean(pool_count)) |>
ungroup()
head(pool_by_bro_type_mean)
```
## Split groups into new dataframes
You can split a dataframe into multiple dataframes using `dplyr::group_split()`. This is useful when you want to breakdown very large data into meaningful chunks, such as business units.
```{r split_group}
# Creating a grouped dataframe
pool_by_bro_type <- pools_df %>% group_by(borough,pooltype) %>% summarize(pool_count=n())
head(pool_by_bro_type)
# Splitting it into a list of dataframes
group_split(pool_by_bro_type)
```
## Alternative Calculation of Group Totals using count()
While grouping first, followed by summarize using n() is one way to calculate total observations or events in a subgroup, an alternate approach is to use the count() function. This function conveniently obviates the need to group first by simply naming the grouping arguments in the function. See the example code below and contrast with the earlier example in lines 40 - 42.
```{r count}
pools_df %>% count(borough,pooltype)
```