-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy pathamish_tidyverse_vignette.Rmd
85 lines (66 loc) · 3 KB
/
amish_tidyverse_vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
title: "Tidyverse Example with Rock Dataset"
author: "Amish Rasheed"
date: "2024-11-11"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
In this example, we’ll use the Tidyverse functions `group_by` and `summarize` to analyze the built-in rock dataset in R. This dataset includes measurements of rock samples, such as area, perimeter, shape, and permeability. We'll group the data by a categorical representation of the shape variable and calculate summary statistics.
# Load Tidyverse
```{r}
library(tidyverse)
```
# Load the built-in rock dataset
```{r}
data("rock")
```
## The rock dataset contains four columns:
**area**: Area of the rock sample in square pixels.
**peri**: Perimeter of the rock sample in pixels.
**shape**: Shape of the rock sample.
**perm**: Permeability of the rock sample.
# Example: Summarize by a rounded 'shape' value
```{r}
rock_summary <- rock %>%
mutate(shape_category = round(shape, 1)) %>% # Round to one decimal place
group_by(shape_category) %>%
summarize(
avg_area = mean(area),
avg_perm = mean(perm),
count = n()
)
# Plot the summarized data
ggplot(rock_summary, aes(x = factor(shape_category), y = avg_area)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(
title = "Average Area by Shape Category",
x = "Shape Category (Rounded to 1 Decimal Place)",
y = "Average Area"
) +
theme_minimal()
```
# Conclusion
In this example, I demonstrated how to use group_by and summarize to compute summary statistics for grouped data using the Tidyverse package. I also created a bar plot to visualize the average area across different shape categories.
# Extension
In this extension I will be showcasing a use of the ungroup() command in group_by statements. In this first example, we can see how the rocks dataset has been grouped by shape category, making the following summary metrics be split for each individual shape category. I.E. the avg_perm for a shape is dependent on it's shape category, givng the average perimeter for shapes in that group.
```{r}
rock_summary_grouped <- rock %>%
mutate(shape_category = round(shape, 1)) %>% # Round to one decimal place
group_by(shape_category) %>%
mutate(avg_area = mean(area)) %>%
mutate(avg_perm = mean(perm))
head(rock_summary_grouped)
```
Now we will call the ungroup() function before we calculate the avg_perm. This means the avg_perm variable is the average perimeter for the whole population, not just ones in a specific shape category. The avg_area is still the average area for the for the specific category since this occurs before the ungroup() function.
```{r}
rock_summary_ungrouped <- rock %>%
mutate(shape_category = round(shape, 1)) %>% # Round to one decimal place
group_by(shape_category) %>%
mutate(avg_area = mean(area)) %>%
ungroup() %>%
mutate(avg_perm = mean(perm))
head(rock_summary_ungrouped)
```