Skip to content

Commit

Permalink
Merge pull request #271 from fhdsl/resources-added
Browse files Browse the repository at this point in the history
Add resources from JHU Course
  • Loading branch information
avahoffman authored Jan 24, 2025
2 parents 27bc33d + d31c9bd commit fddccc7
Show file tree
Hide file tree
Showing 2 changed files with 125 additions and 1 deletion.
15 changes: 14 additions & 1 deletion resources.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Here are additional resources to help you on your R journey - either before, dur

<details open><summary> <span style = "color: #5383bb;"> **Help Getting Started**</span></summary><br>

- [Guide to using Slack]( https://slack.com/help/articles/218080037-Getting-started-for-new-Slack-users)
- [R reference card](http://cran.r-project.org/doc/contrib/Short-refcard.pdf)
- [R introductory guide](https://cran.r-project.org/doc/manuals/r-release/R-intro.html)
- [R jargon](https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf)
Expand Down Expand Up @@ -71,6 +72,11 @@ Here are additional resources to help you on your R journey - either before, dur
- [Video](https://www.youtube.com/watch?v=Ao9e0cDzMrE) for Mac users who want to see how to move files around (especially from downloads)
- [Extra information about file paths](https://docs.google.com/presentation/d/18u1Vhd3Uq-QprC0btpxS_-Ka-LKVUvncyoqdbGdb-g4/edit?usp=sharing)

**Need extra guidance on wrangling?**

- [Guide on `janitor`](https://hutchdatascience.org/data_snacks/r_snacks/janitor.html)
- [Guide on cleaning complicated names](https://daseh.org/resources/cleaning_names.html)

**Need help with joins?**

- [`full-join()` animation](https://github.com/gadenbuie/tidyexplain/blob/master/images/full-join.gif)
Expand All @@ -93,7 +99,13 @@ Here are additional resources to help you on your R journey - either before, dur
- [Modeling 101](https://jhudatascience.org/tidyversecourse/model.html#linear-modeling)
- [Common statistical tests are linear models](https://lindeloev.github.io/tests-as-linear/) (why understanding linear models will get you far!)
- [Interpreting GLM output (e.g., deviance)](https://www.statology.org/null-residual-deviance/)
- [Guide on why `set.seed` can be useful](https://rsample.tidymodels.org/reference/bootstraps.html)

**Want help creating tables?**

- [Guide on making nice tables from stats tests in R](https://www.danieldsjoberg.com/gtsummary/articles/tbl_summary.html)
- [Guide on making custom styled tables in R with the `kableExtra` package](https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html)
- [Guide on using `DT table` to make interactive tables](https://rstudio.github.io/DT/)

</details>

Expand Down Expand Up @@ -132,6 +144,7 @@ Here are additional resources to help you on your R journey - either before, dur
(See page 505)
- [R <-> SAS Cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/sas-r.pdf)
- [SAS to R Converter](https://www.codeconvert.ai/sas-to-r-converter)
- [Guide to learning R as a SAS user](https://hutchdatascience.org/data_snacks/r_snacks/sas2r.html)
- You might also find large language models like ChatGPT useful for code conversion. Be sure to check the output because AI makes mistakes!

</details>
Expand All @@ -140,7 +153,7 @@ Here are additional resources to help you on your R journey - either before, dur

<details><summary> <span style = "color: #5383bb;"> **Comparison of Python and R**</span></summary><br>

- A helpful [blog post](https://www.ibm.com/cloud/blog/python-vs-r) about the difference between these two languages.
- A helpful [article about the difference between these two languages](https://www.ibm.com/think/topics/python-vs-r).

</details>

Expand Down
111 changes: 111 additions & 0 deletions resources/cleaning_names.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
title: "Cleaning complicated column names"
output:
html_document:
css: ../docs/web_styles.css
toc: true
---


## Cleaning a common pattern from names

Let's say that we already have technically clean names - in that they don't have spaces or punctuation or start with a number. However, let's say that there is a redundant word ("percent") that we want to remove or add to multiple columns.


First let's load the packages we will need. We will show some functions from `janitor` and the `tidyverse`:

```{r, echo = FALSE}
install.packages("janitor", repos='http://cran.us.r-project.org')
```

```{r}
#install.packages("janitor")
library(tidyverse)
library(janitor)
```

First let's make some data:

```{r}
data_to_clean <- tibble(State = c("Texas", "Utah", "Maryland", "Ohio"),
tax_percent = c(10, 20, 60, 40),
literacy_percent = c(70, 80, 80, 75),
above_poverty_percent = c(60, 70, 50, 60))
data_to_clean
```


We can use the `rename_with` function of `dplyr` and `str_remove` of `stringr` to remove the pattern "_percent" from each of the column names.

Here we use the `~` and the `.` to indicate that we are using `str_remove` and all the column names. If it finds the pattern it will remove it.


```{r}
data_to_clean %>% rename_with(~str_remove(., '_percent'))
```

Nice! That simplified our names very easily!

## Cleaning names with numbers and punctuation


We can use patterns with regex - see this [regex cheatsheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf) for help to remove unwanted characters! We adapted some code from this [source](https://stackoverflow.com/questions/71151470/remove-characters-from-column-names).

First we will make some very messy data:

```{r}
d <- tibble("Year" = 1:5,
"Info" = 1:5,
"1. Products" = 1:5,
"2. Rate" = 1:5,
"3. Price" = 1:5,
"29. Other" = 1:5)
d
```

Now we can remove the numbers and punctuation in a similar way as we did before using `rename_with` and `str_remove`, but this time we specify a few things:

- that we want to remove digits with `[:digits:]` (based on the [regex cheatsheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf))

- that we want to remove possibly one or more digits with the `+` (based on the [regex cheatsheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf))

- that we want to remove a period (which needs two `\\` based on the (based on the [regex cheatsheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf)) too!) and a space

Here we go:

```{r}
d %>%
rename_with(~str_remove(., "[:digit:]+\\. "))
```

Nice, that is better!

## Using values of a specific row for column names

First let's make some messy data that is missing values in the first row and has possible better column names in the second row. We adapted code from this [source](https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html#remove_constant-columns).

This can often happen when we read in data.

```{r}
dirt <- data.frame(X_1 = c(NA, "ID", 1:3),
X_2 = c(NA, "Value", 4:6))
dirt
```


The function `row_to_names` from the `janitor` package (not part of the `tidyverse` - so make sure you install and load it!) can be really helpful for this.

We can use the `row_number` argument of `row_to_names` to specify that the column names can be found in the second row.

```{r}
row_to_names(dirt, row_number = 2) # our column names can be found in row 2!
```


0 comments on commit fddccc7

Please sign in to comment.