diff --git a/resources.Rmd b/resources.Rmd index 1942e375..d24bf03b 100644 --- a/resources.Rmd +++ b/resources.Rmd @@ -14,6 +14,7 @@ Here are additional resources to help you on your R journey - either before, dur
**Help Getting Started**
+- [Guide to using Slack]( https://slack.com/help/articles/218080037-Getting-started-for-new-Slack-users) - [R reference card](http://cran.r-project.org/doc/contrib/Short-refcard.pdf) - [R introductory guide](https://cran.r-project.org/doc/manuals/r-release/R-intro.html) - [R jargon](https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf) @@ -71,6 +72,11 @@ Here are additional resources to help you on your R journey - either before, dur - [Video](https://www.youtube.com/watch?v=Ao9e0cDzMrE) for Mac users who want to see how to move files around (especially from downloads) - [Extra information about file paths](https://docs.google.com/presentation/d/18u1Vhd3Uq-QprC0btpxS_-Ka-LKVUvncyoqdbGdb-g4/edit?usp=sharing) +**Need extra guidance on wrangling?** + +- [Guide on `janitor`](https://hutchdatascience.org/data_snacks/r_snacks/janitor.html) +- [Guide on cleaning complicated names](https://daseh.org/resources/cleaning_names.html) + **Need help with joins?** - [`full-join()` animation](https://github.com/gadenbuie/tidyexplain/blob/master/images/full-join.gif) @@ -93,7 +99,13 @@ Here are additional resources to help you on your R journey - either before, dur - [Modeling 101](https://jhudatascience.org/tidyversecourse/model.html#linear-modeling) - [Common statistical tests are linear models](https://lindeloev.github.io/tests-as-linear/) (why understanding linear models will get you far!) - [Interpreting GLM output (e.g., deviance)](https://www.statology.org/null-residual-deviance/) +- [Guide on why `set.seed` can be useful](https://rsample.tidymodels.org/reference/bootstraps.html) + +**Want help creating tables?** +- [Guide on making nice tables from stats tests in R](https://www.danieldsjoberg.com/gtsummary/articles/tbl_summary.html) +- [Guide on making custom styled tables in R with the `kableExtra` package](https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html) +- [Guide on using DT table to make interactive tables](https://rstudio.github.io/DT/)
@@ -132,6 +144,7 @@ Here are additional resources to help you on your R journey - either before, dur (See page 505) - [R <-> SAS Cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/sas-r.pdf) - [SAS to R Converter](https://www.codeconvert.ai/sas-to-r-converter) +- [Guide to learning R as a SAS user](https://hutchdatascience.org/data_snacks/r_snacks/sas2r.html) - You might also find large language models like ChatGPT useful for code conversion. Be sure to check the output because AI makes mistakes! @@ -140,7 +153,7 @@ Here are additional resources to help you on your R journey - either before, dur
**Comparison of Python and R**
-- A helpful [blog post](https://www.ibm.com/cloud/blog/python-vs-r) about the difference between these two languages. +- A helpful [article about the difference between these two languages](https://www.ibm.com/think/topics/python-vs-r).
diff --git a/resources/cleaning_names.Rmd b/resources/cleaning_names.Rmd new file mode 100644 index 00000000..64b8d012 --- /dev/null +++ b/resources/cleaning_names.Rmd @@ -0,0 +1,111 @@ +--- +title: "Cleaning complicated column names" +output: + html_document: + css: ../docs/web_styles.css + toc: true +--- + + +## Cleaning a common pattern from names + +Let's say that we already have technically clean names - in that they don't have spaces or punctuation or start with a number. However, let's say that there is a redundant word ("percent") that we want to remove or add to multiple columns. + + +First let's load the packages we will need. We will show some functions from `janitor` and the `tidyverse`: + +```{r, echo = FALSE} +install.packages("janitor", repos='http://cran.us.r-project.org') +``` + +```{r} +#install.packages("janitor") +library(tidyverse) +library(janitor) +``` + +First let's make some data: + +```{r} +data_to_clean <- tibble(State = c("Texas", "Utah", "Maryland", "Ohio"), + tax_percent = c(10, 20, 60, 40), + literacy_percent = c(70, 80, 80, 75), + above_poverty_percent = c(60, 70, 50, 60)) +data_to_clean + +``` + + +We can use the `rename_with` function of `dplyr` and `str_remove` of `stringr` to remove the pattern "_percent" from each of the column names. + +Here we use the `~` and the `.` to indicate that we are using `str_remove` and all the column names. If it finds the pattern it will remove it. + + +```{r} +data_to_clean %>% rename_with(~str_remove(., '_percent')) + +``` + +Nice! That simplified our names very easily! + +## Cleaning names with numbers and punctuation + + +We can use patterns with regex - see this [regex cheatsheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf) for help to remove unwanted characters! We adapted some code from this [source](https://stackoverflow.com/questions/71151470/remove-characters-from-column-names). + +First we will make some very messy data: + +```{r} + +d <- tibble("Year" = 1:5, + "Info" = 1:5, + "1. Products" = 1:5, + "2. Rate" = 1:5, + "3. Price" = 1:5, + "29. Other" = 1:5) +d +``` + +Now we can remove the numbers and punctuation in a similar way as we did before using `rename_with` and `str_remove`, but this time we specify a few things: + +- that we want to remove digits with `[:digits:]` (based on the [regex cheatsheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf)) + +- that we want to remove possibly one or more digits with the `+` (based on the [regex cheatsheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf)) + +- that we want to remove a period (which needs two `\\` based on the (based on the [regex cheatsheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf)) too!) and a space + +Here we go: + +```{r} +d %>% + rename_with(~str_remove(., "[:digit:]+\\. ")) + +``` + +Nice, that is better! + +## Using values of a specific row for column names + +First let's make some messy data that is missing values in the first row and has possible better column names in the second row. We adapted code from this [source](https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html#remove_constant-columns). + +This can often happen when we read in data. + +```{r} + +dirt <- data.frame(X_1 = c(NA, "ID", 1:3), + X_2 = c(NA, "Value", 4:6)) + +dirt +``` + + +The function `row_to_names` from the `janitor` package (not part of the `tidyverse` - so make sure you install and load it!) can be really helpful for this. + +We can use the `row_number` argument of `row_to_names` to specify that the column names can be found in the second row. + +```{r} +row_to_names(dirt, row_number = 2) # our column names can be found in row 2! + +``` + +