Page not found
+The page you requested cannot be found (perhaps it was moved or renamed).
+You may want to try searching to find the page's new location, or use +the table of contents to find the page you are looking for.
+diff --git a/dev/404.html b/dev/404.html new file mode 100644 index 000000000..e66025381 --- /dev/null +++ b/dev/404.html @@ -0,0 +1,627 @@ + + +
+ + + +The page you requested cannot be found (perhaps it was moved or renamed).
+You may want to try searching to find the page's new location, or use +the table of contents to find the page you are looking for.
+We’d like to thank everyone that has contributed to the development of +Data Science: A First Introduction. +This is an open source textbook that began as a collection of course readings +for DSCI 100, a new introductory data science course +at the University of British Columbia (UBC). +Several faculty members in the UBC Department of Statistics +were pivotal in shaping the direction of that course, +and as such, contributed greatly to the broad structure and +list of topics in this book. We would especially like to thank Matías +Salibían-Barrera for his mentorship during the initial development and roll-out +of both DSCI 100 and this book. His door was always open when +we needed to chat about how to best introduce and teach data science to our first-year students. +We would also like to thank Gabriela Cohen Freue for her DSCI 561 (Regression I) teaching materials +from the UBC Master of Data Science program, as some of our linear regression figures were inspired from these.
+We would also like to thank all those who contributed to the process of +publishing this book. In particular, we would like to thank all of our reviewers for their feedback and suggestions: +Rohan Alexander, Isabella Ghement, Virgilio Gómez Rubio, Albert Kim, Adam Loy, Maria Prokofieva, Emily Riederer, and Greg Wilson. +The book was improved substantially by their insights. +We would like to give special thanks to Jim Zidek +for his support and encouragement throughout the process, and to +Roger Peng for graciously offering to write the Foreword.
+Finally, we owe a debt of gratitude to all of the students of DSCI 100 over the past +few years. They provided invaluable feedback on the book and worksheets; +they found bugs for us (and stood by very patiently in class while +we frantically fixed those bugs); and they brought a level of enthusiasm to the class +that sustained us during the hard work of creating a new course and writing a textbook. +Our interactions with them taught us how to teach data science, and that learning +is reflected in the content of this book.
+ +In previous chapters, we focused solely on descriptive and exploratory +data analysis questions. +This chapter and the next together serve as our first +foray into answering predictive questions about data. In particular, we will +focus on classification, i.e., using one or more +variables to predict the value of a categorical variable of interest. This chapter +will cover the basics of classification, how to preprocess data to make it +suitable for use in a classifier, and how to use our observed data to make +predictions. The next chapter will focus on how to evaluate how accurate the +predictions from our classifier are, as well as how to improve our classifier +(where possible) to maximize its accuracy.
+By the end of the chapter, readers will be able to do the following:
+tidymodels
.recipe
to center, scale, balance, and impute data as a preprocessing step.workflow
.In many situations, we want to make predictions based on the current situation +as well as past experiences. For instance, a doctor may want to diagnose a +patient as either diseased or healthy based on their symptoms and the doctor’s +past experience with patients; an email provider might want to tag a given +email as “spam” or “not spam” based on the email’s text and past email text data; +or a credit card company may want to predict whether a purchase is fraudulent based +on the current purchase item, amount, and location as well as past purchases. +These tasks are all examples of classification, i.e., predicting a +categorical class (sometimes called a label) for an observation given its +other variables (sometimes called features).
+Generally, a classifier assigns an observation without a known class (e.g., a new patient) +to a class (e.g., diseased or healthy) on the basis of how similar it is to other observations +for which we do know the class (e.g., previous patients with known diseases and +symptoms). These observations with known classes that we use as a basis for +prediction are called a training set; this name comes from the fact that +we use these data to train, or teach, our classifier. Once taught, we can use +the classifier to make predictions on new data for which we do not know the class.
+There are many possible methods that we could use to predict +a categorical class/label for an observation. In this book, we will +focus on the widely used K-nearest neighbors algorithm (Fix and Hodges 1951; Cover and Hart 1967). +In your future studies, you might encounter decision trees, support vector machines (SVMs), +logistic regression, neural networks, and more; see the additional resources +section at the end of the next chapter for where to begin learning more about +these other methods. It is also worth mentioning that there are many +variations on the basic classification problem. For example, +we focus on the setting of binary classification where only two +classes are involved (e.g., a diagnosis of either healthy or diseased), but you may +also run into multiclass classification problems with more than two +categories (e.g., a diagnosis of healthy, bronchitis, pneumonia, or a common cold).
+In this chapter and the next, we will study a data set of +digitized breast cancer image features, +created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian (Street, Wolberg, and Mangasarian 1993). +Each row in the data set represents an +image of a tumor sample, including the diagnosis (benign or malignant) and +several other measurements (nucleus texture, perimeter, area, and more). +Diagnosis for each image was conducted by physicians.
+As with all data analyses, we first need to formulate a precise question that +we want to answer. Here, the question is predictive: can +we use the tumor +image measurements available to us to predict whether a future tumor image +(with unknown diagnosis) shows a benign or malignant tumor? Answering this +question is important because traditional, non-data-driven methods for tumor +diagnosis are quite subjective and dependent upon how skilled and experienced +the diagnosing physician is. Furthermore, benign tumors are not normally +dangerous; the cells stay in the same place, and the tumor stops growing before +it gets very large. By contrast, in malignant tumors, the cells invade the +surrounding tissue and spread into nearby organs, where they can cause serious +damage (Stanford Health Care 2021). +Thus, it is important to quickly and accurately diagnose the tumor type to +guide patient treatment.
+Our first step is to load, wrangle, and explore the data using visualizations
+in order to better understand the data we are working with. We start by
+loading the tidyverse
package needed for our analysis.
In this case, the file containing the breast cancer data set is a .csv
+file with headers. We’ll use the read_csv
function with no additional
+arguments, and then inspect its contents:
## # A tibble: 569 × 12
+## ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity
+## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
+## 1 8.42e5 M 1.10 -2.07 1.27 0.984 1.57 3.28 2.65
+## 2 8.43e5 M 1.83 -0.353 1.68 1.91 -0.826 -0.487 -0.0238
+## 3 8.43e7 M 1.58 0.456 1.57 1.56 0.941 1.05 1.36
+## 4 8.43e7 M -0.768 0.254 -0.592 -0.764 3.28 3.40 1.91
+## 5 8.44e7 M 1.75 -1.15 1.78 1.82 0.280 0.539 1.37
+## 6 8.44e5 M -0.476 -0.835 -0.387 -0.505 2.24 1.24 0.866
+## 7 8.44e5 M 1.17 0.161 1.14 1.09 -0.123 0.0882 0.300
+## 8 8.45e7 M -0.118 0.358 -0.0728 -0.219 1.60 1.14 0.0610
+## 9 8.45e5 M -0.320 0.588 -0.184 -0.384 2.20 1.68 1.22
+## 10 8.45e7 M -0.473 1.10 -0.329 -0.509 1.58 2.56 1.74
+## # ℹ 559 more rows
+## # ℹ 3 more variables: Concave_Points <dbl>, Symmetry <dbl>,
+## # Fractal_Dimension <dbl>
+Breast tumors can be diagnosed by performing a biopsy, a process where +tissue is removed from the body and examined for the presence of disease. +Traditionally these procedures were quite invasive; modern methods such as fine +needle aspiration, used to collect the present data set, extract only a small +amount of tissue and are less invasive. Based on a digital image of each breast +tissue sample collected for this data set, ten different variables were measured +for each cell nucleus in the image (items 3–12 of the list of variables below), and then the mean +for each variable across the nuclei was recorded. As part of the +data preparation, these values have been standardized (centered and scaled); we will discuss what this +means and why we do it later in this chapter. Each image additionally was given +a unique ID and a diagnosis by a physician. Therefore, the +total set of variables per image in this data set is:
+Below we use glimpse
to preview the data frame. This function can
+make it easier to inspect the data when we have a lot of columns,
+as it prints the data such that the columns go down
+the page (instead of across).
## Rows: 569
+## Columns: 12
+## $ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786…
+## $ Class <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M…
+## $ Radius <dbl> 1.0960995, 1.8282120, 1.5784992, -0.7682333, 1.74875…
+## $ Texture <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.150…
+## $ Perimeter <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.7…
+## $ Area <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.8…
+## $ Smoothness <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.2…
+## $ Compactness <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.5…
+## $ Concavity <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.3…
+## $ Concave_Points <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.42…
+## $ Symmetry <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154, …
+## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -0…
+From the summary of the data above, we can see that Class
is of type character
+(denoted by <chr>
). We can use the distinct
function to see all the unique
+values present in that column. We see that there are two diagnoses: benign, represented by “B”,
+and malignant, represented by “M”.
## # A tibble: 2 × 1
+## Class
+## <chr>
+## 1 M
+## 2 B
+Since we will be working with Class
as a categorical
+variable, it is a good idea to convert it to a factor type using the as_factor
function.
+We will also improve the readability of our analysis by renaming “M” to
+“Malignant” and “B” to “Benign” using the fct_recode
method. The fct_recode
method
+is used to replace the names of factor values with other names. The arguments of fct_recode
are the column that you
+want to modify, followed any number of arguments of the form "new name" = "old name"
to specify the renaming scheme.
cancer <- cancer |>
+ mutate(Class = as_factor(Class)) |>
+ mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
+glimpse(cancer)
## Rows: 569
+## Columns: 12
+## $ ID <dbl> 842302, 842517, 84300903, 84348301, 84358402, 843786…
+## $ Class <fct> Malignant, Malignant, Malignant, Malignant, Malignan…
+## $ Radius <dbl> 1.0960995, 1.8282120, 1.5784992, -0.7682333, 1.74875…
+## $ Texture <dbl> -2.0715123, -0.3533215, 0.4557859, 0.2535091, -1.150…
+## $ Perimeter <dbl> 1.26881726, 1.68447255, 1.56512598, -0.59216612, 1.7…
+## $ Area <dbl> 0.98350952, 1.90703027, 1.55751319, -0.76379174, 1.8…
+## $ Smoothness <dbl> 1.56708746, -0.82623545, 0.94138212, 3.28066684, 0.2…
+## $ Compactness <dbl> 3.28062806, -0.48664348, 1.05199990, 3.39991742, 0.5…
+## $ Concavity <dbl> 2.65054179, -0.02382489, 1.36227979, 1.91421287, 1.3…
+## $ Concave_Points <dbl> 2.53024886, 0.54766227, 2.03543978, 1.45043113, 1.42…
+## $ Symmetry <dbl> 2.215565542, 0.001391139, 0.938858720, 2.864862154, …
+## $ Fractal_Dimension <dbl> 2.25376381, -0.86788881, -0.39765801, 4.90660199, -0…
+Let’s verify that we have successfully converted the Class
column to a factor variable
+and renamed its values to “Benign” and “Malignant” using the distinct
function once more.
## # A tibble: 2 × 1
+## Class
+## <fct>
+## 1 Malignant
+## 2 Benign
+Before we start doing any modeling, let’s explore our data set. Below we use
+the group_by
, summarize
and n
functions to find the number and percentage
+of benign and malignant tumor observations in our data set. The n
function within
+summarize
, when paired with group_by
, counts the number of observations in each Class
group.
+Then we calculate the percentage in each group by dividing by the total number of observations
+and multiplying by 100. We have 357 (63%) benign and 212 (37%) malignant tumor observations.
num_obs <- nrow(cancer)
+cancer |>
+ group_by(Class) |>
+ summarize(
+ count = n(),
+ percentage = n() / num_obs * 100
+ )
## # A tibble: 2 × 3
+## Class count percentage
+## <fct> <int> <dbl>
+## 1 Malignant 212 37.3
+## 2 Benign 357 62.7
+Next, let’s draw a scatter plot to visualize the relationship between the
+perimeter and concavity variables. Rather than use ggplot's
default palette,
+we select our own colorblind-friendly colors—"darkorange"
+for orange and "steelblue"
for blue—and
+pass them as the values
argument to the scale_color_manual
function.
perim_concav <- cancer |>
+ ggplot(aes(x = Perimeter, y = Concavity, color = Class)) +
+ geom_point(alpha = 0.6) +
+ labs(x = "Perimeter (standardized)",
+ y = "Concavity (standardized)",
+ color = "Diagnosis") +
+ scale_color_manual(values = c("darkorange", "steelblue")) +
+ theme(text = element_text(size = 12))
+perim_concav
+Figure 5.1: Scatter plot of concavity versus perimeter colored by diagnosis label. +
+In Figure 5.1, we can see that malignant observations typically fall in
+the upper right-hand corner of the plot area. By contrast, benign
+observations typically fall in the lower left-hand corner of the plot. In other words,
+benign observations tend to have lower concavity and perimeter values, and malignant
+ones tend to have larger values. Suppose we
+obtain a new observation not in the current data set that has all the variables
+measured except the label (i.e., an image without the physician’s diagnosis
+for the tumor class). We could compute the standardized perimeter and concavity values,
+resulting in values of, say, 1 and 1. Could we use this information to classify
+that observation as benign or malignant? Based on the scatter plot, how might
+you classify that new observation? If the standardized concavity and perimeter
+values are 1 and 1 respectively, the point would lie in the middle of the
+orange cloud of malignant points and thus we could probably classify it as
+malignant. Based on our visualization, it seems like it may be possible
+to make accurate predictions of the Class
variable (i.e., a diagnosis) for
+tumor images with unknown diagnoses.
In order to actually make predictions for new observations in practice, we +will need a classification algorithm. +In this book, we will use the K-nearest neighbors classification algorithm. +To predict the label of a new observation (here, classify it as either benign +or malignant), the K-nearest neighbors classifier generally finds the \(K\) +“nearest” or “most similar” observations in our training set, and then uses +their diagnoses to make a prediction for the new observation’s diagnosis. \(K\) +is a number that we must choose in advance; for now, we will assume that someone has chosen +\(K\) for us. We will cover how to choose \(K\) ourselves in the next chapter.
+To illustrate the concept of K-nearest neighbors classification, we +will walk through an example. Suppose we have a +new observation, with standardized perimeter of 2 and standardized concavity of 4, whose +diagnosis “Class” is unknown. This new observation is depicted by the red, diamond point in +Figure 5.2.
++Figure 5.2: Scatter plot of concavity versus perimeter with new observation represented as a red diamond. +
+Figure 5.3 shows that the nearest point to this new observation is malignant and +located at the coordinates (2.1, 3.6). The idea here is that if a point is close to another in the scatter plot, +then the perimeter and concavity values are similar, and so we may expect that +they would have the same diagnosis.
++Figure 5.3: Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label. +
+Suppose we have another new observation with standardized perimeter 0.2 and +concavity of 3.3. Looking at the scatter plot in Figure 5.4, how would you +classify this red, diamond observation? The nearest neighbor to this new point is a +benign observation at (0.2, 2.7). +Does this seem like the right prediction to make for this observation? Probably +not, if you consider the other nearby points.
++Figure 5.4: Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label. +
+To improve the prediction we can consider several +neighboring points, say \(K = 3\), that are closest to the new observation +to predict its diagnosis class. Among those 3 closest points, we use the +majority class as our prediction for the new observation. As shown in Figure 5.5, we +see that the diagnoses of 2 of the 3 nearest neighbors to our new observation +are malignant. Therefore we take majority vote and classify our new red, diamond +observation as malignant.
++Figure 5.5: Scatter plot of concavity versus perimeter with three nearest neighbors. +
+Here we chose the \(K=3\) nearest observations, but there is nothing special +about \(K=3\). We could have used \(K=4, 5\) or more (though we may want to choose +an odd number to avoid ties). We will discuss more about choosing \(K\) in the +next chapter.
+We decide which points are the \(K\) “nearest” to our new observation +using the straight-line distance (we will often just refer to this as distance). +Suppose we have two observations \(a\) and \(b\), each having two predictor variables, \(x\) and \(y\). +Denote \(a_x\) and \(a_y\) to be the values of variables \(x\) and \(y\) for observation \(a\); +\(b_x\) and \(b_y\) have similar definitions for observation \(b\). +Then the straight-line distance between observation \(a\) and \(b\) on the x-y plane can +be computed using the following formula:
+\[\mathrm{Distance} = \sqrt{(a_x -b_x)^2 + (a_y - b_y)^2}\]
+To find the \(K\) nearest neighbors to our new observation, we compute the distance
+from that new observation to each observation in our training data, and select the \(K\) observations corresponding to the
+\(K\) smallest distance values. For example, suppose we want to use \(K=5\) neighbors to classify a new
+observation with perimeter of 0 and
+concavity of 3.5, shown as a red diamond in Figure 5.6. Let’s calculate the distances
+between our new point and each of the observations in the training set to find
+the \(K=5\) neighbors that are nearest to our new point.
+You will see in the mutate
step below, we compute the straight-line
+distance using the formula above: we square the differences between the two observations’ perimeter
+and concavity coordinates, add the squared differences, and then take the square root.
+In order to find the \(K=5\) nearest neighbors, we will use the slice_min
function.
+Figure 5.6: Scatter plot of concavity versus perimeter with new observation represented as a red diamond. +
+new_obs_Perimeter <- 0
+new_obs_Concavity <- 3.5
+cancer |>
+ select(ID, Perimeter, Concavity, Class) |>
+ mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 +
+ (Concavity - new_obs_Concavity)^2)) |>
+ slice_min(dist_from_new, n = 5) # take the 5 rows of minimum distance
## # A tibble: 5 × 5
+## ID Perimeter Concavity Class dist_from_new
+## <dbl> <dbl> <dbl> <fct> <dbl>
+## 1 86409 0.241 2.65 Benign 0.881
+## 2 887181 0.750 2.87 Malignant 0.980
+## 3 899667 0.623 2.54 Malignant 1.14
+## 4 907914 0.417 2.31 Malignant 1.26
+## 5 8710441 -1.16 4.04 Benign 1.28
+In Table 5.1 we show in mathematical detail how
+the mutate
step was used to compute the dist_from_new
variable (the
+distance to the new observation) for each of the 5 nearest neighbors in the
+training data.
+Perimeter + | ++Concavity + | ++Distance + | ++Class + | +
---|---|---|---|
+0.24 + | ++2.65 + | ++\(\sqrt{(0 - 0.24)^2 + (3.5 - 2.65)^2} = 0.88\) + | ++Benign + | +
+0.75 + | ++2.87 + | ++\(\sqrt{(0 - 0.75)^2 + (3.5 - 2.87)^2} = 0.98\) + | ++Malignant + | +
+0.62 + | ++2.54 + | ++\(\sqrt{(0 - 0.62)^2 + (3.5 - 2.54)^2} = 1.14\) + | ++Malignant + | +
+0.42 + | ++2.31 + | ++\(\sqrt{(0 - 0.42)^2 + (3.5 - 2.31)^2} = 1.26\) + | ++Malignant + | +
+-1.16 + | ++4.04 + | ++\(\sqrt{(0 - (-1.16))^2 + (3.5 - 4.04)^2} = 1.28\) + | ++Benign + | +
The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are +malignant; since this is the majority, we classify our new observation as malignant. +These 5 neighbors are circled in Figure 5.7.
++Figure 5.7: Scatter plot of concavity versus perimeter with 5 nearest neighbors circled. +
+Although the above description is directed toward two predictor variables, +exactly the same K-nearest neighbors algorithm applies when you +have a higher number of predictor variables. Each predictor variable may give us new +information to help create our classifier. The only difference is the formula +for the distance between points. Suppose we have \(m\) predictor +variables for two observations \(a\) and \(b\), i.e., +\(a = (a_{1}, a_{2}, \dots, a_{m})\) and +\(b = (b_{1}, b_{2}, \dots, b_{m})\).
+The distance formula becomes
+\[\mathrm{Distance} = \sqrt{(a_{1} -b_{1})^2 + (a_{2} - b_{2})^2 + \dots + (a_{m} - b_{m})^2}.\]
+This formula still corresponds to a straight-line distance, just in a space +with more dimensions. Suppose we want to calculate the distance between a new +observation with a perimeter of 0, concavity of 3.5, and symmetry of 1, and +another observation with a perimeter, concavity, and symmetry of 0.417, 2.31, and +0.837 respectively. We have two observations with three predictor variables: +perimeter, concavity, and symmetry. Previously, when we had two variables, we +added up the squared difference between each of our (two) variables, and then +took the square root. Now we will do the same, except for our three variables. +We calculate the distance as follows
+\[\mathrm{Distance} =\sqrt{(0 - 0.417)^2 + (3.5 - 2.31)^2 + (1 - 0.837)^2} = 1.27.\]
+Let’s calculate the distances between our new observation and each of the +observations in the training set to find the \(K=5\) neighbors when we have these +three predictors.
+new_obs_Perimeter <- 0
+new_obs_Concavity <- 3.5
+new_obs_Symmetry <- 1
+cancer |>
+ select(ID, Perimeter, Concavity, Symmetry, Class) |>
+ mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 +
+ (Concavity - new_obs_Concavity)^2 +
+ (Symmetry - new_obs_Symmetry)^2)) |>
+ slice_min(dist_from_new, n = 5) # take the 5 rows of minimum distance
## # A tibble: 5 × 6
+## ID Perimeter Concavity Symmetry Class dist_from_new
+## <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
+## 1 907914 0.417 2.31 0.837 Malignant 1.27
+## 2 90439701 1.33 2.89 1.10 Malignant 1.47
+## 3 925622 0.470 2.08 1.15 Malignant 1.50
+## 4 859471 -1.37 2.81 1.09 Benign 1.53
+## 5 899667 0.623 2.54 2.06 Malignant 1.56
+Based on \(K=5\) nearest neighbors with these three predictors, we would classify +the new observation as malignant since 4 out of 5 of the nearest neighbors are from the malignant class. +Figure 5.8 shows what the data look like when we visualize them +as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors.
++Figure 5.8: 3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes. +
+In order to classify a new observation using a K-nearest neighbors classifier, we have to do the following:
+tidymodels
Coding the K-nearest neighbors algorithm in R ourselves can get complicated,
+especially if we want to handle multiple classes, more than two variables,
+or predict the class for multiple new observations. Thankfully, in R,
+the K-nearest neighbors algorithm is
+implemented in the parsnip
R package (Kuhn and Vaughan 2021)
+included in tidymodels
, along with
+many other models
+that you will encounter in this and future chapters of the book. The tidymodels
collection
+provides tools to help make and use models, such as classifiers. Using the packages
+in this collection will help keep our code simple, readable and accurate; the
+less we have to code ourselves, the fewer mistakes we will likely make. We
+start by loading tidymodels
.
Let’s walk through how to use tidymodels
to perform K-nearest neighbors classification.
+We will use the cancer
data set from above, with
+perimeter and concavity as predictors and \(K = 5\) neighbors to build our classifier. Then
+we will use the classifier to predict the diagnosis label for a new observation with
+perimeter 0, concavity 3.5, and an unknown diagnosis label. Let’s pick out our two desired
+predictor variables and class label and store them as a new data set named cancer_train
:
## # A tibble: 569 × 3
+## Class Perimeter Concavity
+## <fct> <dbl> <dbl>
+## 1 Malignant 1.27 2.65
+## 2 Malignant 1.68 -0.0238
+## 3 Malignant 1.57 1.36
+## 4 Malignant -0.592 1.91
+## 5 Malignant 1.78 1.37
+## 6 Malignant -0.387 0.866
+## 7 Malignant 1.14 0.300
+## 8 Malignant -0.0728 0.0610
+## 9 Malignant -0.184 1.22
+## 10 Malignant -0.329 1.74
+## # ℹ 559 more rows
+Next, we create a model specification for K-nearest neighbors classification
+by calling the nearest_neighbor
function, specifying that we want to use \(K = 5\) neighbors
+(we will discuss how to choose \(K\) in the next chapter) and that each neighboring point should have the same weight when voting
+(weight_func = "rectangular"
). The weight_func
argument controls
+how neighbors vote when classifying a new observation; by setting it to "rectangular"
,
+each of the \(K\) nearest neighbors gets exactly 1 vote as described above. Other choices,
+which weigh each neighbor’s vote differently, can be found on
+the parsnip
website.
+In the set_engine
argument, we specify which package or system will be used for training
+the model. Here kknn
is the R package we will use for performing K-nearest neighbors classification.
+Finally, we specify that this is a classification problem with the set_mode
function.
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) |>
+ set_engine("kknn") |>
+ set_mode("classification")
+knn_spec
## K-Nearest Neighbor Model Specification (classification)
+##
+## Main Arguments:
+## neighbors = 5
+## weight_func = rectangular
+##
+## Computational engine: kknn
+In order to fit the model on the breast cancer data, we need to pass the model specification
+and the data set to the fit
function. We also need to specify what variables to use as predictors
+and what variable to use as the response. Below, the Class ~ Perimeter + Concavity
argument specifies
+that Class
is the response variable (the one we want to predict),
+and both Perimeter
and Concavity
are to be used as the predictors.
We can also use a convenient shorthand syntax using a period, Class ~ .
, to indicate
+that we want to use every variable except Class
as a predictor in the model.
+In this particular setup, since Concavity
and Perimeter
are the only two predictors in the cancer_train
+data frame, Class ~ Perimeter + Concavity
and Class ~ .
are equivalent.
+In general, you can choose individual predictors using the +
symbol, or you can specify to
+use all predictors using the .
symbol.
## parsnip model object
+##
+##
+## Call:
+## kknn::train.kknn(formula = Class ~ ., data = data, ks = min_rows(5, data, 5)
+## , kernel = ~"rectangular")
+##
+## Type of response variable: nominal
+## Minimal misclassification: 0.07557118
+## Best kernel: rectangular
+## Best k: 5
+Here you can see the final trained model summary. It confirms that the computational engine used
+to train the model was kknn::train.kknn
. It also shows the fraction of errors made by
+the K-nearest neighbors model, but we will ignore this for now and discuss it in more detail
+in the next chapter.
+Finally, it shows (somewhat confusingly) that the “best” weight function
+was “rectangular” and “best” setting of \(K\) was 5; but since we specified these earlier,
+R is just repeating those settings to us here. In the next chapter, we will actually
+let R find the value of \(K\) for us.
Finally, we make the prediction on the new observation by calling the predict
function,
+passing both the fit object we just created and the new observation itself. As above,
+when we ran the K-nearest neighbors
+classification algorithm manually, the knn_fit
object classifies the new observation as
+malignant. Note that the predict
function outputs a data frame with a single
+variable named .pred_class
.
## # A tibble: 1 × 1
+## .pred_class
+## <fct>
+## 1 Malignant
+Is this predicted malignant label the actual class for this observation? +Well, we don’t know because we do not have this +observation’s diagnosis— that is what we were trying to predict! The +classifier’s prediction is not necessarily correct, but in the next chapter, we will +learn ways to quantify how accurate we think our predictions are.
+tidymodels
When using K-nearest neighbors classification, the scale of each variable +(i.e., its size and range of values) matters. Since the classifier predicts +classes by identifying observations nearest to it, any variables with +a large scale will have a much larger effect than variables with a small +scale. But just because a variable has a large scale doesn’t mean that it is +more important for making accurate predictions. For example, suppose you have a +data set with two features, salary (in dollars) and years of education, and +you want to predict the corresponding type of job. When we compute the +neighbor distances, a difference of $1000 is huge compared to a difference of +10 years of education. But for our conceptual understanding and answering of +the problem, it’s the opposite; 10 years of education is huge compared to a +difference of $1000 in yearly salary!
+In many other predictive models, the center of each variable (e.g., its mean) +matters as well. For example, if we had a data set with a temperature variable +measured in degrees Kelvin, and the same data set with temperature measured in +degrees Celsius, the two variables would differ by a constant shift of 273 +(even though they contain exactly the same information). Likewise, in our +hypothetical job classification example, we would likely see that the center of +the salary variable is in the tens of thousands, while the center of the years +of education variable is in the single digits. Although this doesn’t affect the +K-nearest neighbors classification algorithm, this large shift can change the +outcome of using many other predictive models.
+To scale and center our data, we need to find
+our variables’ mean (the average, which quantifies the “central” value of a
+set of numbers) and standard deviation (a number quantifying how spread out values are).
+For each observed value of the variable, we subtract the mean (i.e., center the variable)
+and divide by the standard deviation (i.e., scale the variable). When we do this, the data
+is said to be standardized, and all variables in a data set will have a mean of 0
+and a standard deviation of 1. To illustrate the effect that standardization can have on the K-nearest
+neighbors algorithm, we will read in the original, unstandardized Wisconsin breast
+cancer data set; we have been using a standardized version of the data set up
+until now. As before, we will convert the Class
variable to the factor type
+and rename the values to “Malignant” and “Benign.”
+To keep things simple, we will just use the Area
, Smoothness
, and Class
+variables:
unscaled_cancer <- read_csv("data/wdbc_unscaled.csv") |>
+ mutate(Class = as_factor(Class)) |>
+ mutate(Class = fct_recode(Class, "Benign" = "B", "Malignant" = "M")) |>
+ select(Class, Area, Smoothness)
+unscaled_cancer
## # A tibble: 569 × 3
+## Class Area Smoothness
+## <fct> <dbl> <dbl>
+## 1 Malignant 1001 0.118
+## 2 Malignant 1326 0.0847
+## 3 Malignant 1203 0.110
+## 4 Malignant 386. 0.142
+## 5 Malignant 1297 0.100
+## 6 Malignant 477. 0.128
+## 7 Malignant 1040 0.0946
+## 8 Malignant 578. 0.119
+## 9 Malignant 520. 0.127
+## 10 Malignant 476. 0.119
+## # ℹ 559 more rows
+Looking at the unscaled and uncentered data above, you can see that the differences
+between the values for area measurements are much larger than those for
+smoothness. Will this affect
+predictions? In order to find out, we will create a scatter plot of these two
+predictors (colored by diagnosis) for both the unstandardized data we just
+loaded, and the standardized version of that same data. But first, we need to
+standardize the unscaled_cancer
data set with tidymodels
.
In the tidymodels
framework, all data preprocessing happens
+using a recipe
from the recipes
R package (Kuhn and Wickham 2021).
+Here we will initialize a recipe for
+the unscaled_cancer
data above, specifying
+that the Class
variable is the response, and all other variables are predictors:
##
+## ── Recipe ──────────
+##
+## ── Inputs
+## Number of variables by role
+## outcome: 1
+## predictor: 2
+So far, there is not much in the recipe; just a statement about the number of response variables
+and predictors. Let’s add
+scaling (step_scale
) and
+centering (step_center
) steps for
+all of the predictors so that they each have a mean of 0 and standard deviation of 1.
+Note that tidyverse
actually provides step_normalize
, which does both centering and scaling in
+a single recipe step; in this book we will keep step_scale
and step_center
separate
+to emphasize conceptually that there are two steps happening.
+The prep
function finalizes the recipe by using the data (here, unscaled_cancer
)
+to compute anything necessary to run the recipe (in this case, the column means and standard
+deviations):
uc_recipe <- uc_recipe |>
+ step_scale(all_predictors()) |>
+ step_center(all_predictors()) |>
+ prep()
+uc_recipe
##
+## ── Recipe ──────────
+##
+## ── Inputs
+## Number of variables by role
+## outcome: 1
+## predictor: 2
+##
+## ── Training information
+## Training data contained 569 data points and no incomplete rows.
+##
+## ── Operations
+## • Scaling for: Area, Smoothness | Trained
+## • Centering for: Area, Smoothness | Trained
+You can now see that the recipe includes a scaling and centering step for all predictor variables.
+Note that when you add a step to a recipe, you must specify what columns to apply the step to.
+Here we used the all_predictors()
function to specify that each step should be applied to
+all predictor variables. However, there are a number of different arguments one could use here,
+as well as naming particular columns with the same syntax as the select
function.
+For example:
all_nominal()
and all_numeric()
: specify all categorical or all numeric variablesall_predictors()
and all_outcomes()
: specify all predictor or all response variablesArea, Smoothness
: specify both the Area
and Smoothness
variable-Class
: specify everything except the Class
variableYou can find a full set of all the steps and variable selection functions
+on the recipes
reference page.
At this point, we have calculated the required statistics based on the data input into the
+recipe, but the data are not yet scaled and centered. To actually scale and center
+the data, we need to apply the bake
function to the unscaled data.
## # A tibble: 569 × 3
+## Area Smoothness Class
+## <dbl> <dbl> <fct>
+## 1 0.984 1.57 Malignant
+## 2 1.91 -0.826 Malignant
+## 3 1.56 0.941 Malignant
+## 4 -0.764 3.28 Malignant
+## 5 1.82 0.280 Malignant
+## 6 -0.505 2.24 Malignant
+## 7 1.09 -0.123 Malignant
+## 8 -0.219 1.60 Malignant
+## 9 -0.384 2.20 Malignant
+## 10 -0.509 1.58 Malignant
+## # ℹ 559 more rows
+It may seem redundant that we had to both bake
and prep
to scale and center the data.
+However, we do this in two steps so we can specify a different data set in the bake
step if we want.
+For example, we may want to specify new data that were not part of the training set.
You may wonder why we are doing so much work just to center and
+scale our variables. Can’t we just manually scale and center the Area
and
+Smoothness
variables ourselves before building our K-nearest neighbors model? Well,
+technically yes; but doing so is error-prone. In particular, we might
+accidentally forget to apply the same centering / scaling when making
+predictions, or accidentally apply a different centering / scaling than what
+we used while training. Proper use of a recipe
helps keep our code simple,
+readable, and error-free. Furthermore, note that using prep
and bake
is
+required only when you want to inspect the result of the preprocessing steps
+yourself. You will see further on in Section
+5.8 that tidymodels
provides tools to
+automatically apply prep
and bake
as necessary without additional coding effort.
Figure 5.9 shows the two scatter plots side-by-side—one for unscaled_cancer
and one for
+scaled_cancer
. Each has the same new observation annotated with its \(K=3\) nearest neighbors.
+In the original unstandardized data plot, you can see some odd choices
+for the three nearest neighbors. In particular, the “neighbors” are visually
+well within the cloud of benign observations, and the neighbors are all nearly
+vertically aligned with the new observation (which is why it looks like there
+is only one black line on this plot). Figure 5.10
+shows a close-up of that region on the unstandardized plot. Here the computation of nearest
+neighbors is dominated by the much larger-scale area variable. The plot for standardized data
+on the right in Figure 5.9 shows a much more intuitively reasonable
+selection of nearest neighbors. Thus, standardizing the data can change things
+in an important way when we are using predictive algorithms.
+Standardizing your data should be a part of the preprocessing you do
+before predictive modeling and you should always think carefully about your problem domain and
+whether you need to standardize your data.
+Figure 5.9: Comparison of K = 3 nearest neighbors with unstandardized and standardized data. +
++Figure 5.10: Close-up of three nearest neighbors for unstandardized data. +
+Another potential issue in a data set for a classifier is class imbalance, +i.e., when one label is much more common than another. Since classifiers like +the K-nearest neighbors algorithm use the labels of nearby points to predict +the label of a new point, if there are many more data points with one label +overall, the algorithm is more likely to pick that label in general (even if +the “pattern” of data suggests otherwise). Class imbalance is actually quite a +common and important problem: from rare disease diagnosis to malicious email +detection, there are many cases in which the “important” class to identify +(presence of disease, malicious email) is much rarer than the “unimportant” +class (no disease, normal email).
+To better illustrate the problem, let’s revisit the scaled breast cancer data,
+cancer
; except now we will remove many of the observations of malignant tumors, simulating
+what the data would look like if the cancer was rare. We will do this by
+picking only 3 observations from the malignant group, and keeping all
+of the benign observations.
+We choose these 3 observations using the slice_head
+function, which takes two arguments: a data frame-like object,
+and the number of rows to select from the top (n
).
+We will use the bind_rows
function to glue the two resulting filtered
+data frames back together, and name the result rare_cancer
.
+The new imbalanced data is shown in Figure 5.11.
rare_cancer <- bind_rows(
+ filter(cancer, Class == "Benign"),
+ cancer |> filter(Class == "Malignant") |> slice_head(n = 3)
+ ) |>
+ select(Class, Perimeter, Concavity)
+
+rare_plot <- rare_cancer |>
+ ggplot(aes(x = Perimeter, y = Concavity, color = Class)) +
+ geom_point(alpha = 0.5) +
+ labs(x = "Perimeter (standardized)",
+ y = "Concavity (standardized)",
+ color = "Diagnosis") +
+ scale_color_manual(values = c("darkorange", "steelblue")) +
+ theme(text = element_text(size = 12))
+
+rare_plot
+Figure 5.11: Imbalanced data. +
+Suppose we now decided to use \(K = 7\) in K-nearest neighbors classification. +With only 3 observations of malignant tumors, the classifier +will always predict that the tumor is benign, no matter what its concavity and perimeter +are! This is because in a majority vote of 7 observations, at most 3 will be +malignant (we only have 3 total malignant observations), so at least 4 must be +benign, and the benign vote will always win. For example, Figure 5.12 +shows what happens for a new tumor observation that is quite close to three observations +in the training data that were tagged as malignant.
++Figure 5.12: Imbalanced data with 7 nearest neighbors to a new observation highlighted. +
+Figure 5.13 shows what happens if we set the background color of +each area of the plot to the prediction the K-nearest neighbors +classifier would make for a new observation at that location. We can see that the decision is +always “benign,” corresponding to the blue color.
++Figure 5.13: Imbalanced data with background color indicating the decision of the classifier and the points represent the labeled data. +
+Despite the simplicity of the problem, solving it in a statistically sound manner is actually
+fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook.
+For the present purposes, it will suffice to rebalance the data by oversampling the rare class.
+In other words, we will replicate rare observations multiple times in our data set to give them more
+voting power in the K-nearest neighbors algorithm. In order to do this, we will add an oversampling
+step to the earlier uc_recipe
recipe with the step_upsample
function from the themis
R package.
+We show below how to do this, and also
+use the group_by
and summarize
functions to see that our classes are now balanced:
library(themis)
+
+ups_recipe <- recipe(Class ~ ., data = rare_cancer) |>
+ step_upsample(Class, over_ratio = 1, skip = FALSE) |>
+ prep()
+
+ups_recipe
##
+## ── Recipe ──────────
+##
+## ── Inputs
+## Number of variables by role
+## outcome: 1
+## predictor: 2
+##
+## ── Training information
+## Training data contained 360 data points and no incomplete rows.
+##
+## ── Operations
+## • Up-sampling based on: Class | Trained
+upsampled_cancer <- bake(ups_recipe, rare_cancer)
+
+upsampled_cancer |>
+ group_by(Class) |>
+ summarize(n = n())
## # A tibble: 2 × 2
+## Class n
+## <fct> <int>
+## 1 Malignant 357
+## 2 Benign 357
+Now suppose we train our K-nearest neighbors classifier with \(K=7\) on this balanced data. +Figure 5.14 shows what happens now when we set the background color +of each area of our scatter plot to the decision the K-nearest neighbors +classifier would make. We can see that the decision is more reasonable; when the points are close +to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are +closer to the benign tumor observations.
++Figure 5.14: Upsampled data with background color indicating the decision of the classifier. +
+One of the most common issues in real data sets in the wild is missing data, +i.e., observations where the values of some of the variables were not recorded. +Unfortunately, as common as it is, handling missing data properly is very +challenging and generally relies on expert knowledge about the data, setting, +and how the data were collected. One typical challenge with missing data is +that missing entries can be informative: the very fact that an entries were +missing is related to the values of other variables. For example, survey +participants from a marginalized group of people may be less likely to respond +to certain kinds of questions if they fear that answering honestly will come +with negative consequences. In that case, if we were to simply throw away data +with missing entries, we would bias the conclusions of the survey by +inadvertently removing many members of that group of respondents. So ignoring +this issue in real problems can easily lead to misleading analyses, with +detrimental impacts. In this book, we will cover only those techniques for +dealing with missing entries in situations where missing entries are just +“randomly missing”, i.e., where the fact that certain entries are missing +isn’t related to anything else about the observation.
+Let’s load and examine a modified subset of the tumor image data +that has a few missing entries:
+missing_cancer <- read_csv("data/wdbc_missing.csv") |>
+ select(Class, Radius, Texture, Perimeter) |>
+ mutate(Class = as_factor(Class)) |>
+ mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
+missing_cancer
## # A tibble: 7 × 4
+## Class Radius Texture Perimeter
+## <fct> <dbl> <dbl> <dbl>
+## 1 Malignant NA NA 1.27
+## 2 Malignant 1.83 -0.353 1.68
+## 3 Malignant 1.58 NA 1.57
+## 4 Malignant -0.768 0.254 -0.592
+## 5 Malignant 1.75 -1.15 1.78
+## 6 Malignant -0.476 -0.835 -0.387
+## 7 Malignant 1.17 0.161 1.14
+Recall that K-nearest neighbors classification makes predictions by computing
+the straight-line distance to nearby training observations, and hence requires
+access to the values of all variables for all observations in the training
+data. So how can we perform K-nearest neighbors classification in the presence
+of missing data? Well, since there are not too many observations with missing
+entries, one option is to simply remove those observations prior to building
+the K-nearest neighbors classifier. We can accomplish this by using the
+drop_na
function from tidyverse
prior to working with the data.
## # A tibble: 5 × 4
+## Class Radius Texture Perimeter
+## <fct> <dbl> <dbl> <dbl>
+## 1 Malignant 1.83 -0.353 1.68
+## 2 Malignant -0.768 0.254 -0.592
+## 3 Malignant 1.75 -1.15 1.78
+## 4 Malignant -0.476 -0.835 -0.387
+## 5 Malignant 1.17 0.161 1.14
+However, this strategy will not work when many of the rows have missing
+entries, as we may end up throwing away too much data. In this case, another
+possible approach is to impute the missing entries, i.e., fill in synthetic
+values based on the other observations in the data set. One reasonable choice
+is to perform mean imputation, where missing entries are filled in using the
+mean of the present entries in each variable. To perform mean imputation, we
+add the step_impute_mean
+step to the tidymodels
preprocessing recipe.
impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer) |>
+ step_impute_mean(all_predictors()) |>
+ prep()
+impute_missing_recipe
##
+## ── Recipe ──────────
+##
+## ── Inputs
+## Number of variables by role
+## outcome: 1
+## predictor: 3
+##
+## ── Training information
+## Training data contained 7 data points and 2 incomplete rows.
+##
+## ── Operations
+## • Mean imputation for: Radius, Texture, Perimeter | Trained
+To visualize what mean imputation does, let’s just apply the recipe directly to the missing_cancer
+data frame using the bake
function. The imputation step fills in the missing
+entries with the mean values of their corresponding variables.
## # A tibble: 7 × 4
+## Radius Texture Perimeter Class
+## <dbl> <dbl> <dbl> <fct>
+## 1 0.847 -0.385 1.27 Malignant
+## 2 1.83 -0.353 1.68 Malignant
+## 3 1.58 -0.385 1.57 Malignant
+## 4 -0.768 0.254 -0.592 Malignant
+## 5 1.75 -1.15 1.78 Malignant
+## 6 -0.476 -0.835 -0.387 Malignant
+## 7 1.17 0.161 1.14 Malignant
+Many other options for missing data imputation can be found in
+the recipes
documentation. However
+you decide to handle missing data in your data analysis, it is always crucial
+to think critically about the setting, how the data were collected, and the
+question you are answering.
workflow
The tidymodels
package collection also provides the workflow
, a way to
+chain together
+multiple data analysis steps without a lot of otherwise necessary code for
+intermediate steps. To illustrate the whole pipeline, let’s start from scratch
+with the wdbc_unscaled.csv
data. First we will load the data, create a
+model, and specify a recipe for how the data should be preprocessed:
# load the unscaled cancer data
+# and make sure the response variable, Class, is a factor
+unscaled_cancer <- read_csv("data/wdbc_unscaled.csv") |>
+ mutate(Class = as_factor(Class)) |>
+ mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
+
+# create the K-NN model
+knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
+ set_engine("kknn") |>
+ set_mode("classification")
+
+# create the centering / scaling recipe
+uc_recipe <- recipe(Class ~ Area + Smoothness, data = unscaled_cancer) |>
+ step_scale(all_predictors()) |>
+ step_center(all_predictors())
Note that each of these steps is exactly the same as earlier, except for one major difference:
+we did not use the select
function to extract the relevant variables from the data frame,
+and instead simply specified the relevant variables to use via the
+formula Class ~ Area + Smoothness
(instead of Class ~ .
) in the recipe.
+You will also notice that we did not call prep()
on the recipe; this is unnecessary when it is
+placed in a workflow.
We will now place these steps in a workflow
using the add_recipe
and add_model
functions,
+and finally we will use the fit
function to run the whole workflow on the unscaled_cancer
data.
+Note another difference from earlier here: we do not include a formula in the fit
function. This
+is again because we included the formula in the recipe, so there is no need to respecify it:
knn_fit <- workflow() |>
+ add_recipe(uc_recipe) |>
+ add_model(knn_spec) |>
+ fit(data = unscaled_cancer)
+
+knn_fit
## ══ Workflow [trained] ══════════
+## Preprocessor: Recipe
+## Model: nearest_neighbor()
+##
+## ── Preprocessor ──────────
+## 2 Recipe Steps
+##
+## • step_scale()
+## • step_center()
+##
+## ── Model ──────────
+##
+## Call:
+## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(7, data, 5),
+## kernel = ~"rectangular")
+##
+## Type of response variable: nominal
+## Minimal misclassification: 0.112478
+## Best kernel: rectangular
+## Best k: 7
+As before, the fit object lists the function that trains the model as well as the “best” settings
+for the number of neighbors and weight function (for now, these are just the values we chose
+manually when we created knn_spec
above). But now the fit object also includes information about
+the overall workflow, including the centering and scaling preprocessing steps.
+In other words, when we use the predict
function with the knn_fit
object to make a prediction for a new
+observation, it will first apply the same recipe steps to the new observation.
+As an example, we will predict the class label of two new observations:
+one with Area = 500
and Smoothness = 0.075
, and one with Area = 1500
and Smoothness = 0.1
.
new_observation <- tibble(Area = c(500, 1500), Smoothness = c(0.075, 0.1))
+prediction <- predict(knn_fit, new_observation)
+
+prediction
## # A tibble: 2 × 1
+## .pred_class
+## <fct>
+## 1 Benign
+## 2 Malignant
+The classifier predicts that the first observation is benign, while the second is
+malignant. Figure 5.15 visualizes the predictions that this
+trained K-nearest neighbors model will make on a large range of new observations.
+Although you have seen colored prediction map visualizations like this a few times now,
+we have not included the code to generate them, as it is a little bit complicated.
+For the interested reader who wants a learning challenge, we now include it below.
+The basic idea is to create a grid of synthetic new observations using the expand.grid
function,
+predict the label of each, and visualize the predictions with a colored scatter having a very high transparency
+(low alpha
value) and large point radius. See if you can figure out what each line is doing!
++Note: Understanding this code is not required for the remainder of the +textbook. It is included for those readers who would like to use similar +visualizations in their own data analyses.
+
# create the grid of area/smoothness vals, and arrange in a data frame
+are_grid <- seq(min(unscaled_cancer$Area),
+ max(unscaled_cancer$Area),
+ length.out = 100)
+smo_grid <- seq(min(unscaled_cancer$Smoothness),
+ max(unscaled_cancer$Smoothness),
+ length.out = 100)
+asgrid <- as_tibble(expand.grid(Area = are_grid,
+ Smoothness = smo_grid))
+
+# use the fit workflow to make predictions at the grid points
+knnPredGrid <- predict(knn_fit, asgrid)
+
+# bind the predictions as a new column with the grid points
+prediction_table <- bind_cols(knnPredGrid, asgrid) |>
+ rename(Class = .pred_class)
+
+# plot:
+# 1. the colored scatter of the original data
+# 2. the faded colored scatter for the grid points
+wkflw_plot <-
+ ggplot() +
+ geom_point(data = unscaled_cancer,
+ mapping = aes(x = Area,
+ y = Smoothness,
+ color = Class),
+ alpha = 0.75) +
+ geom_point(data = prediction_table,
+ mapping = aes(x = Area,
+ y = Smoothness,
+ color = Class),
+ alpha = 0.02,
+ size = 5) +
+ labs(color = "Diagnosis",
+ x = "Area",
+ y = "Smoothness") +
+ scale_color_manual(values = c("darkorange", "steelblue")) +
+ theme(text = element_text(size = 12))
+
+wkflw_plot
+Figure 5.15: Scatter plot of smoothness versus area where background color indicates the decision of the classifier. +
+Practice exercises for the material covered in this chapter +can be found in the accompanying +worksheets repository +in the “Classification I: training and predicting” row. +You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. +You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” +If you instead decide to download the worksheet and run it on your own machine, +make sure to follow the instructions for computer setup +found in Chapter 13. This will ensure that the automated feedback +and guidance that the worksheets provide will function as intended.
+ +This chapter continues the introduction to predictive modeling through +classification. While the previous chapter covered training and data +preprocessing, this chapter focuses on how to evaluate the performance of +a classifier, as well as how to improve the classifier (where possible) +to maximize its accuracy.
+By the end of the chapter, readers will be able to do the following:
+set.seed
function.Sometimes our classifier might make the wrong prediction. A classifier does not +need to be right 100% of the time to be useful, though we don’t want the +classifier to make too many wrong predictions. How do we measure how “good” our +classifier is? Let’s revisit the +breast cancer images data (Street, Wolberg, and Mangasarian 1993) +and think about how our classifier will be used in practice. A biopsy will be +performed on a new patient’s tumor, the resulting image will be analyzed, +and the classifier will be asked to decide whether the tumor is benign or +malignant. The key word here is new: our classifier is “good” if it provides +accurate predictions on data not seen during training, as this implies that +it has actually learned about the relationship between the predictor variables and response variable, +as opposed to simply memorizing the labels of individual training data examples. +But then, how can we evaluate our classifier without visiting the hospital to collect more +tumor images?
+The trick is to split the data into a training set and test set (Figure 6.1) +and use only the training set when building the classifier. +Then, to evaluate the performance of the classifier, we first set aside the labels from the test set, +and then use the classifier to predict the labels in the test set. If our predictions match the actual +labels for the observations in the test set, then we have some +confidence that our classifier might also accurately predict the class +labels for new observations without known class labels.
+++Note: If there were a golden rule of machine learning, it might be this: +you cannot use the test data to build the model! If you do, the model gets to +“see” the test data in advance, making it look more accurate than it really +is. Imagine how bad it would be to overestimate your classifier’s accuracy +when predicting whether a patient’s tumor is malignant or benign!
+
+Figure 6.1: Splitting the data into training and testing sets. +
+How exactly can we assess how well our predictions match the actual labels for +the observations in the test set? One way we can do this is to calculate the +prediction accuracy. This is the fraction of examples for which the +classifier made the correct prediction. To calculate this, we divide the number +of correct predictions by the number of predictions made. +The process for assessing if our predictions match the actual labels in the +test set is illustrated in Figure 6.2.
+\[\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}}\]
++Figure 6.2: Process for splitting the data and finding the prediction accuracy. +
+Accuracy is a convenient, general-purpose way to summarize the performance of a classifier with +a single number. But prediction accuracy by itself does not tell the whole +story. In particular, accuracy alone only tells us how often the classifier +makes mistakes in general, but does not tell us anything about the kinds of +mistakes the classifier makes. A more comprehensive view of performance can be +obtained by additionally examining the confusion matrix. The confusion +matrix shows how many test set labels of each type are predicted correctly and +incorrectly, which gives us more detail about the kinds of mistakes the +classifier tends to make. Table 6.1 shows an example +of what a confusion matrix might look like for the tumor image data with +a test set of 65 observations.
++ | Actually Malignant | +Actually Benign | +
---|---|---|
Predicted Malignant | +1 | +4 | +
Predicted Benign | +3 | +57 | +
In the example in Table 6.1, we see that there was +1 malignant observation that was correctly classified as malignant (top left corner), +and 57 benign observations that were correctly classified as benign (bottom right corner). +However, we can also see that the classifier made some mistakes: +it classified 3 malignant observations as benign, and 4 benign observations as +malignant. The accuracy of this classifier is roughly +89%, given by the formula
+\[\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} = \frac{1+57}{1+57+4+3} = 0.892.\]
+But we can also see that the classifier only identified 1 out of 4 total malignant +tumors; in other words, it misclassified 75% of the malignant cases present in the +data set! In this example, misclassifying a malignant tumor is a potentially +disastrous error, since it may lead to a patient who requires treatment not receiving it. +Since we are particularly interested in identifying malignant cases, this +classifier would likely be unacceptable even with an accuracy of 89%.
+Focusing more on one label than the other +is +common in classification problems. In such cases, we typically refer to the label we are more +interested in identifying as the positive label, and the other as the +negative label. In the tumor example, we would refer to malignant +observations as positive, and benign observations as negative. We can then +use the following terms to talk about the four kinds of prediction that the +classifier can make, corresponding to the four entries in the confusion matrix:
+A perfect classifier would have zero false negatives and false positives (and +therefore, 100% accuracy). However, classifiers in practice will almost always +make some errors. So you should think about which kinds of error are most +important in your application, and use the confusion matrix to quantify and +report them. Two commonly used metrics that we can compute using the confusion +matrix are the precision and recall of the classifier. These are often +reported together with accuracy. Precision quantifies how many of the +positive predictions the classifier made were actually positive. Intuitively, +we would like a classifier to have a high precision: for a classifier with +high precision, if the classifier reports that a new observation is positive, +we can trust that the new observation is indeed positive. We can compute the +precision of a classifier using the entries in the confusion matrix, with the +formula
+\[\mathrm{precision} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; predictions}}.\]
+Recall quantifies how many of the positive observations in the test set were +identified as positive. Intuitively, we would like a classifier to have a +high recall: for a classifier with high recall, if there is a positive +observation in the test data, we can trust that the classifier will find it. +We can also compute the recall of the classifier using the entries in the +confusion matrix, with the formula
+\[\mathrm{recall} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; test \; set \; observations}}.\]
+In the example presented in Table 6.1, we have that the precision and recall are
+\[\mathrm{precision} = \frac{1}{1+4} = 0.20, \quad \mathrm{recall} = \frac{1}{1+3} = 0.25.\]
+So even with an accuracy of 89%, the precision and recall of the classifier +were both relatively low. For this data analysis context, recall is +particularly important: if someone has a malignant tumor, we certainly want to +identify it. A recall of just 25% would likely be unacceptable!
+++Note: It is difficult to achieve both high precision and high recall at +the same time; models with high precision tend to have low recall and vice +versa. As an example, we can easily make a classifier that has perfect +recall: just always guess positive! This classifier will of course find +every positive observation in the test set, but it will make lots of false +positive predictions along the way and have low precision. Similarly, we can +easily make a classifier that has perfect precision: never guess +positive! This classifier will never incorrectly identify an obsevation as +positive, but it will make a lot of false negative predictions along the way. +In fact, this classifier will have 0% recall! Of course, most real +classifiers fall somewhere in between these two extremes. But these examples +serve to show that in settings where one of the classes is of interest (i.e., +there is a positive label), there is a trade-off between precision and recall that one has to +make when designing a classifier.
+
Beginning in this chapter, our data analyses will often involve the use +of randomness. We use randomness any time we need to make a decision in our +analysis that needs to be fair, unbiased, and not influenced by human input. +For example, in this chapter, we need to split +a data set into a training set and test set to evaluate our classifier. We +certainly do not want to choose how to split +the data ourselves by hand, as we want to avoid accidentally influencing the result +of the evaluation. So instead, we let R randomly split the data. +In future chapters we will use randomness +in many other ways, e.g., to help us select a small subset of data from a larger data set, +to pick groupings of data, and more.
+However, the use of randomness runs counter to one of the main
+tenets of good data analysis practice: reproducibility. Recall that a reproducible
+analysis produces the same result each time it is run; if we include randomness
+in the analysis, would we not get a different result each time?
+The trick is that in R—and other programming languages—randomness
+is not actually random! Instead, R uses a random number generator that
+produces a sequence of numbers that
+are completely determined by a
+seed value. Once you set the seed value
+using the set.seed
function, everything after that point may look random,
+but is actually totally reproducible. As long as you pick the same seed
+value, you get the same result!
Let’s use an example to investigate how seeds work in R. Say we want
+to randomly pick 10 numbers from 0 to 9 in R using the sample
function,
+but we want it to be reproducible. Before using the sample function,
+we call set.seed
, and pass it any integer as an argument.
+Here, we pass in the number 1
.
## [1] 8 3 6 0 1 6 1 2 0 4
+You can see that random_numbers1
is a list of 10 numbers
+from 0 to 9 that, from all appearances, looks random. If
+we run the sample
function again, we will
+get a fresh batch of 10 numbers that also look random.
## [1] 4 9 5 9 6 8 4 4 8 8
+If we want to force R to produce the same sequences of random numbers,
+we can simply call the set.seed
function again with the same argument
+value.
## [1] 8 3 6 0 1 6 1 2 0 4
+
+## [1] 4 9 5 9 6 8 4 4 8 8
+Notice that after setting the seed, we get the same two sequences of numbers in the same order. random_numbers1
and random_numbers1_again
produce the same sequence of numbers, and the same can be said about random_numbers2
and random_numbers2_again
. And if we choose
+a different value for the seed—say, 4235—we
+obtain a different sequence of random numbers.
set.seed(4235)
+random_numbers1_different <- sample(0:9, 10, replace = TRUE)
+random_numbers1_different
## [1] 8 3 1 4 6 8 8 4 1 7
+
+## [1] 3 7 8 2 8 8 6 3 3 8
+In other words, even though the sequences of numbers that R is generating look +random, they are totally determined when we set a seed value!
+So what does this mean for data analysis? Well, sample
is certainly
+not the only function that uses randomness in R. Many of the functions
+that we use in tidymodels
, tidyverse
, and beyond use randomness—some of them
+without even telling you about it. So at the beginning of every data analysis you
+do, right after loading packages, you should call the set.seed
function and
+pass it an integer that you pick.
+Also note that when R starts up, it creates its own seed to use. So if you do not
+explicitly call the set.seed
function in your code, your results will
+likely not be reproducible.
+And finally, be careful to set the seed only once at the beginning of a data
+analysis. Each time you set the seed, you are inserting your own human input,
+thereby influencing the analysis. If you use set.seed
many times
+throughout your analysis, the randomness that R uses will not look
+as random as it should.
In summary: if you want your analysis to be reproducible, i.e., produce the same result each time you
+run it, make sure to use set.seed
exactly once at the beginning of the analysis.
+Different argument values in set.seed
lead to different patterns of randomness, but as long as
+you pick the same argument value your result will be the same.
+In the remainder of the textbook, we will set the seed once at the beginning of each chapter.
tidymodels
Back to evaluating classifiers now!
+In R, we can use the tidymodels
package not only to perform K-nearest neighbors
+classification, but also to assess how well our classification worked.
+Let’s work through an example of how to use tools from tidymodels
to evaluate a classifier
+using the breast cancer data set from the previous chapter.
+We begin the analysis by loading the packages we require,
+reading in the breast cancer data,
+and then making a quick scatter plot visualization of
+tumor cell concavity versus smoothness colored by diagnosis in Figure 6.3.
+You will also notice that we set the random seed here at the beginning of the analysis
+using the set.seed
function, as described in Section 6.4.
# load packages
+library(tidyverse)
+library(tidymodels)
+
+# set the seed
+set.seed(1)
+
+# load data
+cancer <- read_csv("data/wdbc_unscaled.csv") |>
+ # convert the character Class variable to the factor datatype
+ mutate(Class = as_factor(Class)) |>
+ # rename the factor values to be more readable
+ mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
+
+# create scatter plot of tumor cell concavity versus smoothness,
+# labeling the points be diagnosis class
+perim_concav <- cancer |>
+ ggplot(aes(x = Smoothness, y = Concavity, color = Class)) +
+ geom_point(alpha = 0.5) +
+ labs(color = "Diagnosis") +
+ scale_color_manual(values = c("darkorange", "steelblue")) +
+ theme(text = element_text(size = 12))
+
+perim_concav
+Figure 6.3: Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label. +
+Once we have decided on a predictive question to answer and done some +preliminary exploration, the very next thing to do is to split the data into +the training and test sets. Typically, the training set is between 50% and 95% of +the data, while the test set is the remaining 5% to 50%; the intuition is that +you want to trade off between training an accurate model (by using a larger +training data set) and getting an accurate evaluation of its performance (by +using a larger test data set). Here, we will use 75% of the data for training, +and 25% for testing.
+The initial_split
function from tidymodels
handles the procedure of splitting
+the data for us. It also applies two very important steps when splitting to ensure
+that the accuracy estimates from the test data are reasonable. First, it
+shuffles the data before splitting, which ensures that any ordering present
+in the data does not influence the data that ends up in the training and testing sets.
+Second, it stratifies the data by the class label, to ensure that roughly
+the same proportion of each class ends up in both the training and testing sets. For example,
+in our data set, roughly 63% of the
+observations are from the benign class, and 37% are from the malignant class,
+so initial_split
ensures that roughly 63% of the training data are benign,
+37% of the training data are malignant,
+and the same proportions exist in the testing data.
Let’s use the initial_split
function to create the training and testing sets.
+We will specify that prop = 0.75
so that 75% of our original data set ends up
+in the training set. We will also set the strata
argument to the categorical label variable
+(here, Class
) to ensure that the training and testing subsets contain the
+right proportions of each category of observation.
+The training
and testing
functions then extract the training and testing
+data sets into two separate data frames.
+Note that the initial_split
function uses randomness, but since we set the
+seed earlier in the chapter, the split will be reproducible.
cancer_split <- initial_split(cancer, prop = 0.75, strata = Class)
+cancer_train <- training(cancer_split)
+cancer_test <- testing(cancer_split)
## Rows: 426
+## Columns: 12
+## $ ID <dbl> 8510426, 8510653, 8510824, 854941, 85713702, 857155,…
+## $ Class <fct> Benign, Benign, Benign, Benign, Benign, Benign, Beni…
+## $ Radius <dbl> 13.540, 13.080, 9.504, 13.030, 8.196, 12.050, 13.490…
+## $ Texture <dbl> 14.36, 15.71, 12.44, 18.42, 16.84, 14.63, 22.30, 21.…
+## $ Perimeter <dbl> 87.46, 85.63, 60.34, 82.61, 51.71, 78.04, 86.91, 74.…
+## $ Area <dbl> 566.3, 520.0, 273.9, 523.8, 201.9, 449.3, 561.0, 427…
+## $ Smoothness <dbl> 0.09779, 0.10750, 0.10240, 0.08983, 0.08600, 0.10310…
+## $ Compactness <dbl> 0.08129, 0.12700, 0.06492, 0.03766, 0.05943, 0.09092…
+## $ Concavity <dbl> 0.066640, 0.045680, 0.029560, 0.025620, 0.015880, 0.…
+## $ Concave_Points <dbl> 0.047810, 0.031100, 0.020760, 0.029230, 0.005917, 0.…
+## $ Symmetry <dbl> 0.1885, 0.1967, 0.1815, 0.1467, 0.1769, 0.1675, 0.18…
+## $ Fractal_Dimension <dbl> 0.05766, 0.06811, 0.06905, 0.05863, 0.06503, 0.06043…
+
+## Rows: 143
+## Columns: 12
+## $ ID <dbl> 842517, 84300903, 84501001, 84610002, 848406, 848620…
+## $ Class <fct> Malignant, Malignant, Malignant, Malignant, Malignan…
+## $ Radius <dbl> 20.570, 19.690, 12.460, 15.780, 14.680, 16.130, 19.8…
+## $ Texture <dbl> 17.77, 21.25, 24.04, 17.89, 20.13, 20.68, 22.15, 14.…
+## $ Perimeter <dbl> 132.90, 130.00, 83.97, 103.60, 94.74, 108.10, 130.00…
+## $ Area <dbl> 1326.0, 1203.0, 475.9, 781.0, 684.5, 798.8, 1260.0, …
+## $ Smoothness <dbl> 0.08474, 0.10960, 0.11860, 0.09710, 0.09867, 0.11700…
+## $ Compactness <dbl> 0.07864, 0.15990, 0.23960, 0.12920, 0.07200, 0.20220…
+## $ Concavity <dbl> 0.08690, 0.19740, 0.22730, 0.09954, 0.07395, 0.17220…
+## $ Concave_Points <dbl> 0.070170, 0.127900, 0.085430, 0.066060, 0.052590, 0.…
+## $ Symmetry <dbl> 0.1812, 0.2069, 0.2030, 0.1842, 0.1586, 0.2164, 0.15…
+## $ Fractal_Dimension <dbl> 0.05667, 0.05999, 0.08243, 0.06082, 0.05922, 0.07356…
+We can see from glimpse
in the code above that the training set contains 426
+observations, while the test set contains 143 observations. This corresponds to
+a train / test split of 75% / 25%, as desired. Recall from Chapter 5
+that we use the glimpse
function to view data with a large number of columns,
+as it prints the data such that the columns go down the page (instead of across).
We can use group_by
and summarize
to find the percentage of malignant and benign classes
+in cancer_train
and we see about 63% of the training
+data are benign and 37%
+are malignant, indicating that our class proportions were roughly preserved when we split the data.
cancer_proportions <- cancer_train |>
+ group_by(Class) |>
+ summarize(n = n()) |>
+ mutate(percent = 100*n/nrow(cancer_train))
+
+cancer_proportions
## # A tibble: 2 × 3
+## Class n percent
+## <fct> <int> <dbl>
+## 1 Malignant 159 37.3
+## 2 Benign 267 62.7
+As we mentioned in the last chapter, K-nearest neighbors is sensitive to the scale of the predictors, +so we should perform some preprocessing to standardize them. An +additional consideration we need to take when doing this is that we should +create the standardization preprocessor using only the training data. This ensures that +our test data does not influence any aspect of our model training. Once we have +created the standardization preprocessor, we can then apply it separately to both the +training and test data sets.
+Fortunately, the recipe
framework from tidymodels
helps us handle
+this properly. Below we construct and prepare the recipe using only the training
+data (due to data = cancer_train
in the first line).
Now that we have split our original data set into training and test sets, we
+can create our K-nearest neighbors classifier with only the training set using
+the technique we learned in the previous chapter. For now, we will just choose
+the number \(K\) of neighbors to be 3, and use concavity and smoothness as the
+predictors. As before we need to create a model specification, combine
+the model specification and recipe into a workflow, and then finally
+use fit
with the training data cancer_train
to build the classifier.
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) |>
+ set_engine("kknn") |>
+ set_mode("classification")
+
+knn_fit <- workflow() |>
+ add_recipe(cancer_recipe) |>
+ add_model(knn_spec) |>
+ fit(data = cancer_train)
+
+knn_fit
## ══ Workflow [trained] ══════════
+## Preprocessor: Recipe
+## Model: nearest_neighbor()
+##
+## ── Preprocessor ──────────
+## 2 Recipe Steps
+##
+## • step_scale()
+## • step_center()
+##
+## ── Model ──────────
+##
+## Call:
+## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(3, data, 5),
+## kernel = ~"rectangular")
+##
+## Type of response variable: nominal
+## Minimal misclassification: 0.1126761
+## Best kernel: rectangular
+## Best k: 3
+Now that we have a K-nearest neighbors classifier object, we can use it to
+predict the class labels for our test set. We use the bind_cols
to add the
+column of predictions to the original test data, creating the
+cancer_test_predictions
data frame. The Class
variable contains the actual
+diagnoses, while the .pred_class
contains the predicted diagnoses from the
+classifier.
cancer_test_predictions <- predict(knn_fit, cancer_test) |>
+ bind_cols(cancer_test)
+
+cancer_test_predictions
## # A tibble: 143 × 13
+## .pred_class ID Class Radius Texture Perimeter Area Smoothness
+## <fct> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
+## 1 Benign 842517 Malignant 20.6 17.8 133. 1326 0.0847
+## 2 Malignant 84300903 Malignant 19.7 21.2 130 1203 0.110
+## 3 Malignant 84501001 Malignant 12.5 24.0 84.0 476. 0.119
+## 4 Malignant 84610002 Malignant 15.8 17.9 104. 781 0.0971
+## 5 Benign 848406 Malignant 14.7 20.1 94.7 684. 0.0987
+## 6 Malignant 84862001 Malignant 16.1 20.7 108. 799. 0.117
+## 7 Malignant 849014 Malignant 19.8 22.2 130 1260 0.0983
+## 8 Malignant 8511133 Malignant 15.3 14.3 102. 704. 0.107
+## 9 Malignant 852552 Malignant 16.6 21.4 110 905. 0.112
+## 10 Malignant 853612 Malignant 11.8 18.7 77.9 441. 0.111
+## # ℹ 133 more rows
+## # ℹ 5 more variables: Compactness <dbl>, Concavity <dbl>, Concave_Points <dbl>,
+## # Symmetry <dbl>, Fractal_Dimension <dbl>
+Finally, we can assess our classifier’s performance. First, we will examine
+accuracy. To do this we use the
+metrics
function from tidymodels
,
+specifying the truth
and estimate
arguments:
cancer_test_predictions |>
+ metrics(truth = Class, estimate = .pred_class) |>
+ filter(.metric == "accuracy")
## # A tibble: 1 × 3
+## .metric .estimator .estimate
+## <chr> <chr> <dbl>
+## 1 accuracy binary 0.853
+In the metrics data frame, we filtered the .metric
column since we are
+interested in the accuracy
row. Other entries involve other metrics that
+are beyond the scope of this book. Looking at the value of the .estimate
variable
+shows that the estimated accuracy of the classifier on the test data
+was 85%.
+To compute the precision and recall, we can use the precision
and recall
functions
+from tidymodels
. We first check the order of the
+labels in the Class
variable using the levels
function:
## [1] "Malignant" "Benign"
+This shows that "Malignant"
is the first level. Therefore we will set
+the truth
and estimate
arguments to Class
and .pred_class
as before,
+but also specify that the “positive” class corresponds to the first factor level via event_level="first"
.
+If the labels were in the other order, we would instead use event_level="second"
.
cancer_test_predictions |>
+ precision(truth = Class, estimate = .pred_class, event_level = "first")
## # A tibble: 1 × 3
+## .metric .estimator .estimate
+## <chr> <chr> <dbl>
+## 1 precision binary 0.767
+
+## # A tibble: 1 × 3
+## .metric .estimator .estimate
+## <chr> <chr> <dbl>
+## 1 recall binary 0.868
+The output shows that the estimated precision and recall of the classifier on the test data was
+77% and 87%, respectively.
+Finally, we can look at the confusion matrix for the classifier using the conf_mat
function.
confusion <- cancer_test_predictions |>
+ conf_mat(truth = Class, estimate = .pred_class)
+confusion
## Truth
+## Prediction Malignant Benign
+## Malignant 46 14
+## Benign 7 76
+The confusion matrix shows 46 observations were correctly predicted +as malignant, and 76 were correctly predicted as benign. +It also shows that the classifier made some mistakes; in particular, +it classified 7 observations as benign when they were actually malignant, +and 14 observations as malignant when they were actually benign. +Using our formulas from earlier, we see that the accuracy, precision, and recall agree with what R reported.
+\[\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} = \frac{46+76}{46+76+14+7} = 0.853\]
+\[\mathrm{precision} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; predictions}} = \frac{46}{46 + 14} = 0.767\]
+\[\mathrm{recall} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; test \; set \; observations}} = \frac{46}{46+7} = 0.868\]
+We now know that the classifier was 85% accurate +on the test data set, and had a precision of 77% and a recall of 87%. +That sounds pretty good! Wait, is it good? Or do we need something higher?
+In general, a good value for accuracy (as well as precision and recall, if applicable) +depends on the application; you must critically analyze your accuracy in the context of the problem +you are solving. For example, if we were building a classifier for a kind of tumor that is benign 99% +of the time, a classifier with 99% accuracy is not terribly impressive (just always guess benign!). +And beyond just accuracy, we need to consider the precision and recall: as mentioned +earlier, the kind of mistake the classifier makes is +important in many applications as well. In the previous example with 99% benign observations, it might be very bad for the +classifier to predict “benign” when the actual class is “malignant” (a false negative), as this +might result in a patient not receiving appropriate medical attention. In other +words, in this context, we need the classifier to have a high recall. On the +other hand, it might be less bad for the classifier to guess “malignant” when +the actual class is “benign” (a false positive), as the patient will then likely see a doctor who +can provide an expert diagnosis. In other words, we are fine with sacrificing +some precision in the interest of achieving high recall. This is why it is +important not only to look at accuracy, but also the confusion matrix.
+However, there is always an easy baseline that you can compare to for any +classification problem: the majority classifier. The majority classifier +always guesses the majority class label from the training data, regardless of +the predictor variables’ values. It helps to give you a sense of +scale when considering accuracies. If the majority classifier obtains a 90% +accuracy on a problem, then you might hope for your K-nearest neighbors +classifier to do better than that. If your classifier provides a significant +improvement upon the majority classifier, this means that at least your method +is extracting some useful information from your predictor variables. Be +careful though: improving on the majority classifier does not necessarily +mean the classifier is working well enough for your application.
+As an example, in the breast cancer data, recall the proportions of benign and malignant +observations in the training data are as follows:
+ +## # A tibble: 2 × 3
+## Class n percent
+## <fct> <int> <dbl>
+## 1 Malignant 159 37.3
+## 2 Benign 267 62.7
+Since the benign class represents the majority of the training data, +the majority classifier would always predict that a new observation +is benign. The estimated accuracy of the majority classifier is usually +fairly close to the majority class proportion in the training data. +In this case, we would suspect that the majority classifier will have +an accuracy of around 63%. +The K-nearest neighbors classifier we built does quite a bit better than this, +with an accuracy of 85%. +This means that from the perspective of accuracy, +the K-nearest neighbors classifier improved quite a bit on the basic +majority classifier. Hooray! But we still need to be cautious; in +this application, it is likely very important not to misdiagnose any malignant tumors to avoid missing +patients who actually need medical care. The confusion matrix above shows +that the classifier does, indeed, misdiagnose a significant number of malignant tumors as benign (7 +out of 53 malignant tumors, or 13%!). +Therefore, even though the accuracy improved upon the majority classifier, +our critical analysis suggests that this classifier may not have appropriate performance +for the application.
+The vast majority of predictive models in statistics and machine learning have +parameters. A parameter +is a number you have to pick in advance that determines +some aspect of how the model behaves. For example, in the K-nearest neighbors +classification algorithm, \(K\) is a parameter that we have to pick +that determines how many neighbors participate in the class vote. +By picking different values of \(K\), we create different classifiers +that make different predictions.
+So then, how do we pick the best value of \(K\), i.e., tune the model? +And is it possible to make this selection in a principled way? In this book, +we will focus on maximizing the accuracy of the classifier. Ideally, +we want somehow to maximize the accuracy of our classifier on data it +hasn’t seen yet. But we cannot use our test data set in the process of building +our model. So we will play the same trick we did before when evaluating +our classifier: we’ll split our training data itself into two subsets, +use one to train the model, and then use the other to evaluate it. +In this section, we will cover the details of this procedure, as well as +how to use it to help you pick a good parameter value for your classifier.
+And remember: don’t touch the test set during the tuning process. Tuning is a part of model training!
+The first step in choosing the parameter \(K\) is to be able to evaluate the +classifier using only the training data. If this is possible, then we can compare +the classifier’s performance for different values of \(K\)—and pick the best—using +only the training data. As suggested at the beginning of this section, we will +accomplish this by splitting the training data, training on one subset, and evaluating +on the other. The subset of training data used for evaluation is often called the validation set.
+There is, however, one key difference from the train/test split +that we performed earlier. In particular, we were forced to make only a single split +of the data. This is because at the end of the day, we have to produce a single classifier. +If we had multiple different splits of the data into training and testing data, +we would produce multiple different classifiers. +But while we are tuning the classifier, we are free to create multiple classifiers +based on multiple splits of the training data, evaluate them, and then choose a parameter +value based on all of the different results. If we just split our overall training +data once, our best parameter choice will depend strongly on whatever data +was lucky enough to end up in the validation set. Perhaps using multiple +different train/validation splits, we’ll get a better estimate of accuracy, +which will lead to a better choice of the number of neighbors \(K\) for the +overall set of training data.
+Let’s investigate this idea in R! In particular, we will generate five different train/validation +splits of our overall training data, train five different K-nearest neighbors +models, and evaluate their accuracy. We will start with just a single +split.
+# create the 25/75 split of the training data into training and validation
+cancer_split <- initial_split(cancer_train, prop = 0.75, strata = Class)
+cancer_subtrain <- training(cancer_split)
+cancer_validation <- testing(cancer_split)
+
+# recreate the standardization recipe from before
+# (since it must be based on the training data)
+cancer_recipe <- recipe(Class ~ Smoothness + Concavity,
+ data = cancer_subtrain) |>
+ step_scale(all_predictors()) |>
+ step_center(all_predictors())
+
+# fit the knn model (we can reuse the old knn_spec model from before)
+knn_fit <- workflow() |>
+ add_recipe(cancer_recipe) |>
+ add_model(knn_spec) |>
+ fit(data = cancer_subtrain)
+
+# get predictions on the validation data
+validation_predicted <- predict(knn_fit, cancer_validation) |>
+ bind_cols(cancer_validation)
+
+# compute the accuracy
+acc <- validation_predicted |>
+ metrics(truth = Class, estimate = .pred_class) |>
+ filter(.metric == "accuracy") |>
+ select(.estimate) |>
+ pull()
+
+acc
## [1] 0.8598131
+The accuracy estimate using this split is 86%. +Now we repeat the above code 4 more times, which generates 4 more splits. +Therefore we get five different shuffles of the data, and therefore five different values for +accuracy: 86.0%, 89.7%, 88.8%, 86.0%, 86.9%. None of these values are +necessarily “more correct” than any other; they’re +just five estimates of the true, underlying accuracy of our classifier built +using our overall training data. We can combine the estimates by taking their +average (here 87%) to try to get a single assessment of our +classifier’s accuracy; this has the effect of reducing the influence of any one +(un)lucky validation set on the estimate.
+In practice, we don’t use random splits, but rather use a more structured +splitting procedure so that each observation in the data set is used in a +validation set only a single time. The name for this strategy is +cross-validation. In cross-validation, we split our overall training +data into \(C\) evenly sized chunks. Then, iteratively use \(1\) chunk as the +validation set and combine the remaining \(C-1\) chunks +as the training set. +This procedure is shown in Figure 6.4. +Here, \(C=5\) different chunks of the data set are used, +resulting in 5 different choices for the validation set; we call this +5-fold cross-validation.
++Figure 6.4: 5-fold cross-validation. +
+To perform 5-fold cross-validation in R with tidymodels
, we use another
+function: vfold_cv
. This function splits our training data into v
folds
+automatically. We set the strata
argument to the categorical label variable
+(here, Class
) to ensure that the training and validation subsets contain the
+right proportions of each category of observation.
## # 5-fold cross-validation using stratification
+## # A tibble: 5 × 2
+## splits id
+## <list> <chr>
+## 1 <split [340/86]> Fold1
+## 2 <split [340/86]> Fold2
+## 3 <split [341/85]> Fold3
+## 4 <split [341/85]> Fold4
+## 5 <split [342/84]> Fold5
+Then, when we create our data analysis workflow, we use the fit_resamples
function
+instead of the fit
function for training. This runs cross-validation on each
+train/validation split.
# recreate the standardization recipe from before
+# (since it must be based on the training data)
+cancer_recipe <- recipe(Class ~ Smoothness + Concavity,
+ data = cancer_train) |>
+ step_scale(all_predictors()) |>
+ step_center(all_predictors())
+
+# fit the knn model (we can reuse the old knn_spec model from before)
+knn_fit <- workflow() |>
+ add_recipe(cancer_recipe) |>
+ add_model(knn_spec) |>
+ fit_resamples(resamples = cancer_vfold)
+
+knn_fit
## # Resampling results
+## # 5-fold cross-validation using stratification
+## # A tibble: 5 × 4
+## splits id .metrics .notes
+## <list> <chr> <list> <list>
+## 1 <split [340/86]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
+## 2 <split [340/86]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
+## 3 <split [341/85]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]>
+## 4 <split [341/85]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]>
+## 5 <split [342/84]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]>
+The collect_metrics
function is used to aggregate the mean and standard error
+of the classifier’s validation accuracy across the folds. You will find results
+related to the accuracy in the row with accuracy
listed under the .metric
column.
+You should consider the mean (mean
) to be the estimated accuracy, while the standard
+error (std_err
) is a measure of how uncertain we are in the mean value. A detailed treatment of this
+is beyond the scope of this chapter; but roughly, if your estimated mean is 0.89 and standard
+error is 0.02, you can expect the true average accuracy of the
+classifier to be somewhere roughly between 87% and 91% (although it may
+fall outside this range). You may ignore the other columns in the metrics data frame,
+as they do not provide any additional insight.
+You can also ignore the entire second row with roc_auc
in the .metric
column,
+as it is beyond the scope of this book.
## # A tibble: 2 × 6
+## .metric .estimator mean n std_err .config
+## <chr> <chr> <dbl> <int> <dbl> <chr>
+## 1 accuracy binary 0.890 5 0.0180 Preprocessor1_Model1
+## 2 roc_auc binary 0.925 5 0.0151 Preprocessor1_Model1
+We can choose any number of folds, and typically the more we use the better our +accuracy estimate will be (lower standard error). However, we are limited +by computational power: the +more folds we choose, the more computation it takes, and hence the more time +it takes to run the analysis. So when you do cross-validation, you need to +consider the size of the data, the speed of the algorithm (e.g., K-nearest +neighbors), and the speed of your computer. In practice, this is a +trial-and-error process, but typically \(C\) is chosen to be either 5 or 10. Here +we will try 10-fold cross-validation to see if we get a lower standard error:
+cancer_vfold <- vfold_cv(cancer_train, v = 10, strata = Class)
+
+vfold_metrics <- workflow() |>
+ add_recipe(cancer_recipe) |>
+ add_model(knn_spec) |>
+ fit_resamples(resamples = cancer_vfold) |>
+ collect_metrics()
+
+vfold_metrics
## # A tibble: 2 × 6
+## .metric .estimator mean n std_err .config
+## <chr> <chr> <dbl> <int> <dbl> <chr>
+## 1 accuracy binary 0.890 10 0.0127 Preprocessor1_Model1
+## 2 roc_auc binary 0.913 10 0.0150 Preprocessor1_Model1
+In this case, using 10-fold instead of 5-fold cross validation did reduce the standard error, although +by only an insignificant amount. In fact, due to the randomness in how the data are split, sometimes +you might even end up with a higher standard error when increasing the number of folds! +We can make the reduction in standard error more dramatic by increasing the number of folds +by a large amount. In the following code we show the result when \(C = 50\); +picking such a large number of folds often takes a long time to run in practice, +so we usually stick to 5 or 10.
+cancer_vfold_50 <- vfold_cv(cancer_train, v = 50, strata = Class)
+
+vfold_metrics_50 <- workflow() |>
+ add_recipe(cancer_recipe) |>
+ add_model(knn_spec) |>
+ fit_resamples(resamples = cancer_vfold_50) |>
+ collect_metrics()
+
+vfold_metrics_50
## # A tibble: 2 × 6
+## .metric .estimator mean n std_err .config
+## <chr> <chr> <dbl> <int> <dbl> <chr>
+## 1 accuracy binary 0.884 50 0.00568 Preprocessor1_Model1
+## 2 roc_auc binary 0.926 50 0.0148 Preprocessor1_Model1
+Using 5- and 10-fold cross-validation, we have estimated that the prediction +accuracy of our classifier is somewhere around 89%. +Whether that is good or not +depends entirely on the downstream application of the data analysis. In the +present situation, we are trying to predict a tumor diagnosis, with expensive, +damaging chemo/radiation therapy or patient death as potential consequences of +misprediction. Hence, we might like to +do better than 89% for this application.
+In order to improve our classifier, we have one choice of parameter: the number of
+neighbors, \(K\). Since cross-validation helps us evaluate the accuracy of our
+classifier, we can use cross-validation to calculate an accuracy for each value
+of \(K\) in a reasonable range, and then pick the value of \(K\) that gives us the
+best accuracy. The tidymodels
package collection provides a very simple
+syntax for tuning models: each parameter in the model to be tuned should be specified
+as tune()
in the model specification rather than given a particular value.
knn_spec <- nearest_neighbor(weight_func = "rectangular",
+ neighbors = tune()) |>
+ set_engine("kknn") |>
+ set_mode("classification")
Then instead of using fit
or fit_resamples
, we will use the tune_grid
function
+to fit the model for each value in a range of parameter values.
+In particular, we first create a data frame with a neighbors
+variable that contains the sequence of values of \(K\) to try; below we create the k_vals
+data frame with the neighbors
variable containing values from 1 to 100 (stepping by 5) using
+the seq
function.
+Then we pass that data frame to the grid
argument of tune_grid
.
k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))
+
+knn_results <- workflow() |>
+ add_recipe(cancer_recipe) |>
+ add_model(knn_spec) |>
+ tune_grid(resamples = cancer_vfold, grid = k_vals) |>
+ collect_metrics()
+
+accuracies <- knn_results |>
+ filter(.metric == "accuracy")
+
+accuracies
## # A tibble: 20 × 7
+## neighbors .metric .estimator mean n std_err .config
+## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
+## 1 1 accuracy binary 0.866 10 0.0165 Preprocessor1_Model01
+## 2 6 accuracy binary 0.890 10 0.0153 Preprocessor1_Model02
+## 3 11 accuracy binary 0.887 10 0.0173 Preprocessor1_Model03
+## 4 16 accuracy binary 0.887 10 0.0142 Preprocessor1_Model04
+## 5 21 accuracy binary 0.887 10 0.0143 Preprocessor1_Model05
+## 6 26 accuracy binary 0.887 10 0.0170 Preprocessor1_Model06
+## 7 31 accuracy binary 0.897 10 0.0145 Preprocessor1_Model07
+## 8 36 accuracy binary 0.899 10 0.0144 Preprocessor1_Model08
+## 9 41 accuracy binary 0.892 10 0.0135 Preprocessor1_Model09
+## 10 46 accuracy binary 0.892 10 0.0156 Preprocessor1_Model10
+## 11 51 accuracy binary 0.890 10 0.0155 Preprocessor1_Model11
+## 12 56 accuracy binary 0.873 10 0.0156 Preprocessor1_Model12
+## 13 61 accuracy binary 0.876 10 0.0104 Preprocessor1_Model13
+## 14 66 accuracy binary 0.871 10 0.0139 Preprocessor1_Model14
+## 15 71 accuracy binary 0.876 10 0.0104 Preprocessor1_Model15
+## 16 76 accuracy binary 0.873 10 0.0127 Preprocessor1_Model16
+## 17 81 accuracy binary 0.876 10 0.0135 Preprocessor1_Model17
+## 18 86 accuracy binary 0.873 10 0.0131 Preprocessor1_Model18
+## 19 91 accuracy binary 0.873 10 0.0140 Preprocessor1_Model19
+## 20 96 accuracy binary 0.866 10 0.0126 Preprocessor1_Model20
+We can decide which number of neighbors is best by plotting the accuracy versus \(K\), +as shown in Figure 6.5.
+accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
+ geom_point() +
+ geom_line() +
+ labs(x = "Neighbors", y = "Accuracy Estimate") +
+ theme(text = element_text(size = 12))
+
+accuracy_vs_k
+Figure 6.5: Plot of estimated accuracy versus the number of neighbors. +
+We can also obtain the number of neighbours with the highest accuracy
+programmatically by accessing the neighbors
variable in the accuracies
data
+frame where the mean
variable is highest.
+Note that it is still useful to visualize the results as
+we did above since this provides additional information on how the model
+performance varies.
## [1] 36
+Setting the number of +neighbors to \(K =\) 36 +provides the highest cross-validation accuracy estimate (89.89%). But there is no exact or perfect answer here; +any selection from \(K = 30\) and \(60\) would be reasonably justified, as all +of these differ in classifier accuracy by a small amount. Remember: the +values you see on this plot are estimates of the true accuracy of our +classifier. Although the \(K =\) 36 value is higher than the others on this plot, +that doesn’t mean the classifier is actually more accurate with this parameter +value! Generally, when selecting \(K\) (and other parameters for other predictive +models), we are looking for a value where:
+We know that \(K =\) 36 +provides the highest estimated accuracy. Further, Figure 6.5 shows that the estimated accuracy +changes by only a small amount if we increase or decrease \(K\) near \(K =\) 36. +And finally, \(K =\) 36 does not create a prohibitively expensive +computational cost of training. Considering these three points, we would indeed select +\(K =\) 36 for the classifier.
+To build a bit more intuition, what happens if we keep increasing the number of
+neighbors \(K\)? In fact, the accuracy actually starts to decrease!
+Let’s specify a much larger range of values of \(K\) to try in the grid
+argument of tune_grid
. Figure 6.6 shows a plot of estimated accuracy as
+we vary \(K\) from 1 to almost the number of observations in the training set.
k_lots <- tibble(neighbors = seq(from = 1, to = 385, by = 10))
+
+knn_results <- workflow() |>
+ add_recipe(cancer_recipe) |>
+ add_model(knn_spec) |>
+ tune_grid(resamples = cancer_vfold, grid = k_lots) |>
+ collect_metrics()
+
+accuracies_lots <- knn_results |>
+ filter(.metric == "accuracy")
+
+accuracy_vs_k_lots <- ggplot(accuracies_lots, aes(x = neighbors, y = mean)) +
+ geom_point() +
+ geom_line() +
+ labs(x = "Neighbors", y = "Accuracy Estimate") +
+ theme(text = element_text(size = 12))
+
+accuracy_vs_k_lots
+Figure 6.6: Plot of accuracy estimate versus number of neighbors for many K values. +
+Underfitting: What is actually happening to our classifier that causes +this? As we increase the number of neighbors, more and more of the training +observations (and those that are farther and farther away from the point) get a +“say” in what the class of a new observation is. This causes a sort of +“averaging effect” to take place, making the boundary between where our +classifier would predict a tumor to be malignant versus benign to smooth out +and become simpler. If you take this to the extreme, setting \(K\) to the total +training data set size, then the classifier will always predict the same label +regardless of what the new observation looks like. In general, if the model +isn’t influenced enough by the training data, it is said to underfit the +data.
+Overfitting: In contrast, when we decrease the number of neighbors, each +individual data point has a stronger and stronger vote regarding nearby points. +Since the data themselves are noisy, this causes a more “jagged” boundary +corresponding to a less simple model. If you take this case to the extreme, +setting \(K = 1\), then the classifier is essentially just matching each new +observation to its closest neighbor in the training data set. This is just as +problematic as the large \(K\) case, because the classifier becomes unreliable on +new data: if we had a different training set, the predictions would be +completely different. In general, if the model is influenced too much by the +training data, it is said to overfit the data.
++Figure 6.7: Effect of K in overfitting and underfitting. +
+Both overfitting and underfitting are problematic and will lead to a model +that does not generalize well to new data. When fitting a model, we need to strike +a balance between the two. You can see these two effects in Figure +6.7, which shows how the classifier changes as +we set the number of neighbors \(K\) to 1, 7, 20, and 300.
+Now that we have tuned the K-NN classifier and set \(K =\) 36, +we are done building the model and it is time to evaluate the quality of its predictions on the held out +test data, as we did earlier in Section 6.5.5. +We first need to retrain the K-NN classifier +on the entire training data set using the selected number of neighbors.
+cancer_recipe <- recipe(Class ~ Smoothness + Concavity, data = cancer_train) |>
+ step_scale(all_predictors()) |>
+ step_center(all_predictors())
+
+knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
+ set_engine("kknn") |>
+ set_mode("classification")
+
+knn_fit <- workflow() |>
+ add_recipe(cancer_recipe) |>
+ add_model(knn_spec) |>
+ fit(data = cancer_train)
+
+knn_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
+## Preprocessor: Recipe
+## Model: nearest_neighbor()
+##
+## ── Preprocessor ────────────────────────────────────────────────────────────────
+## 2 Recipe Steps
+##
+## • step_scale()
+## • step_center()
+##
+## ── Model ───────────────────────────────────────────────────────────────────────
+##
+## Call:
+## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(36, data, 5), kernel = ~"rectangular")
+##
+## Type of response variable: nominal
+## Minimal misclassification: 0.1150235
+## Best kernel: rectangular
+## Best k: 36
+Then to make predictions and assess the estimated accuracy of the best model on the test data, we use the
+predict
and metrics
functions as we did earlier in the chapter. We can then pass those predictions to
+the precision
, recall
, and conf_mat
functions to assess the estimated precision and recall, and print a confusion matrix.
+
cancer_test_predictions <- predict(knn_fit, cancer_test) |>
+ bind_cols(cancer_test)
+
+cancer_test_predictions |>
+ metrics(truth = Class, estimate = .pred_class) |>
+ filter(.metric == "accuracy")
## # A tibble: 1 × 3
+## .metric .estimator .estimate
+## <chr> <chr> <dbl>
+## 1 accuracy binary 0.860
+
+## # A tibble: 1 × 3
+## .metric .estimator .estimate
+## <chr> <chr> <dbl>
+## 1 precision binary 0.8
+
+## # A tibble: 1 × 3
+## .metric .estimator .estimate
+## <chr> <chr> <dbl>
+## 1 recall binary 0.830
+confusion <- cancer_test_predictions |>
+ conf_mat(truth = Class, estimate = .pred_class)
+confusion
## Truth
+## Prediction Malignant Benign
+## Malignant 44 11
+## Benign 9 79
+At first glance, this is a bit surprising: the accuracy of the classifier +has only changed a small amount despite tuning the number of neighbors! Our first model +with \(K =\) 3 (before we knew how to tune) had an estimated accuracy of 85%, +while the tuned model with \(K =\) 36 had an estimated accuracy +of 86%. +Upon examining Figure 6.5 again to see the +cross validation accuracy estimates for a range of neighbors, this result +becomes much less surprising. From 1 to around 96 neighbors, the cross +validation accuracy estimate varies only by around 3%, with +each estimate having a standard error around 1%. +Since the cross-validation accuracy estimates the test set accuracy, +the fact that the test set accuracy also doesn’t change much is expected. +Also note that the \(K =\) 3 model had a precision +precision of 77% and recall of 87%, +while the tuned model had +a precision of 80% and recall of 83%. +Given that the recall decreased—remember, in this application, recall +is critical to making sure we find all the patients with malignant tumors—the tuned model may actually be less preferred +in this setting. In any case, it is important to think critically about the result of tuning. Models tuned to +maximize accuracy are not necessarily better for a given application.
+Classification algorithms use one or more quantitative variables to predict the +value of another categorical variable. In particular, the K-nearest neighbors algorithm +does this by first finding the \(K\) points in the training data nearest +to the new observation, and then returning the majority class vote from those +training observations. We can tune and evaluate a classifier by splitting the data randomly into a +training and test data set. The training set is used to build the classifier, +and we can tune the classifier (e.g., select the number of neighbors in K-NN) +by maximizing estimated accuracy via cross-validation. After we have tuned the +model we can use the test set to estimate its accuracy. +The overall process is summarized in Figure 6.8.
++Figure 6.8: Overview of K-NN classification. +
+The overall workflow for performing K-nearest neighbors classification using tidymodels
is as follows:
+
initial_split
function to split the data into a training and test set. Set the strata
argument to the class label variable. Put the test set aside for now.vfold_cv
function to split up the training data for cross-validation.recipe
that specifies the class label and predictors, as well as preprocessing steps for all variables. Pass the training data as the data
argument of the recipe.nearest_neighbors
model specification, with neighbors = tune()
.workflow()
, and use the tune_grid
function on the train/validation splits to estimate the classifier accuracy for a range of \(K\) values.fit
function.predict
function.In these last two chapters, we focused on the K-nearest neighbors algorithm, +but there are many other methods we could have used to predict a categorical label. +All algorithms have their strengths and weaknesses, and we summarize these for +the K-NN here.
+Strengths: K-nearest neighbors classification
+Weaknesses: K-nearest neighbors classification
+++Note: This section is not required reading for the remainder of the textbook. It is included for those readers +interested in learning how irrelevant variables can influence the performance of a classifier, and how to +pick a subset of useful variables to include as predictors.
+
Another potentially important part of tuning your classifier is to choose which +variables from your data will be treated as predictor variables. Technically, you can choose +anything from using a single predictor variable to using every variable in your +data; the K-nearest neighbors algorithm accepts any number of +predictors. However, it is not the case that using more predictors always +yields better predictions! In fact, sometimes including irrelevant predictors can +actually negatively affect classifier performance.
+Let’s take a look at an example where K-nearest neighbors performs
+worse when given more predictors to work with. In this example, we modified
+the breast cancer data to have only the Smoothness
, Concavity
, and
+Perimeter
variables from the original data. Then, we added irrelevant
+variables that we created ourselves using a random number generator.
+The irrelevant variables each take a value of 0 or 1 with equal probability for each observation, regardless
+of what the value Class
variable takes. In other words, the irrelevant variables have
+no meaningful relationship with the Class
variable.
## # A tibble: 569 × 6
+## Class Smoothness Concavity Perimeter Irrelevant1 Irrelevant2
+## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
+## 1 Malignant 0.118 0.300 123. 1 0
+## 2 Malignant 0.0847 0.0869 133. 0 0
+## 3 Malignant 0.110 0.197 130 0 0
+## 4 Malignant 0.142 0.241 77.6 0 1
+## 5 Malignant 0.100 0.198 135. 0 0
+## 6 Malignant 0.128 0.158 82.6 1 0
+## 7 Malignant 0.0946 0.113 120. 0 1
+## 8 Malignant 0.119 0.0937 90.2 1 0
+## 9 Malignant 0.127 0.186 87.5 0 0
+## 10 Malignant 0.119 0.227 84.0 1 1
+## # ℹ 559 more rows
+Next, we build a sequence of K-NN classifiers that include Smoothness
,
+Concavity
, and Perimeter
as predictor variables, but also increasingly many irrelevant
+variables. In particular, we create 6 data sets with 0, 5, 10, 15, 20, and 40 irrelevant predictors.
+Then we build a model, tuned via 5-fold cross-validation, for each data set.
+Figure 6.9 shows
+the estimated cross-validation accuracy versus the number of irrelevant predictors. As
+we add more irrelevant predictor variables, the estimated accuracy of our
+classifier decreases. This is because the irrelevant variables add a random
+amount to the distance between each pair of observations; the more irrelevant
+variables there are, the more (random) influence they have, and the more they
+corrupt the set of nearest neighbors that vote on the class of the new
+observation to predict.
+Figure 6.9: Effect of inclusion of irrelevant predictors. +
+Although the accuracy decreases as expected, one surprising thing about +Figure 6.9 is that it shows that the method +still outperforms the baseline majority classifier (with about 63% accuracy) +even with 40 irrelevant variables. +How could that be? Figure 6.10 provides the answer: +the tuning procedure for the K-nearest neighbors classifier combats the extra randomness from the irrelevant variables +by increasing the number of neighbors. Of course, because of all the extra noise in the data from the irrelevant +variables, the number of neighbors does not increase smoothly; but the general trend is increasing. +Figure 6.11 corroborates +this evidence; if we fix the number of neighbors to \(K=3\), the accuracy falls off more quickly.
++Figure 6.10: Tuned number of neighbors for varying number of irrelevant predictors. +
++Figure 6.11: Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors. +
+So then, if it is not ideal to use all of our variables as predictors without consideration, how
+do we choose which variables we should use? A simple method is to rely on your scientific understanding
+of the data to tell you which variables are not likely to be useful predictors. For example, in the cancer
+data that we have been studying, the ID
variable is just a unique identifier for the observation.
+As it is not related to any measured property of the cells, the ID
variable should therefore not be used
+as a predictor. That is, of course, a very clear-cut case. But the decision for the remaining variables
+is less obvious, as all seem like reasonable candidates. It
+is not clear which subset of them will create the best classifier. One could use visualizations and
+other exploratory analyses to try to help understand which variables are potentially relevant, but
+this process is both time-consuming and error-prone when there are many variables to consider.
+Therefore we need a more systematic and programmatic way of choosing variables.
+This is a very difficult problem to solve in
+general, and there are a number of methods that have been developed that apply
+in particular cases of interest. Here we will discuss two basic
+selection methods as an introduction to the topic. See the additional resources at the end of
+this chapter to find out where you can learn more about variable selection, including more advanced methods.
The first idea you might think of for a systematic way to select predictors +is to try all possible subsets of predictors and then pick the set that results in the “best” classifier. +This procedure is indeed a well-known variable selection method referred to +as best subset selection (Beale, Kendall, and Mann 1967; Hocking and Leslie 1967). +In particular, you
+Best subset selection is applicable to any classification method (K-NN or otherwise). +However, it becomes very slow when you have even a moderate +number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets +grows very quickly with the number of predictors, and you have to train the model (itself +a slow process!) for each one. For example, if we have 2 predictors—let’s call +them A and B—then we have 3 variable sets to try: A alone, B alone, and finally A +and B together. If we have 3 predictors—A, B, and C—then we have 7 +to try: A, B, C, AB, BC, AC, and ABC. In general, the number of models +we have to train for \(m\) predictors is \(2^m-1\); in other words, when we +get to 10 predictors we have over one thousand models to train, and +at 20 predictors we have over one million models to train! +So although it is a simple method, best subset selection is usually too computationally +expensive to use in practice.
+Another idea is to iteratively build up a model by adding one predictor variable +at a time. This method—known as forward selection (Eforymson 1966; Draper and Smith 1966)—is also widely +applicable and fairly straightforward. It involves the following steps:
+Say you have \(m\) total predictors to work with. In the first iteration, you have to make +\(m\) candidate models, each with 1 predictor. Then in the second iteration, you have +to make \(m-1\) candidate models, each with 2 predictors (the one you chose before and a new one). +This pattern continues for as many iterations as you want. If you run the method +all the way until you run out of predictors to choose, you will end up training +\(\frac{1}{2}m(m+1)\) separate models. This is a big improvement from the \(2^m-1\) +models that best subset selection requires you to train! For example, while best subset selection requires +training over 1000 candidate models with 10 predictors, forward selection requires training only 55 candidate models. +Therefore we will continue the rest of this section using forward selection.
+++Note: One word of caution before we move on. Every additional model that you train +increases the likelihood that you will get unlucky and stumble +on a model that has a high cross-validation accuracy estimate, but a low true +accuracy on the test data and other future observations. +Since forward selection involves training a lot of models, you run a fairly +high risk of this happening. To keep this risk low, only use forward selection +when you have a large amount of data and a relatively small total number of +predictors. More advanced methods do not suffer from this +problem as much; see the additional resources at the end of this chapter for +where to learn more about advanced predictor selection methods.
+
We now turn to implementing forward selection in R.
+Unfortunately there is no built-in way to do this using the tidymodels
framework,
+so we will have to code it ourselves. First we will use the select
function to extract a smaller set of predictors
+to work with in this illustrative example—Smoothness
, Concavity
, Perimeter
, Irrelevant1
, Irrelevant2
, and Irrelevant3
—as
+well as the Class
variable as the label. We will also extract the column names for the full set of predictors.
cancer_subset <- cancer_irrelevant |>
+ select(Class,
+ Smoothness,
+ Concavity,
+ Perimeter,
+ Irrelevant1,
+ Irrelevant2,
+ Irrelevant3)
+
+names <- colnames(cancer_subset |> select(-Class))
+
+cancer_subset
## # A tibble: 569 × 7
+## Class Smoothness Concavity Perimeter Irrelevant1 Irrelevant2 Irrelevant3
+## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
+## 1 Malignant 0.118 0.300 123. 1 0 1
+## 2 Malignant 0.0847 0.0869 133. 0 0 0
+## 3 Malignant 0.110 0.197 130 0 0 0
+## 4 Malignant 0.142 0.241 77.6 0 1 0
+## 5 Malignant 0.100 0.198 135. 0 0 0
+## 6 Malignant 0.128 0.158 82.6 1 0 1
+## 7 Malignant 0.0946 0.113 120. 0 1 1
+## 8 Malignant 0.119 0.0937 90.2 1 0 0
+## 9 Malignant 0.127 0.186 87.5 0 0 1
+## 10 Malignant 0.119 0.227 84.0 1 1 0
+## # ℹ 559 more rows
+The key idea of the forward selection code is to use the paste
function (which concatenates strings
+separated by spaces) to create a model formula for each subset of predictors for which we want to build a model.
+The collapse
argument tells paste
what to put between the items in the list;
+to make a formula, we need to put a +
symbol between each variable.
+As an example, let’s make a model formula for all the predictors,
+which should output something like
+Class ~ Smoothness + Concavity + Perimeter + Irrelevant1 + Irrelevant2 + Irrelevant3
:
## [1] "Class ~ Smoothness+Concavity+Perimeter+Irrelevant1+Irrelevant2+Irrelevant3"
+Finally, we need to write some code that performs the task of sequentially
+finding the best predictor to add to the model.
+If you recall the end of the wrangling chapter, we mentioned
+that sometimes one needs more flexible forms of iteration than what
+we have used earlier, and in these cases one typically resorts to
+a for loop; see the chapter on iteration in R for Data Science (Wickham and Grolemund 2016).
+Here we will use two for loops:
+one over increasing predictor set sizes
+(where you see for (i in 1:length(names))
below),
+and another to check which predictor to add in each round (where you see for (j in 1:length(names))
below).
+For each set of predictors to try, we construct a model formula,
+pass it into a recipe
, build a workflow
that tunes
+a K-NN classifier using 5-fold cross-validation,
+and finally records the estimated accuracy.
# create an empty tibble to store the results
+accuracies <- tibble(size = integer(),
+ model_string = character(),
+ accuracy = numeric())
+
+# create a model specification
+knn_spec <- nearest_neighbor(weight_func = "rectangular",
+ neighbors = tune()) |>
+ set_engine("kknn") |>
+ set_mode("classification")
+
+# create a 5-fold cross-validation object
+cancer_vfold <- vfold_cv(cancer_subset, v = 5, strata = Class)
+
+# store the total number of predictors
+n_total <- length(names)
+
+# stores selected predictors
+selected <- c()
+
+# for every size from 1 to the total number of predictors
+for (i in 1:n_total) {
+ # for every predictor still not added yet
+ accs <- list()
+ models <- list()
+ for (j in 1:length(names)) {
+ # create a model string for this combination of predictors
+ preds_new <- c(selected, names[[j]])
+ model_string <- paste("Class", "~", paste(preds_new, collapse="+"))
+
+ # create a recipe from the model string
+ cancer_recipe <- recipe(as.formula(model_string),
+ data = cancer_subset) |>
+ step_scale(all_predictors()) |>
+ step_center(all_predictors())
+
+ # tune the K-NN classifier with these predictors,
+ # and collect the accuracy for the best K
+ acc <- workflow() |>
+ add_recipe(cancer_recipe) |>
+ add_model(knn_spec) |>
+ tune_grid(resamples = cancer_vfold, grid = 10) |>
+ collect_metrics() |>
+ filter(.metric == "accuracy") |>
+ summarize(mx = max(mean))
+ acc <- acc$mx |> unlist()
+
+ # add this result to the dataframe
+ accs[[j]] <- acc
+ models[[j]] <- model_string
+ }
+ jstar <- which.max(unlist(accs))
+ accuracies <- accuracies |>
+ add_row(size = i,
+ model_string = models[[jstar]],
+ accuracy = accs[[jstar]])
+ selected <- c(selected, names[[jstar]])
+ names <- names[-jstar]
+}
+accuracies
## # A tibble: 6 × 3
+## size model_string accuracy
+## <int> <chr> <dbl>
+## 1 1 Class ~ Perimeter 0.896
+## 2 2 Class ~ Perimeter+Concavity 0.916
+## 3 3 Class ~ Perimeter+Concavity+Smoothness 0.931
+## 4 4 Class ~ Perimeter+Concavity+Smoothness+Irrelevant1 0.928
+## 5 5 Class ~ Perimeter+Concavity+Smoothness+Irrelevant1+Irrelevant3 0.924
+## 6 6 Class ~ Perimeter+Concavity+Smoothness+Irrelevant1+Irrelevant3… 0.902
+Interesting! The forward selection procedure first added the three meaningful variables Perimeter
,
+Concavity
, and Smoothness
, followed by the irrelevant variables. Figure 6.12
+visualizes the accuracy versus the number of predictors in the model. You can see that
+as meaningful predictors are added, the estimated accuracy increases substantially; and as you add irrelevant
+variables, the accuracy either exhibits small fluctuations or decreases as the model attempts to tune the number
+of neighbors to account for the extra noise. In order to pick the right model from the sequence, you have
+to balance high accuracy and model simplicity (i.e., having fewer predictors and a lower chance of overfitting). The
+way to find that balance is to look for the elbow
+in Figure 6.12, i.e., the place on the plot where the accuracy stops increasing dramatically and
+levels off or begins to decrease. The elbow in Figure 6.12 appears to occur at the model with
+3 predictors; after that point the accuracy levels off. So here the right trade-off of accuracy and number of predictors
+occurs with 3 variables: Class ~ Perimeter + Concavity + Smoothness
. In other words, we have successfully removed irrelevant
+predictors from the model! It is always worth remembering, however, that what cross-validation gives you
+is an estimate of the true accuracy; you have to use your judgement when looking at this plot to decide
+where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy.
+Figure 6.12: Estimated accuracy versus the number of predictors for the sequence of models built using forward selection. +
+++Note: Since the choice of which variables to include as predictors is +part of tuning your classifier, you cannot use your test data for this +process!
+
Practice exercises for the material covered in this chapter +can be found in the accompanying +worksheets repository +in the “Classification II: evaluation and tuning” row. +You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. +You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” +If you instead decide to download the worksheet and run it on your own machine, +make sure to follow the instructions for computer setup +found in Chapter 13. This will ensure that the automated feedback +and guidance that the worksheets provide will function as intended.
+tidymodels
website is an excellent
+reference for more details on, and advanced usage of, the functions and
+packages in the past two chapters. Aside from that, it also has a nice
+beginner’s tutorial and an extensive list
+of more advanced examples that you can use
+to continue learning beyond the scope of this book. It’s worth noting that the
+tidymodels
package does a lot more than just classification, and so the
+examples on the website similarly go beyond classification as well. In the next
+two chapters, you’ll learn about another kind of predictive modeling setting,
+so it might be worth visiting the website only after reading through those
+chapters.As part of exploratory data analysis, it is often helpful to see if there are +meaningful subgroups (or clusters) in the data. +This grouping can be used for many purposes, +such as generating new questions or improving predictive analyses. +This chapter provides an introduction to clustering +using the K-means algorithm, +including techniques to choose the number of clusters.
+By the end of the chapter, readers will be able to do the following:
+tidymodels
workflows.Clustering is a data analysis technique +involving separating a data set into subgroups of related data. +For example, we might use clustering to separate a +data set of documents into groups that correspond to topics, a data set of +human genetic information into groups that correspond to ancestral +subpopulations, or a data set of online customers into groups that correspond +to purchasing behaviors. Once the data are separated, we can, for example, +use the subgroups to generate new questions about the data and follow up with a +predictive modeling exercise. In this course, clustering will be used only for +exploratory analysis, i.e., uncovering patterns in the data.
+Note that clustering is a fundamentally different kind of task +than classification or regression. +In particular, both classification and regression are supervised tasks +where there is a response variable (a category label or value), +and we have examples of past data with labels/values +that help us predict those of future data. +By contrast, clustering is an unsupervised task, +as we are trying to understand +and examine the structure of data without any response variable labels +or values to help us. +This approach has both advantages and disadvantages. +Clustering requires no additional annotation or input on the data. +For example, while it would be nearly impossible to annotate +all the articles on Wikipedia with human-made topic labels, +we can cluster the articles without this information +to find groupings corresponding to topics automatically. +However, given that there is no response variable, it is not as easy to evaluate +the “quality” of a clustering. With classification, we can use a test data set +to assess prediction performance. In clustering, there is not a single good +choice for evaluation. In this book, we will use visualization to ascertain the +quality of a clustering, and leave rigorous evaluation for more advanced +courses.
+As in the case of classification, +there are many possible methods that we could use to cluster our observations +to look for subgroups. +In this book, we will focus on the widely used K-means algorithm (Lloyd 1982). +In your future studies, you might encounter hierarchical clustering, +principal component analysis, multidimensional scaling, and more; +see the additional resources section at the end of this chapter +for where to begin learning more about these other methods.
+ +++Note: There are also so-called semisupervised tasks, +where only some of the data come with response variable labels/values, +but the vast majority don’t. +The goal is to try to uncover underlying structure in the data +that allows one to guess the missing labels. +This sort of task is beneficial, for example, +when one has an unlabeled data set that is too large to manually label, +but one is willing to provide a few informative example labels as a “seed” +to guess the labels for all the data.
+
In this chapter we will focus on a data set from
+the palmerpenguins
R package (Horst, Hill, and Gorman 2020). This
+data set was collected by Dr. Kristen Gorman and
+the Palmer Station, Antarctica Long Term Ecological Research Site, and includes
+measurements for adult penguins (Figure 9.1) found near there (Gorman, Williams, and Fraser 2014).
+Our goal will be to use two
+variables—penguin bill and flipper length, both in millimeters—to determine whether
+there are distinct types of penguins in our data.
+Understanding this might help us with species discovery and classification in a data-driven
+way. Note that we have reduced the size of the data set to 18 observations and 2 variables;
+this will help us make clear visualizations that illustrate how clustering works for learning purposes.
+Figure 9.1: A Gentoo penguin. +
+Before we get started, we will load the tidyverse
metapackage
+as well as set a random seed.
+This will ensure we have access to the functions we need
+and that our analysis will be reproducible.
+As we will learn in more detail later in the chapter,
+setting the seed here is important
+because the K-means clustering algorithm uses randomness
+when choosing a starting position for each cluster.
Now we can load and preview the penguins
data.
## # A tibble: 18 × 2
+## bill_length_mm flipper_length_mm
+## <dbl> <dbl>
+## 1 39.2 196
+## 2 36.5 182
+## 3 34.5 187
+## 4 36.7 187
+## 5 38.1 181
+## 6 39.2 190
+## 7 36 195
+## 8 37.8 193
+## 9 46.5 213
+## 10 46.1 215
+## 11 47.8 215
+## 12 45 220
+## 13 49.1 212
+## 14 43.3 208
+## 15 46 195
+## 16 46.7 195
+## 17 52.2 197
+## 18 46.8 189
+We will begin by using a version of the data that we have standardized, penguins_standardized
,
+to illustrate how K-means clustering works (recall standardization from Chapter 5).
+Later in this chapter, we will return to the original penguins
data to see how to include standardization automatically
+in the clustering pipeline.
## # A tibble: 18 × 2
+## bill_length_standardized flipper_length_standardized
+## <dbl> <dbl>
+## 1 -0.641 -0.190
+## 2 -1.14 -1.33
+## 3 -1.52 -0.922
+## 4 -1.11 -0.922
+## 5 -0.847 -1.41
+## 6 -0.641 -0.678
+## 7 -1.24 -0.271
+## 8 -0.902 -0.434
+## 9 0.720 1.19
+## 10 0.646 1.36
+## 11 0.963 1.36
+## 12 0.440 1.76
+## 13 1.21 1.11
+## 14 0.123 0.786
+## 15 0.627 -0.271
+## 16 0.757 -0.271
+## 17 1.78 -0.108
+## 18 0.776 -0.759
+Next, we can create a scatter plot using this data set +to see if we can detect subtypes or groups in our data set.
+ +ggplot(penguins_standardized,
+ aes(x = flipper_length_standardized,
+ y = bill_length_standardized)) +
+ geom_point() +
+ xlab("Flipper Length (standardized)") +
+ ylab("Bill Length (standardized)") +
+ theme(text = element_text(size = 12))
+Figure 9.2: Scatter plot of standardized bill length versus standardized flipper length. +
+Based on the visualization +in Figure 9.2, +we might suspect there are a few subtypes of penguins within our data set. +We can see roughly 3 groups of observations in Figure 9.2, +including:
+Data visualization is a great tool to give us a rough sense of such patterns +when we have a small number of variables. +But if we are to group data—and select the number of groups—as part of +a reproducible analysis, we need something a bit more automated. +Additionally, finding groups via visualization becomes more difficult +as we increase the number of variables we consider when clustering. +The way to rigorously separate the data into groups +is to use a clustering algorithm. +In this chapter, we will focus on the K-means algorithm, +a widely used and often very effective clustering method, +combined with the elbow method +for selecting the number of clusters. +This procedure will separate the data into groups; +Figure 9.3 shows these groups +denoted by colored scatter points.
++Figure 9.3: Scatter plot of standardized bill length versus standardized flipper length with colored groups. +
+What are the labels for these groups? Unfortunately, we don’t have any. K-means, +like almost all clustering algorithms, just outputs meaningless “cluster labels” +that are typically whole numbers: 1, 2, 3, etc. But in a simple case like this, +where we can easily visualize the clusters on a scatter plot, we can give +human-made labels to the groups using their positions on +the plot:
+Once we have made these determinations, we can use them to inform our species +classifications or ask further questions about our data. For example, we might +be interested in understanding the relationship between flipper length and bill +length, and that relationship may differ depending on the type of penguin we +have.
+The K-means algorithm is a procedure that groups data into K clusters. +It starts with an initial clustering of the data, and then iteratively +improves it by making adjustments to the assignment of data +to clusters until it cannot improve any further. But how do we measure +the “quality” of a clustering, and what does it mean to improve it? +In K-means clustering, we measure the quality of a cluster +by its within-cluster sum-of-squared-distances (WSSD). +Computing this involves two steps. +First, we find the cluster centers by computing the mean of each variable +over data points in the cluster. For example, suppose we have a +cluster containing four observations, and we are using two variables, \(x\) and \(y\), to cluster the data. +Then we would compute the coordinates, \(\mu_x\) and \(\mu_y\), of the cluster center via
+\[\mu_x = \frac{1}{4}(x_1+x_2+x_3+x_4) \quad \mu_y = \frac{1}{4}(y_1+y_2+y_3+y_4).\]
+In the first cluster from the example, there are 4 data points. These are shown with their cluster center +(standardized flipper length -0.35, standardized bill length 0.99) highlighted +in Figure 9.4.
+ +
+Figure 9.4: Cluster 1 from the penguins_standardized
data set example. Observations are small blue points, with the cluster center highlighted as a large blue point with a black outline.
+
The second step in computing the WSSD is to add up the squared distance +between each point in the cluster and the cluster center. +We use the straight-line / Euclidean distance formula +that we learned about in Chapter 5. +In the 4-observation cluster example above, +we would compute the WSSD \(S^2\) via
+\[\begin{align*} +S^2 = \left((x_1 - \mu_x)^2 + (y_1 - \mu_y)^2\right) + \left((x_2 - \mu_x)^2 + (y_2 - \mu_y)^2\right) + \\ \left((x_3 - \mu_x)^2 + (y_3 - \mu_y)^2\right) + \left((x_4 - \mu_x)^2 + (y_4 - \mu_y)^2\right). +\end{align*}\]
+These distances are denoted by lines in Figure 9.5 for the first cluster of the penguin data example.
+ +
+Figure 9.5: Cluster 1 from the penguins_standardized
data set example. Observations are small blue points, with the cluster center highlighted as a large blue point with a black outline. The distances from the observations to the cluster center are represented as black lines.
+
The larger the value of \(S^2\), the more spread out the cluster is, since large \(S^2\) means +that points are far from the cluster center. Note, however, that “large” is relative to both the +scale of the variables for clustering and the number of points in the cluster. A cluster where points +are very close to the center might still have a large \(S^2\) if there are many data points in the cluster.
+After we have calculated the WSSD for all the clusters, +we sum them together to get the total WSSD. For our example, +this means adding up all the squared distances for the 18 observations. +These distances are denoted by black lines in +Figure 9.6.
+ +
+Figure 9.6: All clusters from the penguins_standardized
data set example. Observations are small orange, blue, and yellow points with cluster centers denoted by larger points with a black outline. The distances from the observations to each of the respective cluster centers are represented as black lines.
+
Since K-means uses the straight-line distance to measure the quality of a clustering, +it is limited to clustering based on quantitative variables. +However, note that there are variants of the K-means algorithm, +as well as other clustering algorithms entirely, +that use other distance metrics +to allow for non-quantitative data to be clustered. +These are beyond the scope of this book.
+ +We begin the K-means algorithm by picking K, +and randomly assigning a roughly equal number of observations +to each of the K clusters. +An example random initialization is shown in Figure 9.7.
++Figure 9.7: Random initialization of labels. +
+Then K-means consists of two major steps that attempt to minimize the +sum of WSSDs over all the clusters, i.e., the total WSSD:
+These two steps are repeated until the cluster assignments no longer change. +We show what the first four iterations of K-means would look like in +Figure 9.8. There each pair of plots +in each row corresponds to an iteration, +where the left figure in the pair depicts the center update, +and the right figure in the pair depicts the label update (i.e., the reassignment of data to clusters).
+ +
+Figure 9.8: First four iterations of K-means clustering on the penguins_standardized
example data set. Each pair of plots corresponds to an iteration. Within the pair, the first plot depicts the center update, and the second plot depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.
+
Note that at this point, we can terminate the algorithm since none of the assignments changed +in the fourth iteration; both the centers and labels will remain the same from this point onward.
+++Note: Is K-means guaranteed to stop at some point, or could it iterate forever? As it turns out, +thankfully, the answer is that K-means is guaranteed to stop after some number of iterations. For the interested reader, the +logic for this has three steps: (1) both the label update and the center update decrease total WSSD in each iteration, +(2) the total WSSD is always greater than or equal to 0, and (3) there are only a finite number of possible +ways to assign the data to clusters. So at some point, the total WSSD must stop decreasing, which means none of the assignments +are changing, and the algorithm terminates.
+
Unlike the classification and regression models we studied in previous chapters, K-means can get “stuck” in a bad solution. +For example, Figure 9.9 illustrates an unlucky random initialization by K-means.
++Figure 9.9: Random initialization of labels. +
+Figure 9.10 shows what the iterations of K-means would look like with the unlucky random initialization shown in Figure 9.9.
+ +
+Figure 9.10: First five iterations of K-means clustering on the penguins_standardized
example data set with a poor random initialization. Each pair of plots corresponds to an iteration. Within the pair, the first plot depicts the center update, and the second plot depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.
+
This looks like a relatively bad clustering of the data, but K-means cannot improve it. +To solve this problem when clustering data using K-means, we should randomly re-initialize the labels a few times, run K-means for each initialization, +and pick the clustering that has the lowest final total WSSD.
+In order to cluster data using K-means, +we also have to pick the number of clusters, K. +But unlike in classification, we have no response variable +and cannot perform cross-validation with some measure of model prediction error. +Further, if K is chosen too small, then multiple clusters get grouped together; +if K is too large, then clusters get subdivided. +In both cases, we will potentially miss interesting structure in the data. +Figure 9.11 illustrates the impact of K +on K-means clustering of our penguin flipper and bill length data +by showing the different clusterings for K’s ranging from 1 to 9.
++Figure 9.11: Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black. +
+If we set K less than 3, then the clustering merges separate groups of data; this causes a large +total WSSD, since the cluster center is not close to any of the data in the cluster. On +the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still +decrease the total WSSD, but by only a diminishing amount. If we plot the total WSSD versus the number of +clusters, we see that the decrease in total WSSD levels off (or forms an “elbow shape”) when we reach roughly +the right number of clusters (Figure 9.12).
++Figure 9.12: Total WSSD for K clusters ranging from 1 to 9. +
+We can perform K-means clustering in R using a tidymodels
workflow similar
+to those in the earlier classification and regression chapters.
+We will begin by loading the tidyclust
library, which contains the necessary
+functionality.
Returning to the original (unstandardized) penguins
data,
+recall that K-means clustering uses straight-line
+distance to decide which points are similar to
+each other. Therefore, the scale of each of the variables in the data
+will influence which cluster data points end up being assigned.
+Variables with a large scale will have a much larger
+effect on deciding cluster assignment than variables with a small scale.
+To address this problem, we need to create a recipe that
+standardizes our data
+before clustering using the step_scale
and step_center
preprocessing steps.
+Standardization will ensure that each variable has a mean
+of 0 and standard deviation of 1 prior to clustering.
+We will designate that all variables are to be used in clustering via
+the model formula ~ .
.
++Note: Recipes were originally designed specifically for predictive data +analysis problems—like classification and regression—not clustering +problems. So the functions in R that we use to construct recipes are a little bit +awkward in the setting of clustering In particular, we will have to treat +“predictors” here as if it meant “variables to be used in clustering”. So the +model formula
+~ .
specifies that all variables are “predictors”, i.e., all variables +should be used for clustering. Similarly, when we use theall_predictors()
function +in the preprocessing steps, we really mean “apply this step to all variables used for +clustering.”
kmeans_recipe <- recipe(~ ., data=penguins) |>
+ step_scale(all_predictors()) |>
+ step_center(all_predictors())
+kmeans_recipe
##
+## ── Recipe ──────────
+##
+## ── Inputs
+## Number of variables by role
+## predictor: 2
+##
+## ── Operations
+## • Scaling for: all_predictors()
+## • Centering for: all_predictors()
+To indicate that we are performing K-means clustering, we will use the k_means
+model specification. We will use the num_clusters
argument to specify the number
+of clusters (here we choose K = 3), and specify that we are using the "stats"
engine.
## K Means Cluster Specification (partition)
+##
+## Main Arguments:
+## num_clusters = 3
+##
+## Computational engine: stats
+To actually run the K-means clustering, we combine the recipe and model
+specification in a workflow, and use the fit
function. Note that the
+K-means algorithm uses a random initialization of assignments; but since
+we set the random seed earlier, the clustering will be reproducible.
kmeans_fit <- workflow() |>
+ add_recipe(kmeans_recipe) |>
+ add_model(kmeans_spec) |>
+ fit(data = penguins)
+
+kmeans_fit
## ══ Workflow [trained] ══════════
+## Preprocessor: Recipe
+## Model: k_means()
+##
+## ── Preprocessor ──────────
+## 2 Recipe Steps
+##
+## • step_scale()
+## • step_center()
+##
+## ── Model ──────────
+## K-means clustering with 3 clusters of sizes 4, 6, 8
+##
+## Cluster means:
+## bill_length_mm flipper_length_mm
+## 1 0.9858721 -0.3524358
+## 2 0.6828058 1.2606357
+## 3 -1.0050404 -0.7692589
+##
+## Clustering vector:
+## [1] 3 3 3 3 3 3 3 3 2 2 2 2 2 2 1 1 1 1
+##
+## Within cluster sum of squares by cluster:
+## [1] 1.098928 1.247042 2.121932
+## (between_SS / total_SS = 86.9 %)
+##
+## Available components:
+##
+## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
+## [6] "betweenss" "size" "iter" "ifault"
+As you can see above, the fit object has a lot of information
+that can be used to visualize the clusters, pick K, and evaluate the total WSSD.
+Let’s start by visualizing the clusters as a colored scatter plot! In
+order to do that, we first need to augment our
+original data frame with the cluster assignments. We can
+achieve this using the augment
function from tidyclust
.
## # A tibble: 18 × 3
+## bill_length_mm flipper_length_mm .pred_cluster
+## <dbl> <dbl> <fct>
+## 1 39.2 196 Cluster_1
+## 2 36.5 182 Cluster_1
+## 3 34.5 187 Cluster_1
+## 4 36.7 187 Cluster_1
+## 5 38.1 181 Cluster_1
+## 6 39.2 190 Cluster_1
+## 7 36 195 Cluster_1
+## 8 37.8 193 Cluster_1
+## 9 46.5 213 Cluster_2
+## 10 46.1 215 Cluster_2
+## 11 47.8 215 Cluster_2
+## 12 45 220 Cluster_2
+## 13 49.1 212 Cluster_2
+## 14 43.3 208 Cluster_2
+## 15 46 195 Cluster_3
+## 16 46.7 195 Cluster_3
+## 17 52.2 197 Cluster_3
+## 18 46.8 189 Cluster_3
+Now that we have the cluster assignments included in the clustered_data
tidy data frame, we can
+visualize them as shown in Figure 9.13.
+Note that we are plotting the un-standardized data here; if we for some reason wanted to
+visualize the standardized data from the recipe, we would need to use the bake
function
+to obtain that first.
cluster_plot <- ggplot(clustered_data,
+ aes(x = flipper_length_mm,
+ y = bill_length_mm,
+ color = .pred_cluster),
+ size = 2) +
+ geom_point() +
+ labs(x = "Flipper Length",
+ y = "Bill Length",
+ color = "Cluster") +
+ scale_color_manual(values = c("steelblue",
+ "darkorange",
+ "goldenrod1")) +
+ theme(text = element_text(size = 12))
+
+cluster_plot
+Figure 9.13: The data colored by the cluster assignments returned by K-means. +
+As mentioned above, we also need to select K by finding
+where the “elbow” occurs in the plot of total WSSD versus the number of clusters.
+We can obtain the total WSSD (tot.withinss
) from our
+clustering with 3 clusters using the glance
function.
## # A tibble: 1 × 4
+## totss tot.withinss betweenss iter
+## <dbl> <dbl> <dbl> <int>
+## 1 34 4.47 29.5 2
+To calculate the total WSSD for a variety of Ks, we will
+create a data frame with a column named num_clusters
with rows containing
+each value of K we want to run K-means with (here, 1 to 9).
## # A tibble: 9 × 1
+## num_clusters
+## <int>
+## 1 1
+## 2 2
+## 3 3
+## 4 4
+## 5 5
+## 6 6
+## 7 7
+## 8 8
+## 9 9
+Then we construct our model specification again, this time
+specifying that we want to tune the num_clusters
parameter.
## K Means Cluster Specification (partition)
+##
+## Main Arguments:
+## num_clusters = tune()
+##
+## Computational engine: stats
+We combine the recipe and specification in a workflow, and then
+use the tune_cluster
function to run K-means on each of the different
+settings of num_clusters
. The grid
argument controls which values of
+K we want to try—in this case, the values from 1 to 9 that are
+stored in the penguin_clust_ks
data frame. We set the resamples
+argument to apparent(penguins)
to tell K-means to run on the whole
+data set for each value of num_clusters
. Finally, we collect the results
+using the collect_metrics
function.
kmeans_results <- workflow() |>
+ add_recipe(kmeans_recipe) |>
+ add_model(kmeans_spec) |>
+ tune_cluster(resamples = apparent(penguins), grid = penguin_clust_ks) |>
+ collect_metrics()
+kmeans_results
## # A tibble: 18 × 7
+## num_clusters .metric .estimator mean n std_err .config
+## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
+## 1 1 sse_total standard 34 1 NA Preprocessor1_…
+## 2 1 sse_within_total standard 34 1 NA Preprocessor1_…
+## 3 2 sse_total standard 34 1 NA Preprocessor1_…
+## 4 2 sse_within_total standard 10.9 1 NA Preprocessor1_…
+## 5 3 sse_total standard 34 1 NA Preprocessor1_…
+## 6 3 sse_within_total standard 4.47 1 NA Preprocessor1_…
+## 7 4 sse_total standard 34 1 NA Preprocessor1_…
+## 8 4 sse_within_total standard 3.54 1 NA Preprocessor1_…
+## 9 5 sse_total standard 34 1 NA Preprocessor1_…
+## 10 5 sse_within_total standard 2.23 1 NA Preprocessor1_…
+## 11 6 sse_total standard 34 1 NA Preprocessor1_…
+## 12 6 sse_within_total standard 1.75 1 NA Preprocessor1_…
+## 13 7 sse_total standard 34 1 NA Preprocessor1_…
+## 14 7 sse_within_total standard 2.06 1 NA Preprocessor1_…
+## 15 8 sse_total standard 34 1 NA Preprocessor1_…
+## 16 8 sse_within_total standard 2.46 1 NA Preprocessor1_…
+## 17 9 sse_total standard 34 1 NA Preprocessor1_…
+## 18 9 sse_within_total standard 0.906 1 NA Preprocessor1_…
+The total WSSD results correspond to the mean
column when the .metric
variable is equal to sse_within_total
.
+We can obtain a tidy data frame with this information using filter
and mutate
.
kmeans_results <- kmeans_results |>
+ filter(.metric == "sse_within_total") |>
+ mutate(total_WSSD = mean) |>
+ select(num_clusters, total_WSSD)
+kmeans_results
## # A tibble: 9 × 2
+## num_clusters total_WSSD
+## <int> <dbl>
+## 1 1 34
+## 2 2 10.9
+## 3 3 4.47
+## 4 4 3.54
+## 5 5 2.23
+## 6 6 1.75
+## 7 7 2.06
+## 8 8 2.46
+## 9 9 0.906
+Now that we have total_WSSD
and num_clusters
as columns in a data frame, we can make a line plot
+(Figure 9.14) and search for the “elbow” to find which value of K to use.
elbow_plot <- ggplot(kmeans_results, aes(x = num_clusters, y = total_WSSD)) +
+ geom_point() +
+ geom_line() +
+ xlab("K") +
+ ylab("Total within-cluster sum of squares") +
+ scale_x_continuous(breaks = 1:9) +
+ theme(text = element_text(size = 12))
+
+elbow_plot
+Figure 9.14: A plot showing the total WSSD versus the number of clusters. +
+It looks like 3 clusters is the right choice for this data.
+But why is there a “bump” in the total WSSD plot here?
+Shouldn’t total WSSD always decrease as we add more clusters?
+Technically yes, but remember: K-means can get “stuck” in a bad solution.
+Unfortunately, for K = 8 we had an unlucky initialization
+and found a bad clustering!
+We can help prevent finding a bad clustering
+by trying a few different random initializations
+via the nstart
argument in the model specification.
+Here we will try using 10 restarts.
## K Means Cluster Specification (partition)
+##
+## Main Arguments:
+## num_clusters = tune()
+##
+## Engine-Specific Arguments:
+## nstart = 10
+##
+## Computational engine: stats
+Now if we rerun the same workflow with the new model specification,
+K-means clustering will be performed nstart = 10
times for each value of K.
+The collect_metrics
function will then pick the best clustering of the 10 runs for each value of K,
+and report the results for that best clustering.
+Figure 9.15 shows the resulting
+total WSSD plot from using 10 restarts; the bump is gone and the total WSSD decreases as expected.
+The more times we perform K-means clustering,
+the more likely we are to find a good clustering (if one exists).
+What value should you choose for nstart
? The answer is that it depends
+on many factors: the size and characteristics of your data set,
+as well as how powerful your computer is.
+The larger the nstart
value the better from an analysis perspective,
+but there is a trade-off that doing many clusterings
+could take a long time.
+So this is something that needs to be balanced.
kmeans_results <- workflow() |>
+ add_recipe(kmeans_recipe) |>
+ add_model(kmeans_spec) |>
+ tune_cluster(resamples = apparent(penguins), grid = penguin_clust_ks) |>
+ collect_metrics() |>
+ filter(.metric == "sse_within_total") |>
+ mutate(total_WSSD = mean) |>
+ select(num_clusters, total_WSSD)
+
+elbow_plot <- ggplot(kmeans_results, aes(x = num_clusters, y = total_WSSD)) +
+ geom_point() +
+ geom_line() +
+ xlab("K") +
+ ylab("Total within-cluster sum of squares") +
+ scale_x_continuous(breaks = 1:9) +
+ theme(text = element_text(size = 12))
+
+elbow_plot
+Figure 9.15: A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts. +
+Practice exercises for the material covered in this chapter +can be found in the accompanying +worksheets repository +in the “Clustering” row. +You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. +You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” +If you instead decide to download the worksheet and run it on your own machine, +make sure to follow the instructions for computer setup +found in Chapter 13. This will ensure that the automated feedback +and guidance that the worksheets provide will function as intended.
+Roger D. Peng
+Johns Hopkins Bloomberg School of Public Health
+2022-01-04
+The field of data science has expanded and grown significantly in recent years, +attracting excitement and interest from many different directions. The demand for introductory +educational materials has grown concurrently with the growth of the field itself, leading to +a proliferation of textbooks, courses, blog posts, and tutorials. This book is an important +contribution to this fast-growing literature, but given the wide availability of materials, a +reader should be inclined to ask, “What is the unique contribution of this book?” In order +to answer that question it is useful to step back for a moment and consider the development +of the field of data science over the past few years.
+When thinking about data science, it is important to consider two questions: “What is +data science?” and “How should one do data science?” The former question is under active +discussion amongst a broad community of researchers and practitioners and there does +not appear to be much consensus to date. However, there seems a general understanding +that data science focuses on the more “active” elements—data wrangling, cleaning, and +analysis—of answering questions with data. These elements are often highly +problem-specific and may seem difficult to generalize across applications. Nevertheless, over time we +have seen some core elements emerge that appear to repeat themselves as useful concepts +across different problems. Given the lack of clear agreement over the definition of data +science, there is a strong need for a book like this one to propose a vision for what the field +is and what the implications are for the activities in which members of the field engage.
+The first important concept addressed by this book is tidy data, which is a format for +tabular data formally introduced to the statistical community in a 2014 paper by Hadley +Wickham. The tidy data organization strategy has proven a powerful abstract concept for +conducting data analysis, in large part because of the vast toolchain implemented in the +Tidyverse collection of R packages. The second key concept is the development of workflows +for reproducible and auditable data analyses. Modern data analyses have only grown in +complexity due to the availability of data and the ease with which we can implement complex +data analysis procedures. Furthermore, these data analyses are often part of +decision-making processes that may have significant impacts on people and communities. Therefore, +there is a critical need to build reproducible analyses that can be studied and repeated by +others in a reliable manner. Statistical methods clearly represent an important element +of data science for building prediction and classification models and for making inferences +about unobserved populations. Finally, because a field can succeed only if it fosters an +active and collaborative community, it has become clear that being fluent in the tools of +collaboration is a core element of data science.
+This book takes these core concepts and focuses on how one can apply them to do data +science in a rigorous manner. Students who learn from this book will be well-versed in +the techniques and principles behind producing reliable evidence from data. This book is +centered around the use of the R programming language within the tidy data framework, +and as such employs the most recent advances in data analysis coding. The use of Jupyter +notebooks for exercises immediately places the student in an environment that encourages +auditability and reproducibility of analyses. The integration of git and GitHub into the +course is a key tool for teaching about collaboration and community, key concepts that are +critical to data science.
+The demand for training in data science continues to increase. The availability of large +quantities of data to answer a variety of questions, the computational power available to +many more people than ever before, and the public awareness of the importance of data for +decision-making have all contributed to the need for high-quality data science work. This +book provides a sophisticated first introduction to the field of data science and provides +a balanced mix of practical skills along with generalizable principles. As we continue to +introduce students to data science and train them to confront an expanding array of data +science problems, they will be well-served by the ideas presented here.
+ +
+Data Science
2024-08-21
+This is the website for Data Science: A First Introduction. +You can read the web version of the book on this site. Click a section in the table of contents +on the left side of the page to navigate to it. If you are on a mobile device, +you may need to open the table of contents first by clicking the menu button on +the top left of the page. You can purchase a PDF or print copy of the book +on the CRC Press website or on Amazon.
+For the python version of the textbook, visit https://python.datasciencebook.ca.
+This book is listed in a number of open educational resource (OER) collections:
+ +This work by Tiffany Timbers, Trevor Campbell, +and Melissa Lee is licensed under +a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
+ +A typical data analysis task in practice is to draw conclusions about some +unknown aspect of a population of interest based on observed data sampled from +that population; we typically do not get data on the entire population. Data +analysis questions regarding how summaries, patterns, trends, or relationships +in a data set extend to the wider population are called inferential +questions. This chapter will start with the fundamental ideas of sampling from +populations and then introduce two common techniques in statistical inference: +point estimation and interval estimation.
+By the end of the chapter, readers will be able to do the following:
+We often need to understand how quantities we observe in a subset +of data relate to the same quantities in the broader population. For example, suppose a +retailer is considering selling iPhone accessories, and they want to estimate +how big the market might be. Additionally, they want to strategize how they can +market their products on North American college and university campuses. This +retailer might formulate the following question:
+What proportion of all undergraduate students in North America own an iPhone?
+In the above question, we are interested in making a conclusion about all +undergraduate students in North America; this is referred to as the population. In +general, the population is the complete collection of individuals or cases we +are interested in studying. Further, in the above question, we are interested +in computing a quantity—the proportion of iPhone owners—based on +the entire population. This proportion is referred to as a population parameter. In +general, a population parameter is a numerical characteristic of the entire +population. To compute this number in the example above, we would need to ask +every single undergraduate in North America whether they own an iPhone. In +practice, directly computing population parameters is often time-consuming and +costly, and sometimes impossible.
+A more practical approach would be to make measurements for a sample, i.e., a +subset of individuals collected from the population. We can then compute a +sample estimate—a numerical characteristic of the sample—that +estimates the population parameter. For example, suppose we randomly selected +ten undergraduate students across North America (the sample) and computed the +proportion of those students who own an iPhone (the sample estimate). In that +case, we might suspect that proportion is a reasonable estimate of the +proportion of students who own an iPhone in the entire population. Figure +10.1 illustrates this process. +In general, the process of using a sample to make a conclusion about the +broader population from which it is taken is referred to as statistical inference. +
++Figure 10.1: The process of using a sample from a broader population to obtain a point estimate of a population parameter. In this case, a sample of 10 individuals yielded 6 who own an iPhone, resulting in an estimated population proportion of 60% iPhone owners. The actual population proportion in this example illustration is 53.8%. +
+Note that proportions are not the only kind of population parameter we might +be interested in. For example, suppose an undergraduate student studying at the University +of British Columbia in Canada is looking for an apartment +to rent. They need to create a budget, so they want to know about +studio apartment rental prices in Vancouver. This student might +formulate the question:
+What is the average price per month of studio apartment rentals in Vancouver?
+In this case, the population consists of all studio apartment rentals in Vancouver, and the +population parameter is the average price per month. Here we used the average +as a measure of the center to describe the “typical value” of studio apartment +rental prices. But even within this one example, we could also be interested in +many other population parameters. For instance, we know that not every studio +apartment rental in Vancouver will have the same price per month. The student +might be interested in how much monthly prices vary and want to find a measure +of the rentals’ spread (or variability), such as the standard deviation. Or perhaps the +student might be interested in the fraction of studio apartment rentals that +cost more than $1000 per month. The question we want to answer will help us +determine the parameter we want to estimate. If we were somehow able to observe +the whole population of studio apartment rental offerings in Vancouver, we +could compute each of these numbers exactly; therefore, these are all +population parameters. There are many kinds of observations and population +parameters that you will run into in practice, but in this chapter, we will +focus on two settings:
+We will look at an example using data from +Inside Airbnb (Cox n.d.). Airbnb is an online +marketplace for arranging vacation rentals and places to stay. The data set +contains listings for Vancouver, Canada, in September 2020. Our data +includes an ID number, neighborhood, type of room, the number of people the +rental accommodates, number of bathrooms, bedrooms, beds, and the price per +night.
+ + + +## # A tibble: 4,594 × 8
+## id neighbourhood room_type accommodates bathrooms bedrooms beds price
+## <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
+## 1 1 Downtown Entire h… 5 2 baths 2 2 150
+## 2 2 Downtown Eastside Entire h… 4 2 baths 2 2 132
+## 3 3 West End Entire h… 2 1 bath 1 1 85
+## 4 4 Kensington-Cedar… Entire h… 2 1 bath 1 0 146
+## 5 5 Kensington-Cedar… Entire h… 4 1 bath 1 2 110
+## 6 6 Hastings-Sunrise Entire h… 4 1 bath 2 3 195
+## 7 7 Renfrew-Collingw… Entire h… 8 3 baths 4 5 130
+## 8 8 Mount Pleasant Entire h… 2 1 bath 1 1 94
+## 9 9 Grandview-Woodla… Private … 2 1 privat… 1 1 79
+## 10 10 West End Private … 2 1 privat… 1 1 75
+## # ℹ 4,584 more rows
+Suppose the city of Vancouver wants information about Airbnb rentals to help
+plan city bylaws, and they want to know how many Airbnb places are listed as
+entire homes and apartments (rather than as private or shared rooms). Therefore
+they may want to estimate the true proportion of all Airbnb listings where the
+“type of place” is listed as “entire home or apartment.” Of course, we usually
+do not have access to the true population, but here let’s imagine (for learning
+purposes) that our data set represents the population of all Airbnb rental
+listings in Vancouver, Canada. We can find the proportion of listings where
+room_type == "Entire home/apt"
.
+
airbnb |>
+ summarize(
+ n = sum(room_type == "Entire home/apt"),
+ proportion = sum(room_type == "Entire home/apt") / nrow(airbnb)
+ )
## # A tibble: 1 × 2
+## n proportion
+## <int> <dbl>
+## 1 3434 0.747
+We can see that the proportion of Entire home/apt
listings in
+the data set is 0.747. This
+value, 0.747, is the population parameter. Remember, this
+parameter value is usually unknown in real data analysis problems, as it is
+typically not possible to make measurements for an entire population.
Instead, perhaps we can approximate it with a small subset of data!
+To investigate this idea, let’s try randomly selecting 40 listings (i.e., taking a random sample of
+size 40 from our population), and computing the proportion for that sample.
+We will use the rep_sample_n
function from the infer
+package to take the sample. The arguments of rep_sample_n
are (1) the data frame to
+sample from, and (2) the size of the sample to take.
library(infer)
+
+sample_1 <- rep_sample_n(tbl = airbnb, size = 40)
+
+airbnb_sample_1 <- summarize(sample_1,
+ n = sum(room_type == "Entire home/apt"),
+ prop = sum(room_type == "Entire home/apt") / 40
+)
+
+airbnb_sample_1
## # A tibble: 1 × 3
+## replicate n prop
+## <int> <int> <dbl>
+## 1 1 28 0.7
+Here we see that the proportion of entire home/apartment listings in this +random sample is 0.7. Wow—that’s close to our +true population value! But remember, we computed the proportion using a random sample of size 40. +This has two consequences. First, this value is only an estimate, i.e., our best guess +of our population parameter using this sample. +Given that we are estimating a single value here, we often +refer to it as a point estimate. Second, since the sample was random, +if we were to take another random sample of size 40 and compute the proportion for that sample, +we would not get the same answer:
+sample_2 <- rep_sample_n(airbnb, size = 40)
+
+airbnb_sample_2 <- summarize(sample_2,
+ n = sum(room_type == "Entire home/apt"),
+ prop = sum(room_type == "Entire home/apt") / 40
+)
+
+airbnb_sample_2
## # A tibble: 1 × 3
+## replicate n prop
+## <int> <int> <dbl>
+## 1 1 35 0.875
+Confirmed! We get a different value for our estimate this time. +That means that our point estimate might be unreliable. Indeed, estimates vary from sample to +sample due to sampling variability. But just how much +should we expect the estimates of our random samples to vary? +Or in other words, how much can we really trust our point estimate based on a single sample?
+To understand this, we will simulate many samples (much more than just two) +of size 40 from our population of listings and calculate the proportion of +entire home/apartment listings in each sample. This simulation will create +many sample proportions, which we can visualize using a histogram. The +distribution of the estimate for all possible samples of a given size (which we +commonly refer to as \(n\)) from a population is called +a sampling distribution. The sampling distribution will help us see how much we would +expect our sample proportions from this population to vary for samples of size 40.
+We again use the rep_sample_n
to take samples of size 40 from our
+population of Airbnb listings. But this time we set the reps
argument to 20,000 to specify
+that we want to take 20,000 samples of size 40.
+
## # A tibble: 800,000 × 9
+## # Groups: replicate [20,000]
+## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds
+## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
+## 1 1 4403 Downtown Entire h… 2 1 bath 1 1
+## 2 1 902 Kensington-C… Private … 2 1 shared… 1 1
+## 3 1 3808 Hastings-Sun… Entire h… 6 1.5 baths 1 3
+## 4 1 561 Kensington-C… Entire h… 6 1 bath 2 2
+## 5 1 3385 Mount Pleasa… Entire h… 4 1 bath 1 1
+## 6 1 4232 Shaughnessy Entire h… 6 1.5 baths 2 2
+## 7 1 1169 Downtown Entire h… 3 1 bath 1 1
+## 8 1 959 Kitsilano Private … 1 1.5 shar… 1 1
+## 9 1 2171 Downtown Entire h… 2 1 bath 1 1
+## 10 1 1258 Dunbar South… Entire h… 4 1 bath 2 2
+## # ℹ 799,990 more rows
+## # ℹ 1 more variable: price <dbl>
+Notice that the column replicate
indicates the replicate, or sample, to which
+each listing belongs. Above, since by default R only prints the first few rows,
+it looks like all of the listings have replicate
set to 1. But you can
+check the last few entries using the tail()
function to verify that
+we indeed created 20,000 samples (or replicates).
## # A tibble: 6 × 9
+## # Groups: replicate [1]
+## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds
+## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
+## 1 20000 3414 Marpole Entire h… 4 1 bath 2 2
+## 2 20000 1974 Hastings-Sunr… Private … 2 1 shared… 1 1
+## 3 20000 1846 Riley Park Entire h… 4 1 bath 2 3
+## 4 20000 862 Downtown Entire h… 5 2 baths 2 2
+## 5 20000 3295 Victoria-Fras… Private … 2 1 shared… 1 1
+## 6 20000 997 Dunbar Southl… Private … 1 1.5 shar… 1 1
+## # ℹ 1 more variable: price <dbl>
+Now that we have obtained the samples, we need to compute the
+proportion of entire home/apartment listings in each sample.
+We first group the data by the replicate
variable—to group the
+set of listings in each sample together—and then use summarize
+to compute the proportion in each sample.
+We print both the first and last few entries of the resulting data frame
+below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples.
sample_estimates <- samples |>
+ group_by(replicate) |>
+ summarize(sample_proportion = sum(room_type == "Entire home/apt") / 40)
+
+sample_estimates
## # A tibble: 20,000 × 2
+## replicate sample_proportion
+## <int> <dbl>
+## 1 1 0.85
+## 2 2 0.85
+## 3 3 0.65
+## 4 4 0.7
+## 5 5 0.75
+## 6 6 0.725
+## 7 7 0.775
+## 8 8 0.775
+## 9 9 0.7
+## 10 10 0.675
+## # ℹ 19,990 more rows
+
+## # A tibble: 6 × 2
+## replicate sample_proportion
+## <int> <dbl>
+## 1 19995 0.75
+## 2 19996 0.675
+## 3 19997 0.625
+## 4 19998 0.75
+## 5 19999 0.875
+## 6 20000 0.65
+We can now visualize the sampling distribution of sample proportions +for samples of size 40 using a histogram in Figure 10.2. Keep in mind: in the real world, +we don’t have access to the full population. So we +can’t take many samples and can’t actually construct or visualize the sampling distribution. +We have created this particular example +such that we do have access to the full population, which lets us visualize the +sampling distribution directly for learning purposes.
+sampling_distribution <- ggplot(sample_estimates, aes(x = sample_proportion)) +
+ geom_histogram(color = "lightgrey", bins = 12) +
+ labs(x = "Sample proportions", y = "Count") +
+ theme(text = element_text(size = 12))
+
+sampling_distribution
+Figure 10.2: Sampling distribution of the sample proportion for sample size 40. +
+The sampling distribution in Figure 10.2 appears +to be bell-shaped, is roughly symmetric, and has one peak. It is centered +around 0.7 and the sample proportions +range from about 0.4 to about +1. In fact, we can +calculate the mean of the sample proportions.
+ +## # A tibble: 1 × 1
+## mean_proportion
+## <dbl>
+## 1 0.747
+We notice that the sample proportions are centered around the population +proportion value, 0.747! In general, the mean of +the sampling distribution should be equal to the population proportion. +This is great news because it means that the sample proportion is neither an overestimate nor an +underestimate of the population proportion. +In other words, if you were to take many samples as we did above, there is no tendency +towards over or underestimating the population proportion. +In a real data analysis setting where you just have access to your single +sample, this implies that you would suspect that your sample point estimate is +roughly equally likely to be above or below the true population proportion.
+In the previous section, our variable of interest—room_type
—was
+categorical, and the population parameter was a proportion. As mentioned in
+the chapter introduction, there are many choices of the population parameter
+for each type of variable. What if we wanted to infer something about a
+population of quantitative variables instead? For instance, a traveler
+visiting Vancouver, Canada may wish to estimate the
+population mean (or average) price per night of Airbnb listings. Knowing
+the average could help them tell whether a particular listing is overpriced.
+We can visualize the population distribution of the price per night with a histogram.
population_distribution <- ggplot(airbnb, aes(x = price)) +
+ geom_histogram(color = "lightgrey") +
+ labs(x = "Price per night (dollars)", y = "Count") +
+ theme(text = element_text(size = 12))
+
+population_distribution
+Figure 10.3: Population distribution of price per night (dollars) for all Airbnb listings in Vancouver, Canada. +
+In Figure 10.3, we see that the population distribution +has one peak. It is also skewed (i.e., is not symmetric): most of the listings are +less than $250 per night, but a small number of listings cost much more, +creating a long tail on the histogram’s right side. +Along with visualizing the population, we can calculate the population mean, +the average price per night for all the Airbnb listings.
+ +## # A tibble: 1 × 1
+## mean_price
+## <dbl>
+## 1 154.51
+The price per night of all Airbnb rentals in Vancouver, BC +is $154.51, on average. This value is our +population parameter since we are calculating it using the population data.
+Now suppose we did not have access to the population data (which is usually the
+case!), yet we wanted to estimate the mean price per night. We could answer
+this question by taking a random sample of as many Airbnb listings as our time
+and resources allow. Let’s say we could do this for 40 listings. What would
+such a sample look like? Let’s take advantage of the fact that we do have
+access to the population data and simulate taking one random sample of 40
+listings in R, again using rep_sample_n
.
+
We can create a histogram to visualize the distribution of observations in the +sample (Figure 10.4), and calculate the mean +of our sample.
+sample_distribution <- ggplot(one_sample, aes(price)) +
+ geom_histogram(color = "lightgrey") +
+ labs(x = "Price per night (dollars)", y = "Count") +
+ theme(text = element_text(size = 12))
+
+sample_distribution
+Figure 10.4: Distribution of price per night (dollars) for sample of 40 Airbnb listings. +
+## # A tibble: 1 × 2
+## replicate mean_price
+## <int> <dbl>
+## 1 1 155.80
+The average value of the sample of size 40 +is $155.80. This +number is a point estimate for the mean of the full population. +Recall that the population mean was +$154.51. So our estimate was fairly close to +the population parameter: the mean was about +0.8% +off. Note that we usually cannot compute the estimate’s accuracy in practice +since we do not have access to the population parameter; if we did, we wouldn’t +need to estimate it!
+Also, recall from the previous section that the point estimate can vary; if we +took another random sample from the population, our estimate’s value might +change. So then, did we just get lucky with our point estimate above? How much +does our estimate vary across different samples of size 40 in this example? +Again, since we have access to the population, we can take many samples and +plot the sampling distribution of sample means for samples of size 40 to +get a sense for this variation. In this case, we’ll use 20,000 samples of size +40.
+ +## # A tibble: 800,000 × 9
+## # Groups: replicate [20,000]
+## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds
+## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
+## 1 1 1177 Downtown Entire h… 4 2 baths 2 2
+## 2 1 4063 Downtown Entire h… 2 1 bath 1 1
+## 3 1 2641 Kitsilano Private … 1 1 shared… 1 1
+## 4 1 1941 West End Entire h… 2 1 bath 1 1
+## 5 1 2431 Mount Pleasa… Entire h… 2 1 bath 1 1
+## 6 1 1871 Arbutus Ridge Entire h… 4 1 bath 2 2
+## 7 1 2557 Marpole Private … 3 1 privat… 1 2
+## 8 1 3534 Downtown Entire h… 2 1 bath 1 1
+## 9 1 4379 Downtown Entire h… 4 1 bath 1 0
+## 10 1 2161 Downtown Entire h… 4 2 baths 2 2
+## # ℹ 799,990 more rows
+## # ℹ 1 more variable: price <dbl>
+Now we can calculate the sample mean for each replicate and plot the sampling +distribution of sample means for samples of size 40.
+sample_estimates <- samples |>
+ group_by(replicate) |>
+ summarize(mean_price = mean(price))
+
+sample_estimates
## # A tibble: 20,000 × 2
+## replicate mean_price
+## <int> <dbl>
+## 1 1 160.06
+## 2 2 173.18
+## 3 3 131.20
+## 4 4 176.96
+## 5 5 125.65
+## 6 6 148.84
+## 7 7 134.82
+## 8 8 137.26
+## 9 9 166.11
+## 10 10 157.81
+## # ℹ 19,990 more rows
+sampling_distribution_40 <- ggplot(sample_estimates, aes(x = mean_price)) +
+ geom_histogram(color = "lightgrey") +
+ labs(x = "Sample mean price per night (dollars)", y = "Count") +
+ theme(text = element_text(size = 12))
+
+sampling_distribution_40
+Figure 10.5: Sampling distribution of the sample means for sample size of 40. +
+In Figure 10.5, the sampling distribution of the mean +has one peak and is bell-shaped. Most of the estimates are between +about $140 and +$170; but there is +a good fraction of cases outside this range (i.e., where the point estimate was +not close to the population parameter). So it does indeed look like we were +quite lucky when we estimated the population mean with only +0.8% error.
+Let’s visualize the population distribution, distribution of the sample, and +the sampling distribution on one plot to compare them in Figure +10.6. Comparing these three distributions, the centers +of the distributions are all around the same price (around $150). The original +population distribution has a long right tail, and the sample distribution has +a similar shape to that of the population distribution. However, the sampling +distribution is not shaped like the population or sample distribution. Instead, +it has a bell shape, and it has a lower spread than the population or sample +distributions. The sample means vary less than the individual observations +because there will be some high values and some small values in any random +sample, which will keep the average from being too extreme. +
+ ++Figure 10.6: Comparison of population distribution, sample distribution, and sampling distribution. +
+Given that there is quite a bit of variation in the sampling distribution of +the sample mean—i.e., the point estimate that we obtain is not very +reliable—is there any way to improve the estimate? One way to improve a +point estimate is to take a larger sample. To illustrate what effect this +has, we will take many samples of size 20, 50, 100, and 500, and plot the +sampling distribution of the sample mean. We indicate the mean of the sampling +distribution with a vertical dashed line.
++Figure 10.7: Comparison of sampling distributions, with mean highlighted as a vertical dashed line. +
+Based on the visualization in Figure 10.7, three points +about the sample mean become clear. First, the mean of the sample mean (across +samples) is equal to the population mean. In other words, the sampling +distribution is centered at the population mean. Second, increasing the size of +the sample decreases the spread (i.e., the variability) of the sampling +distribution. Therefore, a larger sample size results in a more reliable point +estimate of the population parameter. And third, the distribution of the sample +mean is roughly bell-shaped.
+++ +Note: You might notice that in the
+n = 20
case in Figure 10.7, +the distribution is not quite bell-shaped. There is a bit of skew towards the right! +You might also notice that in then = 50
case and larger, that skew seems to disappear. +In general, the sampling distribution—for both means and proportions—only +becomes bell-shaped once the sample size is large enough. +How large is “large enough?” Unfortunately, it depends entirely on the problem at hand. But +as a rule of thumb, often a sample size of at least 20 will suffice.
Why all this emphasis on sampling distributions?
+We saw in the previous section that we could compute a point estimate of a +population parameter using a sample of observations from the population. And +since we constructed examples where we had access to the population, we could +evaluate how accurate the estimate was, and even get a sense of how much the +estimate would vary for different samples from the population. But in real +data analysis settings, we usually have just one sample from our population +and do not have access to the population itself. Therefore we cannot construct +the sampling distribution as we did in the previous section. And as we saw, our +sample estimate’s value can vary significantly from the population parameter. +So reporting the point estimate from a single sample alone may not be enough. +We also need to report some notion of uncertainty in the value of the point +estimate.
+Unfortunately, we cannot construct the exact sampling distribution without +full access to the population. However, if we could somehow approximate what +the sampling distribution would look like for a sample, we could +use that approximation to then report how uncertain our sample +point estimate is (as we did above with the exact sampling +distribution). There are several methods to accomplish this; in this book, we +will use the bootstrap. We will discuss interval estimation and +construct +confidence intervals using just a single sample from a population. A +confidence interval is a range of plausible values for our population parameter.
+Here is the key idea. First, if you take a big enough sample, it looks like +the population. Notice the histograms’ shapes for samples of different sizes +taken from the population in Figure 10.8. We +see that the sample’s distribution looks like that of the population for a +large enough sample.
++Figure 10.8: Comparison of samples of different sizes from the population. +
+In the previous section, we took many samples of the same size from our +population to get a sense of the variability of a sample estimate. But if our +sample is big enough that it looks like our population, we can pretend that our +sample is the population, and take more samples (with replacement) of the +same size from it instead! This very clever technique is +called the bootstrap. Note that by taking many samples from our single, observed +sample, we do not obtain the true sampling distribution, but rather an +approximation that we call the bootstrap distribution.
+ +++Note: We must sample with replacement when using the bootstrap. +Otherwise, if we had a sample of size \(n\), and obtained a sample from it of +size \(n\) without replacement, it would just return our original sample!
+
This section will explore how to create a bootstrap distribution from a single +sample using R. The process is visualized in Figure 10.9. +For a sample of size \(n\), you would do the following:
++Figure 10.9: Overview of the bootstrap process. +
+Let’s continue working with our Airbnb example to illustrate how we might create +and use a bootstrap distribution using just a single sample from the population. +Once again, suppose we are +interested in estimating the population mean price per night of all Airbnb +listings in Vancouver, Canada, using a single sample size of 40. +Recall our point estimate was $155.80. The +histogram of prices in the sample is displayed in Figure 10.10.
+ +## # A tibble: 40 × 8
+## id neighbourhood room_type accommodates bathrooms bedrooms beds price
+## <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
+## 1 3928 Marpole Private … 2 1 shared… 1 1 58
+## 2 3013 Kensington-Cedar… Entire h… 4 1 bath 2 2 112
+## 3 3156 Downtown Entire h… 6 2 baths 2 2 151
+## 4 3873 Dunbar Southlands Private … 5 1 bath 2 3 700
+## 5 3632 Downtown Eastside Entire h… 6 2 baths 3 3 157
+## 6 296 Kitsilano Private … 1 1 shared… 1 1 100
+## 7 3514 West End Entire h… 2 1 bath 1 1 110
+## 8 594 Sunset Entire h… 5 1 bath 3 3 105
+## 9 3305 Dunbar Southlands Entire h… 4 1 bath 1 2 196
+## 10 938 Downtown Entire h… 7 2 baths 2 3 269
+## # ℹ 30 more rows
+one_sample_dist <- ggplot(one_sample, aes(price)) +
+ geom_histogram(color = "lightgrey") +
+ labs(x = "Price per night (dollars)", y = "Count") +
+ theme(text = element_text(size = 12))
+
+one_sample_dist
+Figure 10.10: Histogram of price per night (dollars) for one sample of size 40. +
+The histogram for the sample is skewed, with a few observations out to the right. The +mean of the sample is $155.80. +Remember, in practice, we usually only have this one sample from the population. So +this sample and estimate are the only data we can work with.
+We now perform steps 1–5 listed above to generate a single bootstrap
+sample in R and calculate a point estimate from that bootstrap sample. We will
+use the rep_sample_n
function as we did when we were
+creating our sampling distribution. But critically, note that we now
+pass one_sample
—our single sample of size 40—as the first argument.
+And since we need to sample with replacement,
+we change the argument for replace
from its default value of FALSE
to TRUE
.
+
boot1 <- one_sample |>
+ rep_sample_n(size = 40, replace = TRUE, reps = 1)
+boot1_dist <- ggplot(boot1, aes(price)) +
+ geom_histogram(color = "lightgrey") +
+ labs(x = "Price per night (dollars)", y = "Count") +
+ theme(text = element_text(size = 12))
+
+boot1_dist
+Figure 10.11: Bootstrap distribution. +
+## # A tibble: 1 × 2
+## replicate mean_price
+## <int> <dbl>
+## 1 1 164.20
+Notice in Figure 10.11 that the histogram of our bootstrap sample +has a similar shape to the original sample histogram. Though the shapes of +the distributions are similar, they are not identical. You’ll also notice that +the original sample mean and the bootstrap sample mean differ. How might that +happen? Remember that we are sampling with replacement from the original +sample, so we don’t end up with the same sample values again. We are pretending +that our single sample is close to the population, and we are trying to +mimic drawing another sample from the population by drawing one from our original +sample.
+Let’s now take 20,000 bootstrap samples from the original sample (one_sample
)
+using rep_sample_n
, and calculate the means for
+each of those replicates. Recall that this assumes that one_sample
looks like
+our original population; but since we do not have access to the population itself,
+this is often the best we can do.
## # A tibble: 800,000 × 9
+## # Groups: replicate [20,000]
+## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds
+## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
+## 1 1 1276 Hastings-Sun… Entire h… 2 1 bath 1 1
+## 2 1 3235 Hastings-Sun… Entire h… 2 1 bath 1 1
+## 3 1 1301 Oakridge Entire h… 12 2 baths 2 12
+## 4 1 118 Grandview-Wo… Entire h… 4 1 bath 2 2
+## 5 1 2550 Downtown Eas… Private … 2 1.5 shar… 1 1
+## 6 1 1006 Grandview-Wo… Entire h… 5 1 bath 3 4
+## 7 1 3632 Downtown Eas… Entire h… 6 2 baths 3 3
+## 8 1 1923 West End Entire h… 4 2 baths 2 2
+## 9 1 3873 Dunbar South… Private … 5 1 bath 2 3
+## 10 1 2349 Kerrisdale Private … 2 1 shared… 1 1
+## # ℹ 799,990 more rows
+## # ℹ 1 more variable: price <dbl>
+
+## # A tibble: 6 × 9
+## # Groups: replicate [1]
+## replicate id neighbourhood room_type accommodates bathrooms bedrooms beds
+## <int> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
+## 1 20000 1949 Kitsilano Entire h… 3 1 bath 1 1
+## 2 20000 1025 Kensington-Ce… Entire h… 3 1 bath 1 1
+## 3 20000 3013 Kensington-Ce… Entire h… 4 1 bath 2 2
+## 4 20000 2868 Downtown Entire h… 2 1 bath 1 1
+## 5 20000 3156 Downtown Entire h… 6 2 baths 2 2
+## 6 20000 1923 West End Entire h… 4 2 baths 2 2
+## # ℹ 1 more variable: price <dbl>
+Let’s take a look at the histograms of the first six replicates of our bootstrap samples.
+six_bootstrap_samples <- boot20000 |>
+ filter(replicate <= 6)
+
+ggplot(six_bootstrap_samples, aes(price)) +
+ geom_histogram(color = "lightgrey") +
+ labs(x = "Price per night (dollars)", y = "Count") +
+ facet_wrap(~replicate) +
+ theme(text = element_text(size = 12))
+Figure 10.12: Histograms of the first six replicates of the bootstrap samples. +
+We see in Figure 10.12 how the +bootstrap samples differ. We can also calculate the sample mean for each of +these six replicates.
+ +## # A tibble: 6 × 2
+## replicate mean_price
+## <int> <dbl>
+## 1 1 177.2
+## 2 2 131.45
+## 3 3 179.10
+## 4 4 171.35
+## 5 5 191.32
+## 6 6 170.05
+We can see that the bootstrap sample distributions and the sample means are +different. They are different because we are sampling with replacement. We +will now calculate point estimates for our 20,000 bootstrap samples and +generate a bootstrap distribution of our point estimates. The bootstrap +distribution (Figure 10.13) suggests how we might expect +our point estimate to behave if we took another sample.
+boot20000_means <- boot20000 |>
+ group_by(replicate) |>
+ summarize(mean_price = mean(price))
+
+boot20000_means
## # A tibble: 20,000 × 2
+## replicate mean_price
+## <int> <dbl>
+## 1 1 177.2
+## 2 2 131.45
+## 3 3 179.10
+## 4 4 171.35
+## 5 5 191.32
+## 6 6 170.05
+## 7 7 178.83
+## 8 8 154.78
+## 9 9 163.85
+## 10 10 209.28
+## # ℹ 19,990 more rows
+
+## # A tibble: 6 × 2
+## replicate mean_price
+## <int> <dbl>
+## 1 19995 130.40
+## 2 19996 189.18
+## 3 19997 168.98
+## 4 19998 168.23
+## 5 19999 155.73
+## 6 20000 136.95
+boot_est_dist <- ggplot(boot20000_means, aes(x = mean_price)) +
+ geom_histogram(color = "lightgrey") +
+ labs(x = "Sample mean price per night (dollars)", y = "Count") +
+ theme(text = element_text(size = 12))
+
+boot_est_dist
+Figure 10.13: Distribution of the bootstrap sample means. +
+Let’s compare the bootstrap distribution—which we construct by taking many samples from our original sample of size 40—with +the true sampling distribution—which corresponds to taking many samples from the population.
++Figure 10.14: Comparison of the distribution of the bootstrap sample means and sampling distribution. +
+There are two essential points that we can take away from Figure +10.14. First, the shape and spread of the true sampling +distribution and the bootstrap distribution are similar; the bootstrap +distribution lets us get a sense of the point estimate’s variability. The +second important point is that the means of these two distributions are +different. The sampling distribution is centered at +$154.51, the population mean value. However, the bootstrap +distribution is centered at the original sample’s mean price per night, +$155.87. Because we are resampling from the +original sample repeatedly, we see that the bootstrap distribution is centered +at the original sample’s mean value (unlike the sampling distribution of the +sample mean, which is centered at the population parameter value).
+Figure +10.15 summarizes the bootstrapping process. +The idea here is that we can use this distribution of bootstrap sample means to +approximate the sampling distribution of the sample means when we only have one +sample. Since the bootstrap distribution pretty well approximates the sampling +distribution spread, we can use the bootstrap spread to help us develop a +plausible range for our population parameter along with our estimate!
++Figure 10.15: Summary of bootstrapping process. +
+Now that we have constructed our bootstrap distribution, let’s use it to create +an approximate 95% percentile bootstrap confidence interval. +A confidence interval is a range of plausible values for the population parameter. We will +find the range of values covering the middle 95% of the bootstrap +distribution, giving us a 95% confidence interval. You may be wondering, what +does “95% confidence” mean? If we took 100 random samples and calculated 100 +95% confidence intervals, then about 95% of the ranges would capture the +population parameter’s value. Note there’s nothing special about 95%. We +could have used other levels, such as 90% or 99%. There is a balance between +our level of confidence and precision. A higher confidence level corresponds to +a wider range of the interval, and a lower confidence level corresponds to a +narrower range. Therefore the level we choose is based on what chance we are +willing to take of being wrong based on the implications of being wrong for our +application. In general, we choose confidence levels to be comfortable with our +level of uncertainty but not so strict that the interval is unhelpful. For +instance, if our decision impacts human life and the implications of being +wrong are deadly, we may want to be very confident and choose a higher +confidence level.
+To calculate a 95% percentile bootstrap confidence interval, we will do the following:
+To do this in R, we can use the quantile()
function. Quantiles are expressed in proportions rather than
+percentages, so the 2.5th and 97.5th percentiles would be the 0.025 and 0.975 quantiles, respectively.
+
bounds <- boot20000_means |>
+ select(mean_price) |>
+ pull() |>
+ quantile(c(0.025, 0.975))
+
+bounds
## 2.5% 97.5%
+## 119 204
+Our interval, $119.28 to $203.63, captures +the middle 95% of the sample mean prices in the bootstrap distribution. We can +visualize the interval on our distribution in Figure +10.16.
++Figure 10.16: Distribution of the bootstrap sample means with percentile lower and upper bounds. +
+To finish our estimation of the population parameter, we would report the point +estimate and our confidence interval’s lower and upper bounds. Here the sample +mean price per night of 40 Airbnb listings was +$155.80, and we are 95% “confident” that the true +population mean price per night for all Airbnb listings in Vancouver is between +$119.28 and $203.63. +Notice that our interval does indeed contain the true +population mean value, $154.51! However, in +practice, we would not know whether our interval captured the population +parameter or not because we usually only have a single sample, not the entire +population. This is the best we can do when we only have one sample!
+This chapter is only the beginning of the journey into statistical inference. +We can extend the concepts learned here to do much more than report point +estimates and confidence intervals, such as testing for real differences +between populations, tests for associations between variables, and so much +more. We have just scratched the surface of statistical inference; however, the +material presented here will serve as the foundation for more advanced +statistical techniques you may learn about in the future!
+Practice exercises for the material covered in this chapter +can be found in the accompanying +worksheets repository +in the two “Statistical inference” rows. +You can launch an interactive version of each worksheet in your browser by clicking the “launch binder” button. +You can also preview a non-interactive version of each worksheet by clicking “view worksheet.” +If you instead decide to download the worksheets and run them on your own machine, +make sure to follow the instructions for computer setup +found in Chapter 13. This will ensure that the automated feedback +and guidance that the worksheets provide will function as intended.
+tidyverse
and infer
in a slightly more
+in-depth manner than the present chapter. Chapters 9 and 10 take the next step
+beyond the scope of this chapter and begin to provide some of the initial
+mathematical underpinnings of inference and more advanced applications of the
+concept of inference in testing hypotheses and performing regression. This
+material offers a great starting point for getting more into the technical side
+of statistics.This chapter provides an introduction to data science and the R programming language. +The goal here is to get your hands dirty right from the start! We will walk through an entire data analysis, +and along the way introduce different types of data analysis question, some fundamental programming +concepts in R, and the basics of loading, cleaning, and visualizing data. In the following chapters, we will +dig into each of these steps in much more detail; but for now, let’s jump in to see how much we can do +with data science!
+By the end of the chapter, readers will be able to do the following:
+tidyverse
package into R.read_csv
.filter
, select
, arrange
, and slice
.mutate
.ggplot
bar plot.?
to access help and documentation tools in R.In this chapter, we will walk through a full analysis of a data set relating to +languages spoken at home by Canadian residents (Figure 1.1). Many Indigenous peoples exist in Canada +with their own cultures and languages; these languages are often unique to Canada and not spoken +anywhere else in the world (Statistics Canada 2018). Sadly, colonization has +led to the loss of many of these languages. For instance, generations of +children were not allowed to speak their mother tongue (the first language an +individual learns in childhood) in Canadian residential schools. Colonizers +also renamed places they had “discovered” (K. Wilson 2018). Acts such as these +have significantly harmed the continuity of Indigenous languages in Canada, and +some languages are considered “endangered” as few people report speaking them. +To learn more, please see Canadian Geographic’s article, “Mapping Indigenous Languages in +Canada” (Walker 2017), +They Came for the Children: Canada, Aboriginal +peoples, and Residential Schools (Truth and Reconciliation Commission of Canada 2012) +and the Truth and Reconciliation Commission of Canada’s +Calls to Action (Truth and Reconciliation Commission of Canada 2015).
++Figure 1.1: Map of Canada. +
+The data set we will study in this chapter is taken from
+the canlang
R data package
+(Timbers 2020), which has
+population language data collected during the 2016 Canadian census (Statistics Canada 2016a).
+In this data, there are 214 languages recorded, each having six different properties:
category
: Higher-level language category, describing whether the language is an Official Canadian language, an Aboriginal (i.e., Indigenous) language, or a Non-Official and Non-Aboriginal language.language
: The name of the language.mother_tongue
: Number of Canadian residents who reported the language as their mother tongue. Mother tongue is generally defined as the language someone was exposed to since birth.most_at_home
: Number of Canadian residents who reported the language as being spoken most often at home.most_at_work
: Number of Canadian residents who reported the language as being used most often at work.lang_known
: Number of Canadian residents who reported knowledge of the language.According to the census, more than 60 Aboriginal languages were reported +as being spoken in Canada. Suppose we want to know which are the most common; +then we might ask the following question, which we wish to answer using our data:
+Which ten Aboriginal languages were most often reported in 2016 as mother +tongues in Canada, and how many people speak each of them?
+++Note: Data science cannot be done without +a deep understanding of the data and +problem domain. In this book, we have simplified the data sets used in our +examples to concentrate on methods and fundamental concepts. But in real +life, you cannot and should not do data science without a domain expert. +Alternatively, it is common to practice data science in your own domain of +expertise! Remember that when you work with data, it is essential to think +about how the data were collected, which affects the conclusions you can +draw. If your data are biased, then your results will be biased!
+
Every good data analysis begins with a question—like the +above—that you aim to answer using data. As it turns out, there +are actually a number of different types of question regarding data: +descriptive, exploratory, predictive, inferential, causal, and mechanistic, +all of which are defined in Table 1.1. +Carefully formulating a question as early as possible in your analysis—and +correctly identifying which type of question it is—will guide your overall approach to +the analysis as well as the selection of appropriate tools. +
+Question type | +Description | +Example | +
---|---|---|
Descriptive | +A question that asks about summarized characteristics of a data set without interpretation (i.e., report a fact). | +How many people live in each province and territory in Canada? | +
Exploratory | +A question that asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. | +Does political party voting change with indicators of wealth in a set of data collected on 2,000 people living in Canada? | +
Predictive | +A question that asks about predicting measurements or labels for individuals (people or things). The focus is on what things predict some outcome, but not what causes the outcome. | +What political party will someone vote for in the next Canadian election? | +
Inferential | +A question that looks for patterns, trends, or relationships in a single data set and also asks for quantification of how applicable these findings are to the wider population. | +Does political party voting change with indicators of wealth for all people living in Canada? | +
Causal | +A question that asks about whether changing one factor will lead to a change in another factor, on average, in the wider population. | +Does wealth lead to voting for a certain political party in Canadian elections? | +
Mechanistic | +A question that asks about the underlying mechanism of the observed patterns, trends, or relationships (i.e., how does it happen?) | +How does wealth lead to voting for a certain political party in Canadian elections? | +
In this book, you will learn techniques to answer the +first four types of question: descriptive, exploratory, predictive, and inferential; +causal and mechanistic questions are beyond the scope of this book. +In particular, you will learn how to apply the following analysis tools:
+Referring to Table 1.1, our question about +Aboriginal languages is an example of a descriptive question: we are +summarizing the characteristics of a data set without further interpretation. +And referring to the list above, it looks like we should use visualization +and perhaps some summarization to answer the question. So in the remainder +of this chapter, we will work towards making a visualization that shows +us the ten most common Aboriginal languages in Canada and their associated counts, +according to the 2016 census.
+A data set is, at its core essence, a structured collection of numbers and characters. +Aside from that, there are really no strict rules; data sets can come in +many different forms! Perhaps the most common form of data set that you will +find in the wild, however, is tabular data. Think spreadsheets in Microsoft Excel: tabular data are +rectangular-shaped and spreadsheet-like, as shown in Figure +1.2. In this book, we will focus primarily on tabular data.
+Since we are using R for data analysis in this book, the first step for us is to +load the data into R. When we load tabular data into +R, it is represented as a data frame object. Figure +1.2 shows that an R data frame is very similar +to a spreadsheet. We refer to the rows as observations; +these are the individual objects +for which we collect data. In Figure 1.2, the observations are +languages. We refer to the columns as variables; these are the characteristics of each +observation. In Figure 1.2, the variables are the the +language’s category, its name, the number of mother tongue speakers, etc.
++Figure 1.2: A spreadsheet versus a data frame in R. +
+The first kind of data file that we will learn how to load into R as a data
+frame is the comma-separated values format (.csv
for short). These files
+have names ending in .csv
, and can be opened and saved using common
+spreadsheet programs like Microsoft Excel and Google Sheets. For example, the
+.csv
file named can_lang.csv
+is included with the code for this book.
+If we were to open this data in a plain text editor (a program like Notepad that just shows
+text with no formatting), we would see each row on its own line, and each entry in the table separated by a comma:
category,language,mother_tongue,most_at_home,most_at_work,lang_known
+Aboriginal languages,"Aboriginal languages, n.o.s.",590,235,30,665
+Non-Official & Non-Aboriginal languages,Afrikaans,10260,4785,85,23415
+Non-Official & Non-Aboriginal languages,"Afro-Asiatic languages, n.i.e.",1150,44
+Non-Official & Non-Aboriginal languages,Akan (Twi),13460,5985,25,22150
+Non-Official & Non-Aboriginal languages,Albanian,26895,13135,345,31930
+Aboriginal languages,"Algonquian languages, n.i.e.",45,10,0,120
+Aboriginal languages,Algonquin,1260,370,40,2480
+Non-Official & Non-Aboriginal languages,American Sign Language,2685,3020,1145,21
+Non-Official & Non-Aboriginal languages,Amharic,22465,12785,200,33670
+To load this data into R so that we can do things with it (e.g., perform
+analyses or create data visualizations), we will need to use a function. A
+function is a special word in R that takes instructions (we call these
+arguments) and does something. The function we will use to load a .csv
file
+into R is called read_csv
. In its most basic
+use-case, read_csv
expects that the data file:
,
) to separate the columns, andBelow you’ll see the code used to load the data into R using the read_csv
+function. Note that the read_csv
function is not included in the base
+installation of R, meaning that it is not one of the primary functions ready to
+use when you install R. Therefore, you need to load it from somewhere else
+before you can use it. The place from which we will load it is called an R package.
+An R package is a collection of functions that can be used in addition to the
+built-in R package functions once loaded. The read_csv
function, in
+particular, can be made accessible by loading
+the tidyverse
R package (Wickham 2021b; Wickham et al. 2019)
+using the library
function. The tidyverse
package contains many
+functions that we will use throughout this book to load, clean, wrangle,
+and visualize data.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
+## ✔ dplyr 1.1.2 ✔ readr 2.1.4
+## ✔ forcats 1.0.0 ✔ stringr 1.5.0
+## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
+## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
+## ✔ purrr 1.0.1
+## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+## ✖ dplyr::filter() masks stats::filter()
+## ✖ dplyr::lag() masks stats::lag()
+## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
+++Note: You may have noticed that we got some extra +output from R regarding attached packages and conflicts below our code +line. These are examples of messages in R, which give the user more +information that might be handy to know. The
+Attaching packages
message is +natural when loadingtidyverse
, sincetidyverse
actually automatically +causes other packages to be imported too, such asdplyr
. In the future, +when we loadtidyverse
in this book, we will silence these messages to help +with the readability of the book. TheConflicts
message is also totally normal +in this circumstance. This message tells you if functions from different +packages share the same name, which is confusing to R. For example, in this +case, thedplyr
package and thestats
package both provide a function +calledfilter
. The message above (dplyr::filter() masks stats::filter()
) +is R telling you that it is going to default to thedplyr
package version +of this function. So if you use thefilter
function, you will be using the +dplyr
version. In order to use thestats
version, you need to use its +full namestats::filter
. Messages are not errors, so generally you don’t +need to take action when you see a message; but you should always read the message +and critically think about what it means and whether you need to do anything +about it.
After loading the tidyverse
package, we can call the read_csv
function and
+pass it a single argument: the name of the file, "can_lang.csv"
. We have to
+put quotes around file names and other letters and words that we use in our
+code to distinguish it from the special words (like functions!) that make up the R programming
+language. The file’s name is the only argument we need to provide because our
+file satisfies everything else that the read_csv
function expects in the default
+use-case. Figure 1.3 describes how we use the read_csv
+to read data into R.
+Figure 1.3: Syntax for the read_csv
function.
+
## # A tibble: 214 × 6
+## category language mother_tongue most_at_home most_at_work lang_known
+## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
+## 1 Aboriginal langu… Aborigi… 590 235 30 665
+## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415
+## 3 Non-Official & N… Afro-As… 1150 445 10 2775
+## 4 Non-Official & N… Akan (T… 13460 5985 25 22150
+## 5 Non-Official & N… Albanian 26895 13135 345 31930
+## 6 Aboriginal langu… Algonqu… 45 10 0 120
+## 7 Aboriginal langu… Algonqu… 1260 370 40 2480
+## 8 Non-Official & N… America… 2685 3020 1145 21930
+## 9 Non-Official & N… Amharic 22465 12785 200 33670
+## 10 Non-Official & N… Arabic 419890 223535 5585 629055
+## # ℹ 204 more rows
+++Note: There is another function +that also loads csv files named
+read.csv
. We will always use +read_csv
in this book, as it is designed to play nicely with all of the +othertidyverse
functions, which we will use extensively. Be +careful not to accidentally useread.csv
, as it can cause some tricky +errors to occur in your code that are hard to track down!
When we loaded the 2016 Canadian census language data
+using read_csv
, we did not give this data frame a name.
+Therefore the data was just printed on the screen,
+and we cannot do anything else with it. That isn’t very useful.
+What would be more useful would be to give a name
+to the data frame that read_csv
outputs,
+so that we can refer to it later for analysis and visualization.
The way to assign a name to a value in R is via the assignment symbol <-
.
+On the left side of the assignment symbol you put the name that you want
+to use, and on the right side of the assignment symbol
+you put the value that you want the name to refer to.
+Names can be used to refer to almost anything in R, such as numbers,
+words (also known as strings of characters), and data frames!
+Below, we set my_number
to 3
(the result of 1+2
)
+and we set name
to the string "Alice"
.
Note that when
+we name something in R using the assignment symbol, <-
,
+we do not need to surround the name we are creating with quotes. This is
+because we are formally telling R that this special word denotes
+the value of whatever is on the right-hand side.
+Only characters and words that act as values on the right-hand side of the assignment
+symbol—e.g., the file name "data/can_lang.csv"
that we specified before, or "Alice"
above—need
+to be surrounded by quotes.
After making the assignment, we can use the special name words we have created in
+place of their values. For example, if we want to do something with the value 3
later on,
+we can just use my_number
instead. Let’s try adding 2 to my_number
; you will see that
+R just interprets this as adding 3 and 2:
## [1] 5
+Object names can consist of letters, numbers, periods .
and underscores _
.
+Other symbols won’t work since they have their own meanings in R. For example,
+-
is the subtraction symbol; if we try to assign a name with
+the -
symbol, R will complain and we will get an error!
Error in my - number <- 1 : object 'my' not found
+There are certain conventions for naming objects in R.
+When naming an object we
+suggest using only lowercase letters, numbers and underscores _
to separate
+the words in a name. R is case sensitive, which means that Letter
and
+letter
would be two different objects in R. You should also try to give your
+objects meaningful names. For instance, you can name a data frame x
.
+However, using more meaningful terms, such as language_data
, will help you
+remember what each name in your code represents. We recommend following the
+Tidyverse naming conventions outlined in the Tidyverse Style Guide (Wickham 2020). Let’s
+now use the assignment symbol to give the name
+can_lang
to the 2016 Canadian census language data frame that we get from
+read_csv
.
Wait a minute, nothing happened this time! Where’s our data?
+Actually, something did happen: the data was loaded in
+and now has the name can_lang
associated with it.
+And we can use that name to access the data frame and do things with it.
+For example, we can type the name of the data frame to print the first few rows
+on the screen. You will also see at the top that the number of observations (i.e., rows) and
+variables (i.e., columns) are printed. Printing the first few rows of a data frame
+like this is a handy way to get a quick sense for what is contained in a data frame.
## # A tibble: 214 × 6
+## category language mother_tongue most_at_home most_at_work lang_known
+## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
+## 1 Aboriginal langu… Aborigi… 590 235 30 665
+## 2 Non-Official & N… Afrikaa… 10260 4785 85 23415
+## 3 Non-Official & N… Afro-As… 1150 445 10 2775
+## 4 Non-Official & N… Akan (T… 13460 5985 25 22150
+## 5 Non-Official & N… Albanian 26895 13135 345 31930
+## 6 Aboriginal langu… Algonqu… 45 10 0 120
+## 7 Aboriginal langu… Algonqu… 1260 370 40 2480
+## 8 Non-Official & N… America… 2685 3020 1145 21930
+## 9 Non-Official & N… Amharic 22465 12785 200 33670
+## 10 Non-Official & N… Arabic 419890 223535 5585 629055
+## # ℹ 204 more rows
+filter
& select
Now that we’ve loaded our data into R, we can start wrangling the data to
+find the ten Aboriginal languages that were most often reported
+in 2016 as mother tongues in Canada. In particular, we will construct
+a table with the ten Aboriginal languages that have the largest
+counts in the mother_tongue
column.
+The filter
and select
functions from the tidyverse
package will help us
+here. The filter
function allows you to obtain a subset of the
+rows with specific values, while the select
function allows you
+to obtain a subset of the columns. Therefore, we can filter
the rows to extract the
+Aboriginal languages in the data set, and then use select
to obtain only the
+columns we want to include in our table.
filter
to extract rowsLooking at the can_lang
data above, we see the category
column contains different
+high-level categories of languages, which include “Aboriginal languages”,
+“Non-Official & Non-Aboriginal languages” and “Official languages”. To answer
+our question we want to filter our data set so we restrict our attention
+to only those languages in the “Aboriginal languages” category.
We can use the filter
function to obtain the subset of rows with desired
+values from a data frame. Figure 1.4 outlines what arguments we need to specify to use filter
. The first argument to filter
is the name of the data frame
+object, can_lang
. The second argument is a logical statement to use when
+filtering the rows. A logical statement evaluates to either TRUE
or FALSE
;
+filter
keeps only those rows for which the logical statement evaluates to TRUE
.
+For example, in our analysis, we are interested in keeping only languages in the
+“Aboriginal languages” higher-level category. We can use
+the equivalency operator ==
to compare the values
+of the category
column with the value "Aboriginal languages"
; you will learn about
+many other kinds of logical statements in Chapter 3. Similar to
+when we loaded the data file and put quotes around the file name, here we need
+to put quotes around "Aboriginal languages"
. Using quotes tells R that this
+is a string value and not one of the special words that make up the R
+programming language, or one of the names we have given to data frames in the
+code we have already written.
+Figure 1.4: Syntax for the filter
function.
+
With these arguments, filter
returns a data frame that has all the columns of
+the input data frame, but only those rows we asked for in our logical filter
+statement.
## # A tibble: 67 × 6
+## category language mother_tongue most_at_home most_at_work lang_known
+## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
+## 1 Aboriginal langu… Aborigi… 590 235 30 665
+## 2 Aboriginal langu… Algonqu… 45 10 0 120
+## 3 Aboriginal langu… Algonqu… 1260 370 40 2480
+## 4 Aboriginal langu… Athabas… 50 10 0 85
+## 5 Aboriginal langu… Atikame… 6150 5465 1100 6645
+## 6 Aboriginal langu… Babine … 110 20 10 210
+## 7 Aboriginal langu… Beaver 190 50 0 340
+## 8 Aboriginal langu… Blackfo… 2815 1110 85 5645
+## 9 Aboriginal langu… Carrier 1025 250 15 2100
+## 10 Aboriginal langu… Cayuga 45 10 10 125
+## # ℹ 57 more rows
+It’s good practice to check the output after using a
+function in R. We can see the original can_lang
data set contained 214 rows
+with multiple kinds of category
. The data frame
+aboriginal_lang
contains only 67 rows, and looks like it only contains languages in
+the “Aboriginal languages” in the category
column. So it looks like the function
+gave us the result we wanted!
select
to extract columnsNow let’s use select
to extract the language
and mother_tongue
columns
+from this data frame. Figure 1.5 shows us the syntax for the select
function. To extract these columns, we need to provide the select
+function with three arguments. The first argument is the name of the data frame
+object, which in this example is aboriginal_lang
. The second and third
+arguments are the column names that we want to select: language
and
+mother_tongue
. After passing these three arguments, the select
function
+returns two columns (the language
and mother_tongue
columns that we asked
+for) as a data frame. This code is also a great example of why being
+able to name things in R is useful: you can see that we are using the
+result of our earlier filter
step (which we named aboriginal_lang
) here
+in the next step of the analysis!
+Figure 1.5: Syntax for the select
function.
+
## # A tibble: 67 × 2
+## language mother_tongue
+## <chr> <dbl>
+## 1 Aboriginal languages, n.o.s. 590
+## 2 Algonquian languages, n.i.e. 45
+## 3 Algonquin 1260
+## 4 Athabaskan languages, n.i.e. 50
+## 5 Atikamekw 6150
+## 6 Babine (Wetsuwet'en) 110
+## 7 Beaver 190
+## 8 Blackfoot 2815
+## 9 Carrier 1025
+## 10 Cayuga 45
+## # ℹ 57 more rows
+arrange
to order and slice
to select rows by index numberWe have used filter
and select
to obtain a table with only the Aboriginal
+languages in the data set and their associated counts. However, we want to know
+the ten languages that are spoken most often. As a next step, we will
+order the mother_tongue
column from largest to smallest value and then extract only
+the top ten rows. This is where the arrange
and slice
functions come to the
+rescue!
The arrange
function allows us to order the rows of a data frame by the
+values of a particular column. Figure 1.6 details what arguments we need to specify to
+use the arrange
function. We need to pass the data frame as the first
+argument to this function, and the variable to order by as the second argument.
+Since we want to choose the ten Aboriginal languages most often reported as a mother tongue
+language, we will use the arrange
function to order the rows in our
+selected_lang
data frame by the mother_tongue
column. We want to
+arrange the rows in descending order (from largest to smallest),
+so we pass the column to the desc
function before using it as an argument.
+Figure 1.6: Syntax for the arrange
function.
+
## # A tibble: 67 × 2
+## language mother_tongue
+## <chr> <dbl>
+## 1 Cree, n.o.s. 64050
+## 2 Inuktitut 35210
+## 3 Ojibway 17885
+## 4 Oji-Cree 12855
+## 5 Dene 10700
+## 6 Montagnais (Innu) 10235
+## 7 Mi'kmaq 6690
+## 8 Atikamekw 6150
+## 9 Plains Cree 3065
+## 10 Stoney 3025
+## # ℹ 57 more rows
+Next we will use the slice
function, which selects rows according to their
+row number. Since we want to choose the most common ten languages, we will indicate we want the
+rows 1 to 10 using the argument 1:10
.
## # A tibble: 10 × 2
+## language mother_tongue
+## <chr> <dbl>
+## 1 Cree, n.o.s. 64050
+## 2 Inuktitut 35210
+## 3 Ojibway 17885
+## 4 Oji-Cree 12855
+## 5 Dene 10700
+## 6 Montagnais (Innu) 10235
+## 7 Mi'kmaq 6690
+## 8 Atikamekw 6150
+## 9 Plains Cree 3065
+## 10 Stoney 3025
+mutate
Recall that our data analysis question referred to the count of Canadians
+that speak each of the top ten most commonly reported Aboriginal languages as
+their mother tongue, and the ten_lang
data frame indeed contains those
+counts… But perhaps, seeing these numbers, we became curious about the
+percentage of the population of Canada associated with each count. It is
+common to come up with new data analysis questions in the process of answering
+a first one—so fear not and explore! To answer this small
+question along the way, we need to divide each count in the mother_tongue
+column by the total Canadian population according to the 2016
+census—i.e., 35,151,728—and multiply it by 100. We can perform
+this computation using the mutate
function. We pass the ten_lang
+data frame as its first argument, then specify the equation that computes the percentages
+in the second argument. By using a new variable name on the left hand side of the equation,
+we will create a new column in the data frame; and if we use an existing name, we will
+modify that variable. In this case, we will opt to
+create a new column called mother_tongue_percent
.
canadian_population = 35151728
+ten_lang_percent = mutate(ten_lang, mother_tongue_percent = 100 * mother_tongue / canadian_population)
+ten_lang_percent
## # A tibble: 10 × 3
+## language mother_tongue mother_tongue_percent
+## <chr> <dbl> <dbl>
+## 1 Cree, n.o.s. 64050 0.182
+## 2 Inuktitut 35210 0.100
+## 3 Ojibway 17885 0.0509
+## 4 Oji-Cree 12855 0.0366
+## 5 Dene 10700 0.0304
+## 6 Montagnais (Innu) 10235 0.0291
+## 7 Mi'kmaq 6690 0.0190
+## 8 Atikamekw 6150 0.0175
+## 9 Plains Cree 3065 0.00872
+## 10 Stoney 3025 0.00861
+The ten_lang_percent
data frame shows that
+the ten Aboriginal languages in the ten_lang
data frame were spoken
+as a mother tongue by between 0.008% and 0.18% of the Canadian population.
The ten_lang
table we generated in Section 1.8 answers our initial data analysis question.
+Are we done? Well, not quite; tables are almost never the best way to present
+the result of your analysis to your audience. Even the ten_lang
table with
+only two columns presents some difficulty: for example, you have to scrutinize
+the table quite closely to get a sense for the relative numbers of speakers of
+each language. When you move on to more complicated analyses, this issue only
+gets worse. In contrast, a visualization would convey this information in a much
+more easily understood format.
+Visualizations are a great tool for summarizing information to help you
+effectively communicate with your audience, and
+creating effective data visualizations is an essential component of any data
+analysis. In this section we will develop a visualization of the
+ten Aboriginal languages that were most often reported in 2016 as mother tongues in
+Canada, as well as the number of people that speak each of them.
ggplot
to create a bar plotIn our data set, we can see that language
and mother_tongue
are in separate
+columns (or variables). In addition, there is a single row (or observation) for each language.
+The data are, therefore, in what we call a tidy data format. Tidy data is a
+fundamental concept and will be a significant focus in the remainder of this
+book: many of the functions from tidyverse
require tidy data, including the
+ggplot
function that we will use shortly for our visualization. We will
+formally introduce tidy data in Chapter 3.
We will make a bar plot to visualize our data. A bar plot is a chart where the
+lengths of the bars represent certain values, like counts or proportions. We
+will make a bar plot using the mother_tongue
and language
columns from our
+ten_lang
data frame. To create a bar plot of these two variables using the
+ggplot
function, we must specify the data frame, which variables
+to put on the x and y axes, and what kind of plot to create. The ggplot
+function and its common usage is illustrated in Figure 1.7.
+Figure 1.8 shows the resulting bar plot
+generated by following the instructions in Figure 1.7.
+Figure 1.7: Creating a bar plot with the ggplot
function.
+
+Figure 1.8: Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. Note that this visualization is not done yet; there are still improvements to be made. +
+++Note: The vast majority of the +time, a single expression in R must be contained in a single line of code. +However, there are a small number of situations in which you can have a +single R expression span multiple lines. Above is one such case: here, R knows that a line cannot +end with a
++
symbol, and so it keeps reading the next line to figure out +what the right-hand side of the+
symbol should be. We could, of course, +put all of the added layers on one line of code, but splitting them across +multiple lines helps a lot with code readability.
It is exciting that we can already visualize our data to help answer our
+question, but we are not done yet! We can (and should) do more to improve the
+interpretability of the data visualization that we created. For example, by
+default, R uses the column names as the axis labels. Usually these
+column names do not have enough information about the variable in the column.
+We really should replace this default with a more informative label. For the
+example above, R uses the column name mother_tongue
as the label for the
+y axis, but most people will not know what that is. And even if they did, they
+will not know how we measured this variable, or the group of people on which the
+measurements were taken. An axis label that reads “Mother Tongue (Number of
+Canadian Residents)” would be much more informative.
Adding additional layers to our visualizations that we create in ggplot
is
+one common and easy way to improve and refine our data visualizations. New
+layers are added to ggplot
objects using the +
symbol. For example, we can
+use the xlab
(short for x axis label) and ylab
(short for y axis label) functions
+to add layers where we specify meaningful
+and informative labels for the x and y axes. Again, since we are specifying
+words (e.g. "Mother Tongue (Number of Canadian Residents)"
) as arguments to
+xlab
and ylab
, we surround them with double quotation marks. We can add many more
+layers to format the plot further, and we will explore these in Chapter
+4.
ggplot(ten_lang, aes(x = language, y = mother_tongue)) +
+ geom_bar(stat = "identity") +
+ xlab("Language") +
+ ylab("Mother Tongue (Number of Canadian Residents)")
+Figure 1.9: Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue with x and y labels. Note that this visualization is not done yet; there are still improvements to be made. +
+The result is shown in Figure 1.9. +This is already quite an improvement! Let’s tackle the next major issue with the visualization +in Figure 1.9: the overlapping x axis labels, which are +currently making it difficult to read the different language names. +One solution is to rotate the plot such that the bars are horizontal rather than vertical. +To accomplish this, we will swap the x and y coordinate axes:
+ggplot(ten_lang, aes(x = mother_tongue, y = language)) +
+ geom_bar(stat = "identity") +
+ xlab("Mother Tongue (Number of Canadian Residents)") +
+ ylab("Language")
+Figure 1.10: Horizontal bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. There are no more serious issues with this visualization, but it could be refined further. +
+Another big step forward, as shown in Figure 1.10! There
+are no more serious issues with the visualization. Now comes time to refine
+the visualization to make it even more well-suited to answering the question
+we asked earlier in this chapter. For example, the visualization could be made more transparent by
+organizing the bars according to the number of Canadian residents reporting
+each language, rather than in alphabetical order. We can reorder the bars using
+the reorder
function, which orders a variable (here language
) based on the
+values of the second variable (mother_tongue
).
ggplot(ten_lang, aes(x = mother_tongue,
+ y = reorder(language, mother_tongue))) +
+ geom_bar(stat = "identity") +
+ xlab("Mother Tongue (Number of Canadian Residents)") +
+ ylab("Language")
+Figure 1.11: Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue with bars reordered. +
+Figure 1.11 provides a very clear and well-organized +answer to our original question; we can see what the ten most often reported Aboriginal languages +were, according to the 2016 Canadian census, and how many people speak each of them. For +instance, we can see that the Aboriginal language most often reported was Cree +n.o.s. with over 60,000 Canadian residents reporting it as their mother tongue.
+++Note: “n.o.s.” means “not otherwise specified”, so Cree n.o.s. refers to +individuals who reported Cree as their mother tongue. In this data set, the +Cree languages include the following categories: Cree n.o.s., Swampy Cree, +Plains Cree, Woods Cree, and a ‘Cree not included elsewhere’ category (which +includes Moose Cree, Northern East Cree and Southern East Cree) +(Statistics Canada 2016b).
+
In the block of code below, we put everything from this chapter together, with a few
+modifications. In particular, we have actually skipped the
+select
step that we did above; since you specify the variable names to plot
+in the ggplot
function, you don’t actually need to select
the columns in advance
+when creating a visualization. We have also provided comments next to
+many of the lines of code below using the
+hash symbol #
. When R sees a #
sign, it
+will ignore all of the text that
+comes after the symbol on that line. So you can use comments to explain lines
+of code for others, and perhaps more importantly, your future self!
+It’s good practice to get in the habit of
+commenting your code to improve its readability.
This exercise demonstrates the power of R. In relatively few lines of code, we
+performed an entire data science workflow with a highly effective data
+visualization! We asked a question, loaded the data into R, wrangled the data
+(using filter
, arrange
and slice
) and created a data visualization to
+help answer our question. In this chapter, you got a quick taste of the data
+science workflow; continue on with the next few chapters to learn each of
+these steps in much more detail!
library(tidyverse)
+
+# load the data set
+can_lang <- read_csv("data/can_lang.csv")
+
+# obtain the 10 most common Aboriginal languages
+aboriginal_lang <- filter(can_lang, category == "Aboriginal languages")
+arranged_lang <- arrange(aboriginal_lang, by = desc(mother_tongue))
+ten_lang <- slice(arranged_lang, 1:10)
+
+# create the visualization
+ggplot(ten_lang, aes(x = mother_tongue,
+ y = reorder(language, mother_tongue))) +
+ geom_bar(stat = "identity") +
+ xlab("Mother Tongue (Number of Canadian Residents)") +
+ ylab("Language")
+Figure 1.12: Putting it all together: bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. +
+There are many R functions in the tidyverse
package (and beyond!), and
+nobody can be expected to remember what every one of them does
+or all of the arguments we have to give them. Fortunately, R provides
+the ?
symbol, which
+provides an easy way to pull up the documentation for
+most functions quickly. To use the ?
symbol to access documentation, you
+just put the name of the function you are curious about after the ?
symbol.
+For example, if you had forgotten what the filter
function
+did or exactly what arguments to pass in, you could run the following
+code:
?filter
+Figure 1.13 shows the documentation that will pop up,
+including a high-level description of the function, its arguments,
+a description of each, and more. Note that you may find some of the
+text in the documentation a bit too technical right now
+(for example, what is dbplyr
, and what is a lazy data frame?).
+Fear not: as you work through this book, many of these terms will be introduced
+to you, and slowly but surely you will become more adept at understanding and navigating
+documentation like that shown in Figure 1.13. But do keep in mind that the documentation
+is not written to teach you about a function; it is just there as a reference to remind
+you about the different arguments and usage of functions that you have already learned about elsewhere.
+Figure 1.13: The documentation for the filter
function, including a high-level description, a list of arguments and their meanings, and more.
+
Practice exercises for the material covered in this chapter +can be found in the accompanying +worksheets repository +in the “R and the tidyverse” row. +You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. +You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” +If you instead decide to download the worksheet and run it on your own machine, +make sure to follow the instructions for computer setup +found in Chapter 13. This will ensure that the automated feedback +and guidance that the worksheets provide will function as intended.
+ +A typical data analysis involves not only writing and executing code, but also writing text and displaying images +that help tell the story of the analysis. In fact, ideally, we would like to interleave these three media, +with the text and images serving as narration for the code and its output. +In this chapter we will show you how to accomplish this using Jupyter notebooks, a common coding platform in +data science. Jupyter notebooks do precisely what we need: they let you combine text, images, and (executable!) code in a single +document. In this chapter, we will focus on the use of Jupyter notebooks to program in R and write +text via a web interface. +These skills are essential to getting your analysis running; think of it like getting dressed in the morning! +Note that we assume that you already have Jupyter set up and ready to use. If that is not the case, please first read +Chapter 13 to learn how to install and configure Jupyter on your own +computer.
+By the end of the chapter, readers will be able to do the following:
+.html
, .pdf
).Jupyter (Kluyver et al. 2016) is a web-based interactive development environment for creating, editing, +and executing documents called Jupyter notebooks. Jupyter notebooks are +documents that contain a mix of computer code (and its output) and formattable +text. Given that they combine these two analysis artifacts in a single +document—code is not separate from the output or written report—notebooks are +one of the leading tools to create reproducible data analyses. Reproducible data +analysis is one where you can reliably and easily re-create the same results when +analyzing the same data. Although this sounds like something that should always +be true of any data analysis, in reality, this is not often the case; one needs +to make a conscious effort to perform data analysis in a reproducible manner. +An example of what a Jupyter notebook looks like is shown in +Figure 11.1.
++Figure 11.1: A screenshot of a Jupyter Notebook. +
+One of the easiest ways to start working with Jupyter is to use a +web-based platform called JupyterHub. JupyterHubs often have Jupyter, R, a number of R +packages, and collaboration tools installed, configured and ready to use. +JupyterHubs are usually created and provisioned by organizations, +and require authentication to gain access. For example, if you are reading +this book as part of a course, your instructor may have a JupyterHub +already set up for you to use! Jupyter can also be installed on your +own computer; see Chapter 13 for instructions.
+The sections of a Jupyter notebook that contain code are referred to as code cells. +A code cell that has not yet been +executed has no number inside the square brackets to the left of the cell +(Figure 11.2). Running a code cell will execute all of +the code it contains, and the output (if any exists) will be displayed directly +underneath the code that generated it. Outputs may include printed text or +numbers, data frames and data visualizations. Cells that have been executed +also have a number inside the square brackets to the left of the cell. +This number indicates the order in which the cells were run +(Figure 11.3).
++Figure 11.2: A code cell in Jupyter that has not yet been executed. +
++Figure 11.3: A code cell in Jupyter that has been executed. +
+Code cells can be run independently or as part of executing the entire notebook +using one of the “Run all” commands found in the Run or Kernel menus +in Jupyter. Running a single code cell independently is a workflow typically +used when editing or writing your own R code. Executing an entire notebook is a +workflow typically used to ensure that your analysis runs in its entirety before +sharing it with others, and when using a notebook as part of an automated +process.
+To run a code cell independently, the cell needs to first be activated. This
+is done by clicking on it with the cursor. Jupyter will indicate a cell has been
+activated by highlighting it with a blue rectangle to its left. After the cell
+has been activated (Figure 11.4), the cell can be run
+by either pressing the Run (▶).
+button in the toolbar, or by using the keyboard shortcut Shift + Enter
.
+Figure 11.4: An activated cell that is ready to be run. The blue rectangle to the cell’s left (annotated by a red arrow) indicates that it is ready to be run. The cell can be run by clicking the run button (circled in red). +
+To execute all of the code cells in an entire notebook, you have three options:
+Select Run >> Run All Cells from the menu.
Select Kernel >> Restart Kernel and Run All Cells… from the menu (Figure 11.5).
Click the (⏭) button in the tool bar.
All of these commands result in all of the code cells in a notebook being run. +However, there is a slight difference between them. In particular, only +options 2 and 3 above will restart the R session before running all of the +cells; option 1 will not restart the session. Restarting the R session means +that all previous objects that were created from running cells before this +command was run will be deleted. In other words, restarting the session and +then running all cells (options 2 or 3) emulates how your notebook code would +run if you completely restarted Jupyter before executing your entire notebook.
++Figure 11.5: Restarting the R session can be accomplished by clicking Restart Kernel and Run All Cells… +
+The kernel is a program that executes the code inside your notebook and +outputs the results. Kernels for many different programming languages have +been created for Jupyter, which means that Jupyter can interpret and execute +the code of many different programming languages. To run R code, your notebook +will need an R kernel. In the top right of your window, you can see a circle +that indicates the status of your kernel. If the circle is empty (◯) +the kernel is idle and ready to execute code. If the circle is filled in (⬤) +the kernel is busy running some code.
+You may run into problems where your kernel is stuck for an excessive amount +of time, your notebook is very slow and unresponsive, or your kernel loses its +connection. If this happens, try the following steps:
+To create a new code cell in Jupyter (Figure 11.6), click the +
button in the
+toolbar. By default, all new cells in Jupyter start out as code cells,
+so after this, all you have to do is write R code within the new cell you just
+created!
+Figure 11.6: New cells can be created by clicking the + button, and are by default code cells. +
+Text cells inside a Jupyter notebook are called Markdown cells. Markdown cells +are rich formatted text cells, which means you can bold and italicize +text, create subject headers, create bullet and numbered lists, and more. These cells are +given the name “Markdown” because they use Markdown language to specify the rich text formatting. +You do not need to learn Markdown to write text in the Markdown cells in +Jupyter; plain text will work just fine. However, you might want to learn a bit +about Markdown eventually to enable you to create nicely formatted analyses. +See the additional resources at the end of this chapter to find out +where you can start learning Markdown.
+To edit a Markdown cell in Jupyter, you need to double click on the cell. Once
+you do this, the unformatted (or unrendered) version of the text will be
+shown (Figure 11.7). You
+can then use your keyboard to edit the text. To view the formatted
+(or rendered) text (Figure 11.8), click the Run (▶) button in the toolbar,
+or use the Shift + Enter
keyboard shortcut.
+Figure 11.7: A Markdown cell in Jupyter that has not yet been rendered and can be edited. +
++Figure 11.8: A Markdown cell in Jupyter that has been rendered and exhibits rich text formatting. +
+To create a new Markdown cell in Jupyter, click the +
button in the toolbar.
+By default, all new cells in Jupyter start as code cells, so
+the cell format needs to be changed to be recognized and rendered as a Markdown
+cell. To do this, click on the cell with your cursor to
+ensure it is activated. Then click on the drop-down box on the toolbar that says “Code” (it
+is next to the ⏭ button), and change it from “Code” to “Markdown” (Figure 11.9).
+Figure 11.9: New cells are by default code cells. To create Markdown cells, the cell format must be changed. +
+As with any file you work on, it is critical to save your work often so you
+don’t lose your progress! Jupyter has an autosave feature, where open files are
+saved periodically. The default for this is every two minutes. You can also
+manually save a Jupyter notebook by selecting Save Notebook from the
+File menu, by clicking the disk icon on the toolbar,
+or by using a keyboard shortcut (Control + S
for Windows, or Command + S
for
+Mac OS).
As you might know (or at least imagine) by now, Jupyter notebooks are great for +interactively editing, writing and running R code; this is what they were +designed for! Consequently, Jupyter notebooks are flexible in regards to code +cell execution order. This flexibility means that code cells can be run in any +arbitrary order using the Run (▶) button. But this flexibility has a downside: +it can lead to Jupyter notebooks whose code cannot be executed in a linear +order (from top to bottom of the notebook). A nonlinear notebook is problematic +because a linear order is the conventional way code documents are run, and +others will have this expectation when running your notebook. Finally, if the +code is used in some automated process, it will need to run in a linear order, +from top to bottom of the notebook.
+The most common way to inadvertently create a nonlinear notebook is to rely solely
+on using the ▶ button to execute cells. For example,
+suppose you write some R code that creates an R object, say a variable named
+y
. When you execute that cell and create y
, it will continue
+to exist until it is deliberately deleted with R code, or when the Jupyter
+notebook R session (i.e., kernel) is stopped or restarted. It can also be
+referenced in another distinct code cell (Figure 11.10).
+Together, this means that you could then write a code cell further above in the
+notebook that references y
and execute it without error in the current session
+(Figure 11.11). This could also be done successfully in
+future sessions if, and only if, you run the cells in the same unconventional
+order. However, it is difficult to remember this unconventional order, and it
+is not the order that others would expect your code to be executed in. Thus, in
+the future, this would lead
+to errors when the notebook is run in the conventional
+linear order (Figure 11.12).
+Figure 11.10: Code that was written out of order, but not yet executed. +
++Figure 11.11: Code that was written out of order, and was executed using the run button in a nonlinear order without error. The order of execution can be traced by following the numbers to the left of the code cells; their order indicates the order in which the cells were executed. +
++Figure 11.12: Code that was written out of order, and was executed in a linear order using “Restart Kernel and Run All Cells…” This resulted in an error at the execution of the second code cell and it failed to run all code cells in the notebook. +
+You can also accidentally create a nonfunctioning notebook by +creating an object in a cell that later gets deleted. In such a +scenario, that object only exists for that one particular R session and will +not exist once the notebook is restarted and run again. If that +object was referenced in another cell in that notebook, an error +would occur when the notebook was run again in a new session.
+These events may not negatively affect the current R session when +the code is being written; but as you might now see, they will likely lead to +errors when that notebook is run in a future session. Regularly executing +the entire notebook in a fresh R session will help guard +against this. If you restart your session and new errors seem to pop up when +you run all of your cells in linear order, you can at least be aware that there +is an issue. Knowing this sooner rather than later will allow you to +fix the issue and ensure your notebook can be run linearly from start to finish.
+We recommend as a best practice to run the entire notebook in a fresh R session +at least 2–3 times within any period of work. Note that, +critically, you must do this in a fresh R session by restarting your kernel. +We recommend using either the Kernel >> +Restart Kernel and Run All Cells… command from the menu or the ⏭ +button in the toolbar. Note that the Run >> Run All Cells +menu item will not restart the kernel, and so it is not sufficient +to guard against these errors.
+Most data analyses these days depend on functions from external R packages that
+are not built into R. One example is the tidyverse
metapackage that we
+heavily rely on in this book. This package provides us access to functions like
+read_csv
for reading data, select
for subsetting columns, and ggplot
for
+creating high-quality graphics.
As mentioned earlier in the book, external R packages need to be loaded before
+the functions they contain can be used. Our recommended way to do this is via
+library(package_name)
. But where should this line of code be written in a
+Jupyter notebook? One idea could be to load the library right before the
+function is used in the notebook. However, although this technically works, this
+causes hidden, or at least non-obvious, R package dependencies when others view
+or try to run the notebook. These hidden dependencies can lead to errors when
+the notebook is executed on another computer if the needed R packages are not
+installed. Additionally, if the data analysis code takes a long time to run,
+uncovering the hidden dependencies that need to be installed so that the
+analysis can run without error can take a great deal of time to uncover.
Therefore, we recommend you load all R packages in a code cell near the top of +the Jupyter notebook. Loading all your packages at the start ensures that all +packages are loaded before their functions are called, assuming the notebook is +run in a linear order from top to bottom as recommended above. It also makes it +easy for others viewing or running the notebook to see what external R packages +are used in the analysis, and hence, what packages they should install on +their computer to run the analysis successfully.
+Write code so that it can be executed in a linear order.
As you write code in a Jupyter notebook, run the notebook in a linear order +and in its entirety often (2–3 times every work session) via the Kernel >> +Restart Kernel and Run All Cells… command from the Jupyter menu or the ⏭ +button in the toolbar.
Write the code that loads external R packages near the top of the Jupyter +notebook.
It is essential to preview data files before you try to read them into R to see +whether or not there are column names, what the delimiters are, and if there are +lines you need to skip. In Jupyter, you preview data files stored as plain text +files (e.g., comma- and tab-separated files) in their plain text format (Figure 11.14) by +right-clicking on the file’s name in the Jupyter file explorer, selecting +Open with, and then selecting Editor (Figure 11.13). +Suppose you do not specify to open +the data file with an editor. In that case, Jupyter will render a nice table +for you, and you will not be able to see the column delimiters, and therefore +you will not know which function to use, nor which arguments to use and values +to specify for them.
++Figure 11.13: Opening data files with an editor in Jupyter. +
++Figure 11.14: A data file as viewed in an editor in Jupyter. +
+In Jupyter, viewing, editing and running R code is done in the Jupyter notebook
+file format with file extension .ipynb
. This file format is not easy to open and
+view outside of Jupyter. Thus, to share your analysis with people who do not
+commonly use Jupyter, it is recommended that you export your executed analysis
+as a more common file type, such as an .html
file, or a .pdf
. We recommend
+exporting the Jupyter notebook after executing the analysis so that you can
+also share the outputs of your code. Note, however, that your audience will not be
+able to run your analysis using a .html
or .pdf
file. If you want your audience
+to be able to reproduce the analysis, you must provide them with the .ipynb
Jupyter notebook file.
Exporting to .html
will result in a shareable file that anyone can open
+using a web browser (e.g., Firefox, Safari, Chrome, or Edge). The .html
+output will produce a document that is visually similar to what the Jupyter notebook
+looked like inside Jupyter. One point of caution here is that if there are
+images in your Jupyter notebook, you will need to share the image files and the
+.html
file to see them.
Exporting to .pdf
will result in a shareable file that anyone can open
+using many programs, including Adobe Acrobat, Preview, web browsers and many
+more. The benefit of exporting to PDF is that it is a standalone document,
+even if the Jupyter notebook included references to image files.
+Unfortunately, the default settings will result in a document
+that visually looks quite different from what the Jupyter notebook looked
+like. The font, page margins, and other details will appear different in the .pdf
output.
At some point, you will want to create a new, fresh Jupyter notebook for your +own project instead of viewing, running or editing a notebook that was started +by someone else. To do this, navigate to the Launcher tab, and click on +the R icon under the Notebook heading. If no Launcher tab is visible, +you can get a new one via clicking the + button at the top of the Jupyter +file explorer (Figure 11.15).
++Figure 11.15: Clicking on the R icon under the Notebook heading will create a new Jupyter notebook with an R kernel. +
+Once you have created a new Jupyter notebook, be sure to give it a descriptive
+name, as the default file name is Untitled.ipynb
. You can rename files by
+first right-clicking on the file name of the notebook you just created, and
+then clicking Rename. This will make
+the file name editable. Use your keyboard to
+change the name. Pressing Enter
or clicking anywhere else in the Jupyter
+interface will save the changed file name.
We recommend not using white space or non-standard characters in file names.
+Doing so will not prevent you from using that file in Jupyter. However, these
+sorts of things become troublesome as you start to do more advanced data
+science projects that involve repetition and automation. We recommend naming
+files using lower case characters and separating words by a dash (-
) or an
+underscore (_
).