-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter2.Rmd
338 lines (268 loc) · 13.6 KB
/
chapter2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
---
title: "Rethinking Chapter 2"
author: "Gregor Mathes"
date: "2020-26-07"
output:
html_document:
toc: true
toc_float: true
number_sections: true
theme: journal
keep_md: false
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.dim=c(7,4))
library(tidyverse)
library(rethinking)
```
# Introduction
This is the first part of a series where I work through the practice questions of the second edition of Richard McElreaths [Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/). Each post covers a new chapter. There are already some awesome sources for this book online like [Jeffrey Girard](https://jmgirard.com/statistical-rethinking-ch2/) working through the exercises of the first edition, or [Solomon Kurz](https://bookdown.org/ajkurz/Statistical_Rethinking_recoded/) leading through each example of the book with the *brms* and the *tidyverse* packages. You can even watch the [lectures of McElreath](https://www.youtube.com/playlist?list=PLDcUM9US4XdNM4Edgs7weiyIguLSToZRI) on Youtube and work through the [homework and solutions](https://github.com/rmcelreath/statrethinking_winter2019/tree/master/homework).
However, so far I couldn't find a source providing solutions for the practice questions of the second edition, or the homework practices, in a tidy(-verse) way. My aim here is therefore to provide solutions for each homework and practice question of the second edition, using the *tidyverse* and the *rethinking* packages. The first part of the series will cover chapter 2, which corresponds to week 1 of the lectures and homework.
# Homework
## Question 1
**Suppose the globe tossing data had turned out to be 8 water in 15 tosses. Construct the posterior distribution, using grid approximation. Use the same flat prior as before.**
We can do all of this in one pipe:
```{r homework 1}
tibble(
# define grid
p.grid = seq(from = 0, to = 1, length.out = 20),
# define prior, in this case flat
prior = rep(1, 20)
) %>%
# compute likelihood at each value in grid for 8 water in 15 tosses
mutate(likelihood = dbinom(8, size = 15, prob = p.grid),
# compute product of likelihood and prior
unstd.likelihood = likelihood * prior,
# standardise the posterior so it sums to 1
posterior = unstd.likelihood / sum(unstd.likelihood)) %>%
# directly pipe that to a ggplot
ggplot() +
geom_line(aes(p.grid, posterior)) +
labs(y = "posterior probability", x = "probability of water") +
theme_minimal()
```
## Question 2
**Start over in 1, but now use a prior that is zero below p = 0:5 and a constant above p = 0:5. This corresponds to prior information that a majority of the Earth’s surface is water. What difference does the better prior make? If it helps, compare posterior distributions (using both priors) to the true value p = 0:7.**
Same approach, but with additional colums. First we calculate old and new posteriors and then transform it to longer format for plotting.
```{r homework 2}
grid_comp <- tibble(
# define grid
p.grid = seq(from = 0, to = 1, length.out = 20),
# define flat prior used in question 1
prior.old = rep(1, 20)
) %>%
# new prior with majority water
mutate(prior.new = if_else(p.grid <= 0.5, 0, 1),
# compute likelihood at each value in grid for 8 water in 15 tosses
likelihood = dbinom(8, size = 15, prob = p.grid),
# compute product of likelihood and prior
unstd.likelihood.old = likelihood * prior.old,
unstd.likelihood.new = likelihood * prior.new,
# standardise the posterior so it sums to 1
posterior.old = unstd.likelihood.old / sum(unstd.likelihood.old),
posterior.new = unstd.likelihood.new / sum(unstd.likelihood.new))
# data wrangling for plotting
grid_comp %>%
# select colums needed
select(p.grid, posterior.old, posterior.new) %>%
# transform it to longer format
pivot_longer(cols = c(posterior.old, posterior.new),
names_to = "type", values_to = "posterior") %>%
# pipe it to ggplot
ggplot(aes(x = p.grid, y = posterior, colour = type)) +
# add true value
geom_vline(xintercept = 0.7, colour = "coral") +
# label line
annotate("text", x = 0.78, y = 0.2, label = "true value", colour = "coral") +
geom_line(size = 2) +
scale_colour_manual(values =c("grey30", "grey70")) +
labs(y = "posterior probability", x = "probability of water") +
theme_light()
```
## Question 3
**This problem is more open-ended than the others. Feel free to collaborate on the solution. Suppose you want to estimate the Earth’s proportion of water very precisely. Specifically, you want the 99% percentile interval of the posterior distribution of p to be only 0.05 wide. This means the distance between the upper and lower bound of the interval should be 0.05. How many times will you have to toss the globe to do this? I won’t require a precise answer. I’m honestly more interested in your approach.**
As mentioned, this is going to be tricky. We need to simulate sample sizes, and then iteratively apply a function that calculates the 99% percentile interval to these sample sizes. Let's define the function:
```{r homework 3 int_width}
int_width <- function(n) {
# dependant on our sample size n, we draw the summed up number of water from a
# binomial distribution
w <- rbinom(1, size = n, prob = 0.7)
# now we need to apply our standard procedure to get the posterior based on grid
# approximation
tibble(
# define grid
p.grid = seq(from = 0, to = 1, length.out = 1000
),
# define prior
prior = rep(1, 1000)) %>%
# compute likelihood at each value in grid for
mutate(
likelihood = dbinom(w, size = n, prob = p.grid),
# compute product of likelihood and prior
unstd.likelihood = likelihood * prior,
# standardise the posterior so it sums to 1
posterior = unstd.likelihood / sum(unstd.likelihood)
) %>%
select(posterior, p.grid) %>%
# draw samples
sample_n(size = 1000, weight = posterior, replace = TRUE) %>%
# get 99% percentile interval
summarise(per.int = PI(samples = p.grid, prob = 0.99)) %>%
# get the total width
summarise(width = diff(per.int)) %>%
pull()
}
```
Now we can apply the function to our sample sizes. I use sample sizes up to 3000, and run 100 simulation at each sample size.
```{r homework 3 simulation}
# define sample sizes
n_list <- c(20, 50, 100, 200, 400, 800, 1000, 2000, 3000) %>%
# repeat each sample size 100 times
rep(., each = 100)
# we use map_dfr to apply the int_width function to n_list
sim_width <- map_dfr(n_list, int_width) %>%
# some data wrangling for plotting
add_column(n.list = n_list) %>%
rename("width" = "100%")
# pipe it to ggplot
ggplot(sim_width) +
# reference line and text
geom_hline(yintercept = 0.05, color = "coral") +
annotate("text", x = 200, y = 0.075, label = "aim width", colour = "coral") +
# simulations
geom_point(aes(x = n.list, y = width), alpha = 0.05) +
labs(x = "sample size", y = "interval width") +
theme_minimal()
```
It seems that more than 2000, but less than 3000 globe tosses are sufficient to produce interval that precise.
# Easy practices
## Question 2E1
**Which of the expressions below correspond to the statement: "the probability of rain on Monday?"**
1. Pr(rain)
2. Pr(rain|Monday)
3. Pr(Monday|rain)
4. Pr(rain,Monday)/Pr(Monday)
Let's rephrase each expression as a statement:
1 is the probability of rain.
2 is the probability of rain, given that it is Monday.
3 is the probability of it being Monday, given rain.
4 is the probability of rain and it being Monday, given that it is Monday.
It is quite obvious that 2 is the correct answer here, but 4 is correct as well. Using the joint probability definition on page 37, we can see that *PR(rain, Mondy)/Pr(Monday) = Pr(rain|Monday)*.
## Question 2E2
**Which of the following statements corresponds to the expression: "Pr(Monday|rain)?"**
1. The probability of rain on Monday.
2. The probability of rain, given that it is Monday.
3. The probability that it is Monday, given that it is raining.
4. The probability that it is Monday and that it is raining.
Now we have to convert each statement to an expression:
1 would be Pr(rain|Monday).
2 would be Pr(rain|Monday).
3 would be Pr(Monday|rain).
4 would be Pr(Monday,rain).
The correct answer is 3.
## Question 2E3
**Which of the expressions below correspond to the statement: "the probability that it is Monday, given that it is raining?"**
Pr(Monday|rain)
Pr(rain|Monday)
Pr(rain|Monday)Pr(Monday)
Pr(rain|Monday)Pr(Monday)/Pr(rain)
Pr(Monday|rain)Pr(rain)/Pr(Monday)
Let’s convert each expression into a statement:
1 would be the probability that it is Monday, given that it is raining.
2 would be the probability of rain, given that it is Monday.
3 needs to be converted using the formula on page 37:
*Pr(rain|Monday)Pr(Monday) = Pr(rain,Monday)*
4 is the same as the previous option but with division added:
*Pr(rain|Monday)Pr(Monday)/Pr(rain) = Pr(rain,Monday)/Pr(rain)*
Simplify this using joint probability:
*Pr(rain,Monday)/Pr(rain) = Pr(Monday|rain)*
5 is the same as 4 but with the terms exchanged. So it can be interpreted (repeating all the previous work) as the probability of rain, given that it is Monday.
The correct answers are thus 1 and 4.
## Question 2E4
**The Bayesian statistician Bruno de Finetti (1906-1985) began his book on probability theory with the declaration: “PROBABILITY DOES NOT EXIST.” The capitals appeared in the original, so I imagine de Finetti wanted us to shout the statement. What he meant is that probability is a device for describing uncertainty from the perspective of an observer with limited knowledge; it has no objective reality. Discuss the globe tossing example from the chapter, in light of this statement. What does it mean to say “the probability of water is 0.7”?**
Solution copied over from [Jeffrey Girard](https://jmgirard.com/statistical-rethinking-ch2/):
From the Bayesian perspective, there is one true value of a parameter at any given time and thus there is no uncertainty and no probability in “objective reality.” It is only from the perspective of an observer with limited knowledge of this true value that uncertainty exists and that probability is a useful device. So the statement, “the probability of water is 0.7” means that, given our limited knowledge, our estimate of this parameter’s value is 0.7 (but it has some single true value independent of our uncertainty).
# Medium practices
## Question 2M1
**Recall the globe tossing model from the chapter. Compute and plot the grid approximate posterior distribution for each of the following sets of observations. In each case, assume a uniform prior for p.**
1. W,W,W
2. W,W,W,L
3. L,W,W,L,W,W,W
The solution for this is given on page 40. Let's transform it to a pipe. As we apply three inputs to it, we can save some time by making a function of w for water and l for land.
```{r question 2m1}
grid_approx <- function(w, l){
tibble(
# define grid
p.grid = seq(from = 0, to = 1, length.out = 20),
# define prior
prior = rep(1, 20)
) %>%
# compute likelihood at each value in grid for
mutate(likelihood = dbinom(w, size = w + l, prob = p.grid),
# compute product of likelihood and prior
unstd.likelihood = likelihood * prior,
# standardise the posterior so it sums to 1
posterior = unstd.likelihood / sum(unstd.likelihood)) %>%
# directly pipe that to a ggplot
ggplot() +
geom_line(aes(p.grid, posterior)) +
labs(y = "posterior probability", x = "probability of water") +
theme_minimal()
}
```
Now we can approximate the posterior for each option.
Option 1:
```{r 2M1 option 1}
grid_approx(w = 3, l = 0) # 3 water, 0 land
```
Option 2:
```{r 2M1 option 2}
grid_approx(w = 3, l = 1) # 3 water, 1 land
```
Option 3:
```{r 2M1 option 3}
grid_approx(w = 5, l = 2) # 5 water, 2 land
```
## Question 2M2
**Now assume a prior for p that is equal to zero when p<0.5 and is a positive constant when p≥0.5. Again compute and plot the grid approximate posterior distribution for each of the sets of observations in the problem just above.**
All we have to do is to update the prior outside of the tibble in our 'grid_approx()' function:
```{r question 2M2}
grid_approx <- function(w, l){
tibble(
# define grid
p.grid = seq(from = 0, to = 1, length.out = 20),
) %>%
# define prior
mutate(prior = if_else(p.grid <= 0.5, 0, 1),
# compute likelihood at each value in grid for
likelihood = dbinom(w, size = w + l, prob = p.grid),
# compute product of likelihood and prior
unstd.likelihood = likelihood * prior,
# standardise the posterior so it sums to 1
posterior = unstd.likelihood / sum(unstd.likelihood)) %>%
# directly pipe that to a ggplot
ggplot() +
geom_line(aes(p.grid, posterior)) +
labs(y = "posterior probability", x = "probability of water") +
theme_minimal()
}
```
Now we can apply the function to our three options again.
Option 1:
```{r 2M2 option 1}
grid_approx(w = 3, l = 0) # 3 water, 0 land
```
Option 2:
```{r 2M2 option 2}
grid_approx(w = 3, l = 1) # 3 water, 1 land
```
Option 3:
```{r 2M2 option 3}
grid_approx(w = 5, l = 2) # 5 water, 2 land
```
## Question 2M3 to 2H4
Questions 2M3 to 2H4 are rather phylosophical and can be solved by applying the equations given in the chapter. No code is needed to solve the questions and each question is the same as in the first edition, why I refer to the [solutions from Jeffrey Girard](https://jmgirard.com/statistical-rethinking-ch2/) here.
--------------------------------------------------------------------------------
```{r session info}
sessionInfo()
```