forked from 390introml/notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnutshell.tex
281 lines (245 loc) · 12.3 KB
/
nutshell.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
\chapter{Supervised learning in a nutshell}
\label{sec-nutshell}
In which we try to describe the outlines of the ``lifecycle'' of
supervised learning, including hyperparameter tuning and evaluation of
the final product.
\section{General case}
We start with a very generic setting.
\subsection{Minimal problem specification}
Given:
\begin{itemize}
\item Space of inputs $\mathcal{X}$
\item Space of outputs $\mathcal{Y}$
\item Space of possible {\em hypotheses} $\hclass$ such that each
$h \in \hclass$ is a function
$h: \mathcal{X} \rightarrow \mathcal{Y}$
\item Loss function $\loss : \mathcal{Y} \times \mathcal{Y}
\rightarrow \R$
\end{itemize}
a {\em supervised learning algorithm} $\alg$
takes as input a {\em data set} of the form
\[\data = \left\{\left(\ex{x}{1}, \ex{y}{1}\right), \dots,
\left(\ex{x}{n}, \ex{y}{n}\right)\right\}\;\;,\]
where $\ex{x}{i} \in \mathcal{X}$ and $\ex{y}{i} \in \mathcal{Y}$
and returns an $h \in \hclass$.
\subsection{Evaluating a hypothesis}
Given a problem specification and a set of data $\data$, we
evaluate hypothesis $h$ according to average loss, or {\em error},
\begin{eqnarray*}
\mathcal{E}(h, \loss, \data) = \frac{1}{|\data|}\sum_{i = 1}^{\data}
\loss(h(\ex{x}{i}), \ex{y}{i})
\end{eqnarray*}
If the data used for evaluation {\em were not used during learning
of the hypothesis} then this is a reasonable estimate of how well
the hypothesis will make additional predictions on new data from the
same source.
\subsection{Evaluating a supervised learning algorithm}
A {\em validation strategy} $\mathcal{V}$ takes an algorithm
$\alg$, a loss function $\loss$, and a data source
$\data$ and produces a real number which measures how well $\alg$
performs on data from that distribution.
\subsubsection{Using a validation set}
In the simplest case, we can divide $\data$ into two sets,
$\data^\text{train}$ and $\data^\text{val}$, train on the first, and
then evaluate the resulting hypothesis on the second. In that case,
\[\mathcal{V}(\alg, \loss, \data) =
\mathcal{E}(\alg(\data^\text{train}),
\loss, \data^\text{val})\;\;.\]
\subsubsection{Using multiple training/evaluation runs}
We can't reliably evaluate an algorithm based on a single application of it to
a single training and test set, because there are many aspects of the
training and testing data, as well as, sometimes, randomness in
the algorithm itself, that cause {\em variance} in the performance of
the algorithm. To get a good idea of how well an algorithm
performs, we need to, multiple times, train it and evaluate the
resulting hypothesis, and report the average over
$K$ executions of the algorithm of the error of the
hypothesis it produced each time.
We divide the data into $2K$ random non-overlapping subsets:
$\data_1^\text{train}, \data_1^\text{val}, \ldots,
\data_K^\text{train}, \data_K^\text{val}.$ Then,
\[\mathcal{V}(\alg, \loss, \data) =
\frac{1}{K}\sum_{k=1}^K \mathcal{E}(\alg(\data_k^\text{train}),
\loss, \data_k^\text{val})\;\;.\]
\subsubsection{Cross validation}
In {\em cross validation}, we do a similar computation, but allow data
to be re-used in the $K$ different iterations of training and testing
the algorithm (but never share training and testing data for a single
iteration!). See Section~\ref{cross-validation} for details.
\subsection{Comparing supervised learning algorithms}
Now, if we have two different algorithms $\alg_1$ and
$\alg_2$, we might be interested in knowing which one will
produce hypotheses that generalize the best, using data from a
particular source. We could compute
$\mathcal{V}(\alg_1, \loss, \data)$ and
$\mathcal{V}(\mathcal{A_2}, \loss, \data)$, and prefer the
algorithm with lower validation error. More generally, given
algorithms $\alg_1, \ldots, \alg_M$, we would prefer
\[\alg^* = \argmin{m} \mathcal{V}(\alg_M, \loss, \data)\;\;.\]
\subsection{Fielding a hypothesis}
Now what? We have to deliver a hypothesis to our customer.
We now know how to find the algorithm, $\alg^*$, that works best for
our type of data. We can apply it to all of our data to get the best
hypothesis we know how to create, which would be
\[h^* = \alg^*(\data)\;\;,\]
and deliver this resulting hypothesis as our best product.
\subsection{Learning algorithms as optimizers}
A majority of learning algorithms have the form of
optimizing some objective involving the training data and a loss
function. \note{Interestingly, this loss function is {\em not always
the same} as the loss function that is used for evaluation! We
will see this in logistic regression.}
So for example, (assuming a perfect optimizer which doesn't, of
course, exist) we might say our algorithm is to solve an optimization
problem:
\[\alg(\data) = \argmin{h \in H} \mathcal{J}(h; \data)\;\;.\]
Our objective often has the form
\[\mathcal{J}(h; \data) = \mathcal{E}(h,\loss, \data) +
\mathcal{R}(h)\;\;,\]
where $\loss$ is a loss to be minimized during training and
$\mathcal{R}$ is a regularization term.
\subsection{Hyperparameters}
Often, rather than comparing an arbitrary collection of learning algorithms,
we think of our learning algorithm as having some parameters that
affect the way it maps data to a hypothesis. These are not
parameters of the hypothesis itself, but rather parameters of the
{\em algorithm}. We call these {\em hyperparameters}. A classic
example would be to use a hyperparameter $\lambda$ to govern the weight of a
regularization term on an objective to be optimized:
\[\mathcal{J}(h; \data) = \mathcal{E}(h,\loss, \data) +
\lambda \mathcal{R}(h)\;\;.\]
Then we could think of our algorithm as $\alg(\data;
\lambda)$.
Picking a good value of $\lambda$ is the same as comparing different
supervised learning algorithms, which is accomplished by validating
them and picking the best one!
\section{Concrete case: linear regression}
In linear regression the problem formulation is this:
\begin{itemize}
\item $\mathcal{X} = \R^d$
\item $\mathcal{Y} = \R$
\item $\hclass = \{\theta^T x + \theta_0\}$ for values of
parameters $\theta \in \R^d$ and $\theta_0 \in \R$.
\item $\loss(g, y) = (g - y)^2$
\end{itemize}
Our learning algorithm has hyperparameter $\lambda$ and can be written
as:
\[\alg(\data; \lambda) = \Theta^*(\lambda, \data) = \argmin{\theta, \theta_0}
\frac{1}{|\data|} \sum_{(x, y) \in \data} (\theta^Tx +
\theta_0 - y)^2 + \lambda\|\theta\|^2 \;\;.\]
For a particular training data set and parameter $\lambda$, it finds
the best hypothesis on this data, specified with parameters $\Theta =
(\theta, \theta_0)$, written $\Theta^*(\lambda, \data)$.
Picking the best value of the hyperparameter is choosing among
learning algorithms. We could, most simply,
optimize using a single training / validation split, so $\data =
\data^\text{train} \cup \data^\text{val}$, and
\begin{align*}
\lambda^* & = \argmin{\lambda} \mathcal{V}(\alg_\lambda, \loss,
\data^\text{val}) \\
& = \argmin{\lambda} \mathcal{E}(\Theta^*(\lambda, \data^\text{train}), \text{mse},
\data^\text{val}) \\
& = \argmin{\lambda} \frac{1}{|\data_\text{val}|} \sum_{(x, y) \in
\data_\text{val}} (\theta^*(\lambda, \data^\text{train})^T x +
\theta^*_0(\lambda, \data^\text{train}) - y)^2
\end{align*}
It would be much better to select the best $\lambda$ using multiple
runs or cross-validation; that would just be a different choices of
the validation procedure $\mathcal{V}$ in the top line.
Note that we don't use regularization here because we just want to measure how good the
output of the algorithm is at predicting values of {\em new} points,
and so that's what we measure. We use the regularizer during
training when we don't want to focus only on optimizing predictions on
the training data.
Finally! To make a predictor to ship out into the world, we would use
all the data we have, $\data$, to train, using the best
hyperparameters we know, and return
\begin{align*}
\Theta^* & = \alg(\data; \lambda^*) \\
& = \Theta^*(\lambda^*, \data) \\
& = \argmin{\theta, \theta_0}
\frac{1}{|\data|} \sum_{(x, y) \in \data} (\theta^Tx +
\theta_0 - y)^2 + \lambda^*\|\theta\|^2 \\
\end{align*}
Finally, a customer might evaluate this hypothesis on their data,
which we have never seen during training or validation, as
\begin{align*}
\mathcal{E}^\text{test} & = \mathcal{E}(\Theta^*, \text{mse},
\data^\text{test}) \\
& = \frac{1}{|\data^\text{test}|} \sum_{(x, y) \in \data^\text{test}} ({\theta^*}^Tx +
\theta_0^* - y)^2 \\
\end{align*}
Here are the same ideas, written out in informal pseudocode.
\begin{verbatim}
# returns theta_best(D, lambda)
define train(D, lambda):
return minimize(mse(theta, D) + lambda * norm(theta)**2, theta)
# returns lambda_best using very simple validation
define simple_tune(D_train, D_val, possible_lambda_vals):
scores = [mse(train(D_train, lambda), D_val)
for lambda in possible_lambda_vals]
return possible_lambda_vals[least_index[scores]]
# returns theta_best overall
define theta_best(D_train, D_val, possible_lambda_vals):
return train(D_train + D_val,
simple_tune(D_train, D_val, possible_lambda_vals))
# customer evaluation of the theta delivered to them
define customer_val(theta):
return mse(theta, D_test)
\end{verbatim}
\section{Concrete case: logistic regression}
In binary logistic regression the problem formulation is as follows.
We are writing the class labels as 1 and 0.
\begin{itemize}
\item $\mathcal{X} = \R^d$
\item $\mathcal{Y} = \{+1, 0\}$
\item $\hclass = \{\sigma(\theta^T x + \theta_0)\}$ for values of
parameters $\theta \in \R^d$ and $\theta_0 \in \R$.
\item $\loss(g, y) = \loss_{01}(g, h)$
\item Proxy loss $\loss_\text{nll}(g, y) = -\left(y \log (g) + (1 - y)\log (1 -g)\right)$
\end{itemize}
Our learning algorithm has hyperparameter $\lambda$ and can be written
as:
\[\alg(\data; \lambda) = \Theta^*(\lambda, \data) = \argmin{\theta, \theta_0}
\frac{1}{|\data|} \sum_{(x, y) \in \data} \loss_\text{nll}(\sigma(\theta^Tx +
\theta_0), y) + \lambda\|\theta\|^2 \;\;.\]
For a particular training data set and parameter $\lambda$, it finds
the best hypothesis on this data, specified with parameters $\Theta =
(\theta, \theta_0)$, written $\Theta^*(\lambda, \data)$ according to
the proxy loss $\loss_\text{nll}$.
Picking the best value of the hyperparameter is choosing among
learning algorithms based on their actual predictions. We could, most simply,
optimize using a single training / validation split, so $\data =
\data^\text{train} \cup \data^\text{val}$, and {\em we use the real
01 loss}:
\begin{align*}
\lambda^* & = \argmin{\lambda} \mathcal{V}(\alg_\lambda, \loss_{01},
\data^\text{val}) \\
& = \argmin{\lambda} \mathcal{E}(\Theta^*(\lambda, \data^\text{train}), \loss_{01},
\data^\text{val}) \\
& = \argmin{\lambda} \frac{1}{|\data_\text{val}|} \sum_{(x, y) \in
\data_\text{val}} \loss_{01}(\sigma(\theta^*(\lambda, \data^\text{train})^T x +
\theta^*_0(\lambda, \data^\text{train})), y)
\end{align*}
It would be much better to select the best $\lambda$ using multiple
runs or cross-validation; that would just be a different choices of
the validation procedure $\mathcal{V}$ in the top line.
Finally! To make a predictor to ship out into the world, we would use
all the data we have, $\data$, to train, using the best
hyperparameters we know, and return
\begin{align*}
\Theta^* & = \alg(\data; \lambda^*)
\end{align*}
\question{What loss function is being optimized inside this algorithm?}
Finally, a customer might evaluate this hypothesis on their data,
which we have never seen during training or validation, as
\begin{align*}
\mathcal{E}^\text{test} & = \mathcal{E}(\Theta^*, \loss_{01},
\data^\text{test})
\end{align*}
The customer just wants to buy the right stocks! So we use the real
$\loss_{01}$ here for validation.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "top"
%%% End: