nutshell.tex

\chapter{Supervised learning in a nutshell}
\label{sec-nutshell}

In which we try to describe the outlines of the ``lifecycle'' of
supervised learning, including hyperparameter tuning and evaluation of
the final product.

\section{General case}

We start with a very generic setting.

\subsection{Minimal problem specification}
Given:
\begin{itemize}
  \item Space of inputs $\mathcal{X}$
  \item Space of outputs $\mathcal{Y}$
  \item Space of possible {\em hypotheses} $\hclass$ such that each
        $h \in \hclass$ is a function
        $h: \mathcal{X} \rightarrow \mathcal{Y}$
  \item Loss function $\loss : \mathcal{Y} \times \mathcal{Y}
          \rightarrow \R$
\end{itemize}
a {\em supervised learning algorithm} $\alg$
takes as input a {\em data set} of the form
\[\data = \left\{\left(\ex{x}{1}, \ex{y}{1}\right), \dots,
  \left(\ex{x}{n}, \ex{y}{n}\right)\right\}\;\;,\]
where $\ex{x}{i} \in \mathcal{X}$ and $\ex{y}{i} \in \mathcal{Y}$
and returns an $h \in \hclass$.

\subsection{Evaluating a hypothesis}
Given a problem specification and a set of data $\data$, we
evaluate hypothesis $h$ according to average loss, or {\em error},
\begin{eqnarray*}
  \mathcal{E}(h, \loss, \data) = \frac{1}{|\data|}\sum_{i = 1}^{\data}
  \loss(h(\ex{x}{i}), \ex{y}{i})
\end{eqnarray*}

If the data used for evaluation {\em were not used during learning
    of the hypothesis} then this is a reasonable estimate of how well
the hypothesis will make additional predictions on new data from the
same source.

\subsection{Evaluating a supervised learning algorithm}

A {\em validation strategy} $\mathcal{V}$ takes an algorithm
$\alg$, a loss function $\loss$, and a data source
$\data$ and produces a real number which measures how well $\alg$
performs on data from that distribution.

\subsubsection{Using a validation set}
In the simplest case, we can divide $\data$ into two sets,
$\data^\text{train}$ and $\data^\text{val}$, train on the first, and
then evaluate the resulting hypothesis on the second.  In that case,
\[\mathcal{V}(\alg, \loss, \data) =
  \mathcal{E}(\alg(\data^\text{train}),
  \loss, \data^\text{val})\;\;.\]

\subsubsection{Using multiple training/evaluation runs}
We can't reliably evaluate an algorithm based on a single application of it to
a single training and test set, because there are many aspects of the
training and testing data, as well as, sometimes, randomness in
the algorithm itself, that cause {\em variance} in the performance of
the algorithm.  To get a good idea of how well an algorithm
performs, we need to, multiple times, train it and evaluate the
resulting hypothesis, and report the average over
$K$ executions of the algorithm of the error of the
hypothesis it produced each time.

We divide the data into $2K$ random non-overlapping subsets:
$\data_1^\text{train}, \data_1^\text{val}, \ldots,
  \data_K^\text{train}, \data_K^\text{val}.$    Then,
\[\mathcal{V}(\alg, \loss, \data) =
  \frac{1}{K}\sum_{k=1}^K \mathcal{E}(\alg(\data_k^\text{train}),
  \loss, \data_k^\text{val})\;\;.\]

\subsubsection{Cross validation}
In {\em cross validation}, we do a similar computation, but allow data
to be re-used in the $K$ different iterations of training and testing
the algorithm (but never share training and testing data for a single
iteration!).  See Section~\ref{cross-validation} for details.

\subsection{Comparing supervised learning algorithms}
Now, if we have two different algorithms $\alg_1$  and
$\alg_2$, we might be interested in knowing which one will
produce hypotheses that generalize the best, using data from a
particular source.   We could compute
$\mathcal{V}(\alg_1, \loss, \data)$ and
$\mathcal{V}(\mathcal{A_2}, \loss, \data)$, and prefer the
algorithm with lower validation error.  More generally, given
algorithms $\alg_1, \ldots, \alg_M$, we would prefer
\[\alg^* = \argmin{m} \mathcal{V}(\alg_M, \loss, \data)\;\;.\]

\subsection{Fielding a hypothesis}

Now what?  We have to deliver a hypothesis to our customer.
We now know how to find the algorithm, $\alg^*$, that works best for
our type of data.  We can apply it to all of our data to get the best
hypothesis we know how to create, which would be
\[h^* = \alg^*(\data)\;\;,\]
and deliver this resulting hypothesis as our best product.

\subsection{Learning algorithms as optimizers}
A majority of learning algorithms have the form of
optimizing some objective involving the training data and a loss
function.  \note{Interestingly, this loss function is {\em not always
      the same} as the loss function that is used for evaluation!  We
  will see this in logistic regression.}

So for example, (assuming a perfect optimizer which doesn't, of
course, exist) we might say our algorithm is to solve an optimization
problem:
\[\alg(\data) = \argmin{h \in H} \mathcal{J}(h; \data)\;\;.\]
Our objective often has the form
\[\mathcal{J}(h; \data) = \mathcal{E}(h,\loss, \data) +
  \mathcal{R}(h)\;\;,\]
where $\loss$ is a loss to be minimized during training and
$\mathcal{R}$ is a regularization term.

\subsection{Hyperparameters}
Often, rather than comparing an arbitrary collection of learning algorithms,
we think of our learning algorithm as having some parameters that
affect the way it maps data to a hypothesis.  These are not
parameters of the hypothesis itself, but rather parameters of the
  {\em algorithm}.  We call these {\em hyperparameters}.  A classic
example would be to use a hyperparameter $\lambda$ to govern the weight of a
regularization term on an objective to be optimized:
\[\mathcal{J}(h; \data) = \mathcal{E}(h,\loss, \data) +
  \lambda \mathcal{R}(h)\;\;.\]
Then we could think of our algorithm as $\alg(\data;
  \lambda)$.
Picking a good value of $\lambda$ is the same as comparing different
supervised learning algorithms, which is accomplished by validating
them and picking the best one!

\section{Concrete case: linear regression}
In linear regression the problem formulation is this:
\begin{itemize}
  \item $\mathcal{X} = \R^d$
  \item $\mathcal{Y} = \R$
  \item $\hclass = \{\theta^T x + \theta_0\}$ for values of
        parameters $\theta \in \R^d$ and $\theta_0 \in \R$.
  \item $\loss(g, y) = (g - y)^2$
\end{itemize}

Our learning algorithm has hyperparameter $\lambda$ and can be written
as:
\[\alg(\data; \lambda) = \Theta^*(\lambda, \data) = \argmin{\theta, \theta_0}
  \frac{1}{|\data|} \sum_{(x, y) \in \data} (\theta^Tx +
  \theta_0 - y)^2 + \lambda\|\theta\|^2 \;\;.\]
For a particular training data set and parameter $\lambda$, it finds
the best hypothesis on this data, specified with parameters $\Theta =
  (\theta, \theta_0)$, written  $\Theta^*(\lambda, \data)$.

Picking the best value of the hyperparameter is choosing among
learning algorithms. We could, most simply,
optimize using a single training / validation split, so $\data =
  \data^\text{train} \cup \data^\text{val}$, and
\begin{align*}
  \lambda^* & = \argmin{\lambda} \mathcal{V}(\alg_\lambda, \loss,
  \data^\text{val})                                                                             \\
            & = \argmin{\lambda} \mathcal{E}(\Theta^*(\lambda, \data^\text{train}), \text{mse},
  \data^\text{val})                                                                             \\
            & = \argmin{\lambda} \frac{1}{|\data_\text{val}|} \sum_{(x, y) \in
  \data_\text{val}} (\theta^*(\lambda, \data^\text{train})^T x +
  \theta^*_0(\lambda, \data^\text{train}) - y)^2
\end{align*}
It would be much better to select the best $\lambda$ using multiple
runs or cross-validation;  that would just be a different choices of
the validation procedure $\mathcal{V}$ in the top line.

Note that we don't use regularization here because we just want to measure how good the
output of the algorithm is at predicting values of {\em new} points,
and so that's what we measure.   We use the regularizer during
training when we don't want to focus only on optimizing predictions on
the training data.

Finally!  To make a predictor to ship out into the world, we would use
all the data we have, $\data$, to train, using the best
hyperparameters we know, and return
\begin{align*}
  \Theta^* & = \alg(\data; \lambda^*)     \\
           & = \Theta^*(\lambda^*, \data) \\
           & = \argmin{\theta, \theta_0}
  \frac{1}{|\data|} \sum_{(x, y) \in \data} (\theta^Tx +
  \theta_0 - y)^2 + \lambda^*\|\theta\|^2 \\
\end{align*}

Finally, a customer might evaluate this hypothesis on their data,
which we have never seen during training or validation, as
\begin{align*}
  \mathcal{E}^\text{test} & = \mathcal{E}(\Theta^*, \text{mse},
  \data^\text{test})                                                                                             \\
                          & = \frac{1}{|\data^\text{test}|} \sum_{(x, y) \in \data^\text{test}} ({\theta^*}^Tx +
  \theta_0^* - y)^2                                                                                              \\
\end{align*}

Here are the same ideas, written out in informal pseudocode.
\begin{verbatim}
# returns theta_best(D, lambda)
define train(D, lambda):
    return minimize(mse(theta, D) + lambda * norm(theta)**2, theta)

# returns lambda_best using very simple validation
define simple_tune(D_train, D_val, possible_lambda_vals):
    scores = [mse(train(D_train, lambda), D_val)
                  for lambda in possible_lambda_vals]
    return possible_lambda_vals[least_index[scores]]

# returns theta_best overall
define theta_best(D_train, D_val, possible_lambda_vals):
  return train(D_train + D_val, 
               simple_tune(D_train, D_val, possible_lambda_vals))

# customer evaluation of the theta delivered to them
define customer_val(theta):
    return mse(theta, D_test)

\end{verbatim}

\section{Concrete case: logistic regression}
In binary logistic regression the problem formulation is as follows.
We are writing the class labels as 1 and 0.
\begin{itemize}
  \item $\mathcal{X} = \R^d$
  \item $\mathcal{Y} = \{+1, 0\}$
  \item $\hclass = \{\sigma(\theta^T x + \theta_0)\}$ for values of
        parameters $\theta \in \R^d$ and $\theta_0 \in \R$.
  \item $\loss(g, y) = \loss_{01}(g, h)$
  \item Proxy loss $\loss_\text{nll}(g, y) = -\left(y \log (g) + (1 - y)\log (1 -g)\right)$
\end{itemize}

Our learning algorithm has hyperparameter $\lambda$ and can be written
as:
\[\alg(\data; \lambda) = \Theta^*(\lambda, \data) = \argmin{\theta, \theta_0}
  \frac{1}{|\data|} \sum_{(x, y) \in \data} \loss_\text{nll}(\sigma(\theta^Tx +
  \theta_0), y) + \lambda\|\theta\|^2 \;\;.\]
For a particular training data set and parameter $\lambda$, it finds
the best hypothesis on this data, specified with parameters $\Theta =
  (\theta, \theta_0)$, written  $\Theta^*(\lambda, \data)$ according to
the proxy loss $\loss_\text{nll}$.

Picking the best value of the hyperparameter is choosing among
learning algorithms based on their actual predictions. We could, most simply,
optimize using a single training / validation split, so $\data =
  \data^\text{train} \cup \data^\text{val}$, and {\em we use the real
    01 loss}:
\begin{align*}
  \lambda^* & = \argmin{\lambda} \mathcal{V}(\alg_\lambda, \loss_{01},
  \data^\text{val})                                                                             \\
            & = \argmin{\lambda} \mathcal{E}(\Theta^*(\lambda, \data^\text{train}), \loss_{01},
  \data^\text{val})                                                                             \\
            & = \argmin{\lambda} \frac{1}{|\data_\text{val}|} \sum_{(x, y) \in
    \data_\text{val}} \loss_{01}(\sigma(\theta^*(\lambda, \data^\text{train})^T x +
  \theta^*_0(\lambda, \data^\text{train})), y)
\end{align*}
It would be much better to select the best $\lambda$ using multiple
runs or cross-validation;  that would just be a different choices of
the validation procedure $\mathcal{V}$ in the top line.

Finally!  To make a predictor to ship out into the world, we would use
all the data we have, $\data$, to train, using the best
hyperparameters we know, and return
\begin{align*}
  \Theta^* & = \alg(\data; \lambda^*)
\end{align*}
\question{What loss function is being optimized inside this algorithm?}

Finally, a customer might evaluate this hypothesis on their data,
which we have never seen during training or validation, as
\begin{align*}
  \mathcal{E}^\text{test} & = \mathcal{E}(\Theta^*, \loss_{01},
  \data^\text{test})
\end{align*}
The customer just wants to buy the right stocks!  So we use the real
$\loss_{01}$ here for validation.


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "top"
%%% End: