-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
109 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
\section{Lecture 3: Dropout} | ||
|
||
\subsection{Lecture} | ||
|
||
How does standard regularization works in Bayesian world? | ||
|
||
\begin{gather*} | ||
\mathcal{L}(w, X, T) \to \min_w \\ | ||
\mathcal{L}(w, X, T) + \lambda R(w) \to \min_w | ||
\end{gather*} | ||
|
||
Then: | ||
\begin{align*} | ||
- \sum_{i=1}^n \log p(t_i | x_i, w) + \log p(w) \to \min_w | ||
&\Longleftrightarrow p (T | X, w) p(w) \to \max_w \\ | ||
&\Longleftrightarrow p(w | X, T) \to \max_w | ||
\end{align*} | ||
|
||
The general regularization worked like this: | ||
\[ | ||
n \frac{\partial}{\partial w} \log p(t_k, x_k, w) | ||
\] | ||
|
||
Then dropout was introduced: | ||
\[ | ||
n \frac{\partial}{\partial w} \log p(t_k, x_k, w \odot \eta) \qquad \eta \sim \operatorname{Ber}(\eta | 1 - p) | ||
\] | ||
|
||
Dropout was just a heuristic and no-one understood why it worked. | ||
|
||
Let's consider a simple model: | ||
\[ | ||
y_i = \sum (w_{ij} \eta_{ij}) z_j | ||
\] | ||
where $z_j$ is the input, $w_{ij}$ is the weight, and $\eta_{ij}$ is the Bernoulli mask. | ||
|
||
Statistics of $y$: | ||
\[ | ||
\EE y_i = \sum \overbrace{w_{ij} (1 - p)}^{\mu_{ij}} z_j = \sum \mu_{ij} z_j | ||
\] | ||
|
||
And: | ||
\[ | ||
\DD y_i = \sum_j p (1 - p) (w_{ij} z_j)^2 = \sum_j \left( \frac{p}{1 - p}\right) \mu_{ij}^2 z_j^2 | ||
\] | ||
|
||
We get that the resulting distribution of $y$ is: | ||
\[ | ||
y_i \leadsto \mathcal{N} (y_i | \sum_j \mu_{ij} z_j, \sum_j \left( \frac{p}{1 - p}\right) \mu_{ij}^2 z_j^2) | ||
\] | ||
|
||
Before that we took the Bernoullean noize. Let's consider a situation when we take the Gaussian noize: | ||
\[ | ||
y_i = \sum \mu_{ij} \delta_{ij} z_j, \qquad \delta_{ij} \sim \mathcal{N} (\delta_{ij} | 1, \alpha) | ||
\] | ||
|
||
Statistics would be: | ||
\begin{align*} | ||
&\EE y_i = \sum \mu_{ij} z_j | ||
&\DD y_i = \sum \alpha \mu_{ij}^2 z_j^2 | ||
\end{align*} | ||
|
||
Thus we can take $\alpha = \frac{p}{1 - p}$ and we get the same statistics. From now on we will consider the Gaussian noize, just because it is easier to work with. | ||
|
||
And how do we find the right value for $\alpha$/ $p$? How do we find a good value for the dropout rate? | ||
\begin{align*} | ||
\LL (\mu) &= \int q(\delta | \alpha) \sum_{i = 1}^n \log p(t_i | x_i, \mu \odot \delta) d \delta \\ | ||
&= \{ w = \mu \odot \delta \} \\ | ||
&= \int q(w | \mu, \alpha) \log p(T | X, w) d w \to \max_{\mu} \\ | ||
&= \bigstar | ||
\end{align*} | ||
Now think of the following: What if we make our model Bayesian? If we find such prior $p(w)$ that $KL(q(w | \mu, \alpha) || p(w)) = f(\alpha)$, then: | ||
|
||
\begin{align*} | ||
\bigstar &\sim \int q(w | \mu, \alpha) \log p(T | X, w) d w - KL(q(w | \mu, \alpha) || p(w)) \to \max_{\mu, \alpha} | ||
\end{align*} | ||
|
||
Let's interpret the model as a Bayesian model. | ||
\[ | ||
p(w | \lambda) = \prod_{l} \mathcal{N} (w_l | 0, \lambda_l^{2}) | ||
\] | ||
|
||
\[ | ||
\int q(w | \mu, \alpha) \log p(T | X, w) d w - \sum_l KL (q(w_l | \mu_l, \alpha_l) || p(w_l | \lambda_l)) \to \max_{\mu, \alpha} | ||
\] | ||
|
||
The KL divergence can be computed in closed form: | ||
\[ | ||
KL(\mathcal{N} (w_l | \mu_l, \alpha_l \mu_l^2) || \mathcal{N} (w_l | 0, \lambda_l^2)) = \log \frac{\lambda_l}{\sqrt{\alpha_l} \mu_l} + \frac{\mu_l^2 + \alpha_l \mu_l^2}{2 \lambda_l^2} | ||
\] | ||
|
||
We can integrate this with respect to $\lambda_l$: | ||
\begin{align*} | ||
&\frac{1}{\lambda_l} - \frac{\mu_l^2 + \alpha_l \mu_l^2}{\lambda_l^3} = 0 | ||
&(\lambda_l^*)^2 = \mu_l^2 + \alpha_l \mu_l^2 | ||
\end{align*} | ||
|
||
If we substitute this optimal value to the KL divergence: | ||
\[ | ||
\frac{1}{2} \log \frac{\mu_l^2 + \alpha_l \mu_l^2}{\alpha \mu_l^2} + \frac{\mu_l^2 + \alpha_l \mu_l^2}{2 (\mu_l^2 + \alpha_l \mu_l^2)} = \frac{1}{2} \log \frac{\alpha + 1}{\alpha} + \frac{1}{2} | ||
\] | ||
|
||
We got that we can now only optimize the function with respect to $\alpha$, yet, get the same thing as we would have had in the standard dropout. We interpreted the model as a Bayesian model and now have an algorithm to automatically tune the dropout rate. | ||
|
||
Later on we decided to take $\alpha_l$ instead of $\alpha$, so we get a different value for each weight. And we got the same results. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,3 @@ | ||
\documentclass[a4paper]{article} | ||
|
||
%plastikovye pakety | ||
|
||
\usepackage[12pt]{extsizes} | ||
|