Skip to content

Commit

Permalink
Add lecture3
Browse files Browse the repository at this point in the history
  • Loading branch information
K-dizzled committed Feb 27, 2024
1 parent 6aedcbe commit 5fe898c
Show file tree
Hide file tree
Showing 3 changed files with 109 additions and 3 deletions.
106 changes: 106 additions & 0 deletions Lectures/lecture03.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
\section{Lecture 3: Dropout}

\subsection{Lecture}

How does standard regularization works in Bayesian world?

\begin{gather*}
\mathcal{L}(w, X, T) \to \min_w \\
\mathcal{L}(w, X, T) + \lambda R(w) \to \min_w
\end{gather*}

Then:
\begin{align*}
- \sum_{i=1}^n \log p(t_i | x_i, w) + \log p(w) \to \min_w
&\Longleftrightarrow p (T | X, w) p(w) \to \max_w \\
&\Longleftrightarrow p(w | X, T) \to \max_w
\end{align*}

The general regularization worked like this:
\[
n \frac{\partial}{\partial w} \log p(t_k, x_k, w)
\]

Then dropout was introduced:
\[
n \frac{\partial}{\partial w} \log p(t_k, x_k, w \odot \eta) \qquad \eta \sim \operatorname{Ber}(\eta | 1 - p)
\]

Dropout was just a heuristic and no-one understood why it worked.

Let's consider a simple model:
\[
y_i = \sum (w_{ij} \eta_{ij}) z_j
\]
where $z_j$ is the input, $w_{ij}$ is the weight, and $\eta_{ij}$ is the Bernoulli mask.

Statistics of $y$:
\[
\EE y_i = \sum \overbrace{w_{ij} (1 - p)}^{\mu_{ij}} z_j = \sum \mu_{ij} z_j
\]

And:
\[
\DD y_i = \sum_j p (1 - p) (w_{ij} z_j)^2 = \sum_j \left( \frac{p}{1 - p}\right) \mu_{ij}^2 z_j^2
\]

We get that the resulting distribution of $y$ is:
\[
y_i \leadsto \mathcal{N} (y_i | \sum_j \mu_{ij} z_j, \sum_j \left( \frac{p}{1 - p}\right) \mu_{ij}^2 z_j^2)
\]

Before that we took the Bernoullean noize. Let's consider a situation when we take the Gaussian noize:
\[
y_i = \sum \mu_{ij} \delta_{ij} z_j, \qquad \delta_{ij} \sim \mathcal{N} (\delta_{ij} | 1, \alpha)
\]

Statistics would be:
\begin{align*}
&\EE y_i = \sum \mu_{ij} z_j
&\DD y_i = \sum \alpha \mu_{ij}^2 z_j^2
\end{align*}

Thus we can take $\alpha = \frac{p}{1 - p}$ and we get the same statistics. From now on we will consider the Gaussian noize, just because it is easier to work with.

And how do we find the right value for $\alpha$/ $p$? How do we find a good value for the dropout rate?
\begin{align*}
\LL (\mu) &= \int q(\delta | \alpha) \sum_{i = 1}^n \log p(t_i | x_i, \mu \odot \delta) d \delta \\
&= \{ w = \mu \odot \delta \} \\
&= \int q(w | \mu, \alpha) \log p(T | X, w) d w \to \max_{\mu} \\
&= \bigstar
\end{align*}
Now think of the following: What if we make our model Bayesian? If we find such prior $p(w)$ that $KL(q(w | \mu, \alpha) || p(w)) = f(\alpha)$, then:

\begin{align*}
\bigstar &\sim \int q(w | \mu, \alpha) \log p(T | X, w) d w - KL(q(w | \mu, \alpha) || p(w)) \to \max_{\mu, \alpha}
\end{align*}

Let's interpret the model as a Bayesian model.
\[
p(w | \lambda) = \prod_{l} \mathcal{N} (w_l | 0, \lambda_l^{2})
\]

\[
\int q(w | \mu, \alpha) \log p(T | X, w) d w - \sum_l KL (q(w_l | \mu_l, \alpha_l) || p(w_l | \lambda_l)) \to \max_{\mu, \alpha}
\]

The KL divergence can be computed in closed form:
\[
KL(\mathcal{N} (w_l | \mu_l, \alpha_l \mu_l^2) || \mathcal{N} (w_l | 0, \lambda_l^2)) = \log \frac{\lambda_l}{\sqrt{\alpha_l} \mu_l} + \frac{\mu_l^2 + \alpha_l \mu_l^2}{2 \lambda_l^2}
\]

We can integrate this with respect to $\lambda_l$:
\begin{align*}
&\frac{1}{\lambda_l} - \frac{\mu_l^2 + \alpha_l \mu_l^2}{\lambda_l^3} = 0
&(\lambda_l^*)^2 = \mu_l^2 + \alpha_l \mu_l^2
\end{align*}

If we substitute this optimal value to the KL divergence:
\[
\frac{1}{2} \log \frac{\mu_l^2 + \alpha_l \mu_l^2}{\alpha \mu_l^2} + \frac{\mu_l^2 + \alpha_l \mu_l^2}{2 (\mu_l^2 + \alpha_l \mu_l^2)} = \frac{1}{2} \log \frac{\alpha + 1}{\alpha} + \frac{1}{2}
\]

We got that we can now only optimize the function with respect to $\alpha$, yet, get the same thing as we would have had in the standard dropout. We interpreted the model as a Bayesian model and now have an algorithm to automatically tune the dropout rate.

Later on we decided to take $\alpha_l$ instead of $\alpha$, so we get a different value for each weight. And we got the same results.

4 changes: 3 additions & 1 deletion neurobayes.tex
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
\documentclass[a4paper]{article}

\input{preambule.tex}

\usepackage{fancyhdr}
Expand Down Expand Up @@ -26,6 +28,6 @@ \section{Lecture 2}

\input{Lectures/lecture02.tex}


\input{Lectures/lecture03.tex}

\end{document}
2 changes: 0 additions & 2 deletions preambule.tex
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
\documentclass[a4paper]{article}

%plastikovye pakety

\usepackage[12pt]{extsizes}
Expand Down

0 comments on commit 5fe898c

Please sign in to comment.