diff --git a/Lectures/lecture03.tex b/Lectures/lecture03.tex new file mode 100644 index 0000000..1df37ec --- /dev/null +++ b/Lectures/lecture03.tex @@ -0,0 +1,106 @@ +\section{Lecture 3: Dropout} + +\subsection{Lecture} + +How does standard regularization works in Bayesian world? + +\begin{gather*} + \mathcal{L}(w, X, T) \to \min_w \\ + \mathcal{L}(w, X, T) + \lambda R(w) \to \min_w +\end{gather*} + +Then: +\begin{align*} + - \sum_{i=1}^n \log p(t_i | x_i, w) + \log p(w) \to \min_w + &\Longleftrightarrow p (T | X, w) p(w) \to \max_w \\ + &\Longleftrightarrow p(w | X, T) \to \max_w +\end{align*} + +The general regularization worked like this: +\[ + n \frac{\partial}{\partial w} \log p(t_k, x_k, w) +\] + +Then dropout was introduced: +\[ + n \frac{\partial}{\partial w} \log p(t_k, x_k, w \odot \eta) \qquad \eta \sim \operatorname{Ber}(\eta | 1 - p) +\] + +Dropout was just a heuristic and no-one understood why it worked. + +Let's consider a simple model: +\[ + y_i = \sum (w_{ij} \eta_{ij}) z_j +\] +where $z_j$ is the input, $w_{ij}$ is the weight, and $\eta_{ij}$ is the Bernoulli mask. + +Statistics of $y$: +\[ + \EE y_i = \sum \overbrace{w_{ij} (1 - p)}^{\mu_{ij}} z_j = \sum \mu_{ij} z_j +\] + +And: +\[ + \DD y_i = \sum_j p (1 - p) (w_{ij} z_j)^2 = \sum_j \left( \frac{p}{1 - p}\right) \mu_{ij}^2 z_j^2 +\] + +We get that the resulting distribution of $y$ is: +\[ + y_i \leadsto \mathcal{N} (y_i | \sum_j \mu_{ij} z_j, \sum_j \left( \frac{p}{1 - p}\right) \mu_{ij}^2 z_j^2) +\] + +Before that we took the Bernoullean noize. Let's consider a situation when we take the Gaussian noize: +\[ + y_i = \sum \mu_{ij} \delta_{ij} z_j, \qquad \delta_{ij} \sim \mathcal{N} (\delta_{ij} | 1, \alpha) +\] + +Statistics would be: +\begin{align*} + &\EE y_i = \sum \mu_{ij} z_j + &\DD y_i = \sum \alpha \mu_{ij}^2 z_j^2 +\end{align*} + +Thus we can take $\alpha = \frac{p}{1 - p}$ and we get the same statistics. From now on we will consider the Gaussian noize, just because it is easier to work with. + +And how do we find the right value for $\alpha$/ $p$? How do we find a good value for the dropout rate? +\begin{align*} + \LL (\mu) &= \int q(\delta | \alpha) \sum_{i = 1}^n \log p(t_i | x_i, \mu \odot \delta) d \delta \\ + &= \{ w = \mu \odot \delta \} \\ + &= \int q(w | \mu, \alpha) \log p(T | X, w) d w \to \max_{\mu} \\ + &= \bigstar +\end{align*} +Now think of the following: What if we make our model Bayesian? If we find such prior $p(w)$ that $KL(q(w | \mu, \alpha) || p(w)) = f(\alpha)$, then: + +\begin{align*} + \bigstar &\sim \int q(w | \mu, \alpha) \log p(T | X, w) d w - KL(q(w | \mu, \alpha) || p(w)) \to \max_{\mu, \alpha} +\end{align*} + +Let's interpret the model as a Bayesian model. +\[ + p(w | \lambda) = \prod_{l} \mathcal{N} (w_l | 0, \lambda_l^{2}) +\] + +\[ + \int q(w | \mu, \alpha) \log p(T | X, w) d w - \sum_l KL (q(w_l | \mu_l, \alpha_l) || p(w_l | \lambda_l)) \to \max_{\mu, \alpha} +\] + +The KL divergence can be computed in closed form: +\[ + KL(\mathcal{N} (w_l | \mu_l, \alpha_l \mu_l^2) || \mathcal{N} (w_l | 0, \lambda_l^2)) = \log \frac{\lambda_l}{\sqrt{\alpha_l} \mu_l} + \frac{\mu_l^2 + \alpha_l \mu_l^2}{2 \lambda_l^2} +\] + +We can integrate this with respect to $\lambda_l$: +\begin{align*} + &\frac{1}{\lambda_l} - \frac{\mu_l^2 + \alpha_l \mu_l^2}{\lambda_l^3} = 0 + &(\lambda_l^*)^2 = \mu_l^2 + \alpha_l \mu_l^2 +\end{align*} + +If we substitute this optimal value to the KL divergence: +\[ + \frac{1}{2} \log \frac{\mu_l^2 + \alpha_l \mu_l^2}{\alpha \mu_l^2} + \frac{\mu_l^2 + \alpha_l \mu_l^2}{2 (\mu_l^2 + \alpha_l \mu_l^2)} = \frac{1}{2} \log \frac{\alpha + 1}{\alpha} + \frac{1}{2} +\] + +We got that we can now only optimize the function with respect to $\alpha$, yet, get the same thing as we would have had in the standard dropout. We interpreted the model as a Bayesian model and now have an algorithm to automatically tune the dropout rate. + +Later on we decided to take $\alpha_l$ instead of $\alpha$, so we get a different value for each weight. And we got the same results. + diff --git a/neurobayes.tex b/neurobayes.tex index 3254b6f..1c358b2 100644 --- a/neurobayes.tex +++ b/neurobayes.tex @@ -1,3 +1,5 @@ +\documentclass[a4paper]{article} + \input{preambule.tex} \usepackage{fancyhdr} @@ -26,6 +28,6 @@ \section{Lecture 2} \input{Lectures/lecture02.tex} - +\input{Lectures/lecture03.tex} \end{document} \ No newline at end of file diff --git a/preambule.tex b/preambule.tex index 8a0b784..7251e26 100644 --- a/preambule.tex +++ b/preambule.tex @@ -1,5 +1,3 @@ -\documentclass[a4paper]{article} - %plastikovye pakety \usepackage[12pt]{extsizes}