# Summary of “Neural Networks for Machine Learning by Geoffrey Hinton”

• Activation function
• Logistic neurons
• $y=\frac{1}{1+e^{-z}}$
• $\frac{dy}{dz}=y(1-y)$
• The softmax function
• General tricks
• Dropout (See more at https://arxiv.org/abs/1207.0580)
• A useful way to regularize the net.
• Details: randomly remove some hidden units in a layer for each training example
• Advantage: It stops hidden units from relying too much on other hidden units
• Tricks to speed up the learning procedure (See more at http://ruder.io/optimizing-gradient-descent/)
1. Use momentum
• $v(t) = \alpha v(t-1)-\varepsilon \frac{\partial E}{\partial w}(t)$
• \begin{aligned}\Delta w(t) & = v(t)\\ & = \alpha v(t – 1) – \varepsilon \frac{\partial E}{\partial w}(t)\\ & = \alpha \Delta w(t – 1) – \varepsilon \frac{\partial E}{\partial w}(t) \end{aligned}
2. Use separate adaptive learning rates for each parameter
• $\Delta w_{i,j} = -\varepsilon g_{i,j} \frac{\partial E}{\partial w_{i,j}}$
• $g_{i, j} = \begin{cases}g_{i,j}(t – 1) + \delta & \frac{\partial E}{\partial w_{i,j}}(t)\frac{\partial E}{\partial w_{i,j}}(t – 1) > 0 \\ g_{i,j}(t-1) \times \delta & otherwise\end{cases}$
• $\delta$ is a variable setted by hand ($0.05$ is usually a good choice)
• This method ensures that local learning rates increase when the gradients don’t change sign and decay rapidly when oscillations start
3. Use rmsprop
• $F(w, t) = \delta F(w, t – 1) + (1 – \delta)(\frac{\partial E}{\partial w})^2$
• $\Delta w = -\frac{\varepsilon}{\sqrt{F(w, t)}} \cdot \frac{\partial E}{\partial w}$
• $\delta$ is a variable setted by hand ($0.9$ is usually a good choice)
• Preventing overfitting
1. Get more data
2. Average many different models
3. Use Bayesian Approach
• Bayes Theorem: $P(A|B)=\frac{P(A)P(B|A)}{P(B)}$
• Deriving Bayes’ Theorem: $P(A|B)=\frac{P(AB)}{P(B)}=\frac{P(A)P(B|A)}{P(B)}$
• Understanding
• Let $P(A)$ be the prior probability, assuming that we already know the prior distribution.
• Then we get a new data $B$, so we can compute $P(B|A)$ as we already have prior distribution.
• After that, we use $P(B)$ to renormalize as $P(B) = \int\limits_A{P(B|A)P(A)}$.
• Finally, our posterior distribution $P(A|B)$ is $P(A|B)=\frac{P(A)P(B|A)}{P(B)}$.
• The Bayesian interpretation of weight decay
• Let $W$ be the set of weights in a network. We add Gaussian noise $\epsilon \sim N(0, \sigma_B^2)$ to the the output of the network, let $P(A|W)$ donate the probability we get the correct anwser $A$ after we add the noise to the output $B$ with weights $W$.
• Considering the Maximum A Posteriori(MAP), we are trying to maximize $P(W|A) = \frac{P(W)P(A|W)}{P(A)}$, which is the same as minimize $-\log P(W|A) = -\log P(W) – \log P(A|W) + \log P(A)$. As $\log P(A)$ has nothing to do with $W$, we can just ignore that term. So we can define our cost as $C = -\log P(W) – \log P(A|W)$, and we are trying to minimize the cost $C$.
• Let us regularise weights $W$ by imposing the Gaussian distribution $N(0,\sigma_w^2)$, then $C = \frac{1}{2\sigma_w^2}\sum\limits_{i}{w_i^2}+\frac{1}{2\sigma_B^2}\sum\limits_c{(y_c-\hat{y})^2} + k$ where $k$ is the sum of some constant terms.
• If we take that equation and multiply through by $2\sigma_{B}^2$ and ignore the constant $k$. Then we got a new cost function $C’ = \sum\limits_c{(y_c-\hat{y})^2} + \frac{2\sigma_B^2}{2\sigma_w^2}\sum\limits_{i}{w_i^2}$.
• As we can see here, the weight decay parameter in this case should not be an arbitrary hack.
• See more at $L^2$ regularization is equivalent to Gaussian Prior and Bayesian Methods for Neural Networks
• Empirical Bayes
• As we mentioned above, if we want the $L^2$ weight decay to work excellent, we need to know the variance of the weight prior $\sigma_w$ and the noise variance $\sigma_B$
• Steps
• Start with guesses for both the noise variance and the weight prior variance.
• While not yet bored
• Do some learning using the ratio of the variances as the weight penalty coefficient.
• Reset the noise variance to be the variance of the residual errors.
• Reset the weight prior variance to be the variance of the distribution of the actual learned weights.
• Go back to the start of this loop.
• We don’t need a validation set to choose the weight penalty coefficient.
• Full Bayesian Learning
• Instead of trying to find the best single setting of parameters we compute the full posterior distribution over all possible parameter settings.
• This approach let overfitting disappear, cuz there’s no reason why the amount of data should influence our prior beliefs about the complexity. So this approach allows us to use complicated models even when we do not have much data.
• This approach does not involve any gradient descent so there are no local optimum issues.
• Practical Full Bayesian Learning
• Use Markov Chain Monte Carlo to approximate the posterior probability
• We can use the sampling noise from mini-batch as the noise which the MCMC method needs.
4. Use the model has the right capacity
1. Limit the number of hidden layers and the number of units per layer
2. Use early stopping
3. Use weight-decay
4. Introduce noise
• $y = \sum\limits_i{x_iw_i}$
• Consider $L^2$ weight-decay and Gaussian noise
• Let $x_i^{noisy}$ equals to $x_i+\epsilon_i$ where $\epsilon_i$ is sampled from $N(0,\sigma^{2}_{i})$
• Then $y^{noisy} = y + \sum\limits_{i}{w_i\epsilon_i}$

\begin{aligned}E\bigl((y^{noisy}-\hat{y})^2\bigr) & = E\Bigl(\bigl((y – \hat{y}) + \sum\limits_{i}{w_i\epsilon_i}\bigr)^2\Bigr)\\ & = (y – \hat{y})^2 + E\bigl(2(y-\hat{y})\sum\limits_{i}{w_i\epsilon_i}\bigr) + E\bigl((\sum\limits_{i}{w_i\epsilon_i})^2\bigr)\\ & = (y – \hat{y})^2 + E(\sum\limits_{i}{w_i^2\epsilon_i^2})\\ & = (y – \hat{y})^2 + \sum\limits_{i}{w_i^2\sigma_i^2}\end{aligned}

• So $\sigma_i^2$ is equivalent to an $L^2$ penalty in this simple network.
• If the network is more complicated, Gaussian noise usually works better than $L^2$ weight penalty.
• Combine models (cooperation)
• Mathematics analysis
• Combine models
• $\bar{y} = \frac{1}{n} \sum\limits_{i = 1}^{n}{y_i}$
• If we use $L^2$ penalty, then the cost is $(\bar{y}-\hat{y})^2$, so is its expectation.
• Pick a predictor at random
• If we use $L^2$ penalty, then the expectation of the cost should be

\begin{aligned}E\big((\hat{y}-y_i)^2\big) & = E\Big(\big((\hat{y} – \bar{y}) – (y_i – \bar{y})\big)^2\Big)\\ & = E\big((\hat{y} – \bar{y})^2 + (y_i – \bar{y})^2 – 2(\hat{y} – \bar{y})(y_i – \bar{y})\big)\\ & = (\bar{y} – \hat{y})^2 + E\big((y_i-\bar{y})^2\big) – 2(\hat{y} – \bar{y})E(y_i-\bar{y})\\ & = (\bar{y} – \hat{y})^2 + E\big((y_i-\bar{y})^2\big) – \frac{2}{n}(\hat{y} – \bar{y})\sum\limits_{i = 1}^{n}{y_i – \bar{y}}\\ & = (\bar{y} – \hat{y})^2 + E\big((y_i-\bar{y})^2\big)\end{aligned}

• Comparison

• Apparently the expectation cost of picking a random predictor is no less than combining models’ as $E\big((y_i-\bar{y})^2\big) \ge 0$.
• The more the results of those models differ, the better is combining models than picking a random model.
• Ways to make predictors differ
• Use lots of different kinds of models
• Bagging: Train different models on different subsets of the data.
• Boosting: Train a sequence of low capacity models. Weight the training cases differently for each model in the sequence.
• Boosting up-weights cases that previous models got wrong.
• It focused the computational resources on modelling the tricky cases.
• Mixtures of Experts (specialization)
• Cluster the training cases into subsets based on the relationship between input and output so that each cluster can be well modelled by an expert.
• Averaging models like what we mentioned above causes cooperation, not specialization.
• Cost function that encourages cooperation: $C = (\hat{y}-\bar{y})^2$
• Cost function that encourages specialization: $C = \sum\limits_{i}{p_i(\hat{y}-y_i)^2}$ where $p_i$ is the probability of picking each expert assigned by the manager where manager is a seperately trained model.
• Principal Compopnents Analysis
• This projects n-dimensional data into an m-dimensional subspace in which the data have the most variance.
• We reconstruct by using the mean value on the $n-m$ directions that are not represented. The reconstruction error is the sum over all these unrepresented directions of the squared differences of the datapoint from the mean.
• Using back-propagation to implement PCA inefficiently
• Try to make the output be the same as the input in a network with a central bottleneck. The activities of the hidden units in the bottleneck form an efficient code.
• If the hidden and output layers are linear, it will learn hidden units that are a linear function of the data and minimize the squared reconstructing error which is exactly what PCA dose.
• Neural networks
• Convolutional Neural Networks
• Weight sharing
• If a feature is useful in some locations during training, detectors for that feature will be available in all location during testing
• Pooling
• Get a small amount of translational invariance at each level by averaging some neighbouring replicated detectors to give a single output to the next level.
• Advantage: It reduces the inputs to the next layer, allowing us to have many more different feature maps.
• Recurrent neural networks
• Comparing with other Hidden Sate Models
1. Linear Dynamical Systems
• Let $x$ donate the state vector and $A$ donate the transition matirx
• The variation can take two forms:
1. Flow: $x$ varies coontinuously with time
• $\frac{d x(t)}{d t} = A \cdot x(t)$
2. Mapping: $x$ varies in discrete steps
• $x_{m+1} = A \cdot x_m$
• The distribution over the hidden state given the data so far can be computed effeciently
• The hidden states follow Gaussian distribution
2. Hidden Markov Models
• There are a number of things called states. And the system is always in exactly one of those states.
• Transitions between states are stochastic and controlled by a transition matrix
• HMMs have efficient algorithm for inference and learning
• The hidden states follow discrete distribution
• A HMM with $n$ state can only store $\log (n)$ bits of information because a HMM has to stay at one exactly state of its $n$ states
3. RNNs
• RNNs can potentially learn to implement lots of small programs that each capture a nugget of knowledge and run in parallel, interacting to produce very complicated effects
• RNNs are much harder to train
• Training the RNNs with backpropagation
• Expand RNN in time sequence
• Constraint the weights expanded from the same original weights
• Train as a normal network
• Reasons why RNNs are hard to train:
• In the forward pass, We use squashing functions (like the logistic) to prevent the activity vectors from exploding
• In the backward pass, the pass is completely linear, so it’s very likely to vanish or explode.
• Four effective ways to learn an RNN
1. Long Short Term Memory
2. Hessian Free Optimization
3. Echo State Networks
• Fix the input-to-hidden connections and the hidden-to-hidden connections at random values and only learn the hidden-to-ouput connections.
• It’s very important to set these connections very carefully.
• Key ideal: a big random expansion of the input vector can help.
• ESNs demonstrate that it’s very important to initialize weights sensibly.
4. Good initialization with momentum
• RNNs with multiplicative connections
• Reasons why we need multiplicative connections
• If we consider an RNN as an automaton, then the hidden state of time $t$, should depend on the conjunction of input of time $t$ and hidden state of time $t – 1$.
• Instead of using the inputs to the RNN to provide additive extra input to the hidden units, we can use the current input to choose the whole hidden-to-hidden weight matrix. But this approach requires so many parameters.
• Details
• Let’s consider this model

• Then, multiplicative connections can be written as $c_f = (b^{\mathsf{T}}w_f)(a^{\mathsf{T}}u_f)v_f$ or $c_f=(b^{\mathsf{T}}w_f)(u_fv_f^{\mathsf{T}})a$ where $(b^{\mathsf{T}}w_f)$ can be considered as scalar coefficients and $(u_fv_f^{\mathsf{T}})$ can be considered to be the transition matrix with rank $1$.
• In fact, if here’re $n$ hidden units, here’re $n$ factors, and the description above shows one factor. And all this factors together actually let the input to determine the transition matrix.
• Hopfield Nets
• Energy function: $E = -\sum\limits_i{s_ib_i}-\sum\limits_{i<j}{s_is_jw_{ij}}$.
• Hopfield proposed that memories could be energy minima of a neural net.
• Storing a memory using Hopfield’s storage rule
• If the units are binary threshold units: $\Delta w = 4(s_i-\frac{1}{2})(s_j-\frac{1}{2})$.
• If the units are of activities of $1$ and $-1$: $\Delta w_{ij}=s_is_j$
• Dealing with spurious minima in Hopfield Nets
• Let the net settle from a random initial state and then do unlearning.
• Crick and Mitchison proposed unlearning as a model of what dreams are for.
• Storing a memory using pseudo-likelihood method
• Use the perceptron convergence procedure to train each unit to have the correct state given the states of all the other units in that vector.
• Boltzmann Machines
• Can be regarded as Hopfield nets with hidden units.
• Instead of using the net to store memories, use it to construct interpretations of sensory input.
• The input is represented by the visible units.
• The interpretation is represented by the states of the hidden units.
• The badness of the interpretation is represented by the energy.
• Use Simulated Annealing to make Boltzmann Machines escape from local minima.
• $P(s_i=1)=\frac{1}{1+e^{-\frac{\Delta E_i}{T}}}$ where $T$ is the temperature.
• Rasing the noise level is equivalent to decreasing all the energy gaps between configurations
• Thermal equilibrium
• The thing that settles down is the probability distribution over configurations
• Using energies to define probabilities
• Let $v$ donates the configuration on the visible units and $h$ on the hidden units.
• $P(v, h) = \frac{e^{-E(v, h)}}{\sum\limits_{g}{\sum\limits_{u}{e^{-E(u, g)}}}}$
• $P(v) = \frac{\sum\limits_h{e^{-E(v, h)}}}{\sum\limits_{g}{\sum\limits_{u}{e^{-E(u, g)}}}}$
• We can use MCMC to sample the probabilities.
• The boltzmann Machine learning algorithm
• The goal of learning: We are trying to maximize the product of the probabilities that the Boltzmann machine assigns to the binary vectors in the training set. This is equivalent to maximize the sum of the log probabilities that Boltzmann machine assigns to the training vectors.
• Everything that one weight needs to know about the other weights and the data is contained in the difference of two correlations.

$$\frac{\partial \log P(v)}{\partial w_{ij}}=E_v(s_is_j)-E_{m}(s_is_j)$$

• where $E_v$ donates the expected value of the product of states at thermal equilibrium when $v$ is clamped on the visible units and $E_w$ donates the expected value of the product of states at thermal equilibrium with no clamping.

• Positive phase: finding hidden configurations that work well with $v$ and lower their energies.

• Negative phase: finding the joint configurations that are the best competitors and raises their energies.
• Use MCMC to collect the statistics.
• Restricted Boltzmann Machines

• Definition: Only one hidden layer and no connections between hidden units.
• In an RBM it only takes one step to reach thermal equilibrium when the visible units are clamped

$$P(h_j=1)=\frac{1}{1+e^{-(b_j+\sum\limits_{i \in vis}{v_iw_{ij}})}}$$

• Learning layers of features by stacking RBMs
• Training steps
• First train a layer of features that receive input directly from the data.
• Then treat the activations of the trained features as if they were the original data and learn features of features in a second hidden layer.
• Then do it again.
• Generating steps
• Get an equilibrium sample from the top level RBM by performing alternating Gibbs sampling for a long time.
• Perform a top-down pass to get states for all the other layers.
• Modeling real-valued data with an RBM
• Use stepped sigmoid units
• Belief Networks
• Requires labeled data.
• The learning time doesn’t scale well.
• It can get stuck in poor local optima.
• Overcoming the limitations of back-propagation by using unsupervised learning
• Keep the efficiency and simplicity of using a gradient method for adjusting the weights, but use it for modelling the structure of the sensory input.
• Adjust the weights to maximize the probability that a generative model would have generated the sensory input.
• Definition: a directed acyclic graph composed of stochastic variables.
• Sigmoid Belief Networks (SBN)
• Definition: Belief Networks with logistic neurons.
• Naive learning algorithm
1. Get the posterior distribution over hidden states given the observed data. We can use MCMC to do this.
2. Get an unbiased sample from the posterior distribution.
3. For each unit, maximize the log probability that its binary state in the sample from the posterior would be generated by the sampled binary states of its parents.
4. $\Delta w_{ji}=\epsilon s_j\big(s_i-P(s_i=1)\big)$
• The wake-sleep algorithm
• We assume wrongly that the posterior over hidden configurations factorizes into a product of distributions for each separate hidden unit. Though this assuming is wrong, the learning of an SBN will still work.
• Wake phase: Use recognition weights to perform a bottom-up pass which trans the generative weights to reconstruct activities in each layer from the layer above.
• Sleep phase: Use generative weights to generate samples from the model, and this action trains the recognition weights to reconstruct activities in each layer from the layer below.
• Strong AI
• 2 theories on concept
1. The feature theory: A concept is a set of semantic features
2. The structuralist theory: The meaning of a concept lies in its relationships to other concepts