1. TL;DR

• Force the length of features represents the age of this identity when taking this photo.
• In this case, after doing normalization, features are Age-Invariant.

2. Motivation

• Aging process caused many problems (such as shape changes, texture changes, etc.), and those problems lead to large intra-class variation.
• In the field of AIFR, Conventional solutions model age feature and identity feature simultaneously. Considering the mixed feature are usually undecomposable, the property of mixing potentially reduce the robustness of recognizing cross-age faces.

3. Contriution

• They proposed a new approach called OE-CNNs.
• Specifically, we decompose deep face features into two orthogonal components to represent age-related and identity-related features.
• In this way, identity-related features are then Age-Invariant.
• They also built a new dataset called Cross-Age Face dataset (CAF).
• This dataset includes about 313,986 face images from 4,668 identities. Each identity has approximately 80 face images.
• They manually washed the data. And this dataset is fairly randomly separated across ages.

4. Experiments & Results

• Experiments on the MORPH Album 2 Dataset

• Experiments on the LFW Dataset

• Results on FG-NET Dataset and CACD-VS Dataset are also available in this paper.

5. Questions & Thoughts

• What if we embedding features to a flat instead of a sphere and let the height represents the age?
• Age can be replaced by other attributes of this image like pose, blur etc.
• The structure of their network can be adapted to GridFace.v7.relative_confidence_coefficient.
• The idea is actually quite similar to using Auto-Encoder to do the transfer.
• In comparison to A-Softmax, forcing the length of features to represents the age of this person actually restrict the capacity of this model which reduces the risk of over-fitting.

Why not use a single RNN

• There’s is an intuitive interpretation for this: “If you can discriminate a person from his eyes, why should I look at his nose?”

• So when we cut pictures into different patches and train with different CNNs, actually we are forcing the network to learn every single piece of information.

• And in practice, we are using this trick.

A-Softmax

• $L = \frac{1}{N} \sum\limits_{i}{-\log(\frac{e^{ \left| {x_i} \right| \psi(\theta_{y_i,i})}}{e^{ \left| {x_i} \right| \psi(\theta_{y_i,i})} + \sum\limits_{j|j \ne y_i}{e^{ \left| {x_j} \right| \cos(\theta _{j,i})}}})}$

• $\psi(\theta_{y_i,i})=(-1)^k\cos(m\theta_{y_i,i})-2k$, $\theta_{y_i,i}\in[\frac{k\pi}{m}, \frac{(k+1)\pi}{m}]$ and $k \in [0, m-1], \ m \in \mathbb{Z}^+$

• The plot of $\psi(\theta_{y_i,i})$

Normalizing the weight could reduce the prior caused by the training data imbalance

• Suppose we use a neural network to extract a $1D$ feature $f_i$ for each sample $i$ in the dataset and use Softmax to evaluate our network. To make our analysis easier, we normalize our features. Suppose there are only two classes in the dataset. There are $m$ samples in class 1 and $n$ samples in class 2. When our network is strong enough, our features are distributed at both ends of the diameter of the unit circle.

• Without bias terms, the loss function can be written as:
$L = -\sum\limits_{i = 1}^{m + n}\sum\limits_{j = 1}^{2}{a_{i,j} \log(p_{i,j})}$, where $p_{i,j}$ means the probability that sample i belongs to class $j$ (generated by the softmax) and $a_{i,j} = [sample \ i \ belongs \ to \ class \ j]$. Assume that $w_i$ means the weights in the softmax layer for class $i$.

• Then, $\frac{\partial L}{\partial w_1} = (m – 1) \sum\limits_{i|a_{i,1}=1}{f_i} = m (m – 1)$.

• And, $\frac{\partial L}{\partial w_2} = (n – 1) \sum\limits_{i|a_{i,2}=1}{f_i} = n (n – 1)$.

• If $m = 100$ while $n = 1$, then $\frac{\partial L}{\partial w_1} / \frac{\partial L}{\partial w_2} \approx 10000$. It’s clear to see that the derivative of $w_1$ is much bigger than the derivative of $w_2$. That’s why the larger sample number a class has, the larger the associated norm of weights tends to be.

Biases are useless for softmax

• In this paper, they use an experiment using MNIST to empirically prove that Biases is not necessary for softmax.

• In practical, Biases do are useless.

Understanding to “the prior that faces also lie on a manifold”

• As descript in NormFace, the feature distribution of softmax is ‘radial’. So after normalization, features lie on a very thick line on a hypersphere. That’s why Euclidean features failed and $cos$ similarity works well.

• This is also an interpretation of why the Euclidean margin is incompatible with softmax loss_

Other Points

1. Closed-set FR can be well addressed as a classification problem. Open-set FR is a metric learning problem.

2. The key criterion for metric learning in FR: the maxima intra-class distance is smaller than the minima inter-class distance.

3. Separable $\ne$ discriminative and softmax is only separable.

Preface

This paper is totally empirical. You can hardly find any mathematics analysis. Therefore, every conclusion we drew is from experimental results.

The main gaps between neuroscience models and ML models

1. Neurons encode information in a sparse and distributed way. This corresponds to a trade-off between the richness of representation and small action potential energy expenditure. However, without additional regularization, ordinary feed-forward neural networks do not have this property. This is not only not biologically implausible and hurts gradient-based optimization.
2. Sigmoid and tanh are equivalent up to a linear transformation(do some tiny changes to the axis, we can get another one). the tanh has a steady state at 0 and is therefore preferred from the optimization standpoint, but it forces an antisymmetry around 0, which is absent in biological neurons.

Potential problems of rectifiers

1. The hard saturation at 0 may hurt optimization by blocking gradient back-propagation.
2. In order to efficiently represent symmetric or antisymmetric behaviour in the data, a rectifier network would need twice as many hidden units as a network of symmetric or antisymmetric activation functions.
3. Rectifier networks are subject to ill-conditioning of parametrization.
• Consider for each layer of depth $i$ of the network a scalar $\alpha_i$, and scaling the prameters as $w_i’=\frac{w_i}{\alpha_i}$ and $b_i’=\frac{b_i}{\prod_{j=1}^{i}{\alpha_j}}$. The output units values then change as follow: $s’=\frac{s}{\prod_{j=1}^{i}{\alpha_j}}$
• Therefore, as long as $\prod_{j=1}^{n}{\alpha_j}=1$, the network function is identical.

1. Rectifiers are not only biologically plausible, they are also computationally efficient.
2. There’s almost no improvement when using unsupervised pre-training with rectifier activations.
3. Networks trained with the rectifier activation functions can find local minima of greater or equality than those obtained with its smooth counterpart, the soft plush.
4. Rectifier networks are truly sparse.

What did they do in this paper?

• They greatly improved the result on LFW by improving the alignment and representation.

How did they do this?

1. They apply 2D and 3D alignments to frontalize the picture.

2. They carefully customized the structure of the descriptor(a CNN).

High Lights

1. 3D Alignment

2. Good performance on many datasets

3. Well hand-crafted DNN structures(those locally connected layers)

4. Use a large training set to overcome overfitting

Q&A

1. Q: The detail of 2D alignment?

A: Step 1: Detect 6 fiducial points. Step 2: Use a regressor to apply one or more affine transformations.

2. Q: The detail of 3D alignment?

A:

num detail
1 Generate a generic 3D model of 67 anchor points from USF Human-ID database
2 Fine tune a unique 3D model for each image using residual vector and covariance matrix(?)
3 Locate 67 fiducial points in each picture
4 Do apply Triangulation to both anchor points and fiducial points
5 For each pair of corresponding triangles, apply 2D alignment
3. Q: Why there’s a $L^2$ normalization?

A: To cope with the change of illumination. And there’s another usage: When you use softmax and the data is fairly blurred, then the descriptor will tend to assign $\vec 0$ to every picture as its representation to minimize the loss function. So in order to fix this bug, we set all the representations to the length of 1 to avoid this problem.

Appendix

1. ReLU has a great ability to learn. In practical, ReLU and $\tanh$ are mostly used in CV. And in some cases, some variant of ReLU like PReLU can even do a better job.

2. In 3D alignment literature, PCA is widely used to fine tune a unique 3D model for each picture. And a 3D Engine can be used to generate data to train neural networks. And in some cases, some triangle in the Triangulation is invisible, then you need to guess what it looks like. And in this case, adversarial models are recommended for conventional solutions usually offer a blurred image while adversarial models offer clear images with sharp edges.

3. In 2D alignment, affine transformation and projective transformation are applied. There’re only 4 parameters in affine transformations, so Least Squares are the most used solution. When it comes to projective transformation, things get different. Though there’re only 2 more parameters, the projective transformation would greatly change the shape of faces(affine transformation would do the same stuff but not so much). So, usually, we combine the transformation with the whole net. Though we don’t know how much the transformation does exactly, as we only care about the result, it works and it works well. So, yes, this is still a black box. Moreover, there’s a network called Spatial Transformer Networks(STN) and it actually is doing Projective Transformation’s job.

4. When you are doing the alignment, for example using affine transformation, interpolation may be needed. Usually, there are two ways: 1. Use the nearest point as a substitute; 2. Use a linear combination of four closest points as the answer.

5. In face recognition, we can keep ROC good by limiting the angle of the face.

Taylor series

A Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point

$$f(a) \approx \sum\limits_{n=0}^{\infty}{\frac{f^{(n)}(a)}{n!}(x-a)^n}$$

where $f^{n}(a)$ donates the $n^{th}$ derivative of $f$ evaluated at the point $a$.

And here’s a very intuitive example:

The exponential function $e^x$ (in blue), and the sum of the first $n + 1$ terms of its Taylor series at $0$ (in red).

Newton’s method

Overview

In calculus, Newton’s method is an iterative method for finding the roots of a differentiable function $f$.

In machine learning, we apply Newton’s method to the derivative $f’$ of the cost function $f$.

One-dimension version

In the one-dimensional problem, Newton’s method attempts to construct a sequence ${x_n}$ from an initial guess $x_0$ that converges towards some value $\hat{x}$ satisfying $f'(\hat{x})=0$. In another word, we are trying to find a stationary point of $f$.

Consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is

$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2$$

So now, we are trying to find a $\Delta x$ that sets the derivative of this last expression with respect to $\Delta x$ equal to zero, which means

$$\frac{\rm{d}}{\rm{d} \Delta x}(f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2) = f'(x_n)+f”(x_n)\Delta x = 0$$

Apparently, $\Delta x = -\frac{f'(x_n)}{f”(x_n)}$ is the solution. So it can be hoped that $x_{n+1} = x_n+\Delta x = x_n – \frac{f'(x_n)}{f”(x_n)}$ will be closer to a stationary point $\hat{x}$.

High dimensional version

Still consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is

$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + \Delta x^{\mathsf{T}} \nabla f(x_n) + \frac{1}{2} {\rm H} f(x_n) \Delta x (\Delta x)^{\mathsf{T}}$$

where ${\rm H} f(x)$ donates the Hessian matrix of $f(x)$ and $\nabla f(x)$ donates the gradient. (See more about Taylor expansion at https://en.wikipedia.org/wiki/Taylor_series#Taylor_series_in_several_variables)

So $\Delta x = – [{\rm H}f(x_n)]^{-1}\nabla f(x_n)$ should be a good choice.

Limitation

As is known to us all, the time complexity to get $A^{-1}$ given $A$ is $O(n^3)$ when $A$ is a $n \times n$ matrix. So when the data set is of too many dimensions, the algorithm will work quite slow.

The reason steepest descent goes wrong is that the gradient for one weight gets messed up by simultaneous changes to all the other weights.

And the Hessian matrix determines the sizes of these interactions so that Newton’s method minimize these interactions as much as possible.

Conventional machine-learning

• Conventional machine-learning techniques were limited in their ability to process natural data in their raw form.

• Problems such as image and speech recognition require the input-output function to be insensitive to irrelevant variations of the input while being very sensitive to particular minute variations.

• This is why Conventional machine-learning models require a good feature extractor designed by humans. But neural networks can learn this features automatically using a general-purpose learning procedure. This is the key advantage of deep learning.

Activation Functions

• In past decades, neural nets used smoother non-linearities, such as $\tanh(z)$ or Sigmoid function, but ReLU typically learns much faster in networks with many layers.

Learning Procedure

• The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer.

• Recent theoretical and empirical results strongly suggest that local minima are not a serious problem in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero and the surface curves up in most dimensions and curves down in the reminder. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objective function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.

• For smaller data sets, unsupervised pre-training helps to prevent overfitting. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.

CNN

• CNNs are designed to process data that come in the form of multiple arrays.

• The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolutional layer are organized in feature maps, within each unit is connected to local patched in the feature maps of the previous layer through a set of weights called a filter bank. All units in a feature map share the same filter bank.

• Although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one.

• Many natural signals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones.

• The pooling allows representations to vary very little when elements in the previous layer vary in position and appearance.

• The convolutional and pooling layers in CNNs are directly inspired by classic notions of simple cells and complex cells in visual neuroscience.

The future of deep learning

• Unsupervised learning had a catalytic effect in reviving interest in deep learning. We expected unsupervised learning to become far more important in the longer term.

Views from Ge Li (not from this passage)

Fields Technology Versatility
Computer Vision developed bad
Natural Language Processing developing bad
Speech Recognition developed good

Summary of “Neural Networks for Machine Learning by Geoffrey Hinton”

• Activation function
• Logistic neurons
• $y=\frac{1}{1+e^{-z}}$
• $\frac{dy}{dz}=y(1-y)$
• The softmax function
• General tricks
• Dropout (See more at https://arxiv.org/abs/1207.0580)
• A useful way to regularize the net.
• Details: randomly remove some hidden units in a layer for each training example
• Advantage: It stops hidden units from relying too much on other hidden units
• Tricks to speed up the learning procedure (See more at http://ruder.io/optimizing-gradient-descent/)
1. Use momentum
• $v(t) = \alpha v(t-1)-\varepsilon \frac{\partial E}{\partial w}(t)$
• \begin{aligned}\Delta w(t) & = v(t)\\ & = \alpha v(t – 1) – \varepsilon \frac{\partial E}{\partial w}(t)\\ & = \alpha \Delta w(t – 1) – \varepsilon \frac{\partial E}{\partial w}(t) \end{aligned}
2. Use separate adaptive learning rates for each parameter
• $\Delta w_{i,j} = -\varepsilon g_{i,j} \frac{\partial E}{\partial w_{i,j}}$
• $g_{i, j} = \begin{cases}g_{i,j}(t – 1) + \delta & \frac{\partial E}{\partial w_{i,j}}(t)\frac{\partial E}{\partial w_{i,j}}(t – 1) > 0 \\ g_{i,j}(t-1) \times \delta & otherwise\end{cases}$
• $\delta$ is a variable setted by hand ($0.05$ is usually a good choice)
• This method ensures that local learning rates increase when the gradients don’t change sign and decay rapidly when oscillations start
3. Use rmsprop
• $F(w, t) = \delta F(w, t – 1) + (1 – \delta)(\frac{\partial E}{\partial w})^2$
• $\Delta w = -\frac{\varepsilon}{\sqrt{F(w, t)}} \cdot \frac{\partial E}{\partial w}$
• $\delta$ is a variable setted by hand ($0.9$ is usually a good choice)
• Preventing overfitting
1. Get more data
2. Average many different models
3. Use Bayesian Approach
• Bayes Theorem: $P(A|B)=\frac{P(A)P(B|A)}{P(B)}$
• Deriving Bayes’ Theorem: $P(A|B)=\frac{P(AB)}{P(B)}=\frac{P(A)P(B|A)}{P(B)}$
• Understanding
• Let $P(A)$ be the prior probability, assuming that we already know the prior distribution.
• Then we get a new data $B$, so we can compute $P(B|A)$ as we already have prior distribution.
• After that, we use $P(B)$ to renormalize as $P(B) = \int\limits_A{P(B|A)P(A)}$.
• Finally, our posterior distribution $P(A|B)$ is $P(A|B)=\frac{P(A)P(B|A)}{P(B)}$.
• The Bayesian interpretation of weight decay
• Let $W$ be the set of weights in a network. We add Gaussian noise $\epsilon \sim N(0, \sigma_B^2)$ to the the output of the network, let $P(A|W)$ donate the probability we get the correct anwser $A$ after we add the noise to the output $B$ with weights $W$.
• Considering the Maximum A Posteriori(MAP), we are trying to maximize $P(W|A) = \frac{P(W)P(A|W)}{P(A)}$, which is the same as minimize $-\log P(W|A) = -\log P(W) – \log P(A|W) + \log P(A)$. As $\log P(A)$ has nothing to do with $W$, we can just ignore that term. So we can define our cost as $C = -\log P(W) – \log P(A|W)$, and we are trying to minimize the cost $C$.
• Let us regularise weights $W$ by imposing the Gaussian distribution $N(0,\sigma_w^2)$, then $C = \frac{1}{2\sigma_w^2}\sum\limits_{i}{w_i^2}+\frac{1}{2\sigma_B^2}\sum\limits_c{(y_c-\hat{y})^2} + k$ where $k$ is the sum of some constant terms.
• If we take that equation and multiply through by $2\sigma_{B}^2$ and ignore the constant $k$. Then we got a new cost function $C’ = \sum\limits_c{(y_c-\hat{y})^2} + \frac{2\sigma_B^2}{2\sigma_w^2}\sum\limits_{i}{w_i^2}$.
• As we can see here, the weight decay parameter in this case should not be an arbitrary hack.
• See more at $L^2$ regularization is equivalent to Gaussian Prior and Bayesian Methods for Neural Networks
• Empirical Bayes
• As we mentioned above, if we want the $L^2$ weight decay to work excellent, we need to know the variance of the weight prior $\sigma_w$ and the noise variance $\sigma_B$
• Steps
• Start with guesses for both the noise variance and the weight prior variance.
• While not yet bored
• Do some learning using the ratio of the variances as the weight penalty coefficient.
• Reset the noise variance to be the variance of the residual errors.
• Reset the weight prior variance to be the variance of the distribution of the actual learned weights.
• Go back to the start of this loop.
• We don’t need a validation set to choose the weight penalty coefficient.
• Full Bayesian Learning
• Instead of trying to find the best single setting of parameters we compute the full posterior distribution over all possible parameter settings.
• This approach let overfitting disappear, cuz there’s no reason why the amount of data should influence our prior beliefs about the complexity. So this approach allows us to use complicated models even when we do not have much data.
• This approach does not involve any gradient descent so there are no local optimum issues.
• Practical Full Bayesian Learning
• Use Markov Chain Monte Carlo to approximate the posterior probability
• We can use the sampling noise from mini-batch as the noise which the MCMC method needs.
4. Use the model has the right capacity
1. Limit the number of hidden layers and the number of units per layer
2. Use early stopping
3. Use weight-decay
4. Introduce noise
• $y = \sum\limits_i{x_iw_i}$
• Consider $L^2$ weight-decay and Gaussian noise
• Let $x_i^{noisy}$ equals to $x_i+\epsilon_i$ where $\epsilon_i$ is sampled from $N(0,\sigma^{2}_{i})$
• Then $y^{noisy} = y + \sum\limits_{i}{w_i\epsilon_i}$

\begin{aligned}E\bigl((y^{noisy}-\hat{y})^2\bigr) & = E\Bigl(\bigl((y – \hat{y}) + \sum\limits_{i}{w_i\epsilon_i}\bigr)^2\Bigr)\\ & = (y – \hat{y})^2 + E\bigl(2(y-\hat{y})\sum\limits_{i}{w_i\epsilon_i}\bigr) + E\bigl((\sum\limits_{i}{w_i\epsilon_i})^2\bigr)\\ & = (y – \hat{y})^2 + E(\sum\limits_{i}{w_i^2\epsilon_i^2})\\ & = (y – \hat{y})^2 + \sum\limits_{i}{w_i^2\sigma_i^2}\end{aligned}

• So $\sigma_i^2$ is equivalent to an $L^2$ penalty in this simple network.
• If the network is more complicated, Gaussian noise usually works better than $L^2$ weight penalty.
• Combine models (cooperation)
• Mathematics analysis
• Combine models
• $\bar{y} = \frac{1}{n} \sum\limits_{i = 1}^{n}{y_i}$
• If we use $L^2$ penalty, then the cost is $(\bar{y}-\hat{y})^2$, so is its expectation.
• Pick a predictor at random
• If we use $L^2$ penalty, then the expectation of the cost should be

\begin{aligned}E\big((\hat{y}-y_i)^2\big) & = E\Big(\big((\hat{y} – \bar{y}) – (y_i – \bar{y})\big)^2\Big)\\ & = E\big((\hat{y} – \bar{y})^2 + (y_i – \bar{y})^2 – 2(\hat{y} – \bar{y})(y_i – \bar{y})\big)\\ & = (\bar{y} – \hat{y})^2 + E\big((y_i-\bar{y})^2\big) – 2(\hat{y} – \bar{y})E(y_i-\bar{y})\\ & = (\bar{y} – \hat{y})^2 + E\big((y_i-\bar{y})^2\big) – \frac{2}{n}(\hat{y} – \bar{y})\sum\limits_{i = 1}^{n}{y_i – \bar{y}}\\ & = (\bar{y} – \hat{y})^2 + E\big((y_i-\bar{y})^2\big)\end{aligned}

• Comparison

• Apparently the expectation cost of picking a random predictor is no less than combining models’ as $E\big((y_i-\bar{y})^2\big) \ge 0$.
• The more the results of those models differ, the better is combining models than picking a random model.
• Ways to make predictors differ
• Use lots of different kinds of models
• Bagging: Train different models on different subsets of the data.
• Boosting: Train a sequence of low capacity models. Weight the training cases differently for each model in the sequence.
• Boosting up-weights cases that previous models got wrong.
• It focused the computational resources on modelling the tricky cases.
• Mixtures of Experts (specialization)
• Cluster the training cases into subsets based on the relationship between input and output so that each cluster can be well modelled by an expert.
• Averaging models like what we mentioned above causes cooperation, not specialization.
• Cost function that encourages cooperation: $C = (\hat{y}-\bar{y})^2$
• Cost function that encourages specialization: $C = \sum\limits_{i}{p_i(\hat{y}-y_i)^2}$ where $p_i$ is the probability of picking each expert assigned by the manager where manager is a seperately trained model.
• Principal Compopnents Analysis
• This projects n-dimensional data into an m-dimensional subspace in which the data have the most variance.
• We reconstruct by using the mean value on the $n-m$ directions that are not represented. The reconstruction error is the sum over all these unrepresented directions of the squared differences of the datapoint from the mean.
• Using back-propagation to implement PCA inefficiently
• Try to make the output be the same as the input in a network with a central bottleneck. The activities of the hidden units in the bottleneck form an efficient code.
• If the hidden and output layers are linear, it will learn hidden units that are a linear function of the data and minimize the squared reconstructing error which is exactly what PCA dose.
• Neural networks
• Convolutional Neural Networks
• Weight sharing
• If a feature is useful in some locations during training, detectors for that feature will be available in all location during testing
• Pooling
• Get a small amount of translational invariance at each level by averaging some neighbouring replicated detectors to give a single output to the next level.
• Advantage: It reduces the inputs to the next layer, allowing us to have many more different feature maps.
• Disadvantage: We lost information about the precise position of things
• Recurrent neural networks
• Comparing with other Hidden Sate Models
1. Linear Dynamical Systems
• Let $x$ donate the state vector and $A$ donate the transition matirx
• The variation can take two forms:
1. Flow: $x$ varies coontinuously with time
• $\frac{d x(t)}{d t} = A \cdot x(t)$
2. Mapping: $x$ varies in discrete steps
• $x_{m+1} = A \cdot x_m$
• The distribution over the hidden state given the data so far can be computed effeciently
• The hidden states follow Gaussian distribution
2. Hidden Markov Models
• There are a number of things called states. And the system is always in exactly one of those states.
• Transitions between states are stochastic and controlled by a transition matrix
• HMMs have efficient algorithm for inference and learning
• The hidden states follow discrete distribution
• A HMM with $n$ state can only store $\log (n)$ bits of information because a HMM has to stay at one exactly state of its $n$ states
3. RNNs
• RNNs can remember much more information
• RNNs can potentially learn to implement lots of small programs that each capture a nugget of knowledge and run in parallel, interacting to produce very complicated effects
• RNNs are much harder to train
• Training the RNNs with backpropagation
• Expand RNN in time sequence
• Constraint the weights expanded from the same original weights
• Train as a normal network
• Reasons why RNNs are hard to train:
• In the forward pass, We use squashing functions (like the logistic) to prevent the activity vectors from exploding
• In the backward pass, the pass is completely linear, so it’s very likely to vanish or explode.
• Four effective ways to learn an RNN
1. Long Short Term Memory
2. Hessian Free Optimization
3. Echo State Networks
• Fix the input-to-hidden connections and the hidden-to-hidden connections at random values and only learn the hidden-to-ouput connections.
• It’s very important to set these connections very carefully.
• Key ideal: a big random expansion of the input vector can help.
• ESNs demonstrate that it’s very important to initialize weights sensibly.
4. Good initialization with momentum
• RNNs with multiplicative connections
• Reasons why we need multiplicative connections
• If we consider an RNN as an automaton, then the hidden state of time $t$, should depend on the conjunction of input of time $t$ and hidden state of time $t – 1$.
• Instead of using the inputs to the RNN to provide additive extra input to the hidden units, we can use the current input to choose the whole hidden-to-hidden weight matrix. But this approach requires so many parameters.
• Details
• Let’s consider this model

• Then, multiplicative connections can be written as $c_f = (b^{\mathsf{T}}w_f)(a^{\mathsf{T}}u_f)v_f$ or $c_f=(b^{\mathsf{T}}w_f)(u_fv_f^{\mathsf{T}})a$ where $(b^{\mathsf{T}}w_f)$ can be considered as scalar coefficients and $(u_fv_f^{\mathsf{T}})$ can be considered to be the transition matrix with rank $1$.
• In fact, if here’re $n$ hidden units, here’re $n$ factors, and the description above shows one factor. And all this factors together actually let the input to determine the transition matrix.
• Hopfield Nets
• Energy function: $E = -\sum\limits_i{s_ib_i}-\sum\limits_{i<j}{s_is_jw_{ij}}$.
• Hopfield proposed that memories could be energy minima of a neural net.
• Storing a memory using Hopfield’s storage rule
• If the units are binary threshold units: $\Delta w = 4(s_i-\frac{1}{2})(s_j-\frac{1}{2})$.
• If the units are of activities of $1$ and $-1$: $\Delta w_{ij}=s_is_j$
• Dealing with spurious minima in Hopfield Nets
• Let the net settle from a random initial state and then do unlearning.
• Crick and Mitchison proposed unlearning as a model of what dreams are for.
• Storing a memory using pseudo-likelihood method
• Use the perceptron convergence procedure to train each unit to have the correct state given the states of all the other units in that vector.
• Boltzmann Machines
• Can be regarded as Hopfield nets with hidden units.
• Instead of using the net to store memories, use it to construct interpretations of sensory input.
• The input is represented by the visible units.
• The interpretation is represented by the states of the hidden units.
• The badness of the interpretation is represented by the energy.
• Use Simulated Annealing to make Boltzmann Machines escape from local minima.
• $P(s_i=1)=\frac{1}{1+e^{-\frac{\Delta E_i}{T}}}$ where $T$ is the temperature.
• Rasing the noise level is equivalent to decreasing all the energy gaps between configurations
• Thermal equilibrium
• The thing that settles down is the probability distribution over configurations
• Using energies to define probabilities
• Let $v$ donates the configuration on the visible units and $h$ on the hidden units.
• $P(v, h) = \frac{e^{-E(v, h)}}{\sum\limits_{g}{\sum\limits_{u}{e^{-E(u, g)}}}}$
• $P(v) = \frac{\sum\limits_h{e^{-E(v, h)}}}{\sum\limits_{g}{\sum\limits_{u}{e^{-E(u, g)}}}}$
• We can use MCMC to sample the probabilities.
• The boltzmann Machine learning algorithm
• The goal of learning: We are trying to maximize the product of the probabilities that the Boltzmann machine assigns to the binary vectors in the training set. This is equivalent to maximize the sum of the log probabilities that Boltzmann machine assigns to the training vectors.
• Everything that one weight needs to know about the other weights and the data is contained in the difference of two correlations.

$$\frac{\partial \log P(v)}{\partial w_{ij}}=E_v(s_is_j)-E_{m}(s_is_j)$$

• where $E_v$ donates the expected value of the product of states at thermal equilibrium when $v$ is clamped on the visible units and $E_w$ donates the expected value of the product of states at thermal equilibrium with no clamping.

• Positive phase: finding hidden configurations that work well with $v$ and lower their energies.

• Negative phase: finding the joint configurations that are the best competitors and raises their energies.
• Use MCMC to collect the statistics.
• Restricted Boltzmann Machines

• Definition: Only one hidden layer and no connections between hidden units.
• In an RBM it only takes one step to reach thermal equilibrium when the visible units are clamped

$$P(h_j=1)=\frac{1}{1+e^{-(b_j+\sum\limits_{i \in vis}{v_iw_{ij}})}}$$

• Learning layers of features by stacking RBMs
• Training steps
• First train a layer of features that receive input directly from the data.
• Then treat the activations of the trained features as if they were the original data and learn features of features in a second hidden layer.
• Then do it again.
• Generating steps
• Get an equilibrium sample from the top level RBM by performing alternating Gibbs sampling for a long time.
• Perform a top-down pass to get states for all the other layers.
• Modeling real-valued data with an RBM
• Use stepped sigmoid units
• Belief Networks
• Disadvantage of back-propagation
• Requires labeled data.
• The learning time doesn’t scale well.
• It can get stuck in poor local optima.
• Overcoming the limitations of back-propagation by using unsupervised learning
• Keep the efficiency and simplicity of using a gradient method for adjusting the weights, but use it for modelling the structure of the sensory input.
• Adjust the weights to maximize the probability that a generative model would have generated the sensory input.
• Definition: a directed acyclic graph composed of stochastic variables.
• Sigmoid Belief Networks (SBN)
• Definition: Belief Networks with logistic neurons.
• Naive learning algorithm
1. Get the posterior distribution over hidden states given the observed data. We can use MCMC to do this.
2. Get an unbiased sample from the posterior distribution.
3. For each unit, maximize the log probability that its binary state in the sample from the posterior would be generated by the sampled binary states of its parents.
4. $\Delta w_{ji}=\epsilon s_j\big(s_i-P(s_i=1)\big)$
• The wake-sleep algorithm
• We assume wrongly that the posterior over hidden configurations factorizes into a product of distributions for each separate hidden unit. Though this assuming is wrong, the learning of an SBN will still work.
• Wake phase: Use recognition weights to perform a bottom-up pass which trans the generative weights to reconstruct activities in each layer from the layer above.
• Sleep phase: Use generative weights to generate samples from the model, and this action trains the recognition weights to reconstruct activities in each layer from the layer below.
• Strong AI
• 2 theories on concept
1. The feature theory: A concept is a set of semantic features
2. The structuralist theory: The meaning of a concept lies in its relationships to other concepts