 Activation function
 Logistic neurons
 $y=\frac{1}{1+e^{z}}$
 $\frac{dy}{dz}=y(1y)$
 The softmax function
 $y_i=\frac{e^{z_i}}{\sum\limits_{j:j \in group}{e^{z_j}}}$
 $\frac{\partial y_i}{\partial z_i} = y_i(1y_i)$
 Use Cross entropy as the cost function. (See more at http://peterroelants.github.io/posts/neural_network_implementation_intermezzo02/)
 $E = \sum\limits_{j:j \in group}{\hat{y}_j \log y_j}$
 $\frac{\partial E}{\partial z_i}=y_i\hat{y}_i$
 Logistic neurons
 General tricks
 Dropout (See more at https://arxiv.org/abs/1207.0580)
 A useful way to regularize the net.
 Details: randomly remove some hidden units in a layer for each training example
 Advantage: It stops hidden units from relying too much on other hidden units
 Tricks to speed up the learning procedure (See more at http://ruder.io/optimizinggradientdescent/)
 Use momentum
 $v(t) = \alpha v(t1)\varepsilon \frac{\partial E}{\partial w}(t)$
 $\begin{aligned}\Delta w(t) & = v(t)\\
& = \alpha v(t – 1) – \varepsilon \frac{\partial E}{\partial w}(t)\\
& = \alpha \Delta w(t – 1) – \varepsilon \frac{\partial E}{\partial w}(t) \end{aligned}$
 Use separate adaptive learning rates for each parameter
 $\Delta w_{i,j} = \varepsilon g_{i,j} \frac{\partial E}{\partial w_{i,j}}$
 $g_{i, j} = \begin{cases}g_{i,j}(t – 1) + \delta & \frac{\partial E}{\partial w_{i,j}}(t)\frac{\partial E}{\partial w_{i,j}}(t – 1) > 0 \\ g_{i,j}(t1) \times \delta & otherwise\end{cases}$
 $\delta$ is a variable setted by hand ($0.05$ is usually a good choice)
 This method ensures that local learning rates increase when the gradients don’t change sign and decay rapidly when oscillations start
 Use rmsprop
 $F(w, t) = \delta F(w, t – 1) + (1 – \delta)(\frac{\partial E}{\partial w})^2$
 $\Delta w = \frac{\varepsilon}{\sqrt{F(w, t)}} \cdot \frac{\partial E}{\partial w}$
 $\delta$ is a variable setted by hand ($0.9$ is usually a good choice)
 Use momentum
 Preventing overfitting
 Get more data
 Average many different models
 Use Bayesian Approach
 Bayes Theorem: $P(AB)=\frac{P(A)P(BA)}{P(B)}$
 Deriving Bayes’ Theorem: $P(AB)=\frac{P(AB)}{P(B)}=\frac{P(A)P(BA)}{P(B)}$
 Understanding
 Let $P(A)$ be the prior probability, assuming that we already know the prior distribution.
 Then we get a new data $B$, so we can compute $P(BA)$ as we already have prior distribution.
 After that, we use $P(B)$ to renormalize as $P(B) = \int\limits_A{P(BA)P(A)}$.
 Finally, our posterior distribution $P(AB)$ is $P(AB)=\frac{P(A)P(BA)}{P(B)}$.
 The Bayesian interpretation of weight decay
 Let $W$ be the set of weights in a network. We add Gaussian noise $\epsilon \sim N(0, \sigma_B^2)$ to the the output of the network, let $P(AW)$ donate the probability we get the correct anwser $A$ after we add the noise to the output $B$ with weights $W$.
 Considering the Maximum A Posteriori(MAP), we are trying to maximize $P(WA) = \frac{P(W)P(AW)}{P(A)}$, which is the same as minimize $\log P(WA) = \log P(W) – \log P(AW) + \log P(A)$. As $\log P(A)$ has nothing to do with $W$, we can just ignore that term. So we can define our cost as $C = \log P(W) – \log P(AW)$, and we are trying to minimize the cost $C$.
 Let us regularise weights $W$ by imposing the Gaussian distribution $N(0,\sigma_w^2)$, then $C = \frac{1}{2\sigma_w^2}\sum\limits_{i}{w_i^2}+\frac{1}{2\sigma_B^2}\sum\limits_c{(y_c\hat{y})^2} + k$ where $k$ is the sum of some constant terms.
 If we take that equation and multiply through by $2\sigma_{B}^2$ and ignore the constant $k$. Then we got a new cost function $C’ = \sum\limits_c{(y_c\hat{y})^2} + \frac{2\sigma_B^2}{2\sigma_w^2}\sum\limits_{i}{w_i^2}$.
 As we can see here, the weight decay parameter in this case should not be an arbitrary hack.
 See more at $L^2$ regularization is equivalent to Gaussian Prior and Bayesian Methods for Neural Networks
 Empirical Bayes
 As we mentioned above, if we want the $L^2$ weight decay to work excellent, we need to know the variance of the weight prior $\sigma_w$ and the noise variance $\sigma_B$
 Steps
 Start with guesses for both the noise variance and the weight prior variance.
 While not yet bored
 Do some learning using the ratio of the variances as the weight penalty coefficient.
 Reset the noise variance to be the variance of the residual errors.
 Reset the weight prior variance to be the variance of the distribution of the actual learned weights.
 Go back to the start of this loop.
 We don’t need a validation set to choose the weight penalty coefficient.
 Full Bayesian Learning
 Instead of trying to find the best single setting of parameters we compute the full posterior distribution over all possible parameter settings.
 This approach let overfitting disappear, cuz there’s no reason why the amount of data should influence our prior beliefs about the complexity. So this approach allows us to use complicated models even when we do not have much data.
 This approach does not involve any gradient descent so there are no local optimum issues.
 Practical Full Bayesian Learning
 Use Markov Chain Monte Carlo to approximate the posterior probability
 We can use the sampling noise from minibatch as the noise which the MCMC method needs.
 Bayes Theorem: $P(AB)=\frac{P(A)P(BA)}{P(B)}$
 Use the model has the right capacity
 Limit the number of hidden layers and the number of units per layer
 Use early stopping
 Use weightdecay
 Introduce noise
 $y = \sum\limits_i{x_iw_i}$
 Consider $L^2$ weightdecay and Gaussian noise
 Let $x_i^{noisy}$ equals to $x_i+\epsilon_i$ where $\epsilon_i$ is sampled from $N(0,\sigma^{2}_{i})$
 Then $y^{noisy} = y + \sum\limits_{i}{w_i\epsilon_i}$
$$\begin{aligned}E\bigl((y^{noisy}\hat{y})^2\bigr) & = E\Bigl(\bigl((y – \hat{y}) + \sum\limits_{i}{w_i\epsilon_i}\bigr)^2\Bigr)\\
& = (y – \hat{y})^2 + E\bigl(2(y\hat{y})\sum\limits_{i}{w_i\epsilon_i}\bigr) + E\bigl((\sum\limits_{i}{w_i\epsilon_i})^2\bigr)\\
& = (y – \hat{y})^2 + E(\sum\limits_{i}{w_i^2\epsilon_i^2})\\
& = (y – \hat{y})^2 + \sum\limits_{i}{w_i^2\sigma_i^2}\end{aligned}$$ So $\sigma_i^2$ is equivalent to an $L^2$ penalty in this simple network.
 If the network is more complicated, Gaussian noise usually works better than $L^2$ weight penalty.
 Combine models (cooperation)
 Mathematics analysis
 Combine models
 $\bar{y} = \frac{1}{n} \sum\limits_{i = 1}^{n}{y_i}$
 If we use $L^2$ penalty, then the cost is $(\bar{y}\hat{y})^2$, so is its expectation.
 Pick a predictor at random
 If we use $L^2$ penalty, then the expectation of the cost should be
$$\begin{aligned}E\big((\hat{y}y_i)^2\big) & = E\Big(\big((\hat{y} – \bar{y}) – (y_i – \bar{y})\big)^2\Big)\\
& = E\big((\hat{y} – \bar{y})^2 + (y_i – \bar{y})^2 – 2(\hat{y} – \bar{y})(y_i – \bar{y})\big)\\
& = (\bar{y} – \hat{y})^2 + E\big((y_i\bar{y})^2\big) – 2(\hat{y} – \bar{y})E(y_i\bar{y})\\
& = (\bar{y} – \hat{y})^2 + E\big((y_i\bar{y})^2\big) – \frac{2}{n}(\hat{y} – \bar{y})\sum\limits_{i = 1}^{n}{y_i – \bar{y}}\\
& = (\bar{y} – \hat{y})^2 + E\big((y_i\bar{y})^2\big)\end{aligned}$$ 
Comparison
 Apparently the expectation cost of picking a random predictor is no less than combining models’ as $E\big((y_i\bar{y})^2\big) \ge 0$.
 The more the results of those models differ, the better is combining models than picking a random model.
 Combine models
 Ways to make predictors differ
 Use lots of different kinds of models
 Bagging: Train different models on different subsets of the data.
 Boosting: Train a sequence of low capacity models. Weight the training cases differently for each model in the sequence.
 Boosting upweights cases that previous models got wrong.
 It focused the computational resources on modelling the tricky cases.
 Mathematics analysis
 Mixtures of Experts (specialization)
 Cluster the training cases into subsets based on the relationship between input and output so that each cluster can be well modelled by an expert.
 Averaging models like what we mentioned above causes cooperation, not specialization.
 Cost function that encourages cooperation: $C = (\hat{y}\bar{y})^2$
 Cost function that encourages specialization: $C = \sum\limits_{i}{p_i(\hat{y}y_i)^2}$ where $p_i$ is the probability of picking each expert assigned by the manager where manager is a seperately trained model.
 Principal Compopnents Analysis
 This projects ndimensional data into an mdimensional subspace in which the data have the most variance.
 We reconstruct by using the mean value on the $nm$ directions that are not represented. The reconstruction error is the sum over all these unrepresented directions of the squared differences of the datapoint from the mean.
 Using backpropagation to implement PCA inefficiently
 Try to make the output be the same as the input in a network with a central bottleneck. The activities of the hidden units in the bottleneck form an efficient code.
 If the hidden and output layers are linear, it will learn hidden units that are a linear function of the data and minimize the squared reconstructing error which is exactly what PCA dose.
 Dropout (See more at https://arxiv.org/abs/1207.0580)
 Neural networks
 Convolutional Neural Networks
 Weight sharing
 If a feature is useful in some locations during training, detectors for that feature will be available in all location during testing
 Pooling
 Get a small amount of translational invariance at each level by averaging some neighbouring replicated detectors to give a single output to the next level.
 Advantage: It reduces the inputs to the next layer, allowing us to have many more different feature maps.
 Disadvantage: We lost information about the precise position of things
 Weight sharing
 Recurrent neural networks
 Comparing with other Hidden Sate Models
 Linear Dynamical Systems
 Let $x$ donate the state vector and $A$ donate the transition matirx
 The variation can take two forms:
 Flow: $x$ varies coontinuously with time
 $\frac{d x(t)}{d t} = A \cdot x(t)$
 Mapping: $x$ varies in discrete steps
 $x_{m+1} = A \cdot x_m$
 Flow: $x$ varies coontinuously with time
 The distribution over the hidden state given the data so far can be computed effeciently
 The hidden states follow Gaussian distribution
 Hidden Markov Models
 There are a number of things called states. And the system is always in exactly one of those states.
 Transitions between states are stochastic and controlled by a transition matrix
 HMMs have efficient algorithm for inference and learning
 The hidden states follow discrete distribution
 A HMM with $n$ state can only store $\log (n)$ bits of information because a HMM has to stay at one exactly state of its $n$ states
 RNNs
 RNNs can remember much more information
 RNNs can potentially learn to implement lots of small programs that each capture a nugget of knowledge and run in parallel, interacting to produce very complicated effects
 RNNs are much harder to train
 Linear Dynamical Systems
 Training the RNNs with backpropagation
 Expand RNN in time sequence
 Constraint the weights expanded from the same original weights
 Train as a normal network
 Reasons why RNNs are hard to train:
 In the forward pass, We use squashing functions (like the logistic) to prevent the activity vectors from exploding
 In the backward pass, the pass is completely linear, so it’s very likely to vanish or explode.
 Four effective ways to learn an RNN
 Long Short Term Memory
 Hessian Free Optimization
 Echo State Networks
 Fix the inputtohidden connections and the hiddentohidden connections at random values and only learn the hiddentoouput connections.
 It’s very important to set these connections very carefully.
 Key ideal: a big random expansion of the input vector can help.
 ESNs demonstrate that it’s very important to initialize weights sensibly.
 Good initialization with momentum
 RNNs with multiplicative connections
 Reasons why we need multiplicative connections
 If we consider an RNN as an automaton, then the hidden state of time $t$, should depend on the conjunction of input of time $t$ and hidden state of time $t – 1$.
 Instead of using the inputs to the RNN to provide additive extra input to the hidden units, we can use the current input to choose the whole hiddentohidden weight matrix. But this approach requires so many parameters.
 Details
 Let’s consider this model
(picture from https://www.coursera.org/learn/neuralnetworks/home/welcome)
 Then, multiplicative connections can be written as $c_f = (b^{\mathsf{T}}w_f)(a^{\mathsf{T}}u_f)v_f$ or $c_f=(b^{\mathsf{T}}w_f)(u_fv_f^{\mathsf{T}})a$ where $(b^{\mathsf{T}}w_f)$ can be considered as scalar coefficients and $(u_fv_f^{\mathsf{T}})$ can be considered to be the transition matrix with rank $1$.
 In fact, if here’re $n$ hidden units, here’re $n$ factors, and the description above shows one factor. And all this factors together actually let the input to determine the transition matrix.
 Reasons why we need multiplicative connections
 Comparing with other Hidden Sate Models
 Hopfield Nets
 Energy function: $E = \sum\limits_i{s_ib_i}\sum\limits_{i<j}{s_is_jw_{ij}}$.
 Hopfield proposed that memories could be energy minima of a neural net.
 Storing a memory using Hopfield’s storage rule
 If the units are binary threshold units: $\Delta w = 4(s_i\frac{1}{2})(s_j\frac{1}{2})$.
 If the units are of activities of $1$ and $1$: $\Delta w_{ij}=s_is_j$
 Dealing with spurious minima in Hopfield Nets
 Let the net settle from a random initial state and then do unlearning.
 Crick and Mitchison proposed unlearning as a model of what dreams are for.
 Storing a memory using pseudolikelihood method
 Use the perceptron convergence procedure to train each unit to have the correct state given the states of all the other units in that vector.
 Boltzmann Machines
 Can be regarded as Hopfield nets with hidden units.
 Instead of using the net to store memories, use it to construct interpretations of sensory input.
 The input is represented by the visible units.
 The interpretation is represented by the states of the hidden units.
 The badness of the interpretation is represented by the energy.
 Use Simulated Annealing to make Boltzmann Machines escape from local minima.
 $P(s_i=1)=\frac{1}{1+e^{\frac{\Delta E_i}{T}}}$ where $T$ is the temperature.
 Rasing the noise level is equivalent to decreasing all the energy gaps between configurations
 Thermal equilibrium
 The thing that settles down is the probability distribution over configurations
 Using energies to define probabilities
 Let $v$ donates the configuration on the visible units and $h$ on the hidden units.
 $P(v, h) = \frac{e^{E(v, h)}}{\sum\limits_{g}{\sum\limits_{u}{e^{E(u, g)}}}}$
 $P(v) = \frac{\sum\limits_h{e^{E(v, h)}}}{\sum\limits_{g}{\sum\limits_{u}{e^{E(u, g)}}}}$
 We can use MCMC to sample the probabilities.
 The boltzmann Machine learning algorithm
 The goal of learning: We are trying to maximize the product of the probabilities that the Boltzmann machine assigns to the binary vectors in the training set. This is equivalent to maximize the sum of the log probabilities that Boltzmann machine assigns to the training vectors.
 Everything that one weight needs to know about the other weights and the data is contained in the difference of two correlations.
$$\frac{\partial \log P(v)}{\partial w_{ij}}=E_v(s_is_j)E_{m}(s_is_j)$$
 where $E_v$ donates the expected value of the product of states at thermal equilibrium when $v$ is clamped on the visible units and $E_w$ donates the expected value of the product of states at thermal equilibrium with no clamping.

Positive phase: finding hidden configurations that work well with $v$ and lower their energies.
 Negative phase: finding the joint configurations that are the best competitors and raises their energies.
 Use MCMC to collect the statistics.

Restricted Boltzmann Machines
 Definition: Only one hidden layer and no connections between hidden units.
 In an RBM it only takes one step to reach thermal equilibrium when the visible units are clamped
$$P(h_j=1)=\frac{1}{1+e^{(b_j+\sum\limits_{i \in vis}{v_iw_{ij}})}}$$
 So we can quickly get the exact value of $E_v(v_ih_j)$
 Contrastive divergence (CD)
(picture from https://www.coursera.org/learn/neuralnetworks/home/welcome)
 Learning layers of features by stacking RBMs
 Training steps
 First train a layer of features that receive input directly from the data.
 Then treat the activations of the trained features as if they were the original data and learn features of features in a second hidden layer.
 Then do it again.
 Generating steps
 Get an equilibrium sample from the top level RBM by performing alternating Gibbs sampling for a long time.
 Perform a topdown pass to get states for all the other layers.
 Training steps
 Modeling realvalued data with an RBM
 Use stepped sigmoid units
 Belief Networks
 Disadvantage of backpropagation
 Requires labeled data.
 The learning time doesn’t scale well.
 It can get stuck in poor local optima.
 Overcoming the limitations of backpropagation by using unsupervised learning
 Keep the efficiency and simplicity of using a gradient method for adjusting the weights, but use it for modelling the structure of the sensory input.
 Adjust the weights to maximize the probability that a generative model would have generated the sensory input.
 Definition: a directed acyclic graph composed of stochastic variables.
 Disadvantage of backpropagation
 Sigmoid Belief Networks (SBN)
 Definition: Belief Networks with logistic neurons.
 Naive learning algorithm
 Get the posterior distribution over hidden states given the observed data. We can use MCMC to do this.
 Get an unbiased sample from the posterior distribution.
 For each unit, maximize the log probability that its binary state in the sample from the posterior would be generated by the sampled binary states of its parents.
 $\Delta w_{ji}=\epsilon s_j\big(s_iP(s_i=1)\big)$
 The wakesleep algorithm
 We assume wrongly that the posterior over hidden configurations factorizes into a product of distributions for each separate hidden unit. Though this assuming is wrong, the learning of an SBN will still work.
 Wake phase: Use recognition weights to perform a bottomup pass which trans the generative weights to reconstruct activities in each layer from the layer above.
 Sleep phase: Use generative weights to generate samples from the model, and this action trains the recognition weights to reconstruct activities in each layer from the layer below.
 Convolutional Neural Networks
 Strong AI
 2 theories on concept
 The feature theory: A concept is a set of semantic features
 The structuralist theory: The meaning of a concept lies in its relationships to other concepts
 2 theories on concept