Newton’s method

Taylor series

A Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point

$$f(a) \approx \sum\limits_{n=0}^{\infty}{\frac{f^{(n)}(a)}{n!}(x-a)^n}$$

where $f^{n}(a)$ donates the $n^{th}$ derivative of $f$ evaluated at the point $a$.

And here’s a very intuitive example:

Taylor series
The exponential function $e^x$ (in blue), and the sum of the first $n + 1$ terms of its Taylor series at $0$ (in red).

Newton’s method


In calculus, Newton’s method is an iterative method for finding the roots of a differentiable function $f$.

In machine learning, we apply Newton’s method to the derivative $f’$ of the cost function $f$.

One-dimension version

In the one-dimensional problem, Newton’s method attempts to construct a sequence ${x_n}$ from an initial guess $x_0$ that converges towards some value $\hat{x}$ satisfying $f'(\hat{x})=0$. In another word, we are trying to find a stationary point of $f$.

Consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is

$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2$$

So now, we are trying to find a $\Delta x$ that sets the derivative of this last expression with respect to $\Delta x$ equal to zero, which means

$$\frac{\rm{d}}{\rm{d} \Delta x}(f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2) = f'(x_n)+f”(x_n)\Delta x = 0$$

Apparently, $\Delta x = -\frac{f'(x_n)}{f”(x_n)}$ is the solution. So it can be hoped that $x_{n+1} = x_n+\Delta x = x_n – \frac{f'(x_n)}{f”(x_n)}$ will be closer to a stationary point $\hat{x}$.

High dimensional version

Still consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is

$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + \Delta x^{\mathsf{T}} \nabla f(x_n) + \frac{1}{2} {\rm H} f(x_n) \Delta x (\Delta x)^{\mathsf{T}}$$

where ${\rm H} f(x)$ donates the Hessian matrix of $f(x)$ and $\nabla f(x)$ donates the gradient. (See more about Taylor expansion at

So $\Delta x = – [{\rm H}f(x_n)]^{-1}\nabla f(x_n)$ should be a good choice.


As is known to us all, the time complexity to get $A^{-1}$ given $A$ is $O(n^3)$ when $A$ is a $n \times n$ matrix. So when the data set is of too many dimensions, the algorithm will work quite slow.


The reason steepest descent goes wrong is that the gradient for one weight gets messed up by simultaneous changes to all the other weights.

And the Hessian matrix determines the sizes of these interactions so that Newton’s method minimize these interactions as much as possible.


Auto Encoder vs. PCA (again)

I tried to use Auto Encoder and PCA to do Dimensionality Reduction.
The dataset is from House Prices: Advanced Regression Techniques.
I transformed the data from 79D to 30D, and then reconstruct the data to 70D.
Here’s the result:

As you can see, PCA even did a better job.
And now I come to a conclusion that Auto Encoder is good at seeking different patterns and when fitting a single pattern PCA is a better choice.

Notes on “Deep Learning, Yoshua Bengio et al. (2015)”

Conventional machine-learning

  • Conventional machine-learning techniques were limited in their ability to process natural data in their raw form.

  • Problems such as image and speech recognition require the input-output function to be insensitive to irrelevant variations of the input while being very sensitive to particular minute variations.

  • This is why Conventional machine-learning models require a good feature extractor designed by humans. But neural networks can learn this features automatically using a general-purpose learning procedure. This is the key advantage of deep learning.

Activation Functions

  • In past decades, neural nets used smoother non-linearities, such as $\tanh(z)$ or Sigmoid function, but ReLU typically learns much faster in networks with many layers.

Learning Procedure

  • The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer.

  • Recent theoretical and empirical results strongly suggest that local minima are not a serious problem in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero and the surface curves up in most dimensions and curves down in the reminder. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objective function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.

  • For smaller data sets, unsupervised pre-training helps to prevent overfitting. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.


  • CNNs are designed to process data that come in the form of multiple arrays.

  • The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolutional layer are organized in feature maps, within each unit is connected to local patched in the feature maps of the previous layer through a set of weights called a filter bank. All units in a feature map share the same filter bank.

  • Although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one.

  • Many natural signals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones.

  • The pooling allows representations to vary very little when elements in the previous layer vary in position and appearance.

  • The convolutional and pooling layers in CNNs are directly inspired by classic notions of simple cells and complex cells in visual neuroscience.

The future of deep learning

  • Unsupervised learning had a catalytic effect in reviving interest in deep learning. We expected unsupervised learning to become far more important in the longer term.

Views from Ge Li (not from this passage)

Fields Technology Versatility
Computer Vision developed bad
Natural Language Processing developing bad
Speech Recognition developed good

Summary of “Neural Networks for Machine Learning by Geoffrey Hinton”

  • Activation function
    • Logistic neurons
      • $y=\frac{1}{1+e^{-z}}$
      • $\frac{dy}{dz}=y(1-y)$
    • The softmax function
  • General tricks
    • Dropout (See more at
      • A useful way to regularize the net.
      • Details: randomly remove some hidden units in a layer for each training example
      • Advantage: It stops hidden units from relying too much on other hidden units
    • Tricks to speed up the learning procedure (See more at
      1. Use momentum
        • $v(t) = \alpha v(t-1)-\varepsilon \frac{\partial E}{\partial w}(t)$
        • $\begin{aligned}\Delta w(t) & = v(t)\\
          & = \alpha v(t – 1) – \varepsilon \frac{\partial E}{\partial w}(t)\\
          & = \alpha \Delta w(t – 1) – \varepsilon \frac{\partial E}{\partial w}(t) \end{aligned}$
      2. Use separate adaptive learning rates for each parameter
        • $\Delta w_{i,j} = -\varepsilon g_{i,j} \frac{\partial E}{\partial w_{i,j}}$
        • $g_{i, j} = \begin{cases}g_{i,j}(t – 1) + \delta & \frac{\partial E}{\partial w_{i,j}}(t)\frac{\partial E}{\partial w_{i,j}}(t – 1) > 0 \\ g_{i,j}(t-1) \times \delta & otherwise\end{cases}$
        • $\delta$ is a variable setted by hand ($0.05$ is usually a good choice)
        • This method ensures that local learning rates increase when the gradients don’t change sign and decay rapidly when oscillations start
      3. Use rmsprop
        • $F(w, t) = \delta F(w, t – 1) + (1 – \delta)(\frac{\partial E}{\partial w})^2$
        • $\Delta w = -\frac{\varepsilon}{\sqrt{F(w, t)}} \cdot \frac{\partial E}{\partial w}$
        • $\delta$ is a variable setted by hand ($0.9$ is usually a good choice)
    • Preventing overfitting
      1. Get more data
      2. Average many different models
      3. Use Bayesian Approach
        • Bayes Theorem: $P(A|B)=\frac{P(A)P(B|A)}{P(B)}$
          • Deriving Bayes’ Theorem: $P(A|B)=\frac{P(AB)}{P(B)}=\frac{P(A)P(B|A)}{P(B)}$
          • Understanding
            • Let $P(A)$ be the prior probability, assuming that we already know the prior distribution.
            • Then we get a new data $B$, so we can compute $P(B|A)$ as we already have prior distribution.
            • After that, we use $P(B)$ to renormalize as $P(B) = \int\limits_A{P(B|A)P(A)}$.
            • Finally, our posterior distribution $P(A|B)$ is $P(A|B)=\frac{P(A)P(B|A)}{P(B)}$.
        • The Bayesian interpretation of weight decay
          • Let $W$ be the set of weights in a network. We add Gaussian noise $\epsilon \sim N(0, \sigma_B^2)$ to the the output of the network, let $P(A|W)$ donate the probability we get the correct anwser $A$ after we add the noise to the output $B$ with weights $W$.
          • Considering the Maximum A Posteriori(MAP), we are trying to maximize $P(W|A) = \frac{P(W)P(A|W)}{P(A)}$, which is the same as minimize $-\log P(W|A) = -\log P(W) – \log P(A|W) + \log P(A)$. As $\log P(A)$ has nothing to do with $W$, we can just ignore that term. So we can define our cost as $C = -\log P(W) – \log P(A|W)$, and we are trying to minimize the cost $C$.
          • Let us regularise weights $W$ by imposing the Gaussian distribution $N(0,\sigma_w^2)$, then $C = \frac{1}{2\sigma_w^2}\sum\limits_{i}{w_i^2}+\frac{1}{2\sigma_B^2}\sum\limits_c{(y_c-\hat{y})^2} + k$ where $k$ is the sum of some constant terms.
          • If we take that equation and multiply through by $2\sigma_{B}^2$ and ignore the constant $k$. Then we got a new cost function $C’ = \sum\limits_c{(y_c-\hat{y})^2} + \frac{2\sigma_B^2}{2\sigma_w^2}\sum\limits_{i}{w_i^2}$.
          • As we can see here, the weight decay parameter in this case should not be an arbitrary hack.
          • See more at $L^2$ regularization is equivalent to Gaussian Prior and Bayesian Methods for Neural Networks
        • Empirical Bayes
          • As we mentioned above, if we want the $L^2$ weight decay to work excellent, we need to know the variance of the weight prior $\sigma_w$ and the noise variance $\sigma_B$
          • Steps
            • Start with guesses for both the noise variance and the weight prior variance.
            • While not yet bored
              • Do some learning using the ratio of the variances as the weight penalty coefficient.
              • Reset the noise variance to be the variance of the residual errors.
              • Reset the weight prior variance to be the variance of the distribution of the actual learned weights.
            • Go back to the start of this loop.
          • We don’t need a validation set to choose the weight penalty coefficient.
        • Full Bayesian Learning
          • Instead of trying to find the best single setting of parameters we compute the full posterior distribution over all possible parameter settings.
          • This approach let overfitting disappear, cuz there’s no reason why the amount of data should influence our prior beliefs about the complexity. So this approach allows us to use complicated models even when we do not have much data.
          • This approach does not involve any gradient descent so there are no local optimum issues.
        • Practical Full Bayesian Learning
          • Use Markov Chain Monte Carlo to approximate the posterior probability
          • We can use the sampling noise from mini-batch as the noise which the MCMC method needs.
      4. Use the model has the right capacity
        1. Limit the number of hidden layers and the number of units per layer
        2. Use early stopping
        3. Use weight-decay
        4. Introduce noise
          • $y = \sum\limits_i{x_iw_i}$
          • Consider $L^2$ weight-decay and Gaussian noise
          • Let $x_i^{noisy}$ equals to $x_i+\epsilon_i$ where $\epsilon_i$ is sampled from $N(0,\sigma^{2}_{i})$
          • Then $y^{noisy} = y + \sum\limits_{i}{w_i\epsilon_i}$

          $$\begin{aligned}E\bigl((y^{noisy}-\hat{y})^2\bigr) & = E\Bigl(\bigl((y – \hat{y}) + \sum\limits_{i}{w_i\epsilon_i}\bigr)^2\Bigr)\\
          & = (y – \hat{y})^2 + E\bigl(2(y-\hat{y})\sum\limits_{i}{w_i\epsilon_i}\bigr) + E\bigl((\sum\limits_{i}{w_i\epsilon_i})^2\bigr)\\
          & = (y – \hat{y})^2 + E(\sum\limits_{i}{w_i^2\epsilon_i^2})\\
          & = (y – \hat{y})^2 + \sum\limits_{i}{w_i^2\sigma_i^2}\end{aligned}$$

          • So $\sigma_i^2$ is equivalent to an $L^2$ penalty in this simple network.
          • If the network is more complicated, Gaussian noise usually works better than $L^2$ weight penalty.
    • Combine models (cooperation)
      • Mathematics analysis
        • Combine models
          • $\bar{y} = \frac{1}{n} \sum\limits_{i = 1}^{n}{y_i}$
          • If we use $L^2$ penalty, then the cost is $(\bar{y}-\hat{y})^2$, so is its expectation.
        • Pick a predictor at random
          • If we use $L^2$ penalty, then the expectation of the cost should be

          $$\begin{aligned}E\big((\hat{y}-y_i)^2\big) & = E\Big(\big((\hat{y} – \bar{y}) – (y_i – \bar{y})\big)^2\Big)\\
          & = E\big((\hat{y} – \bar{y})^2 + (y_i – \bar{y})^2 – 2(\hat{y} – \bar{y})(y_i – \bar{y})\big)\\
          & = (\bar{y} – \hat{y})^2 + E\big((y_i-\bar{y})^2\big) – 2(\hat{y} – \bar{y})E(y_i-\bar{y})\\
          & = (\bar{y} – \hat{y})^2 + E\big((y_i-\bar{y})^2\big) – \frac{2}{n}(\hat{y} – \bar{y})\sum\limits_{i = 1}^{n}{y_i – \bar{y}}\\
          & = (\bar{y} – \hat{y})^2 + E\big((y_i-\bar{y})^2\big)\end{aligned}$$

        • Comparison

          • Apparently the expectation cost of picking a random predictor is no less than combining models’ as $E\big((y_i-\bar{y})^2\big) \ge 0$.
          • The more the results of those models differ, the better is combining models than picking a random model.
      • Ways to make predictors differ
        • Use lots of different kinds of models
        • Bagging: Train different models on different subsets of the data.
        • Boosting: Train a sequence of low capacity models. Weight the training cases differently for each model in the sequence.
          • Boosting up-weights cases that previous models got wrong.
          • It focused the computational resources on modelling the tricky cases.
    • Mixtures of Experts (specialization)
      • Cluster the training cases into subsets based on the relationship between input and output so that each cluster can be well modelled by an expert.
      • Averaging models like what we mentioned above causes cooperation, not specialization.
        • Cost function that encourages cooperation: $C = (\hat{y}-\bar{y})^2$
        • Cost function that encourages specialization: $C = \sum\limits_{i}{p_i(\hat{y}-y_i)^2}$ where $p_i$ is the probability of picking each expert assigned by the manager where manager is a seperately trained model.
    • Principal Compopnents Analysis
      • This projects n-dimensional data into an m-dimensional subspace in which the data have the most variance.
      • We reconstruct by using the mean value on the $n-m$ directions that are not represented. The reconstruction error is the sum over all these unrepresented directions of the squared differences of the datapoint from the mean.
      • Using back-propagation to implement PCA inefficiently
        • Try to make the output be the same as the input in a network with a central bottleneck. The activities of the hidden units in the bottleneck form an efficient code.
        • If the hidden and output layers are linear, it will learn hidden units that are a linear function of the data and minimize the squared reconstructing error which is exactly what PCA dose.
  • Neural networks
    • Convolutional Neural Networks
      • Weight sharing
        • If a feature is useful in some locations during training, detectors for that feature will be available in all location during testing
      • Pooling
        • Get a small amount of translational invariance at each level by averaging some neighbouring replicated detectors to give a single output to the next level.
        • Advantage: It reduces the inputs to the next layer, allowing us to have many more different feature maps.
        • Disadvantage: We lost information about the precise position of things
    • Recurrent neural networks
      • Comparing with other Hidden Sate Models
        1. Linear Dynamical Systems
          • Let $x$ donate the state vector and $A$ donate the transition matirx
          • The variation can take two forms:
            1. Flow: $x$ varies coontinuously with time
              • $\frac{d x(t)}{d t} = A \cdot x(t)$
            2. Mapping: $x$ varies in discrete steps
              • $x_{m+1} = A \cdot x_m$
          • The distribution over the hidden state given the data so far can be computed effeciently
          • The hidden states follow Gaussian distribution
        2. Hidden Markov Models
          • There are a number of things called states. And the system is always in exactly one of those states.
          • Transitions between states are stochastic and controlled by a transition matrix
          • HMMs have efficient algorithm for inference and learning
          • The hidden states follow discrete distribution
          • A HMM with $n$ state can only store $\log (n)$ bits of information because a HMM has to stay at one exactly state of its $n$ states
        3. RNNs
          • RNNs can remember much more information
          • RNNs can potentially learn to implement lots of small programs that each capture a nugget of knowledge and run in parallel, interacting to produce very complicated effects
          • RNNs are much harder to train
      • Training the RNNs with backpropagation
        • Expand RNN in time sequence
        • Constraint the weights expanded from the same original weights
        • Train as a normal network
      • Reasons why RNNs are hard to train:
        • In the forward pass, We use squashing functions (like the logistic) to prevent the activity vectors from exploding
        • In the backward pass, the pass is completely linear, so it’s very likely to vanish or explode.
      • Four effective ways to learn an RNN
        1. Long Short Term Memory
        2. Hessian Free Optimization
        3. Echo State Networks
          • Fix the input-to-hidden connections and the hidden-to-hidden connections at random values and only learn the hidden-to-ouput connections.
          • It’s very important to set these connections very carefully.
          • Key ideal: a big random expansion of the input vector can help.
          • ESNs demonstrate that it’s very important to initialize weights sensibly.
        4. Good initialization with momentum
      • RNNs with multiplicative connections
        • Reasons why we need multiplicative connections
          • If we consider an RNN as an automaton, then the hidden state of time $t$, should depend on the conjunction of input of time $t$ and hidden state of time $t – 1$.
          • Instead of using the inputs to the RNN to provide additive extra input to the hidden units, we can use the current input to choose the whole hidden-to-hidden weight matrix. But this approach requires so many parameters.
        • Details
          • Let’s consider this model

          multiplicative connections

          (picture from

          • Then, multiplicative connections can be written as $c_f = (b^{\mathsf{T}}w_f)(a^{\mathsf{T}}u_f)v_f$ or $c_f=(b^{\mathsf{T}}w_f)(u_fv_f^{\mathsf{T}})a$ where $(b^{\mathsf{T}}w_f)$ can be considered as scalar coefficients and $(u_fv_f^{\mathsf{T}})$ can be considered to be the transition matrix with rank $1$.
          • In fact, if here’re $n$ hidden units, here’re $n$ factors, and the description above shows one factor. And all this factors together actually let the input to determine the transition matrix.
    • Hopfield Nets
      • Energy function: $E = -\sum\limits_i{s_ib_i}-\sum\limits_{i<j}{s_is_jw_{ij}}$.
      • Hopfield proposed that memories could be energy minima of a neural net.
      • Storing a memory using Hopfield’s storage rule
        • If the units are binary threshold units: $\Delta w = 4(s_i-\frac{1}{2})(s_j-\frac{1}{2})$.
        • If the units are of activities of $1$ and $-1$: $\Delta w_{ij}=s_is_j$
      • Dealing with spurious minima in Hopfield Nets
        • Let the net settle from a random initial state and then do unlearning.
        • Crick and Mitchison proposed unlearning as a model of what dreams are for.
      • Storing a memory using pseudo-likelihood method
        • Use the perceptron convergence procedure to train each unit to have the correct state given the states of all the other units in that vector.
    • Boltzmann Machines
      • Can be regarded as Hopfield nets with hidden units.
      • Instead of using the net to store memories, use it to construct interpretations of sensory input.
        • The input is represented by the visible units.
        • The interpretation is represented by the states of the hidden units.
        • The badness of the interpretation is represented by the energy.
      • Use Simulated Annealing to make Boltzmann Machines escape from local minima.
        • $P(s_i=1)=\frac{1}{1+e^{-\frac{\Delta E_i}{T}}}$ where $T$ is the temperature.
        • Rasing the noise level is equivalent to decreasing all the energy gaps between configurations
      • Thermal equilibrium
        • The thing that settles down is the probability distribution over configurations
      • Using energies to define probabilities
        • Let $v$ donates the configuration on the visible units and $h$ on the hidden units.
        • $P(v, h) = \frac{e^{-E(v, h)}}{\sum\limits_{g}{\sum\limits_{u}{e^{-E(u, g)}}}}$
        • $P(v) = \frac{\sum\limits_h{e^{-E(v, h)}}}{\sum\limits_{g}{\sum\limits_{u}{e^{-E(u, g)}}}}$
        • We can use MCMC to sample the probabilities.
      • The boltzmann Machine learning algorithm
        • The goal of learning: We are trying to maximize the product of the probabilities that the Boltzmann machine assigns to the binary vectors in the training set. This is equivalent to maximize the sum of the log probabilities that Boltzmann machine assigns to the training vectors.
        • Everything that one weight needs to know about the other weights and the data is contained in the difference of two correlations.

        $$\frac{\partial \log P(v)}{\partial w_{ij}}=E_v(s_is_j)-E_{m}(s_is_j)$$

        • where $E_v$ donates the expected value of the product of states at thermal equilibrium when $v$ is clamped on the visible units and $E_w$ donates the expected value of the product of states at thermal equilibrium with no clamping.

        • Positive phase: finding hidden configurations that work well with $v$ and lower their energies.

        • Negative phase: finding the joint configurations that are the best competitors and raises their energies.
        • Use MCMC to collect the statistics.
      • Restricted Boltzmann Machines

        • Definition: Only one hidden layer and no connections between hidden units.
        • In an RBM it only takes one step to reach thermal equilibrium when the visible units are clamped

        $$P(h_j=1)=\frac{1}{1+e^{-(b_j+\sum\limits_{i \in vis}{v_iw_{ij}})}}$$


        (picture from

        • Learning layers of features by stacking RBMs
          • Training steps
            • First train a layer of features that receive input directly from the data.
            • Then treat the activations of the trained features as if they were the original data and learn features of features in a second hidden layer.
            • Then do it again.
          • Generating steps
            • Get an equilibrium sample from the top level RBM by performing alternating Gibbs sampling for a long time.
            • Perform a top-down pass to get states for all the other layers.
        • Modeling real-valued data with an RBM
          • Use stepped sigmoid units
    • Belief Networks
      • Disadvantage of back-propagation
        • Requires labeled data.
        • The learning time doesn’t scale well.
        • It can get stuck in poor local optima.
      • Overcoming the limitations of back-propagation by using unsupervised learning
        • Keep the efficiency and simplicity of using a gradient method for adjusting the weights, but use it for modelling the structure of the sensory input.
        • Adjust the weights to maximize the probability that a generative model would have generated the sensory input.
      • Definition: a directed acyclic graph composed of stochastic variables.
    • Sigmoid Belief Networks (SBN)
      • Definition: Belief Networks with logistic neurons.
      • Naive learning algorithm
        1. Get the posterior distribution over hidden states given the observed data. We can use MCMC to do this.
        2. Get an unbiased sample from the posterior distribution.
        3. For each unit, maximize the log probability that its binary state in the sample from the posterior would be generated by the sampled binary states of its parents.
        4. $\Delta w_{ji}=\epsilon s_j\big(s_i-P(s_i=1)\big)$
      • The wake-sleep algorithm
        • We assume wrongly that the posterior over hidden configurations factorizes into a product of distributions for each separate hidden unit. Though this assuming is wrong, the learning of an SBN will still work.
        • Wake phase: Use recognition weights to perform a bottom-up pass which trans the generative weights to reconstruct activities in each layer from the layer above.
        • Sleep phase: Use generative weights to generate samples from the model, and this action trains the recognition weights to reconstruct activities in each layer from the layer below.
  • Strong AI
    • 2 theories on concept
      1. The feature theory: A concept is a set of semantic features
      2. The structuralist theory: The meaning of a concept lies in its relationships to other concepts

Notes on “A Neural Probabilistic Language Model”

Ⅰ. Distributed Representation

1. Fight the curse of dimensionality

As sentences with similar meanings can use quite different words, it’s very hard for the n-gram model to generalize. In the paper, they propose to use Distributed Representation to fight the curse of dimensionality. The Distributed Representation give the model the ability to make use of semantic information. This feature itself does improve the model. And this proposal let each training sentence inform the model an exponential number of semantically neighbouring sentences which makes the model generalize much better.

2. Deal with out-of-vocabulary words

It’s clear to see that the only stuff we need to deal with is to assign this out-of-vocabulary word a distributed representation. So to do that, we first consider this unknown word $j$ as a blank which need to be filled. And we use the network to estimate the probability $p_i$ of each word $i$ in the vocabulary. Donating $C_i$ as the distributed representation for word $i$ in the vocabulary, we assign $\sum\limits_{i \ in \ vocabulary}{C_i \cdot p_i}$ to $C_j$ as the distributed representation for word $j$. After that, we can incorporate word $j$ in the vocabulary and use this slightly larger vocabulary for further computation.

This approach is quite elegant because this is exactly how human brain works.

3. Deal with polysemous words

We can just simply assign multiple distributed representations for a single polysemous word.

4. comparison with n-gram models

Use distributed representation to make use of semantic information and to turn discrete random variables to continuous variables. And these two features let the network generalize much better than n-gram models.

Ⅱ. Improve computation efficiency

1. Represent the conditional probability with a tree structure

For each classification, it will go through $O(\log n)$ nodes in the tree structure. This makes the network use much fewer parameters to process.

2. Parallel implementation

In this paper, data-based parallel implementation and parameter-based implementation are mentioned.

Ⅲ. Relating to Strong AI

1. Taking advantage of prior knowledge

In this paper, they take advantage of the semantic information.

2. Decompose the whole network into smaller parts

This can make computation faster and easier to adapt it to other works. In Opening the black box of Deep Neural Networks via Information, it’s said that a large amount of computation is used to compression of input to effective representation. So if we can modularize the network and set up a set of general APIs, it can make a huge difference in practical implementation.

Ⅳ. Paper involves

  1. Bengio and J-S. Senécal. Quick training of probabilistic neural nets by importance sampling. In AISTATS, 2003
  2. Berger, S. Della Pietra, and V. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22:39–71, 1996



A two-class classifier.
A Feedforward Neural Network without hidden layers.

More specifically: It maps its input $x$ to output $f(x)$ with parameter $\theta$:
f(x) = \begin{cases}
0 & \text{if }x \cdot \theta > 0 \\
1 & \text{otherwise}

Learning Algorithm

  1. Random initialize $\theta$
  2. For each example $i$ in the training set, update parameter $\theta$ as follow:
    $$\theta = \theta + (y_i – f(\theta \cdot x_i)) \cdot x_i$$
  3. Repeat step 2 until the classifier classifies most examples correctly.


  1. Let’s change a view. Let each example donates a constraint. And the expected $\theta$ is at the intersection of those constraints. If you know Linear Programming, then you will see that this is actually a Half-Plane Intersection Problem.
  2. It’s clear to see that if $\theta_1, \theta_2$ is legal, then $a\theta_1 + b\theta_2$ is legal when $a + b = 1, a, b \ge 0$. It easy to proof: Half-Plane is a convex set. The intersection of some convex sets is convex. And this property can help to prove the convergence of this algorithm.
  3. If you know how to solve Linear Regression Problems with Gradient Descent, then you will know that sometimes we may pass the best solution if we choose a learning rate which is too large. In this algorithm, the same problem exists. So this algorithm will converge. But it may not converge to a solution which fit the dataset perfectly. So consider “generously feasible solution” that lies within the feasible region by a margin at least as great as the length of the input vector that defines each constraint plane. The algorithm can only be proved that it will converge to the “generously feasible solution”. Thus the solution in the proof of convergence below means “generously feasible solution”.


As you can see from the definition, Perceptron is a linear classifier. Thus, if the dataset is not linearly separable, this algorithm will not converge.

If you can imagine plotting the dataset to a plane(space), with knowledge of linear algebra you can easily have an intuition that we are actually adjusting the decision boundary(separating hyperplane) according to every single example. And each iteration makes the parameter $\theta$ better. Actually, this algorithm will converge. And here is the proof:

  1. Because the feasible solution is a convex set, the modification made from other examples won’t make the decision boundary worse.
  2. As for a single misclassified example $i$, each modification will change the value of $\theta \cdot x$ by $||x_i||^2$. And the total value need to be correct is $|\theta \cdot x_i|$. Thus, the maximum number of iteration before the classifier makes the right classification for each example is $O(\max\limits_i{(\frac{|\theta \cdot x_i|}{||x_i||^2})})$

And here is a more rigorous proof, read it if you like:


The most well-known disadvantage of this algorithm is that it can’t simulate the XOR Function. But actually, there are more general theorems, like “Group Invariance Theorem”. So I decide to read Perceptrons: an introduction to computational geometry(Minsky & Papert,1969) first. Then I will come back to finish this part.

—————————— UPD 2017.8.15 ——————————
I thought Perceptrons: an introduction to computational geometry is a paper. But it’s actually a book with 292 pages! So I give it up to read it for now. Maybe I will read it in university?