Notes on “Deep Learning, Yoshua Bengio et al. (2015)”

Conventional machine-learning

  • Conventional machine-learning techniques were limited in their ability to process natural data in their raw form.

  • Problems such as image and speech recognition require the input-output function to be insensitive to irrelevant variations of the input while being very sensitive to particular minute variations.

  • This is why Conventional machine-learning models require a good feature extractor designed by humans. But neural networks can learn this features automatically using a general-purpose learning procedure. This is the key advantage of deep learning.

Activation Functions

  • In past decades, neural nets used smoother non-linearities, such as $\tanh(z)$ or Sigmoid function, but ReLU typically learns much faster in networks with many layers.

Learning Procedure

  • The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer.

  • Recent theoretical and empirical results strongly suggest that local minima are not a serious problem in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero and the surface curves up in most dimensions and curves down in the reminder. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objective function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.

  • For smaller data sets, unsupervised pre-training helps to prevent overfitting. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.


  • CNNs are designed to process data that come in the form of multiple arrays.

  • The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolutional layer are organized in feature maps, within each unit is connected to local patched in the feature maps of the previous layer through a set of weights called a filter bank. All units in a feature map share the same filter bank.

  • Although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one.

  • Many natural signals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones.

  • The pooling allows representations to vary very little when elements in the previous layer vary in position and appearance.

  • The convolutional and pooling layers in CNNs are directly inspired by classic notions of simple cells and complex cells in visual neuroscience.

The future of deep learning

  • Unsupervised learning had a catalytic effect in reviving interest in deep learning. We expected unsupervised learning to become far more important in the longer term.

Views from Ge Li (not from this passage)

Fields Technology Versatility
Computer Vision developed bad
Natural Language Processing developing bad
Speech Recognition developed good

Leave a Reply

Your email address will not be published. Required fields are marked *