# Notes on “Deep Sparse Rectifier Neural Networks, Yoshua Bengio et al. (2011)”

## Preface

This paper is totally empirical. You can hardly find any mathematics analysis. Therefore, every conclusion we drew is from experimental results.

## The main gaps between neuroscience models and ML models

1. Neurons encode information in a sparse and distributed way. This corresponds to a trade-off between the richness of representation and small action potential energy expenditure. However, without additional regularization, ordinary feed-forward neural networks do not have this property. This is not only not biologically implausible and hurts gradient-based optimization.
2. Sigmoid and tanh are equivalent up to a linear transformation(do some tiny changes to the axis, we can get another one). the tanh has a steady state at 0 and is therefore preferred from the optimization standpoint, but it forces an antisymmetry around 0, which is absent in biological neurons.

## Potential problems of rectifiers

1. The hard saturation at 0 may hurt optimization by blocking gradient back-propagation.
2. In order to efficiently represent symmetric or antisymmetric behaviour in the data, a rectifier network would need twice as many hidden units as a network of symmetric or antisymmetric activation functions.
3. Rectifier networks are subject to ill-conditioning of parametrization.
• Consider for each layer of depth $i$ of the network a scalar $\alpha_i$, and scaling the prameters as $w_i’=\frac{w_i}{\alpha_i}$ and $b_i’=\frac{b_i}{\prod_{j=1}^{i}{\alpha_j}}$. The output units values then change as follow: $s’=\frac{s}{\prod_{j=1}^{i}{\alpha_j}}$
• Therefore, as long as $\prod_{j=1}^{n}{\alpha_j}=1$, the network function is identical.