## Preface

This paper is totally empirical. You can hardly find any mathematics analysis. Therefore, every conclusion we drew is from experimental results.

## The main gaps between neuroscience models and ML models

- Neurons encode information in a sparse and distributed way. This corresponds to a trade-off between the richness of representation and small action potential energy expenditure. However, without additional regularization, ordinary feed-forward neural networks do not have this property. This is not only not biologically implausible and hurts gradient-based optimization.
- Sigmoid and tanh are equivalent up to a linear transformation(do some tiny changes to the axis, we can get another one). the tanh has a steady state at 0 and is therefore preferred from the optimization standpoint, but it forces an antisymmetry around 0, which is absent in biological neurons.

## Potential problems of rectifiers

- The hard saturation at 0 may hurt optimization by blocking gradient back-propagation.
- In order to efficiently represent symmetric or antisymmetric behaviour in the data, a rectifier network would need twice as many hidden units as a network of symmetric or antisymmetric activation functions.
- Rectifier networks are subject to ill-conditioning of parametrization.
- Consider for each layer of depth $i$ of the network a scalar $\alpha_i$, and scaling the prameters as $w_i’=\frac{w_i}{\alpha_i}$ and $b_i’=\frac{b_i}{\prod_{j=1}^{i}{\alpha_j}}$. The output units values then change as follow: $s’=\frac{s}{\prod_{j=1}^{i}{\alpha_j}}$
- Therefore, as long as $\prod_{j=1}^{n}{\alpha_j}=1$, the network function is identical.

## Advantage of rectifiers

- Rectifiers are not only biologically plausible, they are also computationally efficient.
- There’s almost no improvement when using unsupervised pre-training with rectifier activations.
- Networks trained with the rectifier activation functions can find local minima of greater or equality than those obtained with its smooth counterpart, the soft plush.
- Rectifier networks are truly sparse.