This paper is totally empirical. You can hardly find any mathematics analysis. Therefore, every conclusion we drew is from experimental results.
A Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point
$$f(a) \approx \sum\limits_{n=0}^{\infty}{\frac{f^{(n)}(a)}{n!}(x-a)^n}$$
where $f^{n}(a)$ donates the $n^{th}$ derivative of $f$ evaluated at the point $a$.
And here’s a very intuitive example:
The exponential function $e^x$ (in blue), and the sum of the first $n + 1$ terms of its Taylor series at $0$ (in red).
In calculus, Newton’s method is an iterative method for finding the roots of a differentiable function $f$.
In machine learning, we apply Newton’s method to the derivative $f’$ of the cost function $f$.
In the one-dimensional problem, Newton’s method attempts to construct a sequence ${x_n}$ from an initial guess $x_0$ that converges towards some value $\hat{x}$ satisfying $f'(\hat{x})=0$. In another word, we are trying to find a stationary point of $f$.
Consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is
$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2$$
So now, we are trying to find a $\Delta x$ that sets the derivative of this last expression with respect to $\Delta x$ equal to zero, which means
$$\frac{\rm{d}}{\rm{d} \Delta x}(f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2) = f'(x_n)+f”(x_n)\Delta x = 0$$
Apparently, $\Delta x = -\frac{f'(x_n)}{f”(x_n)}$ is the solution. So it can be hoped that $x_{n+1} = x_n+\Delta x = x_n – \frac{f'(x_n)}{f”(x_n)}$ will be closer to a stationary point $\hat{x}$.
Still consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is
$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + \Delta x^{\mathsf{T}} \nabla f(x_n) + \frac{1}{2} {\rm H} f(x_n) \Delta x (\Delta x)^{\mathsf{T}}$$
where ${\rm H} f(x)$ donates the Hessian matrix of $f(x)$ and $\nabla f(x)$ donates the gradient. (See more about Taylor expansion at https://en.wikipedia.org/wiki/Taylor_series#Taylor_series_in_several_variables)
So $\Delta x = – [{\rm H}f(x_n)]^{-1}\nabla f(x_n)$ should be a good choice.
As is known to us all, the time complexity to get $A^{-1}$ given $A$ is $O(n^3)$ when $A$ is a $n \times n$ matrix. So when the data set is of too many dimensions, the algorithm will work quite slow.
The reason steepest descent goes wrong is that the gradient for one weight gets messed up by simultaneous changes to all the other weights.
And the Hessian matrix determines the sizes of these interactions so that Newton’s method minimize these interactions as much as possible.
As you can see, PCA even did a better job.
And now I come to a conclusion that Auto Encoder is good at seeking different patterns and when fitting a single pattern PCA is a better choice.
Conventional machine-learning techniques were limited in their ability to process natural data in their raw form.
Problems such as image and speech recognition require the input-output function to be insensitive to irrelevant variations of the input while being very sensitive to particular minute variations.
This is why Conventional machine-learning models require a good feature extractor designed by humans. But neural networks can learn this features automatically using a general-purpose learning procedure. This is the key advantage of deep learning.
The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer.
Recent theoretical and empirical results strongly suggest that local minima are not a serious problem in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero and the surface curves up in most dimensions and curves down in the reminder. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objective function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.
For smaller data sets, unsupervised pre-training helps to prevent overfitting. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.
CNNs are designed to process data that come in the form of multiple arrays.
The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolutional layer are organized in feature maps, within each unit is connected to local patched in the feature maps of the previous layer through a set of weights called a filter bank. All units in a feature map share the same filter bank.
Although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one.
Many natural signals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones.
The pooling allows representations to vary very little when elements in the previous layer vary in position and appearance.
The convolutional and pooling layers in CNNs are directly inspired by classic notions of simple cells and complex cells in visual neuroscience.
Fields | Technology | Versatility |
---|---|---|
Computer Vision | developed | bad |
Natural Language Processing | developing | bad |
Speech Recognition | developed | good |
$$\begin{aligned}E\bigl((y^{noisy}-\hat{y})^2\bigr) & = E\Bigl(\bigl((y – \hat{y}) + \sum\limits_{i}{w_i\epsilon_i}\bigr)^2\Bigr)\\
& = (y – \hat{y})^2 + E\bigl(2(y-\hat{y})\sum\limits_{i}{w_i\epsilon_i}\bigr) + E\bigl((\sum\limits_{i}{w_i\epsilon_i})^2\bigr)\\
& = (y – \hat{y})^2 + E(\sum\limits_{i}{w_i^2\epsilon_i^2})\\
& = (y – \hat{y})^2 + \sum\limits_{i}{w_i^2\sigma_i^2}\end{aligned}$$
$$\begin{aligned}E\big((\hat{y}-y_i)^2\big) & = E\Big(\big((\hat{y} – \bar{y}) – (y_i – \bar{y})\big)^2\Big)\\
& = E\big((\hat{y} – \bar{y})^2 + (y_i – \bar{y})^2 – 2(\hat{y} – \bar{y})(y_i – \bar{y})\big)\\
& = (\bar{y} – \hat{y})^2 + E\big((y_i-\bar{y})^2\big) – 2(\hat{y} – \bar{y})E(y_i-\bar{y})\\
& = (\bar{y} – \hat{y})^2 + E\big((y_i-\bar{y})^2\big) – \frac{2}{n}(\hat{y} – \bar{y})\sum\limits_{i = 1}^{n}{y_i – \bar{y}}\\
& = (\bar{y} – \hat{y})^2 + E\big((y_i-\bar{y})^2\big)\end{aligned}$$
Comparison
(picture from https://www.coursera.org/learn/neural-networks/home/welcome)
$$\frac{\partial \log P(v)}{\partial w_{ij}}=E_v(s_is_j)-E_{m}(s_is_j)$$
Positive phase: finding hidden configurations that work well with $v$ and lower their energies.
Restricted Boltzmann Machines
$$P(h_j=1)=\frac{1}{1+e^{-(b_j+\sum\limits_{i \in vis}{v_iw_{ij}})}}$$
(picture from https://www.coursera.org/learn/neural-networks/home/welcome)
The first time I saw the picture below, I thought it’s just a joke. However, after my first try, it’s true, though I tried my best to avoid it.
(picture from: xkcd, CC BY-NC 2.5)
This website can find abbreviation from the full name of your project.
A cool abbreviation makes your project much posher!
This website can provide you with semantic suggestions.
It’s quite useful for students when writing English articles.