A Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point
$$f(a) \approx \sum\limits_{n=0}^{\infty}{\frac{f^{(n)}(a)}{n!}(x-a)^n}$$
where $f^{n}(a)$ donates the $n^{th}$ derivative of $f$ evaluated at the point $a$.
And here’s a very intuitive example:
The exponential function $e^x$ (in blue), and the sum of the first $n + 1$ terms of its Taylor series at $0$ (in red).
In calculus, Newton’s method is an iterative method for finding the roots of a differentiable function $f$.
In machine learning, we apply Newton’s method to the derivative $f’$ of the cost function $f$.
In the one-dimensional problem, Newton’s method attempts to construct a sequence ${x_n}$ from an initial guess $x_0$ that converges towards some value $\hat{x}$ satisfying $f'(\hat{x})=0$. In another word, we are trying to find a stationary point of $f$.
Consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is
$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2$$
So now, we are trying to find a $\Delta x$ that sets the derivative of this last expression with respect to $\Delta x$ equal to zero, which means
$$\frac{\rm{d}}{\rm{d} \Delta x}(f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2) = f'(x_n)+f”(x_n)\Delta x = 0$$
Apparently, $\Delta x = -\frac{f'(x_n)}{f”(x_n)}$ is the solution. So it can be hoped that $x_{n+1} = x_n+\Delta x = x_n – \frac{f'(x_n)}{f”(x_n)}$ will be closer to a stationary point $\hat{x}$.
Still consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is
$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + \Delta x^{\mathsf{T}} \nabla f(x_n) + \frac{1}{2} {\rm H} f(x_n) \Delta x (\Delta x)^{\mathsf{T}}$$
where ${\rm H} f(x)$ donates the Hessian matrix of $f(x)$ and $\nabla f(x)$ donates the gradient. (See more about Taylor expansion at https://en.wikipedia.org/wiki/Taylor_series#Taylor_series_in_several_variables)
So $\Delta x = – [{\rm H}f(x_n)]^{-1}\nabla f(x_n)$ should be a good choice.
As is known to us all, the time complexity to get $A^{-1}$ given $A$ is $O(n^3)$ when $A$ is a $n \times n$ matrix. So when the data set is of too many dimensions, the algorithm will work quite slow.
The reason steepest descent goes wrong is that the gradient for one weight gets messed up by simultaneous changes to all the other weights.
And the Hessian matrix determines the sizes of these interactions so that Newton’s method minimize these interactions as much as possible.
As you can see, PCA even did a better job.
And now I come to a conclusion that Auto Encoder is good at seeking different patterns and when fitting a single pattern PCA is a better choice.
Conventional machine-learning techniques were limited in their ability to process natural data in their raw form.
Problems such as image and speech recognition require the input-output function to be insensitive to irrelevant variations of the input while being very sensitive to particular minute variations.
This is why Conventional machine-learning models require a good feature extractor designed by humans. But neural networks can learn this features automatically using a general-purpose learning procedure. This is the key advantage of deep learning.
The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer.
Recent theoretical and empirical results strongly suggest that local minima are not a serious problem in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero and the surface curves up in most dimensions and curves down in the reminder. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objective function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.
For smaller data sets, unsupervised pre-training helps to prevent overfitting. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.
CNNs are designed to process data that come in the form of multiple arrays.
The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolutional layer are organized in feature maps, within each unit is connected to local patched in the feature maps of the previous layer through a set of weights called a filter bank. All units in a feature map share the same filter bank.
Although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one.
Many natural signals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones.
The pooling allows representations to vary very little when elements in the previous layer vary in position and appearance.
The convolutional and pooling layers in CNNs are directly inspired by classic notions of simple cells and complex cells in visual neuroscience.
Fields | Technology | Versatility |
---|---|---|
Computer Vision | developed | bad |
Natural Language Processing | developing | bad |
Speech Recognition | developed | good |
$$\begin{aligned}E\bigl((y^{noisy}-\hat{y})^2\bigr) & = E\Bigl(\bigl((y – \hat{y}) + \sum\limits_{i}{w_i\epsilon_i}\bigr)^2\Bigr)\\
& = (y – \hat{y})^2 + E\bigl(2(y-\hat{y})\sum\limits_{i}{w_i\epsilon_i}\bigr) + E\bigl((\sum\limits_{i}{w_i\epsilon_i})^2\bigr)\\
& = (y – \hat{y})^2 + E(\sum\limits_{i}{w_i^2\epsilon_i^2})\\
& = (y – \hat{y})^2 + \sum\limits_{i}{w_i^2\sigma_i^2}\end{aligned}$$
$$\begin{aligned}E\big((\hat{y}-y_i)^2\big) & = E\Big(\big((\hat{y} – \bar{y}) – (y_i – \bar{y})\big)^2\Big)\\
& = E\big((\hat{y} – \bar{y})^2 + (y_i – \bar{y})^2 – 2(\hat{y} – \bar{y})(y_i – \bar{y})\big)\\
& = (\bar{y} – \hat{y})^2 + E\big((y_i-\bar{y})^2\big) – 2(\hat{y} – \bar{y})E(y_i-\bar{y})\\
& = (\bar{y} – \hat{y})^2 + E\big((y_i-\bar{y})^2\big) – \frac{2}{n}(\hat{y} – \bar{y})\sum\limits_{i = 1}^{n}{y_i – \bar{y}}\\
& = (\bar{y} – \hat{y})^2 + E\big((y_i-\bar{y})^2\big)\end{aligned}$$
Comparison
(picture from https://www.coursera.org/learn/neural-networks/home/welcome)
$$\frac{\partial \log P(v)}{\partial w_{ij}}=E_v(s_is_j)-E_{m}(s_is_j)$$
Positive phase: finding hidden configurations that work well with $v$ and lower their energies.
Restricted Boltzmann Machines
$$P(h_j=1)=\frac{1}{1+e^{-(b_j+\sum\limits_{i \in vis}{v_iw_{ij}})}}$$
(picture from https://www.coursera.org/learn/neural-networks/home/welcome)
The first time I saw the picture below, I thought it’s just a joke. However, after my first try, it’s true, though I tried my best to avoid it.
(picture from: xkcd, CC BY-NC 2.5)
This website can find abbreviation from the full name of your project.
A cool abbreviation makes your project much posher!
This website can provide you with semantic suggestions.
It’s quite useful for students when writing English articles.
As sentences with similar meanings can use quite different words, it’s very hard for the n-gram model to generalize. In the paper, they propose to use Distributed Representation to fight the curse of dimensionality. The Distributed Representation give the model the ability to make use of semantic information. This feature itself does improve the model. And this proposal let each training sentence inform the model an exponential number of semantically neighbouring sentences which makes the model generalize much better.
It’s clear to see that the only stuff we need to deal with is to assign this out-of-vocabulary word a distributed representation. So to do that, we first consider this unknown word $j$ as a blank which need to be filled. And we use the network to estimate the probability $p_i$ of each word $i$ in the vocabulary. Donating $C_i$ as the distributed representation for word $i$ in the vocabulary, we assign $\sum\limits_{i \ in \ vocabulary}{C_i \cdot p_i}$ to $C_j$ as the distributed representation for word $j$. After that, we can incorporate word $j$ in the vocabulary and use this slightly larger vocabulary for further computation.
This approach is quite elegant because this is exactly how human brain works.
We can just simply assign multiple distributed representations for a single polysemous word.
Use distributed representation to make use of semantic information and to turn discrete random variables to continuous variables. And these two features let the network generalize much better than n-gram models.
For each classification, it will go through $O(\log n)$ nodes in the tree structure. This makes the network use much fewer parameters to process.
In this paper, data-based parallel implementation and parameter-based implementation are mentioned.
In this paper, they take advantage of the semantic information.
This can make computation faster and easier to adapt it to other works. In Opening the black box of Deep Neural Networks via Information, it’s said that a large amount of computation is used to compression of input to effective representation. So if we can modularize the network and set up a set of general APIs, it can make a huge difference in practical implementation.
A two-class classifier.
A Feedforward Neural Network without hidden layers.
More specifically: It maps its input $x$ to output $f(x)$ with parameter $\theta$:
$$
\begin{equation}
f(x) = \begin{cases}
0 & \text{if }x \cdot \theta > 0 \\
1 & \text{otherwise}
\end{cases}
\end{equation}
$$
step 2
until the classifier classifies most examples correctly.As you can see from the definition, Perceptron is a linear classifier. Thus, if the dataset is not linearly separable, this algorithm will not converge.
If you can imagine plotting the dataset to a plane(space), with knowledge of linear algebra you can easily have an intuition that we are actually adjusting the decision boundary(separating hyperplane) according to every single example. And each iteration makes the parameter $\theta$ better. Actually, this algorithm will converge. And here is the proof:
- Because the feasible solution is a convex set, the modification made from other examples won’t make the decision boundary worse.
- As for a single misclassified example $i$, each modification will change the value of $\theta \cdot x$ by $||x_i||^2$. And the total value need to be correct is $|\theta \cdot x_i|$. Thus, the maximum number of iteration before the classifier makes the right classification for each example is $O(\max\limits_i{(\frac{|\theta \cdot x_i|}{||x_i||^2})})$
And here is a more rigorous proof, read it if you like: http://leijun00.github.io/2014/08/perceptron/
The most well-known disadvantage of this algorithm is that it can’t simulate the XOR Function. But actually, there are more general theorems, like “Group Invariance Theorem”. So I decide to read Perceptrons: an introduction to computational geometry(Minsky & Papert，1969) first. Then I will come back to finish this part.
—————————— UPD 2017.8.15 ——————————
I thought Perceptrons: an introduction to computational geometry is a paper. But it’s actually a book with 292 pages! So I give it up to read it for now. Maybe I will read it in university?