Newton’s method

Taylor series

A Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point

$$f(a) \approx \sum\limits_{n=0}^{\infty}{\frac{f^{(n)}(a)}{n!}(x-a)^n}$$

where $f^{n}(a)$ donates the $n^{th}$ derivative of $f$ evaluated at the point $a$.

And here’s a very intuitive example:

Taylor series
The exponential function $e^x$ (in blue), and the sum of the first $n + 1$ terms of its Taylor series at $0$ (in red).

Newton’s method

Overview

In calculus, Newton’s method is an iterative method for finding the roots of a differentiable function $f$.

In machine learning, we apply Newton’s method to the derivative $f’$ of the cost function $f$.

One-dimension version

In the one-dimensional problem, Newton’s method attempts to construct a sequence ${x_n}$ from an initial guess $x_0$ that converges towards some value $\hat{x}$ satisfying $f'(\hat{x})=0$. In another word, we are trying to find a stationary point of $f$.

Consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is

$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2$$

So now, we are trying to find a $\Delta x$ that sets the derivative of this last expression with respect to $\Delta x$ equal to zero, which means

$$\frac{\rm{d}}{\rm{d} \Delta x}(f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2) = f'(x_n)+f”(x_n)\Delta x = 0$$

Apparently, $\Delta x = -\frac{f'(x_n)}{f”(x_n)}$ is the solution. So it can be hoped that $x_{n+1} = x_n+\Delta x = x_n – \frac{f'(x_n)}{f”(x_n)}$ will be closer to a stationary point $\hat{x}$.

High dimensional version

Still consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is

$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + \Delta x^{\mathsf{T}} \nabla f(x_n) + \frac{1}{2} {\rm H} f(x_n) \Delta x (\Delta x)^{\mathsf{T}}$$

where ${\rm H} f(x)$ donates the Hessian matrix of $f(x)$ and $\nabla f(x)$ donates the gradient. (See more about Taylor expansion at https://en.wikipedia.org/wiki/Taylor_series#Taylor_series_in_several_variables)

So $\Delta x = – [{\rm H}f(x_n)]^{-1}\nabla f(x_n)$ should be a good choice.

Limitation

As is known to us all, the time complexity to get $A^{-1}$ given $A$ is $O(n^3)$ when $A$ is a $n \times n$ matrix. So when the data set is of too many dimensions, the algorithm will work quite slow.

Advantage

The reason steepest descent goes wrong is that the gradient for one weight gets messed up by simultaneous changes to all the other weights.

And the Hessian matrix determines the sizes of these interactions so that Newton’s method minimize these interactions as much as possible.

References

Notes on “A Neural Probabilistic Language Model”

Ⅰ. Distributed Representation

1. Fight the curse of dimensionality

As sentences with similar meanings can use quite different words, it’s very hard for the n-gram model to generalize. In the paper, they propose to use Distributed Representation to fight the curse of dimensionality. The Distributed Representation give the model the ability to make use of semantic information. This feature itself does improve the model. And this proposal let each training sentence inform the model an exponential number of semantically neighbouring sentences which makes the model generalize much better.

2. Deal with out-of-vocabulary words

It’s clear to see that the only stuff we need to deal with is to assign this out-of-vocabulary word a distributed representation. So to do that, we first consider this unknown word $j$ as a blank which need to be filled. And we use the network to estimate the probability $p_i$ of each word $i$ in the vocabulary. Donating $C_i$ as the distributed representation for word $i$ in the vocabulary, we assign $\sum\limits_{i \ in \ vocabulary}{C_i \cdot p_i}$ to $C_j$ as the distributed representation for word $j$. After that, we can incorporate word $j$ in the vocabulary and use this slightly larger vocabulary for further computation.

This approach is quite elegant because this is exactly how human brain works.

3. Deal with polysemous words

We can just simply assign multiple distributed representations for a single polysemous word.

4. comparison with n-gram models

Use distributed representation to make use of semantic information and to turn discrete random variables to continuous variables. And these two features let the network generalize much better than n-gram models.

Ⅱ. Improve computation efficiency

1. Represent the conditional probability with a tree structure

For each classification, it will go through $O(\log n)$ nodes in the tree structure. This makes the network use much fewer parameters to process.

2. Parallel implementation

In this paper, data-based parallel implementation and parameter-based implementation are mentioned.

Ⅲ. Relating to Strong AI

1. Taking advantage of prior knowledge

In this paper, they take advantage of the semantic information.

2. Decompose the whole network into smaller parts

This can make computation faster and easier to adapt it to other works. In Opening the black box of Deep Neural Networks via Information, it’s said that a large amount of computation is used to compression of input to effective representation. So if we can modularize the network and set up a set of general APIs, it can make a huge difference in practical implementation.

Ⅳ. Paper involves

  1. Bengio and J-S. Senécal. Quick training of probabilistic neural nets by importance sampling. In AISTATS, 2003
  2. Berger, S. Della Pietra, and V. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22:39–71, 1996

Perceptrons

Definition

A two-class classifier.
A Feedforward Neural Network without hidden layers.

More specifically: It maps its input $x$ to output $f(x)$ with parameter $\theta$:
$$
\begin{equation}
f(x) = \begin{cases}
0 & \text{if }x \cdot \theta > 0 \\
1 & \text{otherwise}
\end{cases}
\end{equation}
$$

Learning Algorithm

  1. Random initialize $\theta$
  2. For each example $i$ in the training set, update parameter $\theta$ as follow:
    $$\theta = \theta + (y_i – f(\theta \cdot x_i)) \cdot x_i$$
  3. Repeat step 2 until the classifier classifies most examples correctly.

Properties

  1. Let’s change a view. Let each example donates a constraint. And the expected $\theta$ is at the intersection of those constraints. If you know Linear Programming, then you will see that this is actually a Half-Plane Intersection Problem.
  2. It’s clear to see that if $\theta_1, \theta_2$ is legal, then $a\theta_1 + b\theta_2$ is legal when $a + b = 1, a, b \ge 0$. It easy to proof: Half-Plane is a convex set. The intersection of some convex sets is convex. And this property can help to prove the convergence of this algorithm.
  3. If you know how to solve Linear Regression Problems with Gradient Descent, then you will know that sometimes we may pass the best solution if we choose a learning rate which is too large. In this algorithm, the same problem exists. So this algorithm will converge. But it may not converge to a solution which fit the dataset perfectly. So consider “generously feasible solution” that lies within the feasible region by a margin at least as great as the length of the input vector that defines each constraint plane. The algorithm can only be proved that it will converge to the “generously feasible solution”. Thus the solution in the proof of convergence below means “generously feasible solution”.

Convergence

As you can see from the definition, Perceptron is a linear classifier. Thus, if the dataset is not linearly separable, this algorithm will not converge.

If you can imagine plotting the dataset to a plane(space), with knowledge of linear algebra you can easily have an intuition that we are actually adjusting the decision boundary(separating hyperplane) according to every single example. And each iteration makes the parameter $\theta$ better. Actually, this algorithm will converge. And here is the proof:

  1. Because the feasible solution is a convex set, the modification made from other examples won’t make the decision boundary worse.
  2. As for a single misclassified example $i$, each modification will change the value of $\theta \cdot x$ by $||x_i||^2$. And the total value need to be correct is $|\theta \cdot x_i|$. Thus, the maximum number of iteration before the classifier makes the right classification for each example is $O(\max\limits_i{(\frac{|\theta \cdot x_i|}{||x_i||^2})})$

And here is a more rigorous proof, read it if you like: http://leijun00.github.io/2014/08/perceptron/

Disadvantages

The most well-known disadvantage of this algorithm is that it can’t simulate the XOR Function. But actually, there are more general theorems, like “Group Invariance Theorem”. So I decide to read Perceptrons: an introduction to computational geometry(Minsky & Papert,1969) first. Then I will come back to finish this part.

—————————— UPD 2017.8.15 ——————————
I thought Perceptrons: an introduction to computational geometry is a paper. But it’s actually a book with 292 pages! So I give it up to read it for now. Maybe I will read it in university?

Summary of “Machine Learning by Andrew NG”

Preface

I found this course from zhihu. Lots of people recommend this course as “the best way to start with Machine Learning”. So I spent two weeks to finish this course. After finishing it, I found it a great course as well! So here’s the link: https://www.coursera.org/learn/machine-learning/home/welcome. It won’t cost you so much time(about 50 hours are enough), but will lead you to a new world.

Problems in this course

According to wikipedia, here are 5 subfields:
1. Classification: To divide inputs to known classes.
2. Regression: To estimate the relationships among variables.
3. Clustering: To divide inputs to classes. Unlike in classification, the groups are not known beforehand.
4. Density estimation: To find the distribution of inputs in some space.
5. Dimensionality reduction: To simplify inputs by mapping them into a lower-dimensional space.

In this course, all these 5 topics are involved.

Algorithms in this course

  1. Gradient Descent: A powerful algorithm to solve Classification Problems and (Linear) Regression Problems. This algorithm use derivative of the cost function to minimize the cost function.
  2. Stochastic Gradient Descent: A variant of Gradient Descent. When dealing with a large amount of data, it’s much faster than Gradient Descent. But it’s a little bit harder to converge.
  3. Mini-Batch Gradient Descent: A variant of Gradient Descent. It cost less time to complete a single iteration than Gradient Descent, but slower than Stochastic Gradient Descent. But it can fit data better than Stochastic Gradient Descent. Actually you can regard this algorithm as a compromise between the original Gradient Descent and Stochastic Gradient Descent.
  4. Collaborative Filtering: A variant of Gradient Descent. It’s often used in Recommender system.
  5. Normal Equation: A great way to solve Linear Regression Problems. It use numerical tricks to fit the data perfectly. In this algorithm we have to compute the inverse of a matrix, which can be solved in $O(n^3)$. So this algorithm can’t deal with datasets with so much features.
  6. Support vector machine(SVM): A powerful tool to solve Classification Problems and (Linear) Regression Problems. In this course, Andrew explains the application of this algorithm in classification problems, and it can be described as a Large Margin Classifier. Furthermore the cost function of SVMs is convex, so it won’t be trapped in the local optimum. Moreover with the “Kernel trick”, it can fit nonlinear hypothesis well.
  7. Neural Network(Backpropagation): The most popular algorithm in Machine Learning. Neural networks try to simulate our brain, so it’s believed as the most possible way to build strong AI. And Backpropagation use derivative of the cost function to minimize the cost function. It’s easy to learn, and perform well on many problems.
  8. K-Means Algorithm: This algorithm try to find patterns in data by itself. It divides data to different unknown classes. It’s useful in analysis.
  9. (Multivariate)Gaussian Distribution Algorithm: An algorithm based Gaussian Distribution to solve Density Estimation Problems. Widely used in Anomaly Detection.

Useful Tricks

  1. Feature Scaling: Scale data to make algorithms work better. Widely used in Gradient Descent and other algorithms.
  2. One-vs-All: This trick allows you to do very little modification on your two-class classifier to make it a multi-class classifier.
  3. Regularization: It’s the most useful way to solve overfitting problems.
  4. Gradient Check: An easy numerical way to determine whether your implement of cost function is bug-free.
  5. Random Initialization: A necessary part of Neural Network. And Random Initialize for several times is also a good way to increase the possibility to find global optimum rather than local optimum.
  6. Train/Validation/Test set: A way to assign your dataset. It’s widely used in almost every single algorithm.
  7. Learning Curve: A good way to evaluate your algorithm. And it can help you to decide how to improve your algorithm.
  8. Precision/ Recall/ $F_1$ Score: A good way to evaluate your algorithm, especially when your dataset it skewed.
  9. Principal Component Analysis(PCA): A good way to compress your data. It can reduce the number of principal components. This can speed up your algorithm. Also it can help you to visualize your data.
  10. Ceiling Analysis: A way to the pipeline of your Machine Learning system. It can help you to decide which component to optimize worth the most.

Important Ideas

  1. Build a naive system as fast as possible. Optimize your system later.
  2. Do analyze your system. Let the result of analysi tell you what to do next instead of intuition.