# Newton’s method

## Taylor series

A Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point

$$f(a) \approx \sum\limits_{n=0}^{\infty}{\frac{f^{(n)}(a)}{n!}(x-a)^n}$$

where $f^{n}(a)$ donates the $n^{th}$ derivative of $f$ evaluated at the point $a$.

And here’s a very intuitive example:

The exponential function $e^x$ (in blue), and the sum of the first $n + 1$ terms of its Taylor series at $0$ (in red).

## Newton’s method

### Overview

In calculus, Newton’s method is an iterative method for finding the roots of a differentiable function $f$.

In machine learning, we apply Newton’s method to the derivative $f’$ of the cost function $f$.

### One-dimension version

In the one-dimensional problem, Newton’s method attempts to construct a sequence ${x_n}$ from an initial guess $x_0$ that converges towards some value $\hat{x}$ satisfying $f'(\hat{x})=0$. In another word, we are trying to find a stationary point of $f$.

Consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is

$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2$$

So now, we are trying to find a $\Delta x$ that sets the derivative of this last expression with respect to $\Delta x$ equal to zero, which means

$$\frac{\rm{d}}{\rm{d} \Delta x}(f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2) = f'(x_n)+f”(x_n)\Delta x = 0$$

Apparently, $\Delta x = -\frac{f'(x_n)}{f”(x_n)}$ is the solution. So it can be hoped that $x_{n+1} = x_n+\Delta x = x_n – \frac{f'(x_n)}{f”(x_n)}$ will be closer to a stationary point $\hat{x}$.

### High dimensional version

Still consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is

$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + \Delta x^{\mathsf{T}} \nabla f(x_n) + \frac{1}{2} {\rm H} f(x_n) \Delta x (\Delta x)^{\mathsf{T}}$$

where ${\rm H} f(x)$ donates the Hessian matrix of $f(x)$ and $\nabla f(x)$ donates the gradient. (See more about Taylor expansion at https://en.wikipedia.org/wiki/Taylor_series#Taylor_series_in_several_variables)

So $\Delta x = – [{\rm H}f(x_n)]^{-1}\nabla f(x_n)$ should be a good choice.

## Limitation

As is known to us all, the time complexity to get $A^{-1}$ given $A$ is $O(n^3)$ when $A$ is a $n \times n$ matrix. So when the data set is of too many dimensions, the algorithm will work quite slow.