A two-class classifier.
A Feedforward Neural Network without hidden layers.

More specifically: It maps its input $x$ to output $f(x)$ with parameter $\theta$:
f(x) = \begin{cases}
0 & \text{if }x \cdot \theta > 0 \\
1 & \text{otherwise}

Learning Algorithm

  1. Random initialize $\theta$
  2. For each example $i$ in the training set, update parameter $\theta$ as follow:
    $$\theta = \theta + (y_i – f(\theta \cdot x_i)) \cdot x_i$$
  3. Repeat step 2 until the classifier classifies most examples correctly.


  1. Let’s change a view. Let each example donates a constraint. And the expected $\theta$ is at the intersection of those constraints. If you know Linear Programming, then you will see that this is actually a Half-Plane Intersection Problem.
  2. It’s clear to see that if $\theta_1, \theta_2$ is legal, then $a\theta_1 + b\theta_2$ is legal when $a + b = 1, a, b \ge 0$. It easy to proof: Half-Plane is a convex set. The intersection of some convex sets is convex. And this property can help to prove the convergence of this algorithm.
  3. If you know how to solve Linear Regression Problems with Gradient Descent, then you will know that sometimes we may pass the best solution if we choose a learning rate which is too large. In this algorithm, the same problem exists. So this algorithm will converge. But it may not converge to a solution which fit the dataset perfectly. So consider “generously feasible solution” that lies within the feasible region by a margin at least as great as the length of the input vector that defines each constraint plane. The algorithm can only be proved that it will converge to the “generously feasible solution”. Thus the solution in the proof of convergence below means “generously feasible solution”.


As you can see from the definition, Perceptron is a linear classifier. Thus, if the dataset is not linearly separable, this algorithm will not converge.

If you can imagine plotting the dataset to a plane(space), with knowledge of linear algebra you can easily have an intuition that we are actually adjusting the decision boundary(separating hyperplane) according to every single example. And each iteration makes the parameter $\theta$ better. Actually, this algorithm will converge. And here is the proof:

  1. Because the feasible solution is a convex set, the modification made from other examples won’t make the decision boundary worse.
  2. As for a single misclassified example $i$, each modification will change the value of $\theta \cdot x$ by $||x_i||^2$. And the total value need to be correct is $|\theta \cdot x_i|$. Thus, the maximum number of iteration before the classifier makes the right classification for each example is $O(\max\limits_i{(\frac{|\theta \cdot x_i|}{||x_i||^2})})$

And here is a more rigorous proof, read it if you like:


The most well-known disadvantage of this algorithm is that it can’t simulate the XOR Function. But actually, there are more general theorems, like “Group Invariance Theorem”. So I decide to read Perceptrons: an introduction to computational geometry(Minsky & Papert,1969) first. Then I will come back to finish this part.

—————————— UPD 2017.8.15 ——————————
I thought Perceptrons: an introduction to computational geometry is a paper. But it’s actually a book with 292 pages! So I give it up to read it for now. Maybe I will read it in university?

Leave a Reply

Your email address will not be published. Required fields are marked *