## 1. TL;DR

• Force the length of features represents the age of this identity when taking this photo.
• In this case, after doing normalization, features are Age-Invariant.

## 2. Motivation

• Aging process caused many problems (such as shape changes, texture changes, etc.), and those problems lead to large intra-class variation.
• In the field of AIFR, Conventional solutions model age feature and identity feature simultaneously. Considering the mixed feature are usually undecomposable, the property of mixing potentially reduce the robustness of recognizing cross-age faces.

## 3. Contriution

• They proposed a new approach called OE-CNNs.
• Specifically, we decompose deep face features into two orthogonal components to represent age-related and identity-related features.
• In this way, identity-related features are then Age-Invariant.
• They also built a new dataset called Cross-Age Face dataset (CAF).
• This dataset includes about 313,986 face images from 4,668 identities. Each identity has approximately 80 face images.
• They manually washed the data. And this dataset is fairly randomly separated across ages.

## 4. Experiments & Results

• Experiments on the MORPH Album 2 Dataset

• Experiments on the LFW Dataset

• Results on FG-NET Dataset and CACD-VS Dataset are also available in this paper.

## 5. Questions & Thoughts

• What if we embedding features to a flat instead of a sphere and let the height represents the age?
• Age can be replaced by other attributes of this image like pose, blur etc.
• The structure of their network can be adapted to GridFace.v7.relative_confidence_coefficient.
• The idea is actually quite similar to using Auto-Encoder to do the transfer.
• In comparison to A-Softmax, forcing the length of features to represents the age of this person actually restrict the capacity of this model which reduces the risk of over-fitting.

## Why not use a single RNN

• There’s is an intuitive interpretation for this: “If you can discriminate a person from his eyes, why should I look at his nose?”

• So when we cut pictures into different patches and train with different CNNs, actually we are forcing the network to learn every single piece of information.

• And in practice, we are using this trick.

## A-Softmax

• $L = \frac{1}{N} \sum\limits_{i}{-\log(\frac{e^{ \left| {x_i} \right| \psi(\theta_{y_i,i})}}{e^{ \left| {x_i} \right| \psi(\theta_{y_i,i})} + \sum\limits_{j|j \ne y_i}{e^{ \left| {x_j} \right| \cos(\theta _{j,i})}}})}$

• $\psi(\theta_{y_i,i})=(-1)^k\cos(m\theta_{y_i,i})-2k$, $\theta_{y_i,i}\in[\frac{k\pi}{m}, \frac{(k+1)\pi}{m}]$ and $k \in [0, m-1], \ m \in \mathbb{Z}^+$

• The plot of $\psi(\theta_{y_i,i})$

## Normalizing the weight could reduce the prior caused by the training data imbalance

• Suppose we use a neural network to extract a $1D$ feature $f_i$ for each sample $i$ in the dataset and use Softmax to evaluate our network. To make our analysis easier, we normalize our features. Suppose there are only two classes in the dataset. There are $m$ samples in class 1 and $n$ samples in class 2. When our network is strong enough, our features are distributed at both ends of the diameter of the unit circle.

• Without bias terms, the loss function can be written as:
$L = -\sum\limits_{i = 1}^{m + n}\sum\limits_{j = 1}^{2}{a_{i,j} \log(p_{i,j})}$, where $p_{i,j}$ means the probability that sample i belongs to class $j$ (generated by the softmax) and $a_{i,j} = [sample \ i \ belongs \ to \ class \ j]$. Assume that $w_i$ means the weights in the softmax layer for class $i$.

• Then, $\frac{\partial L}{\partial w_1} = (m – 1) \sum\limits_{i|a_{i,1}=1}{f_i} = m (m – 1)$.

• And, $\frac{\partial L}{\partial w_2} = (n – 1) \sum\limits_{i|a_{i,2}=1}{f_i} = n (n – 1)$.

• If $m = 100$ while $n = 1$, then $\frac{\partial L}{\partial w_1} / \frac{\partial L}{\partial w_2} \approx 10000$. It’s clear to see that the derivative of $w_1$ is much bigger than the derivative of $w_2$. That’s why the larger sample number a class has, the larger the associated norm of weights tends to be.

## Biases are useless for softmax

• In this paper, they use an experiment using MNIST to empirically prove that Biases is not necessary for softmax.

• In practical, Biases do are useless.

## Understanding to “the prior that faces also lie on a manifold”

• As descript in NormFace, the feature distribution of softmax is ‘radial’. So after normalization, features lie on a very thick line on a hypersphere. That’s why Euclidean features failed and $cos$ similarity works well.

• This is also an interpretation of why the Euclidean margin is incompatible with softmax loss_

## Other Points

1. Closed-set FR can be well addressed as a classification problem. Open-set FR is a metric learning problem.

2. The key criterion for metric learning in FR: the maxima intra-class distance is smaller than the minima inter-class distance.

3. Separable $\ne$ discriminative and softmax is only separable.

## Preface

This paper is totally empirical. You can hardly find any mathematics analysis. Therefore, every conclusion we drew is from experimental results.

## The main gaps between neuroscience models and ML models

1. Neurons encode information in a sparse and distributed way. This corresponds to a trade-off between the richness of representation and small action potential energy expenditure. However, without additional regularization, ordinary feed-forward neural networks do not have this property. This is not only not biologically implausible and hurts gradient-based optimization.
2. Sigmoid and tanh are equivalent up to a linear transformation(do some tiny changes to the axis, we can get another one). the tanh has a steady state at 0 and is therefore preferred from the optimization standpoint, but it forces an antisymmetry around 0, which is absent in biological neurons.

## Potential problems of rectifiers

1. The hard saturation at 0 may hurt optimization by blocking gradient back-propagation.
2. In order to efficiently represent symmetric or antisymmetric behaviour in the data, a rectifier network would need twice as many hidden units as a network of symmetric or antisymmetric activation functions.
3. Rectifier networks are subject to ill-conditioning of parametrization.
• Consider for each layer of depth $i$ of the network a scalar $\alpha_i$, and scaling the prameters as $w_i’=\frac{w_i}{\alpha_i}$ and $b_i’=\frac{b_i}{\prod_{j=1}^{i}{\alpha_j}}$. The output units values then change as follow: $s’=\frac{s}{\prod_{j=1}^{i}{\alpha_j}}$
• Therefore, as long as $\prod_{j=1}^{n}{\alpha_j}=1$, the network function is identical.

1. Rectifiers are not only biologically plausible, they are also computationally efficient.
2. There’s almost no improvement when using unsupervised pre-training with rectifier activations.
3. Networks trained with the rectifier activation functions can find local minima of greater or equality than those obtained with its smooth counterpart, the soft plush.
4. Rectifier networks are truly sparse.

## What did they do in this paper?

• They greatly improved the result on LFW by improving the alignment and representation.

## How did they do this?

1. They apply 2D and 3D alignments to frontalize the picture.

2. They carefully customized the structure of the descriptor(a CNN).

## High Lights

1. 3D Alignment

2. Good performance on many datasets

3. Well hand-crafted DNN structures(those locally connected layers)

4. Use a large training set to overcome overfitting

## Q&A

1. Q: The detail of 2D alignment?

A: Step 1: Detect 6 fiducial points. Step 2: Use a regressor to apply one or more affine transformations.

2. Q: The detail of 3D alignment?

A:

num detail
1 Generate a generic 3D model of 67 anchor points from USF Human-ID database
2 Fine tune a unique 3D model for each image using residual vector and covariance matrix(?)
3 Locate 67 fiducial points in each picture
4 Do apply Triangulation to both anchor points and fiducial points
5 For each pair of corresponding triangles, apply 2D alignment
3. Q: Why there’s a $L^2$ normalization?

A: To cope with the change of illumination. And there’s another usage: When you use softmax and the data is fairly blurred, then the descriptor will tend to assign $\vec 0$ to every picture as its representation to minimize the loss function. So in order to fix this bug, we set all the representations to the length of 1 to avoid this problem.

## Appendix

1. ReLU has a great ability to learn. In practical, ReLU and $\tanh$ are mostly used in CV. And in some cases, some variant of ReLU like PReLU can even do a better job.

2. In 3D alignment literature, PCA is widely used to fine tune a unique 3D model for each picture. And a 3D Engine can be used to generate data to train neural networks. And in some cases, some triangle in the Triangulation is invisible, then you need to guess what it looks like. And in this case, adversarial models are recommended for conventional solutions usually offer a blurred image while adversarial models offer clear images with sharp edges.

3. In 2D alignment, affine transformation and projective transformation are applied. There’re only 4 parameters in affine transformations, so Least Squares are the most used solution. When it comes to projective transformation, things get different. Though there’re only 2 more parameters, the projective transformation would greatly change the shape of faces(affine transformation would do the same stuff but not so much). So, usually, we combine the transformation with the whole net. Though we don’t know how much the transformation does exactly, as we only care about the result, it works and it works well. So, yes, this is still a black box. Moreover, there’s a network called Spatial Transformer Networks(STN) and it actually is doing Projective Transformation’s job.

4. When you are doing the alignment, for example using affine transformation, interpolation may be needed. Usually, there are two ways: 1. Use the nearest point as a substitute; 2. Use a linear combination of four closest points as the answer.

5. In face recognition, we can keep ROC good by limiting the angle of the face.

## Taylor series

A Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point

$$f(a) \approx \sum\limits_{n=0}^{\infty}{\frac{f^{(n)}(a)}{n!}(x-a)^n}$$

where $f^{n}(a)$ donates the $n^{th}$ derivative of $f$ evaluated at the point $a$.

And here’s a very intuitive example:

The exponential function $e^x$ (in blue), and the sum of the first $n + 1$ terms of its Taylor series at $0$ (in red).

## Newton’s method

### Overview

In calculus, Newton’s method is an iterative method for finding the roots of a differentiable function $f$.

In machine learning, we apply Newton’s method to the derivative $f’$ of the cost function $f$.

### One-dimension version

In the one-dimensional problem, Newton’s method attempts to construct a sequence ${x_n}$ from an initial guess $x_0$ that converges towards some value $\hat{x}$ satisfying $f'(\hat{x})=0$. In another word, we are trying to find a stationary point of $f$.

Consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is

$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2$$

So now, we are trying to find a $\Delta x$ that sets the derivative of this last expression with respect to $\Delta x$ equal to zero, which means

$$\frac{\rm{d}}{\rm{d} \Delta x}(f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2) = f'(x_n)+f”(x_n)\Delta x = 0$$

Apparently, $\Delta x = -\frac{f'(x_n)}{f”(x_n)}$ is the solution. So it can be hoped that $x_{n+1} = x_n+\Delta x = x_n – \frac{f'(x_n)}{f”(x_n)}$ will be closer to a stationary point $\hat{x}$.

### High dimensional version

Still consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is

$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + \Delta x^{\mathsf{T}} \nabla f(x_n) + \frac{1}{2} {\rm H} f(x_n) \Delta x (\Delta x)^{\mathsf{T}}$$

where ${\rm H} f(x)$ donates the Hessian matrix of $f(x)$ and $\nabla f(x)$ donates the gradient. (See more about Taylor expansion at https://en.wikipedia.org/wiki/Taylor_series#Taylor_series_in_several_variables)

So $\Delta x = – [{\rm H}f(x_n)]^{-1}\nabla f(x_n)$ should be a good choice.

## Limitation

As is known to us all, the time complexity to get $A^{-1}$ given $A$ is $O(n^3)$ when $A$ is a $n \times n$ matrix. So when the data set is of too many dimensions, the algorithm will work quite slow.

The reason steepest descent goes wrong is that the gradient for one weight gets messed up by simultaneous changes to all the other weights.

And the Hessian matrix determines the sizes of these interactions so that Newton’s method minimize these interactions as much as possible.

## Auto Encoder vs. PCA (again)

I tried to use Auto Encoder and PCA to do Dimensionality Reduction.
The dataset is from House Prices: Advanced Regression Techniques.
I transformed the data from 79D to 30D, and then reconstruct the data to 70D.
Here’s the result:

As you can see, PCA even did a better job.
And now I come to a conclusion that Auto Encoder is good at seeking different patterns and when fitting a single pattern PCA is a better choice.