Experiments on the MORPH Album 2 Dataset
Experiments on the LFW Dataset
Results on FG-NET Dataset and CACD-VS Dataset are also available in this paper.
There’s is an intuitive interpretation for this: “If you can discriminate a person from his eyes, why should I look at his nose?”
So when we cut pictures into different patches and train with different CNNs, actually we are forcing the network to learn every single piece of information.
And in practice, we are using this trick.
$\psi(\theta_{y_i,i})=(-1)^k\cos(m\theta_{y_i,i})-2k$, $\theta_{y_i,i}\in[\frac{k\pi}{m}, \frac{(k+1)\pi}{m}]$ and $k \in [0, m-1], \ m \in \mathbb{Z}^+$
The plot of $\psi(\theta_{y_i,i})$
Suppose we use a neural network to extract a $1D$ feature $f_i$ for each sample $i$ in the dataset and use Softmax to evaluate our network. To make our analysis easier, we normalize our features. Suppose there are only two classes in the dataset. There are $m$ samples in class 1 and $n$ samples in class 2. When our network is strong enough, our features are distributed at both ends of the diameter of the unit circle.
Without bias terms, the loss function can be written as:
$L = -\sum\limits_{i = 1}^{m + n}\sum\limits_{j = 1}^{2}{a_{i,j} \log(p_{i,j})}$, where $p_{i,j}$ means the probability that sample i belongs to class $j$ (generated by the softmax) and $a_{i,j} = [sample \ i \ belongs \ to \ class \ j]$. Assume that $w_i$ means the weights in the softmax layer for class $i$.
Then, $\frac{\partial L}{\partial w_1} = (m – 1) \sum\limits_{i|a_{i,1}=1}{f_i} = m (m – 1)$.
And, $\frac{\partial L}{\partial w_2} = (n – 1) \sum\limits_{i|a_{i,2}=1}{f_i} = n (n – 1)$.
If $m = 100$ while $n = 1$, then $\frac{\partial L}{\partial w_1} / \frac{\partial L}{\partial w_2} \approx 10000$. It’s clear to see that the derivative of $w_1$ is much bigger than the derivative of $w_2$. That’s why the larger sample number a class has, the larger the associated norm of weights tends to be.
In this paper, they use an experiment using MNIST to empirically prove that Biases is not necessary for softmax.
In practical, Biases do are useless.
As descript in NormFace, the feature distribution of softmax is ‘radial’. So after normalization, features lie on a very thick line on a hypersphere. That’s why Euclidean features failed and $cos$ similarity works well.
This is also an interpretation of why the Euclidean margin is incompatible with softmax loss_
Closed-set FR can be well addressed as a classification problem. Open-set FR is a metric learning problem.
The key criterion for metric learning in FR: the maxima intra-class distance is smaller than the minima inter-class distance.
Separable $\ne$ discriminative and softmax is only separable.
This paper is totally empirical. You can hardly find any mathematics analysis. Therefore, every conclusion we drew is from experimental results.
They carefully customized the structure of the descriptor(a CNN).
3D Alignment
Good performance on many datasets
Well hand-crafted DNN structures(those locally connected layers)
Use a large training set to overcome overfitting
Q: The detail of 2D alignment?
A: Step 1: Detect 6 fiducial points. Step 2: Use a regressor to apply one or more affine transformations.
Q: The detail of 3D alignment?
A:
num | detail |
---|---|
1 | Generate a generic 3D model of 67 anchor points from USF Human-ID database |
2 | Fine tune a unique 3D model for each image using residual vector and covariance matrix(?) |
3 | Locate 67 fiducial points in each picture |
4 | Do apply Triangulation to both anchor points and fiducial points |
5 | For each pair of corresponding triangles, apply 2D alignment |
Q: Why there’s a $L^2$ normalization?
A: To cope with the change of illumination. And there’s another usage: When you use softmax and the data is fairly blurred, then the descriptor will tend to assign $\vec 0$ to every picture as its representation to minimize the loss function. So in order to fix this bug, we set all the representations to the length of 1 to avoid this problem.
ReLU has a great ability to learn. In practical, ReLU and $\tanh$ are mostly used in CV. And in some cases, some variant of ReLU like PReLU can even do a better job.
In 3D alignment literature, PCA is widely used to fine tune a unique 3D model for each picture. And a 3D Engine can be used to generate data to train neural networks. And in some cases, some triangle in the Triangulation is invisible, then you need to guess what it looks like. And in this case, adversarial models are recommended for conventional solutions usually offer a blurred image while adversarial models offer clear images with sharp edges.
In 2D alignment, affine transformation and projective transformation are applied. There’re only 4 parameters in affine transformations, so Least Squares are the most used solution. When it comes to projective transformation, things get different. Though there’re only 2 more parameters, the projective transformation would greatly change the shape of faces(affine transformation would do the same stuff but not so much). So, usually, we combine the transformation with the whole net. Though we don’t know how much the transformation does exactly, as we only care about the result, it works and it works well. So, yes, this is still a black box. Moreover, there’s a network called Spatial Transformer Networks(STN) and it actually is doing Projective Transformation’s job.
When you are doing the alignment, for example using affine transformation, interpolation may be needed. Usually, there are two ways: 1. Use the nearest point as a substitute; 2. Use a linear combination of four closest points as the answer.
In face recognition, we can keep ROC good by limiting the angle of the face.
This content is password protected. To view it please enter your password below:
]]>
A Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point
$$f(a) \approx \sum\limits_{n=0}^{\infty}{\frac{f^{(n)}(a)}{n!}(x-a)^n}$$
where $f^{n}(a)$ donates the $n^{th}$ derivative of $f$ evaluated at the point $a$.
And here’s a very intuitive example:
The exponential function $e^x$ (in blue), and the sum of the first $n + 1$ terms of its Taylor series at $0$ (in red).
In calculus, Newton’s method is an iterative method for finding the roots of a differentiable function $f$.
In machine learning, we apply Newton’s method to the derivative $f’$ of the cost function $f$.
In the one-dimensional problem, Newton’s method attempts to construct a sequence ${x_n}$ from an initial guess $x_0$ that converges towards some value $\hat{x}$ satisfying $f'(\hat{x})=0$. In another word, we are trying to find a stationary point of $f$.
Consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is
$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2$$
So now, we are trying to find a $\Delta x$ that sets the derivative of this last expression with respect to $\Delta x$ equal to zero, which means
$$\frac{\rm{d}}{\rm{d} \Delta x}(f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2) = f'(x_n)+f”(x_n)\Delta x = 0$$
Apparently, $\Delta x = -\frac{f'(x_n)}{f”(x_n)}$ is the solution. So it can be hoped that $x_{n+1} = x_n+\Delta x = x_n – \frac{f'(x_n)}{f”(x_n)}$ will be closer to a stationary point $\hat{x}$.
Still consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is
$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + \Delta x^{\mathsf{T}} \nabla f(x_n) + \frac{1}{2} {\rm H} f(x_n) \Delta x (\Delta x)^{\mathsf{T}}$$
where ${\rm H} f(x)$ donates the Hessian matrix of $f(x)$ and $\nabla f(x)$ donates the gradient. (See more about Taylor expansion at https://en.wikipedia.org/wiki/Taylor_series#Taylor_series_in_several_variables)
So $\Delta x = – [{\rm H}f(x_n)]^{-1}\nabla f(x_n)$ should be a good choice.
As is known to us all, the time complexity to get $A^{-1}$ given $A$ is $O(n^3)$ when $A$ is a $n \times n$ matrix. So when the data set is of too many dimensions, the algorithm will work quite slow.
The reason steepest descent goes wrong is that the gradient for one weight gets messed up by simultaneous changes to all the other weights.
And the Hessian matrix determines the sizes of these interactions so that Newton’s method minimize these interactions as much as possible.
As you can see, PCA even did a better job.
And now I come to a conclusion that Auto Encoder is good at seeking different patterns and when fitting a single pattern PCA is a better choice.