Erjin recommended this article to me.
I found it extremely useful as well.
So I post it here in case if I need to revisit it in the future.
http://dustintran.com/blog/aresearchtoengineeringworkflow
Notes on “Orthogonal Deep Features Decomposition for AgeInvariant Face Recognition (2018)”
1. TL;DR
 Force the length of features represents the age of this identity when taking this photo.
 In this case, after doing normalization, features are AgeInvariant.
2. Motivation
 Aging process caused many problems (such as shape changes, texture changes, etc.), and those problems lead to large intraclass variation.
 In the field of AIFR, Conventional solutions model age feature and identity feature simultaneously. Considering the mixed feature are usually undecomposable, the property of mixing potentially reduce the robustness of recognizing crossage faces.
3. Contriution
 They proposed a new approach called OECNNs.
 Specifically, we decompose deep face features into two orthogonal components to represent agerelated and identityrelated features.
 In this way, identityrelated features are then AgeInvariant.
 They also built a new dataset called CrossAge Face dataset (CAF).
 This dataset includes about 313,986 face images from 4,668 identities. Each identity has approximately 80 face images.
 They manually washed the data. And this dataset is fairly randomly separated across ages.
4. Experiments & Results

Experiments on the MORPH Album 2 Dataset

Experiments on the LFW Dataset

Results on FGNET Dataset and CACDVS Dataset are also available in this paper.
5. Questions & Thoughts
 What if we embedding features to a flat instead of a sphere and let the height represents the age?
 Age can be replaced by other attributes of this image like pose, blur etc.
 The structure of their network can be adapted to GridFace.v7.relative_confidence_coefficient.
 The idea is actually quite similar to using AutoEncoder to do the transfer.
 In comparison to ASoftmax, forcing the length of features to represents the age of this person actually restrict the capacity of this model which reduces the risk of overfitting.
Notes on “ORDEREMBEDDINGS OF IMAGES AND LANGUAGE (2016)”
Notes on “Deep learning face representation by joint identificationverification (2014)”
Why not use a single RNN

There’s is an intuitive interpretation for this: “If you can discriminate a person from his eyes, why should I look at his nose?”

So when we cut pictures into different patches and train with different CNNs, actually we are forcing the network to learn every single piece of information.

And in practice, we are using this trick.
No More…
Notes on “SphereFace Deep Hypersphere Embedding for Face Recognition (2017)”
ASoftmax
 $L = \frac{1}{N} \sum\limits_{i}{\log(\frac{e^{ \left {x_i} \right \psi(\theta_{y_i,i})}}{e^{ \left {x_i} \right \psi(\theta_{y_i,i})} + \sum\limits_{jj \ne y_i}{e^{ \left {x_j} \right \cos(\theta _{j,i})}}})}$

$\psi(\theta_{y_i,i})=(1)^k\cos(m\theta_{y_i,i})2k$, $\theta_{y_i,i}\in[\frac{k\pi}{m}, \frac{(k+1)\pi}{m}]$ and $k \in [0, m1], \ m \in \mathbb{Z}^+$

The plot of $\psi(\theta_{y_i,i})$
 In practice, we set $m$ to $4$. Detailed proof of properties of ASoftmax Loss: https://www.cnblogs.com/heguanyou/p/7503025.html#undefined
Normalizing the weight could reduce the prior caused by the training data imbalance

Suppose we use a neural network to extract a $1D$ feature $f_i$ for each sample $i$ in the dataset and use Softmax to evaluate our network. To make our analysis easier, we normalize our features. Suppose there are only two classes in the dataset. There are $m$ samples in class 1 and $n$ samples in class 2. When our network is strong enough, our features are distributed at both ends of the diameter of the unit circle.

Without bias terms, the loss function can be written as:
$L = \sum\limits_{i = 1}^{m + n}\sum\limits_{j = 1}^{2}{a_{i,j} \log(p_{i,j})}$, where $p_{i,j}$ means the probability that sample i belongs to class $j$ (generated by the softmax) and $a_{i,j} = [sample \ i \ belongs \ to \ class \ j]$. Assume that $w_i$ means the weights in the softmax layer for class $i$. 
Then, $\frac{\partial L}{\partial w_1} = (m – 1) \sum\limits_{ia_{i,1}=1}{f_i} = m (m – 1)$.

And, $\frac{\partial L}{\partial w_2} = (n – 1) \sum\limits_{ia_{i,2}=1}{f_i} = n (n – 1)$.

If $m = 100$ while $n = 1$, then $\frac{\partial L}{\partial w_1} / \frac{\partial L}{\partial w_2} \approx 10000$. It’s clear to see that the derivative of $w_1$ is much bigger than the derivative of $w_2$. That’s why the larger sample number a class has, the larger the associated norm of weights tends to be.
Biases are useless for softmax

In this paper, they use an experiment using MNIST to empirically prove that Biases is not necessary for softmax.

In practical, Biases do are useless.
Understanding to “the prior that faces also lie on a manifold”

As descript in NormFace, the feature distribution of softmax is ‘radial’. So after normalization, features lie on a very thick line on a hypersphere. That’s why Euclidean features failed and $cos$ similarity works well.

This is also an interpretation of why the Euclidean margin is incompatible with softmax loss_
Other Points

Closedset FR can be well addressed as a classification problem. Openset FR is a metric learning problem.

The key criterion for metric learning in FR: the maxima intraclass distance is smaller than the minima interclass distance.

Separable $\ne$ discriminative and softmax is only separable.
Notes on “Deep Sparse Rectifier Neural Networks, Yoshua Bengio et al. (2011)”
Preface
This paper is totally empirical. You can hardly find any mathematics analysis. Therefore, every conclusion we drew is from experimental results.
The main gaps between neuroscience models and ML models
 Neurons encode information in a sparse and distributed way. This corresponds to a tradeoff between the richness of representation and small action potential energy expenditure. However, without additional regularization, ordinary feedforward neural networks do not have this property. This is not only not biologically implausible and hurts gradientbased optimization.
 Sigmoid and tanh are equivalent up to a linear transformation(do some tiny changes to the axis, we can get another one). the tanh has a steady state at 0 and is therefore preferred from the optimization standpoint, but it forces an antisymmetry around 0, which is absent in biological neurons.
Potential problems of rectifiers
 The hard saturation at 0 may hurt optimization by blocking gradient backpropagation.
 In order to efficiently represent symmetric or antisymmetric behaviour in the data, a rectifier network would need twice as many hidden units as a network of symmetric or antisymmetric activation functions.
 Rectifier networks are subject to illconditioning of parametrization.
 Consider for each layer of depth $i$ of the network a scalar $\alpha_i$, and scaling the prameters as $w_i’=\frac{w_i}{\alpha_i}$ and $b_i’=\frac{b_i}{\prod_{j=1}^{i}{\alpha_j}}$. The output units values then change as follow: $s’=\frac{s}{\prod_{j=1}^{i}{\alpha_j}}$
 Therefore, as long as $\prod_{j=1}^{n}{\alpha_j}=1$, the network function is identical.
Advantage of rectifiers
 Rectifiers are not only biologically plausible, they are also computationally efficient.
 There’s almost no improvement when using unsupervised pretraining with rectifier activations.
 Networks trained with the rectifier activation functions can find local minima of greater or equality than those obtained with its smooth counterpart, the soft plush.
 Rectifier networks are truly sparse.
Notes on “DeepFace Closing the Gap to HumanLevel Performance in Face Verification, Yaniv Taigman et al. (2014)”
What did they do in this paper?
 They greatly improved the result on LFW by improving the alignment and representation.
How did they do this?
 They apply 2D and 3D alignments to frontalize the picture.

They carefully customized the structure of the descriptor(a CNN).
High Lights

3D Alignment

Good performance on many datasets

Well handcrafted DNN structures(those locally connected layers)

Use a large training set to overcome overfitting
Q&A

Q: The detail of 2D alignment?
A: Step 1: Detect 6 fiducial points. Step 2: Use a regressor to apply one or more affine transformations.

Q: The detail of 3D alignment?
A:
num detail 1 Generate a generic 3D model of 67 anchor points from USF HumanID database 2 Fine tune a unique 3D model for each image using residual vector and covariance matrix(?) 3 Locate 67 fiducial points in each picture 4 Do apply Triangulation to both anchor points and fiducial points 5 For each pair of corresponding triangles, apply 2D alignment 
Q: Why there’s a $L^2$ normalization?
A: To cope with the change of illumination. And there’s another usage: When you use softmax and the data is fairly blurred, then the descriptor will tend to assign $\vec 0$ to every picture as its representation to minimize the loss function. So in order to fix this bug, we set all the representations to the length of 1 to avoid this problem.
Appendix

ReLU has a great ability to learn. In practical, ReLU and $\tanh$ are mostly used in CV. And in some cases, some variant of ReLU like PReLU can even do a better job.

In 3D alignment literature, PCA is widely used to fine tune a unique 3D model for each picture. And a 3D Engine can be used to generate data to train neural networks. And in some cases, some triangle in the Triangulation is invisible, then you need to guess what it looks like. And in this case, adversarial models are recommended for conventional solutions usually offer a blurred image while adversarial models offer clear images with sharp edges.

In 2D alignment, affine transformation and projective transformation are applied. There’re only 4 parameters in affine transformations, so Least Squares are the most used solution. When it comes to projective transformation, things get different. Though there’re only 2 more parameters, the projective transformation would greatly change the shape of faces(affine transformation would do the same stuff but not so much). So, usually, we combine the transformation with the whole net. Though we don’t know how much the transformation does exactly, as we only care about the result, it works and it works well. So, yes, this is still a black box. Moreover, there’s a network called Spatial Transformer Networks(STN) and it actually is doing Projective Transformation’s job.

When you are doing the alignment, for example using affine transformation, interpolation may be needed. Usually, there are two ways: 1. Use the nearest point as a substitute; 2. Use a linear combination of four closest points as the answer.

In face recognition, we can keep ROC good by limiting the angle of the face.
Protected: Summery on Google mock phone interview
Newton’s method
Taylor series
A Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single point
$$f(a) \approx \sum\limits_{n=0}^{\infty}{\frac{f^{(n)}(a)}{n!}(xa)^n}$$
where $f^{n}(a)$ donates the $n^{th}$ derivative of $f$ evaluated at the point $a$.
And here’s a very intuitive example:
The exponential function $e^x$ (in blue), and the sum of the first $n + 1$ terms of its Taylor series at $0$ (in red).
Newton’s method
Overview
In calculus, Newton’s method is an iterative method for finding the roots of a differentiable function $f$.
In machine learning, we apply Newton’s method to the derivative $f’$ of the cost function $f$.
Onedimension version
In the onedimensional problem, Newton’s method attempts to construct a sequence ${x_n}$ from an initial guess $x_0$ that converges towards some value $\hat{x}$ satisfying $f'(\hat{x})=0$. In another word, we are trying to find a stationary point of $f$.
Consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is
$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2$$
So now, we are trying to find a $\Delta x$ that sets the derivative of this last expression with respect to $\Delta x$ equal to zero, which means
$$\frac{\rm{d}}{\rm{d} \Delta x}(f(x_n) + f'(x_n) \Delta x + \frac{1}{2}f”(x_n) \Delta x^2) = f'(x_n)+f”(x_n)\Delta x = 0$$
Apparently, $\Delta x = \frac{f'(x_n)}{f”(x_n)}$ is the solution. So it can be hoped that $x_{n+1} = x_n+\Delta x = x_n – \frac{f'(x_n)}{f”(x_n)}$ will be closer to a stationary point $\hat{x}$.
High dimensional version
Still consider the second order Taylor expansion $f_T(x)$ of $f$ around $x_n$ is
$$f_T(x)=f_T(x_n+\Delta x) \approx f(x_n) + \Delta x^{\mathsf{T}} \nabla f(x_n) + \frac{1}{2} {\rm H} f(x_n) \Delta x (\Delta x)^{\mathsf{T}}$$
where ${\rm H} f(x)$ donates the Hessian matrix of $f(x)$ and $\nabla f(x)$ donates the gradient. (See more about Taylor expansion at https://en.wikipedia.org/wiki/Taylor_series#Taylor_series_in_several_variables)
So $\Delta x = – [{\rm H}f(x_n)]^{1}\nabla f(x_n)$ should be a good choice.
Limitation
As is known to us all, the time complexity to get $A^{1}$ given $A$ is $O(n^3)$ when $A$ is a $n \times n$ matrix. So when the data set is of too many dimensions, the algorithm will work quite slow.
Advantage
The reason steepest descent goes wrong is that the gradient for one weight gets messed up by simultaneous changes to all the other weights.
And the Hessian matrix determines the sizes of these interactions so that Newton’s method minimize these interactions as much as possible.
References
Auto Encoder vs. PCA (again)
I tried to use Auto Encoder and PCA to do Dimensionality Reduction.
The dataset is from House Prices: Advanced Regression Techniques.
I transformed the data from 79D to 30D, and then reconstruct the data to 70D.
Here’s the result:
As you can see, PCA even did a better job.
And now I come to a conclusion that Auto Encoder is good at seeking different patterns and when fitting a single pattern PCA is a better choice.