What did they do in this paper?
 They greatly improved the result on LFW by improving the alignment and representation.
How did they do this?
 They apply 2D and 3D alignments to frontalize the picture.

They carefully customized the structure of the descriptor(a CNN).
High Lights

3D Alignment

Good performance on many datasets

Well handcrafted DNN structures(those locally connected layers)

Use a large training set to overcome overfitting
Q&A

Q: The detail of 2D alignment?
A: Step 1: Detect 6 fiducial points. Step 2: Use a regressor to apply one or more affine transformations.

Q: The detail of 3D alignment?
A:
num detail 1 Generate a generic 3D model of 67 anchor points from USF HumanID database 2 Fine tune a unique 3D model for each image using residual vector and covariance matrix(?) 3 Locate 67 fiducial points in each picture 4 Do apply Triangulation to both anchor points and fiducial points 5 For each pair of corresponding triangles, apply 2D alignment 
Q: Why there’s a $L^2$ normalization?
A: To cope with the change of illumination. And there’s another usage: When you use softmax and the data is fairly blurred, then the descriptor will tend to assign $\vec 0$ to every picture as its representation to minimize the loss function. So in order to fix this bug, we set all the representations to the length of 1 to avoid this problem.
Appendix

ReLU has a great ability to learn. In practical, ReLU and $\tanh$ are mostly used in CV. And in some cases, some variant of ReLU like PReLU can even do a better job.

In 3D alignment literature, PCA is widely used to fine tune a unique 3D model for each picture. And a 3D Engine can be used to generate data to train neural networks. And in some cases, some triangle in the Triangulation is invisible, then you need to guess what it looks like. And in this case, adversarial models are recommended for conventional solutions usually offer a blurred image while adversarial models offer clear images with sharp edges.

In 2D alignment, affine transformation and projective transformation are applied. There’re only 4 parameters in affine transformations, so Least Squares are the most used solution. When it comes to projective transformation, things get different. Though there’re only 2 more parameters, the projective transformation would greatly change the shape of faces(affine transformation would do the same stuff but not so much). So, usually, we combine the transformation with the whole net. Though we don’t know how much the transformation does exactly, as we only care about the result, it works and it works well. So, yes, this is still a black box. Moreover, there’s a network called Spatial Transformer Networks(STN) and it actually is doing Projective Transformation’s job.

When you are doing the alignment, for example using affine transformation, interpolation may be needed. Usually, there are two ways: 1. Use the nearest point as a substitute; 2. Use a linear combination of four closest points as the answer.

In face recognition, we can keep ROC good by limiting the angle of the face.