# Notes on “SphereFace Deep Hypersphere Embedding for Face Recognition (2017)”

## A-Softmax

• $L = \frac{1}{N} \sum\limits_{i}{-\log(\frac{e^{ \left| {x_i} \right| \psi(\theta_{y_i,i})}}{e^{ \left| {x_i} \right| \psi(\theta_{y_i,i})} + \sum\limits_{j|j \ne y_i}{e^{ \left| {x_j} \right| \cos(\theta _{j,i})}}})}$

• $\psi(\theta_{y_i,i})=(-1)^k\cos(m\theta_{y_i,i})-2k$, $\theta_{y_i,i}\in[\frac{k\pi}{m}, \frac{(k+1)\pi}{m}]$ and $k \in [0, m-1], \ m \in \mathbb{Z}^+$

• The plot of $\psi(\theta_{y_i,i})$

## Normalizing the weight could reduce the prior caused by the training data imbalance

• Suppose we use a neural network to extract a $1D$ feature $f_i$ for each sample $i$ in the dataset and use Softmax to evaluate our network. To make our analysis easier, we normalize our features. Suppose there are only two classes in the dataset. There are $m$ samples in class 1 and $n$ samples in class 2. When our network is strong enough, our features are distributed at both ends of the diameter of the unit circle.

• Without bias terms, the loss function can be written as:
$L = -\sum\limits_{i = 1}^{m + n}\sum\limits_{j = 1}^{2}{a_{i,j} \log(p_{i,j})}$, where $p_{i,j}$ means the probability that sample i belongs to class $j$ (generated by the softmax) and $a_{i,j} = [sample \ i \ belongs \ to \ class \ j]$. Assume that $w_i$ means the weights in the softmax layer for class $i$.

• Then, $\frac{\partial L}{\partial w_1} = (m – 1) \sum\limits_{i|a_{i,1}=1}{f_i} = m (m – 1)$.

• And, $\frac{\partial L}{\partial w_2} = (n – 1) \sum\limits_{i|a_{i,2}=1}{f_i} = n (n – 1)$.

• If $m = 100$ while $n = 1$, then $\frac{\partial L}{\partial w_1} / \frac{\partial L}{\partial w_2} \approx 10000$. It’s clear to see that the derivative of $w_1$ is much bigger than the derivative of $w_2$. That’s why the larger sample number a class has, the larger the associated norm of weights tends to be.

## Biases are useless for softmax

• In this paper, they use an experiment using MNIST to empirically prove that Biases is not necessary for softmax.

• In practical, Biases do are useless.

## Understanding to “the prior that faces also lie on a manifold”

• As descript in NormFace, the feature distribution of softmax is ‘radial’. So after normalization, features lie on a very thick line on a hypersphere. That’s why Euclidean features failed and $cos$ similarity works well.

• This is also an interpretation of why the Euclidean margin is incompatible with softmax loss_

## Other Points

1. Closed-set FR can be well addressed as a classification problem. Open-set FR is a metric learning problem.

2. The key criterion for metric learning in FR: the maxima intra-class distance is smaller than the minima inter-class distance.

3. Separable $\ne$ discriminative and softmax is only separable.