ASoftmax
 $L = \frac{1}{N} \sum\limits_{i}{\log(\frac{e^{ \left {x_i} \right \psi(\theta_{y_i,i})}}{e^{ \left {x_i} \right \psi(\theta_{y_i,i})} + \sum\limits_{jj \ne y_i}{e^{ \left {x_j} \right \cos(\theta _{j,i})}}})}$

$\psi(\theta_{y_i,i})=(1)^k\cos(m\theta_{y_i,i})2k$, $\theta_{y_i,i}\in[\frac{k\pi}{m}, \frac{(k+1)\pi}{m}]$ and $k \in [0, m1], \ m \in \mathbb{Z}^+$

The plot of $\psi(\theta_{y_i,i})$
 In practice, we set $m$ to $4$. Detailed proof of properties of ASoftmax Loss: https://www.cnblogs.com/heguanyou/p/7503025.html#undefined
Normalizing the weight could reduce the prior caused by the training data imbalance

Suppose we use a neural network to extract a $1D$ feature $f_i$ for each sample $i$ in the dataset and use Softmax to evaluate our network. To make our analysis easier, we normalize our features. Suppose there are only two classes in the dataset. There are $m$ samples in class 1 and $n$ samples in class 2. When our network is strong enough, our features are distributed at both ends of the diameter of the unit circle.

Without bias terms, the loss function can be written as:
$L = \sum\limits_{i = 1}^{m + n}\sum\limits_{j = 1}^{2}{a_{i,j} \log(p_{i,j})}$, where $p_{i,j}$ means the probability that sample i belongs to class $j$ (generated by the softmax) and $a_{i,j} = [sample \ i \ belongs \ to \ class \ j]$. Assume that $w_i$ means the weights in the softmax layer for class $i$. 
Then, $\frac{\partial L}{\partial w_1} = (m – 1) \sum\limits_{ia_{i,1}=1}{f_i} = m (m – 1)$.

And, $\frac{\partial L}{\partial w_2} = (n – 1) \sum\limits_{ia_{i,2}=1}{f_i} = n (n – 1)$.

If $m = 100$ while $n = 1$, then $\frac{\partial L}{\partial w_1} / \frac{\partial L}{\partial w_2} \approx 10000$. It’s clear to see that the derivative of $w_1$ is much bigger than the derivative of $w_2$. That’s why the larger sample number a class has, the larger the associated norm of weights tends to be.
Biases are useless for softmax

In this paper, they use an experiment using MNIST to empirically prove that Biases is not necessary for softmax.

In practical, Biases do are useless.
Understanding to “the prior that faces also lie on a manifold”

As descript in NormFace, the feature distribution of softmax is ‘radial’. So after normalization, features lie on a very thick line on a hypersphere. That’s why Euclidean features failed and $cos$ similarity works well.

This is also an interpretation of why the Euclidean margin is incompatible with softmax loss_
Other Points

Closedset FR can be well addressed as a classification problem. Openset FR is a metric learning problem.

The key criterion for metric learning in FR: the maxima intraclass distance is smaller than the minima interclass distance.

Separable $\ne$ discriminative and softmax is only separable.