Ⅰ. Distributed Representation
1. Fight the curse of dimensionality
As sentences with similar meanings can use quite different words, it’s very hard for the n-gram model to generalize. In the paper, they propose to use Distributed Representation to fight the curse of dimensionality. The Distributed Representation give the model the ability to make use of semantic information. This feature itself does improve the model. And this proposal let each training sentence inform the model an exponential number of semantically neighbouring sentences which makes the model generalize much better.
2. Deal with out-of-vocabulary words
It’s clear to see that the only stuff we need to deal with is to assign this out-of-vocabulary word a distributed representation. So to do that, we first consider this unknown word $j$ as a blank which need to be filled. And we use the network to estimate the probability $p_i$ of each word $i$ in the vocabulary. Donating $C_i$ as the distributed representation for word $i$ in the vocabulary, we assign $\sum\limits_{i \ in \ vocabulary}{C_i \cdot p_i}$ to $C_j$ as the distributed representation for word $j$. After that, we can incorporate word $j$ in the vocabulary and use this slightly larger vocabulary for further computation.
This approach is quite elegant because this is exactly how human brain works.
3. Deal with polysemous words
We can just simply assign multiple distributed representations for a single polysemous word.
4. comparison with n-gram models
Use distributed representation to make use of semantic information and to turn discrete random variables to continuous variables. And these two features let the network generalize much better than n-gram models.
Ⅱ. Improve computation efficiency
1. Represent the conditional probability with a tree structure
For each classification, it will go through $O(\log n)$ nodes in the tree structure. This makes the network use much fewer parameters to process.
2. Parallel implementation
In this paper, data-based parallel implementation and parameter-based implementation are mentioned.
Ⅲ. Relating to Strong AI
1. Taking advantage of prior knowledge
In this paper, they take advantage of the semantic information.
2. Decompose the whole network into smaller parts
This can make computation faster and easier to adapt it to other works. In Opening the black box of Deep Neural Networks via Information, it’s said that a large amount of computation is used to compression of input to effective representation. So if we can modularize the network and set up a set of general APIs, it can make a huge difference in practical implementation.