Notes on “A Neural Probabilistic Language Model”

Ⅰ. Distributed Representation

1. Fight the curse of dimensionality

As sentences with similar meanings can use quite different words, it’s very hard for the n-gram model to generalize. In the paper, they propose to use Distributed Representation to fight the curse of dimensionality. The Distributed Representation give the model the ability to make use of semantic information. This feature itself does improve the model. And this proposal let each training sentence inform the model an exponential number of semantically neighbouring sentences which makes the model generalize much better.

2. Deal with out-of-vocabulary words

It’s clear to see that the only stuff we need to deal with is to assign this out-of-vocabulary word a distributed representation. So to do that, we first consider this unknown word $j$ as a blank which need to be filled. And we use the network to estimate the probability $p_i$ of each word $i$ in the vocabulary. Donating $C_i$ as the distributed representation for word $i$ in the vocabulary, we assign $\sum\limits_{i \ in \ vocabulary}{C_i \cdot p_i}$ to $C_j$ as the distributed representation for word $j$. After that, we can incorporate word $j$ in the vocabulary and use this slightly larger vocabulary for further computation.

This approach is quite elegant because this is exactly how human brain works.

3. Deal with polysemous words

We can just simply assign multiple distributed representations for a single polysemous word.

4. comparison with n-gram models

Use distributed representation to make use of semantic information and to turn discrete random variables to continuous variables. And these two features let the network generalize much better than n-gram models.

Ⅱ. Improve computation efficiency

1. Represent the conditional probability with a tree structure

For each classification, it will go through $O(\log n)$ nodes in the tree structure. This makes the network use much fewer parameters to process.

2. Parallel implementation

In this paper, data-based parallel implementation and parameter-based implementation are mentioned.

Ⅲ. Relating to Strong AI

1. Taking advantage of prior knowledge

In this paper, they take advantage of the semantic information.

2. Decompose the whole network into smaller parts

This can make computation faster and easier to adapt it to other works. In Opening the black box of Deep Neural Networks via Information, it’s said that a large amount of computation is used to compression of input to effective representation. So if we can modularize the network and set up a set of general APIs, it can make a huge difference in practical implementation.

Ⅳ. Paper involves

  1. Bengio and J-S. Senécal. Quick training of probabilistic neural nets by importance sampling. In AISTATS, 2003
  2. Berger, S. Della Pietra, and V. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22:39–71, 1996

Leave a Reply

Your email address will not be published. Required fields are marked *