Thursday, February 7, 2013

Maximum Likelihood Estimate / Laplace's law

What is the maximum likelihood estimate in statistics? Let's use an example from natural language processing.

Let's say that we want to estimate the probability/likelihood of every word in the English language vocabulary. What this means is that you want to estimate the probability that a randomly chosen word happens to be a particular word. There are many uses for such an estimate, called a unigram, such as attempting to predict what the next word to be written by a user will be (in order to type it for the user). Morse code gives the most frequent letters (letters with a highest probability of being used) the shortest codes in order to make sending messages faster. Something similar can be used for words in order to speed up the searching of words and to compress text files.

How can we estimate such a probability? What is usually done is that a large amount of text is collected, called a corpus, and the words in the text are counted. By dividing the number of times each word occurs by the total number of words in the corpus we will have an estimate of the probability for each word. This assumes there is a large number of words and that the words are used naturally (not a single word repeated over and over for example).

So P(w) = F(w)/F(*),
where P(w) is the probability of using a word called "w", F(w) is the frequency or number of times that the word "w" is found in the corpus and F(*) is the total number of words in the corpus.

What is the problem with this estimation? The problem is that any words which are not found in the corpus are assumed to have a probability of zero, that is, that they don't exist. This is not realistic, as even a very large corpus would not have every possible word ever, and new words are invented all the time. What we need to do is give a little less probability to the words we have in the corpus in order to share it with the words which are not in the corpus. This is called smoothing.

As it is, the way we are estimating the probability gives all the probability to the words in the corpus, which is the maximum probability you can give them. For this reason, this method is called the maximum likelihood estimate.

Now how can we smooth out our probability to give the unknown words a little of the probability too? A simple way we can do this is by assuming that each unknown word has an equal probability which is less than the probabilities of the words in the corpus. We can do this easily by assuming that each unknown word has a frequency of 1 and that each known word has a frequency of one more than it actually is. By increasing the number of times that each word occurs in the corpus by 1, including the words which aren't there, our probabilities will change such that the unknown words will all have an equal probability which is less than any known word's probability. In effect, no word will have a probability of zero, since every word would now look like it occurs at least once.

We might be tempted to say that the new probability is "(F(w) + 1)/F(*)", but this would be wrong because when you add together the probabilities of each word the total should come to 1, which it will not using this equation. As it was before, if you add together all the probabilities in the form of "F(w)/F(*)", the total would come to 1. For example if our corpus consisted of just one sentence "the boy played with the girl", then the probabilities of each word would be 2/6 for "the", 1/6 for "boy", 1/6 for "played", 1/6 for "with" and 1/6 for "girl". 2/6 + 1/6 + 1/6 + 1/6 + 1/6 = 1. If we use the wrong formula on the other hand, the probabilities would be 3/6 for "the", 2/6 for "boy", 2/6 for "played", 2/6 for "with" and 2/6 for "girl". 3/6 + 2/6 + 2/6 + 2/6 + 2/6 = 11/6 ≈ 1.8.

What we need to do is to add 1 to the denominator of the probability fraction for every different word we can find the probability of. What this means is that we need to know how many unknown words together with the known words there can be. This is called the vocabulary size and assuming that you know it you just need to add it to the denominator of the probability fraction like this:

P(w) = (F(w) + 1)/(F(*) + V),
where V is the vocabulary size or the total number of words including unknown ones.

This method is called Laplace's law, or rule of succession or add-one smoothing. It's not a very realistic estimation either however, since unknown words will not have equal probabilities and will not necessarily have a probability less than the rarest words in the corpus. Also, giving the unknown words a frequency of 1 might look like it's small, but when you consider that it is just one less than the smallest frequency in the corpus then it might be too large an estimate, especially considering that we might be including words in our vocabulary size which just don't exist (maybe no one uses them anymore).

There are other smoothing methods which can be used, such as using synonyms of unknown words. By using the probability of synonyms which are found in the corpus you can find an estimation of the unknown word's probability.

Let's see another example of how maximum likelihood estimate and Laplace's law can be used. Let's say that we want to do something more interesting than finding the probabilities of unigrams (single words) and instead want to find the probabilities of bigrams, that is, pairs of words. The nice thing about bigrams is that you can use a word in order to predict what the next word might be in a sentence. This would be useful for making writing text messages faster by having your cell phone predict and write for you the next word you want to write. Using more words than just the previous word in order to predict the next word would make for more accurate guesses, but let's just consider bigrams instead of the full "ngrams".

The probability we want to find is that of the user writing a given word provided that the previous word he/she wrote is known. So for example, we want to know what the probability of the user writing "cream" is after he/she has written the word "ice". We can then do this for every word and sort them by their probability in order to present the user with a list of suggested next words. Using maximum likelihood estimate, the probability is found by taking a corpus and counting the number of times the word "ice cream" is found and dividing that with the number of times the word "ice" is found. That way we'll estimate the probability of the word "ice" being followed by the word "cream".

P(w1 w2) = F(w1 w2)/F(w1),
where w1 is the first word and w2 is the next word.

Using Laplace's law we need to add 1 to the numerator and add the total number of possible word pairs you can have (known and unknown) to the denominator. The total number of possible word pairs might be calculated simply as the vocabulary size squared, that is, the vocabulary size multiplied by itself. This assumes that every possible word can follow every possible other word, include the same word. This isn't realistic at all but it might do.

P(w1 w2) = (F(w1 w2) + 1)/(F(w1) + V^2)