Sunday, April 17, 2016

The Kullback–Leibler divergence

In a text file, each character is represented by a number of bits. Usually each character consists of 8 bits. But in data compression and information theory we learn that the number of bits need not be the same for each character. The most frequent characters can be given the fewest number of bits whilst the least frequent characters can be given many bits. According to Shannon's information theory, the smallest number of bits needed to represent a character that occurs with probability "p" is
-log2(p)

This means that if a character occurs half the time in a document, then you need -log2(0.5) bits which equals 1. On the other hand a character that occurs an eighth of the time should be represented with -log2(0.125) bits which equals 3. The rarer the character, the more bits should be assigned to it in order to be able to use less bits for the more frequent characters. This will result in compressing the whole document.

The entropy of a document is the expected number of bits per character, that is, the average number of bits in a document. If the number of bits is minimum, as described above, the expected number of bits is
-sum(p log2(p) for all character probabilities p)

Now that we know what entropy is, we can understand what the Kullback-Leibler divergence is. Assume that the number of bits for each character is calculated based on a different document than the one being compressed. Clearly this is a recipe for disaster, but you might want to compute an average probability for each character once based on a representative corpus and then always use these probabilities in all documents, which saves time. By how much will the probabilities diverge? This is what the Kullback-Leibler divergence is used for.

What we'll do is we'll calculate the difference in the average number of bits per character when using the wrong probabilities instead of the correct ones. This is done as follows:
-sum(p log2(q)) - -sum(p log2(p))

Notice how the correct probability for a particular character is "p" but in the first term we're using a different probability "q". This can now be simplified as follows:
-sum(p log2(q)) - -sum(p log2(p))
sum(p log2(p)) - sum(p log2(q))
sum(p log2(p) - p log2(q))
sum(p(log2(p) - log2(q)))
sum(p log2(p/q))

And this is the Kullback-Leibler divergence between the probabilities that "p" comes from and the probabilities that "q" comes from. It is usually used to measure the difference between two probability distributions.