Wednesday, June 24, 2015

Compressed frequencies: Representing frequencies with less bits

In a cache memory you usually store the most frequently used data that is currently in a larger but slower memory. For example, you keep your most frequently accessed files cached in RAM rather than on your hard drive. Since you can't fit all the contents of your hard disk in RAM, you keep only the most frequently used files that can fit. You will still need to access your hard disk once in a while in order to access your less frequently used files but the average file access time will now be greatly reduced.

The problem is how to keep count of the number of times each file is being used in order to know which is the most frequently used. The obvious solution is to associate each file with a number and increment that number each time it is used. But numbers take space as well, and sometimes this becomes a significant problem. You might not afford to waste 4 or 8 bytes of memory worth of frequency integers for every item. Is there a way to bring down the number of bytes used by the frequency integers without losing their usefulness?

Here is an example of a 4 byte int integer number in memory representing the number 45723:
00000000 00000000 10110010 10011011

The most obvious thing you can do is to tear away the most significant bits (the ones on the left which have a larger value) by using smaller range number types such as the 2 byte short, which gives us 10110010 10011011. If the frequency is sometimes, but rarely, larger than the ranges provided by these smaller types, then you can just cap it off by stopping incrementation once the maximum number is reached. For example, if we're using a two byte short, then the maximum number this can be is 65535. Once the frequency reaches this number, then it gets frozen and never incremented again. Many frequences follow a zipfian distribution, meaning that the vast majority of items will have a small frequency, followed by a handful of very frequent items. An example of this is words in a document where most words will only occur once and only a few words such as "the" and "of" will occur frequently. If this is the case then you will be fine with capping off your frequencies since only a few items will have a large frequency and it might not be important to order these high frequency items among themselves.

It might seem more useful instead to tear away the least significant bits (the ones on the right which have a smaller value) instead, since these are less useful. The way you do this is to divide the frequency by a constant and keep only the whole number part. For example, if we divide the above number by 256, we'd be shifting the bits by one byte to the right, which gives us 00000000 00000000 00000000 10110010. The least significant byte has been removed which means that we can use less bytes to store the frequency. But in order to do that you need to first have the actual frequency which defeats the purpose. So what we can do is to simulate the division by incrementing the frequency only once every 256 times. If we do that then the resulting number will always be a 256th of the actual frequency which is the frequency without the least significant byte. But how do you know when to increment the frequency next? If you keep a separate counter which counts to 256 in order to know when to increment next then you lose the space you would have saved. Instead we can do it stochastically using random numbers. Increment the frequency with a probability of 1 in 256 and the frequency will be approximately a 256th of the actual frequency.

By combining these two ideas together we can reduce an 8 byte frequency into a single byte and that byte will be one of the original 8 bytes of the actual frequency. Here is a Python function that increments an integer with a compressed frequency that is a certain number of bytes long and with a certain number of least significant bytes torn off.

def compressed_increment(frequency, bytes_length, bytes_torn):
    if frequency < 256**bytes_length: #cap the frequency to the maximum number that can be contained in bytes_length bytes
        if random.randint(1, 256**bytes_torn) == 1: #increment with a probability of 1 in 256^bytes_length (number to divide by to shift the frequency by that number of bytes)
            return frequency + 1

Of course this is a lossy compression. Information is lost. This means that the compressed frequency is not useful in certain situations, such as when you want to also decrement the frequency or when approximate frequencies are inadequate.