Saturday, July 20, 2013

No Free Lunch

There's no free lunch, you always have to lose something in order to gain something. Well at least this is true in the case of computer science where there is something called the No Free Lunch theorem. But in this post I'd like to highlight various instances in computer science where you have to lose something in order to gain something and offer some other fields where I think this also applies.

As you read these examples keep in mind those games where you have to set the stats of your character such as his speed, strength, etc. where you have a total number of points which you can distribute as you see fit, but which you cannot use any more points than those. This creates a condition where there is no free lunch, because you cannot have a perfect character who has perfect stats. There must always be some stats which are sacrificed in order to improve others.

Lossless compression

When you use WinZip in order to make a file smaller in such a way that you can extract the original file back again, you are using something called lossless compression. Can you design a compression algorithm which can make any file smaller and be able to decompress it back? No, because there are more large files than there are smaller ones.

Computer files consist of a sequence of ones and zeros called bits. Let's say that we have designed an algorithm which will compress files consisting of 2 bits. There can only be 4 such files which are:
00
01
10
11
Now let's say that our algorithm will somehow compress these files into 1 bit files. How many different 1 bit files are there? Just 2 which are:
0
1
So we want to compress each of the four 2 bit files into one of the two 1 bit files. What this results in is that two of the large files will turn into the same compressed file and so when you get to decompress it, it would be impossible to know which of the two possible files the original was, so you have lost information and hence the process is not reversible. It would be a lossy compression, not a lossless one.

So what lossless compression algorithms do, inevitably, is to make some files smaller and others bigger than they were. It is the only way to compress some files; you must enlarge some others. A good algorithm would be designed to compress the most typical files whilst enlarging the less common ones, but what would be a typical file? This depends on the application, which is why good compression results when an algorithm is designed for compressing only certain types of files such as pictures or music.

Here is an example of how the 2 bit files can be compressed:
00: 0
01: 10
10: 110
11: 111
So the file "00" will be actually compressed whilst the rest will be either the same size or bigger. As you can see each file was compressed into a unique file so it is possible to reverse the process and obtain the original files.

There's no free lunch in lossless compression, you've got to have some files which will be enlarged if you want some others to be compressed.

Hashing functions

A hashing function is an algorithm which transforms a piece of data (such as a person's name) into a number within a fixed range. This number is called a hash.

Here is how the name "john" can be hashed into the number 7. Start by associating each letter with a number (for example "a" can be associated with 1 and "b" with 2, etc.) and then taking each letter in a person's name and finding their associated numbers. These numbers would then be added together producing a single number. So "john" would be associated with the numbers [10,15,8,14] whose sum is 47. However we want the number to always be within a fixed range, such as between 0 and 9. We can do this by dividing the number by 10 and keeping the remainder which will always be between 0 and 9. So 47 divided by 10 gives 4 remainder 7, so the hash of the name "john" would be 7. This is of course just an example and hash functions can be much more complicated than this.

Hashing functions are used in order to create a hash table which is a table where every row is associated with a number and data is placed in a row according to its hash. The number of rows would be the range of the hashes. So "john" in the previous example would be placed in row 7. This speeds up searching for names since you can quickly find if a name is in the table by just computing its hash and checking if that particular row contains that particular name.

The problem is that different names can give the same hash which will create a collision in the hash table. There are ways around it but a good hashing function will minimize the number of collisions. So is there a hash function which will give the minimal number of collisions for any data?

Obviously not. Let's say that the range of the hashing function allows you to use a hash table with 1000 rows. This would mean that at most you can have 1000 different names without a collision, after which the next name will always result in a collision. Let's say that there is a list of 1000 names which when hashed will all be given different hashes. The hash function works well when the names are picked from that list. But there are more names than 1000. There are potentially infinite names which can be hashed. Since there are potentially infinite names and only 1000 can be placed in the hash table, then there are potentially infinite names which will all try to occupy the same row, regardless of which hashing function is used. This is simply because of the limited available space. So you can always find a list of names which will result in collisions. You can increase the range of the hashing function, but it must always be a finite range or it will not be a hashing function and cannot be used in a hash table.

There is no free lunch when it comes to hashing functions. You can only find a hashing function which works well with a particular subset of the data (usually the most typical data, just like in compression). There will always be some data which will result in a collision.

Machine learning / regression analysis

Both machine learning and regression analysis are the process of trying to guess a pattern from some examples. For example I say that I'm thinking of a sequence of numbers which starts as 0,2,4,6,8,... what will the next number be? You might be quick to guess that it would be 10 but this isn't necessarily true. You simply recognized one pattern which the list follows, namely the sequence of even numbers. But this isn't the only pattern which starts with those numbers. If there were more examples it might look like this: 0,2,4,6,8,1,3,5,7,9,10,12,14,16,18,11,13,15,17,19,... in which case we might guess the next number to be 20 since it alternated between even and odd numbers after every 5 numbers.

In fact for any sequence of numbers that you can think of, there exists an equation which will start with that sequence. A method for finding such an equation is called Polynomial Interpolation where an equation which starts with the sequence 1,3,2,4 (for whole number values of x) is:
f(x) = 1((x-2)(x-3)(x-4))/((1-2)(1-3)(1-4)) + 3((x-1)(x-3)(x-4))/((2-1)(2-3)(2-4)) + 2((x-1)(x-2)(x-4))/((3-1)(3-2)(3-4)) + 4((x-1)(x-2)(x-3))/((4-1)(4-2)(4-3))
which simplifies to:
f(x) = (2x^3 - 15x^2 + 35x - 20)/2
which gives the following values:
f(1) = 1
f(2) = 3
f(3) = 2
f(4) = 4

And this can be done for any finite sequence. The equation might become more and more complex (even when simplified) but there will always be one. Which means that for any example of sequence that you come up with, there is an infinite number of different equations which start with that sequence and then all continue differently. This means that there is no finite sequence of numbers you can give which will suffice to describe a single pattern such that we will always regress those examples into the original pattern which describes them, because there are an infinite number of such patterns, not just the one you were thinking of initially.

In general it is not simply a sequence of numbers that we need to find a pattern for. Machine learning is used to learn how to do a task from a set of examples of that task. It could be Youtube trying to guess what sort of videos you like based on the ones you have already seen. It could be Amazon trying to guess what sort of products you tend to buy based on the ones you have already bought. It could be a natural language production program which given a number of good stories will try to produce a new original good story. It could be a program which tries to find what you should invest in that will not fail given the current economic situation and using past economic situations as examples. It could be an automatic car which tries to learn how to drive automatically by observing how people drive. All these can be thought of as more complex versions of the previous equation shown, with the difference that instead of an equation you have a full blown program.

But the same problem persists. Different learning methods will give different patterns which describe the examples. The problem is choosing which learning method will give you the pattern that you want. This is called an induction bias and it is the fact that different learning methods tend towards different patterns and so a particular learning method will tend to give patterns which are suitable for particular uses whilst another method will give patterns which are suitable for different uses. And this applies even to us humans because we try to find patterns which describe our observations all the time, from a student who is trying to learn how to do an exercise based on the examples given by the teacher to a scientist who is trying to find a scientific theory which describes some natural process. There is not one best method of learning since each can give a pattern which fits the examples but will not fit new observations.

There is no free lunch in learning. Each learning method can fail and so there is no universally good method.

Programming languages?

We now come to the first of my unfounded hypothesis. I think that programming languages also have no free lunch. Each programming language is designed to be efficient at creating certain programs but none is efficient at everything. The more general purpose a programming language is, the harder it is to find errors in the code, which is why Haskell was reduced in generality to produce Helium which has better error messages. It seems like a perfect programming language cannot be created, it will either be too unwieldy to be used by humans or it will be too specialized to be useful in general. So there is no free lunch, there cannot be a general and easy to use programming language, it must be domain specific (usually implicitly).

Data structures and algorithms?

Here is another of my unfounded hypothesis. Data structures are ways of representing data in order to be efficient for some particular use such as searching. The problem is that there are always tradeoffs involved and sometimes these tradeoffs might not be obvious. For example although a sorted list is fast to search through, it is slower to add new data to. On the other hand an unsorted list is faster to add data to (just append it to the end) but slower to search through. Sorting algorithms also have tradeoffs. General sorting algorithms have a worst time complexity of O(n^2) except for merge sort which has a worst time complexity of O(n log n) but at the expense of using twice as much memory. Heap sort does have a worst time complexity of O(n log n) (without big O notation it is 2n log n) and uses the same amount of memory but it still has its disadvantages. On the other hand when an algorithm is optimized and improved to handle more and more special cases in order to avoid reaching the worst time complexity, then there is another trade off being involved: the size of the program needed to implement the algorithm. And different computers will be more efficient at executing certain algorithms than others, such as we humans being better at using the slow algorithms than the fast ones when it comes to sorting manually. It seems like there cannot be perfect data structures and algorithms which are good in any situation and use, there will always be some situation where the data structure or algorithm is not well suited for. So there is no free lunch, there cannot be a general algorithm or data structure which works well in every situation, it must be application specific.