Sunday, September 30, 2012

The Birthday Paradox

The birthday paradox is the probability of finding two people who share the same birthday in a group of randomly chosen people. The paradox is that it takes surprisingly very few people to get a high probability. This is because you are looking for two people with the same birthday rather than looking for one person with a particular birthday. It assumes that every day of the year has an equal chance of being someone's birthday, which isn't the case since there are more birthdays during the summer months, probably because nine months before then are the autumn months which, I assume, is when couples start spending more time inside and it's not too cold yet.

The probability is also used to find the probability of a hash collision. A hash collision is when a hashing function gives the same hash value for different inputs. One case where this is a problem is when hashes are used to represent contracts which are stored at notaries in order to keep the notary from reading the contract. When the contract needs to be read, the contract is sent to the notary who will then check if the hash is still the same, meaning that it wasn't changed. Of course it could still be changed and give the same hash value, but while this is hard to do for a particular contract, finding a pair of contracts which give the same hash value is easier and so it is important that the probability of this happening is taken into consideration.

So, assuming that birthdays are uniformly distributed across all days of the year and that a year has 365 days, the probability of 2 random people having the same birthday is 1/365 (because it's like asking what are the chances of the second person's birthday being on a particular date) whilst the probability of two people among 366 random people, or more, having the same birthday is 1 (because there are only 365 possible birthdays). But what is the probability of anything in between?

If we have "n" people, then the question is what is the probability of at least one of the those people having a birthday which is the same as another different person among the "n" people. In order to avoid checking for when 2 people share the same birthday, or 3 people, or 4 and so on, we will instead be checking what the probability of no one in the group sharing a birthday is and then subtract it from 1 in order to find what the probability of that not happening is. So we will find what the probability of no one sharing a birthday not happening is.

So what is the probability that no one among 3 people shares a birthday? The first person can have any birthday they want so their birthday can be in any day of the year, giving them a probability of 365/365 of having the right birthday. The second person can have any birthday except the one of the first, so they have a probability of 364/365 of having the right birthday. The third can have any birthday except the ones of the first and the second, so they have a probability of 363/365 of having the right birthday. All these events must happen and they are independent (the birthday of one person will not change the probability of any of the other persons having a right birthday), so we just multiply them together. However we want this whole event to not happen so that we know the probability of finding at least two people who share a birthday, so:
P(3) = 1 - ( 365/365 x 364/365 x 363/365 )
P(3) = 1 - ( (365 x 364 x 363)/(365^3) )
P(3) ≈ 0.00548

What about for 4 people?
P(4) = 1 - ( 365/365 x 364/365 x 363/365 x 362/365 )
P(4) = 1 - ( (365 x 364 x 363 x 362)/(365^4) )
P(4) ≈ 0.01636

What about for "n" people?
P(n) = 1 - ( 365/365 x 364/365 x 363/365 x ... x (356 - n + 1)/365 )
P(n) = 1 - ( (365 x 364 x 363 x ... x (365 - n + 1))/(365^n) )

We can simplify this further by using factorials. Since 7! = 7x6x5x4x3x2x1 and 4! = 4x3x2x1, we can find what 7x6x5 is by calculating 7!/4!, so that parts in 7! that we don't want get cancelled out by 4!. In general,
n x (n-1) x (n-2) x ... x (n-k+1) = n!/(n-k)!
So we can no use this to continue the simplification:

P(n) = 1 - ( (365 x 364 x 363 x ... x (365 - n + 1))/(365^n) )
P(n) = 1 - ( (365!/(365-n)!)/(365^n) )

We can further simplify this by using combinations. Since
n!/(k! x (n-k)!) = nCk,
we can turn any a!/(a-b)! into (b! x a!)/(b! x (a-b)!) by multiplying top and bottom by b! and subsequently turn it into b! x (a!/(b! x (a-b)!) which becomes b! x aCb. So now we can use this to continue the simplification:

P(n) = 1 - ( (365!/(365-n)!)/(365^n) )
P(n) = 1 - ( (n! x 365Cn)/(365^n) )

Now that we've simplified the probability formula, we can find out that P(23) is the smallest number of people that you will need in order to have at least a 50% probability of finding two people who share the same birthday. So in a crowd of 23 people you will probably find two who share the same birthday.