Wednesday, July 8, 2015

Expected number of uniformly distributed collisions (birthday problem)

Here's an interesting mathematical problem. If you have "n" objects to be inserted into "m" available slots using a uniformly distributed random placement, how many collisions with already occupied slots should we expect to happen? This is useful for hashtables and other data structures where duplicates are not allowed.

Here is a Python 3 program that simulates inserting objects into random positions in an array and counting the average number of collisions.

def collisions(n, m):
 trials = 10000
 total_collisions = 0
 for _ in range(trials):
  slot_is_occupied = [ False for _ in range(m) ]
  for _ in range(n):
   slot = random.randint(0, m-1)
   if slot_is_occupied[slot]:
    total_collisions += 1
   else:
    slot_is_occupied[slot] = True
 return total_collisions/trials

Here is a sample of the average number of collisions given by the above function for different values of "n" and "m":
n\m12345678910
10.00.00.00.00.00.00.00.00.00.0
21.00.49470.33340.25150.20540.1630.14370.12970.11180.1053
32.01.24470.88610.6850.5570.46330.40230.35370.3250.2819
43.02.12911.58121.25881.04430.90540.77540.69240.61760.5459
54.03.06172.41.9441.64571.42041.23641.11230.99840.9019
65.04.0343.2652.70782.3042.02181.76041.60041.43491.318
76.05.01684.17423.54063.04982.67162.39732.14991.94171.7744
87.06.00755.12064.4033.83633.39053.03042.73782.51512.3219
98.07.00356.07655.30524.67884.16523.7383.41683.12052.8816
109.08.00167.05266.22335.54014.96324.4934.09133.77213.5016

Basically the answer is the number of objects "n" minus the number of occupied slots. This will give us the number of objects excluding the ones which were inserted without collision, that is, in an empty slot. For example, if I insert 5 objects into an array but at the end there are only 3 occupied slots, then that must mean that 2 of those objects were inserted in the same slot as some other objects (they collided with them).

The question is how to predict the expected number of occupied slots.

Expected number of occupied slots

What is the average number of slots ending up being occupied by at least one object? This previous blog post explains that you basically just need to multiply the probability of a given slot being occupied at the end by the number of slots. So what is the probability of a slot being occupied?

Probability of a slot being occupied

What is the probability that an object is inserted into a particular slot out of "m" slots?
1/m

Therefore the probability that the slot remains empty is
1 - 1/m

What is the probability that the slot is still empty after another placement? It's the probability that the first object did not land on the slot AND that the second object did not land on the slot too. These two probabilities are independent of each other, so
(1 - 1/m)(1 - 1/m)

In general, after "n" objects have been placed, the probability that the slot is still empty is
(1 - 1/m)^n

Notice that this makes sense for n = 0 because if no objects were placed, then the probability that the slot is empty is 1.

Which means that after "n" objects have been placed, the probability that the slot is occupied is
1 - (1 - 1/m)^n

Therefore...

Therefore, the expect number of occupied slots among "m" slots after "n" objects have been inserted with uniform probability is
m(1 - (1 - 1/m)^n)

Which means that the expected number of collisions is
n - m(1 - (1 - 1/m)^n)

Here is the same table as the one at the top showing the corresponding predicted number of collisions:

n\m12345678910
10.00.00.00.00.00.00.00.0-0.00.0
21.00.50.33330.250.20.16670.14290.1250.11110.1
32.01.250.88890.68750.560.47220.40820.35940.3210.29
43.02.1251.59261.26561.0480.89350.77840.68950.61870.561
54.03.06252.39511.94921.63841.41131.23871.10330.99440.9049
65.04.03133.26342.71192.31072.00941.7761.59041.43941.3144
76.05.01564.17563.53393.04862.67452.37942.14161.94621.783
87.06.00785.11714.40053.83893.39543.03952.74892.50772.3047
98.07.00396.0785.30034.67114.16283.74813.40533.1182.8742
109.08.0027.0526.22535.53694.9694.49844.10463.77153.4868

The maximum absolute error between the two tables is 0.0179.

Probabilities are average proportions (expected value)

Intuitively, if a coin flip has a probability of 1/2 of turning out heads, and we flipped the coin 100 times, we expect that 1/2 of those 100 flips will be heads. What is meant by "expect" is that if we do this 100 coin flip experiment for many times, count the number of times it turns out heads for each 100 flip trial, and take the average of these counts, the average will be close to 1/2 of 100. Furthermore, the more 100 flip trials we include in our average, the closer the average will be 1/2 of 100.

If this were the case, then a probability can be treated as an average proportion, because if a probability of something happening is, say, 1/100, then after 1000 attempts we should find that, on average, 1/100 of those 1000 attempts would be the thing happening. In general, if the probability of an outcome is "p", and "n" attempts are made, then we should have "pn" positive outcomes. That probability is acting as a proportion of the average number of attempts made which will result in a positive outcome out of the attempts made. In fact, semantically speaking, the phrase "This outcome occurs with probability 1/100" and the phrase "This outcome occurs once every 100 times" are identical.

A simple proof of this is in the way we estimate the probability of an outcome. We attempt to produce the outcome (such as a coin flip resulting in heads) for a number of times "n", count the number of times "x" the outcome is positive (heads), and then just find x/n. But in order for this probability to be reliable, the quotient must remain constant for different values of "n" (the value "x" will change according to "n" to keep x/n equal). Given this statement, if we know a reliable probability x/n, and have performed the experiment "m" times, then the number of positive outcomes "y" can be predicted as follows:
For x/n to be reliable, x/n = y/m
Therefore, y = m(y/m) = m(x/n)
That is, since x/n is known and "m" is known, "y" can be found using those two values only.

Of course this is not a rigorous proof. To get a rigorous proof we need to turn to a field of probability called expected value. The expected value of a random variable (such as a coin flip) is the average of the values (assumed to be numerical) of the outcomes after a large number of trials. It is defined as the sum of each outcome multiplied by its probability. For example, the expected value of the value on a die is
1*1/6 + 2*1/6 + 3*1/6 + 4*1/6 + 5*1/6 + 6*1/6
because for each outcome from 1 to 6, the probability is 1/6.

In general, if the probability of outcome "o_i" is "p_i", then the expected outcome is
sum(o_i*p_i for all i)

But this isn't useful for proving the statement in the title. The proof is in this Math Exchange answer which explains that the expected number of positive outcomes out of "n" attempts, given that the probability of each outcome each time is "p", is "pn". It goes like this:

Let the random variable "U_i" be the outcome of the "i"th attempt (heads or tails). If the outcome is positive (heads), "U_i" is 1, otherwise it is 0. Given "n" attempts, the number of positive outcomes is
U_1 + U_2 + U_3 + ... + U_n

Call this actual number of positive outcomes "X", that is
X = U_1 + U_2 + U_3 + ... + U_n

The expected value of "X", written as E(X) is
E(X) = E(U_1 + U_2 + U_3 + ... + U_n)

Since the expected value is a linear operator,
E(X) = E(U_1) + E(U_2) + E(U_3) + ... + E(U_n)

Now, given the above definition of what an expected value is,
E(U_i) = 1*(probability of U_i = 1) + 0*(probability of U_i = 0)

If the probability of "U_i" being 1 is "p_i", then
E(U_i) = p_i

But for all "i", the probability of "U_i" is the same. That is
E(U_i) = p

So that means that
E(X) = p + p + p + ... + p
E(X) = pn

And there we have it, the expected number of positive outcomes out of "n" attempts, each of which has a probability of "p", is "pn", which means that the probability "p" can be treated exactly as if it was the proportion of positive outcomes out of a number of trials.