Hide behind huge numbers, making fighting against very expensive
Birthday problem or paradox is the probability that, from a given set of people, two of them will have the same birthday. It is a paradox because the result defies common sense. For a group of 23 people, the chance that two of them share the same birthday is greater than 50%, and for a group of 57 people, it is higher than 99%.
The best known use of Birthday Problem paradox is probably the Cryptographic Attack known as the Birthday Attack. This attack exploits the math behind the Birthday Paradox, by looking for collisions in a small set, having a much higher collision chance than expected.
Recently I came across a different use of the same paradox, in what else than the infamous Conficker. Here, the use of this statistical paradox is different, with the purpose of making the fight against this worm much harder.
Here is the problem explained: each day, they have a pool of 50,000 (pseudo)randomly generated URLs, out of each any infected computer randomly chooses 500. The total number of possible draws is huge. It actually has 1,215 digits in decimal representation and people will find hard even to imagine it. Just for the fun of it, I annexed the file to the end of my post. However, as you will see, in practice things work on a much smaller scale.
Registering all of them is an incredibly challenging task, and here lies the power of the aforementioned statistical problem. The problem here is to find the probability that a random group of 500 URLs contains at least one out of a smaller set (I will refer to such cases from here on as hits). And the result is amazing: if one registers 50 URLs, the chance to hit is 39.514%, and if one registers 500 URLs the chance to hit becomes 99.359%. That is, with only 1% of the pool registered, one can achieve more than 99% success rate in spreading new malware content using Conficker.
The graph of hit chance is here
On the horizontal axis we have the number of registered URLs and on the vertical axis we have the chance that any random draw will hit at least one of them, or in other words, the chance that a computer infected with Conficker will access one of these URLs.
This shows the importance of blocking as many of the 50,000 URLs as possible. A single missed URL that happens to be registered for malicious purpose can get 1% chance to spread malware to Conficker infected machines.
Randomly blocking some of the URLs have limited benefits, since the pool size is fairly big and the number of URLs potentially used by the malware is relatively small (at least two orders of magnitude).
Even if the above statement is true, there are some particularities that may help overcome these facts. The domains have to be registered through a limited number of Registrars, based on their TLD. By working with the registrars directly, bulkily blocking large numbers of domains becomes less of a problem than Conficker’s authors had foreseen, and with all the attention this thing is getting, people are willing to put in a lot of work to see this threat over.
The “good guys” also may use this paradox to their own advantage. It may give means in estimating the real size of the “infection” in the world. By registering a limited numbers of URLs, one can monitor the incoming requests, and knowing the chance a URL is picked, one can extrapolate to the number of infected machines.
Appendix 1 - Mathematical reasoning behind the numbers I’ve presented here.
Let’s denote by C(n,k) the number of combinations of size k chosen from a set of n elements (S). Our problem is to determine the probability that a randomly chosen set hit at least one element from a smaller subset of S. Let m be the number of elements in that smaller set. M is the subset of S, having Card(M) = m. The total number of possible k sized sets out of S is C(n,k).
In order to see how many of them contain at least an element from M we check first its complement. That is, the number of k-sets that do not contain an element from M. It is obvious that to have such sets, m has to be smaller than or equal to n– k. In other words if m is greater than n– k, there is no possible choice of k-sets that do not contain elements from M. If m is smaller than n– k and we subtract M from S (S\M) we get a subset of S, denoted by S’ that has n– m elements. It is clear that all k-sets from S that do not contain elements from M are also k-sets for S’, and all k-sets from S’ are also k-sets for S, so the sets are equivalent. Thus the number of k-sets from S that do not contain elements from M is equal to the number of k-sets from S’ which is equal to C(n-m,k).
As a direct result, the number of k-sets from S containing at least an element from M is C(n,k)–C(n-m,k). In order to compute the probability we divide this number by the total number of sets. We get P(m) = 1 – C(n–m,k)/C(n,k). If we break this down we get to
As we see, the second element is a product of sub-unitary numbers, which decreases towards 0 as we increase the number of elements (m). As a matter of fact, each element in the product is smaller than the first element (n–k)/n (trivial to prove under the assumption 1=j=m=n-k) resulting in the following approximation,
that is closing to 0 faster than an exponential. This means that our probability can be approximated with the following formula
Another debate may be started around the fact Conficker doesn't check for duplicates when picking up the 500 URLs. To take this into account, we have to estimate the average number of duplicates in a randomly picked 500 set out of the bigger 50,000 possible choices. A collision counting formula may be found at Collision counting formula. Applying the formula on our case, gives an estimate of 2.4867 duplicates on any random draw. To take this into account, we have to adjust previous calculations with 497 instead of 500, but this doesn't induce a notable difference in the results.
Another approach for the same arguments is to take into account the number of combinations with repetitions, rather than the number of combinations. This changes the above formulas to C(n+k–1,k) used instead of C(n,k); Combinations with repetitions. Having the following substitution n'=n+k–1 we get to the same formulas, but n' used in place of n. The differences in the numbers above are insignificant, and this is true for similar cases: n much bigger in comparison with k.
Appendix 2 - Here is a table showing the probabilities to get a hit for up to 100 URLs. The values are computed with the exact formula, not using the approximation, but in most cases, especially with large numbers, the estimation gives a pretty good idea.
|#Chosen URLs||Chance to hit|
Appendix 3 - Number of 500 sized groups out of a pool of 50,000