Reservoir Sampling - Once and for all

Today I'll talk about a topic which I consider very important - Reservoir Sampling.

Imagine that you have a set of N numbers. You would like to generate a random subset of size K, such that any such subset has equal probability to be generated. In other words, every element of the set has equal probability to end up in that subset. The algorithm to do this is actually very straight forward, however, the proof of why it works is not so trivial. Hence I will describe the algorithm first show the code and then I'll add the formal mathematical proof for it which I wrote in LaTex due to the need for mathematical notation.

First thing is first: The Algorithm. The algorithm works as follows: we start by taking the first K elements from the set. We then iterate from indexes K+1 to N and generate a random number between {0 - i +1} where i is the current index we are traversing. If the random number we get is between 0 to k (say index j) we replace the number in that index with the number at i. Otherwise, we continue to the next number.

Code:

The resulting subset will be in the first K elements of the vector V. I've attached to this post a pdf file with the proof I wrote for the correctness of the method. I think that reading it and understanding it is worth your while. 

 

Reservoir Sampling - Once and for all

ReservoirSampling1.pdf