Reservoir Sampling
A simple random sampling strategy to produce a sample without replacement from a stream of data - that is, in one pass: O(N)
Want to sample sinstances - uniformly at random without replacement - from a population size of n records, where n is not known.
Figuring out n would require 2 passes. Reservoir sampling achieves this in 1 pass.
A reservoir R here is simply an array of size s. Let D be data stream of size n
Algorithm:
- Store first s elements into R.
- for each element in position k = s+1 to n ,
- accept it with probability s/k
- if accepted, choose a random element from R to replace.
Partial analysis:
Base case is trivial. For the k+1st case, the probability a given element i with position <= k is in R is s/k. The prob. i is replaced is the probability k+1st element is chosen multiplied by i being chosen to be replaced, which is: s/(k+1) * 1/s = 1/(k+1), and prob that i is not replaced is k/k+1.
So any given element's probability of lasting after k+1 rounds is: (chosen in k steps, and not removed in k steps)
= s/k * k/(k+1), which is s/(k+1).
So, when k+1 = n, any element is present with probability s/n.
Distributing Reservoir Sampling
It is very simple to distribute the reservoir sampling algorithm to n nodes.
Split the data stream into n partitions, one for each node. Apply reservoir sampling with reservoir size s, the final reservoir size, on each of the partitions. Finally, aggregate each reservoir into a final reservoir sample by carrying out reservoir sampling on them.
Lets say you split data of size n into 2 nodes, where each partition is of size n/2. Sub-reservoirs R1 and R2 are each of size s.
Probability that a record will be in sub-reservoir is:
s / (n/2) = 2s/n
The Probability that a record will end up in the final reservoir given it is in a sub-reservoir is: s/(2s) = 1/2.
It follows that the probability any given record will end up in the final reservoir is:
2s/n * 1/2 = s/n