Sunday, February 26, 2012

Generating k random samples from a very large (Infinite) Set

Inside-out version of the Fisher-Yates shuffle  can be used to generate a random sample of size k from a very large source set as follows.
Pseudocode:
  Source[n]  // n size array of elements
  A[k]  // k size array of containing random samples
  A[0] ← S[0]
   for
i from 1 to k - 1 do
       
r ← random (0 .. i)
      
A[i] ← A[r]
      
A[r] ← Source[i]

   for i from k to n - 1 do
       
r ← random (0 .. i)
       if (r < k) then A[r] ← Source[i]

Order of the first k samples is immaterial, therefore the first loop can be removed and A can be initialized to be the first k items of Source. This yields Algorithm R (An instance of a family of randomized algorithms called Reservoir Sampling).

The modified algorithm is as follows:

Pseudocode:
  Source[n]  // n size array of elements
  A[k]  // k size array of containing random samples
 for i from 0 to k - 1 do
      
A[i] ← Source[i]

   for i from k to n - 1 do
       
r ← random (0 .. i)
       if (r < k) then A[r] ← Source[i]


This algorithm's time complexity  is O(n).

Further Reading 
 Wikipedia Article on Reservoir Sampling
Tutorial Slides
Jeffery Stock's Reservoir Sampling Paper
Lecture Silde

No comments:

Post a Comment