Sunday, February 26, 2012

Generating k random samples from a very large (Infinite) Set

Inside-out version of the Fisher-Yates shuffle  can be used to generate a random sample of size k from a very large source set as follows.
Pseudocode:
  Source[n]  // n size array of elements
  A[k]  // k size array of containing random samples
  A[0] ← S[0]
   for
i from 1 to k - 1 do
       
r ← random (0 .. i)
      
A[i] ← A[r]
      
A[r] ← Source[i]

   for i from k to n - 1 do
       
r ← random (0 .. i)
       if (r < k) then A[r] ← Source[i]

Order of the first k samples is immaterial, therefore the first loop can be removed and A can be initialized to be the first k items of Source. This yields Algorithm R (An instance of a family of randomized algorithms called Reservoir Sampling).

The modified algorithm is as follows:

Pseudocode:
  Source[n]  // n size array of elements
  A[k]  // k size array of containing random samples
 for i from 0 to k - 1 do
      
A[i] ← Source[i]

   for i from k to n - 1 do
       
r ← random (0 .. i)
       if (r < k) then A[r] ← Source[i]


This algorithm's time complexity  is O(n).

Further Reading 
 Wikipedia Article on Reservoir Sampling
Tutorial Slides
Jeffery Stock's Reservoir Sampling Paper
Lecture Silde

Random Set Shuffling

 Linear In-Place Shuffling :

The modern version of Fisher–Yates shuffle ( aka Knuth shuffle) can be used for generating a random permutation of a finite set or for randomly shuffling  a set. Fisher–Yates shuffle is unbiased, so that every permutation is equally likely.

Pseudocode :
A[n] : An array of n elements
for i  from n − 1 to 1 do
       j ← random integer with 0 ≤ ji
       swap A[j] and A[i]

This algorithm's time complexity  is O(n).

Generation of a New Shuffled Set from a Source Set:

Inside-out algorithm simultaneously initializes and shuffles an array from a given source array.

Pseudocode :
Source[n] // Source array
A[n] // Target array 
A[0] ← Source[0]
for i from 1 to n − 1 do
      j ← random integer with 0 ≤ ji
      A[i] ← A[j]
      A[j] ← Source[i]
 
This algorithm's time complexity  is O(n).
These algorithms are also used for randomly shuffling playing cards in electronic computer games.