Problem

Choose k entries from n numbers. Make sure each number is selected with the probability of k/n.

Basic idea

  • Choose 1, 2, 3, …, k first and put them into the reservoir.
  • For k+1, pick it with a probability of k/(k+1), and randomly replace a number in the reservoir.
  • For k+i, pick it with a probability of k/(k+i), and randomly replace a number in the reservoir.
  • Repeat until k+i reaches n

Proof

  • For k+i, the probability that it is selected and will replace a number in the reservoir is k/(k+i)
  • For a number in the reservoir before (let’s say X), the probability that it keeps staying in the reservoir is
    • P(X was in the reservoir last time) × P(X is not replaced by k+i)
    • = P(X was in the reservoir last time) × (1 - P(k+i is selected and replaces X))
    • = k/(k+i-1) × (1 - k/(k+i) × 1/k)
    • = k/(k+i)
  • When k+i reaches n, the probability of each number staying in the reservoir is k/n

Example

  • Choose 3 numbers from [111, 222, 333, 444]. Make sure each number is selected with a probability of 3/4
  • First, choose [111, 222, 333] as the initial reservior
  • Then choose 444 with a probability of 3/4
  • For 111, it stays with a probability of
    • P(444 is not selected) + P(444 is selected but it replaces 222 or 333)
    • = 1/4 + 3/4*2/3
    • = 3/4
  • The same case with 222 and 333
  • Now all the numbers have the probability of 3/4 to be picked

Source:

http://www.geeksforgeeks.org/reservoir-sampling/
http://blog.csdn.net/javastart/article/details/50610868