Sampling equiprobably with dice

From Wikipedia, the free encyclopedia

An illustrative example of how to compute with the probabilities associated to dice, as well as the analysis of algorithms, is the following problem. Suppose you have a group of 19 individuals and an ordinary six-sided die. Your task is to use that die to select one of these individuals equiproabably, i.e. so that all individuals are equally likely, having probability 1 / 19 of being selected, and using as few rolls of the die as possible. This is an example of the more general problem of using dice to sample a range equiprobably, where the range has no common factor with the number of faces of the die, which is in turn equivalent to one of the key problems of the algorithmics of random number generators. This is because with today's hardware, even if a random number generator samples a physical source, it will need to load the sample into a discrete register consisting of a finite number of bits. Hence the question arises of how to sample equiprobably from a range not being a power of two using a random number generator that returns some fixed number of random bits.

The main result is that the expected number E[T] of rolls when an n-sided die is used to select equiprobably from a range m where n < m and (n,m) = 1 and the discrete random variable T represents the number of rolls, obeys the bound

E[T] \le \lfloor \log_n m \rfloor + 1 +
\frac{m-1}{m} \frac{n}{n-1}.

This article presents the basic concepts and a basic analysis of the problem. For a much more sophisticated treatment of the case n = 2, consult the article by H. Prodinger, who also discusses the problem in terms of leader election on a network. (The reference is here).

Contents

[edit] Building the algorithm

First, we return to the case of an ordinary die and nineteen individuals. A moment's thought reveals that a bounded number of rolls of the die, say k, will never suffice, as there is no way to distribute the values from one to nineteen among the 6k possible outcomes so that all values are equally likely. We note, however, that if 6k is large, and we use q \, \bmod \,19, where q\in [0, 6^k) is the outcome, to choose the individual, the probabilities will be very close to \frac{1}{19}, being perturbed only slightly by the remainder r = 6^k \,\bmod \,19. This suggests that we roll the die until the number 6k of possible outcomes is at least nineteen, and take q \, \bmod \,19, if q < 6^k - r.\, If q \ge 6^k - r, however, we will need to roll the die again.

Recall that we require as few rolls of the die as possible. Therefore we should make use of the r discarded outcomes that trigger another set of rolls, so as to not loose any valuable random bits. Each of these discarded outcomes may combine with any one of the outcomes of the next set of rolls, forming 6r possible pairs. This finally suggests the following procedure. Introduce the variable ck, where k is the number of rolls, and set c0 = 1. This is the sequence of remainders r. Similarly, introduce the variable qk and set q0 = 0. This is the index into the remainder interval if applicable. At every step, roll the die, obtaining a value q between zero and five, i.e. modulo six. Let c_k = 6 c_{k-1} \,\bmod\, 19 and p = 6 q_{k-1} + q.\, If p < 6 c_{k-1} - c_k\,, take p \, \bmod \,19 and halt, otherwise, set q_k = p - (6 c_{k-1} - c_k)\, and repeat.

This algorithm is actually quite simple. Roll the die. If the value obtained is below the remainder interval, take the value modulo nineteen. If not, use six times the remainder interval as the new available range. Roll again and combine with the previous index into the remainder interval to obtain an index into the new range, and repeat.

[edit] Analysis

This procedure may be analysed in various ways. The simplest is to fix an individual and compute the probability of him/her being chosen. If this turns out to be \frac{1}{19}, then the algorithm is indeed correct. With this in mind we introduce the probability pk of the individual being chosen after k rolls, and make use of the fact that the sequence of remainders must repeat with a period that divides \varphi(19) = 18, according to Fermat's little theorem.

We find that


p_1 =  \frac{0}{6} + \frac{6}{6} p_2, \quad
p_2 =  \frac{1}{36} + \frac{17}{36} p_3, \quad
p_3 =  \frac{5}{102} + \frac{7}{102} p_4

p_4 = \frac{2}{42} + \frac{4}{42} p_5, \quad
p_5 = \frac{1}{24} + \frac{5}{24} p_6, \quad
p_6 = \frac{1}{30} + \frac{11}{30} p_7

p_7 = \frac{3}{66} + \frac{9}{66} p_8, \quad
p_8 = \frac{2}{54} + \frac{16}{54} p_9, \quad
p_9 = \frac{5}{96} + \frac{1}{96} p_1.

This yields p_1 = \frac{1}{19}, so the algorithm is correct. In fact we have

p_1 = p_2 = p_3 = \ldots = p_9 = \frac{1}{19}
\quad \mbox{because} \quad
\frac{1}{19} = \frac{1}{19} x + (1-x) \frac{1}{19}.

We introduce tk, the expected number of rolls after k rolls, in order to verify that the algorithm fulfills the requirement of needing few rolls. Using the same probabilities as above, we find that


t_1 =  \frac{0}{6} + \frac{6}{6} (1 + t_2), \quad
t_2 =  \frac{19}{36} + \frac{17}{36} (1 + t_3), \quad
t_3 =  \frac{95}{102} + \frac{7}{102} (1+ t_4)

t_4 = \frac{38}{42} + \frac{4}{42} (1 + t_5), \quad
t_5 = \frac{19}{24} + \frac{5}{24} (1 + t_6), \quad
t_6 = \frac{19}{30} + \frac{11}{30} (1 + t_7)

t_7 = \frac{57}{66} + \frac{9}{66} (1 + t_8), \quad
t_8 = \frac{38}{54} + \frac{16}{54} (1 + t_9), \quad
t_9 = \frac{95}{96} + \frac{1}{96} (1 + t_1).

This yields t_1 = \frac{25281276}{10077695} \approx 2.508636747, so we need on average two and a half rolls of a six-sided die to select one of nineteen individuals equiprobably.

The same result may be obtained by introducing the probability generating function f(x) given by


f(x) = 
\frac{19}{6^2} x^2 + 
\frac{95}{6^3} x^3 + 
\frac{38}{6^4} x^4 + 
\frac{19}{6^5} x^5 + 
\frac{19}{6^6} x^6 + 
\frac{57}{6^7} x^7 + 
\frac{38}{6^8} x^8 + 
\frac{95}{6^9} x^9 + 
\frac{1}{6^9} x^9 f(x)

and using the fact that

 t_1 = \left. \frac{d}{dx} f(x) \right|_{x=1}
= \frac{25281276}{10077695}.

The correctness of this equivalent approach will be shown in the next section.

[edit] Analysis of the general case

The analysis of the general case, i.e. the expected number of rolls where an n-sided die is used to select equiprobably from a range m where n < m and (n,m) = 1 is very instructive, and serves to illustrate the techniques used in the manipulation of discrete random variables.

Here it is useful to introduce an infinite tree, consisting of internal nodes and of outdegree n, where each level of the tree represents the nk possible outcomes after k rolls, the root being the initial state (zero rolls). We may think of the levels being divided from left to right into three zones, a blue zone, a green one, and a red one. The blue nodes represent outcomes that correspond to a halt after fewer than k rolls, the green ones, to a halt at the current level (exactly k rolls) and the red ones, to the remainder case (another roll is required). All children of the blue nodes are blue, as are children of the green nodes. There are ck red nodes, and of their children, c_{k+1} = n c_k \, \bmod \, m are in red in turn, while  n c_k - c_{k+1}\, are green. The number of green nodes is always a multiple of m and a fraction of 1 / m of them correspond to a particular individual or value, thus showing by inspection that the algorithm samples the range equiprobably.

The three colors correspond to three sequences, (ak),(bk), and (ck), where ak / nk is the probability of having halted after fewer than k rolls, bk / nk is the probability of halting after exactly k rolls, and ck / nk is the probability of needing at least k + 1 rolls. These sequences obey the following recurrences:

a_{k+1} = n a_k + n b_k, \quad 
b_{k+1} = n c_k - c_{k+1}, \quad \mbox{and} \quad
c_{k+1} = n c_k\, \bmod \, m

as well as

a_k + b_k + c_k = \,n^k.

Let T be the random variable giving the number of rolls. The fact that ck / nk is the probability of needing at least k + 1 rolls may also be derived from the conditional probabilities of needing another roll at any time.

P[T \ge k+1] =
\frac{c_k}{b_k+c_k}
\frac{c_{k-1}}{b_{k-1}+c_{k-1}}
\cdots
\frac{c_1}{c_1+b_1}.

But b_k + c_k = \, n c_{k-1}, so this is


\frac{c_k}{n c_{k-1}}
\frac{c_{k-1}}{n c_{k-2}}
\cdots
\frac{c_1}{n c_0} = \frac{c_k}{n^k}.

Using the definition of the expected value, we find that


E[T] = \sum_{k\ge 1} k P[T=k] = \sum_{k \ge 1} P[T\ge k] = \sum_{k\ge 0} \frac{c_k}{n^k}.

Recalling the definition of ck, this becomes


\begin{array}{lcl}
E[T] & = & \sum_{k\ge 0} \frac{n^k \, \bmod \, m}{n^k} \\
& = & \sum_{n^k < m} 1 + \sum_{n^k > m} \frac{n^k \, \bmod \, m}{n^k} \\
& = & \lfloor \log_n m \rfloor + 1 + 
\frac{1}{n^{\lfloor \log_n m \rfloor + 1}} 
\sum_{k\ge 0} \frac{n^{k + \lfloor \log_n m \rfloor + 1} \, \bmod \, m}{n^k} \\
& \le & \lfloor \log_n m \rfloor + 1 + 
\frac{1}{n^{\lfloor \log_n m \rfloor + 1}}
\sum_{k\ge 0} \frac{m-1}{n^k} \\
& \le & \lfloor \log_n m \rfloor + 1 + 
\frac{1}{n^{\log_n m -1 + 1}}
(m-1) \frac{1}{1-1/n} \\
& = & \lfloor \log_n m \rfloor + 1 +
\frac{m-1}{m} \frac{n}{n-1}.
\end{array}

Hence we need about two more rolls than \lfloor \log_n m \rfloor to select equiprobably from m individuals using an n-sided die.

[edit] External links

  • Quim Testar, Antonio González, Marko Riedel, et al. Aleae iactae sunt, newsgroup es.ciencia.matematicas, In Spanish.
  • Quim Testar, Antonio González, Marko Riedel, et al. Acotando el numero de tiradas de dado esperadas, newsgroup es.ciencia.matematicas, In Spanish.

[edit] References