Talk:Fisher-Yates shuffle

From Wikipedia, the free encyclopedia

Contents

[edit] code style

I separated testing the index from decrementing it, since doing both at once is discouraged in many standard coding styles. Also, this makes it easier to state what n exactly means, as an invariant. (The old version would need 'n is the maximum pertinent index', but then the initialization statement doesn't gibe with that.) It might be more natural to move the --n to the end of the loop body (and index n-1). I suspect a for-loop is actually more appropriate (esp. if the target reader is expected to recognize a pre/post decrement).

Should the variable 'n' be replaced with something more index-like, like 'i'? In particular, there is confusion between the algorithm's n (which is 0-based), and the n in the pseudocode's An (which is 1-based).

Finally, I like the discussion at the end about drawbacks of pseudo-random number generators, but that's orthogonal to the current page; should this page just link to that? (Like MaxEnt below, I kinda prefer expository treatments over encyclopedic ones in many cases.)

not-just-yeti (talk) 17:29, 14 March 2008 (UTC)


[edit] Comments from Ilmari Karonen's talk page

(Copied from User talk:Ilmari Karonen.)

I see you contributed this article recently. I read it through on a superficial level. It's quite good in pointing out the majority of pitfalls. You can never say enough about pitfalls where random numbers are concerned. A couple of points: there are many applications, such as video games, where a small bias is immaterial. In the case of using a language's built-in sort function, it be helpful to mention which sort algorithms are unbiased for this purpose. I think plain quicksort is OK, but some quicksorts oversample the pivot element. I would worry about those extra calls to the comparison function. The C++ STL is quite explicit about the properties of the provided sort() algorithms, though I can't recall right now which are in and which are out for this purpose.

There are some who might complain that this article is more tutorial than encyclopedic (not citing every claim from another reference). I personally would favour seeing more expository synthesis, where the stiff/authoritative sources are assimilited for the benefit of a less expert reader. A good contribution. (Not expecting a reply, use my talk page if you wish to.) MaxEnt 06:52, 15 August 2007 (UTC)

Thanks for the comments. Since you've given me the excuse, I'll try to outline a bit where I've been planning to go with this article. I've your comments here to the article's talk page, since I feel they could be useful to others besides just me. Similarly, the rest of my response below is as much for myself (thinking out loud) and anyone else who might be reading than just you.
I agree that the "Potential sources of bias" section is rather lengthy; it just turned out that way. Part of the reason for spending so much affort and words on that section was my (perhaps misguided) aim to include more or less all of the issues already listed in the shuffling article; another was my personal experience that the Fisher-Yates shuffle, though simple, tends to be an easy algorithm to implement incorrectly. Of course, much of the difficulty comes from not understanding the logic behind the algorithm, but instead treating it as a magic incantation that "just works somehow", a problem that I hope the earlier sections will help to correct. Certainly I wouldn't see condensing the problems section into a more concise and easily understandable form as a bad thing at all.
As for sourcing, I do need to work on it more, if only by matching the existing sources with claims that are found in them. Much of the "Potential sources of bias" section can be sourced to either Knuth or Fisher&Yates themselves, and I'm sure there are others sources that treat the issues in more depth — it's not like I just pulled it all out of a hat. I did name all the <ref> tags, in anticipation of reusing them, but I'll still need to actually do that.
Another issue I'm aware of that needs work is distinguishing between the general problem of shuffling a pre-existing finite set and the specific problem of obtaining a random permutation of the integers 1–N (or 0–(N−1)). A particular reason for emphasizing the distinction is that there's a variant of the algorithm (some might consider it a different algorithm entirely) due to Knuth which is specific to the second task, avoiding the swap operation by running the shuffle "backwards" and initializing the array as it is shuffled:
  1. Let n := 1.
  2. Choose a random k such that 1 ≤ kn.
  3. If kn, let An := Ak.
  4. Let Ak := n.
  5. Increase n by one.
  6. If nN, repeat from step 2.
I haven't actually added this to the article yet, both because I want to try chasing down the original reference (Knuth, The Stanford GraphBase, 1994, p. 104) rather than cite the oblique mention in The Art of Computer Programming (3rd ed.) and because I want to clarify other parts of the article at the same time.
Similarly, I don't think Sattolo's algorithm really makes much sense when applied to any set that isn't ordered to begin with, although I've half a mind to split that one off into its own article anyway. Come to think of it, I do believe the "backwards" variant above can also be made to produce cyclic permutations by adjusting the range of k like in Sattolo's algorithm. For that I have no cite, though I do believe I can sketch a proof.
As for the "sort with random comparator" approach, the reason I haven't mentioned which sorting algorithms produce unbiased results with it is that I'm not aware of any, though I'm willing to believe there could be some. In any case, though, those who attempt that approach usually do so using a built-in sort function whose implementation details may vary and thus can't be relied on. It might be best to just remove the weaselly "some" from that paragraph entirely... In any case, it really could use better references, though I don't know if the approach has actually been analysed in any academic publication. —Ilmari Karonen (talk) 11:51, 15 August 2007 (UTC)

[edit] Example Programs are Confused

"Below is the same algorithm in C#". The program below that statement is obviously not the same algorithm because all of the variables have been renamed in a contradictory manner. "n" is now "i", "k" is now "n". How on Earth is this supposed to make sense? Miqrogroove (talk) 22:41, 18 January 2008 (UTC)

I cleaned up the C# code to use the same variable names, but then decided to remove it entirely since the result was nearly identical to the Java version. Having multiple example implementations might not be a bad idea as such, but there's little point in providing minor variations for a bunch of different C-derived languages. Let's at least have something with different-looking syntax, like Python for example. —Ilmari Karonen (talk) 05:09, 20 January 2008 (UTC)

[edit] Inaccuracy

"Also, of course, no pseudorandom number generator can produce more distinct sequences than there are distinct seed values it may be initialized with. Thus, it doesn't matter much if a generator has 1024 bits of internal state if it is only ever initialized with a 32-bit seed."

This is not necessarily true. If we switch to a different algorithm, such as the reservoir model (cite: Knuth vol 2 section 3.4.2 algorithm R), we only need our pseudorandom number generator to produce numbers [0 N) to produce all N! sequences. —Preceding unsigned comment added by 142.104.68.102 (talk) 23:51, 19 March 2008 (UTC)

Interesting, I didn't know that. But maybe what the orignal author wanted to say was, that implementations probably do something like one seeding - one shuffle. Then the program terminates and the long period of the PRNG goes to a waste. I understand that is why under Unix we should use the system provided random (but I don't really know much about that). If we use a PRNG library in our program (e.g. a Mersenne Twister is provided for Free Pascal under Windows) we probably should better save the internal state of the PRNG (and reload+continue from there on the next run) instead of seeding every time we start our program. I hope that wasn't too off-topic. Reading about the potential implemenation problems here was very helpful for me; I might have fallen into one of those traps. --Stevemiller (talk) 04:09, 27 March 2008 (UTC)