Talk:Duff's device

From Wikipedia, the free encyclopedia

This article was originally based on material from the Free On-line Dictionary of Computing, which is licensed under the GFDL.

Contents

[edit] I really do not understand why keep this requirement of (count > 0) - if you rewrite it a bit, it can be zero:

Oops, made a mistake :)

/* this assumes count is signed */
switch (count & 7) {
        while ((count -= 8) > -1) {
                *d = *s++;
        case 7: *d = *s++;
        case 6: *d = *s++;
        case 5: *d = *s++;
        case 4: *d = *s++;
        case 3: *d = *s++;
        case 2: *d = *s++;
        case 1: *d = *s++;
        case 0: ;
        }
}

Best regards,

 Ashot.
Put a better explanation for non-C programmers please...
I'm not so sure that a non-C programmer would get a lot of excitement/revulsion out of this --kibibu 01:15, 7 October 2005 (UTC)
The sort of explanation for a non-C programmer would be useful for those just starting to program in C as well. Also people wondering about this as the basis of coroutines and protothreads. Hackwrench 02:17, 7 October 2005 (UTC)
The only thing a non-C programmer needs to understand this article is a beginner's knowledge of C, which can be obtained the usual way and doesn't belong here. This holds even more so for understanding C implementations of coroutines and protothreads that use the same trick that Duff's Device uses (they do not use Duff's Device per se; as Duff wrote in his original email, "I have another revolting way to use switches to implement interrupt driven state machines [...]"). As for the rewrite, it adds an extra subtract, test, and branch for any multiple of 8, so it would be slightly more efficient to test for count == 0 as a special case -- which often isn't needed (when it is known that count can't be 0). -- Jibal 08:21, 6 April 2007 (UTC)
Your code does not compile. Code cannot exist in a switch-case before the first case-label. Adding a dummy "case -1:" there would work though. Ps. Please sign your comments with ~~~~. Bisqwit 13:28, 24 August 2007 (UTC)
I don't know what compiler you used but the code compiles just fine on my g++ compiler. The dummy is was not needed on this compiler. 8 September 2007

[edit] slashdot

This article has been linked from slashdot, so I just added the slashdotted template as a precaution. --Codemonkey 01:07, 7 October 2005 (UTC)

The mod 8 operation could be optimized at the cost of readability. Mod is a complex operation, and unless the compiler optimizes for the specific case of mod 2n likely any performance improvements seen by using a jump table will vanish.

It is more optimal to use a logical and for the modulus and a logical shift for the division.

This also illustrates the importance of maintaining a length that is a power of 2.

--67.173.214.71 02:09, 7 October 2005 (UTC)

What do you mean that mod is a complicated operation? It comes free with every assembly DIV instruction. Hackwrench 02:29, 7 October 2005 (UTC)
Also why would the remainder be a loop? Hackwrench 02:29, 7 October 2005 (UTC)
Re: "[Mod] comes free with every assembly DIV instruction." Counting instructions can be misleading. On most processor implementations, not every instruction is born equal. For instance, on an Intel Pentium 4 processor, the DIV instruction takes 80 cycles, while a mask operation takes only 1 cycle (note: mod 2n can be implemented as a mask operation, i.e., using the AND instruction). In other words, on an Intel Pentium 4, if you know that the divisor is a power of 2, your implementation of the modulus operation can be 80 times faster than the general case where nothing is known about the divisor and where you must fall back to use the DIV instruction. [A source for the instruction latencies is http://swox.com/doc/x86-timing.pdf].
Re: "Duff's Device is an optimized implementation of a serial copy." Tom Duff's device is not just about optimizing serial copy. It's more general than that. It's a way to express loop unrolling (or loop unwinding) in the C programming language. Optimizing serial copy is just one application of loop unrolling, there are many others. See point 1 of Tom's original usenet posting (available for instance there http://www.lysator.liu.se/c/duffs-device.html).
Re: "Based on an algorithm used widely by assemblers..." Loop unrolling is also implemented in many compilers (e.g., GNU's GCC). Since fewer and fewer folks write assembly anymore, loop unrolling is perhaps now more likely to be the result of a compilation.
If you're interested in optimization, the best thing you can do is throw any compiler that doesn't optimize for mod 2n into the trash and use a decent one; writing less readable code to help out a non-optimizing compiler is penny wise and pound foolish. As for "maintaining a length that is a power of 2", Duff's Device handles any length (except 0); 8 is just the block factor for unrolling, which should indeed be a power of 2. Also, the jump table is a result of the case values being a non-sparse set of constants; it has nothing to do with powers of 2. Finally, Duff's Device does not obtain a performance improvement over a more straightforward loop unrolling implementation that uses an 8-at-a-time loop plus a switch statement for the leftover elements (the way Duff said he would "usually handle this"), it only reduces the number of lines of code by combining the loop and the switch statement. -- Jibal 08:47, 6 April 2007 (UTC)

[edit] Tom himself

Here is a fresh comment from Tom himself why he wrote it on the Lambda weblog (nexus of the sect of functional programming :-):

What was I reading to write that?
I knew enough to write that bit of code because I had read the source code for Dennis Ritchie's 
PDP-11 C Compiler, specifically the version shipped with Fifth Edition UNIX. It's a remarkable 
program -- it compiles a fairly sophisticated language, producing decent code, all in a 64K byte 
footprint.

Most of the syntax analysis is done by recursive descent, except for expressions, which are parsed 
by using precedence tables with what we called the Railway Siding algorithm when I was in school. 
Code generation is done by a table-driven tree-pattern matcher.

The whole compiler runs in two passes (uhh, actually seven if you count one for the preprocessor, 
one for the optional freestanding peep-hole optimizer and three for the assembler), but the 
organization is essentially single pass -- it was split in two for space reasons. The first pass 
does syntax analysis & typechecking and generates code for everything but expressions. (And maybe 
switch statements? It's been 30 years since I read the code.) Its output is an assembly-language 
code/data stream with embedded expression trees to be processed in the second pass.

If the machine it compiled for were still of contemporary interest, I'd recommend it to anyone 
looking for a good introduction to the practicalities of software design and implementation.

Presumably to keep the compiler's size managable, case labels are treated syntactically as 
ordinary goto labels with a number attached. There was a stack of switch statements, and whenever 
a case was encountered, a label went in the output and was recorded in the stack (with an error 
if the stack was empty, i.e. if no switch was active.) Since case labels weren't syntactically 
bound to switch statements, they could occur anywhere in the switch's body, including, for example, 
in the middle of a nested do-while statement.

By Tom Duff at Wed, 11/23/2005 - 17:54

Here is Tom's comment

--Marc van Woerkom 15:12, 2 December 2005 (UTC)


[edit] Missing information

There is no explanation in the text on why the pointer increment is missing from "to". —The preceding unsigned comment was added by Skypher (talkcontribs) 15:54, 28 January 2007 (UTC).

The canonical example of this is copying to memory-mapped I/O. —The preceding unsigned comment was added by 71.245.136.42 (talk) 06:18, 9 February 2007 (UTC).
There's nothing "canonical" about Duff's original example; note that the Duff's Device given in the article isn't even the same code as in his original email. The non-incrementing of to in his case is irrelevant to the working of the device and deserves at most a footnote, not its current confusing treatment as if it were a major issue (the fact that there's a library routine to copy isn't relevant, because there are many other ways to use the device that won't be found in a library routine, e.g. *to++ = func(*from++)).
Yes, there is:
Note that to is not incremented because Duff was copying to a single memory-mapped output register.
Kurosa 01:39, 22 February 2007 (UTC)

[edit] Missing Book

K&R's The C Programming language should probably be added to the list of books, since Stroustrup's is there. SavantEdge 23:43, 26 April 2007 (UTC)

I agree. I added it. Jeff Carr (talk) 05:57, 19 November 2007 (UTC)

[edit] One usage of the device is not listed

Apart from being used in the context of writing to a device. The device is also used to test C compilers and C++ compilers. Essentially, if the compiler freaks out on the code it isn't a true C or C++ compiler.

Also, as others said, there are other uses of the device other than writing to device. The *to++ = func(*from++) example is as such a more general use of the device.

I also think one should be explicite about telling that the switch is done outside of the loop. It seems that many people misses this detail - if the switch is done inside the loop you make the test each time and gain nothing and also it is not duff's device any more.

Alf —Preceding unsigned comment added by 217.144.227.22 (talk) 11:30, 9 January 2008 (UTC)

[edit] Irreducible flow graph?

Doesn't Duff's Device give you an irreducible flow graph, so that an optimizing compiler would most likely split it, giving one the same code as one would get handling the count % 8 bytes outside the loop? —Preceding unsigned comment added by 209.234.91.82 (talk) 15:59, 31 January 2008 (UTC)

[edit] Modern compilers

Shouldn't modern compilers already have some kind of loop unrolling optimization like this already?

I don't see any advantage of using this device, other than for historic reason.--Huntercool (talk) 10:37, 6 June 2008 (UTC)

Duff's Device targets a very special problem (serial copy to a register) that modern compilers are unlikely to have an optimization for. That said, the device itself is probably not useful for its original purpose, but the technique is still applicable. For example, it is possible to implement protothreads in pure C with the same nested switch/case construct. Rrelf (talk) 11:28, 11 June 2008 (UTC)

[edit] Removing content

There currently are two source-code variants of the same concept in the article. The similarity between the two is very high, so i'm thinking about merging them to leave only one snippet. However, before i do such a thing, i'd like some feedback about it.

Also, there's this: "Its behavior is undefined if it is entered with count having the value zero.", with which i disagree. The behaviour is quite predictable: it will run a loop of roughly MAXINT iterations, copying the memory contents starting with *from... -- Jokes Free4Me (talk) 10:58, 11 June 2008 (UTC)

As far as I'm concerned, the second variant (the one that starts with "duff.c" in a comment) is entirely redundant and can be safely removed. Rrelf (talk) 11:25, 11 June 2008 (UTC)