Replay system
From Wikipedia, the free encyclopedia
The Replay system is a little known subsystem within the Intel Pentium 4 processor. Its primary function is to catch operations that have been mistakenly sent for execution by the processor's scheduler. Operations caught by the replay system are then re-executed in a loop until the conditions necessary for their proper execution have been fulfilled.
Contents |
[edit] Origins
The replay system came about as a result of Intel's now defunct quest for ever increasing clock speeds. These higher clock speeds necessitated very lengthy pipelines (up to 31 stages in the Prescott core). Because of this, there are 6 stages between the scheduler and the execution units in the Prescott core. In an attempt to maintain acceptable performance, Intel engineers were forced to design the scheduler to be very optimistic.
[edit] Operation
The scheduler in a Pentium 4 processor is so aggressive that it will send operations for execution without a guarantee that they can be successfully executed. (Among other things, the scheduler assumes all data is in level 1 cache.) The most common reason execution fails is that the requisite data is not available, which itself is most likely due to a cache miss. When this happens, the replay system springs into action. The replay system signals the scheduler to stop, and then repeatedly executes the failed string of dependent operations until they have completed successfully.
[edit] Performance considerations
Ironically, in certain cases the replay system can have a very bad impact on performance. Under normal circumstances, the execution units in the Pentium 4 are in use roughly 33% of the time. When the replay system is invoked, it will occupy execution units nearly every available cycle. This wastes power, which is an increasingly important architectural design metric, but poses no performance penalty because the execution units would be sitting idle anyway. However, if hyper-threading is in use, the replay system will prevent the other thread from utilizing the execution units. (In Prescott, the Pentium 4 gained a replay queue, which reduce the time it will occupy the execution units, but still...) This is the true cause behind any performance degradation concerning hyper-threading. On the other hand, if each thread is processing different types of operations, the replay system will not interfere, and a performance increase can appear. This explains why performance with hyper-threading is application dependent.