FLOPS

From Wikipedia, the free encyclopedia

For other uses, see flop.

In computing, FLOPS (or flops) is an acronym meaning FLoating point Operations Per Second. This is used as a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating point calculations; similar to instructions per second. One should speak in the singular of a FLOPS and not of a FLOP, although the latter is frequently encountered. The final S stands for second and does not indicate a plural. Alternatively, the singular FLOP (or flop) is used as an abbreviation for "FLoating-point OPeration", and a flop count is a count of these operations (e.g., required by a given algorithm or computer program). In this context, "flops" is simply the plural rather than a rate.

Computing devices exhibit an enormous range of performance levels in floating-point applications, so it makes sense to introduce larger units than FLOPS. The standard SI prefixes can be used for this purpose, resulting in such units as megaFLOPS (MFLOPS = 1,000,000 FLOPS), gigaFLOPS (GFLOPS = 1,000 MFLOPS), teraFLOPS (TFLOPS = 1,000 GFLOPS), petaFLOPS (PFLOPS = 1,000 TFLOPS) and exaFLOPS (EFLOPS = 1,000 PFLOPS). As of 2006, the fastest supercomputer's performance tops out at one petaflops. A basic calculator performs relatively few FLOPS. Each calculation request to a typical calculator requires only a single operation, so there is rarely any need for its response time to exceed that needed by the operator. Any response time below 0.1 second is experienced as instantaneous by a human operator, so a simple calculator could be said to operate at about 10 FLOPS. Humans are worse than calculators at floating-point operations. If it takes a person 15 minutes to carry out a long division problem with 10 significant digits, that person would be calculating in the milliFLOPS range. Bear in mind, however, that a purely mathematical test will not truly measure a human's FLOPS, as a human is also processing thoughts, consciousness, smells, sounds, touch, sight and motor coordination.

1 FLOPS as a measure of performance
2 FLOPS, GPUs, and game consoles
3 Records
4 Cost of computing
5 Pop culture references
6 References
7 External links

[edit] FLOPS as a measure of performance

In order for FLOPS to be useful as a measure of floating-point performance, a standard benchmark must be available on all computers of interest. One example is the LINPACK benchmark.

FLOPS in isolation are arguably not very useful as a benchmark for modern computers. There are many factors in computer performance other than raw floating-point computation speed, such as I/O performance, interprocessor communication, cache coherence, and the memory hierarchy. This means that supercomputers are in general only capable of a small fraction of their "theoretical peak" FLOPS throughput (obtained by adding together the theoretical peak FLOPS performance of every element of the system). Even when operating on large highly parallel problems, their performance will be bursty, mostly due to the residual effects of Amdahl's law. Real benchmarks therefore measure both peak actual FLOPS performance as well as sustained FLOPS performance.

For ordinary (non-scientific) applications, integer operations (measured in MIPS) are far more common. Measuring floating point operation speed, therefore, does not predict accurately how the processor will perform on just any problem. However, for many scientific jobs such as analysis of data, a FLOPS rating is effective.

Historically, the earliest reliably documented serious use of the Floating Point Operation as metric appears to be AEC justification to Congress for purchasing a Control Data CDC 6600 in the mid-1960s.

The terminology is currently so confusing that until April 24, 2006 U.S. export control was based upon measurement of "Composite Theoretical Performance" (CTP) in millions of "Theoretical Operations Per Second" or MTOPS. On that date, however, the U.S. Department of Commerce's Bureau of Industry and Security amended the Export Administration Regulations to base controls on Adjusted Peak Performance (APP) in Weighted TeraFLOPS (WT).

[edit] FLOPS, GPUs, and game consoles

It has been suggested that this section may not be relevant to the subject.
Please see the discussion on the talk page.

Very high FLOPS figures are often quoted for inexpensive computer video cards and game consoles.

For example, the Xbox 360 has been announced as having total floating point performance of around one TFLOPS, while the PlayStation 3 has been announced as having a theoretical 2.18 TFLOPS. By comparison, a common AMD A64 or Intel Pentium 4 general-purpose PC would have a FLOPS rating of around ten GFLOPS, if the performance of its CPU alone was considered. The 1 TFLOPS for the Xbox 360 or 2 TFLOPS for the Playstation 3 ratings that were sometimes mentioned regarding the consoles would even appear to class them as supercomputers. These FLOPS figures should be treated with caution, as they are often the product of marketing. The game console figures are often based on total system performance (CPU + GPU). In the extreme case, the TFLOPS figure is primarily derived from the function of the single-purpose texture filtering unit of the GPU. This piece of logic is tasked with doing a weighted average of sometimes hundreds of pixels in a texture during a look-up (particularly when performing a quadrilinear anisotropically filtered fetch from a 3D texture). However, single-purpose hardware can never be included in an honest FLOPS figure.

Still, the programmable pixel pipelines of modern GPUs are capable of a theoretic peak performance that is an order of a magnitude higher than a CPU. An NVIDIA 7800 GTX 512 is capable of around 200 GFLOPS and the current (11/06) NVIDIA 8800 GTX is capable of sustaining 330 GFLOPS. ATI's latest X1900 architecture (2/06) has a claimed performance of 554 GFLOPS^[1]. This is possible because 3D graphics operations are a classic example of a highly parallelizable problem which can easily be split between different execution units and pipelines, allowing a high speed gain to be obtained from scaling the number of logic gates while taking advantage of the fact that the cost-efficiency sweet spot of (number of transistors)*frequency currently lies at around 500 MHz. This has to do with the imperfection rate in the manufacturing process, which rises exponentially with frequency. The NVIDIA Quad SLI with two dual GPU GeForce 7950 GX2 (4 GeForce 7950 cards) claims to have up to 6 TFLOPS of computing power.

While CPUs dedicate a few transistors to run at very high frequency in order to process a single thread of execution very quickly, GPUs pack a great deal more transistors running at a low speed because they are designed to simultaneously process a large number of pixels with no requirement that each pixel be completed quickly. Moreover, GPUs are not designed to perform branch operations (IF statements which determine what will be executed based on the value of a piece of data) well. The circuits for this, in particular the circuits for predicting how a program will branch to ready data for it, consume an inordinant number of transistors on a CPU that could be used for FLOPs. Lastly, CPUs access data more unpredictably. This requires them to include an amount of on-chip memory called a cache for quick random access. This cache represents the majority of CPU transistors.

General purpose computing on GPUs is an emerging field which hopes to utilize the vast advantage in raw FLOPS, as well as memory bandwidth, of modern video cards. As an example, occlusion testing in games is often done by rasterizing a piece of geometry and detecting the number of pixels changed in the z buffer, a highly non-optimal technique considering floating point operations. A few applications can even take advantage of the texture fetch unit in computing averages in (1, 2, or 3 dimensional) sorted data for a further boost in performance.

[edit] Records

In June 2006, a new computer was announced by Japanese research institute RIKEN, the MDGRAPE-3. The computer's performance tops out at one petaflop, over three times faster than the Blue Gene/L. MDGRAPE-3 is not a general purpose computer, which is why it does not appear in the TOP500 list. It has special-purpose pipelines for simulating molecular dynamics. MDGRAPE-3 houses 4,808 custom processors, 64 servers each with 256 dual-core processors, and 37 servers each containing 74 processors, for a total of 40,314 processor cores, compared to the 131,072 needed for the Blue Gene/L. MDGRAPE-3 is able to do many more computations with few chips because of its specialized architecture. The computer is a joint project between Riken, Hitachi, Intel, and NEC subsidiary SGI Japan.

Distributed computing uses the Internet to link personal computers to achieve a similar effect:

The entire BOINC averages at 536 TFLOPS. ^[2]
SETI@home computes data at more than 270 TFLOPS. ^[3]
Folding@home has been able to sustain over 990 teraFLOPS. ^[4]. Note, as of March 22rd, 2007, PlayStation 3 owners may now participate in the FAH project. Because of this, FAH is now sustaining considerably higher than 210 TFLOPS (990 as of 3/25/07). See current stats^[5] for details.
Einstein@home is crunching more than 65 TFLOPS. ^[6]
As of June 2005, GIMPS is sustaining 23 TFLOPS. ^[7]
Intel Corporation has recently unveiled the experimental multi-core POLARIS chip, which achieves 1 TFLOPS at 3.2GHz. The 80-core chip can increase this to 1.8 TFLOPS at 5.6GHz, although the thermal dissipation at this frequency exceeds 260 watts.

[edit] Cost of computing

1997: about US$30,000 per GFLOPS; with two 16-Pentium-Pro–processor Beowulf cluster computers, ^[8]
2000, May: $640 per GFLOPS, KLAT2, University of Kentucky
2003, August: $82 per GFLOPS, KASY0, University of Kentucky
2005: about $2.60 ($300/115 GFLOPS CPU only) per GFLOPS in the Xbox 360 in case Linux will be implemented as intended ^[9]
2006, February: about $1 per GFLOPS in ATI PC add-in graphics card (X1900 architecture)

This trend toward low cost follows Moore's law.

[edit] Pop culture references

In the Star Trek fictional universe, circa 2364, the android Data was constructed with an initial linear computational speed rated at 60 trillion operations per second, or 60 TIPS (and thereby, potentially 'dating' the series Star Trek: The Next Generation in which he appears); however, he was later able to infinitely exceed this limit by modifying his hardware and software.
In the movie Terminator III, Skynet is said to having expanded over the Internet at 60 TFLOPS, a nonsensical misuse of the term.