x87

x87 is a floating point-related subset of the x86 architecture instruction set. It originated as an extension of the 8086 instruction set in the form of optional floating point coprocessors that worked in tandem with corresponding x86 CPUs. These microchips had names ending in "87". This was also known as the NPX (Numeric Processor eXtension). Like other extensions to the basic instruction set, x87-instructions are not strictly needed to construct working programs, but provide hardware and microcode implementations of common numerical tasks, allowing these tasks to be performed much faster than corresponding machine code routines can. The x87 instruction set includes instructions for basic floating point operations such as addition, subtraction and comparison, but also for more complex numerical operations, such as the computation of the tangent function and its inverse, for example.

Most x86 processors since the Intel 80486 have had these x87 instructions implemented in the main CPU but the term is sometimes still used to refer to that part of the instruction set. Before x87 instructions were standard in PCs, compilers or programmers had to use rather slow library calls to perform floating-point operations, a method that is still common in (low-cost) embedded systems.

Description

The x87 registers form an 8-level deep non-strict stack structure ranging from ST(0) to ST(7) with registers that can be directly accessed by either operand, using an offset relative to the top, as well as pushed and popped. (This scheme may be compared to how a stack frame may be both pushed/popped and indexed.)

There are instructions to push, calculate, and pop values on top of this stack; monadic operations (FSQRT, FPTAN etc.) then implicitly address the topmost ST(0) while dyadic operations (FADD, FMUL, FCOM, etc.) implicitly address ST(0) and ST(1). The non-strict stack-model also allows dyadic operations to use ST(0) together with a direct memory operand or with an explicitly specified stack-register, ST(x), in a role similar to a traditional accumulator (a combined destination and left operand). This can also be reversed on an instruction-by-instruction basis with ST(0) as the unmodified operand and ST(x) as the destination. Furthermore, the contents in ST(0) can be exchanged with another stack register using an instruction called FXCH ST(x).

These properties make the x87 stack usable as seven freely addressable registers plus a dedicated accumulator (or as seven independent accumulators). This is especially applicable on superscalar x86 processors (such as the Pentium of 1993 and later) where these exchange instructions (codes D9C8..D9CF_h) are optimized down to a zero clock penalty by using one of the integer paths for FXCH ST(x) in parallel with the FPU instruction. Despite being natural and convenient for human assembly language programmers, some compiler writers have found it complicated to construct automatic code generators that schedule x87 code effectively. Such a stack-based interface potentially can minimize the need to save scratch variables in function calls compared with a register-based interface^[1] (although, historically, design issues in the original implementation limited that potential^[2]^[3]).

The x87 provides single precision, double precision and 80-bit double-extended precision binary floating-point arithmetic as per the IEEE 754-1985 standard. By default, the x87 processors all use 80-bit double-extended precision internally (to allow for sustained precision over many calculations- see IEEE 754 design rationale). A given sequence of arithmetic operations may thus behave slightly differently compared to a strict single-precision or double-precision IEEE 754 FPU.^[4] As this may sometimes be problematic for some semi-numerical calculations written to assume double precision for correct operation, to avoid such problems, the x87 can be configured via a special configuration/status register to automatically round to single or double precision after each operation. Since the introduction of SSE2, the x87 instructions are not as essential as they once were, but remain important as a high precision scalar unit for numerical calculations sensitive to round-off error and requiring the 64-bit mantissa precision and extended range available in the 80-bit format.

Performance

Clock cycle counts for examples of typical x87 FPU instructions (only register-register versions shown here).^[5]

The A~B notation (minimum to maximum) covers timing variations dependent on transient pipeline status as well as the arithmetic precision chosen (32, 64 or 80 bits); it also includes variations due to numerical cases (such as the number of set bits, zero, etc.). The L→H notation depicts values corresponding to the lowest (L) and the highest (H) maximum clock frequencies that were available.

x87 implementation	FADD	FMUL	FDIV	FXCH	FCOM	FSQRT	FPTAN	FPATAN	Max Clock	Peak FMUL/sec	Relative 5 MHz 8087^§ FMUL
8087	70~100	90~145	193~203	10~15	40~50	180~186	30~540	250~800	5→10 MHz	34~55K → 100~111K	1.0 → 2.0 times as fast
80287 (original)	70~100	90~145	193~203	10~15	40~50	180~186	30~540	250~800	6→12 MHz	41~66K → 83~133K	1.2 → 2.4 times as fast
80387 (and later 287 models)	23~34	29~57	88~91	18	24	122~129	191~497	314~487	16→33 MHz	280~552K → 579~1100K	approx 10 → 20 × as fast
80486 (or 80487)	8~20	16	73	4	4	83~87	200~273	218~303	16→50 MHz	1.0M → 3.1M	approx 18 → 56 × as fast
Cyrix 6x86, Cyrix MII	4~7	4~6	24~34	2	4	59~60	117~129	97~161	66→300 MHz	11~16M → 50~75M	approx 320 → 1400 ×
AMD K6 (including K6 II/III)	2	2	21~41	2	3	21~41	todo	todo	166→550 MHz	83M → 275M	approx 1500 → 5000 ×
Pentium / Pentium MMX	1~3	1~3	39	1 (0*)	1~4	70	17~173	19~134	60→300 MHz	20~60M → 100~300M	approx 1100 → 5400 ×
Pentium Pro	1~3	2~5	16~56	1 (0*)	1	28~68	todo	todo	150→200 MHz	30~75M → 40~100M	approx 1400 → 1800 ×
Pentium II / III	1~3	2~5	17~38	1 (0*)	1	27~50	todo	todo	233→1400 MHz	47~116M → 280~700M	approx 2100 → 13000 ×
Athlon (K7)	1~4	1~4	13~24	1 (0*)	1~2	16~35	todo	todo	500→2330 MHz	125~500M → 0.580~2.33G	approx 9000 → 42000 ×
Pentium 4	1~5	2~7	20~43	Multiple cycles	1	20~43	todo	todo	1.3→3.8 GHz	186~650M → 0.543~1.90G	approx 11000 → 34000 ×
Athlon 64 (K8)	1~4	1~4	13~24	1 (0*)	1~2	16~35	todo	todo	1.0→3.2 GHz	250~1000M → 0.800~3.2G	approx 18000 → 58000 ×

* An effective zero clock delay is often possible, via superscalar execution.

^§ The 5 MHz 8087 was the original x87 processor. Compared to typical software-implemented floating point routines on an 8086 (without an 8087), the factors would be even larger, perhaps by another factor of 10 (i.e., a correct floating point addition in assembly language may well consume over 1000 cycles).

Manufacturers

Companies that have designed or manufactured^{[lower-alpha 1]} floating point units compatible with the Intel 8087 or later models include AMD (287, 387, 486DX, 5x86, K5, K6, K7, K8), Chips and Technologies (the Super MATH coprocessors), Cyrix (the FasMath, Cx87SLC, Cx87DLC, etc., 6x86, Cyrix MII), Fujitsu (early Pentium Mobile etc.), Harris Semiconductor (manufactured 80387 and 486DX processors), IBM (various 387 and 486 designs), IDT (the WinChip, C3, C7, Nano, etc.), IIT (the 2C87, 3C87, etc.), LC Technology (the Green MATH coprocessors), National Semiconductor (the Geode GX1, Geode GXm, etc.), NexGen (the Nx587), Rise Technology (the mP6), ST Microelectronics (manufactured 486DX, 5x86, etc.), Texas Instruments (manufactured 486DX processors etc.), Transmeta (the TM5600 and TM5800), ULSI (the Math·Co coprocessors), VIA (the C3, C7, and Nano, etc.), and Xtend (the 83S87SX-25 and other coprocessors).

Architectural generations

8087

The 8087 was the first math coprocessor for 16-bit processors designed by Intel. It was built to be paired with the Intel 8088 or 8086 microprocessors. However, the Intel 8231 floating-point processor was an earlier design. It was a licensed version of AMD's Am9511 of 1977.^[6] The Am9511 was primarily intended for the Intel 8080, but, using some amount of glue logic, it was possible to use it with almost any microprocessor (-system) that had a spare interrupt input or interrupt vector available. The family included the 32-bit Am9511 and Am9511A (or Intel 8231/8231A) and the later 64-bit Am9512 (or Intel 8232).

80187

The 80187 (80C187)^[7] is the math coprocessor for the Intel 80186 CPU. It is incapable of operating with the 80188, as the 80188 has an 8 bit data bus; the 80188 can only use the 8087. The 80187 did not appear at the same time as the 80186 and 80188, but was in fact launched after the 80287 and the 80387. Although the interface to the main processor is the same as that of the 8087, its core is that of the 80387, and is thus fully IEEE 754 compliant as well as capable of executing all the 80387's extra instructions.^[8]

80287

6 MHz version of the Intel 80287

Intel 80287 die shot

Intel 80287XL

Intel 80287XLT

The 80287 (i287) is the math coprocessor for the Intel 80286 series of microprocessors. Intel's models included variants with specified upper frequency limits ranging from 6 up to 12 MHz. Later followed the i80287XL with 387 microarchitecture and the i80287XLT, a special version intended for laptops, as well as other variants.

The 80287XL is actually an 80387SX with a 287 pinout. It contains an internal 3/2 multiplier so that motherboards which ran the coprocessor at 2/3 CPU speed could instead run the FPU at the same speed of the CPU. Other 287 models with 387-like performance are the Intel 80C287, built using CHMOS III, and the AMD 80EC287 manufactured in AMD's CMOS process, using only fully static gates.

The 80287 and 80287XL work with the 80386 microprocessor, and were initially the only coprocessors available for the 80386 until the introduction of the 80387 in 1987. Finally, they were able to work with the Cyrix Cx486SLC. However, for both of these chips the 80387 is strongly preferred for its higher performance and the greater capability of its instruction set.

80387

The 80387 (387 or i387) is the first Intel coprocessor to be fully compliant with the IEEE 754-1985 standard. Released in 1987, a full two years after the 386 chip, the i387 includes much improved speed over Intel's previous 8087/80287 coprocessors, and improved characteristics of its trigonometric functions. The 8087 and 80287's FPTAN and FPATAN instructions are limited to an argument in the range ±π/4 (±45°), and the 8087 and 80287 have no direct instructions for the sin and cos functions.^[9]

Intel 80387 CPU Die Image

Without a coprocessor, the 386 normally performs floating-point arithmetic through (slow) software routines, implemented at runtime through a software exception-handler. When a math coprocessor is paired with the 386, the coprocessor performs the floating point arithmetic in hardware, returning results much faster than an (emulating) software library call.

The i387 is compatible only with the standard i386 chip, which has a 32-bit processor bus. The later cost-reduced i386SX, which has a narrower 16-bit data bus, can not interface with the i387's 32-bit bus. The i386SX requires its own coprocessor, the 80387SX, which is compatible with the SX's narrower 16-bit data bus.

16 MHz version of the Intel 80187

i387
i387SX
i387DX
i387 microarchitecture with 16-bit barrel shifter and CORDIC unit
i386DX with i387DX

i487SX

80487

The i487SX was marketed as a floating point unit coprocessor for Intel i486SX machines. It actually contains a full-blown i486DX implementation. When installed into an i486SX system, the i487 disables the main CPU and takes over all CPU operations.

80587

The Nx587 was the last FPU for x86 to be manufactured separately from the CPU, in this case NexGen's Nx586.

References

↑ William Kahan (2 November 1990). "On the advantages of 8087's stack. Unpublished course notes, Computer Science Division, University of California at Berkeley." (PDF).
↑ William Kahan (8 July 1989). "How Intel 8087 stack overflow/underflow should have been handled." (PDF).
↑ Jack Woehr (1 November 1997). "A conversation with William Kahan.".
↑ David Monniaux, The pitfalls of verifying floating-point computations, to appear in ACM TOPLAS
↑ Numbers are taken from respective processors' data sheets, programming manuals, and optimization manuals.
↑ http://www.cpushack.com/2010/09/23/arithmetic-processors-then-and-now/
↑ CPU Collection - Model 80187
↑ http://www.datasheetcatalog.org/datasheet/Intel/mXryvuw.pdf
↑ Borland Turbo Assembler documentation

Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture (PDF). Intel.

Notes

↑ Fabless companies design a chip and rely on a fabbed company to manufacture it, while fabbed companies can do both the design and the manufacture by themselves.

External links

Everything you always wanted to know about math coprocessors

Intel processors

Discontinued
BCD oriented (4-bit)	4004 (1971) 4040 (1974)
pre-x86 (8-bit)	8008 (1972) 8080 (1974) 8085 (1977)
Early x86 (16-bit)	8086 (1978) 8088 (1979) 80186 (1982) 80188 (1982) 80286 (1982)
x87 (external FPUs)	8/16-bit databus 8087 (1980) 16-bit databus 80187 80287 80387SX 32-bit databus 80387DX 80487
IA-32 (32-bit)	80386 SX 376 EX 80486 SX DX2 DX4 SL RapidCAD OverDrive A100/A110 Celeron (1998) M D (2004) Pentium Original OverDrive Pro II II OverDrive III 4 M Dual-Core Core Solo Duo Tolapai
x86-64 (64-bit)	Celeron D Dual-Core Pentium 4 D Extreme Edition Dual-Core Core 2 i7 (some)
Other	CISC iAPX 432 RISC i860 i960 StrongARM XScale

Current
IA-32 (32-bit)	Atom CE SoC Quark
x86-64 (64-bit)	Atom CE SoC Celeron Pentium Core i3 i5 i7 M Xeon E3 E5 E7 Phi
EPIC	Itanium

Lists
Atom Celeron Core 2 i3 i5 i7 i9 M Itanium Pentium Pro II III 4 D M Xeon

Related
Chipsets PCHs SCHs ICHs PIIXs GPUs Codenames GMA HD and Iris Graphics

Microarchitectures
P5	800 nm P5 600 nm P54C 350 nm P54CS P55C 250 nm Tillamook
P6 / Pentium M / Enhanced Pentium M	500 nm P6 350 nm P6 Klamath 250 nm Mendocino Dixon Tonga Covington Deschutes Katmai Drake Tanner 180 nm Coppermine Coppermine T Timna Cascades 130 nm Tualatin Banias 90 nm Dothan Stealey Tolapai Canmore 65 nm Yonah Sossaman
NetBurst	180 nm Willamette Foster 130 nm Northwood Gallatin Prestonia 90 nm Tejas and Jayhawk Prescott Smithfield Nocona Irwindale Cranford Potomac Paxville 65 nm Cedar Mill Presler Dempsey Tulsa
Core / Penryn	65 nm Merom-L Merom Conroe-L Allendale Conroe Kentsfield Woodcrest Clovertown Tigerton 45 nm Penryn Penryn-QC Wolfdale Yorkfield Wolfdale-DP Harpertown Dunnington
Bonnell / Saltwell	45 nm Silverthorne Diamondville Pineview Lincroft Tunnel Creek Stellarton Sodaville Groveland 32 nm Cedarview Penwell Cloverview Berryville Centerton
Nehalem / Westmere	45 nm Clarksfield Lynnfield Jasper Forest Bloomfield Gainestown (Nehalem-EP) Beckton (Nehalem-EX) 32 nm Arrandale Clarkdale Gulftown (Westmere-EP) Westmere-EX
Sandy Bridge / Ivy Bridge	32 nm Sandy Bridge Sandy Bridge-E Gladden 22 nm Ivy Bridge Ivy Bridge-EP Ivy Bridge-EX
Haswell / Broadwell	22 nm Haswell 14 nm Broadwell
Silvermont / Airmont	22 nm Valleyview Tangier Anniedale 14 nm Cherryview
Skylake/Kaby Lake/ Coffee Lake/Cannonlake	14 nm Skylake Kaby Lake Coffee Lake 10 nm Cannonlake
Goldmont	14 nm Goldmont Goldmont Plus
Future (Icelake/Tigerlake)	10 nm Icelake Tigerlake

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

x87