X86 assembly language

From Wikipedia, the free encyclopedia

The correct title of this article is x86 assembly language. The initial letter is shown capitalized due to technical restrictions.
Wikibooks
Wikibooks has a book on the topic of

x86 assembly language is the assembly language for the x86 class of processors, which includes Intel's Pentium series and AMD's Athlon series.

Contents

[edit] History: x86 assembly language

The Intel 8088 and 8086 CPUs were an 8 bit version and 16 bit version of CPU that first had an instruction set that is now commonly referred to as x86. The x86 assembly language also pertains to the many different versions of CPUs that followed from Intel, such as 80188, 80186, 80286, 80386, 80486, 80586, Pentium and non Intel CPUs from AMD and Cyrix. The term x86 refers to all the CPUs that can run the same original assembly language.

[edit] x86 instruction set architecture

The x86 processor and instruction set design is CISC; however, in the latter half of the 1990s the internal architecture moved towards being more of a RISC or VLIW design. Most modern x86 processors translate their instructions to one or more RISC-like "micro-ops" before they execute them, allowing the substeps of complex instructions to be executed in parallel in a superscalar fashion, rather than just being able to execute instructions in parallel as the original Pentium could do. This behaviour is, however, invisible to the assembly programmer.

The modern x86 instruction set is really a series of extensions of instruction sets that began with the Intel 8008 microprocessor. Nearly full binary backward compatibility is actually present between the Intel 8086 chip through to the modern Pentium 4, Intel Core, Athlon 64, Opteron, etc. processors. (There are certain unusual exceptions, such as the counted shift instructions, corrections to the original PUSHA instruction, some orphaned Intel 80286 semantics, the dropped LOADALL instruction, and the Pentium 4 giving up on precise FPU operation counts.) Each successive instruction extension has been either simply directly added, or accompanied by adding execution modes to the processor.

[edit] The various kinds of instructions

In general, the features of the modern x86 instruction set are:

  • variable length and alignment independent (encoded as little endian, as is all data in the x86 architecture)
  • both general and implicit register usage. Although the registers are theoretically general purpose, there are very few of them and all but one (EBX) is affected by the operation of one or more instruction with no facility to protect them or use a different register. In this sense they are not truly general purpose.
  • double operand (that is to say, the first register may usually be used for both input and output)
  • supports various complex addressing modes (including immediate addressing, offset addressing, and scaled index addressing, but not PC-relative)
  • contains special support for atomic instructions (XCHG, CMPXCHG(8B), XADD, and integer instructions which combine with the LOCK prefix)
  • includes floating point (to a stack of registers) and integer instructions,
  • produces conditional flags implicitly (through most integer ALU instruction) and explicitly (via the CMP instruction)
  • SIMD instructions (instructions which perform parallel simultaneous single instructions on many operands encoded in adjacent cells of wider registers).

[edit] The stack

The x86 processor also comes with a built-in execution stack mechanism. The CALL/RET and the INT/IRET instructions use the properly set up stack to save and restore call-return points.

However, unlike many other processor families such as Power and 68K, there is no hardware support for multiple stacks and so data and control-flow information must be combined into the same stack.

In order to alleviate this limitation, instructions like ENTER/LEAVE, or other direct manipulations of the stack register (ESP) can be used for saving local data in the stack. The instruction architecture also includes PUSH/POP instructions for direct usage of the stack for integer and address quantities. This helps simplify ABI specifications with respect to "call stack" software support mechanisms as compared with some RISC architectures which must be more explicit about call stack details.

The combination of a single hardware stack and the limited number of other registers available creates one of the most significant bottlenecks in x86 code.

[edit] Execution modes

The processor supports numerous modes of operation in which some instructions are available and some are not. A 16-bit subset of instructions are available in "real mode" (available since the 8086), "16-bit protected mode" (available since the 80286), or "v86 mode" (available since the Intel 80386). In "32-bit protected mode" (available in processors starting with the Intel 80386) or "legacy mode" (available when 64 bit extensions are enabled), 32-bit instructions (plus SIMD instructions) are available. In "long mode" (available since the AMD Opteron processor) 64-bit instructions are available.

The instruction set is based on similar ideas in each mode, but involves different ways of accessing memory and thus employs different programming strategies.

For information on the assembly language within a respective mode, see:

[edit] Real mode

Real mode is mostly 16-bit, but since the 80386 it is possible to use 32-bit registers in this mode. It is also possible to enable partial 32-bit addressing in real mode through a bug/feature that appears under certain conditions when switching from protected mode back to real mode. Some DOS extenders make use of this to make it possible to access more than 1 megabyte of RAM. This bug-mode is sometimes called unreal mode by assembly programmers.

A memory reference specifies a 16-bit offset in a segment; the actual 20-bit address is given by SEGMENT * 16 + OFFSET, where SEGMENT is the contents of the segment register for the segment, and OFFSET is the offset within that segment. Segments are either implicit or made explicit via a segment override. By default the general registers are assumed to use the DS (data) segment, the stack registers are assumed to use the SS (stack) segment, and IP is assumed to use the CS (code) segment. This segmented architecture allowed for addressing just a little over 1MB of memory; however, only 64K could be addressed within a given segment at any one time. On earlier IBM PC compatible machines, this also caused great confusion with something called the "A20" line, since, while the addresses from 0x100000 to 0x10FFEF could technically be addressed, early systems generally did not make the extra 64K of memory available, instead dropping the top bit which ended up wrapping the address. However, later systems did not exhibit this behaviour, since the x86 evolved ways of addressing more than 1MB of memory.

In order to use more than 64K of memory, the segment registers must be used. This created great complications for C compiler implementors who introduced odd pointer modes such as "near", "far" and "huge" to leverage the implicit nature of segmented architecture to different degrees, with some pointers containing 16-bit offsets within implied segments and other pointers containing segment addresses and offsets within segments.

For more details on this topic, see x86 assembly programming in real mode.

[edit] Protected mode

In the 80286, 16-bit protected mode was added. It was used in early operating systems that needed memory protection. When implemented in the kernel itself, the mode delivers 24-bit physical addressing, which gives a maximum capability of 16 megabytes of physical memory and up to 1 GB of virtual addressing. Some early Unix operating systems, OS/2 1.x and Windows 3.x used this mode. Today, 16-bit protected mode is still used for running legacy applications, eg. DPMI compatible DOS extender programs (through virtual DOS machines) or Windows 3.x applications (through the Windows on Windows subsystem) and certain classes of device drivers in OS/2 2.0 and later, all under control of a 32-bit kernel.

In protected mode, a segment register no longer contains the physical address of the beginning of a segment, but contain a "selector" that points to a system-level structure called a segment descriptor. A segment descriptor contains the physical address of the beginning of the segment, the length of the segment, and access permissions to that segment. The offset is checked against the length of the segment, with offsets referring to locations outside the segment causing an exception. Offsets referring to locations inside the segment are combined with the physical address of the beginning of the segment to get the physical address corresponding to that offset.

The instruction set in protected mode is perfectly backward compatible with the one used in real mode.

In this mode, the same techniques used to access more than 64K of memory in real mode are used; "far", or long, pointers contain segment selectors rather than segment addresses.

In the 80386, 32-bit protected mode was added. It enables full 32-bit addressing, paging, a few more registers, and some new instructions to handle the 32-bit addressing.

In 32-bit protected mode, with paging not enabled, the address in a segment descriptor is the physical address of the beginning of the segment, and the address calculated from the address of the beginning of a segment and the offset within that segment is a physical address. With paging enabled, the address in a segment descriptor is the "linear" address, in a 32-bit address space, of the beginning of the segment, and the address calculated from the address of the beginning of a segment and the offset within that segment is a linear address in that address space. Addresses in that address space are translated to physical addresses via a page table. Linear addresses are 32-bit addresses. By default, physical addresses are also 32-bit addresses; however, there exists a page extension mode called Physical Address Extension or PAE, first added in the Intel Pentium Pro, which allows an additional 4 bits of physical addressing. This mode does not change the length of segment offsets or linear addresses; those are still only 32 bits.

In this mode, as offsets within segments are 32 bits, there was less need for explicit segmentation, and, as 48-bit segmented addresses (segment selectors plus offset within segment) were translated to 32-bit linear addresses, explicit segmentation didn't conveniently expand the address space available to a program. Therefore, C compiler vendors and operating system vendors rarely supported segmented addresses in 32-bit protected mode.

x86 processors that support protected mode boot into real mode for backward compatibility with the older 8086 class of processors. Typically, the operating system is responsible for switching to protected mode if it so wishes.

For more details on this topic, see x86 assembly programming in protected mode.

[edit] Long mode

Long mode, as implemented in the AMD64 instruction set, is a mode that enables 64-bit addressing, 64-bit extensions of most registers and some new 64-bit registers as well. It is mostly an extension of the 32-bit instruction set, but unlike the 16 -> 32 bit transition, many instructions were dropped in the 64 bit mode. This does not affect actual binary backward compatibility (which would execute legacy code in other modes that retain support for those instructions), but it changes the way assembler and compilers for new code have to work.

To switch to long mode, the processor has to first switch from real mode to protected mode, and then to long mode. Toby Opferman has written an example of how to do this last year under a 32 bit Operating System. This example driver actually takes over control of the 32 bit system, gets into Long Mode and then executes a 64 bit raw binary supplied by a user mode application. The driver is then able to return back to protected mode and restore control back to the Operating System. The source code is available from his website http://www.opferman.net/Files/64drv.zip.

For more details on this topic, see x86 assembly programming in long mode.

[edit] Integer registers

The x86 in real mode and 16-bit protected mode contains 6 general 16-bit registers (AX, BX, CX, DX, SI, DI), 2 special stack registers (BP and SP), one 16-bit flags register (FLAGS), and 4 segment registers (CS, SS, DS, ES). The first 4 of the general registers are split into top and bottom half 8-bit registers (AX = AH:AL, BX = BH:BL, CX = CH:CL, DX = DH:DL) which are independently usable in 8-bit instruction forms. The instruction pointer (IP) register exists, but is only used in an implicit manner (though its value can be stored on the stack and accessed without a problem).

Starting with the Intel 80386 processor, the x86 in 32-bit protected mode extended the 16-bit registers to 32 bits (EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP, EFLAGS, EIP). The older 16-bit registers were overlayed with the bottom half of the 32-bit registers and could be accessed with an instruction override. There is no "high-half" 16-bit register access; instead, Intel chose to generalize the addressing so that every register could be used for scaled index addressing, and so that EBP could be used as a general register, as well as a stack register.

Starting with the AMD Opteron processor, the x86 in 64-bit long mode (as a subset of AMD64 or x86-64 mode) extended the 32-bit registers in a similar way that 32-bit protected mode did before it (RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP, RFLAGS, RIP). However, AMD also added 8 additional 64-bit general registers (R8, R9, ..., R15).

The addressing modes were not dramatically changed from 32-bit mode, except that addressing was extended to 64 bits, physical addressing is now sign extended (so memory always adds equally to the top and bottom of memory; note that this does not affect linear or virtual addressing), and other selector details have been dramatically reduced.

[edit] Floating point stack

Starting with the Intel 8087 floating point coprocessor (and first integrated as a standard extension of the x86 architecture in the Intel 80486DX chip) the x86 processor includes an 8-entry 80-bit floating point stack with individually selectable entries (st(0), st(1), ..., st(7), where st(0) is always the top entry of the stack). Floating point instructions can push entries onto the stack, or pop the top entry off. As one of its two operands, a floating point instruction may select any stack entry, however the other must be st(0). The FXCH instruction also exists as a convenience, to allow the swapping of any pair of stack entries. A floating point stack entry is only valid if it has previously been pushed onto the stack. The floating point instructions read and write data to memory using integer addressing to floating point values that are 32-bit, 64-bit, 80-bit, or in fact integer values that are 32 or 64 bit. Implicit numeric format conversions are performed as necessary.

The x86 floating point instructions can operate in one of 3 possible execution modes with respect to operand size: 32-bit, 64-bit or 80-bit, as well as various rounding modes. For compatibility (with external non-x86 sources, such as data generated on a RISC processor, which will typically support only 64-bit mode) reasons, the size mode is usually set to 64-bit, and the rounding mode is set to (TBD). However, some C and Fortran compilers use the full 80-bit precision for maximum accuracy.

Note that the AMD64 did not add additional entries to the floating point stack, though the additional integer registers can be used for memory addressing.

[edit] SIMD registers

Starting with the mid-90s, numerous SIMD (Single Instruction, Multiple Data) instruction extensions have been introduced. We will break those down here by their marketing names.

[edit] MMX

The MMX instruction set, which first appeared in the Pentium MMX, mapped 8 64-bit SIMD registers (MM0, MM1, ..., MM7) on top of the floating point stack, but did not adopt the stack-like semantics. The reason for this mapping was so that existing operating systems could still correctly save and restore the register state when multitasking without modifications. SIMD instructions can arbitrarily access any of its SIMD registers in any instruction. To avoid confusion with floating point instructions, these SIMD instructions must execute in blocks bracketed by the EMMS instruction. EMMS instructions implicitly clear the FP stack as a side effect, so any FP entries are clearly lost as a result of the instruction.

MMX instructions use the MMX registers as pairs of 32-bit integer values, or sets of 4 16-bit integer values, or sets of 8 8-bit integer values.

Note that the AMD64 architecture did not add additional MMX registers, though the additional integer registers can be used for memory addressing.

[edit] 3DNow!

The 3DNow! instructions use the MMX registers as pairs of 32-bit floating point values. 3DNow! instructions must execute in blocks bracketed by the FEMMS instruction, which also clear the FP stack.

[edit] Streaming SIMD extensions

Starting with the Intel Pentium III, these instruction sets used 8 new 128-bit registers called SSE registers (XMM0, XMM1, ..., XMM7). SIMD instructions can arbitrarily access any of its SIMD registers in any instruction. Intel and followed by AMD added more SIMD instruction sets, but used these same registers until AMD introduced the AMD64 long mode execution mode. AMD64 simply extends the number of registers to 16 128-bit registers (XMM8, XMM9, ..., XMM15), and extends the instructions to be able to use any of these registers.

The format of these registers depends on the instructions using them. The original SSE instruction set uses them as 4 simultaneous 32-bit floating point values. SSE2 allows usage of them as 2 simultaneous 64-bit floating point values, 4 simultaneous 32-bit integer values, 8 simultaneous 16-bit integer values, or 16 simultaneous 8-bit values.

[edit] Instruction overview

As a CISC processor, the x86 offers a large number of instructions of varying capabilities.

[edit] Integer ALU instructions

x86 assembly has the standard mathematical operations, add, sub, mul, with idiv; the logical operators and, or, xor, neg; bitshift arithmetic and logical, sal/sar, shl/shr; rotate with and without carry, rcl/rcr, rol/ror, a complement of BCD arithmetic instructions, aaa, aad, daa and others.

[edit] Floating point instructions

x86 assembly language includes instructions for a stack-based floating point unit. They include addition, subtraction, negation, multiplication, division, remainder, square roots, integer truncation, fraction truncation, and scale by power of two. The operations also include conversion instructions which can load or store a value from memory in any of the following formats: Binary coded decimal, 32-bit integer, 64-bit integer, 32-bit floating point, 64-bit floating point or 80-bit floating point (upon loading, the value is converted to the currently used floating point mode). The x86 also includes a number of transcendental functions including sine, cosine, tangent, arctangent, exponentiation with the base 2 and logarithms to bases 2, 10, or e.

The stack register to stack register format of the instructions is usually F(OP) st, st(*) or F(OP) st(*), st. Where st is equivalent to st(0), and st(*) is one of the 8 stack registers (st(0), st(1), ..., st(7)) Like the integers, the first operand is both the first source operand and the destination operand. FSUBR and FDIVR should be singled out as first swapping the source operands before performing the subtraction or division. The addition, subtraction, multiplication, division, store and comparison instructions include instruction modes that will pop the top of the stack after their operation is complete. So for example FADDP st(1), st performs the calculation st(1) = st(1) + st(0), then removes st(0) from the top of stack, thus making what was the result in st(1) the top of the stack in st(0).

[edit] SIMD instructions

Modern x86 CPUs contain SIMD instructions, which largely perform the same operation in parallel on many values encoded in a wide SIMD register. Various instruction technologies support different operations on different register sets, but taken as complete whole (from MMX to SSE3) they include general computations on integer or floating point arithmetic (addition, subtraction, multiplication, shift, minimization, maximization, comparison, division or square root). So for example, PADDW MM0, MM1 performs 4 parallel 16-bit (indicated by the W) integer adds (indicated by the PADD) of mm0 values to mm1 and stores the result in mm0. SSE and SSE-2 also include floating point modes in which only the very first value of the registers is actually modified. Some other unusual instructions have been added including a sum of absolute differences (used for motion estimation in video processing, such as is done in MPEG) and a 16-bit multiply accumulation instruction (useful for software-based alpha-blending). SSE3 and 3DNow! extensions, include addition and subtraction instructions for treating paired floating point values like complex numbers.

These instruction sets also include numerous fixed sub-word instructions for shuffling, inserting and extracting the values around within the registers. In addition there are instructions for moving data between the integer registers and SSE/MMX registers.

[edit] Data manipulation instructions

The x86 processor also includes complex addressing modes for addressing memory with an immediate offset, a register, a register with an offset, a scaled register with or without an offset, and a register with an optional offset and another scaled register. So for example, one can encode mov eax, [Table + ebx + esi*4] as a single instruction which loads 32 bits of data from the address computed as (Table + ebx + esi * 4) offset from the DS selector, and stores it to the eax register. In general the x86 processor can load and use memory matched to the size of any register it is operating on. (The SIMD instructions also include half-load instructions.)

The x86 instruction set includes string load, store and move instructions (LODS, STOS, and MOVS) which perform each operation to a specified size (B for 8-bit byte, W for 16-bit word, D for 32-bit double word) then increments the implicit address register (SI for LODS, DI for STOS, and both for MOVS). For the load and store, the implicit target/source register is in the AL, AX or EAX register (depending on size.) The implicit segment used is DS, except for MOVS which uses ES for the store and DS for the load. In modern x86 processors, these complex instructions don't offer any performance advantage over more simply implemented separate load/store and address increment instructions.

The stack is implemented with an implicitly decrementing (push) and incrementing (pop) stack pointer. In 16-bit mode, this implicit stack pointer is addressed as SS:[SP], in 32-bit mode it's SS:[ESP], and in 64-bit mode it's [RSP]. The stack pointer actually points to the next value that will be stored, under the assumption that its size will match the operating mode of the processor (i.e., 16, 32, or 64 bits) to match the default width of the PUSH/POP/CALL/RET instructions. Also included are the instructions ENTER and LEAVE which reserve and remove data from the top of the stack while setting up a stack frame pointer in BP/EBP/RBP. However, direct setting, or addition and subtraction to the SP/ESP/RSP register is also supported, so the ENTER/LEAVE instructions are generally unnecessary. Other instructions for manipulating the stack include PUSHF/POPF for storing and retrieving the (E)FLAGS register. The PUSHA/POPA instructions will store and retrieve the entire integer register state to and from the stack.

Values for a SIMD load or store are assumed to be packed in adjacent positions for the SIMD register and will align them in sequential little-endian order. Some SSE load and store instructions require 16-byte alignment to function properly. The SIMD instruction sets also include "prefetch" instructions which perform the load but do not target any register, used for cache loading. The SSE instruction sets also include non-temporal store instructions which will perform stores straight to memory without performing a cache allocate if the destination is not already cached (otherwise it will behave like a regular store.)

Most generic integer and floating point (but no SIMD) instructions can use one parameter as a complex address as the second source parameter. Integer instructions can also accept one memory parameter as a destination operand.

[edit] Programming flow

The x86 assembly has an unconditional jump operation, jmp, which can take an immediate address, a register or an indirect address as a parameter. (Note that most RISC processors only support a link register or short immediate displacement for jumping.)

Also supported are several conditional jumps, including je (jump on equality), jne (jump on inequality), jg (jump on greater than, signed), jl (jump on less than, signed), ja (jump on above/greater than, unsigned), jb (jump on below/less than, unsigned). These conditional operations are based on the state of specific bits in the (E)FLAGS register. Many arithmetic and logic operations set, clear or complement these flags depending on their result. The comparison cmp (compare) and test instructions set the flags as if they had performed a subtraction or a bitwise AND operation, respectively, without altering the values of the operands. There are also instructions such as clc (clear carry flag) and cmc (complement carry flag) which work on the flags directly. Floating point comparisons are performed via FCOM or FICOM instructions which eventually have to be converted to integer flags.

Each jump operation has three different forms, depending on the size of the operand. A short jump uses an 8-bit signed operand, which is a relative offset from the current instruction. In real mode or 16-bit protected mode, a near jump uses a 16-bit or unsigned operand as an address relative to the current segment base; in 32-bit protected mode, a near jump is a 16-bit or 32-bit signed relative offset similar to a short jump. A far jump is one that uses the full segment base:offset value as an absolute address. There are also indirect and indexed forms of each of these.

In addition to the simple jump operations, there are the call (call a subroutine) and ret (return from subroutine) instructions. Before transferring control to the subroutine, call pushes the segment offset address of the instruction following the call onto the stack; ret pops this value off the stack, and jumps to it, effectively returning the flow of control to that part of the program. In the case of a far call, the segment base is pushed following the offset.

There are also two similar instructions, int (interrupt), which saves the current register values on the stack, then performs a far call, except that instead of an address, it uses an interrupt vector, an index into a table of interrupt handler addresses. The matching return from interrupt instruction is iret, which restores the register values after returning. Soft Interrupts of the type described above are used by some operating systems for system calls, and can also be used in debugging hard interrupt handlers. Hard interrupts are triggered by external hardware events.


[edit] See also

[edit] External links

In other languages