EBCDIC

From Wikipedia, the free encyclopedia

EBCDIC (Extended Binary Coded Decimal Interchange Code) is an 8-bit character encoding (code page) used on IBM mainframe operating systems, like z/OS, OS/390, VM and VSE, as well as IBM minicomputer operating systems like OS/400 and i5/OS. It is also employed on various non-IBM platforms such as Fujitsu-Siemens' BS2000/OSD, HP MPE/iX, and Unisys MCP. It descended from punched cards and the corresponding six bit binary-coded decimal code that most of IBM's computer peripherals of the late 1950s and early 1960s used.

1 History
2 Technical details
3 Codepage layout
4 Trivia
5 See also
6 External links

[edit] History

EBCDIC was devised in 1963 and 1964 by IBM and was announced with the release of the IBM System/360 line of mainframe computers. It was created to extend the Binary-Coded Decimal that existed at the time. EBCDIC was developed separately from ASCII. EBCDIC is an 8-bit encoding, versus the 7-bit encoding of ASCII.

Interestingly, IBM was a chief proponent of the ASCII standardization committee. However, IBM did not have time to prepare ASCII peripherals (such as card punch machines) to ship with its System/360 computers, so the company settled on EBCDIC at the time. The System/360 became wildly successful, and thus so did EBCDIC.

All IBM mainframe peripherals and operating systems (except Linux on zSeries) use EBCDIC as their inherent encoding but software can translate to and from other encodings. Many hardware peripherals provide translation as well and modern mainframes (such as IBM zSeries) include processor instructions, at the hardware level, to accelerate translation between character sets.

At the time it was devised, EBCDIC made it relatively easy to enter data into a computer with punch cards. Since punch cards are never used on mainframes nowadays, EBCDIC is used in modern mainframes solely for backwards compatibility. It has no technical advantage over ASCII-based code pages such as the ISO-8859 series or Unicode. As with single-byte extended ASCII codepages, most EBCDIC codepages only allow up to 2 languages (English and one other language) to be used in a database or text file.

Where true support for multilingual text is desired, a system supporting far more characters is needed. Generally this is done with some form of Unicode support. There is an EBCDIC Unicode Transformation Format called UTF-EBCDIC proposed by the Unicode consortium, but it is not intended to be used in open interchange environments and, even on EBCDIC-based systems, it is almost never used. IBM mainframes support UTF-16, but they do not support UTF-EBCDIC natively.

[edit] Technical details

EBCDIC code pages and ASCII-based code pages are incompatible with each other. Since computers only understand numbers, these codepages assign a character to these numbers. The same byte values are interpreted as different characters depending on the codepage used. Data stored in EBCDIC require a code page conversion before the text can be viewed on ASCII based machines, like a personal computer.

A single EBCDIC byte occupies eight bits, which are divided in two halves or nibbles. The first four bits is called the zone and represent the category of the character, whereas the last four bits is called the digit and identify the specific character.

There is a nice correspondence between hexadecimal character codes and punch card codes for EBCDIC. This was an important feature at the time the EBCDIC scheme was created. An IBM card punch could make a 12-row punch card with up to 2 punches per column, the first punch somewhere in the first 3 rows (called the zone) and the second punch somewhere in the last 9 rows (called the number). The zone could thus be considered a value from 0 to 3, and the number a value from 0 to 9, where 0 means no punch, and non-zero means the corresponding row was punched. The initial version of EBCDIC was just (0xf-zone)<<4+number and defined only the lower-left 10x4 part of the table shown below (the zone was apparently reversed so the letters would at least be in alphabetic order).

The first 64 code points (00-3F) are control characters, 33 of which have ASCII equivalents. One notable difference between the two sets is that ASCII has carriage return (CR) and linefeed (LF) codes, which are generally used as end of line indicators within ASCII text files, whereas EBCDIC has additional newline (NL) and reverse newline (RNL) codes. The other 31 control codes are used for various terminal and device controls, mostly specific to IBM hardware.

There are a number of different versions of EBCDIC, customized for different countries. Some East Asian countries use a double byte extension of EBCDIC to allow display of Chinese, Japanese and Korean scripts for their mainframes. In the double byte extension of EBCDIC, there are shift codes [0x0E,0x0F] to shift between the single byte and double byte modes.

IBM typically names all of its code pages with a number called a CCSID (Coded Character Set IDentifier). It is important to note that the same CCSID can have different character positions in a codepage. For example, the newline character can be a different byte value in z/OS UNIX System Services versus the other EBCDIC based operating systems. This becomes an issue when transferring EBCDIC based text data between machines.

[edit] Codepage layout

This is CCSID 500, a variant of EBCDIC. Characters 00–3F and FF are controls, 40 is space, 41 is no-break space, and CA is soft hyphen. Characters are shown with their equivalent ISO 8859-1 codes:

	-0	-1	-2	-3	-4	-5	-6	-7	-8	-9	-A	-B	-C	-D	-E	-F
0-	NUL `00`	SOH `01`	STX `02`	ETX `03`	SEL	HT `09`	RNL	DEL `7F`	GE	SPS	RPT	VT `0B`	FF `0C`	CR `0D`	SO `0E`	SI `0F`
1-	DLE `10`	DC1 `11`	DC2 `12`	DC3 `13`	RES ENP	NL	BS `08`	POC	CAN `18`	EM `19`	UBS	CU1	IFS `1C`	IGS `1D`	IRS `1E`	IUS ITB `1F`
2-	DS	SOS	FS	WUS	BYP INP	LF `0A`	ETB `17`	ESC `1B`	SA	SFE	SM SW	CSP	MFA	ENQ `05`	ACK `06`	BEL `07`
3-			SYN `16`	IR	PP	TRN	NBS	EOT `04`	SBS	IT	RFF	CU3	DC4 `14`	NAK `15`		SUB `1A`
4-	SP `20`	RSP `A0`	â `E2`	ä `E4`	à `E0`	á `E1`	ã `E3`	å `E5`	ç `E7`	ñ `F1`	[ `5B`	. `2E`	< `3C`	( `28`	+ `2B`	! `21`
5-	& `26`	é `E9`	ê `EA`	ë `EB`	è `E8`	í `E0`	î `EE`	ï `EF`	ì `ED`	ß `DF`	] `5D`	$ `24`	* `2A`	) `29`	; `3B`	^ `5E`
6-	- `2D`	/ `2F`	Â `C2`	Ä `C4`	À `C0`	Á `C1`	Ã `C3`	Å `C5`	Ç `C7`	Ñ `D1`	¦ `A6`	, `2C`	% `25`	_ `5F`	> `3E`	? `3F`
7-	ø `F8`	É `C9`	Ê `CA`	Ë `CB`	È `C8`	Í `CD`	Î `CE`	Ï `CF`	Ì `CC`	` `60`	: `3A`	# `23`	@ `40`	' `27`	= `3D`	" `22`
8-	Ø `D8`	a `61`	b `62`	c `63`	d `64`	e `65`	f `66`	g `67`	h `68`	i `69`	« `AB`	» `BB`	ð `F0`	ý `FD`	þ `FE`	± `B1`
9-	° `B0`	j `6A`	k `6B`	l `6C`	m `6D`	n `6E`	o `6F`	p `70`	q `71`	r `72`	ª `AA`	º `BA`	æ `E6`	¸ `B8`	Æ `C6`	¤ `A4`
A-	µ `B5`	~ `7E`	s `73`	t `74`	u `75`	v `76`	w `76`	x `77`	y `78`	z `79`	¡ `A1`	¿ `BF`	Ð `D0`	Ý `DD`	Þ `DE`	® `AE`
B-	¢ `A2`	£ `A3`	¥ `A5`	· `B7`	© `A9`	§ `A7`	¶ `B6`	¼ `BC`	½ `BD`	¾ `BE`	¬ `AC`	\| `7C`	¯ `AF`	¨ `A8`	´ `B4`	× `D7`
C-	{ `7B`	A `41`	B `42`	C `43`	D `44`	E `45`	F `46`	G `47`	H `48`	I `49`	SHY `AD`	ô `F4`	ö `F6`	ò `F2`	ó `F3`	õ `F5`
D-	} `7D`	J `4A`	K `4B`	L `4C`	M `4D`	N `4E`	O `4F`	P `50`	Q `51`	R `52`	¹ `B9`	û `FB`	ü `FC`	ù `F9`	ú `FA`	ÿ `FF`
E-	\ `5C`	÷ `F7`	S `53`	T `54`	U `55`	V `56`	W `57`	X `58`	Y `59`	Z `5A`	² `82`	Ô `D4`	Ö `D6`	Ò `D2`	Ó `D3`	Õ `D5`
F-	0 `30`	1 `31`	2 `32`	3 `33`	4 `34`	5 `35`	6 `36`	7 `37`	8 `38`	9 `39`	³ `83`	Û `D8`	Ü `DC`	Ù `D9`	Ú `DA`	EO

[edit] Trivia

Famed open source software advocate and hacker Eric S. Raymond writes in his Jargon File that EBCDIC was almost universally loathed by early hackers and programmers because of its multitude of different versions, none of which resembled the other versions, and that IBM produced it in direct competition with the already-established ASCII.

The Jargon file 4.4.7 gives the following definition:

	EBCDIC: /eb´s@·dik/, /eb´see`dik/, /eb´k@·dik/, n. [abbreviation, Extended Binary Coded Decimal Interchange Code] An alleged character set used on IBM dinosaurs. It exists in at least six mutually incompatible versions, all featuring such delights as non-contiguous letter sequences and the absence of several ASCII punctuation characters fairly important for modern computer languages (exactly which characters are absent varies according to which version of EBCDIC you're looking at). IBM adapted EBCDIC from punched card code in the early 1960s and promulgated it as a customer-control tactic (see connector conspiracy), spurning the already established ASCII standard. Today, IBM claims to be an open-systems company, but IBM's own description of the EBCDIC variants and how to convert between them is still internally classified top-secret, burn-before-reading. Hackers blanch at the very name of EBCDIC and consider it a manifestation of purest evil. See also fear and loathing.

Another popular complaint is that the EBCDIC alphabetic characters follow an archaic punch card encoding rather than a linear ordering like ASCII. The upshot of this is that incrementing the character code for "I" does not produce the code for "J", and likewise there is a gap between the codes for "R" and "S". Thus programming a simple control loop to cycle through only the alphabetic characters is problematic.

These incompatibles were also the source of many jokes. A popular one went:

Professor: So the American government went to IBM to come up with a data encryption standard, and they came up with—
Student: EBCDIC!