UTF-EBCDIC
UTF-EBCDIC is a character encoding used to represent Unicode characters. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.
To produce the UTF-EBCDIC encoded version of a series of Unicode code points, an encoding based on UTF-8 (known in the specification as UTF-8-Mod) is applied first. The main difference between this encoding and UTF-8 is that it allows Unicode code points U+0080 through U+009F (the C1 control codes) to be represented as a single byte and therefore later mapped to corresponding EBCDIC control codes. In order to achieve this, UTF-8-Mod uses 101XXXXX instead of 10XXXXXX as the format for trailing bytes in a multi-byte sequence. As this can only hold 5 bits rather than 6, the UTF-8-Mod encoding of codepoints above U+009F is generally larger than the UTF-8 encoding.
The UTF-8-Mod transformation leaves the data in an ASCII-based format (for example, U+0041 "A" is still encoded as 01000001), so each byte is fed through a reversible (one-to-one) lookup table to produce the final UTF-EBCDIC encoding. For example, 01000001 in this table maps to 11000001; thus the UTF-EBCDIC encoding of U+0041 (Unicode's "A") is 0xC1 (EBCDIC's "A").
This encoding form is rarely used, even on the EBCDIC-based mainframes for which it was designed. IBM EBCDIC-based mainframe operating systems, such as z/OS, usually use UTF-16 for complete Unicode support. For example, DB2 UDB, COBOL, PL/I, Java and the IBM XML toolkit support UTF-16 on IBM mainframes.
Codepage layout
There are 160 characters with single-byte encodings in UTF-EBCDIC (compared to 128 in UTF-8). As you can see, the single-byte portion is similar to IBM-1047 instead of IBM-37 due to the location of the square brackets. CCSID 37 has [] at hex BA and BB instead of at hex AD and BD respectively.
UTF-EBCDIC | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
_0 | _1 | _2 | _3 | _4 | _5 | _6 | _7 | _8 | _9 | _A | _B | _C | _D | _E | _F | |
0_ |
NUL 0000 0 |
SOH 0001 1 |
STX 0002 2 |
ETX 0003 3 |
ST 009C 4 |
HT 0009 5 |
SSA 0086 6 |
DEL 007F 7 |
EPA 0097 8 |
RI 008D 9 |
SS2 008E 10 |
VT 000B 11 |
FF 000C 12 |
CR 000D 13 |
SO 000E 14 |
SI 000F 15 |
1_ |
DLE 0010 16 |
DC1 0011 17 |
DC2 0012 18 |
DC3 0013 19 |
OSC 009D 20 |
LF 000A 21 |
BS 0008 22 |
ESA 0087 23 |
CAN 0018 24 |
EM 0019 25 |
PU2 0092 26 |
SS3 008F 27 |
FS 001C 28 |
GS 001D 29 |
RS 001E 30 |
US 001F 31 |
2_ |
PAD 0080 32 |
HOP 0081 33 |
BPH 0082 34 |
NBH 0083 35 |
IND 0084 36 |
NEL 0085 37 |
ETB 0017 38 |
ESC 001B 39 |
HTS 0088 40 |
HTJ 0089 41 |
VTS 008A 42 |
PLD 008B 43 |
PLU 008C 44 |
ENQ 0005 45 |
ACK 0006 46 |
BEL 0007 47 |
3_ |
DCS 0090 48 |
PU1 0091 49 |
SYN 0016 50 |
STS 0093 51 |
CCH 0094 52 |
MW 0095 53 |
SPA 0096 54 |
EOT 0004 55 |
SOS 0098 56 |
SGCI 0099 57 |
SCI 009A 58 |
CSI 009B 59 |
DC4 0014 60 |
NAK 0015 61 |
PM 009E 62 |
SUB 001A 63 |
4_ |
SP 0020 64 |
• +00 65 |
• +01 66 |
• +02 67 |
• +03 68 |
• +04 69 |
• +05 70 |
• +06 71 |
• +07 72 |
• +08 73 |
• +09 74 |
. 002E 75 |
< 003C 76 |
( 0028 77 |
+ 002B 78 |
| 007C 79 |
5_ |
& 0026 80 |
• +0A 81 |
• +0B 82 |
• +0C 83 |
• +0D 84 |
• +0E 85 |
• +0F 86 |
• +10 87 |
• +11 88 |
• +12 89 |
! 0021 90 |
$ 0024 91 |
* 002A 92 |
) 0029 93 |
; 003B 94 |
^ 005E 95 |
6_ |
- 002D 96 |
/ 002F 97 |
• +13 98 |
• +14 99 |
• +15 100 |
• +16 101 |
• +17 102 |
• +18 103 |
• +19 104 |
• +1A 105 |
• +1B 106 |
, 002C 107 |
% 0025 108 |
_ 005F 109 |
> 003E 110 |
? 003F 111 |
7_ |
• +1C 112 |
• +1D 113 |
• +1E 114 |
• +1F 115 |
2 116 |
2 117 |
2 118 |
2 119 |
2 120 |
` 0060 121 |
: 003A 122 |
# 0023 123 |
@ 0040 124 |
' 0027 125 |
= 003D 126 |
" 0022 127 |
8_ |
2 00A0 128 |
a 0061 129 |
b 0062 130 |
c 0063 131 |
d 0064 132 |
e 0065 133 |
f 0066 134 |
g 0067 135 |
h 0068 136 |
i 0069 137 |
2 00C0 138 |
2 00E0 139 |
2 0100 140 |
2 0120 141 |
2 0140 142 |
2 0160 143 |
9_ |
2 0180 144 |
j 006A 145 |
k 006B 146 |
l 006C 147 |
m 006D 148 |
n 006E 149 |
o 006F 150 |
p 0070 151 |
q 0071 152 |
r 0072 153 |
2 01A0 154 |
2 01C0 155 |
2 01E0 156 |
2 0200 157 |
2 0220 158 |
2 0240 159 |
A_ |
2 0260 160 |
~ 007E 161 |
s 0073 162 |
t 0074 163 |
u 0075 164 |
v 0076 165 |
w 0077 166 |
x 0078 167 |
y 0079 168 |
z 007A 169 |
2 0280 170 |
2 02A0 171 |
2 02C0 172 |
[ 005B 173 |
2 02E0 174 |
2 0300 175 |
B_ |
2 0320 176 |
2 0340 177 |
2 0360 178 |
2 0380 179 |
2 03A0 180 |
2 03C0 181 |
2 03E0 182 |
3 183 |
3 0400 184 |
3 0800 185 |
3 0C00 186 |
3 1000 187 |
3 1400 188 |
] 005D 189 |
3 1800 190 |
3 1C00 191 |
C_ |
{ 007B 192 |
A 0041 193 |
B 0042 194 |
C 0043 195 |
D 0044 196 |
E 0045 197 |
F 0046 198 |
G 0047 199 |
H 0048 200 |
I 0049 201 |
3 2000 202 |
3 2400 203 |
3 2800 204 |
3 2C00 205 |
3 3000 206 |
3 3400 207 |
D_ |
} 007D 208 |
J 004A 209 |
K 004B 210 |
L 004C 211 |
M 004D 212 |
N 004E 213 |
O 004F 214 |
P 0050 215 |
Q 0051 216 |
R 0052 217 |
3 3800 218 |
3 3C00 219 |
4 4000 220 |
4 8000 221 |
4 10000 222 |
4 18000 223 |
E_ |
\ 005C 224 |
4 20000 225 |
S 0053 226 |
T 0054 227 |
U 0055 228 |
V 0056 229 |
W 0057 230 |
X 0058 231 |
Y 0059 232 |
Z 005A 233 |
4 28000 234 |
4 30000 235 |
4 38000 236 |
5 40000 237 |
5 100000 238 |
239 |
F_ |
0 0030 240 |
1 0031 241 |
2 0032 242 |
3 0033 243 |
4 0034 244 |
5 0035 245 |
6 0036 246 |
7 0037 247 |
8 0038 248 |
9 0039 249 |
250 |
251 |
252 |
253 |
254 |
APC 009F 255 |
_0 | _1 | _2 | _3 | _4 | _5 | _6 | _7 | _8 | _9 | _A | _B | _C | _D | _E | _F |
White cells containing a large single-digit number are the start bytes for a sequence of that many bytes. The unbolded hexadecimal code point number shown in the cell is the lowest character value encoded using that start byte. This value can be greater than the value which would be obtained by following the start byte with continuation bytes which are all 65 (hex 0x41), if this would result in an invalid overlong form.
Orange cells with one dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 5 bits they add.
Red cells indicate start bytes (for a sequence of that many bytes) which can never appear in properly encoded UTF-EBCDIC text, because any possible continuation would result in an invalid overlong form. For example, 0x76 is marked in red because even 0x76 0x73 (which maps to the UTF-8-Mod sequence 0xC2 0xBF) would merely be an overlong encoding of U+005F (properly encoded as UTF-8-Mod 0x5F, UTF-EBCDIC 0x6D).
See also
External links
- http://www.unicode.org/reports/tr16/ Unicode Technical Report #16: the definition of UTF-EBCDIC
|