Shift JIS

From Wikipedia, the free encyclopedia

The correct title of this article is Shift_JIS. The substitution or omission of an _ is due to technical restrictions.

Shift_JIS (SJIS) is a character encoding for the Japanese language originally developed by Microsoft and standardized as JIS X 0208 Appendix 1. It is based on character sets defined within JIS standards JIS X 0201:1997 (for the single-byte characters) and JIS X 0208:1997 (for the double byte characters). The lead bytes for the double byte characters are "shifted" around the 64 halfwidth katakana characters in the single-byte range 0xA1 to 0xDF. The single-byte characters 0x00 to 0x7F match the ASCII encoding, except for a yen sign at 0x5C and an overline at 0x7E in place of the ASCII character set's backslash and tilde respectively. On the web, 0x5C is still used as the JavaScript escape character. The single-byte characters from 0xA1 to 0xDF map to the half-width katakana characters found in JIS X 0201.

Shift_JIS requires an 8-bit medium for transmission. It is fully backwards compatible with the legacy JIS X 0201 single-byte encoding, meaning it supports half-width katakana and that any valid JIS X 0201 string is also a valid Shift_JIS string. However Shift_JIS only guarantees that the first byte will be in the upper ASCII range; the value of the second byte can be either high or low. This makes reliable Shift_JIS detection difficult. On the other hand, the competing 8-bit format EUC-JP, which does not support halfwidth katakana, allows for a much cleaner and direct conversion to and from JIS X 0208 codepoints, as all upper-ASCII bytes are part of a double-byte character and all lower-ASCII bytes are part of a single-byte character.

For a double-byte JIS sequence j1j2, the transformation to the corresponding Shift_JIS bytes s1s2 is:

33 \le j_1 \le 94  \Rightarrow s_1 = \left \lfloor \frac{j_1 + 1}{2} \right \rfloor + 112\,
95 \le j_1 \le 126 \Rightarrow s_1 = \left \lfloor \frac{j_1 + 1}{2} \right \rfloor + 176\,
j_1 \mbox{ is odd }  \Rightarrow s_2 = j_2 + 31 + \begin{cases} 1 & \mbox{if }j_2 \ge 96 \\ 0 & \mbox{otherwise} \end{cases}  \,
j_1 \mbox{ is even } \Rightarrow s_2 = j_2 + 126\,

Many different versions of Shift_JIS exist. There are two areas for expansion: Firstly, JIS X 0208 does not fill the whole 94x94 space encoded for it in Shift_JIS, therefore there is room for more characters here—these are really extensions to JIS X 0208 rather than to Shift_JIS itself. The most popular extension here is to the Windows-31J (otherwise known as Code page 932) encoding popularized by Microsoft. Secondly, Shift_JIS has more encoding space than is needed for JIS X 0201 and JIS X 0208, and this space can and is used for yet more characters. The space with lead bytes 0xF5 to 0xF9 is used by Japanese mobile phone operators for pictographs for use in E-mail, for example (KDDI goes further and defines hundreds more in the space with lead bytes 0xF3 and 0xF4).

Beyond even this there have been numerous minor variations made on Shift_JIS, with individual characters here and there altered. Most of these extensions and variants have no IANA registration, so there is much scope for confusion if the extensions are used. Microsoft Code Page 932 is registered separately from Shift_JIS.

IBM 943 has the same extensions as Code Page 932.

[edit] Shift_JIS byte map

The chart below gives the detailed meaning of each byte in a Shift_JIS encoded stream.

First byte
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2  ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9  :  ; < = >  ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ¥ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } DEL
8
9
A
B ソ
C
D
E
F
Second byte
0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
 
Non printable ASCII character
Unaltered ASCII character
Modified ASCII character
Single-byte half-width katakana
First byte of a double-byte JIS X 0208 character
Unused as first byte of a JIS X 0208 character
Second byte of a double-byte JIS X 0208 character whose first half of the JIS sequence was odd
Second byte of a double-byte JIS X 0208 character whose first half of the JIS sequence was even
Unused as second byte of a JIS X 0208 character

[edit] See also

[edit] External links