Code page 930
From Wikipedia, the free encyclopedia
This article or section is missing citations or needs footnotes. Using inline citations helps guard against copyright violations and factual inaccuracies. (January 2008) |
Code page 930 (abbreviated as CP930, also known as Japanese EBCDIC) is a code page created by IBM for representation of Japanese text. It is a superset of EBCDIC. It is commonly used on IBM OS390 and IBM AS400 operating system.
It encodes halfwidth Katakana, fullwidth Katakana and Hiragana and Kanji.
[edit] Technical detail
CP930 uses 1 byte to encode halfwidth Katakana and 2 bytes to encode all other Japanese characters. If only halfwidth Katakana mixed with Latin characters is used, which was the standard till the 80s, CP930 can be considered a pure 8bit encoding. Else it is a mixed single byte double byte encoding with the added flavor of using a Shift-In 0x0E and Shift-Out 0x0F byte to indicate the start and end of a double-byte encoding. Thus a 4 character Kanji name is commonly is encoded as 10 bytes.
[edit] Practical considerations
CP930 itself and CP930 usage patterns contains a number of idiosyncrazies, which makes working with CP930 in practice hard (see also EBCDIC for idiosyncrazies of the EBCDIC standard) and are of some practical relevance.
- Because of the Shift-In, Shift-Out codes parsing a byte sequence from the middle is hard.
- On the positive side the Shift-In 0x0E and Shift-Out 0x0F bytes are a sure way of spotting CP930 even when it has been run through an incorrect code page conversion resulting in mojibake.
- Although CP930 allows for mixed halfwidth and fullwidth character text, many database schemas strictly distinguish between columns containing only single byte halfwidth Katakana and such containing only double byte fullwidth characters. This is a convenience created for software developers to make text length prediction for a given column size in bytes easier and vice-versa.
- On the downside the above means that for consistency Latin text in such fullwidth character column will have to be entered or converted into fullwidth Alphabetic characters (interesting when doing database searches) such that they are encoded as double byte characters
- When database columns are implicitely defined as pure fullwidth character text the Shift-In, Shift-Out codes are often omitted, which results in strictly speaking incorrect encoding. Code page converters might or might not be sensitive to those missing Shift-In, Shift-Out codes.