Thursday, July 10, 2008

Unicode

Frequently, I am using utf-8, gb2312, ansii... For ANSII encoding, I understand it well becaues it is simple. For Unicode, I have been confused with various terms(e.g. utf-8, utf-16, utf-32, UCS-2, UCS-4...). Now, I do some research and summarize what I have learnt.

Resources:
Official site: http://www.unicode.org/
Unicode Standard (version 5): http://www.unicode.org/versions/Unicode5.0.0/bookmarks.html
Unicode FAQ: http://unicode.org/faq/
FAQ of various encoding forms: http://unicode.org/faq/utf_bom.html
Online tools:
An online Unicode conversion tool: http://rishida.net/scripts/uniview/conversion.php (Display various encoding representations of what you input.)
Unihan charset database: http://www.unicode.org/charts/unihan.html (You can query by Unicode code point.)
Another good tool: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%26%2320320%3B&mode=char

Windows XP provides a useful tool called "Character Map": Start -> All Programs -> Accessories -> System Tools -> Character Map.
In linux/Unix, iconv is a powerful tool which can be used to convert between different character encodings. (It seems BOM described below is not handled by iconv. So you should not include BOM in the source file.)

Two entities:
Unicode consortium
ISO/IEC JTC1/SC2/WG2
Good news is that these two entities are well synchronized and a standard from one entity is aligned to corresponding standard from the other entity.
Unicode 5 is synchronized with Amd2 to ISO/IEC 10646:2003 plus Sindhi additions.

Encoding forms
Functionality of encoding forms is to map every Unicode code point to a unique byte sequence. Unicode code point arrangement is the same while various encoding forms float around.
ISO/IEC 10646 defines 4 forms of encoding of universal character set: UCS-4, UCS-2, UTF-8 and UTF-16. 
Code unit: encoding of every character consists of integral number of code units. For example, for UTF-16 code unit is 16bits which means length of encoding is 16bits or 32bits.
(1) UCS-4/UTF-32
Currently, they are almost identical. It uses exactly 32bits for each code point. So it is fixed length encoding.
"This single 4-byte-long code unit corresponds to the Unicode scalar value, which is the abstract number associated with a Unicode character. "
(2) UCS-2/UTF-16
Code unit is 16 bits. Commonly used characters usually can be encoded in 1 code unit (16bits).
From wikipedia: "UTF-16: For characters in the Basic Multilingual Plane (BMP) the resulting encoding is a single 16-bit word. For characters in the other planes, the encoding will result in a pair of 16-bit words, together called a surrogate pair. All possible code points from U+0000 through U+10FFFF, except for the surrogate code points U+D800–U+DFFF (which are not characters), are uniquely mapped by UTF-16 regardless of the code point's current or future character assignment or use."
From wikipedia: "UCS-2 (2-byte Universal Character Set) is an obsolete character encoding which is a predecessor to UTF-16. The UCS-2 encoding form is nearly identical to that of UTF-16, except that it does not support surrogate pairs and therefore can only encode characters in the BMP range U+0000 through U+FFFF. As a consequence it is a fixed-length encoding that always encodes characters into a single 16-bit value."
In a word, UCS-2 is a fixed-length encoding which can encode characters in BMP range U+0000 through U+FFFF while UTF-16 is variable length encoding which supports characters in other planes by using surrogate pair.
Note:The two values FFFE16 and FFFF16 as well as the 32 values from FDD016 to FDEF16 represent noncharacters. They are invalid in interchange, but may be freely used internal to an implementation. Unpaired surrogates are invalid as well, i.e. any value in the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16.
(3) UTF-8
Code unit is 8 bits. It encodes one code point in one to four octets.
From wikipedia: "It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages"
Note: initial encoding in UTF-8 is NOT compatible with latin1(ISO-8859).

Byte Order
For UTF-16 and UTF-32, code unit is more than 1 byte. A natural problem is the byte order, big endian(MSB first) or little endian(MSB last)? For UTF-8, this problem does not exist because code unit is 1 byte. To solve this problem, Byte Order Mark(BOM) can be used. Actually, BOM does not only indicate the byte order but also define encoding form.
BOM table: From http://unicode.org/faq/utf_bom.html:

Bytes Encoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8

Here are some guidelines to follow(From http://unicode.org/faq/utf_bom.html):

  1. A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.
  2. Some protocols allow optional BOMs in the case of untagged text. In those cases,
    • Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.
    • Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.
  3. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided. For example, bash file starting with #!/bin/sh.
  4. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.

Summary:

From http://unicode.org/faq/utf_bom.html:

Name UTF-8 UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE
Smallest code point 0000 0000 0000 0000 0000 0000 0000
Largest code point 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF
Code unit size 8 bits 16 bits 16 bits 16 bits 32 bits 32 bits 32 bits
Byte order N/A <BOM> big-endian little-endian <BOM> big-endian little-endian
Minimal bytes/character 1 2 2 2 4 4 4
Maximal bytes/character 4 4 4 4 4 4 4

Note: all valid code points that are encoded are the same: from U+0000 through U+10FFFF.

More:
None of the UTFs can generate every arbitrary byte sequence. In other words, not every 4-byte-long byte sequence represents  a legal coding in UCS4.
From Unicode consortium site: "Each UTF is reversible, thus every UTF supports lossless round tripping: mapping from any Unicode coded character sequence S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF mapping  must also map all code points that are not valid Unicode characters to unique byte sequences. These invalid code points are the 66 noncharacters (including FFFE and FFFF), as well as unpaired surrogates."
This means that: to guarantee reversibility, not only valid characters but also non-valid characters must be considered and encoded appropriately.

How to fit a Unicode character into ANSII stream?
See http://unicode.org/faq/utf_bom.html#31.  Several methods are used in practice: (1) UTF-8; (2) '\uXXXX' in C or Java (3) "&#XXXX;" in HTML or XML.