Encyclopedia > Universal character set

Article Content

Universal character set

The Universal Character Set is a character encoding shared with the Unicode Standard defined by the international standard ISO 10646.

ISO 10646 is an informal citation for the ISO/IEC 10646 family of standards. When citing a particular version of a particular part of the standard, the form ISO/IEC 10646-{part}:{year}, like ISO/IEC 10646-1:1993, is preferred.

Since 1991 the Unicode Consortium has worked with the ISO to develop the Unicode Standard and ISO 10646 in tandem. The character encoding portion of Version 2.0 of the Unicode Standard is identical to ISO/IEC 10646-1:1993 plus its first seven published amendments. Unicode 3.0 was published in February 2000 and the relevant portions were later adopted as ISO/IEC 10646-1:2000.

UCS is kept synchronized character by character with Unicode. It has over a million code points, but only the first 65536 (the Basic Multilingual Plane, or BMP) are commonly used, the remainder being reserved for such purposes as representing ancient Egyptian hieroglyphics or rare Chinese characters. Many code points, even in the BMP, are deliberately not assigned to characters, to allow for future expansion or to minimize conflicts with other encoding forms.

Encodings of the Universal Character Set

There are several character encoding forms defined by ISO 10646 for the Universal Character Set. The simplest is UCS-2, which uses a single code value between 0 and 65535 for each character, and allowing that value to be represented as exactly two bytes (one 16-bit word). UCS-2 thereby permits a binary representation of every code point in the BMP, as long as the code point represents a character. Code points outside the BMP cannot be represented with UCS-2.

Another encoding is UCS-4, which uses a single code value between 0 and, theoretically, hexadecimal FFFFFFFF for each character (although the UCS stops at 10FFFF), and allowing that value to be represented as exactly four bytes (one 32-bit word). UCS-4 thereby permits a binary representation of every code point in the UCS, including those outside the BMP. Like UCS-2, every encoded character has a fixed length in bytes, which makes it simple to manipulate, but of course it requires twice as much storage as UCS-2.

The most common encoding, UTF-16, is similar, but not identical, to UCS-2: UCS-2 precludes the use of D800-DFFF code value range, since the corresponding code points are not assigned to characters, while UTF-16 uses pairs of code values in that range to represent characters beyond the BMP. For example, D800 DC00 in UCS-2 are illegal code values, while the same sequence in UTF-16 is a surrogate pair representing the character at code point hex 10000.

Occasionally articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". This is not correct; UCS-2 is the 16-bit character encoding, and UCS-4 is the 32-bit character encoding; there is no UCS-16.