Encyclopedia > Unicode

  Article Content

Unicode

Unicode is the international standard whose goal is to specify a code matching every character needed by every written human language to a single unique integer number, called a code point. It is the explicit aim of Unicode to abolish traditional character encodings such as those defined by the ISO 8859 standard, which are used in the various countries of the world, but are largely incompatible with each other.

Table of contents

Unicode Consortium

The California-based Unicode Consortium first published "The Unicode Standard" in 1991, and continues to develop standards based on that original work. Unicode was developed in conjunction with the International Organization for Standardization and it shares its character repertoire with ISO 10646. Unicode and ISO 10646 are equivalent as character encodings, but The Unicode Standard contains much more information for implementers, covering, in depth, topics such as bitwise encoding, collation, and rendering, and enumerating a multitude of character properties, including those needed for BiDi support. The two standards also have slightly different terminology, although efforts are being made to reconcile the differences.

Repertoire

Unicode reserves 1114112 (a little more than 220) code points, and currently assigns characters to more than 70000 of those code points. The first 256 codes precisely match those of ISO 8859-1, the most popular 8-bit character encoding in the "Western world"; as a result, the first 128 characters are also identical to ASCII.

The Unicode code space for characters is divided into 17 "planes", each plane has 65536 code points. The first plane (plane 0), the Basic Multilingual Plane (BMP), is where most characters have been assigned, so far. The BMP contains characters for almost all modern languages, and a large number of special characters. Most of the allocated code points in the BMP are used to encode CJK characters.

Two more planes are used for "graphic" characters. Plane 1, the Supplementary Multilingual Plane (SMP) is mostly used for historic scripts, e.g. Egyptian hieroglyphs (not yet encoded), but is also used for music symbols. Plane 2, the Supplementary Ideographic Plane (SIP) is used for about 40000 rare Chinese characters that are mostly historic, although there are some modern ones.

There is much controversy among Chinese language specialists about the desirability and technical merit of the "Han unification" process used to map multiple Chinese and Japanese character sets into a single set of unified glyphs.

The cap of ~220 code points exists in order to maintain compatibility with the UTF-16 encoding, which can only address that range (see below). The 10% utilisation of the Unicode code space suggests that this ~20 bit limit is unlikely to be reached in the near future.

Encodings

So far, it was only said that Unicode is a means to assign a unique number for each possible character used by humans. How these numbers are stored in text processing is another matter; problems result from the fact that most of the world's software has so far been written to deal with 8-bit character encodings only, and Unicode support has only been added slowly in recent years.

The internal logic of most legacy software would typically use 8 bits for each character, making it impossible to use more than 256 code points without special processing. Several mechanisms have therefore been suggested to implement Unicode; which one is chosen depends on available storage space, source code compatibility, and interoperability with other systems.

  1. The simplest possible way to store all possible 220 Unicode code points is to use 32 bits for each character, that is, four bytes -- hence, this encoding is referred to as UCS-4. The main problem with this method is that it uses four times the space of traditional encodings, which is why it is rarely used for external storage. However, due to its simplicity, many programs will use 32 bits encodings internally when processing Unicode. -- UTF-32 is another name for this encoding: UCS-4 implies the ISO 10646 standard, while UTF-32 implies the Unicode Consortium standard; but the two differ only on a few minor points.
  2. The oldest of Unicode encodings is ISO 10646's UCS-2, which uses two bytes (16 bits) for each character. It therefore supports only the first 65,536 characters, the BMP (see above). Representation of code points from all planes, including the BMP, can be achieved using the UTF-16 encoding, which requires one 16-bit word for characters in the BMP, and a pair of 16-bit words for characters in higher planes. Characters in UTF-16 are therefore of variable length: they can be represented by two or four bytes. A range of unassigned code points in the BMP is reserved for making these "surrogate pairs". This can make internal processing a bit more complex.
  3. Another common encoding is UTF-8, which is an 8-bit (one-byte) encoding. Here, the first 128 code points are equivalent to ASCII; bytes with a value higher than 127 represent multi-byte values. To support the full Unicode range, each character will occupy up to six bytes. UTF-8 is therefore a variable-length encoding also, but it has several advantages, especially when adding support for Unicode to existing software. For one, no changes are required for supporting ASCII only. Secondly, most functions from the standard library of the C programming language that have traditionally been used for character processing (such as strcmp for comparisons and trivial sorting) still work, because they operate on 8-bit values. (By contrast, to support the 16- or 32-bit encodings mentioned above, large parts of older software would have to be rewritten.) Third, for most texts that use relatively few non-ASCII characters (that is, texts in most Western languages), the encoding is very space-efficient because it will require only slightly more than 8 bits per character.

The Unicode standard also includes a number of related items, such as character properties, text normalisation forms, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).

Unicode on the web

Recent web browsers display web pages using Unicode if an appropriate font is installed.

Although syntax rules may affect the order in which characters are allowed to appear, both HTML 4.0 and XML 1.0 documents are, by definition, comprised of characters from the entire range of Unicode code points, minus only a handful of disallowed control characters and the permanently-unassigned code points D800-DFFF and FFFE-FFFF. These characters manifest either directly as bytes according to document's encoding, if the encoding supports them, or they may be written as numeric character references based on the character's Unicode code point, as long as the document's encoding supports the digits and symbols required to write the references (all encodings approved for use on the Internet do). For example, the references Δ Й ק م ๗ ぁ 叶 葉 냻 (or the same numeric values expressed in hexadecimal, with &#x as the prefix) display on your browser as Δ, Й, ק, م, ๗, ぁ, 叶, 葉 and 냻 -- if you have the proper fonts, these symbols look like the Greek capital letter "Delta", Cyrillic capital letter "Short I", the Arabic letter "Meem", the Hebrew letter "Qof", Thai numeral 7, Japanese Hiragana "A", simplified Chinese "Leaf", traditional Chinese[?] "Leaf", and a Korean Han-geul syllable "Nyrh", respectively.

Unicode fonts

Free software fonts exist for most (all?) of the characters in the BMP. They may be downloaded freely from the Internet.

Retail fonts that use the Unicode encoding are increasingly common, since first TrueType and now OpenType use Unicode.

It should be noted that a font using a Unicode encoding says nothing about how much of Unicode is supported by the font. There are thousands of Unicode-encoded fonts on the market, but probably fewer than half a dozen fonts that attempt to support most of Unicode. Most fonts focus on a particular script.

Unicode revision history

External links



All Wikipedia text is available under the terms of the GNU Free Documentation License

 
  Search Encyclopedia

Search over one million articles, find something about almost anything!
 
 
  
  Featured Article
Tantalium

... in an ore called coltan, about whose sources ethical questions have been raised (see the article). Several complicated steps are involved in the separation of ...