UTF-8 (8-
bit Unicode Transformation Format) is a variable-length
character encoding that is used to represent
Unicode-encoded text using a stream of
bytes.
Description
UTF-8 is currently standardized as RFC 2279 (UTF-8, a transformation format of ISO 10646), which is quite extensive and detailed. However, a short summary is brought below, in the case that the reader is interested only in a general overview.
The characters that are smaller than 128 are encoded with a single byte that contains their value: these correspond exactly to the 128 7-bit ASCII characters. In other cases, several bytes are required. The bytes' upper bit is always 1, in order for them to be always greater than 128 and not look like any of the 7-bit ASCII characters (particularly the ones used for control, e.g. Carriage Return). The encoded character is divided into several groups of bits, which are then divided among the lower positions inside these bytes.
Code range hexadecimal |
UTF-16 |
UTF-8 binary |
Notes |
U00000 - U0007F: |
00000000 0xxxxxxx |
0xxxxxxx |
ASCII equivalence range; byte begins with zero |
U00080 - U007FF: |
00000xxx xxxxxxxx |
110xxxxx 10xxxxxx |
first byte begins with 11, the following byte(s) begin with 10 |
U00800 - U0FFFF: |
xxxxxxxx xxxxxxxx |
1110xxxx 10xxxxxx 10xxxxxx |
U10000 - UFFFFF: |
110110xx xxxxxxxx 110111xx xxxxxxxx* |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
UTF-16 requires surrogate characters; an offset of 0x10000 is subtracted, so the bit pattern is not identical with UTF-8 |
For example, the character alef (א), which is Unicode 0x05D0, is encoded into UTF-8 in this way:
- It falls into the range of 0x0080 to 0x07FF. That's why it has to be encoded using 2 bytes, 110xxxxx 10xxxxxx.
- Hexadecimal 0x05D0 is equivalent to binary 101-1101-0000.
- The 11 bits are put in their order into the position marked by "x"-s: 11010111 10010000.
- The final result is the two bytes, more conveniently expressed as the two hexadecimal bytes 0xD7 0x90. That's the letter aleph in UTF-8.
So the first 128 characters need one byte. The next 1920 characters need two bytes to encode. This includes Greek, Cyrillic, Coptic, Armenian[?], Hebrew, and Arabic characters. The rest of the UCS-2 characters use three bytes, and additional characters are encoded in 4 bytes. For representing the full 32-bit codespace of UCS-4 up to 6 bytes may be required, but there are currently no plans to assign characters beyond the 1 million or so that can be represented in 4 bytes in both UTF-8 and UTF-16.
Advantages
- A Unicode symbol takes from 2 to 4 bytes. Some symbols (including the Latin alphabet) in UTF-8 will take as little as 1 byte, although others may take up to 6. So UTF-8 generally saves space compared to UTF-16 (the simplest encoding of Unicode).
- Most existing computer software (including operating systems) was not written with Unicode in mind, and using Unicode with them might create some compatibility issues. For example, the C standard library marks the end of a string with a character that has an 1-byte code 0x00 (hexadecimal). In UTF-16-encoded Unicode the English letter "A" will be coded as 0x0041. The library will consider the first byte 0x00 as the end of the string and will ignore anything after it. UTF-8, however, is designed so that encoded bytes never take on any of the special ASCII 'special character' values, preventing this and similar problems.
- UTF-8 strings can be sorted using standard byte-oriented sorting routines (however there will be no differentiation between stroke and capital letters with values exceeding 128).
- Since the top bit is always set in all the bytes of multiple-byte UTF-8 characters, most software designed to process ASCII or other 8-bit codes will never see any of them as a space, so white-space based tokenizing routines will continue to work correctly with UTF-8 encoded strings.
- Although encoded characters are variable length, their encoding is such that their boundaries can be delineated without elaborate parsing.
- UTF-8 is the default value for the XML format.
Disadvantages
- UTF-8 is variable-length; that means that different characters take sequences of different lengths to encode. The acuteness of this could be decreased, however, by creating an abstract interface to work with UTF-8 strings, and making it all transparent to the user.
- A badly-written (and non-standard-compliant) UTF-8 parser could accept a number of different pseudo-UTF-8 representations and convert them to the same Unicode output. This provides a way for information to leak past validation routines designed to process data in its 8-bit representation.
Example web pages written in UTF-8:
All Wikipedia text
is available under the
terms of the GNU Free Documentation License