<<Up     Contents

UTF-8

Redirected from Utf-8

UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding that is used to represent Unicode-encoded text using a stream of bytes.

Description

UTF-8 is currently standardized as RFC 2279 (UTF-8, a transformation format of ISO 10646), which is quite extensive and detailed. However, a short summary is brought below, in the case that the reader is interested only in a general overview.

The characters that are smaller than 128 are encoded with a single byte that contains their value: these correspond exactly to the 128 7-bit ASCII characters. In other cases, several bytes are required. The bytes' upper bit is always 1, in order for them to be always greater than 128 and not look like any of the 7-bit ASCII characters (particularly the ones used for control, e.g. Carriage Return). The encoded character is divided into several groups of bits, which are then divided among the lower positions inside these bytes.

Code range
hexadecimal
UTF-16 UTF-8
binary
Notes
U00000 - U0007F: 00000000 0xxxxxxx 0xxxxxxx ASCII equivalence range; byte begins with zero
U00080 - U007FF: 00000xxx xxxxxxxx 110xxxxx 10xxxxxx first byte begins with 11, the following byte(s) begin with 10
U00800 - U0FFFF: xxxxxxxx xxxxxxxx 1110xxxx 10xxxxxx 10xxxxxx
U10000 - UFFFFF: 110110xx xxxxxxxx
110111xx xxxxxxxx*
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx UTF-16 requires surrogate characters; an offset of 0x10000 is subtracted, so the bit pattern is not identical with UTF-8
For example, the character alef (א), which is Unicode 0x05D0, is encoded into UTF-8 in this way:

So the first 128 characters need one byte. The next 1920 characters need two bytes to encode. This includes Greek, Cyrillic, Coptic, Armenian[?], Hebrew, and Arabic characters. The rest of the UCS-2 characters use three bytes, and additional characters are encoded in 4 bytes. For representing the full 32-bit codespace of UCS-4 up to 6 bytes may be required, but there are currently no plans to assign characters beyond the 1 million or so that can be represented in 4 bytes in both UTF-8 and UTF-16.

Advantages

Disadvantages

Example web pages written in UTF-8:

wikipedia.org dumped 2003-03-17 with terodump