Tutorial: Character Encoding and Unicode (2024)

WWW2005 Tutorial: Internationalizing WebContent and Web Technology

10 May 2005, Makuhari, Chiba, Japan

Martin J. Dürst (duerst@it.aoyama.ac.jp)

Department of Integrated InformationTechnology
College of Science and Engineering
Aoyama Gakuin University
Tokyo/Sagamihara, Japan

Character encoding is a very central and basic necessityfor internationalization. For computer communication, characters have to beencoded into bytes. There are very simple encodings, but also morecomplicated ones. Over the years and around the world, a long list ofcorporate, national, and regional encodings has developed, which coverdifferent sets of characters. The most complicated and the largest characterencodings have been developed and are in use in Asia.

Unicode/ISO 10646 is steadily replacing these encodings inmore and more places. Unicode is a single, large set of characters includingall presently used scripts of the world, with remaining historic scriptsbeing added. Unicode comes with two main encodings, UTF-8 and UTF-16, bothvery well designed for specific purposes. Because Unicode includes all thecharacters of all the well-used legacy encodings, mapping from olderencodings to Unicode is usually not a problem, although there are some issueswhere care is necessary in particular for East Asian character encodings.

Originally two separate projects:

ISO/IEC 10646 (by ISO/IEC JTC1 SC2 WG2)
The Unicode Standard (Unicode Consortium)

Merged between 1991 and 1993 to avoid two global encodings.

For simplicity, this talk uses the term Unicode for the commonproduct unless it is ambiguous.

Two standards, very closely in sync
A set of characters
A table of characters, with a number for each character
Three encodings
A lot of help for working with characters

We use the term Unicode here for what isessentially two standards.

The Unicode Standard, available as a book (ISBN 0-321-18578-1) and online
ISO/IEC 10646, available on a CD
ISO/IEC 10646 translated into many national variants (e.g. JIS X 0221)

Identifying characters:
- What scripts
- What characters (as opposed to non-characters such as logos, icons,...)
- What characters (what level of abstraction,...)
Encoding at the center of input/rendering/processing/storage
Unification (which characters are the same, which are different)
- One and the same character should only be encoded once, independent of language
- For small scripts, solution is obvious or can be handled case-by-case

The term character set has been used in variousways in the industry. Here we are talking about a set in the mathematicalsense. To avoid confusion, this is often also called a characterrepertoire.

For CJKV ideographs, more systematic approach needed:

Based on unification rules used in Japanese national standard
Which is oriented towards average users/usage
"clearly different" shapes: separate code points (e.g. 体 U+4F53 vs. 體 U+9AD4)
"closely similar" shapes: separate code points if separate meanings (e.g. 大、犬、太、土、士)
otherwise "closely similar" shapes: one code point (e.g. , U+8FBB)

Original ISO/IEC design (32-bit): 128 groups of 256 planes
Original Unicode design (16-bit): 16 bits per character, 2¹⁶ characters
Final sturcture: 1+16=17 planes:
- BMP (base multilingual plane, plane 0, modern-use)
- Plane 1, SMP (supplementary multilingual plane, historic scripts,...)
- Plane 2, SIP (Supplementary Ideographic Plane, really rare ideographs)
- Planes 3-13: currently unused
- Plane 14, SSP (Supplementary Special-purpose Plane, tags, variant selectors,...)
- Planes 15 and 16: Private Use

Character numbers are hexadecimal, using digits 0-9 and A-F
- Notation: U+hhhh, with hhhh being 4-6 hex digits
First 128 positions: identical to US-ASCII
First 265 positions: identical to ISO 8859-1 (Latin-1)
Characters are arranged by script and function in blocks
- Smallest blocks: Kanbun, Katakana Phonetic Extensions,... (16 characters each)
- Largest block: CJK Unified Ideographs Extension B, from U+20000 to U+2A6DF (42720 characters)
Maybe more than one block for a script (example: 0600..06FF Arabic; 0750..077F Arabic Supplement)

Identifying the characters to encode is the first step;the next step is to give each character a number. and to arrange thecharacters in a table.

Some numbers are not used, or serve a special function
- Non-character code-points
- Private use code points
Some 'characters' don't look or work like what we think about characters
- Control characters
- Formatting characters

Code point is used for numbers that are not really characters

When emphasizing the fact of looking at the numbers orpositions in the table, which may or may not be occupied by real characters,the term code point is often used.

UTF-8
- Multibyte encoding with nice properties
- ASCII-compatible in a very strong sense
UTF-16
- Variable-length encoding, most characters 16 bits
- Internal encoding for many applications and systems
UTF-32
- Very simple and straightforward
- Used on some Unix systems for internal processing

It would have been nice if a single encoding for Unicodeaddressed all encoding needs. Unfortunately, due to @@@@, this is not (yet?)the case. Unicode defines three encodings with different size code unit fordifferent purposes.

from	to	usage	byte number
from	to	usage	1	2	3	4
U+0000	U+007F	US-ASCII	0xxx xxxx	-	-	-
U+0080	U+07FF	Latin,..., Arabic	110x xxxx	10xx xxxx	-	-
U+0800	U+FFFF	rest of BMP	1110 xxxx	10xx xxxx	10xx xxxx	-
U+10000	U+10FFFF	non-BMP	1111 0xxx	10xx xxxx	10xx xxxx	10xx xxxx

Only shortest encoding allowed.

Clear roles of byte values (single: 0..., initial: 11..., trailing: 10..., unused: C0, C1, F5-FF; cf. e.g. with Shift_JIS)
Easy to detect start of character in byte stream, synchronize in case of errors
Easy to distinguish from other encodings based on byte pattern
No overlapping matches
Strictly protects US-ASCII, important for many protocols and operating systems
Binary sorting identical to UCS-4/UTF-32
Reasonably compact for ASCII-heavy text

See TheProperties and Promises of UTF-8

Default encoding for XML
Many protocols
IRI->URI conversion
Some file systems (getting popular for Unix/Linux file names)
Internal processing in some applications

Code unit is 16 bits (two bytes, one half-word)
One code unit for characters in BMP
Two code units for characters in Planes 1-16
2048 "code points" in BMP reserved for UTF-16
- High surrogates: D800-DBFF
- Low surrogates: DC00-DFFF

The term code unit was specially created todeal with the fact that encoding characters with UTF-16 is a tree-stepprocess, with code units as an additional step between code points andbytes.

Reasonably compact and efficient for well-used characters
Endianness problem:
16-bit values get stored differently on big-endian and little-endian machines
Solutions:
- Byte Order Mark: U+FEFF; reverse (U+FFFE) is not a character
- Explicit labeling: UTF-16BE and UTF-16LE
Usage: Mostly for internal processing
- Microsoft Windows
- Java

Each character is directly represented as a 32-byte code unit (one word)
Straightforward, but rather inefficient encoding
Used internally on some Unix-like systems and applications, and for certain kinds of processing
Endianness problems and solutions are similar to those of UTF-16

Only in the Unicode Standard, not part of ISO/IEC 10646

Rendering and processing
Conversion from/to legacy (non-Unicode) encodings
Bidirectional rendering
Sorting
Normalization
...

Round-tripping (compatibility characters)
Escaping
Minor points (XML Japanese Profile,...)
- Vendor-specific extensions
- minor mapping differences
Formatting characters,...

[and a break!]