Tutorial: Character Encoding and Unicode (2024)

WWW2005 Tutorial: Internationalizing WebContent and Web Technology

10 May 2005, Makuhari, Chiba, Japan

Martin J. Dürst (duerst@it.aoyama.ac.jp)

Department of Integrated InformationTechnology
College of Science and Engineering
Aoyama Gakuin University
Tokyo/Sagamihara, Japan

Tutorial: Character Encoding and Unicode (1)

© 2005 MartinJ. Dürst Aoyama GakuinUniversity

Character encoding is a very central and basic necessityfor internationalization. For computer communication, characters have to beencoded into bytes. There are very simple encodings, but also morecomplicated ones. Over the years and around the world, a long list ofcorporate, national, and regional encodings has developed, which coverdifferent sets of characters. The most complicated and the largest characterencodings have been developed and are in use in Asia.

Unicode/ISO 10646 is steadily replacing these encodings inmore and more places. Unicode is a single, large set of characters includingall presently used scripts of the world, with remaining historic scriptsbeing added. Unicode comes with two main encodings, UTF-8 and UTF-16, bothvery well designed for specific purposes. Because Unicode includes all thecharacters of all the well-used legacy encodings, mapping from olderencodings to Unicode is usually not a problem, although there are some issueswhere care is necessary in particular for East Asian character encodings.

  • To represent characters on a computer, they have to be:
    • selected (character repertoire)
    • represented as numbers (coded character set, character table)
    • ultimately, represented as bits and bytes (character encoding)

In general, character encoding deals with how to denotecharacters by more basic or primitive elements, such as numbers, bytes(octets) or bits. This includes a number of separable decisions, and a numberof abstract layers of representation, which we will look at in greater detaillater. For the moment, we will use the term encoding somewhatloosely.

General developments:

  • More and more characters, more and more bits per character
  • Encodings for larger and larger communities (local, corporate, national, supranational, global)
  • More coding complexity (code switching,...)
  • More and more complexity (direct key-character-glyph mapping -> complex relationship)

The history of character encodings contains many ingeniousdesigns, but also quite a few accidental developments. The search for thebest encoding always to some extent was in conflict with the need touse a common encoding that met many needs, even if somewhat incompletely.

A brief history of character encoding is provided inRichard Gillam, Unicode Demystified, pp. 25-59.

  • 5-bit encodings
  • 6-bit encodings
  • 7-bit encodings (US-ASCII, ISO 646 national variants,...)
  • 8-bit encodings (EBCDIC, ISO 8859-x series,...)
  • multibyte encodings

One tendency that can clearly be identified in the historyof character encodings is the increase in the number of characters in atypical encoding. This increase is mainly due to the strong limitations ofmemory and display/printing capabilities of early technology.

  • Multibyte encodings (e.g. shift_JIS, euc-jp,...)
    • difficult to identify characters within a byte stream
    • adaption of program from single byte to multibyte requires huge work
    • different multibyte encodings require different adaptions
  • Code switching (e.g. iso-2022-jp, ISO 2022 in general)
    • even more difficult to handle than multibyte encodings
    • difficult to get accurate information on some parts
  • Ad-hoc and research-based encodings
  • Corporate encodings
  • National encodings
  • Supranational encodings

Character encodings in most cases started out with a large variety ofencodings, but converged sooner or later.

Basic idea: A single encoding for the whole world

Originally two separate projects:

  • ISO/IEC 10646 (by ISO/IEC JTC1 SC2 WG2)
  • The Unicode Standard (Unicode Consortium)

Merged between 1991 and 1993 to avoid two global encodings.

For simplicity, this talk uses the term Unicode for the commonproduct unless it is ambiguous.

  1. Two standards, very closely in sync
  2. A set of characters
  3. A table of characters, with a number for each character
  4. Three encodings
  5. A lot of help for working with characters

We use the term Unicode here for what isessentially two standards.

  • The Unicode Standard, available as a book (ISBN 0-321-18578-1) and online
  • ISO/IEC 10646, available on a CD
  • ISO/IEC 10646 translated into many national variants (e.g. JIS X 0221)
  • Identifying characters:
    • What scripts
    • What characters (as opposed to non-characters such as logos, icons,...)
    • What characters (what level of abstraction,...)
  • Encoding at the center of input/rendering/processing/storage
  • Unification (which characters are the same, which are different)
    • One and the same character should only be encoded once, independent of language
    • For small scripts, solution is obvious or can be handled case-by-case

The term character set has been used in variousways in the industry. Here we are talking about a set in the mathematicalsense. To avoid confusion, this is often also called a characterrepertoire.

For CJKV ideographs, more systematic approach needed:

  • Based on unification rules used in Japanese national standard
  • Which is oriented towards average users/usage
  • "clearly different" shapes: separate code points (e.g. 体 U+4F53 vs. 體 U+9AD4)
  • "closely similar" shapes: separate code points if separate meanings (e.g. 大、犬、太、土、士)
  • otherwise "closely similar" shapes: one code point (e.g. Tutorial: Character Encoding and Unicode (2), U+8FBB)
  • Original ISO/IEC design (32-bit): 128 groups of 256 planes
  • Original Unicode design (16-bit): 16 bits per character, 216 characters
  • Final sturcture: 1+16=17 planes:
    • BMP (base multilingual plane, plane 0, modern-use)
    • Plane 1, SMP (supplementary multilingual plane, historic scripts,...)
    • Plane 2, SIP (Supplementary Ideographic Plane, really rare ideographs)
    • Planes 3-13: currently unused
    • Plane 14, SSP (Supplementary Special-purpose Plane, tags, variant selectors,...)
    • Planes 15 and 16: Private Use
  • Character numbers are hexadecimal, using digits 0-9 and A-F
    • Notation: U+hhhh, with hhhh being 4-6 hex digits
  • First 128 positions: identical to US-ASCII
  • First 265 positions: identical to ISO 8859-1 (Latin-1)
  • Characters are arranged by script and function in blocks
    • Smallest blocks: Kanbun, Katakana Phonetic Extensions,... (16 characters each)
    • Largest block: CJK Unified Ideographs Extension B, from U+20000 to U+2A6DF (42720 characters)
  • Maybe more than one block for a script (example: 0600..06FF Arabic; 0750..077F Arabic Supplement)

Identifying the characters to encode is the first step;the next step is to give each character a number. and to arrange thecharacters in a table.

  • Some numbers are not used, or serve a special function
    • Non-character code-points
    • Private use code points
  • Some 'characters' don't look or work like what we think about characters
    • Control characters
    • Formatting characters

Code point is used for numbers that are not really characters

When emphasizing the fact of looking at the numbers orpositions in the table, which may or may not be occupied by real characters,the term code point is often used.

  • UTF-8
    • Multibyte encoding with nice properties
    • ASCII-compatible in a very strong sense
  • UTF-16
    • Variable-length encoding, most characters 16 bits
    • Internal encoding for many applications and systems
  • UTF-32
    • Very simple and straightforward
    • Used on some Unix systems for internal processing

It would have been nice if a single encoding for Unicodeaddressed all encoding needs. Unfortunately, due to @@@@, this is not (yet?)the case. Unicode defines three encodings with different size code unit fordifferent purposes.

from to usage byte number
1 2 3 4
U+0000 U+007F US-ASCII 0xxx xxxx - - -
U+0080 U+07FF Latin,..., Arabic 110x xxxx 10xx xxxx - -
U+0800 U+FFFF rest of BMP 1110 xxxx 10xx xxxx 10xx xxxx -
U+10000 U+10FFFF non-BMP 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx

Only shortest encoding allowed.

  • Clear roles of byte values (single: 0..., initial: 11..., trailing: 10..., unused: C0, C1, F5-FF; cf. e.g. with Shift_JIS)
  • Easy to detect start of character in byte stream, synchronize in case of errors
  • Easy to distinguish from other encodings based on byte pattern
  • No overlapping matches
  • Strictly protects US-ASCII, important for many protocols and operating systems
  • Binary sorting identical to UCS-4/UTF-32
  • Reasonably compact for ASCII-heavy text

See TheProperties and Promises of UTF-8

  • Default encoding for XML
  • Many protocols
  • IRI->URI conversion
  • Some file systems (getting popular for Unix/Linux file names)
  • Internal processing in some applications
  • Code unit is 16 bits (two bytes, one half-word)
  • One code unit for characters in BMP
  • Two code units for characters in Planes 1-16
  • 2048 "code points" in BMP reserved for UTF-16
    • High surrogates: D800-DBFF
    • Low surrogates: DC00-DFFF

The term code unit was specially created todeal with the fact that encoding characters with UTF-16 is a tree-stepprocess, with code units as an additional step between code points andbytes.

  • Reasonably compact and efficient for well-used characters
  • Endianness problem:

    16-bit values get stored differently on big-endian and little-endian machines

  • Solutions:
    • Byte Order Mark: U+FEFF; reverse (U+FFFE) is not a character
    • Explicit labeling: UTF-16BE and UTF-16LE
  • Usage: Mostly for internal processing
    • Microsoft Windows
    • Java
  • Each character is directly represented as a 32-byte code unit (one word)
  • Straightforward, but rather inefficient encoding
  • Used internally on some Unix-like systems and applications, and for certain kinds of processing
  • Endianness problems and solutions are similar to those of UTF-16

Only in the Unicode Standard, not part of ISO/IEC 10646

  • Rendering and processing
  • Conversion from/to legacy (non-Unicode) encodings
  • Bidirectional rendering
  • Sorting
  • Normalization
  • ...
  • Round-tripping (compatibility characters)
  • Escaping
  • Minor points (XML Japanese Profile,...)
    • Vendor-specific extensions
    • minor mapping differences
  • Formatting characters,...

[and a break!]

Tutorial: Character Encoding and Unicode (2024)
Top Articles
Latest Posts
Article information

Author: Van Hayes

Last Updated:

Views: 6356

Rating: 4.6 / 5 (66 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Van Hayes

Birthday: 1994-06-07

Address: 2004 Kling Rapid, New Destiny, MT 64658-2367

Phone: +512425013758

Job: National Farming Director

Hobby: Reading, Polo, Genealogy, amateur radio, Scouting, Stand-up comedy, Cryptography

Introduction: My name is Van Hayes, I am a thankful, friendly, smiling, calm, powerful, fine, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.