Unicode - Win32 apps (2024)

  • Article

Unicode is a worldwide character-encoding standard. The system uses Unicode exclusively for character and string manipulation. For a detailed description of all aspects of Unicode, refer to The Unicode Standard.

Compared to older mechanisms for handling character and string data, Unicode simplifies software localization and improves multilingual text processing. By using Unicode to represent character and string data in your applications, you can enable universal data exchange capabilities for global marketing, using a single binary file for every possible character code. Unicode does the following:

  • Allows any combination of characters, drawn from any combination of scripts and languages, to co-exist in a single document.
  • Defines semantics for each character.
  • Standardizes script behavior.
  • Provides a standard algorithm for bidirectional text.
  • Defines cross-mappings to other standards.
  • Defines multiple encodings of its single character set: UTF-7, UTF-8, UTF-16, and UTF-32. Conversion of data among these encodings is lossless.

Unicode supports numerous scripts used by languages around the world, and also a large number of technical symbols and special characters used in publishing. The supported scripts include, but are not limited to, Latin, Greek, Cyrillic, Hebrew, Arabic, Devanagari, Thai, Han, Hangul, Hiragana, and Katakana. Supported languages include, but are not limited to, German, French, English, Greek, Russian, Hebrew, Arabic, Hindi, Thai, Chinese, Korean, and Japanese. Unicode currently can represent the vast majority of characters in modern computer use around the world, and continues to be updated to make it even more complete.

Unicode-enabled functions are described in Conventions for Function Prototypes. These functions use UTF-16 (wide character) encoding, which is the most common encoding of Unicode and the one used for native Unicode encoding on Windows operating systems. Each code value is 16 bits wide, in contrast to the older code page approach to character and string data, which uses 8-bit code values. The use of 16 bits allows the direct encoding of 65,536 characters. In fact, the universe of symbols used to transcribe human languages is even larger than that, and UTF-16 code points in the range U+D800 through U+DFFF are used to form surrogate pairs, which constitute 32-bit encodings of supplementary characters. See Surrogates and Supplementary Characters for further discussion.

The Unicode character set includes numerous combining characters, such as U+0308 ("¨"), a combining dieresis or umlaut. Unicode can often represent the same glyph in either a ''composed'' or a ''decomposed'' form: for example, the composed form of "Ä" is the single Unicode code point "Ä" (U+00C4), while its decomposed form is "A" + "¨" (U+0041 U+0308). Unicode does not define a composed form for every glyph. For example, the Vietnamese lowercase "o" with circumflex and tilde ("ỗ") is represented by U+006f U+0302 U+0303 (o + Circumflex + Tilde). For further discussion of combining characters and related issues, see Using Unicode Normalization to Represent Strings.

For compatibility with 8-bit and 7-bit environments, Unicode can also be encoded as UTF-8 and UTF-7, respectively. While Unicode-enabled functions in Windows use UTF-16, it is also possible to work with data encoded in UTF-8 or UTF-7, which are supported in Windows as multibyte character set code pages.

New Windows applications should use UTF-16 as their internal data representation. Windows also provides extensive support for code pages, and mixed use in the same application is possible. Even new Unicode-based applications sometimes have to work with code pages. Reasons for this are discussed in Code Pages.

An application can use the MultiByteToWideChar and WideCharToMultiByte functions to convert between strings based on code pages and Unicode strings. Although their names refer to "MultiByte", these functions work equally well with single-byte character set (SBCS), double-byte character set (DBCS), and multibyte character set (MBCS) code pages.

Typically, a Windows application should use UTF-16 internally, converting only as part of a "thin layer" over the interface that must use another format. This technique defends against loss and corruption of data. Each code page supports different characters, but none of them supports the full spectrum of characters provided by Unicode. Most of the code pages support different subsets, differently encoded. The code pages for UTF-8 and UTF-7 are an exception, since they support the complete Unicode character set, and conversion between these encodings and UTF-16 is lossless.

Data converted directly from the encoding used by one code page to the encoding used by another is subject to corruption, because the same data value on different code pages can encode a different character. Even when your application is converting as close to the interface as possible, you should think carefully about the range of data to handle.

Data converted from Unicode to a code page is subject to data loss, because a given code page might not be able to represent every character used in that particular Unicode data. Therefore, note that WideCharToMultiByte might lose some data if the target code page cannot represent all of the characters in the Unicode string.

When modernizing code page-based legacy applications to use Unicode, you can use generic functions and the TEXT macro to maintain a single set of sources from which to compile two versions of your application. One version supports Unicode and the other one works with Windows code pages. Using this mechanism, you can convert even very large applications from Windows code pages to Unicode while maintaining application sources that can be compiled, built, and tested at all phases of the conversion. For more information, see Conventions for Function Prototypes.

Unicode characters and strings use data types that are distinct from those for code page-based characters and strings. Along with a series of macros and naming conventions, this distinction minimizes the chance of accidentally mixing the two types of character data. It facilitates compiler type checking to ensure that only Unicode parameter values are used with functions expecting Unicode strings.

Character Sets

Sorting

Surrogates and Supplementary Characters

Using Unicode Normalization to Represent Strings

Unicode - Win32 apps (2024)

FAQs

What is the difference between UTF-32 and Unicode? ›

UTF-32 is a fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

What is the difference between UTF-8 and Unicode? ›

The Difference Between Unicode and UTF-8

Unicode is a character set. It translates characters to numbers. UTf-8 is an encoding standard. It translates numbers into binary.

What is Unicode used for? ›

Unicode is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices.

Why is Unicode better than ASCII? ›

The Unicode standard includes over 149,000 characters and can represent characters across dozens of languages. Compared to ASCII, which uses only one byte to represent a single character, Unicode uses up to four bytes. This gives it the capability to support a wide variety of encoding systems.

Should I use UTF-32? ›

If you frequently need to access APIs that require string parameters to be in UTF-32, it may be more convenient to work with UTF-32 strings all the time. In many situations that does not matter, but the convenience of having a fixed number of code units per character can be the deciding factor.

Is Unicode the same as UTF-16? ›

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units.

Should I use UTF-8 or UTF-16? ›

UTF-8 encoding is preferable to UTF-16 on the majority of websites, because it uses less memory.

Which is better ASCII or UTF-8? ›

UTF-8 VS ASCII – What's the Difference? UTF-8 extends the ASCII character set to use 8-bit code points, which allows for up to 256 different characters. This means that UTF-8 can represent all of the printable ASCII characters, as well as the non-printable characters.

Is Python UTF-8 or Unicode? ›

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding. (There are also UTF-16 and UTF-32 encodings, but they are less frequently used than UTF-8.)

Why do phones use Unicode? ›

What is the advantage of using unicode over other character encodings? Unicode provides a universal standard for character encoding, which means that text can be accurately represented and interpreted across different platforms, operating systems, and programming languages.

What is Unicode primarily used for? ›

Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits to encompass the characters of all the world's living languages.

What are the disadvantages of Unicode? ›

Here are some disadvantages of Unicode: More bits are needed for non-ASCII characters, which cause documents to take up more space than with encoding that are specific to a particular language or writing system.

What is an example of Unicode characters? ›

Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).

What characters are supported by Unicode? ›

The simplest answer is that Unicode covers all of the languages that can be written in the following widely-used scripts: Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, Devanagari, Bengali, Gurmukhi, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Hangul, ...

Are emojis part of Unicode? ›

Yes. Each new version of the Unicode Standard generally adds new emoji, though the exact number varies.

What is the difference between Unicode and non Unicode characters? ›

The only difference between the Unicode and the non-Unicode versions is whether OAWCHAR or char data type is used for character data. The length arguments always indicate the number of characters, not the number of bytes. OAWCHAR is mapped to the C Unicode data type wchar_t.

Can UTF-8 represent all Unicode characters? ›

Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.

What is the difference between byte code and Unicode? ›

Unicode is a universal character encoding standard. This standard includes roughly 100000 characters to represent characters of different languages. While ASCII uses only 1 byte the Unicode uses 4 bytes to represent characters. Hence, it provides a very wide variety of encoding.

Top Articles
Latest Posts
Article information

Author: Kimberely Baumbach CPA

Last Updated:

Views: 6116

Rating: 4 / 5 (41 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Kimberely Baumbach CPA

Birthday: 1996-01-14

Address: 8381 Boyce Course, Imeldachester, ND 74681

Phone: +3571286597580

Job: Product Banking Analyst

Hobby: Cosplaying, Inline skating, Amateur radio, Baton twirling, Mountaineering, Flying, Archery

Introduction: My name is Kimberely Baumbach CPA, I am a gorgeous, bright, charming, encouraging, zealous, lively, good person who loves writing and wants to share my knowledge and understanding with you.