Unicode - Win32 apps (2024)

Article
01/07/2021

Unicode is a worldwide character-encoding standard. The system uses Unicode exclusively for character and string manipulation. For a detailed description of all aspects of Unicode, refer to The Unicode Standard.

Compared to older mechanisms for handling character and string data, Unicode simplifies software localization and improves multilingual text processing. By using Unicode to represent character and string data in your applications, you can enable universal data exchange capabilities for global marketing, using a single binary file for every possible character code. Unicode does the following:

Allows any combination of characters, drawn from any combination of scripts and languages, to co-exist in a single document.
Defines semantics for each character.
Standardizes script behavior.
Provides a standard algorithm for bidirectional text.
Defines cross-mappings to other standards.
Defines multiple encodings of its single character set: UTF-7, UTF-8, UTF-16, and UTF-32. Conversion of data among these encodings is lossless.

Unicode supports numerous scripts used by languages around the world, and also a large number of technical symbols and special characters used in publishing. The supported scripts include, but are not limited to, Latin, Greek, Cyrillic, Hebrew, Arabic, Devanagari, Thai, Han, Hangul, Hiragana, and Katakana. Supported languages include, but are not limited to, German, French, English, Greek, Russian, Hebrew, Arabic, Hindi, Thai, Chinese, Korean, and Japanese. Unicode currently can represent the vast majority of characters in modern computer use around the world, and continues to be updated to make it even more complete.

Unicode-enabled functions are described in Conventions for Function Prototypes. These functions use UTF-16 (wide character) encoding, which is the most common encoding of Unicode and the one used for native Unicode encoding on Windows operating systems. Each code value is 16 bits wide, in contrast to the older code page approach to character and string data, which uses 8-bit code values. The use of 16 bits allows the direct encoding of 65,536 characters. In fact, the universe of symbols used to transcribe human languages is even larger than that, and UTF-16 code points in the range U+D800 through U+DFFF are used to form surrogate pairs, which constitute 32-bit encodings of supplementary characters. See Surrogates and Supplementary Characters for further discussion.

The Unicode character set includes numerous combining characters, such as U+0308 ("¨"), a combining dieresis or umlaut. Unicode can often represent the same glyph in either a ''composed'' or a ''decomposed'' form: for example, the composed form of "Ä" is the single Unicode code point "Ä" (U+00C4), while its decomposed form is "A" + "¨" (U+0041 U+0308). Unicode does not define a composed form for every glyph. For example, the Vietnamese lowercase "o" with circumflex and tilde ("ỗ") is represented by U+006f U+0302 U+0303 (o + Circumflex + Tilde). For further discussion of combining characters and related issues, see Using Unicode Normalization to Represent Strings.

For compatibility with 8-bit and 7-bit environments, Unicode can also be encoded as UTF-8 and UTF-7, respectively. While Unicode-enabled functions in Windows use UTF-16, it is also possible to work with data encoded in UTF-8 or UTF-7, which are supported in Windows as multibyte character set code pages.

New Windows applications should use UTF-16 as their internal data representation. Windows also provides extensive support for code pages, and mixed use in the same application is possible. Even new Unicode-based applications sometimes have to work with code pages. Reasons for this are discussed in Code Pages.

An application can use the MultiByteToWideChar and WideCharToMultiByte functions to convert between strings based on code pages and Unicode strings. Although their names refer to "MultiByte", these functions work equally well with single-byte character set (SBCS), double-byte character set (DBCS), and multibyte character set (MBCS) code pages.

FAQs

What is the difference between UTF-32 and Unicode? ›

UTF-32 is a fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

What is the difference between UTF-8 and Unicode? ›

The Difference Between Unicode and UTF-8

Unicode is a character set. It translates characters to numbers. UTf-8 is an encoding standard. It translates numbers into binary.

View Details ›

What is Unicode used for? ›

Unicode is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices.

Discover More Details ›

Why is Unicode better than ASCII? ›

The Unicode standard includes over 149,000 characters and can represent characters across dozens of languages. Compared to ASCII, which uses only one byte to represent a single character, Unicode uses up to four bytes. This gives it the capability to support a wide variety of encoding systems.

Should I use UTF-32? ›

If you frequently need to access APIs that require string parameters to be in UTF-32, it may be more convenient to work with UTF-32 strings all the time. In many situations that does not matter, but the convenience of having a fixed number of code units per character can be the deciding factor.

View Details ›

Is Unicode the same as UTF-16? ›

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units.

Should I use UTF-8 or UTF-16? ›

UTF-8 encoding is preferable to UTF-16 on the majority of websites, because it uses less memory.

Read On ›

Which is better ASCII or UTF-8? ›

UTF-8 VS ASCII – What's the Difference? UTF-8 extends the ASCII character set to use 8-bit code points, which allows for up to 256 different characters. This means that UTF-8 can represent all of the printable ASCII characters, as well as the non-printable characters.

Find Out More ›

Is Python UTF-8 or Unicode? ›

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding. (There are also UTF-16 and UTF-32 encodings, but they are less frequently used than UTF-8.)

What is Unicode primarily used for? ›

Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits to encompass the characters of all the world's living languages.

What are the disadvantages of Unicode? ›

Here are some disadvantages of Unicode: More bits are needed for non-ASCII characters, which cause documents to take up more space than with encoding that are specific to a particular language or writing system.

Get More Info ›

What is an example of Unicode characters? ›

Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).

View Details ›

What characters are supported by Unicode? ›

The simplest answer is that Unicode covers all of the languages that can be written in the following widely-used scripts: Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, Devanagari, Bengali, Gurmukhi, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Hangul, ...

View Details ›

Are emojis part of Unicode? ›

Yes. Each new version of the Unicode Standard generally adds new emoji, though the exact number varies.

See Details ›

What is the difference between Unicode and non Unicode characters? ›

The only difference between the Unicode and the non-Unicode versions is whether OAWCHAR or char data type is used for character data. The length arguments always indicate the number of characters, not the number of bytes. OAWCHAR is mapped to the C Unicode data type wchar_t.

See Details ›

Can UTF-8 represent all Unicode characters? ›

Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.

Get More Info Here ›

What is the difference between byte code and Unicode? ›

Unicode is a universal character encoding standard. This standard includes roughly 100000 characters to represent characters of different languages. While ASCII uses only 1 byte the Unicode uses 4 bytes to represent characters. Hence, it provides a very wide variety of encoding.

View Details ›