Unicode HOWTO (2024)

Release

1.12

This HOWTO discusses Python’s support for the Unicode specificationfor representing textual data, and explains various problems thatpeople commonly encounter when trying to work with Unicode.

Introduction to Unicode

Definitions

Today’s programs need to be able to handle a wide variety ofcharacters. Applications are often internationalized to displaymessages and output in a variety of user-selectable languages; thesame program might need to output an error message in English, French,Japanese, Hebrew, or Russian. Web content can be written in any ofthese languages and can also include a variety of emoji symbols.Python’s string type uses the Unicode Standard for representingcharacters, which lets Python programs work with all these differentpossible characters.

Unicode (https://www.unicode.org/) is a specification that aims tolist every character used by human languages and give each characterits own unique code. The Unicode specifications are continuallyrevised and updated to add new languages and symbols.

A character is the smallest possible component of a text. ‘A’, ‘B’, ‘C’,etc., are all different characters. So are ‘È’ and ‘Í’. Characters varydepending on the language or context you’re talkingabout. For example, there’s a character for “Roman Numeral One”, ‘Ⅰ’, that’sseparate from the uppercase letter ‘I’. They’ll usually look the same,but these are two different characters that have different meanings.

The Unicode standard describes how characters are represented bycode points. A code point value is an integer in the range 0 to0x10FFFF (about 1.1 million values, theactual number assignedis less than that). In the standard and in this document, a code point is writtenusing the notation U+265E to mean the character with value0x265e (9,822 in decimal).

The Unicode standard contains a lot of tables listing characters andtheir corresponding code points:

0061 'a'; LATIN SMALL LETTER A0062 'b'; LATIN SMALL LETTER B0063 'c'; LATIN SMALL LETTER C...007B '{'; LEFT CURLY BRACKET...2167 'Ⅷ'; ROMAN NUMERAL EIGHT2168 'Ⅸ'; ROMAN NUMERAL NINE...265E '♞'; BLACK CHESS KNIGHT265F '♟'; BLACK CHESS PAWN...1F600 '😀'; GRINNING FACE1F609 '😉'; WINKING FACE...

Strictly, these definitions imply that it’s meaningless to say ‘this ischaracter U+265E’. U+265E is a code point, which represents some particularcharacter; in this case, it represents the character ‘BLACK CHESS KNIGHT’,‘♞’. Ininformal contexts, this distinction between code points and characters willsometimes be forgotten.

A character is represented on a screen or on paper by a set of graphicalelements that’s called a glyph. The glyph for an uppercase A, for example,is two diagonal strokes and a horizontal stroke, though the exact details willdepend on the font being used. Most Python code doesn’t need to worry aboutglyphs; figuring out the correct glyph to display is generally the job of a GUItoolkit or a terminal’s font renderer.

Encodings

To summarize the previous section: a Unicode string is a sequence ofcode points, which are numbers from 0 through 0x10FFFF (1,114,111decimal). This sequence of code points needs to be represented inmemory as a set of code units, and code units are then mappedto 8-bit bytes. The rules for translating a Unicode string into asequence of bytes are called a character encoding, or justan encoding.

The first encoding you might think of is using 32-bit integers as thecode unit, and then using the CPU’s representation of 32-bit integers.In this representation, the string “Python” might look like this:

 P y t h o n0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

This representation is straightforward but using it presents a number ofproblems.

  1. It’s not portable; different processors order the bytes differently.

  2. It’s very wasteful of space. In most texts, the majority of the code pointsare less than 127, or less than 255, so a lot of space is occupied by 0x00bytes. The above string takes 24 bytes compared to the 6 bytes needed for anASCII representation. Increased RAM usage doesn’t matter too much (desktopcomputers have gigabytes of RAM, and strings aren’t usually that large), butexpanding our usage of disk and network bandwidth by a factor of 4 isintolerable.

  3. It’s not compatible with existing C functions such as strlen(), so a newfamily of wide string functions would need to be used.

Therefore this encoding isn’t used very much, and people instead choose otherencodings that are more efficient and convenient, such as UTF-8.

UTF-8 is one of the most commonly used encodings, and Python oftendefaults to using it. UTF stands for “Unicode Transformation Format”,and the ‘8’ means that 8-bit values are used in the encoding. (Thereare also UTF-16 and UTF-32 encodings, but they are less frequentlyused than UTF-8.) UTF-8 uses the following rules:

  1. If the code point is < 128, it’s represented by the corresponding byte value.

  2. If the code point is >= 128, it’s turned into a sequence of two, three, orfour bytes, where each byte of the sequence is between 128 and 255.

UTF-8 has several convenient properties:

  1. It can handle any Unicode code point.

  2. A Unicode string is turned into a sequence of bytes that contains embeddedzero bytes only where they represent the null character (U+0000). This meansthat UTF-8 strings can be processed by C functions such as strcpy() and sentthrough protocols that can’t handle zero bytes for anything other thanend-of-string markers.

  3. A string of ASCII text is also valid UTF-8 text.

  4. UTF-8 is fairly compact; the majority of commonly used characters can berepresented with one or two bytes.

  5. If bytes are corrupted or lost, it’s possible to determine the start of thenext UTF-8-encoded code point and resynchronize. It’s also unlikely thatrandom 8-bit data will look like valid UTF-8.

  6. UTF-8 is a byte oriented encoding. The encoding specifies that eachcharacter is represented by a specific sequence of one or more bytes. Thisavoids the byte-ordering issues that can occur with integer and word orientedencodings, like UTF-16 and UTF-32, where the sequence of bytes varies dependingon the hardware on which the string was encoded.

References

The Unicode Consortium site has character charts, aglossary, and PDF versions of the Unicode specification. Be prepared for somedifficult reading. A chronology of theorigin and development of Unicode is also available on the site.

On the Computerphile Youtube channel, Tom Scott brieflydiscusses the history of Unicode and UTF-8(9 minutes 36 seconds).

To help understand the standard, Jukka Korpela has written an introductoryguide to reading theUnicode character tables.

Another good introductory articlewas written by Joel Spolsky.If this introduction didn’t make things clear to you, you should tryreading this alternate article before continuing.

Wikipedia entries are often helpful; see the entries for “character encoding” and UTF-8, for example.

Python’s Unicode Support

Now that you’ve learned the rudiments of Unicode, we can look at Python’sUnicode features.

The String Type

Since Python 3.0, the language’s str type contains Unicodecharacters, meaning any string created using "unicode rocks!", 'unicoderocks!', or the triple-quoted string syntax is stored as Unicode.

The default encoding for Python source code is UTF-8, so you can simplyinclude a Unicode character in a string literal:

try: with open('/tmp/input.txt', 'r') as f: ...except OSError: # 'File not found' error message. print("Fichier non trouvé")

Side note: Python 3 also supports using Unicode characters in identifiers:

répertoire = "/tmp/records.log"with open(répertoire, "w") as f: f.write("test\n")

If you can’t enter a particular character in your editor or want tokeep the source code ASCII-only for some reason, you can also useescape sequences in string literals. (Depending on your system,you may see the actual capital-delta glyph instead of a u escape.)

>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name'\u0394'>>> "\u0394" # Using a 16-bit hex value'\u0394'>>> "\U00000394" # Using a 32-bit hex value'\u0394'

In addition, one can create a string using the decode() method ofbytes. This method takes an encoding argument, such as UTF-8,and optionally an errors argument.

The errors argument specifies the response when the input string can’t beconverted according to the encoding’s rules. Legal values for this argument are'strict' (raise a UnicodeDecodeError exception), 'replace' (useU+FFFD, REPLACEMENT CHARACTER), 'ignore' (just leave thecharacter out of the Unicode result), or 'backslashreplace' (inserts a\xNN escape sequence).The following examples show the differences:

>>> b'\x80abc'.decode("utf-8", "strict") Traceback (most recent call last): ...UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte>>> b'\x80abc'.decode("utf-8", "replace")'\ufffdabc'>>> b'\x80abc'.decode("utf-8", "backslashreplace")'\\x80abc'>>> b'\x80abc'.decode("utf-8", "ignore")'abc'

Encodings are specified as strings containing the encoding’s name. Pythoncomes with roughly 100 different encodings; see the Python Library Reference atStandard Encodings for a list. Some encodings have multiple names; forexample, 'latin-1', 'iso_8859_1' and '8859’ are all synonyms forthe same encoding.

One-character Unicode strings can also be created with the chr()built-in function, which takes integers and returns a Unicode string of length 1that contains the corresponding code point. The reverse operation is thebuilt-in ord() function that takes a one-character Unicode string andreturns the code point value:

>>> chr(57344)'\ue000'>>> ord('\ue000')57344

Converting to Bytes

The opposite method of bytes.decode() is str.encode(),which returns a bytes representation of the Unicode string, encoded in therequested encoding.

The errors parameter is the same as the parameter of thedecode() method but supports a few more possible handlers. As well as'strict', 'ignore', and 'replace' (which in this caseinserts a question mark instead of the unencodable character), there isalso 'xmlcharrefreplace' (inserts an XML character reference),backslashreplace (inserts a \uNNNN escape sequence) andnamereplace (inserts a \N{...} escape sequence).

The following example shows the different results:

>>> u = chr(40960) + 'abcd' + chr(1972)>>> u.encode('utf-8')b'\xea\x80\x80abcd\xde\xb4'>>> u.encode('ascii') Traceback (most recent call last): ...UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)>>> u.encode('ascii', 'ignore')b'abcd'>>> u.encode('ascii', 'replace')b'?abcd?'>>> u.encode('ascii', 'xmlcharrefreplace')b'&#40960;abcd&#1972;'>>> u.encode('ascii', 'backslashreplace')b'\\ua000abcd\\u07b4'>>> u.encode('ascii', 'namereplace')b'\\N{YI SYLLABLE IT}abcd\\u07b4'

The low-level routines for registering and accessing the availableencodings are found in the codecs module. Implementing newencodings also requires understanding the codecs module.However, the encoding and decoding functions returned by this moduleare usually more low-level than is comfortable, and writing new encodingsis a specialized task, so the module won’t be covered in this HOWTO.

Unicode Literals in Python Source Code

In Python source code, specific Unicode code points can be written using the\u escape sequence, which is followed by four hex digits giving the codepoint. The \U escape sequence is similar, but expects eight hex digits,not four:

>>> s = "a\xac\u1234\u20ac\U00008000"... # ^^^^ two-digit hex escape... # ^^^^^^ four-digit Unicode escape... # ^^^^^^^^^^ eight-digit Unicode escape>>> [ord(c) for c in s][97, 172, 4660, 8364, 32768]

Using escape sequences for code points greater than 127 is fine in small doses,but becomes an annoyance if you’re using many accented characters, as you wouldin a program with messages in French or some other accent-using language. Youcan also assemble strings using the chr() built-in function, but this iseven more tedious.

Ideally, you’d want to be able to write literals in your language’s naturalencoding. You could then edit Python source code with your favorite editorwhich would display the accented characters naturally, and have the rightcharacters used at runtime.

Python supports writing source code in UTF-8 by default, but you can use almostany encoding if you declare the encoding being used. This is done by includinga special comment as either the first or second line of the source file:

#!/usr/bin/env python# -*- coding: latin-1 -*-u = 'abcdé'print(ord(u[-1]))

The syntax is inspired by Emacs’s notation for specifying variables local to afile. Emacs supports many different variables, but Python only supports‘coding’. The -*- symbols indicate to Emacs that the comment is special;they have no significance to Python but are a convention. Python looks forcoding: name or coding=name in the comment.

If you don’t include such a comment, the default encoding used will be UTF-8 asalready mentioned. See also PEP 263 for more information.

Unicode Properties

The Unicode specification includes a database of information aboutcode points. For each defined code point, the information includesthe character’s name, its category, the numeric value if applicable(for characters representing numeric concepts such as the Romannumerals, fractions such as one-third and four-fifths, etc.). Thereare also display-related properties, such as how to use the code pointin bidirectional text.

The following program displays some information about several characters, andprints the numeric value of one particular character:

import unicodedatau = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)for i, c in enumerate(u): print(i, '%04x' % ord(c), unicodedata.category(c), end=" ") print(unicodedata.name(c))# Get numeric value of second characterprint(unicodedata.numeric(u[1]))

When run, this prints:

0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE1 0bf2 No TAMIL NUMBER ONE THOUSAND2 0f84 Mn TIBETAN MARK HALANTA3 1770 Lo TAGBANWA LETTER SA4 33af So SQUARE RAD OVER S SQUARED1000.0

The category codes are abbreviations describing the nature of the character.These are grouped into categories such as “Letter”, “Number”, “Punctuation”, or“Symbol”, which in turn are broken up into subcategories. To take the codesfrom the above output, 'Ll' means ‘Letter, lowercase’, 'No' means“Number, other”, 'Mn' is “Mark, nonspacing”, and 'So' is “Symbol,other”. Seethe General Category Values section of the Unicode Character Database documentation for alist of category codes.

Comparing Strings

Unicode adds some complication to comparing strings, because the sameset of characters can be represented by different sequences of codepoints. For example, a letter like ‘ê’ can be represented as a singlecode point U+00EA, or as U+0065 U+0302, which is the code point for‘e’ followed by a code point for ‘COMBINING CIRCUMFLEX ACCENT’. Thesewill produce the same output when printed, but one is a string oflength 1 and the other is of length 2.

One tool for a case-insensitive comparison is thecasefold() string method that converts a string to acase-insensitive form following an algorithm described by the UnicodeStandard. This algorithm has special handling for characters such asthe German letter ‘ß’ (code point U+00DF), which becomes the pair oflowercase letters ‘ss’.

>>> street = 'Gürzenichstraße'>>> street.casefold()'gürzenichstrasse'

A second tool is the unicodedata module’snormalize() function that converts strings to oneof several normal forms, where letters followed by a combining character arereplaced with single characters. normalize() canbe used to perform string comparisons that won’t falsely reportinequality if two strings use combining characters differently:

import unicodedatadef compare_strs(s1, s2): def NFD(s): return unicodedata.normalize('NFD', s) return NFD(s1) == NFD(s2)single_char = 'ê'multiple_chars = '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'print('length of first string=', len(single_char))print('length of second string=', len(multiple_chars))print(compare_strs(single_char, multiple_chars))

When run, this outputs:

$ python compare-strs.pylength of first string= 1length of second string= 2True

The first argument to the normalize() function is astring giving the desired normalization form, which can be one of‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

The Unicode Standard also specifies how to do caseless comparisons:

import unicodedatadef compare_caseless(s1, s2): def NFD(s): return unicodedata.normalize('NFD', s) return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold())# Example usagesingle_char = 'ê'multiple_chars = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'print(compare_caseless(single_char, multiple_chars))

This will print True. (Why is NFD() invoked twice? Becausethere are a few characters that make casefold() return anon-normalized string, so the result needs to be normalized again. Seesection 3.13 of the Unicode Standard for a discussion and an example.)

Unicode Regular Expressions

The regular expressions supported by the re module can be providedeither as bytes or strings. Some of the special character sequences such as\d and \w have different meanings depending on whetherthe pattern is supplied as bytes or a string. For example,\d will match the characters [0-9] in bytes butin strings will match any character that’s in the 'Nd' category.

The string in this example has the number 57 written in both Thai andArabic numerals:

import rep = re.compile(r'\d+')s = "Over \u0e55\u0e57 57 flavours"m = p.search(s)print(repr(m.group()))

When executed, \d+ will match the Thai numerals and print themout. If you supply the re.ASCII flag tocompile(), \d+ will match the substring “57” instead.

Similarly, \w matches a wide variety of Unicode characters butonly [a-zA-Z0-9_] in bytes or if re.ASCII is supplied,and \s will match either Unicode whitespace characters or[ \t\n\r\f\v].

References

Some good alternative discussions of Python’s Unicode support are:

The str type is described in the Python library reference atText Sequence Type — str.

The documentation for the unicodedata module.

The documentation for the codecs module.

Marc-André Lemburg gave a presentation titled “Python and Unicode” (PDF slides) atEuroPython 2002. The slides are an excellent overview of the design of Python2’s Unicode features (where the Unicode string type is called unicode andliterals start with u).

Reading and Writing Unicode Data

Once you’ve written some code that works with Unicode data, the next problem isinput/output. How do you get Unicode strings into your program, and how do youconvert Unicode into a form suitable for storage or transmission?

It’s possible that you may not need to do anything depending on your inputsources and output destinations; you should check whether the libraries used inyour application support Unicode natively. XML parsers often return Unicodedata, for example. Many relational databases also support Unicode-valuedcolumns and can return Unicode values from an SQL query.

Unicode data is usually converted to a particular encoding before it getswritten to disk or sent over a socket. It’s possible to do all the workyourself: open a file, read an 8-bit bytes object from it, and convert the byteswith bytes.decode(encoding). However, the manual approach is not recommended.

One problem is the multi-byte nature of encodings; one Unicode character can berepresented by several bytes. If you want to read the file in arbitrary-sizedchunks (say, 1024 or 4096 bytes), you need to write error-handling code to catch the casewhere only part of the bytes encoding a single Unicode character are read at theend of a chunk. One solution would be to read the entire file into memory andthen perform the decoding, but that prevents you from working with files thatare extremely large; if you need to read a 2 GiB file, you need 2 GiB of RAM.(More, really, since for at least a moment you’d need to have both the encodedstring and its Unicode version in memory.)

The solution would be to use the low-level decoding interface to catch the caseof partial coding sequences. The work of implementing this has already beendone for you: the built-in open() function can return a file-like objectthat assumes the file’s contents are in a specified encoding and accepts Unicodeparameters for methods such as read() andwrite(). This works through open()'s encoding anderrors parameters which are interpreted just like those in str.encode()and bytes.decode().

Reading Unicode from a file is therefore simple:

with open('unicode.txt', encoding='utf-8') as f: for line in f: print(repr(line))

It’s also possible to open files in update mode, allowing both reading andwriting:

with open('test', encoding='utf-8', mode='w+') as f: f.write('\u4500 blah blah blah\n') f.seek(0) print(repr(f.readline()[:1]))

The Unicode character U+FEFF is used as a byte-order mark (BOM), and is oftenwritten as the first character of a file in order to assist with autodetectionof the file’s byte ordering. Some encodings, such as UTF-16, expect a BOM to bepresent at the start of a file; when such an encoding is used, the BOM will beautomatically written as the first character and will be silently dropped whenthe file is read. There are variants of these encodings, such as ‘utf-16-le’and ‘utf-16-be’ for little-endian and big-endian encodings, that specify oneparticular byte ordering and don’t skip the BOM.

In some areas, it is also convention to use a “BOM” at the start of UTF-8encoded files; the name is misleading since UTF-8 is not byte-order dependent.The mark simply announces that the file is encoded in UTF-8. For reading suchfiles, use the ‘utf-8-sig’ codec to automatically skip the mark if present.

Unicode filenames

Most of the operating systems in common use today support filenamesthat contain arbitrary Unicode characters. Usually this isimplemented by converting the Unicode string into some encoding thatvaries depending on the system. Today Python is converging on usingUTF-8: Python on MacOS has used UTF-8 for several versions, and Python3.6 switched to using UTF-8 on Windows as well. On Unix systems,there will only be a filesystem encoding. if you’ve set the LANG or LC_CTYPE environment variables; ifyou haven’t, the default encoding is again UTF-8.

The sys.getfilesystemencoding() function returns the encoding to use onyour current system, in case you want to do the encoding manually, but there’snot much reason to bother. When opening a file for reading or writing, you canusually just provide the Unicode string as the filename, and it will beautomatically converted to the right encoding for you:

filename = 'filename\u4500abc'with open(filename, 'w') as f: f.write('blah\n')

Functions in the os module such as os.stat() will also accept Unicodefilenames.

The os.listdir() function returns filenames, which raises an issue: should it returnthe Unicode version of filenames, or should it return bytes containingthe encoded versions? os.listdir() can do both, depending on whether youprovided the directory path as bytes or a Unicode string. If you pass aUnicode string as the path, filenames will be decoded using the filesystem’sencoding and a list of Unicode strings will be returned, while passing a bytepath will return the filenames as bytes. For example,assuming the default filesystem encoding is UTF-8, running the following program:

fn = 'filename\u4500abc'f = open(fn, 'w')f.close()import osprint(os.listdir(b'.'))print(os.listdir('.'))

will produce the following output:

$ python listdir-test.py[b'filename\xe4\x94\x80abc', ...]['filename\u4500abc', ...]

The first list contains UTF-8-encoded filenames, and the second list containsthe Unicode versions.

Note that on most occasions, you should can just stick with usingUnicode with these APIs. The bytes APIs should only be used onsystems where undecodable file names can be present; that’spretty much only Unix systems now.

Tips for Writing Unicode-aware Programs

This section provides some suggestions on writing software that deals withUnicode.

The most important tip is:

Software should only work with Unicode strings internally, decoding the inputdata as soon as possible and encoding the output only at the end.

If you attempt to write processing functions that accept both Unicode and bytestrings, you will find your program vulnerable to bugs wherever you combine thetwo different kinds of strings. There is no automatic encoding or decoding: ifyou do e.g. str + bytes, a TypeError will be raised.

When using data coming from a web browser or some other untrusted source, acommon technique is to check for illegal characters in a string before using thestring in a generated command line or storing it in a database. If you’re doingthis, be careful to check the decoded string, not the encoded bytes data;some encodings may have interesting properties, such as not being bijectiveor not being fully ASCII-compatible. This is especially true if the inputdata also specifies the encoding, since the attacker can then choose aclever way to hide malicious text in the encoded bytestream.

Converting Between File Encodings

The StreamRecoder class can transparently convert betweenencodings, taking a stream that returns data in encoding #1and behaving like a stream returning data in encoding #2.

For example, if you have an input file f that’s in Latin-1, youcan wrap it with a StreamRecoder to return bytes encoded inUTF-8:

new_f = codecs.StreamRecoder(f, # en/decoder: used by read() to encode its results and # by write() to decode its input. codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'), # reader/writer: used to read and write to the stream. codecs.getreader('latin-1'), codecs.getwriter('latin-1') )

Files in an Unknown Encoding

What can you do if you need to make a change to a file, but don’t knowthe file’s encoding? If you know the encoding is ASCII-compatible andonly want to examine or modify the ASCII parts, you can open the filewith the surrogateescape error handler:

with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f: data = f.read()# make changes to the string 'data'with open(fname + '.new', 'w', encoding="ascii", errors="surrogateescape") as f: f.write(data)

The surrogateescape error handler will decode any non-ASCII bytesas code points in a special range running from U+DC80 toU+DCFF. These code points will then turn back into thesame bytes when the surrogateescape error handler is used toencode the data and write it back out.

References

One section of Mastering Python 3 Input/Output,a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling.

The PDF slides for Marc-André Lemburg’s presentation “Writing Unicode-awareApplications in Python”discuss questions of character encodings as well as how to internationalizeand localize an application. These slides cover Python 2.x only.

The Guts of Unicode in Pythonis a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicoderepresentation in Python 3.3.

Acknowledgements

The initial draft of this document was written by Andrew Kuchling.It has since been revised further by Alexander Belopolsky, Georg Brandl,Andrew Kuchling, and Ezio Melotti.

Thanks to the following people who have noted errors or offeredsuggestions on this article: Éric Araujo, Nicholas Bastin, NickCoghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-AndréLemburg, Martin von Löwis, Terry J. Reedy, Serhiy Storchaka,Eryk Sun, Chad Whitacre, Graham Wideman.

Unicode HOWTO (2024)

FAQs

Unicode HOWTO? ›

Unicode characters can then be entered by holding down Alt , and typing + on the numeric keypad, followed by the hexadecimal code, and then releasing Alt . This may not work for 5-digit hexadecimal codes like U+1F937 .

How do you do Unicode? ›

Unicode characters can then be entered by holding down Alt , and typing + on the numeric keypad, followed by the hexadecimal code, and then releasing Alt . This may not work for 5-digit hexadecimal codes like U+1F937 .

How do I type Unicode on my keyboard? ›

To insert a Unicode character, type the character code, press ALT, and then press X. For example, to type a dollar symbol ($), type 0024, press ALT, and then press X. For more Unicode character codes, see Unicode character code charts by script.

How do I encode Unicode characters? ›

Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide. Sixteen-bit encoding form is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character.

What are the basics of Unicode? ›

Unicode is a standard that precisely defines a character set as well as a small number of encodings for it. It enables you to handle text in any language efficiently. It allows a single application executable to work for a global audience.

What is Unicode example? ›

The code point is a unique number for a character or some symbol such as an accent mark or ligature. Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).

What is Unicode and how do you use it? ›

Unicode is a universal character encoding standard. This standard includes roughly 100000 characters to represent characters of different languages. While ASCII uses only 1 byte the Unicode uses 4 bytes to represent characters. Hence, it provides a very wide variety of encoding.

How do I type Unicode in Chrome? ›

Using international characters on a Chromebook (Unicode method)
  1. Press Ctrl+Shift+U in a text input field. ...
  2. Enter a character code directly after the "u" (Ex. ...
  3. Press "Enter" or "Space" on your keyboard to display the desired character from the code.
Jan 25, 2022

How do I add Unicode to a font? ›

Open the document in Microsoft Word, Excel, or PowerPoint. Go to the "Insert" tab and click on the "Symbol" icon. In the "Symbol" dialog box, select the "Symbols" tab and choose "Unicode (hex)" in the "Font" dropdown menu. Type the Unicode character code in the "Character code" field and click "Insert".

How do I type Unicode characters on my keyboard Windows 10? ›

How to type unicode characters in Windows 10?
  1. Press and hold down the Alt key.
  2. Press the + (plus) key on the numeric keypad.
  3. Type the hexidecimal unicode value.
  4. Release the Alt key.
Apr 7, 2016

Which program is used to enter Unicode characters? ›

The Character Map program is used for entering Unicode characters into any application. This program's option another name is charmap. It is an inbuilt tool in the MS-Window operating system. This operating system enables the user to view the characters present in the selected font.

What is Unicode text format? ›

Unicode is a universal encoding scheme for written characters and text that enables the exchange of data internationally. Two transformation formats, UTF_16 and UCS_2, of Unicode are supported with DDS. A Unicode field in a display file can contain UCS-2 or UTF-16 data.

Can I create a Unicode character? ›

An individual person cannot 'make' their own Unicode 'characters' because the characters are numerical codes that map with visual representations called 'glyphs'. The purpose of the Unicode standard is to control, or standardize, this mapping. This is how Unicode works.

What is the most common Unicode encoding? ›

The Unicode Standard itself defines three encodings: UTF-8, UTF-16, and UTF-32, though several others exist. Of these, UTF-8 is the most widely used by a large margin, in part due to its backwards-compatibility with ASCII.

How is Unicode decoded? ›

Unicode data is usually converted to a particular encoding before it gets written to disk or sent over a socket. It's possible to do all the work yourself: open a file, read an 8-bit bytes object from it, and convert the bytes with bytes.decode(encoding) .

What is difference between Unicode and ASCII? ›

ASCII uses eight bits to encode characters, allowing it to represent only 256 characters. In contrast, Unicode can represent over one million characters with its various encoding schemes that utilize different numbers of bits for character representation.

How does Unicode text work? ›

Unicode text is processed and stored as binary data using one of several encodings, which define how to translate the standard's abstracted codes for characters into sequences of bytes. The Unicode Standard itself defines three encodings: UTF-8, UTF-16, and UTF-32, though several others exist.

How do Unicode fonts work? ›

Every single glyph in a font has an associated unicode reference: a unique identifier that corresponds with the same character in other typefaces, regardless of the language. Unicode defines which characters exist; any given character will have a default glyph in the font—and perhaps more.

Top Articles
Latest Posts
Article information

Author: Margart Wisoky

Last Updated:

Views: 6145

Rating: 4.8 / 5 (78 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Margart Wisoky

Birthday: 1993-05-13

Address: 2113 Abernathy Knoll, New Tamerafurt, CT 66893-2169

Phone: +25815234346805

Job: Central Developer

Hobby: Machining, Pottery, Rafting, Cosplaying, Jogging, Taekwondo, Scouting

Introduction: My name is Margart Wisoky, I am a gorgeous, shiny, successful, beautiful, adventurous, excited, pleasant person who loves writing and wants to share my knowledge and understanding with you.