Skip to content

For simplicity, In U, a string consists of characters.

If you need to work on low-level Unicode, this section simply describes the minimum you need to know to use convenient Unicode methods with confidence.

If you don't need to know about the Unicode standard, skip this section.

Standards and Encodings

Human languages are complex, Computers, conversely, are simple. They have bytes. The only way to translate human language elements to bytes is to map them with an encoding.

Human character -> standard -> encoding -> byte

As an example, ASCII was one of the first standard. It directly maps characters to bytes: the letter a is represented as number 97.

Human character -> ASCII standard(`a` is 97) -> ASCII encoding(1 character in 1 byte) -> ASCII byte

However, ASCII was designed to represent characters and symbols from English, not non-English characters like Japanese characters. To standardized the encoding of all written languages, the Unicode system has been designed.

Unicode Standard

Unicode is not an encoding. It's a way to group numbers to represents characters. Those numbers are code points. There are multiple ways of encoding Unicode characters. The Unicode standard defines several Unicode Transformation Formats (UTF): UTF-8, UTF-16, UTF-32...Discussing the differences is out of scope for this section.

UTF-8 is now the de facto encoding for websites, many modern programming languages, and files.

Human character -> Unicode standard(`a` is U+0061) -> UTF encoding (1 character in MULTIPLE bytes) -> UTF byte

Note that in Unicode, we denote code points in hexadecimal. For example:

character code points name
'a' U+0061 LATIN SMALL LETTER A
'Я' U+042F CYRILLIC CAPITAL LETTER YA
'和' U+548C (harmony, peace; peaceful, calm)
'🚀' U+1F680 ROCKET
'🏴‍☠️' U+1F3F4, U+200D, U+2620, U+FE0F Pirate Flag

One of the main difference with ASCII, is that Unicode is implemented into a variable number of bytes. For example, the Pirate Flag '🏴‍☠️' has 13 bytes !. It makes processing Unicode strings complex.

We suggest reading about code points, graphemes, combining character sequences to understand how to manage Unicode strings and how they might be complicated to work with (sorting, comparing...).

Terms

In Unicode, every character defined by the Unicode standard has a unique identifier, and characters are categorized into blocks, such as Latin, Arabic, or Cyrillic.

The key terms to understand about Unicode are:

  • A code point is a number that represents a Unicode character in Unicode character set. Think of this number as an index into the Unicode character set. It isn't yet encoded. It starts with 'U+', like 'U+0061' for 'a'.

  • An Unicode scalar value, or Unicode scalar, consists of code points WITHOUT surrogates code points. Surrogate pairs (special value ranges) are used in UTF-16 only. In UTF-8, Unicode scalar values are the same as code points.

  • A code point consists sequence of integers: code units. Each code unit has an encoding to define its bit width: 8, 16, or 32 bits:

For example, the pirate flag here is the data.

Character Encoding Code Unit size Encoded value # of bytes # of code units
UTF-8 8-bit 0xF0 0x9D 0x84 0x9E 4 4
UTF-16 16-bit 0xD834 0xDD1E 4 2
UTF-32 32-bit 0x0001D11E 4 1
  • Two or more Unicode scalars value can be combined to create grapheme clusters or simply graphemes

  • Two or more Unicode scalars value can be combined to create grapheme clusters or simply graphemes. A grapheme is drawn as a single visual symbol, For example, the pirate flag 🏴‍☠️ is:

    • one human symbol on screen,
    • represented by 4 code points: U+1F3F4, U+200D, U+2620, U+FE0F,
    • has 13 code units: F0 9F 8F B4 E2 80 8D E2 98 A0 EF B8 8F.
  • A string consists of Unicode scalars (1 code points) and graphemes (2 or more code points)