For simplicity, In U, a string consists of characters.
If you need to work on low-level Unicode, this section simply describes the minimum you need to know to use convenient Unicode methods with confidence.
If you don't need to know about the Unicode standard, skip this section.
Standards and Encodings¶
Human languages are complex, Computers, conversely, are simple. They have bytes. The only way to translate human language elements to bytes is to map them with an encoding.
As an example, ASCII was one of the first standard. It directly maps characters to bytes: the letter a
is represented as number 97
.
However, ASCII was designed to represent characters and symbols from English, not non-English characters like Japanese characters. To standardized the encoding of all written languages, the Unicode system has been designed.
Unicode Standard¶
Unicode is not an encoding. It's a way to group numbers to represents characters. Those numbers are code points. There are multiple ways of encoding Unicode characters. The Unicode standard defines several Unicode Transformation Formats (UTF): UTF-8, UTF-16, UTF-32...Discussing the differences is out of scope for this section.
UTF-8 is now the de facto encoding for websites, many modern programming languages, and files.
Human character -> Unicode standard(`a` is U+0061) -> UTF encoding (1 character in MULTIPLE bytes) -> UTF byte
Note that in Unicode, we denote code points in hexadecimal. For example:
character | code points | name |
---|---|---|
'a' | U+0061 | LATIN SMALL LETTER A |
'Я' | U+042F | CYRILLIC CAPITAL LETTER YA |
'和' | U+548C | (harmony, peace; peaceful, calm) |
'🚀' | U+1F680 | ROCKET |
'🏴☠️' | U+1F3F4, U+200D, U+2620, U+FE0F | Pirate Flag |
One of the main difference with ASCII, is that Unicode is implemented into a variable number of bytes. For example, the Pirate Flag '🏴☠️' has 13 bytes !. It makes processing Unicode strings complex.
We suggest reading about code points, graphemes, combining character sequences to understand how to manage Unicode strings and how they might be complicated to work with (sorting, comparing...).
Terms¶
In Unicode, every character defined by the Unicode standard has a unique identifier, and characters are categorized into blocks, such as Latin, Arabic, or Cyrillic.
The key terms to understand about Unicode are:
-
A code point is a number that represents a Unicode character in Unicode character set. Think of this number as an index into the Unicode character set. It isn't yet encoded. It starts with 'U+', like 'U+0061' for 'a'.
-
An Unicode scalar value, or Unicode scalar, consists of code points WITHOUT surrogates code points. Surrogate pairs (special value ranges) are used in UTF-16 only. In UTF-8, Unicode scalar values are the same as code points.
-
A code point consists sequence of integers: code units. Each code unit has an encoding to define its bit width: 8, 16, or 32 bits:
For example, the pirate flag here is the data.
Character Encoding | Code Unit size | Encoded value | # of bytes | # of code units |
---|---|---|---|---|
UTF-8 | 8-bit | 0xF0 0x9D 0x84 0x9E | 4 | 4 |
UTF-16 | 16-bit | 0xD834 0xDD1E | 4 | 2 |
UTF-32 | 32-bit | 0x0001D11E | 4 | 1 |
-
Two or more Unicode scalars value can be combined to create grapheme clusters or simply graphemes
-
Two or more Unicode scalars value can be combined to create grapheme clusters or simply graphemes. A grapheme is drawn as a single visual symbol, For example, the pirate flag 🏴☠️ is:
- one human symbol on screen,
- represented by 4 code points: U+1F3F4, U+200D, U+2620, U+FE0F,
- has 13 code units: F0 9F 8F B4 E2 80 8D E2 98 A0 EF B8 8F.
-
A string consists of Unicode scalars (1 code points) and graphemes (2 or more code points)