Unicode
The characters “U+” are an ASCIIfied version of the MULTISET UNION “⊎” U+228E character (the U-like union symbol with a plus sign inside it), which was meant to symbolize Unicode as the union of character sets
— codepoint - Why is 'U+' used to designate a Unicode code point? - Stack Overflow
Composite and Precomposed Characters
Unicode contains a vast number of characters, many of which have different Unicode numbers, but are in fact the same character. A simple example is the letter e-acute: this can be represented by Ă©, which in UTF-8 encoding is the two hex bytes
c3 a9, or by Ă©, which is the three hex bytes65 cc 81. In some fonts there may be small differences, but in most cases we see identical characters and expect our computers to treat them the same.
Ă© and Ă©
Apfelstrudel, Downloads – The Eclectic Light Company
Normalization forms:
NFC: Precomposed string with canonical mapping
NFD: Decomposed string with canonical mapping
NFKC: Precomposed string with compatibility mapping
NFKD: Decomposed string with compatibility mapping
Last updated
Was this helpful?