Unicode

The characters “U+” are an ASCIIfied version of the MULTISET UNION “⊎” U+228E character (the U-like union symbol with a plus sign inside it), which was meant to symbolize Unicode as the union of character sets
— codepoint - Why is 'U+' used to designate a Unicode code point? - Stack Overflow

A Programmer’s Introduction to Unicode – Nathan Reed’s coding blog
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software

Composite and Precomposed Characters

Unicode contains a vast number of characters, many of which have different Unicode numbers, but are in fact the same character. A simple example is the letter e-acute: this can be represented by é, which in UTF-8 encoding is the two hex bytes c3 a9, or by é, which is the three hex bytes 65 cc 81. In some fonts there may be small differences, but in most cases we see identical characters and expect our computers to treat them the same.

é and é

Normalization forms:

NFC: Precomposed string with canonical mapping
NFD: Decomposed string with canonical mapping
NFKC: Precomposed string with compatibility mapping
NFKD: Decomposed string with compatibility mapping

PreviousUnicode NextUnit system

Last updated 3 years ago

Was this helpful?