Unicode

The characters “U+” are an ASCIIfied version of the MULTISET UNION “⊎” U+228E character (the U-like union symbol with a plus sign inside it), which was meant to symbolize Unicode as the union of character sets

codepoint - Why is 'U+' used to designate a Unicode code point? - Stack Overflow

Composite and Precomposed Characters

Unicode contains a vast number of characters, many of which have different Unicode numbers, but are in fact the same character. A simple example is the letter e-acute: this can be represented by é, which in UTF-8 encoding is the two hex bytes c3 a9, or by é, which is the three hex bytes 65 cc 81. In some fonts there may be small differences, but in most cases we see identical characters and expect our computers to treat them the same.

é and é

Normalization forms:

  • NFC: Precomposed string with canonical mapping

  • NFD: Decomposed string with canonical mapping

  • NFKC: Precomposed string with compatibility mapping

  • NFKD: Decomposed string with compatibility mapping

Last updated