General Concepts for Multilingual Programming

Localized or Unicode Character Encoding

Make sure you know what character encoding that you are processing. Some languages make this easier to process than others, however few make it very difficult. Encoding is the process of taking characters and representing them in a specific format. Decoding is the reverse, of taking a specific format and extracting characters. It is usually safe to assume that Unicode is a superset of other character encodings, that being the whole intent.

Unicode Character Normalization

There are several reasons why a single character might need to be normalized before working with it. One thing you might want to do, but would need normalization, is comparing string equality of a unicode string. Normalization is the process of attempting to make every equivalent grapheme have the same character representation. This is necessary because of the inequivalence of graphemes like accented a vs a with an accent added, the first being a single character, whereas the latter taking two characters to represent. Normalization processes attempt to pick a single representation for multiple encodings that represent the same thing.

What is a grapheme?

A grapheme is an individual symbol that the user recognizes as singular and self-contained. Both accented a and a with an accent added represent the same grapheme. That is why they should be normalized to the same representation during any normalization step.