-@section overview_unicode_encodings Unicode Representations
-
-Unicode provides a unique code to identify every character, however in practice
-these codes are not always used directly but encoded using one of the standard
-UTF or Unicode Transformation Formats which are algorithms mapping the Unicode
-codes to byte code sequences. The simplest of them is UTF-32 which simply maps
-the Unicode code to a 4 byte sequence representing this 32 bit number (although
-this is still not completely trivial as the mapping is different for little and
-big-endian architectures). UTF-32 is commonly used under Unix systems for
-internal representation of Unicode strings. Another very widespread standard is
-UTF-16 which is used by Microsoft Windows: it encodes the first (approximately)
-64 thousands of Unicode characters using only 2 bytes and uses a pair of 16-bit
-codes to encode the characters beyond this. Finally, the most widespread
-encoding used for the external Unicode storage (e.g. files and network
-protocols) is UTF-8 which is byte-oriented and so avoids the endianness
-ambiguities of UTF-16 and UTF-32. However UTF-8 uses a variable number of bytes
-for representing Unicode characters which makes it less efficient than UTF-32
-for internal representation.
-
-From the C/C++ programmer perspective the situation is further complicated by
-the fact that the standard type @c wchar_t which is used to represent the
+@section overview_unicode_encodings Unicode Representations and Terminology
+
+When working with Unicode, it's important to define the meaning of some terms.
+
+A <b><em>glyph</em></b> is a particular image (usually part of a font) that
+represents a character or part of a character.
+Any character may have one or more glyph associated; e.g. some of the possible
+glyphs for the capital letter 'A' are:
+
+@image html overview_unicode_glyphs.png
+
+Unicode assigns each character of almost any existing alphabet/script a number,
+which is called <b><em>code point</em></b>; it's typically indicated in documentation
+manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number.
+
+Note that typically one character is assigned exactly one code point, but there
+are exceptions; the so-called <em>precomposed characters</em>
+(see http://en.wikipedia.org/wiki/Precomposed_character) or the <em>ligatures</em>.
+In these cases a single "character" may be mapped to more than one code point or
+viceversa more characters may be mapped to a single code point.
+
+The Unicode standard divides the space of all possible code points in <b><em>planes</em></b>;
+a plane is a range of 65,536 (1000016) contiguous Unicode code points.
+Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic
+Multilingual Plane.
+The BMP contains characters for all modern languages, and a large number of
+special characters. The other planes in fact contain mainly historic scripts,
+special-purpose characters or are unused.
+
+Code points are represented in computer memory as a sequence of one or more
+<b><em>code units</em></b>, where a code unit is a unit of memory: 8, 16, or 32 bits.
+More precisely, a code unit is the minimal bit combination that can represent a
+unit of encoded text for processing or interchange.
+
+The <b><em>UTF</em></b> or Unicode Transformation Formats are algorithms mapping the Unicode
+code points to code unit sequences. The simplest of them is <b>UTF-32</b> where
+each code unit is composed by 32 bits (4 bytes) and each code point is always
+represented by a single code unit (fixed length encoding).
+(Note that even UTF-32 is still not completely trivial as the mapping is different
+for little and big-endian architectures). UTF-32 is commonly used under Unix systems for
+internal representation of Unicode strings.
+
+Another very widespread standard is <b>UTF-16</b> which is used by Microsoft Windows:
+it encodes the first (approximately) 64 thousands of Unicode code points
+(the BMP plane) using 16-bit code units (2 bytes) and uses a pair of 16-bit code
+units to encode the characters beyond this. These pairs are called @e surrogate.
+Thus UTF16 uses a variable number of code units to encode each code point.
+
+Finally, the most widespread encoding used for the external Unicode storage
+(e.g. files and network protocols) is <b>UTF-8</b> which is byte-oriented and so
+avoids the endianness ambiguities of UTF-16 and UTF-32.
+UTF-8 uses code units of 8 bits (1 byte); code points beyond the usual english
+alphabet are represented using a variable number of bytes, which makes it less
+efficient than UTF-32 for internal representation.
+
+As visual aid to understand the differences between the various concepts described
+so far, look at the different UTF representations of the same code point:
+
+@image html overview_unicode_codes.png
+
+In this particular case UTF8 requires more space than UTF16 (3 bytes instead of 2).
+
+Note that from the C/C++ programmer perspective the situation is further complicated
+by the fact that the standard type @c wchar_t which is usually used to represent the