X-Git-Url: https://git.saurik.com/wxWidgets.git/blobdiff_plain/727aa9062ba6ffa3153069e15df38dca958172d5..ae901b234c4a0aa7c1777b3bd181dd7f8517ad21:/docs/doxygen/overviews/unicode.h diff --git a/docs/doxygen/overviews/unicode.h b/docs/doxygen/overviews/unicode.h index e50454a1cd..a84dc50aa1 100644 --- a/docs/doxygen/overviews/unicode.h +++ b/docs/doxygen/overviews/unicode.h @@ -49,8 +49,8 @@ other services should be ready to deal with Unicode. When working with Unicode, it's important to define the meaning of some terms. -A glyph is a particular image that represents a character or part -of a character. +A glyph is a particular image (usually part of a font) that +represents a character or part of a character. Any character may have one or more glyph associated; e.g. some of the possible glyphs for the capital letter 'A' are: @@ -60,7 +60,13 @@ Unicode assigns each character of almost any existing alphabet/script a number, which is called code point; it's typically indicated in documentation manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number. -The Unicode standard divides the space of all possible code points in @e planes; +Note that typically one character is assigned exactly one code point, but there +are exceptions; the so-called precomposed characters +(see http://en.wikipedia.org/wiki/Precomposed_character) or the ligatures. +In these cases a single "character" may be mapped to more than one code point or +viceversa more characters may be mapped to a single code point. + +The Unicode standard divides the space of all possible code points in planes; a plane is a range of 65,536 (1000016) contiguous Unicode code points. Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic Multilingual Plane. @@ -73,7 +79,7 @@ Code points are represented in computer memory as a sequence of one or more More precisely, a code unit is the minimal bit combination that can represent a unit of encoded text for processing or interchange. -The @e UTF or Unicode Transformation Formats are algorithms mapping the Unicode +The UTF or Unicode Transformation Formats are algorithms mapping the Unicode code points to code unit sequences. The simplest of them is UTF-32 where each code unit is composed by 32 bits (4 bytes) and each code point is always represented by a single code unit (fixed length encoding). @@ -129,7 +135,7 @@ programs require the Microsoft Layer for Unicode to run on Windows 95/98/ME. However, unlike the Unicode build mode of the previous versions of wxWidgets, this support is mostly transparent: you can still continue to work with the @b narrow (i.e. current locale-encoded @c char*) strings even if @b wide -(i.e. UTF16/UCS2-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also +(i.e. UTF16-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also supported. Any wxWidgets function accepts arguments of either type as both kinds of strings are implicitly converted to wxString, so both @code @@ -366,19 +372,13 @@ const char *p = s.ToUTF8(); puts(p); // or call any other function taking const char * @endcode does @b not work because the temporary buffer returned by wxString::ToUTF8() is -destroyed and @c p is left pointing nowhere. To correct this you may use +destroyed and @c p is left pointing nowhere. To correct this you should use @code -wxCharBuffer p(s.ToUTF8()); +const wxScopedCharBuffer p(s.ToUTF8()); puts(p); @endcode -which does work but results in an unnecessary copy of string data in the build -configurations when wxString::ToUTF8() returns the pointer to internal string buffer. -If this inefficiency is important you may write -@code -const wxUTF8Buf p(s.ToUTF8()); -puts(p); -@endcode -where @c wxUTF8Buf is the type corresponding to the real return type of wxString::ToUTF8(). +which does work. + Similarly, wxWX2WCbuf can be used for the return type of wxString::wc_str(). But, once again, none of these cryptic types is really needed if you just pass the return value of any of the functions mentioned in this section to another @@ -386,7 +386,7 @@ function directly. @section overview_unicode_settings Unicode Related Compilation Settings -@c wxUSE_UNICODE is now defined as 1 by default to indicate Unicode support. +@c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support. If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is also defined, otherwise @c wxUSE_UNICODE_WCHAR is.