X-Git-Url: https://git.saurik.com/wxWidgets.git/blobdiff_plain/727aa9062ba6ffa3153069e15df38dca958172d5..c8c77ee2af68bcea8ba157b4d5a4e2cd5b4912bd:/docs/doxygen/overviews/unicode.h diff --git a/docs/doxygen/overviews/unicode.h b/docs/doxygen/overviews/unicode.h index e50454a1cd..4013ff8e06 100644 --- a/docs/doxygen/overviews/unicode.h +++ b/docs/doxygen/overviews/unicode.h @@ -3,7 +3,7 @@ // Purpose: topic overview // Author: wxWidgets team // RCS-ID: $Id$ -// Licence: wxWindows license +// Licence: wxWindows licence ///////////////////////////////////////////////////////////////////////////// /** @@ -49,8 +49,8 @@ other services should be ready to deal with Unicode. When working with Unicode, it's important to define the meaning of some terms. -A glyph is a particular image that represents a character or part -of a character. +A glyph is a particular image (usually part of a font) that +represents a character or part of a character. Any character may have one or more glyph associated; e.g. some of the possible glyphs for the capital letter 'A' are: @@ -60,7 +60,13 @@ Unicode assigns each character of almost any existing alphabet/script a number, which is called code point; it's typically indicated in documentation manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number. -The Unicode standard divides the space of all possible code points in @e planes; +Note that typically one character is assigned exactly one code point, but there +are exceptions; the so-called precomposed characters +(see http://en.wikipedia.org/wiki/Precomposed_character) or the ligatures. +In these cases a single "character" may be mapped to more than one code point or +viceversa more characters may be mapped to a single code point. + +The Unicode standard divides the space of all possible code points in planes; a plane is a range of 65,536 (1000016) contiguous Unicode code points. Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic Multilingual Plane. @@ -73,7 +79,7 @@ Code points are represented in computer memory as a sequence of one or more More precisely, a code unit is the minimal bit combination that can represent a unit of encoded text for processing or interchange. -The @e UTF or Unicode Transformation Formats are algorithms mapping the Unicode +The UTF or Unicode Transformation Formats are algorithms mapping the Unicode code points to code unit sequences. The simplest of them is UTF-32 where each code unit is composed by 32 bits (4 bytes) and each code point is always represented by a single code unit (fixed length encoding). @@ -129,7 +135,7 @@ programs require the Microsoft Layer for Unicode to run on Windows 95/98/ME. However, unlike the Unicode build mode of the previous versions of wxWidgets, this support is mostly transparent: you can still continue to work with the @b narrow (i.e. current locale-encoded @c char*) strings even if @b wide -(i.e. UTF16/UCS2-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also +(i.e. UTF16-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also supported. Any wxWidgets function accepts arguments of either type as both kinds of strings are implicitly converted to wxString, so both @code @@ -192,7 +198,7 @@ work. Here are some examples, using a wxString object @c s and some integer @c n: - Writing @code switch ( s[n] ) @endcode doesn't work because the argument of - the switch statement must an integer expression so you need to replace + the switch statement must be an integer expression so you need to replace @c s[n] with @code s[n].GetValue() @endcode. You may also force the conversion to @c char or @c wchar_t by using an explicit cast but beware that converting the value to char uses the conversion to current locale and may @@ -224,7 +230,7 @@ problems: - Using a cast to force the issue (listed only for completeness): @code printf("Hello, %s", (const char *)s.c_str()) @endcode - - The result of @c c_str() can not be cast to @c char* but only to @c const @c + - The result of @c c_str() cannot be cast to @c char* but only to @c const @c @c char*. Of course, modifying the string via the pointer returned by this method has never been possible but unfortunately it was occasionally useful to use a @c const_cast here to pass the value to const-incorrect functions. @@ -366,19 +372,13 @@ const char *p = s.ToUTF8(); puts(p); // or call any other function taking const char * @endcode does @b not work because the temporary buffer returned by wxString::ToUTF8() is -destroyed and @c p is left pointing nowhere. To correct this you may use +destroyed and @c p is left pointing nowhere. To correct this you should use @code -wxCharBuffer p(s.ToUTF8()); +const wxScopedCharBuffer p(s.ToUTF8()); puts(p); @endcode -which does work but results in an unnecessary copy of string data in the build -configurations when wxString::ToUTF8() returns the pointer to internal string buffer. -If this inefficiency is important you may write -@code -const wxUTF8Buf p(s.ToUTF8()); -puts(p); -@endcode -where @c wxUTF8Buf is the type corresponding to the real return type of wxString::ToUTF8(). +which does work. + Similarly, wxWX2WCbuf can be used for the return type of wxString::wc_str(). But, once again, none of these cryptic types is really needed if you just pass the return value of any of the functions mentioned in this section to another @@ -386,7 +386,7 @@ function directly. @section overview_unicode_settings Unicode Related Compilation Settings -@c wxUSE_UNICODE is now defined as 1 by default to indicate Unicode support. +@c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support. If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is also defined, otherwise @c wxUSE_UNICODE_WCHAR is.