X-Git-Url: https://git.saurik.com/wxWidgets.git/blobdiff_plain/77ef61f539f418d21989856f8cfcd3fc287443f2..4e15d1caa03346c126015019c1fdf093033ef40b:/docs/doxygen/overviews/unicode.h?ds=sidebyside diff --git a/docs/doxygen/overviews/unicode.h b/docs/doxygen/overviews/unicode.h index e372007b1e..2f42904b30 100644 --- a/docs/doxygen/overviews/unicode.h +++ b/docs/doxygen/overviews/unicode.h @@ -3,13 +3,15 @@ // Purpose: topic overview // Author: wxWidgets team // RCS-ID: $Id$ -// Licence: wxWindows license +// Licence: wxWindows licence ///////////////////////////////////////////////////////////////////////////// /** @page overview_unicode Unicode Support in wxWidgets +@tableofcontents + This section describes how does wxWidgets support Unicode and how can it affect your programs. @@ -19,15 +21,8 @@ correct any more. Please see @ref overview_changes_unicode for the details of these changes. You can skip the first two sections if you're already familiar with Unicode and -wish to jump directly in the details of its support in the library: -@li @ref overview_unicode_what -@li @ref overview_unicode_encodings -@li @ref overview_unicode_supportin -@li @ref overview_unicode_pitfalls -@li @ref overview_unicode_supportout -@li @ref overview_unicode_settings +wish to jump directly in the details of its support in the library. -
@section overview_unicode_what What is Unicode? @@ -49,30 +44,40 @@ other services should be ready to deal with Unicode. When working with Unicode, it's important to define the meaning of some terms. -A @e glyph is a particular image that represents a @e character or part of a character. +A glyph is a particular image (usually part of a font) that +represents a character or part of a character. Any character may have one or more glyph associated; e.g. some of the possible glyphs for the capital letter 'A' are: @image html overview_unicode_glyphs.png Unicode assigns each character of almost any existing alphabet/script a number, -which is called code point; it's typically indicated in documentation +which is called code point; it's typically indicated in documentation manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number. -The Unicode standard divides the space of all possible code points in @e planes; +Note that typically one character is assigned exactly one code point, but there +are exceptions; the so-called precomposed characters +(see http://en.wikipedia.org/wiki/Precomposed_character) or the ligatures. +In these cases a single "character" may be mapped to more than one code point or +viceversa more characters may be mapped to a single code point. + +The Unicode standard divides the space of all possible code points in planes; a plane is a range of 65,536 (1000016) contiguous Unicode code points. Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic Multilingual Plane. +The BMP contains characters for all modern languages, and a large number of +special characters. The other planes in fact contain mainly historic scripts, +special-purpose characters or are unused. Code points are represented in computer memory as a sequence of one or more -code units, where a code unit is a unit of memory: 8, 16, or 32 bits. +code units, where a code unit is a unit of memory: 8, 16, or 32 bits. More precisely, a code unit is the minimal bit combination that can represent a unit of encoded text for processing or interchange. -The @e UTF or Unicode Transformation Formats are algorithms mapping the Unicode +The UTF or Unicode Transformation Formats are algorithms mapping the Unicode code points to code unit sequences. The simplest of them is UTF-32 where -each code unit is composed by 32 bits (4 bytes) and each code point is represented -by a single code unit. +each code unit is composed by 32 bits (4 bytes) and each code point is always +represented by a single code unit (fixed length encoding). (Note that even UTF-32 is still not completely trivial as the mapping is different for little and big-endian architectures). UTF-32 is commonly used under Unix systems for internal representation of Unicode strings. @@ -81,6 +86,7 @@ Another very widespread standard is UTF-16 which is used by Microsoft Win it encodes the first (approximately) 64 thousands of Unicode code points (the BMP plane) using 16-bit code units (2 bytes) and uses a pair of 16-bit code units to encode the characters beyond this. These pairs are called @e surrogate. +Thus UTF16 uses a variable number of code units to encode each code point. Finally, the most widespread encoding used for the external Unicode storage (e.g. files and network protocols) is UTF-8 which is byte-oriented and so @@ -107,7 +113,7 @@ Typically when UTF8 is used, code units are stored into @c char types, since @c char are 8bit wide on almost all systems; when using UTF16 typically code units are stored into @c wchar_t types since @c wchar_t is at least 16bits on all systems. This is also the approach used by wxString. -See @ref overview_wxstring for more info. +See @ref overview_string for more info. See also http://unicode.org/glossary/ for the official definitions of the terms reported above. @@ -115,16 +121,19 @@ terms reported above. @section overview_unicode_supportin Unicode Support in wxWidgets -Since wxWidgets 3.0 Unicode support is always enabled and building the library -without it is not recommended any longer and will cease to be supported in the -near future. This means that internally only Unicode strings are used and that, -under Microsoft Windows, Unicode system API is used which means that wxWidgets -programs require the Microsoft Layer for Unicode to run on Windows 95/98/ME. +@subsection overview_unicode_support_default Unicode is Always Used by Default + +Since wxWidgets 3.0 Unicode support is always enabled and while building the +library without it is still possible, it is not recommended any longer and will +cease to be supported in the near future. This means that internally only +Unicode strings are used and that, under Microsoft Windows, Unicode system API +is used which means that wxWidgets programs require the Microsoft Layer for +Unicode to run on Windows 95/98/ME. However, unlike the Unicode build mode of the previous versions of wxWidgets, this support is mostly transparent: you can still continue to work with the @b narrow -(i.e. current-locale-encoded @c char*) strings even if @b wide -(i.e. UTF16/UCS2-encoded @c wchar_t* or UTF8-encoded @c char) strings are also +(i.e. current locale-encoded @c char*) strings even if @b wide +(i.e. UTF16-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also supported. Any wxWidgets function accepts arguments of either type as both kinds of strings are implicitly converted to wxString, so both @code @@ -132,7 +141,7 @@ wxMessageBox("Hello, world!"); @endcode and the somewhat less usual @code -wxMessageBox(L"Salut \u00e0 toi!"); // 00E0 is "Latin Small Letter a with Grave" +wxMessageBox(L"Salut \u00E0 toi!"); // U+00E0 is "Latin Small Letter a with Grave" @endcode work as expected. @@ -147,9 +156,10 @@ in the case of gcc). In particular, the most common encoding used under modern Unix systems is UTF-8 and as the string above is not a valid UTF-8 byte sequence, nothing would be displayed at all in this case. Thus it is important to never use 8-bit (instead of 7-bit) characters directly in the program source -but use wide strings or, alternatively, write +but use wide strings or, alternatively, write: @code -wxMessageBox(wxString::FromUTF8("Salut \xc3\xa0 toi!")); +wxMessageBox(wxString::FromUTF8("Salut \xC3\xA0 toi!")); + // in UTF8 the character U+00E0 is encoded as 0xC3A0 @endcode In a similar way, wxString provides access to its contents as either @c wchar_t or @@ -169,6 +179,54 @@ in your program there is really nothing special to do. However you should be aware of the potential problems covered by the following section. +@subsection overview_unicode_support_utf Choosing Unicode Representation + +wxWidgets uses the system @c wchar_t in wxString implementation by default +under all systems. Thus, under Microsoft Windows, UCS-2 (simplified version of +UTF-16 without support for surrogate characters) is used as @c wchar_t is 2 +bytes on this platform. Under Unix systems, including Mac OS X, UCS-4 (also +known as UTF-32) is used by default, however it is also possible to build +wxWidgets to use UTF-8 internally by passing @c --enable-utf8 option to +configure. + +The interface provided by wxString is the same independently of the format used +internally. However different formats have specific advantages and +disadvantages. Notably, under Unix, the underlying graphical toolkit (e.g. +GTK+) usually uses UTF-8 encoded strings and using the same representations for +the strings in wxWidgets allows to avoid conversion from UTF-32 to UTF-8 and +vice versa each time a string is shown in the UI or retrieved from it. The +overhead of such conversions is usually negligible for small strings but may be +important for some programs. If you believe that it would be advantageous to +use UTF-8 for the strings in your particular application, you may rebuild +wxWidgets to use UTF-8 as explained above (notice that this is currently not +supported under Microsoft Windows and arguably doesn't make much sense there as +Windows itself uses UTF-16 and not UTF-8) but be sure to be aware of the +performance implications (see @ref overview_unicode_performance) of using UTF-8 +in wxString before doing this! + +Generally speaking you should only use non-default UTF-8 build in specific +circumstances e.g. building for resource-constrained systems where the overhead +of conversions (and also reduced memory usage of UTF-8 compared to UTF-32 for +the European languages) can be important. If the environment in which your +program is running is under your control -- as is quite often the case in such +scenarios -- consider ensuring that the system always uses UTF-8 locale and +use @c --enable-utf8only configure option to disable support for the other +locales and consider all strings to be in UTF-8. This further reduces the code +size and removes the need for conversions in more cases. + + +@subsection overview_unicode_settings Unicode Related Preprocessor Symbols + +@c wxUSE_UNICODE is defined as 1 now to indicate Unicode support. It can be +explicitly set to 0 in @c setup.h under MSW or you can use @c --disable-unicode +under Unix but doing this is strongly discouraged. By default, @c +wxUSE_UNICODE_WCHAR is also defined as 1, however in UTF-8 build (described in +the previous section), it is set to 0 and @c wxUSE_UNICODE_UTF8, which is +usually 0, is set to 1 instead. In the latter case, @c wxUSE_UTF8_LOCALE_ONLY +can also be set to 1 to indicate that all strings are considered to be in UTF-8. + + + @section overview_unicode_pitfalls Potential Unicode Pitfalls The problems can be separated into three broad classes: @@ -186,7 +244,7 @@ work. Here are some examples, using a wxString object @c s and some integer @c n: - Writing @code switch ( s[n] ) @endcode doesn't work because the argument of - the switch statement must an integer expression so you need to replace + the switch statement must be an integer expression so you need to replace @c s[n] with @code s[n].GetValue() @endcode. You may also force the conversion to @c char or @c wchar_t by using an explicit cast but beware that converting the value to char uses the conversion to current locale and may @@ -218,7 +276,7 @@ problems: - Using a cast to force the issue (listed only for completeness): @code printf("Hello, %s", (const char *)s.c_str()) @endcode - - The result of @c c_str() can not be cast to @c char* but only to @c const @c + - The result of @c c_str() cannot be cast to @c char* but only to @c const @c @c char*. Of course, modifying the string via the pointer returned by this method has never been possible but unfortunately it was occasionally useful to use a @c const_cast here to pass the value to const-incorrect functions. @@ -268,17 +326,18 @@ wxWidgets 3.0 and the new code should be used with this in mind and ideally avoiding implicit conversions to @c char*. -@subsection overview_unicode_performance Unicode Performance Implications +@subsection overview_unicode_performance Performance Implications of Using UTF-8 -Under Unix systems wxString class uses variable-width UTF-8 encoding for -internal representation and this implies that it can't guarantee constant-time -access to N-th element of the string any longer as to find the position of this -character in the string we have to examine all the preceding ones. Usually this -doesn't matter much because most algorithms used on the strings examine them -sequentially anyhow and because wxString implements a cache for iterating over -the string by index but it can have serious consequences for algorithms -using random access to string elements as they typically acquire O(N^2) time -complexity instead of O(N) where N is the length of the string. +As mentioned above, under Unix systems wxString class can use variable-width +UTF-8 encoding for internal representation. In this case it can't guarantee +constant-time access to N-th element of the string any longer as to find the +position of this character in the string we have to examine all the preceding +ones. Usually this doesn't matter much because most algorithms used on the +strings examine them sequentially anyhow and because wxString implements a +cache for iterating over the string by index but it can have serious +consequences for algorithms using random access to string elements as they +typically acquire O(N^2) time complexity instead of O(N) where N is the length +of the string. Even despite caching the index, indexed access should be replaced with sequential access using string iterators. For example a typical loop: @@ -327,6 +386,7 @@ different encoding of it. So you need to be able to convert the data to various representations and the wxString methods wxString::ToAscii(), wxString::ToUTF8() (or its synonym wxString::utf8_str()), wxString::mb_str(), wxString::c_str() and wxString::wc_str() can be used for this. + The first of them should be only used for the string containing 7-bit ASCII characters only, anything else will be replaced by some substitution character. wxString::mb_str() converts the string to the encoding used by the current locale @@ -359,33 +419,16 @@ const char *p = s.ToUTF8(); puts(p); // or call any other function taking const char * @endcode does @b not work because the temporary buffer returned by wxString::ToUTF8() is -destroyed and @c p is left pointing nowhere. To correct this you may use +destroyed and @c p is left pointing nowhere. To correct this you should use @code -wxCharBuffer p(s.ToUTF8()); +const wxScopedCharBuffer p(s.ToUTF8()); puts(p); @endcode -which does work but results in an unnecessary copy of string data in the build -configurations when wxString::ToUTF8() returns the pointer to internal string buffer. -If this inefficiency is important you may write -@code -const wxUTF8Buf p(s.ToUTF8()); -puts(p); -@endcode -where @c wxUTF8Buf is the type corresponding to the real return type of wxString::ToUTF8(). +which does work. + Similarly, wxWX2WCbuf can be used for the return type of wxString::wc_str(). But, once again, none of these cryptic types is really needed if you just pass the return value of any of the functions mentioned in this section to another function directly. -@section overview_unicode_settings Unicode Related Compilation Settings - -@c wxUSE_UNICODE is now defined as 1 by default to indicate Unicode support. -If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is -also defined, otherwise @c wxUSE_UNICODE_WCHAR is. - -You are encouraged to always use the default build settings of wxWidgets; this avoids -the need of different builds of the same application/library because of different -"build modes". - */ -