X-Git-Url: https://git.saurik.com/wxWidgets.git/blobdiff_plain/36b952b759285491d8e95e7cbe227d08b8c91160..fbfe2d4e7caa118e1b609151bc72e7e5c7ac0f32:/docs/doxygen/overviews/unicode.h diff --git a/docs/doxygen/overviews/unicode.h b/docs/doxygen/overviews/unicode.h index bb58e1f1ba..e9bc453939 100644 --- a/docs/doxygen/overviews/unicode.h +++ b/docs/doxygen/overviews/unicode.h @@ -10,13 +10,20 @@ @page overview_unicode Unicode Support in wxWidgets -This section briefly describes the state of the Unicode support in wxWidgets. -Read it if you want to know more about how to write programs able to work with -characters from languages other than English. +This section describes how does wxWidgets support Unicode and how can it affect +your programs. +Notice that Unicode support has changed radically in wxWidgets 3.0 and a lot of +existing material pertaining to the previous versions of the library is not +correct any more. Please see @ref overview_changes_unicode for the details of +these changes. + +You can skip the first two sections if you're already familiar with Unicode and +wish to jump directly in the details of its support in the library: @li @ref overview_unicode_what -@li @ref overview_unicode_ansi +@li @ref overview_unicode_encodings @li @ref overview_unicode_supportin +@li @ref overview_unicode_pitfalls @li @ref overview_unicode_supportout @li @ref overview_unicode_settings @@ -25,142 +32,312 @@ characters from languages other than English. @section overview_unicode_what What is Unicode? -wxWidgets has support for compiling in Unicode mode on the platforms which -support it. Unicode is a standard for character encoding which addresses the -shortcomings of the previous, 8 bit standards, by using at least 16 (and -possibly 32) bits for encoding each character. This allows to have at least -65536 characters (what is called the BMP, or basic multilingual plane) and -possible 2^32 of them instead of the usual 256 and is sufficient to encode all -of the world languages at once. A different approach is to encode all -strings in UTF8 which does not require the use of wide characters and -additionally is backwards compatible with 7-bit ASCII. The solution to -use UTF8 is prefered under Linux and partially OS X. - -More details about Unicode may be found at . - -Writing internationalized programs is much easier with Unicode. Moreover -even a program which uses only standard ASCII can benefit from using Unicode -for string representation because there will be no need to convert all -strings the program uses to/from Unicode each time a system call is made. - -@section overview_unicode_ansi Unicode and ANSI Modes - -Until wxWidgets 3.0 it was possible to compile the library both in -ANSI (=8-bit) mode as well as in wide char mode (16-bit per character -on Windows and 32-but on most Unix versions, Linux and OS X). This -has been changed in wxWidget with the removal of the ANSI mode, -but much effort has been made so that most of the previous ANSI -code should still compile and work as before. +Unicode is a standard for character encoding which addresses the shortcomings +of the previous, 8 bit standards, by using at least 16 (and possibly 32) bits +for encoding each character. This allows to have at least 65536 characters +(in what is called the BMP, or basic multilingual plane) and possible 2^32 of +them instead of the usual 256 and is sufficient to encode all of the world +languages at once. More details about Unicode may be found at +http://www.unicode.org/. + +From a practical point of view, using Unicode is almost a requirement when +writing applications for international audience. Moreover, any application +reading files which it didn't produce or receiving data from the network from +other services should be ready to deal with Unicode. + + +@section overview_unicode_encodings Unicode Representations + +Unicode provides a unique code to identify every character, however in practice +these codes are not always used directly but encoded using one of the standard +UTF or Unicode Transformation Formats which are algorithms mapping the Unicode +codes to byte code sequences. The simplest of them is UTF-32 which simply maps +the Unicode code to a 4 byte sequence representing this 32 bit number (although +this is still not completely trivial as the mapping is different for little and +big-endian architectures). UTF-32 is commonly used under Unix systems for +internal representation of Unicode strings. Another very widespread standard is +UTF-16 which is used by Microsoft Windows: it encodes the first (approximately) +64 thousands of Unicode characters using only 2 bytes and uses a pair of 16-bit +codes to encode the characters beyond this. Finally, the most widespread +encoding used for the external Unicode storage (e.g. files and network +protocols) is UTF-8 which is byte-oriented and so avoids the endianness +ambiguities of UTF-16 and UTF-32. However UTF-8 uses a variable number of bytes +for representing Unicode characters which makes it less efficient than UTF-32 +for internal representation. + +From the C/C++ programmer perspective the situation is further complicated by +the fact that the standard type @c wchar_t which is used to represent the +Unicode ("wide") strings in C/C++ doesn't have the same size on all platforms. +It is 4 bytes under Unix systems, corresponding to the tradition of using +UTF-32, but only 2 bytes under Windows which is required by compatibility with +the OS which uses UTF-16. -@section overview_unicode_supportin Unicode Support in wxWidgets -Since wxWidgets 3.0 Unicode support is always enabled meaning -that the wxString class always uses Unicode to encode its content. -Under Windows wxString uses UCS-2 (basically an array of 16-bit -wchar_t). Under Unix, Linux and OS X however, wxString uses UTF8 -to encode its content. +@section overview_unicode_supportin Unicode Support in wxWidgets -For the programmer, the biggest change is that iterating over -a string can be slower than before since wxString has to parse -the entire string in order to find the n-th character in a -string, meaning that iterating over a string should no longer -be done by index but using iterators. Old code will still work -but might be less efficient. +Since wxWidgets 3.0 Unicode support is always enabled and building the library +without it is not recommended any longer and will cease to be supported in the +near future. This means that internally only Unicode strings are used and that, +under Microsoft Windows, Unicode system API is used which means that wxWidgets +programs require the Microsoft Layer for Unicode to run on Windows 95/98/ME. + +However, unlike Unicode build mode in the previous versions of wxWidgets, this +support is mostly transparent: you can still continue to work with the narrow +(i.e. @c char*) strings even if wide (i.e. @c wchar_t*) strings are also +supported. Any wxWidgets function accepts arguments of either type as both +kinds of strings are implicitly converted to wxString, so both +@code +wxMessageBox("Hello, world!"); +@endcode +and somewhat less usual +@code +wxMessageBox(L"Salut \u00e0 toi!"); // 00E0 is "Latin Small Letter a with Grave" +@endcode +work as expected. -Old code like this: +Notice that the narrow strings used with wxWidgets are @e always assumed to be +in the current locale encoding, so writing +@code +wxMessageBox("Salut à toi!"); +@endcode +wouldn't work if the encoding used on the user system is incompatible with +ISO-8859-1 (or even if the sources were compiled under different locale +in the case of gcc). In particular, the most common encoding used under +modern Unix systems is UTF-8 and as the string above is not a valid UTF-8 byte +sequence, nothing would be displayed at all in this case. Thus it is important +to never use 8 bit characters directly in the program source but use wide +strings or, alternatively, write +@code +wxMessageBox(wxString::FromUTF8("Salut \xc3\xa0 toi!")); +@endcode +In a similar way, wxString provides access to its contents as either wchar_t or +char character buffer. Of course, the latter only works if the string contains +data representable in the current locale encoding. This will always be the case +if the string had been initially constructed from a narrow string or if it +contains only 7-bit ASCII data but otherwise this conversion is not guaranteed +to succeed. And as with @c FromUTF8() example above, you can always use @c +ToUTF8() to retrieve the string contents in UTF-8 encoding -- this, unlike +converting to @c char* using the current locale, never fails + +To summarize, Unicode support in wxWidgets is mostly transparent for the +application and if you use wxString objects for storing all the character data +in your program there is really nothing special to do. However you should be +aware of the potential problems covered by the following section. + + +@section overview_unicode_pitfalls Potential Unicode Pitfalls + +The problems can be separated into three broad classes: + +@subsection overview_unicode_compilation_errors Unicode-Related Compilation Errors + +Because of the need to support implicit conversions to both @c char and @c +wchar_t, wxString implementation is rather involved and many of its operators +don't return the types which they could be naively expected to return. For +example, the @c operator[] doesn't return neither a @c char nor a @c wchar_t +but an object of a helper class wxUniChar or wxUniCharRef which is implicitly +convertible to either. Usually you don't need to worry about this as the +conversions do their work behind the scenes however in some cases it doesn't +work. Here are some examples, using a wxString object @c s and some integer @c +n: + + - Writing @code switch ( s[n] ) @endcode doesn't work because the argument of + the switch statement must an integer expression so you need to replace + @c s[n] with @code s[n].GetValue() @endcode. You may also force the + conversion to char or wchar_t by using an explicit cast but beware that + converting the value to char uses the conversion to current locale and may + return 0 if it fails. Finally notice that writing @code (wxChar)s[n] @endcode + works both with wxWidgets 3.0 and previous library versions and so should be + used for writing code which should be compatible with both 2.8 and 3.0. + + - Similarly, @code &s[n] @endcode doesn't yield a pointer to char so you may + not pass it to functions expecting @c char* or @c wchar_t*. Consider using + string iterators instead if possible or replace this expression with + @code s.c_str() + n @endcode otherwise. + +Another class of problems is related to the fact that the value returned by @c +c_str() itself is also not just a pointer to a buffer but a value of helper +class wxCStrData which is implicitly convertible to both narrow and wide +strings. Again, this mostly will be unnoticeable but can result in some +problems: + + - You shouldn't pass @c c_str() result to vararg functions such as standard + @c printf(). Some compilers (notably g++) warn about this but even if they + don't, this @code printf("Hello, %s", s.c_str()) @endcode is not going to + work. It can be corrected in one of the following ways: + + - Preferred: @code wxPrintf("Hello, %s", s) @endcode (notice the absence + of @c c_str(), it is not needed at all with wxWidgets functions) + - Compatible with wxWidgets 2.8: @code wxPrintf("Hello, %s", s.c_str()) @endcode + - Using an explicit conversion to narrow, multibyte, string: + @code printf("Hello, %s", (const char *)s.mb_str()) @endcode + - Using a cast to force the issue (listed only for completeness): + @code printf("Hello, %s", (const char *)s.c_str()) @endcode + + - The result of @c c_str() can not be cast to @c char* but only to @c const @c + @c char*. Of course, modifying the string via the pointer returned by this + method has never been possible but unfortunately it was occasionally useful + to use a @c const_cast here to pass the value to const-incorrect functions. + This can be done either using new wxString::char_str() (and matching + wchar_str()) method or by writing a double cast: + @code (char *)(const char *)s.c_str() @endcode + + - One of the unfortunate consequences of the possibility to pass wxString to + @c wxPrintf() without using @c c_str() is that it is now impossible to pass + the elements of unnamed enumerations to @c wxPrintf() and other similar + vararg functions, i.e. + @code + enum { Red, Green, Blue }; + wxPrintf("Red is %d", Red); + @endcode + doesn't compile. The easiest workaround is to give a name to the enum. + +Other unexpected compilation errors may arise but they should happen even more +rarely than the above-mentioned ones and the solution should usually be quite +simple: just use the explicit methods of wxUniChar and wxCStrData classes +instead of relying on their implicit conversions if the compiler can't choose +among them. + + +@subsection overview_unicode_data_loss Data Loss due To Unicode Conversion Errors + +wxString API provides implicit conversion of the internal Unicode string +contents to narrow, char strings. This can be very convenient and is absolutely +necessary for backwards compatibility with the existing code using wxWidgets +however it is a rather dangerous operation as it can easily give unexpected +results if the string contents isn't convertible to the current locale. + +To be precise, the conversion will always succeed if the string was created +from a narrow string initially. It will also succeed if the current encoding is +UTF-8 as all Unicode strings are representable in this encoding. However +initializing the string using FromUTF8() method and then accessing it as a char +string via its c_str() method is a recipe for disaster as the program may work +perfectly well during testing on Unix systems using UTF-8 locale but completely +fail under Windows where UTF-8 locales are never used because c_str() would +return an empty string. + +The simplest way to ensure that this doesn't happen is to avoid conversions to +@c char* completely by using wxString throughout your program. However if the +program never manipulates 8 bit strings internally, using @c char* pointers is +safe as well. So the existing code needs to be reviewed when upgrading to +wxWidgets 3.0 and the new code should be used with this in mind and ideally +avoiding implicit conversions to @c char*. + + +@subsection overview_unicode_performance Unicode Performance Implications + +Under Unix systems wxString class uses variable-width UTF-8 encoding for +internal representation and this implies that it can't guarantee constant-time +access to N-th element of the string any longer as to find the position of this +character in the string we have to examine all the preceding ones. Usually this +doesn't matter much because most algorithms used on the strings examine them +sequentially anyhow and because wxString implements a cache for iterating over +the string by index but it can have serious consequences for algorithms +using random access to string elements as they typically acquire O(N^2) time +complexity instead of O(N) where N is the length of the string. + +Even despite caching the index, indexed access should be replaced with +sequential access using string iterators. For example a typical loop: @code -wxString s = wxT("hello"); -size_t i; -for (i = 0; i < s.Len(); i++) +wxString s("hello"); +for ( size_t i = 0; i < s.length(); i++ ) { - wxChar ch = s[i]; + wchar_t ch = s[i]; // do something with it } @endcode - -should be replaced (especially in time critical places) with: - +should be rewritten as @code -wxString s = "hello"; -wxString::const_iterator i; -for (i = s.begin(); i != s.end(); ++i) +wxString s("hello"); +for ( wxString::const_iterator i = s.begin(); i != s.end(); ++i ) { - wxUniChar uni_ch = *i; - wxChar ch = uni_ch; - // same as: wxChar ch = *i + wchar_t ch = *i // do something with it } @endcode -If you want to replace individual characters in the string you -need to get a reference to that character: - +Another, similar, alternative is to use pointer arithmetic: @code -wxString s = "hello"; -wxString::iterator i; -for (i = s.begin(); i != s.end(); ++i) +wxString s("hello"); +for ( const wchar_t *p = s.wc_str(); *p; p++ ) { - wxUniCharRef ch = *i; - ch = 'a'; - // same as: *i = 'a'; + wchar_t ch = *i + + // do something with it } @endcode +however this doesn't work correctly for strings with embedded @c NUL characters +and the use of iterators is generally preferred as they provide some run-time +checks (at least in debug build) unlike the raw pointers. But if you do use +them, it is better to use wchar_t pointers rather than char ones to avoid the +data loss problems due to conversion as discussed in the previous section. -which will change the content of the wxString s from "hello" to "aaaaa". -String literals are translated to Unicode when they are assigned to -a wxString object so code can be written like this: +@section overview_unicode_supportout Unicode and the Outside World +Even though wxWidgets always uses Unicode internally, not all the other +libraries and programs do and even those that do use Unicode may use a +different encoding of it. So you need to be able to convert the data to various +representations and the wxString methods ToAscii(), ToUTF8() (or its synonym +utf8_str()), mb_str(), c_str() and wc_str() can be used for this. The first of +them should be only used for the string containing 7-bit ASCII characters only, +anything else will be replaced by some substitution character. mb_str() +converts the string to the encoding used by the current locale and so can +return an empty string if the string contains characters not representable in +it as explained in @ref overview_unicode_data_loss. The same applies to c_str() +if its result is used as a narrow string. Finally, ToUTF8() and wc_str() +functions never fail and always return a pointer to char string containing the +UTF-8 representation of the string or wchar_t string. + +wxString also provides two convenience functions: From8BitData() and +To8BitData(). They can be used to create wxString from arbitrary binary data +without supposing that it is in current locale encoding, and then get it back, +again, without any conversion or, rather, undoing the conversion used by +From8BitData(). Because of this you should only use From8BitData() for the +strings created using To8BitData(). Also notice that in spite of the +availability of these functions, wxString is not the ideal class for storing +arbitrary binary data as they can take up to 4 times more space than needed +(when using @c wchar_t internal representation on the systems where size of +wide characters is 4 bytes) and you should consider using wxMemoryBuffer +instead. + +Final word of caution: most of these functions may return either directly the +pointer to internal string buffer or a temporary wxCharBuffer or wxWCharBuffer +object. Such objects are implicitly convertible to char and wchar_t pointers, +respectively, and so the result of, for example, ToUTF8() can always be passed +directly to a function taking @c const @c char*. However code such as @code -wxString s = "Hello, world!"; -int len = s.Len(); +const char *p = s.ToUTF8(); +... +puts(p); // or call any other function taking const char * @endcode - -wxWidgets provides wrappers around most Posix C functions (like printf(..)) -and the syntax has been adapted to support input with wxString, normal -C-style strings and wchar_t strings: - +does @b not work because the temporary buffer returned by ToUTF8() is destroyed +and @c p is left pointing nowhere. To correct this you may use @code -wxString s; -s.Printf( "%s %s %s", "hello1", L"hello2", wxString("hello3") ); -wxPrintf( "Three times hello %s\n", s ); +wxCharBuffer p(s.ToUTF8()); +puts(p); @endcode - -@section overview_unicode_supportout Unicode and the Outside World - -We have seen that it was easy to write Unicode programs using wxWidgets types -and macros, but it has been also mentioned that it isn't quite enough. Although -everything works fine inside the program, things can get nasty when it tries to -communicate with the outside world which, sadly, often expects ANSI strings (a -notable exception is the entire Win32 API which accepts either Unicode or ANSI -strings and which thus makes it unnecessary to ever perform any conversions in -the program). GTK 2.0 only accepts UTF-8 strings. - -To get an ANSI string from a wxString, you may use the mb_str() function which -always returns an ANSI string (independently of the mode - while the usual -c_str() returns a pointer to the internal representation which is either ASCII -or Unicode). More rarely used, but still useful, is wc_str() function which -always returns the Unicode string. - -Sometimes it is also necessary to go from ANSI strings to wxStrings. In this -case, you can use the converter-constructor, as follows: - +which does work but results in an unnecessary copy of string data in the build +configurations when ToUTF8() returns the pointer to internal string buffer. If +this inefficiency is important you may write @code -const char* ascii_str = "Some text"; -wxString str(ascii_str, wxConvUTF8); +const wxUTF8Buf p(s.ToUTF8()); +puts(p); @endcode - -For more information about converters and Unicode see the @ref overview_mbconv. - +where @c wxUTF8Buf is the type corresponding to the real return type of +ToUTF8(). Similarly, wxWX2WCbuf can be used for the return type of wc_str(). +But, once again, none of these cryptic types is really needed if you just pass +the return value of any of the functions mentioned in this section to another +function directly. @section overview_unicode_settings Unicode Related Compilation Settings -You should define @c wxUSE_UNICODE to 1 to compile your program in Unicode -mode. Since wxWidgets 3.0 this is always the case. When compiled in UTF8 -mode @c wxUSE_UNICODE_UTF8 is also defined. +@c wxUSE_UNICODE is now defined as 1 by default to indicate Unicode support. +If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is +also defined, otherwise @c wxUSE_UNICODE_WCHAR is. */