docs/doxygen/overviews/unicode.h

   1 /////////////////////////////////////////////////////////////////////////////
   2 // Name:        unicode.h
   3 // Purpose:     topic overview
   4 // Author:      wxWidgets team
   5 // RCS-ID:      $Id$
   6 // Licence:     wxWindows license
   7 /////////////////////////////////////////////////////////////////////////////
   8
   9 /**
  10
  11 @page overview_unicode Unicode Support in wxWidgets
  12
  13 This section briefly describes the state of the Unicode support in wxWidgets.
  14 Read it if you want to know more about how to write programs able to work with
  15 characters from languages other than English.
  16
  17 @li @ref overview_unicode_what
  18 @li @ref overview_unicode_ansi
  19 @li @ref overview_unicode_supportin
  20 @li @ref overview_unicode_supportout
  21 @li @ref overview_unicode_settings
  22 @li @ref overview_unicode_traps
  23
  24
  25 <hr>
  26
  27
  28 @section overview_unicode_what What is Unicode?
  29
  30 wxWidgets has support for compiling in Unicode mode on the platforms which
  31 support it. Unicode is a standard for character encoding which addresses the
  32 shortcomings of the previous, 8 bit standards, by using at least 16 (and
  33 possibly 32) bits for encoding each character. This allows to have at least
  34 65536 characters (what is called the BMP, or basic multilingual plane) and
  35 possible 2^32 of them instead of the usual 256 and is sufficient to encode all
  36 of the world languages at once. More details about Unicode may be found at
  37 <http://www.unicode.org/>.
  38
  39 As this solution is obviously preferable to the previous ones (think of
  40 incompatible encodings for the same language, locale chaos and so on), many
  41 modern operating systems support it. The probably first example is Windows NT
  42 which uses only Unicode internally since its very first version.
  43
  44 Writing internationalized programs is much easier with Unicode and, as the
  45 support for it improves, it should become more and more so. Moreover, in the
  46 Windows NT/2000 case, even the program which uses only standard ASCII can
  47 profit from using Unicode because they will work more efficiently - there will
  48 be no need for the system to convert all strings the program uses to/from
  49 Unicode each time a system call is made.
  50
  51
  52 @section overview_unicode_ansi Unicode and ANSI Modes
  53
  54 As not all platforms supported by wxWidgets support Unicode (fully) yet, in
  55 many cases it is unwise to write a program which can only work in Unicode
  56 environment. A better solution is to write programs in such way that they may
  57 be compiled either in ANSI (traditional) mode or in the Unicode one.
  58
  59 This can be achieved quite simply by using the means provided by wxWidgets.
  60 Basically, there are only a few things to watch out for:
  61
  62 - Character type (@c char or @c wchar_t)
  63 - Literal strings (i.e. @c "Hello, world!" or @c '*')
  64 - String functions (@c strlen(), @c strcpy(), ...)
  65 - Special preprocessor tokens (@c __FILE__, @c __DATE__ and @c __TIME__)
  66
  67 Let's look at them in order. First of all, each character in an Unicode program
  68 takes 2 bytes instead of usual one, so another type should be used to store the
  69 characters (@c char only holds 1 byte usually). This type is called @c wchar_t
  70 which stands for @e wide-character type.
  71
  72 Also, the string and character constants should be encoded using wide
  73 characters (@c wchar_t type) which typically take 2 or 4 bytes instead of
  74 @c char which only takes one. This is achieved by using the standard C (and
  75 C++) way: just put the letter @c 'L' after any string constant and it becomes a
  76 @e long constant, i.e. a wide character one. To make things a bit more
  77 readable, you are also allowed to prefix the constant with @c 'L' instead of
  78 putting it after it.
  79
  80 Of course, the usual standard C functions don't work with @c wchar_t strings,
  81 so another set of functions exists which do the same thing but accept
  82 @c wchar_t* instead of @c char*. For example, a function to get the length of a
  83 wide-character string is called @c wcslen() (compare with @c strlen() - you see
  84 that the only difference is that the "str" prefix standing for "string" has
  85 been replaced with "wcs" standing for "wide-character string").
  86
  87 And finally, the standard preprocessor tokens enumerated above expand to ANSI
  88 strings but it is more likely that Unicode strings are wanted in the Unicode
  89 build. wxWidgets provides the macros @c __TFILE__, @c __TDATE__ and
  90 @c __TTIME__ which behave exactly as the standard ones except that they produce
  91 ANSI strings in ANSI build and Unicode ones in the Unicode build.
  92
  93 To summarize, here is a brief example of how a program which can be compiled
  94 in both ANSI and Unicode modes could look like:
  95
  96 @code
  97 #ifdef __UNICODE__
  98     wchar_t wch = L'*';
  99     const wchar_t *ws = L"Hello, world!";
 100     int len = wcslen(ws);
 101
 102     wprintf(L"Compiled at %s\n", __TDATE__);
 103 #else // ANSI
 104     char ch = '*';
 105     const char *s = "Hello, world!";
 106     int len = strlen(s);
 107
 108     printf("Compiled at %s\n", __DATE__);
 109 #endif // Unicode/ANSI
 110 @endcode
 111
 112 Of course, it would be nearly impossibly to write such programs if it had to
 113 be done this way (try to imagine the number of UNICODE checkes an average
 114 program would have had!). Luckily, there is another way - see the next section.
 115
 116
 117 @section overview_unicode_supportin Unicode Support in wxWidgets
 118
 119 In wxWidgets, the code fragment from above should be written instead:
 120
 121 @code
 122 wxChar ch = wxT('*');
 123 wxString s = wxT("Hello, world!");
 124 int len = s.Len();
 125 @endcode
 126
 127 What happens here? First of all, you see that there are no more UNICODE checks
 128 at all. Instead, we define some types and macros which behave differently in
 129 the Unicode and ANSI builds and allow us to avoid using conditional compilation
 130 in the program itself.
 131
 132 We have a @c wxChar type which maps either on @c char or @c wchar_t depending
 133 on the mode in which program is being compiled. There is no need for a separate
 134 type for strings though, because the standard wxString supports Unicode, i.e.
 135 it stores either ANSI or Unicode strings depending on the compile mode.
 136
 137 Finally, there is a special wxT() macro which should enclose all literal
 138 strings in the program. As it is easy to see comparing the last fragment with
 139 the one above, this macro expands to nothing in the (usual) ANSI mode and
 140 prefixes @c 'L' to its argument in the Unicode mode.
 141
 142 The important conclusion is that if you use @c wxChar instead of @c char, avoid
 143 using C style strings and use @c wxString instead and don't forget to enclose
 144 all string literals inside wxT() macro, your program automatically becomes
 145 (almost) Unicode compliant!
 146
 147 Just let us state once again the rules:
 148
 149 @li Always use wxChar instead of @c char
 150 @li Always enclose literal string constants in wxT() macro unless they're
 151     already converted to the right representation (another standard wxWidgets
 152     macro _() does it, for example, so there is no need for wxT() in this case)
 153     or you intend to pass the constant directly to an external function which
 154     doesn't accept wide-character strings.
 155 @li Use wxString instead of C style strings.
 156
 157
 158 @section overview_unicode_supportout Unicode and the Outside World
 159
 160 We have seen that it was easy to write Unicode programs using wxWidgets types
 161 and macros, but it has been also mentioned that it isn't quite enough. Although
 162 everything works fine inside the program, things can get nasty when it tries to
 163 communicate with the outside world which, sadly, often expects ANSI strings (a
 164 notable exception is the entire Win32 API which accepts either Unicode or ANSI
 165 strings and which thus makes it unnecessary to ever perform any conversions in
 166 the program). GTK 2.0 only accepts UTF-8 strings.
 167
 168 To get an ANSI string from a wxString, you may use the mb_str() function which
 169 always returns an ANSI string (independently of the mode - while the usual
 170 c_str() returns a pointer to the internal representation which is either ASCII
 171 or Unicode). More rarely used, but still useful, is wc_str() function which
 172 always returns the Unicode string.
 173
 174 Sometimes it is also necessary to go from ANSI strings to wxStrings. In this
 175 case, you can use the converter-constructor, as follows:
 176
 177 @code
 178 const char* ascii_str = "Some text";
 179 wxString str(ascii_str, wxConvUTF8);
 180 @endcode
 181
 182 This code also compiles fine under a non-Unicode build of wxWidgets, but in
 183 that case the converter is ignored.
 184
 185 For more information about converters and Unicode see the @ref overview_mbconv.
 186
 187
 188 @section overview_unicode_settings Unicode Related Compilation Settings
 189
 190 You should define @c wxUSE_UNICODE to 1 to compile your program in Unicode
 191 mode. This currently works for wxMSW, wxGTK, wxMac and wxX11. If you compile
 192 your program in ANSI mode you can still define @c wxUSE_WCHAR_T to get some
 193 limited support for @c wchar_t type.
 194
 195 This will allow your program to perform conversions between Unicode strings and
 196 ANSI ones (using @ref overview_mbconv "wxMBConv") and construct wxString
 197 objects from Unicode strings (presumably read from some external file or
 198 elsewhere).
 199
 200
 201 @section overview_unicode_traps Traps for the Unwary
 202
 203 @li Casting c_str() to void* is now char*, not wxChar*
 204 @li Passing c_str(), mb_str() or wc_str() to variadic functions doesn't work.
 205
 206 */
 207