]> git.saurik.com Git - wxWidgets.git/blob - docs/doxygen/overviews/unicode.h
Replaced /*! with /** in remaining Doxygen headers, and removed intentation of overvi...
[wxWidgets.git] / docs / doxygen / overviews / unicode.h
1 /////////////////////////////////////////////////////////////////////////////
2 // Name: unicode.h
3 // Purpose: topic overview
4 // Author: wxWidgets team
5 // RCS-ID: $Id$
6 // Licence: wxWindows license
7 /////////////////////////////////////////////////////////////////////////////
8
9 /**
10
11 @page overview_unicode Unicode Support in wxWidgets
12
13 This section briefly describes the state of the Unicode support in wxWidgets.
14 Read it if you want to know more about how to write programs able to work with
15 characters from languages other than English.
16
17 @li @ref overview_unicode_what
18 @li @ref overview_unicode_ansi
19 @li @ref overview_unicode_supportin
20 @li @ref overview_unicode_supportout
21 @li @ref overview_unicode_settings
22 @li @ref overview_unicode_traps
23
24
25 <hr>
26
27
28 @section overview_unicode_what What is Unicode?
29
30 wxWidgets has support for compiling in Unicode mode on the platforms which
31 support it. Unicode is a standard for character encoding which addresses the
32 shortcomings of the previous, 8 bit standards, by using at least 16 (and
33 possibly 32) bits for encoding each character. This allows to have at least
34 65536 characters (what is called the BMP, or basic multilingual plane) and
35 possible 2^32 of them instead of the usual 256 and is sufficient to encode all
36 of the world languages at once. More details about Unicode may be found at
37 <http://www.unicode.org/>.
38
39 As this solution is obviously preferable to the previous ones (think of
40 incompatible encodings for the same language, locale chaos and so on), many
41 modern operating systems support it. The probably first example is Windows NT
42 which uses only Unicode internally since its very first version.
43
44 Writing internationalized programs is much easier with Unicode and, as the
45 support for it improves, it should become more and more so. Moreover, in the
46 Windows NT/2000 case, even the program which uses only standard ASCII can
47 profit from using Unicode because they will work more efficiently - there will
48 be no need for the system to convert all strings the program uses to/from
49 Unicode each time a system call is made.
50
51
52 @section overview_unicode_ansi Unicode and ANSI Modes
53
54 As not all platforms supported by wxWidgets support Unicode (fully) yet, in
55 many cases it is unwise to write a program which can only work in Unicode
56 environment. A better solution is to write programs in such way that they may
57 be compiled either in ANSI (traditional) mode or in the Unicode one.
58
59 This can be achieved quite simply by using the means provided by wxWidgets.
60 Basically, there are only a few things to watch out for:
61
62 - Character type (@c char or @c wchar_t)
63 - Literal strings (i.e. @c "Hello, world!" or @c '*')
64 - String functions (@c strlen(), @c strcpy(), ...)
65 - Special preprocessor tokens (@c __FILE__, @c __DATE__ and @c __TIME__)
66
67 Let's look at them in order. First of all, each character in an Unicode program
68 takes 2 bytes instead of usual one, so another type should be used to store the
69 characters (@c char only holds 1 byte usually). This type is called @c wchar_t
70 which stands for @e wide-character type.
71
72 Also, the string and character constants should be encoded using wide
73 characters (@c wchar_t type) which typically take 2 or 4 bytes instead of
74 @c char which only takes one. This is achieved by using the standard C (and
75 C++) way: just put the letter @c 'L' after any string constant and it becomes a
76 @e long constant, i.e. a wide character one. To make things a bit more
77 readable, you are also allowed to prefix the constant with @c 'L' instead of
78 putting it after it.
79
80 Of course, the usual standard C functions don't work with @c wchar_t strings,
81 so another set of functions exists which do the same thing but accept
82 @c wchar_t* instead of @c char*. For example, a function to get the length of a
83 wide-character string is called @c wcslen() (compare with @c strlen() - you see
84 that the only difference is that the "str" prefix standing for "string" has
85 been replaced with "wcs" standing for "wide-character string").
86
87 And finally, the standard preprocessor tokens enumerated above expand to ANSI
88 strings but it is more likely that Unicode strings are wanted in the Unicode
89 build. wxWidgets provides the macros @c __TFILE__, @c __TDATE__ and
90 @c __TTIME__ which behave exactly as the standard ones except that they produce
91 ANSI strings in ANSI build and Unicode ones in the Unicode build.
92
93 To summarize, here is a brief example of how a program which can be compiled
94 in both ANSI and Unicode modes could look like:
95
96 @code
97 #ifdef __UNICODE__
98 wchar_t wch = L'*';
99 const wchar_t *ws = L"Hello, world!";
100 int len = wcslen(ws);
101
102 wprintf(L"Compiled at %s\n", __TDATE__);
103 #else // ANSI
104 char ch = '*';
105 const char *s = "Hello, world!";
106 int len = strlen(s);
107
108 printf("Compiled at %s\n", __DATE__);
109 #endif // Unicode/ANSI
110 @endcode
111
112 Of course, it would be nearly impossibly to write such programs if it had to
113 be done this way (try to imagine the number of UNICODE checkes an average
114 program would have had!). Luckily, there is another way - see the next section.
115
116
117 @section overview_unicode_supportin Unicode Support in wxWidgets
118
119 In wxWidgets, the code fragment from above should be written instead:
120
121 @code
122 wxChar ch = wxT('*');
123 wxString s = wxT("Hello, world!");
124 int len = s.Len();
125 @endcode
126
127 What happens here? First of all, you see that there are no more UNICODE checks
128 at all. Instead, we define some types and macros which behave differently in
129 the Unicode and ANSI builds and allow us to avoid using conditional compilation
130 in the program itself.
131
132 We have a @c wxChar type which maps either on @c char or @c wchar_t depending
133 on the mode in which program is being compiled. There is no need for a separate
134 type for strings though, because the standard wxString supports Unicode, i.e.
135 it stores either ANSI or Unicode strings depending on the compile mode.
136
137 Finally, there is a special wxT() macro which should enclose all literal
138 strings in the program. As it is easy to see comparing the last fragment with
139 the one above, this macro expands to nothing in the (usual) ANSI mode and
140 prefixes @c 'L' to its argument in the Unicode mode.
141
142 The important conclusion is that if you use @c wxChar instead of @c char, avoid
143 using C style strings and use @c wxString instead and don't forget to enclose
144 all string literals inside wxT() macro, your program automatically becomes
145 (almost) Unicode compliant!
146
147 Just let us state once again the rules:
148
149 @li Always use wxChar instead of @c char
150 @li Always enclose literal string constants in wxT() macro unless they're
151 already converted to the right representation (another standard wxWidgets
152 macro _() does it, for example, so there is no need for wxT() in this case)
153 or you intend to pass the constant directly to an external function which
154 doesn't accept wide-character strings.
155 @li Use wxString instead of C style strings.
156
157
158 @section overview_unicode_supportout Unicode and the Outside World
159
160 We have seen that it was easy to write Unicode programs using wxWidgets types
161 and macros, but it has been also mentioned that it isn't quite enough. Although
162 everything works fine inside the program, things can get nasty when it tries to
163 communicate with the outside world which, sadly, often expects ANSI strings (a
164 notable exception is the entire Win32 API which accepts either Unicode or ANSI
165 strings and which thus makes it unnecessary to ever perform any conversions in
166 the program). GTK 2.0 only accepts UTF-8 strings.
167
168 To get an ANSI string from a wxString, you may use the mb_str() function which
169 always returns an ANSI string (independently of the mode - while the usual
170 c_str() returns a pointer to the internal representation which is either ASCII
171 or Unicode). More rarely used, but still useful, is wc_str() function which
172 always returns the Unicode string.
173
174 Sometimes it is also necessary to go from ANSI strings to wxStrings. In this
175 case, you can use the converter-constructor, as follows:
176
177 @code
178 const char* ascii_str = "Some text";
179 wxString str(ascii_str, wxConvUTF8);
180 @endcode
181
182 This code also compiles fine under a non-Unicode build of wxWidgets, but in
183 that case the converter is ignored.
184
185 For more information about converters and Unicode see the @ref overview_mbconv.
186
187
188 @section overview_unicode_settings Unicode Related Compilation Settings
189
190 You should define @c wxUSE_UNICODE to 1 to compile your program in Unicode
191 mode. This currently works for wxMSW, wxGTK, wxMac and wxX11. If you compile
192 your program in ANSI mode you can still define @c wxUSE_WCHAR_T to get some
193 limited support for @c wchar_t type.
194
195 This will allow your program to perform conversions between Unicode strings and
196 ANSI ones (using @ref overview_mbconv "wxMBConv") and construct wxString
197 objects from Unicode strings (presumably read from some external file or
198 elsewhere).
199
200
201 @section overview_unicode_traps Traps for the Unwary
202
203 @li Casting c_str() to void* is now char*, not wxChar*
204 @li Passing c_str(), mb_str() or wc_str() to variadic functions doesn't work.
205
206 */
207