]>
Commit | Line | Data |
---|---|---|
1 | ///////////////////////////////////////////////////////////////////////////// | |
2 | // Name: unicode.h | |
3 | // Purpose: topic overview | |
4 | // Author: wxWidgets team | |
5 | // RCS-ID: $Id$ | |
6 | // Licence: wxWindows license | |
7 | ///////////////////////////////////////////////////////////////////////////// | |
8 | ||
9 | /** | |
10 | ||
11 | @page overview_unicode Unicode Support in wxWidgets | |
12 | ||
13 | This section describes how does wxWidgets support Unicode and how can it affect | |
14 | your programs. | |
15 | ||
16 | Notice that Unicode support has changed radically in wxWidgets 3.0 and a lot of | |
17 | existing material pertaining to the previous versions of the library is not | |
18 | correct any more. Please see @ref overview_changes_unicode for the details of | |
19 | these changes. | |
20 | ||
21 | You can skip the first two sections if you're already familiar with Unicode and | |
22 | wish to jump directly in the details of its support in the library: | |
23 | @li @ref overview_unicode_what | |
24 | @li @ref overview_unicode_encodings | |
25 | @li @ref overview_unicode_supportin | |
26 | @li @ref overview_unicode_pitfalls | |
27 | @li @ref overview_unicode_supportout | |
28 | @li @ref overview_unicode_settings | |
29 | ||
30 | <hr> | |
31 | ||
32 | ||
33 | @section overview_unicode_what What is Unicode? | |
34 | ||
35 | Unicode is a standard for character encoding which addresses the shortcomings | |
36 | of the previous, 8 bit standards, by using at least 16 (and possibly 32) bits | |
37 | for encoding each character. This allows to have at least 65536 characters | |
38 | (in what is called the BMP, or basic multilingual plane) and possible 2^32 of | |
39 | them instead of the usual 256 and is sufficient to encode all of the world | |
40 | languages at once. More details about Unicode may be found at | |
41 | http://www.unicode.org/. | |
42 | ||
43 | From a practical point of view, using Unicode is almost a requirement when | |
44 | writing applications for international audience. Moreover, any application | |
45 | reading files which it didn't produce or receiving data from the network from | |
46 | other services should be ready to deal with Unicode. | |
47 | ||
48 | ||
49 | @section overview_unicode_encodings Unicode Representations | |
50 | ||
51 | Unicode provides a unique code to identify every character, however in practice | |
52 | these codes are not always used directly but encoded using one of the standard | |
53 | UTF or Unicode Transformation Formats which are algorithms mapping the Unicode | |
54 | codes to byte code sequences. The simplest of them is UTF-32 which simply maps | |
55 | the Unicode code to a 4 byte sequence representing this 32 bit number (although | |
56 | this is still not completely trivial as the mapping is different for little and | |
57 | big-endian architectures). UTF-32 is commonly used under Unix systems for | |
58 | internal representation of Unicode strings. Another very widespread standard is | |
59 | UTF-16 which is used by Microsoft Windows: it encodes the first (approximately) | |
60 | 64 thousands of Unicode characters using only 2 bytes and uses a pair of 16-bit | |
61 | codes to encode the characters beyond this. Finally, the most widespread | |
62 | encoding used for the external Unicode storage (e.g. files and network | |
63 | protocols) is UTF-8 which is byte-oriented and so avoids the endianness | |
64 | ambiguities of UTF-16 and UTF-32. However UTF-8 uses a variable number of bytes | |
65 | for representing Unicode characters which makes it less efficient than UTF-32 | |
66 | for internal representation. | |
67 | ||
68 | From the C/C++ programmer perspective the situation is further complicated by | |
69 | the fact that the standard type @c wchar_t which is used to represent the | |
70 | Unicode ("wide") strings in C/C++ doesn't have the same size on all platforms. | |
71 | It is 4 bytes under Unix systems, corresponding to the tradition of using | |
72 | UTF-32, but only 2 bytes under Windows which is required by compatibility with | |
73 | the OS which uses UTF-16. | |
74 | ||
75 | ||
76 | @section overview_unicode_supportin Unicode Support in wxWidgets | |
77 | ||
78 | Since wxWidgets 3.0 Unicode support is always enabled and building the library | |
79 | without it is not recommended any longer and will cease to be supported in the | |
80 | near future. This means that internally only Unicode strings are used and that, | |
81 | under Microsoft Windows, Unicode system API is used which means that wxWidgets | |
82 | programs require the Microsoft Layer for Unicode to run on Windows 95/98/ME. | |
83 | ||
84 | However, unlike Unicode build mode in the previous versions of wxWidgets, this | |
85 | support is mostly transparent: you can still continue to work with the narrow | |
86 | (i.e. @c char*) strings even if wide (i.e. @c wchar_t*) strings are also | |
87 | supported. Any wxWidgets function accepts arguments of either type as both | |
88 | kinds of strings are implicitly converted to wxString, so both | |
89 | @code | |
90 | wxMessageBox("Hello, world!"); | |
91 | @endcode | |
92 | and somewhat less usual | |
93 | @code | |
94 | wxMessageBox(L"Salut \u00e0 toi!"); // 00E0 is "Latin Small Letter a with Grave" | |
95 | @endcode | |
96 | work as expected. | |
97 | ||
98 | Notice that the narrow strings used with wxWidgets are @e always assumed to be | |
99 | in the current locale encoding, so writing | |
100 | @code | |
101 | wxMessageBox("Salut à toi!"); | |
102 | @endcode | |
103 | wouldn't work if the encoding used on the user system is incompatible with | |
104 | ISO-8859-1 (or even if the sources were compiled under different locale | |
105 | in the case of gcc). In particular, the most common encoding used under | |
106 | modern Unix systems is UTF-8 and as the string above is not a valid UTF-8 byte | |
107 | sequence, nothing would be displayed at all in this case. Thus it is important | |
108 | to never use 8 bit characters directly in the program source but use wide | |
109 | strings or, alternatively, write | |
110 | @code | |
111 | wxMessageBox(wxString::FromUTF8("Salut \xc3\xa0 toi!")); | |
112 | @endcode | |
113 | ||
114 | In a similar way, wxString provides access to its contents as either wchar_t or | |
115 | char character buffer. Of course, the latter only works if the string contains | |
116 | data representable in the current locale encoding. This will always be the case | |
117 | if the string had been initially constructed from a narrow string or if it | |
118 | contains only 7-bit ASCII data but otherwise this conversion is not guaranteed | |
119 | to succeed. And as with @c FromUTF8() example above, you can always use @c | |
120 | ToUTF8() to retrieve the string contents in UTF-8 encoding -- this, unlike | |
121 | converting to @c char* using the current locale, never fails | |
122 | ||
123 | To summarize, Unicode support in wxWidgets is mostly transparent for the | |
124 | application and if you use wxString objects for storing all the character data | |
125 | in your program there is really nothing special to do. However you should be | |
126 | aware of the potential problems covered by the following section. | |
127 | ||
128 | ||
129 | @section overview_unicode_pitfalls Potential Unicode Pitfalls | |
130 | ||
131 | The problems can be separated into three broad classes: | |
132 | ||
133 | @subsection overview_unicode_compilation_errors Unicode-Related Compilation Errors | |
134 | ||
135 | Because of the need to support implicit conversions to both @c char and @c | |
136 | wchar_t, wxString implementation is rather involved and many of its operators | |
137 | don't return the types which they could be naively expected to return. For | |
138 | example, the @c operator[] doesn't return neither a @c char nor a @c wchar_t | |
139 | but an object of a helper class wxUniChar or wxUniCharRef which is implicitly | |
140 | convertible to either. Usually you don't need to worry about this as the | |
141 | conversions do their work behind the scenes however in some cases it doesn't | |
142 | work. Here are some examples, using a wxString object @c s and some integer @c | |
143 | n: | |
144 | ||
145 | - Writing @code switch ( s[n] ) @endcode doesn't work because the argument of | |
146 | the switch statement must an integer expression so you need to replace | |
147 | @c s[n] with @code s[n].GetValue() @endcode. You may also force the | |
148 | conversion to char or wchar_t by using an explicit cast but beware that | |
149 | converting the value to char uses the conversion to current locale and may | |
150 | return 0 if it fails. Finally notice that writing @code (wxChar)s[n] @endcode | |
151 | works both with wxWidgets 3.0 and previous library versions and so should be | |
152 | used for writing code which should be compatible with both 2.8 and 3.0. | |
153 | ||
154 | - Similarly, @code &s[n] @endcode doesn't yield a pointer to char so you may | |
155 | not pass it to functions expecting @c char* or @c wchar_t*. Consider using | |
156 | string iterators instead if possible or replace this expression with | |
157 | @code s.c_str() + n @endcode otherwise. | |
158 | ||
159 | Another class of problems is related to the fact that the value returned by @c | |
160 | c_str() itself is also not just a pointer to a buffer but a value of helper | |
161 | class wxCStrData which is implicitly convertible to both narrow and wide | |
162 | strings. Again, this mostly will be unnoticeable but can result in some | |
163 | problems: | |
164 | ||
165 | - You shouldn't pass @c c_str() result to vararg functions such as standard | |
166 | @c printf(). Some compilers (notably g++) warn about this but even if they | |
167 | don't, this @code printf("Hello, %s", s.c_str()) @endcode is not going to | |
168 | work. It can be corrected in one of the following ways: | |
169 | ||
170 | - Preferred: @code wxPrintf("Hello, %s", s) @endcode (notice the absence | |
171 | of @c c_str(), it is not needed at all with wxWidgets functions) | |
172 | - Compatible with wxWidgets 2.8: @code wxPrintf("Hello, %s", s.c_str()) @endcode | |
173 | - Using an explicit conversion to narrow, multibyte, string: | |
174 | @code printf("Hello, %s", (const char *)s.mb_str()) @endcode | |
175 | - Using a cast to force the issue (listed only for completeness): | |
176 | @code printf("Hello, %s", (const char *)s.c_str()) @endcode | |
177 | ||
178 | - The result of @c c_str() can not be cast to @c char* but only to @c const @c | |
179 | @c char*. Of course, modifying the string via the pointer returned by this | |
180 | method has never been possible but unfortunately it was occasionally useful | |
181 | to use a @c const_cast here to pass the value to const-incorrect functions. | |
182 | This can be done either using new wxString::char_str() (and matching | |
183 | wchar_str()) method or by writing a double cast: | |
184 | @code (char *)(const char *)s.c_str() @endcode | |
185 | ||
186 | - One of the unfortunate consequences of the possibility to pass wxString to | |
187 | @c wxPrintf() without using @c c_str() is that it is now impossible to pass | |
188 | the elements of unnamed enumerations to @c wxPrintf() and other similar | |
189 | vararg functions, i.e. | |
190 | @code | |
191 | enum { Red, Green, Blue }; | |
192 | wxPrintf("Red is %d", Red); | |
193 | @endcode | |
194 | doesn't compile. The easiest workaround is to give a name to the enum. | |
195 | ||
196 | Other unexpected compilation errors may arise but they should happen even more | |
197 | rarely than the above-mentioned ones and the solution should usually be quite | |
198 | simple: just use the explicit methods of wxUniChar and wxCStrData classes | |
199 | instead of relying on their implicit conversions if the compiler can't choose | |
200 | among them. | |
201 | ||
202 | ||
203 | @subsection overview_unicode_data_loss Data Loss due To Unicode Conversion Errors | |
204 | ||
205 | wxString API provides implicit conversion of the internal Unicode string | |
206 | contents to narrow, char strings. This can be very convenient and is absolutely | |
207 | necessary for backwards compatibility with the existing code using wxWidgets | |
208 | however it is a rather dangerous operation as it can easily give unexpected | |
209 | results if the string contents isn't convertible to the current locale. | |
210 | ||
211 | To be precise, the conversion will always succeed if the string was created | |
212 | from a narrow string initially. It will also succeed if the current encoding is | |
213 | UTF-8 as all Unicode strings are representable in this encoding. However | |
214 | initializing the string using FromUTF8() method and then accessing it as a char | |
215 | string via its c_str() method is a recipe for disaster as the program may work | |
216 | perfectly well during testing on Unix systems using UTF-8 locale but completely | |
217 | fail under Windows where UTF-8 locales are never used because c_str() would | |
218 | return an empty string. | |
219 | ||
220 | The simplest way to ensure that this doesn't happen is to avoid conversions to | |
221 | @c char* completely by using wxString throughout your program. However if the | |
222 | program never manipulates 8 bit strings internally, using @c char* pointers is | |
223 | safe as well. So the existing code needs to be reviewed when upgrading to | |
224 | wxWidgets 3.0 and the new code should be used with this in mind and ideally | |
225 | avoiding implicit conversions to @c char*. | |
226 | ||
227 | ||
228 | @subsection overview_unicode_performance Unicode Performance Implications | |
229 | ||
230 | Under Unix systems wxString class uses variable-width UTF-8 encoding for | |
231 | internal representation and this implies that it can't guarantee constant-time | |
232 | access to N-th element of the string any longer as to find the position of this | |
233 | character in the string we have to examine all the preceding ones. Usually this | |
234 | doesn't matter much because most algorithms used on the strings examine them | |
235 | sequentially anyhow and because wxString implements a cache for iterating over | |
236 | the string by index but it can have serious consequences for algorithms | |
237 | using random access to string elements as they typically acquire O(N^2) time | |
238 | complexity instead of O(N) where N is the length of the string. | |
239 | ||
240 | Even despite caching the index, indexed access should be replaced with | |
241 | sequential access using string iterators. For example a typical loop: | |
242 | @code | |
243 | wxString s("hello"); | |
244 | for ( size_t i = 0; i < s.length(); i++ ) | |
245 | { | |
246 | wchar_t ch = s[i]; | |
247 | ||
248 | // do something with it | |
249 | } | |
250 | @endcode | |
251 | should be rewritten as | |
252 | @code | |
253 | wxString s("hello"); | |
254 | for ( wxString::const_iterator i = s.begin(); i != s.end(); ++i ) | |
255 | { | |
256 | wchar_t ch = *i | |
257 | ||
258 | // do something with it | |
259 | } | |
260 | @endcode | |
261 | ||
262 | Another, similar, alternative is to use pointer arithmetic: | |
263 | @code | |
264 | wxString s("hello"); | |
265 | for ( const wchar_t *p = s.wc_str(); *p; p++ ) | |
266 | { | |
267 | wchar_t ch = *i | |
268 | ||
269 | // do something with it | |
270 | } | |
271 | @endcode | |
272 | however this doesn't work correctly for strings with embedded @c NUL characters | |
273 | and the use of iterators is generally preferred as they provide some run-time | |
274 | checks (at least in debug build) unlike the raw pointers. But if you do use | |
275 | them, it is better to use wchar_t pointers rather than char ones to avoid the | |
276 | data loss problems due to conversion as discussed in the previous section. | |
277 | ||
278 | ||
279 | @section overview_unicode_supportout Unicode and the Outside World | |
280 | ||
281 | Even though wxWidgets always uses Unicode internally, not all the other | |
282 | libraries and programs do and even those that do use Unicode may use a | |
283 | different encoding of it. So you need to be able to convert the data to various | |
284 | representations and the wxString methods ToAscii(), ToUTF8() (or its synonym | |
285 | utf8_str()), mb_str(), c_str() and wc_str() can be used for this. The first of | |
286 | them should be only used for the string containing 7-bit ASCII characters only, | |
287 | anything else will be replaced by some substitution character. mb_str() | |
288 | converts the string to the encoding used by the current locale and so can | |
289 | return an empty string if the string contains characters not representable in | |
290 | it as explained in @ref overview_unicode_data_loss. The same applies to c_str() | |
291 | if its result is used as a narrow string. Finally, ToUTF8() and wc_str() | |
292 | functions never fail and always return a pointer to char string containing the | |
293 | UTF-8 representation of the string or wchar_t string. | |
294 | ||
295 | wxString also provides two convenience functions: From8BitData() and | |
296 | To8BitData(). They can be used to create wxString from arbitrary binary data | |
297 | without supposing that it is in current locale encoding, and then get it back, | |
298 | again, without any conversion or, rather, undoing the conversion used by | |
299 | From8BitData(). Because of this you should only use From8BitData() for the | |
300 | strings created using To8BitData(). Also notice that in spite of the | |
301 | availability of these functions, wxString is not the ideal class for storing | |
302 | arbitrary binary data as they can take up to 4 times more space than needed | |
303 | (when using @c wchar_t internal representation on the systems where size of | |
304 | wide characters is 4 bytes) and you should consider using wxMemoryBuffer | |
305 | instead. | |
306 | ||
307 | Final word of caution: most of these functions may return either directly the | |
308 | pointer to internal string buffer or a temporary wxCharBuffer or wxWCharBuffer | |
309 | object. Such objects are implicitly convertible to char and wchar_t pointers, | |
310 | respectively, and so the result of, for example, ToUTF8() can always be passed | |
311 | directly to a function taking @c const @c char*. However code such as | |
312 | @code | |
313 | const char *p = s.ToUTF8(); | |
314 | ... | |
315 | puts(p); // or call any other function taking const char * | |
316 | @endcode | |
317 | does @b not work because the temporary buffer returned by ToUTF8() is destroyed | |
318 | and @c p is left pointing nowhere. To correct this you may use | |
319 | @code | |
320 | wxCharBuffer p(s.ToUTF8()); | |
321 | puts(p); | |
322 | @endcode | |
323 | which does work but results in an unnecessary copy of string data in the build | |
324 | configurations when ToUTF8() returns the pointer to internal string buffer. If | |
325 | this inefficiency is important you may write | |
326 | @code | |
327 | const wxUTF8Buf p(s.ToUTF8()); | |
328 | puts(p); | |
329 | @endcode | |
330 | where @c wxUTF8Buf is the type corresponding to the real return type of | |
331 | ToUTF8(). Similarly, wxWX2WCbuf can be used for the return type of wc_str(). | |
332 | But, once again, none of these cryptic types is really needed if you just pass | |
333 | the return value of any of the functions mentioned in this section to another | |
334 | function directly. | |
335 | ||
336 | @section overview_unicode_settings Unicode Related Compilation Settings | |
337 | ||
338 | @c wxUSE_UNICODE is now defined as 1 by default to indicate Unicode support. | |
339 | If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is | |
340 | also defined, otherwise @c wxUSE_UNICODE_WCHAR is. | |
341 | ||
342 | */ | |
343 |