]>
Commit | Line | Data |
---|---|---|
1 | ///////////////////////////////////////////////////////////////////////////// | |
2 | // Name: unicode.h | |
3 | // Purpose: topic overview | |
4 | // Author: wxWidgets team | |
5 | // Licence: wxWindows licence | |
6 | ///////////////////////////////////////////////////////////////////////////// | |
7 | ||
8 | /** | |
9 | ||
10 | @page overview_unicode Unicode Support in wxWidgets | |
11 | ||
12 | @tableofcontents | |
13 | ||
14 | This section describes how does wxWidgets support Unicode and how can it affect | |
15 | your programs. | |
16 | ||
17 | Notice that Unicode support has changed radically in wxWidgets 3.0 and a lot of | |
18 | existing material pertaining to the previous versions of the library is not | |
19 | correct any more. Please see @ref overview_changes_unicode for the details of | |
20 | these changes. | |
21 | ||
22 | You can skip the first two sections if you're already familiar with Unicode and | |
23 | wish to jump directly in the details of its support in the library. | |
24 | ||
25 | ||
26 | ||
27 | @section overview_unicode_what What is Unicode? | |
28 | ||
29 | Unicode is a standard for character encoding which addresses the shortcomings | |
30 | of the previous standards (e.g. the ASCII standard), by using 8, 16 or 32 bits | |
31 | for encoding each character. | |
32 | This allows enough code points (see below for the definition) sufficient to | |
33 | encode all of the world languages at once. | |
34 | More details about Unicode may be found at http://www.unicode.org/. | |
35 | ||
36 | From a practical point of view, using Unicode is almost a requirement when | |
37 | writing applications for international audience. Moreover, any application | |
38 | reading files which it didn't produce or receiving data from the network from | |
39 | other services should be ready to deal with Unicode. | |
40 | ||
41 | ||
42 | @section overview_unicode_encodings Unicode Representations and Terminology | |
43 | ||
44 | When working with Unicode, it's important to define the meaning of some terms. | |
45 | ||
46 | A <b><em>glyph</em></b> is a particular image (usually part of a font) that | |
47 | represents a character or part of a character. | |
48 | Any character may have one or more glyph associated; e.g. some of the possible | |
49 | glyphs for the capital letter 'A' are: | |
50 | ||
51 | @image html overview_unicode_glyphs.png | |
52 | ||
53 | Unicode assigns each character of almost any existing alphabet/script a number, | |
54 | which is called <b><em>code point</em></b>; it's typically indicated in documentation | |
55 | manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number. | |
56 | ||
57 | Note that typically one character is assigned exactly one code point, but there | |
58 | are exceptions; the so-called <em>precomposed characters</em> | |
59 | (see http://en.wikipedia.org/wiki/Precomposed_character) or the <em>ligatures</em>. | |
60 | In these cases a single "character" may be mapped to more than one code point or | |
61 | viceversa more characters may be mapped to a single code point. | |
62 | ||
63 | The Unicode standard divides the space of all possible code points in <b><em>planes</em></b>; | |
64 | a plane is a range of 65,536 (1000016) contiguous Unicode code points. | |
65 | Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic | |
66 | Multilingual Plane. | |
67 | The BMP contains characters for all modern languages, and a large number of | |
68 | special characters. The other planes in fact contain mainly historic scripts, | |
69 | special-purpose characters or are unused. | |
70 | ||
71 | Code points are represented in computer memory as a sequence of one or more | |
72 | <b><em>code units</em></b>, where a code unit is a unit of memory: 8, 16, or 32 bits. | |
73 | More precisely, a code unit is the minimal bit combination that can represent a | |
74 | unit of encoded text for processing or interchange. | |
75 | ||
76 | The <b><em>UTF</em></b> or Unicode Transformation Formats are algorithms mapping the Unicode | |
77 | code points to code unit sequences. The simplest of them is <b>UTF-32</b> where | |
78 | each code unit is composed by 32 bits (4 bytes) and each code point is always | |
79 | represented by a single code unit (fixed length encoding). | |
80 | (Note that even UTF-32 is still not completely trivial as the mapping is different | |
81 | for little and big-endian architectures). UTF-32 is commonly used under Unix systems for | |
82 | internal representation of Unicode strings. | |
83 | ||
84 | Another very widespread standard is <b>UTF-16</b> which is used by Microsoft Windows: | |
85 | it encodes the first (approximately) 64 thousands of Unicode code points | |
86 | (the BMP plane) using 16-bit code units (2 bytes) and uses a pair of 16-bit code | |
87 | units to encode the characters beyond this. These pairs are called @e surrogate. | |
88 | Thus UTF16 uses a variable number of code units to encode each code point. | |
89 | ||
90 | Finally, the most widespread encoding used for the external Unicode storage | |
91 | (e.g. files and network protocols) is <b>UTF-8</b> which is byte-oriented and so | |
92 | avoids the endianness ambiguities of UTF-16 and UTF-32. | |
93 | UTF-8 uses code units of 8 bits (1 byte); code points beyond the usual english | |
94 | alphabet are represented using a variable number of bytes, which makes it less | |
95 | efficient than UTF-32 for internal representation. | |
96 | ||
97 | As visual aid to understand the differences between the various concepts described | |
98 | so far, look at the different UTF representations of the same code point: | |
99 | ||
100 | @image html overview_unicode_codes.png | |
101 | ||
102 | In this particular case UTF8 requires more space than UTF16 (3 bytes instead of 2). | |
103 | ||
104 | Note that from the C/C++ programmer perspective the situation is further complicated | |
105 | by the fact that the standard type @c wchar_t which is usually used to represent the | |
106 | Unicode ("wide") strings in C/C++ doesn't have the same size on all platforms. | |
107 | It is 4 bytes under Unix systems, corresponding to the tradition of using | |
108 | UTF-32, but only 2 bytes under Windows which is required by compatibility with | |
109 | the OS which uses UTF-16. | |
110 | ||
111 | Typically when UTF8 is used, code units are stored into @c char types, since | |
112 | @c char are 8bit wide on almost all systems; when using UTF16 typically code | |
113 | units are stored into @c wchar_t types since @c wchar_t is at least 16bits on | |
114 | all systems. This is also the approach used by wxString. | |
115 | See @ref overview_string for more info. | |
116 | ||
117 | See also http://unicode.org/glossary/ for the official definitions of the | |
118 | terms reported above. | |
119 | ||
120 | ||
121 | @section overview_unicode_supportin Unicode Support in wxWidgets | |
122 | ||
123 | @subsection overview_unicode_support_default Unicode is Always Used by Default | |
124 | ||
125 | Since wxWidgets 3.0 Unicode support is always enabled and while building the | |
126 | library without it is still possible, it is not recommended any longer and will | |
127 | cease to be supported in the near future. This means that internally only | |
128 | Unicode strings are used and that, under Microsoft Windows, Unicode system API | |
129 | is used which means that wxWidgets programs require the Microsoft Layer for | |
130 | Unicode to run on Windows 95/98/ME. | |
131 | ||
132 | However, unlike the Unicode build mode of the previous versions of wxWidgets, this | |
133 | support is mostly transparent: you can still continue to work with the @b narrow | |
134 | (i.e. current locale-encoded @c char*) strings even if @b wide | |
135 | (i.e. UTF16-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also | |
136 | supported. Any wxWidgets function accepts arguments of either type as both | |
137 | kinds of strings are implicitly converted to wxString, so both | |
138 | @code | |
139 | wxMessageBox("Hello, world!"); | |
140 | @endcode | |
141 | and the somewhat less usual | |
142 | @code | |
143 | wxMessageBox(L"Salut \u00E0 toi!"); // U+00E0 is "Latin Small Letter a with Grave" | |
144 | @endcode | |
145 | work as expected. | |
146 | ||
147 | Notice that the narrow strings used with wxWidgets are @e always assumed to be | |
148 | in the current locale encoding, so writing | |
149 | @code | |
150 | wxMessageBox("Salut à toi!"); | |
151 | @endcode | |
152 | wouldn't work if the encoding used on the user system is incompatible with | |
153 | ISO-8859-1 (or even if the sources were compiled under different locale | |
154 | in the case of gcc). In particular, the most common encoding used under | |
155 | modern Unix systems is UTF-8 and as the string above is not a valid UTF-8 byte | |
156 | sequence, nothing would be displayed at all in this case. Thus it is important | |
157 | to <b>never use 8-bit (instead of 7-bit) characters directly in the program source</b> | |
158 | but use wide strings or, alternatively, write: | |
159 | @code | |
160 | wxMessageBox(wxString::FromUTF8("Salut \xC3\xA0 toi!")); | |
161 | // in UTF8 the character U+00E0 is encoded as 0xC3A0 | |
162 | @endcode | |
163 | ||
164 | In a similar way, wxString provides access to its contents as either @c wchar_t or | |
165 | @c char character buffer. Of course, the latter only works if the string contains | |
166 | data representable in the current locale encoding. This will always be the case | |
167 | if the string had been initially constructed from a narrow string or if it | |
168 | contains only 7-bit ASCII data but otherwise this conversion is not guaranteed | |
169 | to succeed. And as with wxString::FromUTF8() example above, you can always use | |
170 | wxString::ToUTF8() to retrieve the string contents in UTF-8 encoding -- this, | |
171 | unlike converting to @c char* using the current locale, never fails. | |
172 | ||
173 | For more info about how wxString works, please see the @ref overview_string. | |
174 | ||
175 | To summarize, Unicode support in wxWidgets is mostly @b transparent for the | |
176 | application and if you use wxString objects for storing all the character data | |
177 | in your program there is really nothing special to do. However you should be | |
178 | aware of the potential problems covered by the following section. | |
179 | ||
180 | ||
181 | @subsection overview_unicode_support_utf Choosing Unicode Representation | |
182 | ||
183 | wxWidgets uses the system @c wchar_t in wxString implementation by default | |
184 | under all systems. Thus, under Microsoft Windows, UCS-2 (simplified version of | |
185 | UTF-16 without support for surrogate characters) is used as @c wchar_t is 2 | |
186 | bytes on this platform. Under Unix systems, including Mac OS X, UCS-4 (also | |
187 | known as UTF-32) is used by default, however it is also possible to build | |
188 | wxWidgets to use UTF-8 internally by passing @c --enable-utf8 option to | |
189 | configure. | |
190 | ||
191 | The interface provided by wxString is the same independently of the format used | |
192 | internally. However different formats have specific advantages and | |
193 | disadvantages. Notably, under Unix, the underlying graphical toolkit (e.g. | |
194 | GTK+) usually uses UTF-8 encoded strings and using the same representations for | |
195 | the strings in wxWidgets allows to avoid conversion from UTF-32 to UTF-8 and | |
196 | vice versa each time a string is shown in the UI or retrieved from it. The | |
197 | overhead of such conversions is usually negligible for small strings but may be | |
198 | important for some programs. If you believe that it would be advantageous to | |
199 | use UTF-8 for the strings in your particular application, you may rebuild | |
200 | wxWidgets to use UTF-8 as explained above (notice that this is currently not | |
201 | supported under Microsoft Windows and arguably doesn't make much sense there as | |
202 | Windows itself uses UTF-16 and not UTF-8) but be sure to be aware of the | |
203 | performance implications (see @ref overview_unicode_performance) of using UTF-8 | |
204 | in wxString before doing this! | |
205 | ||
206 | Generally speaking you should only use non-default UTF-8 build in specific | |
207 | circumstances e.g. building for resource-constrained systems where the overhead | |
208 | of conversions (and also reduced memory usage of UTF-8 compared to UTF-32 for | |
209 | the European languages) can be important. If the environment in which your | |
210 | program is running is under your control -- as is quite often the case in such | |
211 | scenarios -- consider ensuring that the system always uses UTF-8 locale and | |
212 | use @c --enable-utf8only configure option to disable support for the other | |
213 | locales and consider all strings to be in UTF-8. This further reduces the code | |
214 | size and removes the need for conversions in more cases. | |
215 | ||
216 | ||
217 | @subsection overview_unicode_settings Unicode Related Preprocessor Symbols | |
218 | ||
219 | @c wxUSE_UNICODE is defined as 1 now to indicate Unicode support. It can be | |
220 | explicitly set to 0 in @c setup.h under MSW or you can use @c --disable-unicode | |
221 | under Unix but doing this is strongly discouraged. By default, @c | |
222 | wxUSE_UNICODE_WCHAR is also defined as 1, however in UTF-8 build (described in | |
223 | the previous section), it is set to 0 and @c wxUSE_UNICODE_UTF8, which is | |
224 | usually 0, is set to 1 instead. In the latter case, @c wxUSE_UTF8_LOCALE_ONLY | |
225 | can also be set to 1 to indicate that all strings are considered to be in UTF-8. | |
226 | ||
227 | ||
228 | ||
229 | @section overview_unicode_pitfalls Potential Unicode Pitfalls | |
230 | ||
231 | The problems can be separated into three broad classes: | |
232 | ||
233 | @subsection overview_unicode_compilation_errors Unicode-Related Compilation Errors | |
234 | ||
235 | Because of the need to support implicit conversions to both @c char and | |
236 | @c wchar_t, wxString implementation is rather involved and many of its operators | |
237 | don't return the types which they could be naively expected to return. | |
238 | For example, the @c operator[] doesn't return neither a @c char nor a @c wchar_t | |
239 | but an object of a helper class wxUniChar or wxUniCharRef which is implicitly | |
240 | convertible to either. Usually you don't need to worry about this as the | |
241 | conversions do their work behind the scenes however in some cases it doesn't | |
242 | work. Here are some examples, using a wxString object @c s and some integer @c | |
243 | n: | |
244 | ||
245 | - Writing @code switch ( s[n] ) @endcode doesn't work because the argument of | |
246 | the switch statement must be an integer expression so you need to replace | |
247 | @c s[n] with @code s[n].GetValue() @endcode. You may also force the | |
248 | conversion to @c char or @c wchar_t by using an explicit cast but beware that | |
249 | converting the value to char uses the conversion to current locale and may | |
250 | return 0 if it fails. Finally notice that writing @code (wxChar)s[n] @endcode | |
251 | works both with wxWidgets 3.0 and previous library versions and so should be | |
252 | used for writing code which should be compatible with both 2.8 and 3.0. | |
253 | ||
254 | - Similarly, @code &s[n] @endcode doesn't yield a pointer to char so you may | |
255 | not pass it to functions expecting @c char* or @c wchar_t*. Consider using | |
256 | string iterators instead if possible or replace this expression with | |
257 | @code s.c_str() + n @endcode otherwise. | |
258 | ||
259 | Another class of problems is related to the fact that the value returned by | |
260 | @c c_str() itself is also not just a pointer to a buffer but a value of helper | |
261 | class wxCStrData which is implicitly convertible to both narrow and wide | |
262 | strings. Again, this mostly will be unnoticeable but can result in some | |
263 | problems: | |
264 | ||
265 | - You shouldn't pass @c c_str() result to vararg functions such as standard | |
266 | @c printf(). Some compilers (notably g++) warn about this but even if they | |
267 | don't, this @code printf("Hello, %s", s.c_str()) @endcode is not going to | |
268 | work. It can be corrected in one of the following ways: | |
269 | ||
270 | - Preferred: @code wxPrintf("Hello, %s", s) @endcode (notice the absence | |
271 | of @c c_str(), it is not needed at all with wxWidgets functions) | |
272 | - Compatible with wxWidgets 2.8: @code wxPrintf("Hello, %s", s.c_str()) @endcode | |
273 | - Using an explicit conversion to narrow, multibyte, string: | |
274 | @code printf("Hello, %s", (const char *)s.mb_str()) @endcode | |
275 | - Using a cast to force the issue (listed only for completeness): | |
276 | @code printf("Hello, %s", (const char *)s.c_str()) @endcode | |
277 | ||
278 | - The result of @c c_str() cannot be cast to @c char* but only to @c const @c | |
279 | @c char*. Of course, modifying the string via the pointer returned by this | |
280 | method has never been possible but unfortunately it was occasionally useful | |
281 | to use a @c const_cast here to pass the value to const-incorrect functions. | |
282 | This can be done either using new wxString::char_str() (and matching | |
283 | wchar_str()) method or by writing a double cast: | |
284 | @code (char *)(const char *)s.c_str() @endcode | |
285 | ||
286 | - One of the unfortunate consequences of the possibility to pass wxString to | |
287 | @c wxPrintf() without using @c c_str() is that it is now impossible to pass | |
288 | the elements of unnamed enumerations to @c wxPrintf() and other similar | |
289 | vararg functions, i.e. | |
290 | @code | |
291 | enum { Red, Green, Blue }; | |
292 | wxPrintf("Red is %d", Red); | |
293 | @endcode | |
294 | doesn't compile. The easiest workaround is to give a name to the enum. | |
295 | ||
296 | Other unexpected compilation errors may arise but they should happen even more | |
297 | rarely than the above-mentioned ones and the solution should usually be quite | |
298 | simple: just use the explicit methods of wxUniChar and wxCStrData classes | |
299 | instead of relying on their implicit conversions if the compiler can't choose | |
300 | among them. | |
301 | ||
302 | ||
303 | @subsection overview_unicode_data_loss Data Loss due To Unicode Conversion Errors | |
304 | ||
305 | wxString API provides implicit conversion of the internal Unicode string | |
306 | contents to narrow, char strings. This can be very convenient and is absolutely | |
307 | necessary for backwards compatibility with the existing code using wxWidgets | |
308 | however it is a rather dangerous operation as it can easily give unexpected | |
309 | results if the string contents isn't convertible to the current locale. | |
310 | ||
311 | To be precise, the conversion will always succeed if the string was created | |
312 | from a narrow string initially. It will also succeed if the current encoding is | |
313 | UTF-8 as all Unicode strings are representable in this encoding. However | |
314 | initializing the string using wxString::FromUTF8() method and then accessing it | |
315 | as a char string via its wxString::c_str() method is a recipe for disaster as the | |
316 | program may work perfectly well during testing on Unix systems using UTF-8 locale | |
317 | but completely fail under Windows where UTF-8 locales are never used because | |
318 | wxString::c_str() would return an empty string. | |
319 | ||
320 | The simplest way to ensure that this doesn't happen is to avoid conversions to | |
321 | @c char* completely by using wxString throughout your program. However if the | |
322 | program never manipulates 8 bit strings internally, using @c char* pointers is | |
323 | safe as well. So the existing code needs to be reviewed when upgrading to | |
324 | wxWidgets 3.0 and the new code should be used with this in mind and ideally | |
325 | avoiding implicit conversions to @c char*. | |
326 | ||
327 | ||
328 | @subsection overview_unicode_performance Performance Implications of Using UTF-8 | |
329 | ||
330 | As mentioned above, under Unix systems wxString class can use variable-width | |
331 | UTF-8 encoding for internal representation. In this case it can't guarantee | |
332 | constant-time access to N-th element of the string any longer as to find the | |
333 | position of this character in the string we have to examine all the preceding | |
334 | ones. Usually this doesn't matter much because most algorithms used on the | |
335 | strings examine them sequentially anyhow and because wxString implements a | |
336 | cache for iterating over the string by index but it can have serious | |
337 | consequences for algorithms using random access to string elements as they | |
338 | typically acquire O(N^2) time complexity instead of O(N) where N is the length | |
339 | of the string. | |
340 | ||
341 | Even despite caching the index, indexed access should be replaced with | |
342 | sequential access using string iterators. For example a typical loop: | |
343 | @code | |
344 | wxString s("hello"); | |
345 | for ( size_t i = 0; i < s.length(); i++ ) | |
346 | { | |
347 | wchar_t ch = s[i]; | |
348 | ||
349 | // do something with it | |
350 | } | |
351 | @endcode | |
352 | should be rewritten as | |
353 | @code | |
354 | wxString s("hello"); | |
355 | for ( wxString::const_iterator i = s.begin(); i != s.end(); ++i ) | |
356 | { | |
357 | wchar_t ch = *i | |
358 | ||
359 | // do something with it | |
360 | } | |
361 | @endcode | |
362 | ||
363 | Another, similar, alternative is to use pointer arithmetic: | |
364 | @code | |
365 | wxString s("hello"); | |
366 | for ( const wchar_t *p = s.wc_str(); *p; p++ ) | |
367 | { | |
368 | wchar_t ch = *i | |
369 | ||
370 | // do something with it | |
371 | } | |
372 | @endcode | |
373 | however this doesn't work correctly for strings with embedded @c NUL characters | |
374 | and the use of iterators is generally preferred as they provide some run-time | |
375 | checks (at least in debug build) unlike the raw pointers. But if you do use | |
376 | them, it is better to use @c wchar_t pointers rather than @c char ones to avoid the | |
377 | data loss problems due to conversion as discussed in the previous section. | |
378 | ||
379 | ||
380 | @section overview_unicode_supportout Unicode and the Outside World | |
381 | ||
382 | Even though wxWidgets always uses Unicode internally, not all the other | |
383 | libraries and programs do and even those that do use Unicode may use a | |
384 | different encoding of it. So you need to be able to convert the data to various | |
385 | representations and the wxString methods wxString::ToAscii(), wxString::ToUTF8() | |
386 | (or its synonym wxString::utf8_str()), wxString::mb_str(), wxString::c_str() and | |
387 | wxString::wc_str() can be used for this. | |
388 | ||
389 | The first of them should be only used for the string containing 7-bit ASCII characters | |
390 | only, anything else will be replaced by some substitution character. | |
391 | wxString::mb_str() converts the string to the encoding used by the current locale | |
392 | and so can return an empty string if the string contains characters not representable in | |
393 | it as explained in @ref overview_unicode_data_loss. The same applies to wxString::c_str() | |
394 | if its result is used as a narrow string. Finally, wxString::ToUTF8() and wxString::wc_str() | |
395 | functions never fail and always return a pointer to char string containing the | |
396 | UTF-8 representation of the string or @c wchar_t string. | |
397 | ||
398 | wxString also provides two convenience functions: wxString::From8BitData() and | |
399 | wxString::To8BitData(). They can be used to create a wxString from arbitrary binary | |
400 | data without supposing that it is in current locale encoding, and then get it back, | |
401 | again, without any conversion or, rather, undoing the conversion used by | |
402 | wxString::From8BitData(). Because of this you should only use wxString::From8BitData() | |
403 | for the strings created using wxString::To8BitData(). Also notice that in spite | |
404 | of the availability of these functions, wxString is not the ideal class for storing | |
405 | arbitrary binary data as they can take up to 4 times more space than needed | |
406 | (when using @c wchar_t internal representation on the systems where size of | |
407 | wide characters is 4 bytes) and you should consider using wxMemoryBuffer | |
408 | instead. | |
409 | ||
410 | Final word of caution: most of these functions may return either directly the | |
411 | pointer to internal string buffer or a temporary wxCharBuffer or wxWCharBuffer | |
412 | object. Such objects are implicitly convertible to @c char and @c wchar_t pointers, | |
413 | respectively, and so the result of, for example, wxString::ToUTF8() can always be | |
414 | passed directly to a function taking <tt>const char*</tt>. However code such as | |
415 | @code | |
416 | const char *p = s.ToUTF8(); | |
417 | ... | |
418 | puts(p); // or call any other function taking const char * | |
419 | @endcode | |
420 | does @b not work because the temporary buffer returned by wxString::ToUTF8() is | |
421 | destroyed and @c p is left pointing nowhere. To correct this you should use | |
422 | @code | |
423 | const wxScopedCharBuffer p(s.ToUTF8()); | |
424 | puts(p); | |
425 | @endcode | |
426 | which does work. | |
427 | ||
428 | Similarly, wxWX2WCbuf can be used for the return type of wxString::wc_str(). | |
429 | But, once again, none of these cryptic types is really needed if you just pass | |
430 | the return value of any of the functions mentioned in this section to another | |
431 | function directly. | |
432 | ||
433 | */ |