]>
Commit | Line | Data |
---|---|---|
1 | ///////////////////////////////////////////////////////////////////////////// | |
2 | // Name: unicode.h | |
3 | // Purpose: topic overview | |
4 | // Author: wxWidgets team | |
5 | // RCS-ID: $Id$ | |
6 | // Licence: wxWindows licence | |
7 | ///////////////////////////////////////////////////////////////////////////// | |
8 | ||
9 | /** | |
10 | ||
11 | @page overview_unicode Unicode Support in wxWidgets | |
12 | ||
13 | This section describes how does wxWidgets support Unicode and how can it affect | |
14 | your programs. | |
15 | ||
16 | Notice that Unicode support has changed radically in wxWidgets 3.0 and a lot of | |
17 | existing material pertaining to the previous versions of the library is not | |
18 | correct any more. Please see @ref overview_changes_unicode for the details of | |
19 | these changes. | |
20 | ||
21 | You can skip the first two sections if you're already familiar with Unicode and | |
22 | wish to jump directly in the details of its support in the library: | |
23 | @li @ref overview_unicode_what | |
24 | @li @ref overview_unicode_encodings | |
25 | @li @ref overview_unicode_supportin | |
26 | @li @ref overview_unicode_pitfalls | |
27 | @li @ref overview_unicode_supportout | |
28 | @li @ref overview_unicode_settings | |
29 | ||
30 | <hr> | |
31 | ||
32 | ||
33 | @section overview_unicode_what What is Unicode? | |
34 | ||
35 | Unicode is a standard for character encoding which addresses the shortcomings | |
36 | of the previous standards (e.g. the ASCII standard), by using 8, 16 or 32 bits | |
37 | for encoding each character. | |
38 | This allows enough code points (see below for the definition) sufficient to | |
39 | encode all of the world languages at once. | |
40 | More details about Unicode may be found at http://www.unicode.org/. | |
41 | ||
42 | From a practical point of view, using Unicode is almost a requirement when | |
43 | writing applications for international audience. Moreover, any application | |
44 | reading files which it didn't produce or receiving data from the network from | |
45 | other services should be ready to deal with Unicode. | |
46 | ||
47 | ||
48 | @section overview_unicode_encodings Unicode Representations and Terminology | |
49 | ||
50 | When working with Unicode, it's important to define the meaning of some terms. | |
51 | ||
52 | A <b><em>glyph</em></b> is a particular image (usually part of a font) that | |
53 | represents a character or part of a character. | |
54 | Any character may have one or more glyph associated; e.g. some of the possible | |
55 | glyphs for the capital letter 'A' are: | |
56 | ||
57 | @image html overview_unicode_glyphs.png | |
58 | ||
59 | Unicode assigns each character of almost any existing alphabet/script a number, | |
60 | which is called <b><em>code point</em></b>; it's typically indicated in documentation | |
61 | manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number. | |
62 | ||
63 | Note that typically one character is assigned exactly one code point, but there | |
64 | are exceptions; the so-called <em>precomposed characters</em> | |
65 | (see http://en.wikipedia.org/wiki/Precomposed_character) or the <em>ligatures</em>. | |
66 | In these cases a single "character" may be mapped to more than one code point or | |
67 | viceversa more characters may be mapped to a single code point. | |
68 | ||
69 | The Unicode standard divides the space of all possible code points in <b><em>planes</em></b>; | |
70 | a plane is a range of 65,536 (1000016) contiguous Unicode code points. | |
71 | Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic | |
72 | Multilingual Plane. | |
73 | The BMP contains characters for all modern languages, and a large number of | |
74 | special characters. The other planes in fact contain mainly historic scripts, | |
75 | special-purpose characters or are unused. | |
76 | ||
77 | Code points are represented in computer memory as a sequence of one or more | |
78 | <b><em>code units</em></b>, where a code unit is a unit of memory: 8, 16, or 32 bits. | |
79 | More precisely, a code unit is the minimal bit combination that can represent a | |
80 | unit of encoded text for processing or interchange. | |
81 | ||
82 | The <b><em>UTF</em></b> or Unicode Transformation Formats are algorithms mapping the Unicode | |
83 | code points to code unit sequences. The simplest of them is <b>UTF-32</b> where | |
84 | each code unit is composed by 32 bits (4 bytes) and each code point is always | |
85 | represented by a single code unit (fixed length encoding). | |
86 | (Note that even UTF-32 is still not completely trivial as the mapping is different | |
87 | for little and big-endian architectures). UTF-32 is commonly used under Unix systems for | |
88 | internal representation of Unicode strings. | |
89 | ||
90 | Another very widespread standard is <b>UTF-16</b> which is used by Microsoft Windows: | |
91 | it encodes the first (approximately) 64 thousands of Unicode code points | |
92 | (the BMP plane) using 16-bit code units (2 bytes) and uses a pair of 16-bit code | |
93 | units to encode the characters beyond this. These pairs are called @e surrogate. | |
94 | Thus UTF16 uses a variable number of code units to encode each code point. | |
95 | ||
96 | Finally, the most widespread encoding used for the external Unicode storage | |
97 | (e.g. files and network protocols) is <b>UTF-8</b> which is byte-oriented and so | |
98 | avoids the endianness ambiguities of UTF-16 and UTF-32. | |
99 | UTF-8 uses code units of 8 bits (1 byte); code points beyond the usual english | |
100 | alphabet are represented using a variable number of bytes, which makes it less | |
101 | efficient than UTF-32 for internal representation. | |
102 | ||
103 | As visual aid to understand the differences between the various concepts described | |
104 | so far, look at the different UTF representations of the same code point: | |
105 | ||
106 | @image html overview_unicode_codes.png | |
107 | ||
108 | In this particular case UTF8 requires more space than UTF16 (3 bytes instead of 2). | |
109 | ||
110 | Note that from the C/C++ programmer perspective the situation is further complicated | |
111 | by the fact that the standard type @c wchar_t which is usually used to represent the | |
112 | Unicode ("wide") strings in C/C++ doesn't have the same size on all platforms. | |
113 | It is 4 bytes under Unix systems, corresponding to the tradition of using | |
114 | UTF-32, but only 2 bytes under Windows which is required by compatibility with | |
115 | the OS which uses UTF-16. | |
116 | ||
117 | Typically when UTF8 is used, code units are stored into @c char types, since | |
118 | @c char are 8bit wide on almost all systems; when using UTF16 typically code | |
119 | units are stored into @c wchar_t types since @c wchar_t is at least 16bits on | |
120 | all systems. This is also the approach used by wxString. | |
121 | See @ref overview_string for more info. | |
122 | ||
123 | See also http://unicode.org/glossary/ for the official definitions of the | |
124 | terms reported above. | |
125 | ||
126 | ||
127 | @section overview_unicode_supportin Unicode Support in wxWidgets | |
128 | ||
129 | Since wxWidgets 3.0 Unicode support is always enabled and building the library | |
130 | without it is not recommended any longer and will cease to be supported in the | |
131 | near future. This means that internally only Unicode strings are used and that, | |
132 | under Microsoft Windows, Unicode system API is used which means that wxWidgets | |
133 | programs require the Microsoft Layer for Unicode to run on Windows 95/98/ME. | |
134 | ||
135 | However, unlike the Unicode build mode of the previous versions of wxWidgets, this | |
136 | support is mostly transparent: you can still continue to work with the @b narrow | |
137 | (i.e. current locale-encoded @c char*) strings even if @b wide | |
138 | (i.e. UTF16-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also | |
139 | supported. Any wxWidgets function accepts arguments of either type as both | |
140 | kinds of strings are implicitly converted to wxString, so both | |
141 | @code | |
142 | wxMessageBox("Hello, world!"); | |
143 | @endcode | |
144 | and the somewhat less usual | |
145 | @code | |
146 | wxMessageBox(L"Salut \u00E0 toi!"); // U+00E0 is "Latin Small Letter a with Grave" | |
147 | @endcode | |
148 | work as expected. | |
149 | ||
150 | Notice that the narrow strings used with wxWidgets are @e always assumed to be | |
151 | in the current locale encoding, so writing | |
152 | @code | |
153 | wxMessageBox("Salut à toi!"); | |
154 | @endcode | |
155 | wouldn't work if the encoding used on the user system is incompatible with | |
156 | ISO-8859-1 (or even if the sources were compiled under different locale | |
157 | in the case of gcc). In particular, the most common encoding used under | |
158 | modern Unix systems is UTF-8 and as the string above is not a valid UTF-8 byte | |
159 | sequence, nothing would be displayed at all in this case. Thus it is important | |
160 | to <b>never use 8-bit (instead of 7-bit) characters directly in the program source</b> | |
161 | but use wide strings or, alternatively, write: | |
162 | @code | |
163 | wxMessageBox(wxString::FromUTF8("Salut \xC3\xA0 toi!")); | |
164 | // in UTF8 the character U+00E0 is encoded as 0xC3A0 | |
165 | @endcode | |
166 | ||
167 | In a similar way, wxString provides access to its contents as either @c wchar_t or | |
168 | @c char character buffer. Of course, the latter only works if the string contains | |
169 | data representable in the current locale encoding. This will always be the case | |
170 | if the string had been initially constructed from a narrow string or if it | |
171 | contains only 7-bit ASCII data but otherwise this conversion is not guaranteed | |
172 | to succeed. And as with wxString::FromUTF8() example above, you can always use | |
173 | wxString::ToUTF8() to retrieve the string contents in UTF-8 encoding -- this, | |
174 | unlike converting to @c char* using the current locale, never fails. | |
175 | ||
176 | For more info about how wxString works, please see the @ref overview_string. | |
177 | ||
178 | To summarize, Unicode support in wxWidgets is mostly @b transparent for the | |
179 | application and if you use wxString objects for storing all the character data | |
180 | in your program there is really nothing special to do. However you should be | |
181 | aware of the potential problems covered by the following section. | |
182 | ||
183 | ||
184 | @section overview_unicode_pitfalls Potential Unicode Pitfalls | |
185 | ||
186 | The problems can be separated into three broad classes: | |
187 | ||
188 | @subsection overview_unicode_compilation_errors Unicode-Related Compilation Errors | |
189 | ||
190 | Because of the need to support implicit conversions to both @c char and | |
191 | @c wchar_t, wxString implementation is rather involved and many of its operators | |
192 | don't return the types which they could be naively expected to return. | |
193 | For example, the @c operator[] doesn't return neither a @c char nor a @c wchar_t | |
194 | but an object of a helper class wxUniChar or wxUniCharRef which is implicitly | |
195 | convertible to either. Usually you don't need to worry about this as the | |
196 | conversions do their work behind the scenes however in some cases it doesn't | |
197 | work. Here are some examples, using a wxString object @c s and some integer @c | |
198 | n: | |
199 | ||
200 | - Writing @code switch ( s[n] ) @endcode doesn't work because the argument of | |
201 | the switch statement must an integer expression so you need to replace | |
202 | @c s[n] with @code s[n].GetValue() @endcode. You may also force the | |
203 | conversion to @c char or @c wchar_t by using an explicit cast but beware that | |
204 | converting the value to char uses the conversion to current locale and may | |
205 | return 0 if it fails. Finally notice that writing @code (wxChar)s[n] @endcode | |
206 | works both with wxWidgets 3.0 and previous library versions and so should be | |
207 | used for writing code which should be compatible with both 2.8 and 3.0. | |
208 | ||
209 | - Similarly, @code &s[n] @endcode doesn't yield a pointer to char so you may | |
210 | not pass it to functions expecting @c char* or @c wchar_t*. Consider using | |
211 | string iterators instead if possible or replace this expression with | |
212 | @code s.c_str() + n @endcode otherwise. | |
213 | ||
214 | Another class of problems is related to the fact that the value returned by | |
215 | @c c_str() itself is also not just a pointer to a buffer but a value of helper | |
216 | class wxCStrData which is implicitly convertible to both narrow and wide | |
217 | strings. Again, this mostly will be unnoticeable but can result in some | |
218 | problems: | |
219 | ||
220 | - You shouldn't pass @c c_str() result to vararg functions such as standard | |
221 | @c printf(). Some compilers (notably g++) warn about this but even if they | |
222 | don't, this @code printf("Hello, %s", s.c_str()) @endcode is not going to | |
223 | work. It can be corrected in one of the following ways: | |
224 | ||
225 | - Preferred: @code wxPrintf("Hello, %s", s) @endcode (notice the absence | |
226 | of @c c_str(), it is not needed at all with wxWidgets functions) | |
227 | - Compatible with wxWidgets 2.8: @code wxPrintf("Hello, %s", s.c_str()) @endcode | |
228 | - Using an explicit conversion to narrow, multibyte, string: | |
229 | @code printf("Hello, %s", (const char *)s.mb_str()) @endcode | |
230 | - Using a cast to force the issue (listed only for completeness): | |
231 | @code printf("Hello, %s", (const char *)s.c_str()) @endcode | |
232 | ||
233 | - The result of @c c_str() cannot be cast to @c char* but only to @c const @c | |
234 | @c char*. Of course, modifying the string via the pointer returned by this | |
235 | method has never been possible but unfortunately it was occasionally useful | |
236 | to use a @c const_cast here to pass the value to const-incorrect functions. | |
237 | This can be done either using new wxString::char_str() (and matching | |
238 | wchar_str()) method or by writing a double cast: | |
239 | @code (char *)(const char *)s.c_str() @endcode | |
240 | ||
241 | - One of the unfortunate consequences of the possibility to pass wxString to | |
242 | @c wxPrintf() without using @c c_str() is that it is now impossible to pass | |
243 | the elements of unnamed enumerations to @c wxPrintf() and other similar | |
244 | vararg functions, i.e. | |
245 | @code | |
246 | enum { Red, Green, Blue }; | |
247 | wxPrintf("Red is %d", Red); | |
248 | @endcode | |
249 | doesn't compile. The easiest workaround is to give a name to the enum. | |
250 | ||
251 | Other unexpected compilation errors may arise but they should happen even more | |
252 | rarely than the above-mentioned ones and the solution should usually be quite | |
253 | simple: just use the explicit methods of wxUniChar and wxCStrData classes | |
254 | instead of relying on their implicit conversions if the compiler can't choose | |
255 | among them. | |
256 | ||
257 | ||
258 | @subsection overview_unicode_data_loss Data Loss due To Unicode Conversion Errors | |
259 | ||
260 | wxString API provides implicit conversion of the internal Unicode string | |
261 | contents to narrow, char strings. This can be very convenient and is absolutely | |
262 | necessary for backwards compatibility with the existing code using wxWidgets | |
263 | however it is a rather dangerous operation as it can easily give unexpected | |
264 | results if the string contents isn't convertible to the current locale. | |
265 | ||
266 | To be precise, the conversion will always succeed if the string was created | |
267 | from a narrow string initially. It will also succeed if the current encoding is | |
268 | UTF-8 as all Unicode strings are representable in this encoding. However | |
269 | initializing the string using wxString::FromUTF8() method and then accessing it | |
270 | as a char string via its wxString::c_str() method is a recipe for disaster as the | |
271 | program may work perfectly well during testing on Unix systems using UTF-8 locale | |
272 | but completely fail under Windows where UTF-8 locales are never used because | |
273 | wxString::c_str() would return an empty string. | |
274 | ||
275 | The simplest way to ensure that this doesn't happen is to avoid conversions to | |
276 | @c char* completely by using wxString throughout your program. However if the | |
277 | program never manipulates 8 bit strings internally, using @c char* pointers is | |
278 | safe as well. So the existing code needs to be reviewed when upgrading to | |
279 | wxWidgets 3.0 and the new code should be used with this in mind and ideally | |
280 | avoiding implicit conversions to @c char*. | |
281 | ||
282 | ||
283 | @subsection overview_unicode_performance Unicode Performance Implications | |
284 | ||
285 | Under Unix systems wxString class uses variable-width UTF-8 encoding for | |
286 | internal representation and this implies that it can't guarantee constant-time | |
287 | access to N-th element of the string any longer as to find the position of this | |
288 | character in the string we have to examine all the preceding ones. Usually this | |
289 | doesn't matter much because most algorithms used on the strings examine them | |
290 | sequentially anyhow and because wxString implements a cache for iterating over | |
291 | the string by index but it can have serious consequences for algorithms | |
292 | using random access to string elements as they typically acquire O(N^2) time | |
293 | complexity instead of O(N) where N is the length of the string. | |
294 | ||
295 | Even despite caching the index, indexed access should be replaced with | |
296 | sequential access using string iterators. For example a typical loop: | |
297 | @code | |
298 | wxString s("hello"); | |
299 | for ( size_t i = 0; i < s.length(); i++ ) | |
300 | { | |
301 | wchar_t ch = s[i]; | |
302 | ||
303 | // do something with it | |
304 | } | |
305 | @endcode | |
306 | should be rewritten as | |
307 | @code | |
308 | wxString s("hello"); | |
309 | for ( wxString::const_iterator i = s.begin(); i != s.end(); ++i ) | |
310 | { | |
311 | wchar_t ch = *i | |
312 | ||
313 | // do something with it | |
314 | } | |
315 | @endcode | |
316 | ||
317 | Another, similar, alternative is to use pointer arithmetic: | |
318 | @code | |
319 | wxString s("hello"); | |
320 | for ( const wchar_t *p = s.wc_str(); *p; p++ ) | |
321 | { | |
322 | wchar_t ch = *i | |
323 | ||
324 | // do something with it | |
325 | } | |
326 | @endcode | |
327 | however this doesn't work correctly for strings with embedded @c NUL characters | |
328 | and the use of iterators is generally preferred as they provide some run-time | |
329 | checks (at least in debug build) unlike the raw pointers. But if you do use | |
330 | them, it is better to use @c wchar_t pointers rather than @c char ones to avoid the | |
331 | data loss problems due to conversion as discussed in the previous section. | |
332 | ||
333 | ||
334 | @section overview_unicode_supportout Unicode and the Outside World | |
335 | ||
336 | Even though wxWidgets always uses Unicode internally, not all the other | |
337 | libraries and programs do and even those that do use Unicode may use a | |
338 | different encoding of it. So you need to be able to convert the data to various | |
339 | representations and the wxString methods wxString::ToAscii(), wxString::ToUTF8() | |
340 | (or its synonym wxString::utf8_str()), wxString::mb_str(), wxString::c_str() and | |
341 | wxString::wc_str() can be used for this. | |
342 | ||
343 | The first of them should be only used for the string containing 7-bit ASCII characters | |
344 | only, anything else will be replaced by some substitution character. | |
345 | wxString::mb_str() converts the string to the encoding used by the current locale | |
346 | and so can return an empty string if the string contains characters not representable in | |
347 | it as explained in @ref overview_unicode_data_loss. The same applies to wxString::c_str() | |
348 | if its result is used as a narrow string. Finally, wxString::ToUTF8() and wxString::wc_str() | |
349 | functions never fail and always return a pointer to char string containing the | |
350 | UTF-8 representation of the string or @c wchar_t string. | |
351 | ||
352 | wxString also provides two convenience functions: wxString::From8BitData() and | |
353 | wxString::To8BitData(). They can be used to create a wxString from arbitrary binary | |
354 | data without supposing that it is in current locale encoding, and then get it back, | |
355 | again, without any conversion or, rather, undoing the conversion used by | |
356 | wxString::From8BitData(). Because of this you should only use wxString::From8BitData() | |
357 | for the strings created using wxString::To8BitData(). Also notice that in spite | |
358 | of the availability of these functions, wxString is not the ideal class for storing | |
359 | arbitrary binary data as they can take up to 4 times more space than needed | |
360 | (when using @c wchar_t internal representation on the systems where size of | |
361 | wide characters is 4 bytes) and you should consider using wxMemoryBuffer | |
362 | instead. | |
363 | ||
364 | Final word of caution: most of these functions may return either directly the | |
365 | pointer to internal string buffer or a temporary wxCharBuffer or wxWCharBuffer | |
366 | object. Such objects are implicitly convertible to @c char and @c wchar_t pointers, | |
367 | respectively, and so the result of, for example, wxString::ToUTF8() can always be | |
368 | passed directly to a function taking <tt>const char*</tt>. However code such as | |
369 | @code | |
370 | const char *p = s.ToUTF8(); | |
371 | ... | |
372 | puts(p); // or call any other function taking const char * | |
373 | @endcode | |
374 | does @b not work because the temporary buffer returned by wxString::ToUTF8() is | |
375 | destroyed and @c p is left pointing nowhere. To correct this you should use | |
376 | @code | |
377 | const wxScopedCharBuffer p(s.ToUTF8()); | |
378 | puts(p); | |
379 | @endcode | |
380 | which does work. | |
381 | ||
382 | Similarly, wxWX2WCbuf can be used for the return type of wxString::wc_str(). | |
383 | But, once again, none of these cryptic types is really needed if you just pass | |
384 | the return value of any of the functions mentioned in this section to another | |
385 | function directly. | |
386 | ||
387 | @section overview_unicode_settings Unicode Related Compilation Settings | |
388 | ||
389 | @c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support. | |
390 | If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is | |
391 | also defined, otherwise @c wxUSE_UNICODE_WCHAR is. | |
392 | ||
393 | You are encouraged to always use the default build settings of wxWidgets; this avoids | |
394 | the need of different builds of the same application/library because of different | |
395 | "build modes". | |
396 | ||
397 | */ | |
398 |