]>
Commit | Line | Data |
---|---|---|
15b6757b | 1 | ///////////////////////////////////////////////////////////////////////////// |
2cd3cc94 | 2 | // Name: unicode.h |
15b6757b FM |
3 | // Purpose: topic overview |
4 | // Author: wxWidgets team | |
5 | // RCS-ID: $Id$ | |
6 | // Licence: wxWindows license | |
7 | ///////////////////////////////////////////////////////////////////////////// | |
8 | ||
880efa2a | 9 | /** |
36c9828f | 10 | |
2cd3cc94 BP |
11 | @page overview_unicode Unicode Support in wxWidgets |
12 | ||
cc506697 VZ |
13 | This section describes how does wxWidgets support Unicode and how can it affect |
14 | your programs. | |
36c9828f | 15 | |
cc506697 VZ |
16 | Notice that Unicode support has changed radically in wxWidgets 3.0 and a lot of |
17 | existing material pertaining to the previous versions of the library is not | |
18 | correct any more. Please see @ref overview_changes_unicode for the details of | |
19 | these changes. | |
20 | ||
21 | You can skip the first two sections if you're already familiar with Unicode and | |
22 | wish to jump directly in the details of its support in the library: | |
2cd3cc94 | 23 | @li @ref overview_unicode_what |
cc506697 | 24 | @li @ref overview_unicode_encodings |
2cd3cc94 | 25 | @li @ref overview_unicode_supportin |
cc506697 | 26 | @li @ref overview_unicode_pitfalls |
2cd3cc94 BP |
27 | @li @ref overview_unicode_supportout |
28 | @li @ref overview_unicode_settings | |
36c9828f | 29 | |
2cd3cc94 | 30 | <hr> |
36c9828f FM |
31 | |
32 | ||
2cd3cc94 BP |
33 | @section overview_unicode_what What is Unicode? |
34 | ||
cc506697 | 35 | Unicode is a standard for character encoding which addresses the shortcomings |
77ef61f5 FM |
36 | of the previous standards (e.g. the ASCII standard), by using 8, 16 or 32 bits |
37 | for encoding each character. | |
38 | This allows enough code points (see below for the definition) sufficient to | |
39 | encode all of the world languages at once. | |
40 | More details about Unicode may be found at http://www.unicode.org/. | |
cc506697 VZ |
41 | |
42 | From a practical point of view, using Unicode is almost a requirement when | |
43 | writing applications for international audience. Moreover, any application | |
44 | reading files which it didn't produce or receiving data from the network from | |
45 | other services should be ready to deal with Unicode. | |
46 | ||
47 | ||
77ef61f5 FM |
48 | @section overview_unicode_encodings Unicode Representations and Terminology |
49 | ||
50 | When working with Unicode, it's important to define the meaning of some terms. | |
51 | ||
727aa906 FM |
52 | A <b><em>glyph</em></b> is a particular image that represents a character or part |
53 | of a character. | |
77ef61f5 FM |
54 | Any character may have one or more glyph associated; e.g. some of the possible |
55 | glyphs for the capital letter 'A' are: | |
56 | ||
57 | @image html overview_unicode_glyphs.png | |
58 | ||
59 | Unicode assigns each character of almost any existing alphabet/script a number, | |
727aa906 | 60 | which is called <b><em>code point</em></b>; it's typically indicated in documentation |
77ef61f5 FM |
61 | manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number. |
62 | ||
63 | The Unicode standard divides the space of all possible code points in @e planes; | |
64 | a plane is a range of 65,536 (1000016) contiguous Unicode code points. | |
65 | Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic | |
66 | Multilingual Plane. | |
727aa906 FM |
67 | The BMP contains characters for all modern languages, and a large number of |
68 | special characters. The other planes in fact contain mainly historic scripts, | |
69 | special-purpose characters or are unused. | |
77ef61f5 FM |
70 | |
71 | Code points are represented in computer memory as a sequence of one or more | |
727aa906 | 72 | <b><em>code units</em></b>, where a code unit is a unit of memory: 8, 16, or 32 bits. |
77ef61f5 FM |
73 | More precisely, a code unit is the minimal bit combination that can represent a |
74 | unit of encoded text for processing or interchange. | |
75 | ||
76 | The @e UTF or Unicode Transformation Formats are algorithms mapping the Unicode | |
77 | code points to code unit sequences. The simplest of them is <b>UTF-32</b> where | |
727aa906 FM |
78 | each code unit is composed by 32 bits (4 bytes) and each code point is always |
79 | represented by a single code unit (fixed length encoding). | |
77ef61f5 FM |
80 | (Note that even UTF-32 is still not completely trivial as the mapping is different |
81 | for little and big-endian architectures). UTF-32 is commonly used under Unix systems for | |
82 | internal representation of Unicode strings. | |
83 | ||
84 | Another very widespread standard is <b>UTF-16</b> which is used by Microsoft Windows: | |
85 | it encodes the first (approximately) 64 thousands of Unicode code points | |
86 | (the BMP plane) using 16-bit code units (2 bytes) and uses a pair of 16-bit code | |
87 | units to encode the characters beyond this. These pairs are called @e surrogate. | |
727aa906 | 88 | Thus UTF16 uses a variable number of code units to encode each code point. |
77ef61f5 FM |
89 | |
90 | Finally, the most widespread encoding used for the external Unicode storage | |
91 | (e.g. files and network protocols) is <b>UTF-8</b> which is byte-oriented and so | |
92 | avoids the endianness ambiguities of UTF-16 and UTF-32. | |
93 | UTF-8 uses code units of 8 bits (1 byte); code points beyond the usual english | |
94 | alphabet are represented using a variable number of bytes, which makes it less | |
95 | efficient than UTF-32 for internal representation. | |
96 | ||
97 | As visual aid to understand the differences between the various concepts described | |
98 | so far, look at the different UTF representations of the same code point: | |
99 | ||
100 | @image html overview_unicode_codes.png | |
101 | ||
102 | In this particular case UTF8 requires more space than UTF16 (3 bytes instead of 2). | |
103 | ||
104 | Note that from the C/C++ programmer perspective the situation is further complicated | |
105 | by the fact that the standard type @c wchar_t which is usually used to represent the | |
cc506697 VZ |
106 | Unicode ("wide") strings in C/C++ doesn't have the same size on all platforms. |
107 | It is 4 bytes under Unix systems, corresponding to the tradition of using | |
108 | UTF-32, but only 2 bytes under Windows which is required by compatibility with | |
109 | the OS which uses UTF-16. | |
2cd3cc94 | 110 | |
77ef61f5 FM |
111 | Typically when UTF8 is used, code units are stored into @c char types, since |
112 | @c char are 8bit wide on almost all systems; when using UTF16 typically code | |
113 | units are stored into @c wchar_t types since @c wchar_t is at least 16bits on | |
114 | all systems. This is also the approach used by wxString. | |
727aa906 | 115 | See @ref overview_string for more info. |
77ef61f5 FM |
116 | |
117 | See also http://unicode.org/glossary/ for the official definitions of the | |
118 | terms reported above. | |
119 | ||
2cd3cc94 | 120 | |
cc506697 | 121 | @section overview_unicode_supportin Unicode Support in wxWidgets |
2cd3cc94 | 122 | |
cc506697 VZ |
123 | Since wxWidgets 3.0 Unicode support is always enabled and building the library |
124 | without it is not recommended any longer and will cease to be supported in the | |
125 | near future. This means that internally only Unicode strings are used and that, | |
126 | under Microsoft Windows, Unicode system API is used which means that wxWidgets | |
127 | programs require the Microsoft Layer for Unicode to run on Windows 95/98/ME. | |
128 | ||
77ef61f5 FM |
129 | However, unlike the Unicode build mode of the previous versions of wxWidgets, this |
130 | support is mostly transparent: you can still continue to work with the @b narrow | |
727aa906 FM |
131 | (i.e. current locale-encoded @c char*) strings even if @b wide |
132 | (i.e. UTF16/UCS2-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also | |
cc506697 VZ |
133 | supported. Any wxWidgets function accepts arguments of either type as both |
134 | kinds of strings are implicitly converted to wxString, so both | |
135 | @code | |
136 | wxMessageBox("Hello, world!"); | |
137 | @endcode | |
77ef61f5 | 138 | and the somewhat less usual |
cc506697 | 139 | @code |
727aa906 | 140 | wxMessageBox(L"Salut \u00E0 toi!"); // U+00E0 is "Latin Small Letter a with Grave" |
cc506697 VZ |
141 | @endcode |
142 | work as expected. | |
2cd3cc94 | 143 | |
cc506697 VZ |
144 | Notice that the narrow strings used with wxWidgets are @e always assumed to be |
145 | in the current locale encoding, so writing | |
146 | @code | |
147 | wxMessageBox("Salut à toi!"); | |
148 | @endcode | |
149 | wouldn't work if the encoding used on the user system is incompatible with | |
f99af6c0 VS |
150 | ISO-8859-1 (or even if the sources were compiled under different locale |
151 | in the case of gcc). In particular, the most common encoding used under | |
152 | modern Unix systems is UTF-8 and as the string above is not a valid UTF-8 byte | |
153 | sequence, nothing would be displayed at all in this case. Thus it is important | |
77ef61f5 | 154 | to <b>never use 8-bit (instead of 7-bit) characters directly in the program source</b> |
727aa906 | 155 | but use wide strings or, alternatively, write: |
cc506697 | 156 | @code |
727aa906 FM |
157 | wxMessageBox(wxString::FromUTF8("Salut \xC3\xA0 toi!")); |
158 | // in UTF8 the character U+00E0 is encoded as 0xC3A0 | |
cc506697 | 159 | @endcode |
2cd3cc94 | 160 | |
77ef61f5 FM |
161 | In a similar way, wxString provides access to its contents as either @c wchar_t or |
162 | @c char character buffer. Of course, the latter only works if the string contains | |
cc506697 VZ |
163 | data representable in the current locale encoding. This will always be the case |
164 | if the string had been initially constructed from a narrow string or if it | |
165 | contains only 7-bit ASCII data but otherwise this conversion is not guaranteed | |
91fa0da4 FM |
166 | to succeed. And as with wxString::FromUTF8() example above, you can always use |
167 | wxString::ToUTF8() to retrieve the string contents in UTF-8 encoding -- this, | |
168 | unlike converting to @c char* using the current locale, never fails. | |
cc506697 | 169 | |
77ef61f5 FM |
170 | For more info about how wxString works, please see the @ref overview_string. |
171 | ||
172 | To summarize, Unicode support in wxWidgets is mostly @b transparent for the | |
cc506697 VZ |
173 | application and if you use wxString objects for storing all the character data |
174 | in your program there is really nothing special to do. However you should be | |
175 | aware of the potential problems covered by the following section. | |
176 | ||
177 | ||
178 | @section overview_unicode_pitfalls Potential Unicode Pitfalls | |
179 | ||
180 | The problems can be separated into three broad classes: | |
181 | ||
182 | @subsection overview_unicode_compilation_errors Unicode-Related Compilation Errors | |
183 | ||
77ef61f5 FM |
184 | Because of the need to support implicit conversions to both @c char and |
185 | @c wchar_t, wxString implementation is rather involved and many of its operators | |
186 | don't return the types which they could be naively expected to return. | |
187 | For example, the @c operator[] doesn't return neither a @c char nor a @c wchar_t | |
cc506697 VZ |
188 | but an object of a helper class wxUniChar or wxUniCharRef which is implicitly |
189 | convertible to either. Usually you don't need to worry about this as the | |
190 | conversions do their work behind the scenes however in some cases it doesn't | |
191 | work. Here are some examples, using a wxString object @c s and some integer @c | |
192 | n: | |
193 | ||
194 | - Writing @code switch ( s[n] ) @endcode doesn't work because the argument of | |
195 | the switch statement must an integer expression so you need to replace | |
196 | @c s[n] with @code s[n].GetValue() @endcode. You may also force the | |
77ef61f5 | 197 | conversion to @c char or @c wchar_t by using an explicit cast but beware that |
cc506697 VZ |
198 | converting the value to char uses the conversion to current locale and may |
199 | return 0 if it fails. Finally notice that writing @code (wxChar)s[n] @endcode | |
200 | works both with wxWidgets 3.0 and previous library versions and so should be | |
201 | used for writing code which should be compatible with both 2.8 and 3.0. | |
202 | ||
203 | - Similarly, @code &s[n] @endcode doesn't yield a pointer to char so you may | |
204 | not pass it to functions expecting @c char* or @c wchar_t*. Consider using | |
205 | string iterators instead if possible or replace this expression with | |
206 | @code s.c_str() + n @endcode otherwise. | |
207 | ||
91fa0da4 FM |
208 | Another class of problems is related to the fact that the value returned by |
209 | @c c_str() itself is also not just a pointer to a buffer but a value of helper | |
cc506697 VZ |
210 | class wxCStrData which is implicitly convertible to both narrow and wide |
211 | strings. Again, this mostly will be unnoticeable but can result in some | |
212 | problems: | |
213 | ||
214 | - You shouldn't pass @c c_str() result to vararg functions such as standard | |
215 | @c printf(). Some compilers (notably g++) warn about this but even if they | |
216 | don't, this @code printf("Hello, %s", s.c_str()) @endcode is not going to | |
217 | work. It can be corrected in one of the following ways: | |
218 | ||
219 | - Preferred: @code wxPrintf("Hello, %s", s) @endcode (notice the absence | |
220 | of @c c_str(), it is not needed at all with wxWidgets functions) | |
221 | - Compatible with wxWidgets 2.8: @code wxPrintf("Hello, %s", s.c_str()) @endcode | |
222 | - Using an explicit conversion to narrow, multibyte, string: | |
f99af6c0 | 223 | @code printf("Hello, %s", (const char *)s.mb_str()) @endcode |
cc506697 VZ |
224 | - Using a cast to force the issue (listed only for completeness): |
225 | @code printf("Hello, %s", (const char *)s.c_str()) @endcode | |
226 | ||
227 | - The result of @c c_str() can not be cast to @c char* but only to @c const @c | |
228 | @c char*. Of course, modifying the string via the pointer returned by this | |
229 | method has never been possible but unfortunately it was occasionally useful | |
230 | to use a @c const_cast here to pass the value to const-incorrect functions. | |
231 | This can be done either using new wxString::char_str() (and matching | |
232 | wchar_str()) method or by writing a double cast: | |
233 | @code (char *)(const char *)s.c_str() @endcode | |
234 | ||
235 | - One of the unfortunate consequences of the possibility to pass wxString to | |
236 | @c wxPrintf() without using @c c_str() is that it is now impossible to pass | |
237 | the elements of unnamed enumerations to @c wxPrintf() and other similar | |
238 | vararg functions, i.e. | |
239 | @code | |
240 | enum { Red, Green, Blue }; | |
241 | wxPrintf("Red is %d", Red); | |
242 | @endcode | |
243 | doesn't compile. The easiest workaround is to give a name to the enum. | |
244 | ||
245 | Other unexpected compilation errors may arise but they should happen even more | |
246 | rarely than the above-mentioned ones and the solution should usually be quite | |
247 | simple: just use the explicit methods of wxUniChar and wxCStrData classes | |
248 | instead of relying on their implicit conversions if the compiler can't choose | |
249 | among them. | |
250 | ||
251 | ||
252 | @subsection overview_unicode_data_loss Data Loss due To Unicode Conversion Errors | |
253 | ||
254 | wxString API provides implicit conversion of the internal Unicode string | |
255 | contents to narrow, char strings. This can be very convenient and is absolutely | |
256 | necessary for backwards compatibility with the existing code using wxWidgets | |
257 | however it is a rather dangerous operation as it can easily give unexpected | |
258 | results if the string contents isn't convertible to the current locale. | |
259 | ||
260 | To be precise, the conversion will always succeed if the string was created | |
261 | from a narrow string initially. It will also succeed if the current encoding is | |
262 | UTF-8 as all Unicode strings are representable in this encoding. However | |
91fa0da4 FM |
263 | initializing the string using wxString::FromUTF8() method and then accessing it |
264 | as a char string via its wxString::c_str() method is a recipe for disaster as the | |
265 | program may work perfectly well during testing on Unix systems using UTF-8 locale | |
266 | but completely fail under Windows where UTF-8 locales are never used because | |
267 | wxString::c_str() would return an empty string. | |
cc506697 VZ |
268 | |
269 | The simplest way to ensure that this doesn't happen is to avoid conversions to | |
270 | @c char* completely by using wxString throughout your program. However if the | |
271 | program never manipulates 8 bit strings internally, using @c char* pointers is | |
272 | safe as well. So the existing code needs to be reviewed when upgrading to | |
273 | wxWidgets 3.0 and the new code should be used with this in mind and ideally | |
274 | avoiding implicit conversions to @c char*. | |
275 | ||
276 | ||
277 | @subsection overview_unicode_performance Unicode Performance Implications | |
278 | ||
279 | Under Unix systems wxString class uses variable-width UTF-8 encoding for | |
280 | internal representation and this implies that it can't guarantee constant-time | |
281 | access to N-th element of the string any longer as to find the position of this | |
282 | character in the string we have to examine all the preceding ones. Usually this | |
283 | doesn't matter much because most algorithms used on the strings examine them | |
a6919a6a RR |
284 | sequentially anyhow and because wxString implements a cache for iterating over |
285 | the string by index but it can have serious consequences for algorithms | |
286 | using random access to string elements as they typically acquire O(N^2) time | |
cc506697 VZ |
287 | complexity instead of O(N) where N is the length of the string. |
288 | ||
a6919a6a | 289 | Even despite caching the index, indexed access should be replaced with |
cc506697 | 290 | sequential access using string iterators. For example a typical loop: |
7b74e828 | 291 | @code |
cc506697 VZ |
292 | wxString s("hello"); |
293 | for ( size_t i = 0; i < s.length(); i++ ) | |
7b74e828 | 294 | { |
cc506697 | 295 | wchar_t ch = s[i]; |
91fa0da4 | 296 | |
7b74e828 RR |
297 | // do something with it |
298 | } | |
299 | @endcode | |
cc506697 | 300 | should be rewritten as |
2cd3cc94 | 301 | @code |
cc506697 VZ |
302 | wxString s("hello"); |
303 | for ( wxString::const_iterator i = s.begin(); i != s.end(); ++i ) | |
7b74e828 | 304 | { |
cc506697 | 305 | wchar_t ch = *i |
91fa0da4 | 306 | |
7b74e828 RR |
307 | // do something with it |
308 | } | |
2cd3cc94 BP |
309 | @endcode |
310 | ||
cc506697 | 311 | Another, similar, alternative is to use pointer arithmetic: |
7b74e828 | 312 | @code |
cc506697 VZ |
313 | wxString s("hello"); |
314 | for ( const wchar_t *p = s.wc_str(); *p; p++ ) | |
7b74e828 | 315 | { |
cc506697 VZ |
316 | wchar_t ch = *i |
317 | ||
318 | // do something with it | |
7b74e828 RR |
319 | } |
320 | @endcode | |
cc506697 VZ |
321 | however this doesn't work correctly for strings with embedded @c NUL characters |
322 | and the use of iterators is generally preferred as they provide some run-time | |
323 | checks (at least in debug build) unlike the raw pointers. But if you do use | |
77ef61f5 | 324 | them, it is better to use @c wchar_t pointers rather than @c char ones to avoid the |
cc506697 | 325 | data loss problems due to conversion as discussed in the previous section. |
2cd3cc94 | 326 | |
2cd3cc94 | 327 | |
cc506697 | 328 | @section overview_unicode_supportout Unicode and the Outside World |
2cd3cc94 | 329 | |
cc506697 VZ |
330 | Even though wxWidgets always uses Unicode internally, not all the other |
331 | libraries and programs do and even those that do use Unicode may use a | |
332 | different encoding of it. So you need to be able to convert the data to various | |
91fa0da4 FM |
333 | representations and the wxString methods wxString::ToAscii(), wxString::ToUTF8() |
334 | (or its synonym wxString::utf8_str()), wxString::mb_str(), wxString::c_str() and | |
335 | wxString::wc_str() can be used for this. | |
727aa906 | 336 | |
91fa0da4 FM |
337 | The first of them should be only used for the string containing 7-bit ASCII characters |
338 | only, anything else will be replaced by some substitution character. | |
339 | wxString::mb_str() converts the string to the encoding used by the current locale | |
340 | and so can return an empty string if the string contains characters not representable in | |
341 | it as explained in @ref overview_unicode_data_loss. The same applies to wxString::c_str() | |
342 | if its result is used as a narrow string. Finally, wxString::ToUTF8() and wxString::wc_str() | |
cc506697 | 343 | functions never fail and always return a pointer to char string containing the |
77ef61f5 | 344 | UTF-8 representation of the string or @c wchar_t string. |
cc506697 | 345 | |
91fa0da4 FM |
346 | wxString also provides two convenience functions: wxString::From8BitData() and |
347 | wxString::To8BitData(). They can be used to create a wxString from arbitrary binary | |
348 | data without supposing that it is in current locale encoding, and then get it back, | |
cc506697 | 349 | again, without any conversion or, rather, undoing the conversion used by |
91fa0da4 FM |
350 | wxString::From8BitData(). Because of this you should only use wxString::From8BitData() |
351 | for the strings created using wxString::To8BitData(). Also notice that in spite | |
352 | of the availability of these functions, wxString is not the ideal class for storing | |
cc506697 VZ |
353 | arbitrary binary data as they can take up to 4 times more space than needed |
354 | (when using @c wchar_t internal representation on the systems where size of | |
355 | wide characters is 4 bytes) and you should consider using wxMemoryBuffer | |
356 | instead. | |
357 | ||
358 | Final word of caution: most of these functions may return either directly the | |
359 | pointer to internal string buffer or a temporary wxCharBuffer or wxWCharBuffer | |
77ef61f5 | 360 | object. Such objects are implicitly convertible to @c char and @c wchar_t pointers, |
91fa0da4 | 361 | respectively, and so the result of, for example, wxString::ToUTF8() can always be |
77ef61f5 | 362 | passed directly to a function taking <tt>const char*</tt>. However code such as |
7b74e828 | 363 | @code |
cc506697 VZ |
364 | const char *p = s.ToUTF8(); |
365 | ... | |
366 | puts(p); // or call any other function taking const char * | |
7b74e828 | 367 | @endcode |
91fa0da4 FM |
368 | does @b not work because the temporary buffer returned by wxString::ToUTF8() is |
369 | destroyed and @c p is left pointing nowhere. To correct this you may use | |
7b74e828 | 370 | @code |
cc506697 VZ |
371 | wxCharBuffer p(s.ToUTF8()); |
372 | puts(p); | |
7b74e828 | 373 | @endcode |
cc506697 | 374 | which does work but results in an unnecessary copy of string data in the build |
91fa0da4 FM |
375 | configurations when wxString::ToUTF8() returns the pointer to internal string buffer. |
376 | If this inefficiency is important you may write | |
2cd3cc94 | 377 | @code |
cc506697 VZ |
378 | const wxUTF8Buf p(s.ToUTF8()); |
379 | puts(p); | |
2cd3cc94 | 380 | @endcode |
91fa0da4 FM |
381 | where @c wxUTF8Buf is the type corresponding to the real return type of wxString::ToUTF8(). |
382 | Similarly, wxWX2WCbuf can be used for the return type of wxString::wc_str(). | |
cc506697 VZ |
383 | But, once again, none of these cryptic types is really needed if you just pass |
384 | the return value of any of the functions mentioned in this section to another | |
385 | function directly. | |
2cd3cc94 BP |
386 | |
387 | @section overview_unicode_settings Unicode Related Compilation Settings | |
388 | ||
cc506697 VZ |
389 | @c wxUSE_UNICODE is now defined as 1 by default to indicate Unicode support. |
390 | If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is | |
391 | also defined, otherwise @c wxUSE_UNICODE_WCHAR is. | |
2cd3cc94 | 392 | |
77ef61f5 FM |
393 | You are encouraged to always use the default build settings of wxWidgets; this avoids |
394 | the need of different builds of the same application/library because of different | |
395 | "build modes". | |
396 | ||
2cd3cc94 BP |
397 | */ |
398 |