]>
Commit | Line | Data |
---|---|---|
15b6757b | 1 | ///////////////////////////////////////////////////////////////////////////// |
2cd3cc94 | 2 | // Name: unicode.h |
15b6757b FM |
3 | // Purpose: topic overview |
4 | // Author: wxWidgets team | |
5 | // RCS-ID: $Id$ | |
526954c5 | 6 | // Licence: wxWindows licence |
15b6757b FM |
7 | ///////////////////////////////////////////////////////////////////////////// |
8 | ||
880efa2a | 9 | /** |
36c9828f | 10 | |
2cd3cc94 BP |
11 | @page overview_unicode Unicode Support in wxWidgets |
12 | ||
cc506697 VZ |
13 | This section describes how does wxWidgets support Unicode and how can it affect |
14 | your programs. | |
36c9828f | 15 | |
cc506697 VZ |
16 | Notice that Unicode support has changed radically in wxWidgets 3.0 and a lot of |
17 | existing material pertaining to the previous versions of the library is not | |
18 | correct any more. Please see @ref overview_changes_unicode for the details of | |
19 | these changes. | |
20 | ||
21 | You can skip the first two sections if you're already familiar with Unicode and | |
22 | wish to jump directly in the details of its support in the library: | |
2cd3cc94 | 23 | @li @ref overview_unicode_what |
cc506697 | 24 | @li @ref overview_unicode_encodings |
2cd3cc94 | 25 | @li @ref overview_unicode_supportin |
cc506697 | 26 | @li @ref overview_unicode_pitfalls |
2cd3cc94 BP |
27 | @li @ref overview_unicode_supportout |
28 | @li @ref overview_unicode_settings | |
36c9828f | 29 | |
2cd3cc94 | 30 | <hr> |
36c9828f FM |
31 | |
32 | ||
2cd3cc94 BP |
33 | @section overview_unicode_what What is Unicode? |
34 | ||
cc506697 | 35 | Unicode is a standard for character encoding which addresses the shortcomings |
77ef61f5 FM |
36 | of the previous standards (e.g. the ASCII standard), by using 8, 16 or 32 bits |
37 | for encoding each character. | |
38 | This allows enough code points (see below for the definition) sufficient to | |
39 | encode all of the world languages at once. | |
40 | More details about Unicode may be found at http://www.unicode.org/. | |
cc506697 VZ |
41 | |
42 | From a practical point of view, using Unicode is almost a requirement when | |
43 | writing applications for international audience. Moreover, any application | |
44 | reading files which it didn't produce or receiving data from the network from | |
45 | other services should be ready to deal with Unicode. | |
46 | ||
47 | ||
77ef61f5 FM |
48 | @section overview_unicode_encodings Unicode Representations and Terminology |
49 | ||
50 | When working with Unicode, it's important to define the meaning of some terms. | |
51 | ||
2f365fcb FM |
52 | A <b><em>glyph</em></b> is a particular image (usually part of a font) that |
53 | represents a character or part of a character. | |
77ef61f5 FM |
54 | Any character may have one or more glyph associated; e.g. some of the possible |
55 | glyphs for the capital letter 'A' are: | |
56 | ||
57 | @image html overview_unicode_glyphs.png | |
58 | ||
59 | Unicode assigns each character of almost any existing alphabet/script a number, | |
727aa906 | 60 | which is called <b><em>code point</em></b>; it's typically indicated in documentation |
77ef61f5 FM |
61 | manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number. |
62 | ||
2f365fcb FM |
63 | Note that typically one character is assigned exactly one code point, but there |
64 | are exceptions; the so-called <em>precomposed characters</em> | |
65 | (see http://en.wikipedia.org/wiki/Precomposed_character) or the <em>ligatures</em>. | |
66 | In these cases a single "character" may be mapped to more than one code point or | |
67 | viceversa more characters may be mapped to a single code point. | |
68 | ||
69 | The Unicode standard divides the space of all possible code points in <b><em>planes</em></b>; | |
77ef61f5 FM |
70 | a plane is a range of 65,536 (1000016) contiguous Unicode code points. |
71 | Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic | |
72 | Multilingual Plane. | |
727aa906 FM |
73 | The BMP contains characters for all modern languages, and a large number of |
74 | special characters. The other planes in fact contain mainly historic scripts, | |
75 | special-purpose characters or are unused. | |
77ef61f5 FM |
76 | |
77 | Code points are represented in computer memory as a sequence of one or more | |
727aa906 | 78 | <b><em>code units</em></b>, where a code unit is a unit of memory: 8, 16, or 32 bits. |
77ef61f5 FM |
79 | More precisely, a code unit is the minimal bit combination that can represent a |
80 | unit of encoded text for processing or interchange. | |
81 | ||
2f365fcb | 82 | The <b><em>UTF</em></b> or Unicode Transformation Formats are algorithms mapping the Unicode |
77ef61f5 | 83 | code points to code unit sequences. The simplest of them is <b>UTF-32</b> where |
727aa906 FM |
84 | each code unit is composed by 32 bits (4 bytes) and each code point is always |
85 | represented by a single code unit (fixed length encoding). | |
77ef61f5 FM |
86 | (Note that even UTF-32 is still not completely trivial as the mapping is different |
87 | for little and big-endian architectures). UTF-32 is commonly used under Unix systems for | |
88 | internal representation of Unicode strings. | |
89 | ||
90 | Another very widespread standard is <b>UTF-16</b> which is used by Microsoft Windows: | |
91 | it encodes the first (approximately) 64 thousands of Unicode code points | |
92 | (the BMP plane) using 16-bit code units (2 bytes) and uses a pair of 16-bit code | |
93 | units to encode the characters beyond this. These pairs are called @e surrogate. | |
727aa906 | 94 | Thus UTF16 uses a variable number of code units to encode each code point. |
77ef61f5 FM |
95 | |
96 | Finally, the most widespread encoding used for the external Unicode storage | |
97 | (e.g. files and network protocols) is <b>UTF-8</b> which is byte-oriented and so | |
98 | avoids the endianness ambiguities of UTF-16 and UTF-32. | |
99 | UTF-8 uses code units of 8 bits (1 byte); code points beyond the usual english | |
100 | alphabet are represented using a variable number of bytes, which makes it less | |
101 | efficient than UTF-32 for internal representation. | |
102 | ||
103 | As visual aid to understand the differences between the various concepts described | |
104 | so far, look at the different UTF representations of the same code point: | |
105 | ||
106 | @image html overview_unicode_codes.png | |
107 | ||
108 | In this particular case UTF8 requires more space than UTF16 (3 bytes instead of 2). | |
109 | ||
110 | Note that from the C/C++ programmer perspective the situation is further complicated | |
111 | by the fact that the standard type @c wchar_t which is usually used to represent the | |
cc506697 VZ |
112 | Unicode ("wide") strings in C/C++ doesn't have the same size on all platforms. |
113 | It is 4 bytes under Unix systems, corresponding to the tradition of using | |
114 | UTF-32, but only 2 bytes under Windows which is required by compatibility with | |
115 | the OS which uses UTF-16. | |
2cd3cc94 | 116 | |
77ef61f5 FM |
117 | Typically when UTF8 is used, code units are stored into @c char types, since |
118 | @c char are 8bit wide on almost all systems; when using UTF16 typically code | |
119 | units are stored into @c wchar_t types since @c wchar_t is at least 16bits on | |
120 | all systems. This is also the approach used by wxString. | |
727aa906 | 121 | See @ref overview_string for more info. |
77ef61f5 FM |
122 | |
123 | See also http://unicode.org/glossary/ for the official definitions of the | |
124 | terms reported above. | |
125 | ||
2cd3cc94 | 126 | |
cc506697 | 127 | @section overview_unicode_supportin Unicode Support in wxWidgets |
2cd3cc94 | 128 | |
cc506697 VZ |
129 | Since wxWidgets 3.0 Unicode support is always enabled and building the library |
130 | without it is not recommended any longer and will cease to be supported in the | |
131 | near future. This means that internally only Unicode strings are used and that, | |
132 | under Microsoft Windows, Unicode system API is used which means that wxWidgets | |
133 | programs require the Microsoft Layer for Unicode to run on Windows 95/98/ME. | |
134 | ||
77ef61f5 FM |
135 | However, unlike the Unicode build mode of the previous versions of wxWidgets, this |
136 | support is mostly transparent: you can still continue to work with the @b narrow | |
727aa906 | 137 | (i.e. current locale-encoded @c char*) strings even if @b wide |
2f365fcb | 138 | (i.e. UTF16-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also |
cc506697 VZ |
139 | supported. Any wxWidgets function accepts arguments of either type as both |
140 | kinds of strings are implicitly converted to wxString, so both | |
141 | @code | |
142 | wxMessageBox("Hello, world!"); | |
143 | @endcode | |
77ef61f5 | 144 | and the somewhat less usual |
cc506697 | 145 | @code |
727aa906 | 146 | wxMessageBox(L"Salut \u00E0 toi!"); // U+00E0 is "Latin Small Letter a with Grave" |
cc506697 VZ |
147 | @endcode |
148 | work as expected. | |
2cd3cc94 | 149 | |
cc506697 VZ |
150 | Notice that the narrow strings used with wxWidgets are @e always assumed to be |
151 | in the current locale encoding, so writing | |
152 | @code | |
153 | wxMessageBox("Salut à toi!"); | |
154 | @endcode | |
155 | wouldn't work if the encoding used on the user system is incompatible with | |
f99af6c0 VS |
156 | ISO-8859-1 (or even if the sources were compiled under different locale |
157 | in the case of gcc). In particular, the most common encoding used under | |
158 | modern Unix systems is UTF-8 and as the string above is not a valid UTF-8 byte | |
159 | sequence, nothing would be displayed at all in this case. Thus it is important | |
77ef61f5 | 160 | to <b>never use 8-bit (instead of 7-bit) characters directly in the program source</b> |
727aa906 | 161 | but use wide strings or, alternatively, write: |
cc506697 | 162 | @code |
727aa906 FM |
163 | wxMessageBox(wxString::FromUTF8("Salut \xC3\xA0 toi!")); |
164 | // in UTF8 the character U+00E0 is encoded as 0xC3A0 | |
cc506697 | 165 | @endcode |
2cd3cc94 | 166 | |
77ef61f5 FM |
167 | In a similar way, wxString provides access to its contents as either @c wchar_t or |
168 | @c char character buffer. Of course, the latter only works if the string contains | |
cc506697 VZ |
169 | data representable in the current locale encoding. This will always be the case |
170 | if the string had been initially constructed from a narrow string or if it | |
171 | contains only 7-bit ASCII data but otherwise this conversion is not guaranteed | |
91fa0da4 FM |
172 | to succeed. And as with wxString::FromUTF8() example above, you can always use |
173 | wxString::ToUTF8() to retrieve the string contents in UTF-8 encoding -- this, | |
174 | unlike converting to @c char* using the current locale, never fails. | |
cc506697 | 175 | |
77ef61f5 FM |
176 | For more info about how wxString works, please see the @ref overview_string. |
177 | ||
178 | To summarize, Unicode support in wxWidgets is mostly @b transparent for the | |
cc506697 VZ |
179 | application and if you use wxString objects for storing all the character data |
180 | in your program there is really nothing special to do. However you should be | |
181 | aware of the potential problems covered by the following section. | |
182 | ||
183 | ||
184 | @section overview_unicode_pitfalls Potential Unicode Pitfalls | |
185 | ||
186 | The problems can be separated into three broad classes: | |
187 | ||
188 | @subsection overview_unicode_compilation_errors Unicode-Related Compilation Errors | |
189 | ||
77ef61f5 FM |
190 | Because of the need to support implicit conversions to both @c char and |
191 | @c wchar_t, wxString implementation is rather involved and many of its operators | |
192 | don't return the types which they could be naively expected to return. | |
193 | For example, the @c operator[] doesn't return neither a @c char nor a @c wchar_t | |
cc506697 VZ |
194 | but an object of a helper class wxUniChar or wxUniCharRef which is implicitly |
195 | convertible to either. Usually you don't need to worry about this as the | |
196 | conversions do their work behind the scenes however in some cases it doesn't | |
197 | work. Here are some examples, using a wxString object @c s and some integer @c | |
198 | n: | |
199 | ||
200 | - Writing @code switch ( s[n] ) @endcode doesn't work because the argument of | |
201 | the switch statement must an integer expression so you need to replace | |
202 | @c s[n] with @code s[n].GetValue() @endcode. You may also force the | |
77ef61f5 | 203 | conversion to @c char or @c wchar_t by using an explicit cast but beware that |
cc506697 VZ |
204 | converting the value to char uses the conversion to current locale and may |
205 | return 0 if it fails. Finally notice that writing @code (wxChar)s[n] @endcode | |
206 | works both with wxWidgets 3.0 and previous library versions and so should be | |
207 | used for writing code which should be compatible with both 2.8 and 3.0. | |
208 | ||
209 | - Similarly, @code &s[n] @endcode doesn't yield a pointer to char so you may | |
210 | not pass it to functions expecting @c char* or @c wchar_t*. Consider using | |
211 | string iterators instead if possible or replace this expression with | |
212 | @code s.c_str() + n @endcode otherwise. | |
213 | ||
91fa0da4 FM |
214 | Another class of problems is related to the fact that the value returned by |
215 | @c c_str() itself is also not just a pointer to a buffer but a value of helper | |
cc506697 VZ |
216 | class wxCStrData which is implicitly convertible to both narrow and wide |
217 | strings. Again, this mostly will be unnoticeable but can result in some | |
218 | problems: | |
219 | ||
220 | - You shouldn't pass @c c_str() result to vararg functions such as standard | |
221 | @c printf(). Some compilers (notably g++) warn about this but even if they | |
222 | don't, this @code printf("Hello, %s", s.c_str()) @endcode is not going to | |
223 | work. It can be corrected in one of the following ways: | |
224 | ||
225 | - Preferred: @code wxPrintf("Hello, %s", s) @endcode (notice the absence | |
226 | of @c c_str(), it is not needed at all with wxWidgets functions) | |
227 | - Compatible with wxWidgets 2.8: @code wxPrintf("Hello, %s", s.c_str()) @endcode | |
228 | - Using an explicit conversion to narrow, multibyte, string: | |
f99af6c0 | 229 | @code printf("Hello, %s", (const char *)s.mb_str()) @endcode |
cc506697 VZ |
230 | - Using a cast to force the issue (listed only for completeness): |
231 | @code printf("Hello, %s", (const char *)s.c_str()) @endcode | |
232 | ||
4c51a665 | 233 | - The result of @c c_str() cannot be cast to @c char* but only to @c const @c |
cc506697 VZ |
234 | @c char*. Of course, modifying the string via the pointer returned by this |
235 | method has never been possible but unfortunately it was occasionally useful | |
236 | to use a @c const_cast here to pass the value to const-incorrect functions. | |
237 | This can be done either using new wxString::char_str() (and matching | |
238 | wchar_str()) method or by writing a double cast: | |
239 | @code (char *)(const char *)s.c_str() @endcode | |
240 | ||
241 | - One of the unfortunate consequences of the possibility to pass wxString to | |
242 | @c wxPrintf() without using @c c_str() is that it is now impossible to pass | |
243 | the elements of unnamed enumerations to @c wxPrintf() and other similar | |
244 | vararg functions, i.e. | |
245 | @code | |
246 | enum { Red, Green, Blue }; | |
247 | wxPrintf("Red is %d", Red); | |
248 | @endcode | |
249 | doesn't compile. The easiest workaround is to give a name to the enum. | |
250 | ||
251 | Other unexpected compilation errors may arise but they should happen even more | |
252 | rarely than the above-mentioned ones and the solution should usually be quite | |
253 | simple: just use the explicit methods of wxUniChar and wxCStrData classes | |
254 | instead of relying on their implicit conversions if the compiler can't choose | |
255 | among them. | |
256 | ||
257 | ||
258 | @subsection overview_unicode_data_loss Data Loss due To Unicode Conversion Errors | |
259 | ||
260 | wxString API provides implicit conversion of the internal Unicode string | |
261 | contents to narrow, char strings. This can be very convenient and is absolutely | |
262 | necessary for backwards compatibility with the existing code using wxWidgets | |
263 | however it is a rather dangerous operation as it can easily give unexpected | |
264 | results if the string contents isn't convertible to the current locale. | |
265 | ||
266 | To be precise, the conversion will always succeed if the string was created | |
267 | from a narrow string initially. It will also succeed if the current encoding is | |
268 | UTF-8 as all Unicode strings are representable in this encoding. However | |
91fa0da4 FM |
269 | initializing the string using wxString::FromUTF8() method and then accessing it |
270 | as a char string via its wxString::c_str() method is a recipe for disaster as the | |
271 | program may work perfectly well during testing on Unix systems using UTF-8 locale | |
272 | but completely fail under Windows where UTF-8 locales are never used because | |
273 | wxString::c_str() would return an empty string. | |
cc506697 VZ |
274 | |
275 | The simplest way to ensure that this doesn't happen is to avoid conversions to | |
276 | @c char* completely by using wxString throughout your program. However if the | |
277 | program never manipulates 8 bit strings internally, using @c char* pointers is | |
278 | safe as well. So the existing code needs to be reviewed when upgrading to | |
279 | wxWidgets 3.0 and the new code should be used with this in mind and ideally | |
280 | avoiding implicit conversions to @c char*. | |
281 | ||
282 | ||
283 | @subsection overview_unicode_performance Unicode Performance Implications | |
284 | ||
285 | Under Unix systems wxString class uses variable-width UTF-8 encoding for | |
286 | internal representation and this implies that it can't guarantee constant-time | |
287 | access to N-th element of the string any longer as to find the position of this | |
288 | character in the string we have to examine all the preceding ones. Usually this | |
289 | doesn't matter much because most algorithms used on the strings examine them | |
a6919a6a RR |
290 | sequentially anyhow and because wxString implements a cache for iterating over |
291 | the string by index but it can have serious consequences for algorithms | |
292 | using random access to string elements as they typically acquire O(N^2) time | |
cc506697 VZ |
293 | complexity instead of O(N) where N is the length of the string. |
294 | ||
a6919a6a | 295 | Even despite caching the index, indexed access should be replaced with |
cc506697 | 296 | sequential access using string iterators. For example a typical loop: |
7b74e828 | 297 | @code |
cc506697 VZ |
298 | wxString s("hello"); |
299 | for ( size_t i = 0; i < s.length(); i++ ) | |
7b74e828 | 300 | { |
cc506697 | 301 | wchar_t ch = s[i]; |
91fa0da4 | 302 | |
7b74e828 RR |
303 | // do something with it |
304 | } | |
305 | @endcode | |
cc506697 | 306 | should be rewritten as |
2cd3cc94 | 307 | @code |
cc506697 VZ |
308 | wxString s("hello"); |
309 | for ( wxString::const_iterator i = s.begin(); i != s.end(); ++i ) | |
7b74e828 | 310 | { |
cc506697 | 311 | wchar_t ch = *i |
91fa0da4 | 312 | |
7b74e828 RR |
313 | // do something with it |
314 | } | |
2cd3cc94 BP |
315 | @endcode |
316 | ||
cc506697 | 317 | Another, similar, alternative is to use pointer arithmetic: |
7b74e828 | 318 | @code |
cc506697 VZ |
319 | wxString s("hello"); |
320 | for ( const wchar_t *p = s.wc_str(); *p; p++ ) | |
7b74e828 | 321 | { |
cc506697 VZ |
322 | wchar_t ch = *i |
323 | ||
324 | // do something with it | |
7b74e828 RR |
325 | } |
326 | @endcode | |
cc506697 VZ |
327 | however this doesn't work correctly for strings with embedded @c NUL characters |
328 | and the use of iterators is generally preferred as they provide some run-time | |
329 | checks (at least in debug build) unlike the raw pointers. But if you do use | |
77ef61f5 | 330 | them, it is better to use @c wchar_t pointers rather than @c char ones to avoid the |
cc506697 | 331 | data loss problems due to conversion as discussed in the previous section. |
2cd3cc94 | 332 | |
2cd3cc94 | 333 | |
cc506697 | 334 | @section overview_unicode_supportout Unicode and the Outside World |
2cd3cc94 | 335 | |
cc506697 VZ |
336 | Even though wxWidgets always uses Unicode internally, not all the other |
337 | libraries and programs do and even those that do use Unicode may use a | |
338 | different encoding of it. So you need to be able to convert the data to various | |
91fa0da4 FM |
339 | representations and the wxString methods wxString::ToAscii(), wxString::ToUTF8() |
340 | (or its synonym wxString::utf8_str()), wxString::mb_str(), wxString::c_str() and | |
341 | wxString::wc_str() can be used for this. | |
727aa906 | 342 | |
91fa0da4 FM |
343 | The first of them should be only used for the string containing 7-bit ASCII characters |
344 | only, anything else will be replaced by some substitution character. | |
345 | wxString::mb_str() converts the string to the encoding used by the current locale | |
346 | and so can return an empty string if the string contains characters not representable in | |
347 | it as explained in @ref overview_unicode_data_loss. The same applies to wxString::c_str() | |
348 | if its result is used as a narrow string. Finally, wxString::ToUTF8() and wxString::wc_str() | |
cc506697 | 349 | functions never fail and always return a pointer to char string containing the |
77ef61f5 | 350 | UTF-8 representation of the string or @c wchar_t string. |
cc506697 | 351 | |
91fa0da4 FM |
352 | wxString also provides two convenience functions: wxString::From8BitData() and |
353 | wxString::To8BitData(). They can be used to create a wxString from arbitrary binary | |
354 | data without supposing that it is in current locale encoding, and then get it back, | |
cc506697 | 355 | again, without any conversion or, rather, undoing the conversion used by |
91fa0da4 FM |
356 | wxString::From8BitData(). Because of this you should only use wxString::From8BitData() |
357 | for the strings created using wxString::To8BitData(). Also notice that in spite | |
358 | of the availability of these functions, wxString is not the ideal class for storing | |
cc506697 VZ |
359 | arbitrary binary data as they can take up to 4 times more space than needed |
360 | (when using @c wchar_t internal representation on the systems where size of | |
361 | wide characters is 4 bytes) and you should consider using wxMemoryBuffer | |
362 | instead. | |
363 | ||
364 | Final word of caution: most of these functions may return either directly the | |
365 | pointer to internal string buffer or a temporary wxCharBuffer or wxWCharBuffer | |
77ef61f5 | 366 | object. Such objects are implicitly convertible to @c char and @c wchar_t pointers, |
91fa0da4 | 367 | respectively, and so the result of, for example, wxString::ToUTF8() can always be |
77ef61f5 | 368 | passed directly to a function taking <tt>const char*</tt>. However code such as |
7b74e828 | 369 | @code |
cc506697 VZ |
370 | const char *p = s.ToUTF8(); |
371 | ... | |
372 | puts(p); // or call any other function taking const char * | |
7b74e828 | 373 | @endcode |
91fa0da4 | 374 | does @b not work because the temporary buffer returned by wxString::ToUTF8() is |
197380a0 | 375 | destroyed and @c p is left pointing nowhere. To correct this you should use |
7b74e828 | 376 | @code |
197380a0 | 377 | const wxScopedCharBuffer p(s.ToUTF8()); |
cc506697 | 378 | puts(p); |
7b74e828 | 379 | @endcode |
197380a0 VS |
380 | which does work. |
381 | ||
91fa0da4 | 382 | Similarly, wxWX2WCbuf can be used for the return type of wxString::wc_str(). |
cc506697 VZ |
383 | But, once again, none of these cryptic types is really needed if you just pass |
384 | the return value of any of the functions mentioned in this section to another | |
385 | function directly. | |
2cd3cc94 BP |
386 | |
387 | @section overview_unicode_settings Unicode Related Compilation Settings | |
388 | ||
2f365fcb | 389 | @c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support. |
cc506697 VZ |
390 | If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is |
391 | also defined, otherwise @c wxUSE_UNICODE_WCHAR is. | |
2cd3cc94 | 392 | |
77ef61f5 FM |
393 | You are encouraged to always use the default build settings of wxWidgets; this avoids |
394 | the need of different builds of the same application/library because of different | |
395 | "build modes". | |
396 | ||
2cd3cc94 BP |
397 | */ |
398 |