]>
Commit | Line | Data |
---|---|---|
15b6757b | 1 | ///////////////////////////////////////////////////////////////////////////// |
2cd3cc94 | 2 | // Name: unicode.h |
15b6757b FM |
3 | // Purpose: topic overview |
4 | // Author: wxWidgets team | |
5 | // RCS-ID: $Id$ | |
6 | // Licence: wxWindows license | |
7 | ///////////////////////////////////////////////////////////////////////////// | |
8 | ||
880efa2a | 9 | /** |
36c9828f | 10 | |
2cd3cc94 BP |
11 | @page overview_unicode Unicode Support in wxWidgets |
12 | ||
cc506697 VZ |
13 | This section describes how does wxWidgets support Unicode and how can it affect |
14 | your programs. | |
36c9828f | 15 | |
cc506697 VZ |
16 | Notice that Unicode support has changed radically in wxWidgets 3.0 and a lot of |
17 | existing material pertaining to the previous versions of the library is not | |
18 | correct any more. Please see @ref overview_changes_unicode for the details of | |
19 | these changes. | |
20 | ||
21 | You can skip the first two sections if you're already familiar with Unicode and | |
22 | wish to jump directly in the details of its support in the library: | |
2cd3cc94 | 23 | @li @ref overview_unicode_what |
cc506697 | 24 | @li @ref overview_unicode_encodings |
2cd3cc94 | 25 | @li @ref overview_unicode_supportin |
cc506697 | 26 | @li @ref overview_unicode_pitfalls |
2cd3cc94 BP |
27 | @li @ref overview_unicode_supportout |
28 | @li @ref overview_unicode_settings | |
36c9828f | 29 | |
2cd3cc94 | 30 | <hr> |
36c9828f FM |
31 | |
32 | ||
2cd3cc94 BP |
33 | @section overview_unicode_what What is Unicode? |
34 | ||
cc506697 VZ |
35 | Unicode is a standard for character encoding which addresses the shortcomings |
36 | of the previous, 8 bit standards, by using at least 16 (and possibly 32) bits | |
37 | for encoding each character. This allows to have at least 65536 characters | |
38 | (in what is called the BMP, or basic multilingual plane) and possible 2^32 of | |
39 | them instead of the usual 256 and is sufficient to encode all of the world | |
40 | languages at once. More details about Unicode may be found at | |
41 | http://www.unicode.org/. | |
42 | ||
43 | From a practical point of view, using Unicode is almost a requirement when | |
44 | writing applications for international audience. Moreover, any application | |
45 | reading files which it didn't produce or receiving data from the network from | |
46 | other services should be ready to deal with Unicode. | |
47 | ||
48 | ||
49 | @section overview_unicode_encodings Unicode Representations | |
50 | ||
51 | Unicode provides a unique code to identify every character, however in practice | |
52 | these codes are not always used directly but encoded using one of the standard | |
53 | UTF or Unicode Transformation Formats which are algorithms mapping the Unicode | |
54 | codes to byte code sequences. The simplest of them is UTF-32 which simply maps | |
55 | the Unicode code to a 4 byte sequence representing this 32 bit number (although | |
56 | this is still not completely trivial as the mapping is different for little and | |
57 | big-endian architectures). UTF-32 is commonly used under Unix systems for | |
58 | internal representation of Unicode strings. Another very widespread standard is | |
59 | UTF-16 which is used by Microsoft Windows: it encodes the first (approximately) | |
60 | 64 thousands of Unicode characters using only 2 bytes and uses a pair of 16-bit | |
61 | codes to encode the characters beyond this. Finally, the most widespread | |
62 | encoding used for the external Unicode storage (e.g. files and network | |
63 | protocols) is UTF-8 which is byte-oriented and so avoids the endianness | |
64 | ambiguities of UTF-16 and UTF-32. However UTF-8 uses a variable number of bytes | |
65 | for representing Unicode characters which makes it less efficient than UTF-32 | |
66 | for internal representation. | |
67 | ||
68 | From the C/C++ programmer perspective the situation is further complicated by | |
69 | the fact that the standard type @c wchar_t which is used to represent the | |
70 | Unicode ("wide") strings in C/C++ doesn't have the same size on all platforms. | |
71 | It is 4 bytes under Unix systems, corresponding to the tradition of using | |
72 | UTF-32, but only 2 bytes under Windows which is required by compatibility with | |
73 | the OS which uses UTF-16. | |
2cd3cc94 | 74 | |
2cd3cc94 | 75 | |
cc506697 | 76 | @section overview_unicode_supportin Unicode Support in wxWidgets |
2cd3cc94 | 77 | |
cc506697 VZ |
78 | Since wxWidgets 3.0 Unicode support is always enabled and building the library |
79 | without it is not recommended any longer and will cease to be supported in the | |
80 | near future. This means that internally only Unicode strings are used and that, | |
81 | under Microsoft Windows, Unicode system API is used which means that wxWidgets | |
82 | programs require the Microsoft Layer for Unicode to run on Windows 95/98/ME. | |
83 | ||
84 | However, unlike Unicode build mode in the previous versions of wxWidgets, this | |
85 | support is mostly transparent: you can still continue to work with the narrow | |
86 | (i.e. @c char*) strings even if wide (i.e. @c wchar_t*) strings are also | |
87 | supported. Any wxWidgets function accepts arguments of either type as both | |
88 | kinds of strings are implicitly converted to wxString, so both | |
89 | @code | |
90 | wxMessageBox("Hello, world!"); | |
91 | @endcode | |
92 | and somewhat less usual | |
93 | @code | |
94 | wxMessageBox(L"Salut \u00e0 toi!"); // 00E0 is "Latin Small Letter a with Grave" | |
95 | @endcode | |
96 | work as expected. | |
2cd3cc94 | 97 | |
cc506697 VZ |
98 | Notice that the narrow strings used with wxWidgets are @e always assumed to be |
99 | in the current locale encoding, so writing | |
100 | @code | |
101 | wxMessageBox("Salut à toi!"); | |
102 | @endcode | |
103 | wouldn't work if the encoding used on the user system is incompatible with | |
104 | ISO-8859-1. In particular, the most common encoding used under modern Unix | |
105 | systems is UTF-8 and as the string above is not a valid UTF-8 byte sequence, | |
106 | nothing would be displayed at all in this case. Thus it is important to never | |
107 | use 8 bit characters directly in the program source but use wide strings or, | |
108 | alternatively, write | |
109 | @code | |
110 | wxMessageBox(wxString::FromUTF8("Salut \xc3\xa0 toi!")); | |
111 | @endcode | |
2cd3cc94 | 112 | |
cc506697 VZ |
113 | In a similar way, wxString provides access to its contents as either wchar_t or |
114 | char character buffer. Of course, the latter only works if the string contains | |
115 | data representable in the current locale encoding. This will always be the case | |
116 | if the string had been initially constructed from a narrow string or if it | |
117 | contains only 7-bit ASCII data but otherwise this conversion is not guaranteed | |
118 | to succeed. And as with @c FromUTF8() example above, you can always use @c | |
119 | ToUTF8() to retrieve the string contents in UTF-8 encoding -- this, unlike | |
120 | converting to @c char* using the current locale, never fails | |
121 | ||
122 | To summarize, Unicode support in wxWidgets is mostly transparent for the | |
123 | application and if you use wxString objects for storing all the character data | |
124 | in your program there is really nothing special to do. However you should be | |
125 | aware of the potential problems covered by the following section. | |
126 | ||
127 | ||
128 | @section overview_unicode_pitfalls Potential Unicode Pitfalls | |
129 | ||
130 | The problems can be separated into three broad classes: | |
131 | ||
132 | @subsection overview_unicode_compilation_errors Unicode-Related Compilation Errors | |
133 | ||
134 | Because of the need to support implicit conversions to both @c char and @c | |
135 | wchar_t, wxString implementation is rather involved and many of its operators | |
136 | don't return the types which they could be naively expected to return. For | |
137 | example, the @c operator[] doesn't return neither a @c char nor a @c wchar_t | |
138 | but an object of a helper class wxUniChar or wxUniCharRef which is implicitly | |
139 | convertible to either. Usually you don't need to worry about this as the | |
140 | conversions do their work behind the scenes however in some cases it doesn't | |
141 | work. Here are some examples, using a wxString object @c s and some integer @c | |
142 | n: | |
143 | ||
144 | - Writing @code switch ( s[n] ) @endcode doesn't work because the argument of | |
145 | the switch statement must an integer expression so you need to replace | |
146 | @c s[n] with @code s[n].GetValue() @endcode. You may also force the | |
147 | conversion to char or wchar_t by using an explicit cast but beware that | |
148 | converting the value to char uses the conversion to current locale and may | |
149 | return 0 if it fails. Finally notice that writing @code (wxChar)s[n] @endcode | |
150 | works both with wxWidgets 3.0 and previous library versions and so should be | |
151 | used for writing code which should be compatible with both 2.8 and 3.0. | |
152 | ||
153 | - Similarly, @code &s[n] @endcode doesn't yield a pointer to char so you may | |
154 | not pass it to functions expecting @c char* or @c wchar_t*. Consider using | |
155 | string iterators instead if possible or replace this expression with | |
156 | @code s.c_str() + n @endcode otherwise. | |
157 | ||
158 | Another class of problems is related to the fact that the value returned by @c | |
159 | c_str() itself is also not just a pointer to a buffer but a value of helper | |
160 | class wxCStrData which is implicitly convertible to both narrow and wide | |
161 | strings. Again, this mostly will be unnoticeable but can result in some | |
162 | problems: | |
163 | ||
164 | - You shouldn't pass @c c_str() result to vararg functions such as standard | |
165 | @c printf(). Some compilers (notably g++) warn about this but even if they | |
166 | don't, this @code printf("Hello, %s", s.c_str()) @endcode is not going to | |
167 | work. It can be corrected in one of the following ways: | |
168 | ||
169 | - Preferred: @code wxPrintf("Hello, %s", s) @endcode (notice the absence | |
170 | of @c c_str(), it is not needed at all with wxWidgets functions) | |
171 | - Compatible with wxWidgets 2.8: @code wxPrintf("Hello, %s", s.c_str()) @endcode | |
172 | - Using an explicit conversion to narrow, multibyte, string: | |
173 | @code printf("Hello, %s", s.mb_str()) @endcode | |
174 | - Using a cast to force the issue (listed only for completeness): | |
175 | @code printf("Hello, %s", (const char *)s.c_str()) @endcode | |
176 | ||
177 | - The result of @c c_str() can not be cast to @c char* but only to @c const @c | |
178 | @c char*. Of course, modifying the string via the pointer returned by this | |
179 | method has never been possible but unfortunately it was occasionally useful | |
180 | to use a @c const_cast here to pass the value to const-incorrect functions. | |
181 | This can be done either using new wxString::char_str() (and matching | |
182 | wchar_str()) method or by writing a double cast: | |
183 | @code (char *)(const char *)s.c_str() @endcode | |
184 | ||
185 | - One of the unfortunate consequences of the possibility to pass wxString to | |
186 | @c wxPrintf() without using @c c_str() is that it is now impossible to pass | |
187 | the elements of unnamed enumerations to @c wxPrintf() and other similar | |
188 | vararg functions, i.e. | |
189 | @code | |
190 | enum { Red, Green, Blue }; | |
191 | wxPrintf("Red is %d", Red); | |
192 | @endcode | |
193 | doesn't compile. The easiest workaround is to give a name to the enum. | |
194 | ||
195 | Other unexpected compilation errors may arise but they should happen even more | |
196 | rarely than the above-mentioned ones and the solution should usually be quite | |
197 | simple: just use the explicit methods of wxUniChar and wxCStrData classes | |
198 | instead of relying on their implicit conversions if the compiler can't choose | |
199 | among them. | |
200 | ||
201 | ||
202 | @subsection overview_unicode_data_loss Data Loss due To Unicode Conversion Errors | |
203 | ||
204 | wxString API provides implicit conversion of the internal Unicode string | |
205 | contents to narrow, char strings. This can be very convenient and is absolutely | |
206 | necessary for backwards compatibility with the existing code using wxWidgets | |
207 | however it is a rather dangerous operation as it can easily give unexpected | |
208 | results if the string contents isn't convertible to the current locale. | |
209 | ||
210 | To be precise, the conversion will always succeed if the string was created | |
211 | from a narrow string initially. It will also succeed if the current encoding is | |
212 | UTF-8 as all Unicode strings are representable in this encoding. However | |
213 | initializing the string using FromUTF8() method and then accessing it as a char | |
214 | string via its c_str() method is a recipe for disaster as the program may work | |
215 | perfectly well during testing on Unix systems using UTF-8 locale but completely | |
216 | fail under Windows where UTF-8 locales are never used because c_str() would | |
217 | return an empty string. | |
218 | ||
219 | The simplest way to ensure that this doesn't happen is to avoid conversions to | |
220 | @c char* completely by using wxString throughout your program. However if the | |
221 | program never manipulates 8 bit strings internally, using @c char* pointers is | |
222 | safe as well. So the existing code needs to be reviewed when upgrading to | |
223 | wxWidgets 3.0 and the new code should be used with this in mind and ideally | |
224 | avoiding implicit conversions to @c char*. | |
225 | ||
226 | ||
227 | @subsection overview_unicode_performance Unicode Performance Implications | |
228 | ||
229 | Under Unix systems wxString class uses variable-width UTF-8 encoding for | |
230 | internal representation and this implies that it can't guarantee constant-time | |
231 | access to N-th element of the string any longer as to find the position of this | |
232 | character in the string we have to examine all the preceding ones. Usually this | |
233 | doesn't matter much because most algorithms used on the strings examine them | |
234 | sequentially anyhow, but it can have serious consequences for the algorithms | |
235 | using indexed access to string elements as they typically acquire O(N^2) time | |
236 | complexity instead of O(N) where N is the length of the string. | |
237 | ||
238 | To return to the linear complexity, indexed access should be replaced with | |
239 | sequential access using string iterators. For example a typical loop: | |
7b74e828 | 240 | @code |
cc506697 VZ |
241 | wxString s("hello"); |
242 | for ( size_t i = 0; i < s.length(); i++ ) | |
7b74e828 | 243 | { |
cc506697 | 244 | wchar_t ch = s[i]; |
7b74e828 RR |
245 | |
246 | // do something with it | |
247 | } | |
248 | @endcode | |
cc506697 | 249 | should be rewritten as |
2cd3cc94 | 250 | @code |
cc506697 VZ |
251 | wxString s("hello"); |
252 | for ( wxString::const_iterator i = s.begin(); i != s.end(); ++i ) | |
7b74e828 | 253 | { |
cc506697 | 254 | wchar_t ch = *i |
7b74e828 RR |
255 | |
256 | // do something with it | |
257 | } | |
2cd3cc94 BP |
258 | @endcode |
259 | ||
cc506697 | 260 | Another, similar, alternative is to use pointer arithmetic: |
7b74e828 | 261 | @code |
cc506697 VZ |
262 | wxString s("hello"); |
263 | for ( const wchar_t *p = s.wc_str(); *p; p++ ) | |
7b74e828 | 264 | { |
cc506697 VZ |
265 | wchar_t ch = *i |
266 | ||
267 | // do something with it | |
7b74e828 RR |
268 | } |
269 | @endcode | |
cc506697 VZ |
270 | however this doesn't work correctly for strings with embedded @c NUL characters |
271 | and the use of iterators is generally preferred as they provide some run-time | |
272 | checks (at least in debug build) unlike the raw pointers. But if you do use | |
273 | them, it is better to use wchar_t pointers rather than char ones to avoid the | |
274 | data loss problems due to conversion as discussed in the previous section. | |
2cd3cc94 | 275 | |
2cd3cc94 | 276 | |
cc506697 | 277 | @section overview_unicode_supportout Unicode and the Outside World |
2cd3cc94 | 278 | |
cc506697 VZ |
279 | Even though wxWidgets always uses Unicode internally, not all the other |
280 | libraries and programs do and even those that do use Unicode may use a | |
281 | different encoding of it. So you need to be able to convert the data to various | |
282 | representations and the wxString methods ToAscii(), ToUTF8() (or its synonym | |
283 | utf8_str()), mb_str(), c_str() and wc_str() can be used for this. The first of | |
284 | them should be only used for the string containing 7-bit ASCII characters only, | |
285 | anything else will be replaced by some substitution character. mb_str() | |
286 | converts the string to the encoding used by the current locale and so can | |
287 | return an empty string if the string contains characters not representable in | |
288 | it as explained in @ref overview_unicode_data_loss. The same applies to c_str() | |
289 | if its result is used as a narrow string. Finally, ToUTF8() and wc_str() | |
290 | functions never fail and always return a pointer to char string containing the | |
291 | UTF-8 representation of the string or wchar_t string. | |
292 | ||
293 | wxString also provides two convenience functions: From8BitData() and | |
294 | To8BitData(). They can be used to create wxString from arbitrary binary data | |
295 | without supposing that it is in current locale encoding, and then get it back, | |
296 | again, without any conversion or, rather, undoing the conversion used by | |
297 | From8BitData(). Because of this you should only use From8BitData() for the | |
298 | strings created using To8BitData(). Also notice that in spite of the | |
299 | availability of these functions, wxString is not the ideal class for storing | |
300 | arbitrary binary data as they can take up to 4 times more space than needed | |
301 | (when using @c wchar_t internal representation on the systems where size of | |
302 | wide characters is 4 bytes) and you should consider using wxMemoryBuffer | |
303 | instead. | |
304 | ||
305 | Final word of caution: most of these functions may return either directly the | |
306 | pointer to internal string buffer or a temporary wxCharBuffer or wxWCharBuffer | |
307 | object. Such objects are implicitly convertible to char and wchar_t pointers, | |
308 | respectively, and so the result of, for example, ToUTF8() can always be passed | |
309 | directly to a function taking @c const @c char*. However code such as | |
7b74e828 | 310 | @code |
cc506697 VZ |
311 | const char *p = s.ToUTF8(); |
312 | ... | |
313 | puts(p); // or call any other function taking const char * | |
7b74e828 | 314 | @endcode |
cc506697 VZ |
315 | does @b not work because the temporary buffer returned by ToUTF8() is destroyed |
316 | and @c p is left pointing nowhere. To correct this you may use | |
7b74e828 | 317 | @code |
cc506697 VZ |
318 | wxCharBuffer p(s.ToUTF8()); |
319 | puts(p); | |
7b74e828 | 320 | @endcode |
cc506697 VZ |
321 | which does work but results in an unnecessary copy of string data in the build |
322 | configurations when ToUTF8() returns the pointer to internal string buffer. If | |
323 | this inefficiency is important you may write | |
2cd3cc94 | 324 | @code |
cc506697 VZ |
325 | const wxUTF8Buf p(s.ToUTF8()); |
326 | puts(p); | |
2cd3cc94 | 327 | @endcode |
cc506697 VZ |
328 | where @c wxUTF8Buf is the type corresponding to the real return type of |
329 | ToUTF8(). Similarly, wxWX2WCbuf can be used for the return type of wc_str(). | |
330 | But, once again, none of these cryptic types is really needed if you just pass | |
331 | the return value of any of the functions mentioned in this section to another | |
332 | function directly. | |
2cd3cc94 BP |
333 | |
334 | @section overview_unicode_settings Unicode Related Compilation Settings | |
335 | ||
cc506697 VZ |
336 | @c wxUSE_UNICODE is now defined as 1 by default to indicate Unicode support. |
337 | If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is | |
338 | also defined, otherwise @c wxUSE_UNICODE_WCHAR is. | |
2cd3cc94 BP |
339 | |
340 | */ | |
341 |