]>
Commit | Line | Data |
---|---|---|
15b6757b | 1 | ///////////////////////////////////////////////////////////////////////////// |
2cd3cc94 | 2 | // Name: unicode.h |
15b6757b FM |
3 | // Purpose: topic overview |
4 | // Author: wxWidgets team | |
5 | // RCS-ID: $Id$ | |
526954c5 | 6 | // Licence: wxWindows licence |
15b6757b FM |
7 | ///////////////////////////////////////////////////////////////////////////// |
8 | ||
880efa2a | 9 | /** |
36c9828f | 10 | |
2cd3cc94 BP |
11 | @page overview_unicode Unicode Support in wxWidgets |
12 | ||
cc506697 VZ |
13 | This section describes how does wxWidgets support Unicode and how can it affect |
14 | your programs. | |
36c9828f | 15 | |
cc506697 VZ |
16 | Notice that Unicode support has changed radically in wxWidgets 3.0 and a lot of |
17 | existing material pertaining to the previous versions of the library is not | |
18 | correct any more. Please see @ref overview_changes_unicode for the details of | |
19 | these changes. | |
20 | ||
21 | You can skip the first two sections if you're already familiar with Unicode and | |
22 | wish to jump directly in the details of its support in the library: | |
2cd3cc94 | 23 | @li @ref overview_unicode_what |
cc506697 | 24 | @li @ref overview_unicode_encodings |
2cd3cc94 | 25 | @li @ref overview_unicode_supportin |
cc506697 | 26 | @li @ref overview_unicode_pitfalls |
2cd3cc94 | 27 | @li @ref overview_unicode_supportout |
36c9828f | 28 | |
2cd3cc94 | 29 | <hr> |
36c9828f FM |
30 | |
31 | ||
2cd3cc94 BP |
32 | @section overview_unicode_what What is Unicode? |
33 | ||
cc506697 | 34 | Unicode is a standard for character encoding which addresses the shortcomings |
77ef61f5 FM |
35 | of the previous standards (e.g. the ASCII standard), by using 8, 16 or 32 bits |
36 | for encoding each character. | |
37 | This allows enough code points (see below for the definition) sufficient to | |
38 | encode all of the world languages at once. | |
39 | More details about Unicode may be found at http://www.unicode.org/. | |
cc506697 VZ |
40 | |
41 | From a practical point of view, using Unicode is almost a requirement when | |
42 | writing applications for international audience. Moreover, any application | |
43 | reading files which it didn't produce or receiving data from the network from | |
44 | other services should be ready to deal with Unicode. | |
45 | ||
46 | ||
77ef61f5 FM |
47 | @section overview_unicode_encodings Unicode Representations and Terminology |
48 | ||
49 | When working with Unicode, it's important to define the meaning of some terms. | |
50 | ||
2f365fcb FM |
51 | A <b><em>glyph</em></b> is a particular image (usually part of a font) that |
52 | represents a character or part of a character. | |
77ef61f5 FM |
53 | Any character may have one or more glyph associated; e.g. some of the possible |
54 | glyphs for the capital letter 'A' are: | |
55 | ||
56 | @image html overview_unicode_glyphs.png | |
57 | ||
58 | Unicode assigns each character of almost any existing alphabet/script a number, | |
727aa906 | 59 | which is called <b><em>code point</em></b>; it's typically indicated in documentation |
77ef61f5 FM |
60 | manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number. |
61 | ||
2f365fcb FM |
62 | Note that typically one character is assigned exactly one code point, but there |
63 | are exceptions; the so-called <em>precomposed characters</em> | |
64 | (see http://en.wikipedia.org/wiki/Precomposed_character) or the <em>ligatures</em>. | |
65 | In these cases a single "character" may be mapped to more than one code point or | |
66 | viceversa more characters may be mapped to a single code point. | |
67 | ||
68 | The Unicode standard divides the space of all possible code points in <b><em>planes</em></b>; | |
77ef61f5 FM |
69 | a plane is a range of 65,536 (1000016) contiguous Unicode code points. |
70 | Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic | |
71 | Multilingual Plane. | |
727aa906 FM |
72 | The BMP contains characters for all modern languages, and a large number of |
73 | special characters. The other planes in fact contain mainly historic scripts, | |
74 | special-purpose characters or are unused. | |
77ef61f5 FM |
75 | |
76 | Code points are represented in computer memory as a sequence of one or more | |
727aa906 | 77 | <b><em>code units</em></b>, where a code unit is a unit of memory: 8, 16, or 32 bits. |
77ef61f5 FM |
78 | More precisely, a code unit is the minimal bit combination that can represent a |
79 | unit of encoded text for processing or interchange. | |
80 | ||
2f365fcb | 81 | The <b><em>UTF</em></b> or Unicode Transformation Formats are algorithms mapping the Unicode |
77ef61f5 | 82 | code points to code unit sequences. The simplest of them is <b>UTF-32</b> where |
727aa906 FM |
83 | each code unit is composed by 32 bits (4 bytes) and each code point is always |
84 | represented by a single code unit (fixed length encoding). | |
77ef61f5 FM |
85 | (Note that even UTF-32 is still not completely trivial as the mapping is different |
86 | for little and big-endian architectures). UTF-32 is commonly used under Unix systems for | |
87 | internal representation of Unicode strings. | |
88 | ||
89 | Another very widespread standard is <b>UTF-16</b> which is used by Microsoft Windows: | |
90 | it encodes the first (approximately) 64 thousands of Unicode code points | |
91 | (the BMP plane) using 16-bit code units (2 bytes) and uses a pair of 16-bit code | |
92 | units to encode the characters beyond this. These pairs are called @e surrogate. | |
727aa906 | 93 | Thus UTF16 uses a variable number of code units to encode each code point. |
77ef61f5 FM |
94 | |
95 | Finally, the most widespread encoding used for the external Unicode storage | |
96 | (e.g. files and network protocols) is <b>UTF-8</b> which is byte-oriented and so | |
97 | avoids the endianness ambiguities of UTF-16 and UTF-32. | |
98 | UTF-8 uses code units of 8 bits (1 byte); code points beyond the usual english | |
99 | alphabet are represented using a variable number of bytes, which makes it less | |
100 | efficient than UTF-32 for internal representation. | |
101 | ||
102 | As visual aid to understand the differences between the various concepts described | |
103 | so far, look at the different UTF representations of the same code point: | |
104 | ||
105 | @image html overview_unicode_codes.png | |
106 | ||
107 | In this particular case UTF8 requires more space than UTF16 (3 bytes instead of 2). | |
108 | ||
109 | Note that from the C/C++ programmer perspective the situation is further complicated | |
110 | by the fact that the standard type @c wchar_t which is usually used to represent the | |
cc506697 VZ |
111 | Unicode ("wide") strings in C/C++ doesn't have the same size on all platforms. |
112 | It is 4 bytes under Unix systems, corresponding to the tradition of using | |
113 | UTF-32, but only 2 bytes under Windows which is required by compatibility with | |
114 | the OS which uses UTF-16. | |
2cd3cc94 | 115 | |
77ef61f5 FM |
116 | Typically when UTF8 is used, code units are stored into @c char types, since |
117 | @c char are 8bit wide on almost all systems; when using UTF16 typically code | |
118 | units are stored into @c wchar_t types since @c wchar_t is at least 16bits on | |
119 | all systems. This is also the approach used by wxString. | |
727aa906 | 120 | See @ref overview_string for more info. |
77ef61f5 FM |
121 | |
122 | See also http://unicode.org/glossary/ for the official definitions of the | |
123 | terms reported above. | |
124 | ||
2cd3cc94 | 125 | |
cc506697 | 126 | @section overview_unicode_supportin Unicode Support in wxWidgets |
2cd3cc94 | 127 | |
bf0f2c4b VZ |
128 | @subsection overview_unicode_support_default Unicode is Always Used by Default |
129 | ||
130 | Since wxWidgets 3.0 Unicode support is always enabled and while building the | |
131 | library without it is still possible, it is not recommended any longer and will | |
132 | cease to be supported in the near future. This means that internally only | |
133 | Unicode strings are used and that, under Microsoft Windows, Unicode system API | |
134 | is used which means that wxWidgets programs require the Microsoft Layer for | |
135 | Unicode to run on Windows 95/98/ME. | |
cc506697 | 136 | |
77ef61f5 FM |
137 | However, unlike the Unicode build mode of the previous versions of wxWidgets, this |
138 | support is mostly transparent: you can still continue to work with the @b narrow | |
727aa906 | 139 | (i.e. current locale-encoded @c char*) strings even if @b wide |
2f365fcb | 140 | (i.e. UTF16-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also |
cc506697 VZ |
141 | supported. Any wxWidgets function accepts arguments of either type as both |
142 | kinds of strings are implicitly converted to wxString, so both | |
143 | @code | |
144 | wxMessageBox("Hello, world!"); | |
145 | @endcode | |
77ef61f5 | 146 | and the somewhat less usual |
cc506697 | 147 | @code |
727aa906 | 148 | wxMessageBox(L"Salut \u00E0 toi!"); // U+00E0 is "Latin Small Letter a with Grave" |
cc506697 VZ |
149 | @endcode |
150 | work as expected. | |
2cd3cc94 | 151 | |
cc506697 VZ |
152 | Notice that the narrow strings used with wxWidgets are @e always assumed to be |
153 | in the current locale encoding, so writing | |
154 | @code | |
155 | wxMessageBox("Salut à toi!"); | |
156 | @endcode | |
157 | wouldn't work if the encoding used on the user system is incompatible with | |
f99af6c0 VS |
158 | ISO-8859-1 (or even if the sources were compiled under different locale |
159 | in the case of gcc). In particular, the most common encoding used under | |
160 | modern Unix systems is UTF-8 and as the string above is not a valid UTF-8 byte | |
161 | sequence, nothing would be displayed at all in this case. Thus it is important | |
77ef61f5 | 162 | to <b>never use 8-bit (instead of 7-bit) characters directly in the program source</b> |
727aa906 | 163 | but use wide strings or, alternatively, write: |
cc506697 | 164 | @code |
727aa906 FM |
165 | wxMessageBox(wxString::FromUTF8("Salut \xC3\xA0 toi!")); |
166 | // in UTF8 the character U+00E0 is encoded as 0xC3A0 | |
cc506697 | 167 | @endcode |
2cd3cc94 | 168 | |
77ef61f5 FM |
169 | In a similar way, wxString provides access to its contents as either @c wchar_t or |
170 | @c char character buffer. Of course, the latter only works if the string contains | |
cc506697 VZ |
171 | data representable in the current locale encoding. This will always be the case |
172 | if the string had been initially constructed from a narrow string or if it | |
173 | contains only 7-bit ASCII data but otherwise this conversion is not guaranteed | |
91fa0da4 FM |
174 | to succeed. And as with wxString::FromUTF8() example above, you can always use |
175 | wxString::ToUTF8() to retrieve the string contents in UTF-8 encoding -- this, | |
176 | unlike converting to @c char* using the current locale, never fails. | |
cc506697 | 177 | |
77ef61f5 FM |
178 | For more info about how wxString works, please see the @ref overview_string. |
179 | ||
180 | To summarize, Unicode support in wxWidgets is mostly @b transparent for the | |
cc506697 VZ |
181 | application and if you use wxString objects for storing all the character data |
182 | in your program there is really nothing special to do. However you should be | |
183 | aware of the potential problems covered by the following section. | |
184 | ||
185 | ||
bf0f2c4b VZ |
186 | @subsection overview_unicode_support_utf Choosing Unicode Representation |
187 | ||
188 | wxWidgets uses the system @c wchar_t in wxString implementation by default | |
189 | under all systems. Thus, under Microsoft Windows, UCS-2 (simplified version of | |
190 | UTF-16 without support for surrogate characters) is used as @c wchar_t is 2 | |
191 | bytes on this platform. Under Unix systems, including Mac OS X, UCS-4 (also | |
192 | known as UTF-32) is used by default, however it is also possible to build | |
193 | wxWidgets to use UTF-8 internally by passing @c --enable-utf8 option to | |
194 | configure. | |
195 | ||
196 | The interface provided by wxString is the same independently of the format used | |
197 | internally. However different formats have specific advantages and | |
198 | disadvantages. Notably, under Unix, the underlying graphical toolkit (e.g. | |
199 | GTK+) usually uses UTF-8 encoded strings and using the same representations for | |
200 | the strings in wxWidgets allows to avoid conversion from UTF-32 to UTF-8 and | |
201 | vice versa each time a string is shown in the UI or retrieved from it. The | |
202 | overhead of such conversions is usually negligible for small strings but may be | |
203 | important for some programs. If you believe that it would be advantageous to | |
204 | use UTF-8 for the strings in your particular application, you may rebuild | |
205 | wxWidgets to use UTF-8 as explained above (notice that this is currently not | |
206 | supported under Microsoft Windows and arguably doesn't make much sense there as | |
207 | Windows itself uses UTF-16 and not UTF-8) but be sure to be aware of the | |
208 | performance implications (see @ref overview_unicode_performance) of using UTF-8 | |
209 | in wxString before doing this! | |
210 | ||
211 | Generally speaking you should only use non-default UTF-8 build in specific | |
212 | circumstances e.g. building for resource-constrained systems where the overhead | |
213 | of conversions (and also reduced memory usage of UTF-8 compared to UTF-32 for | |
214 | the European languages) can be important. If the environment in which your | |
215 | program is running is under your control -- as is quite often the case in such | |
216 | scenarios -- consider ensuring that the system always uses UTF-8 locale and | |
217 | use @c --enable-utf8only configure option to disable support for the other | |
218 | locales and consider all strings to be in UTF-8. This further reduces the code | |
219 | size and removes the need for conversions in more cases. | |
220 | ||
221 | ||
222 | @subsection overview_unicode_settings Unicode Related Preprocessor Symbols | |
223 | ||
224 | @c wxUSE_UNICODE is defined as 1 now to indicate Unicode support. It can be | |
225 | explicitly set to 0 in @c setup.h under MSW or you can use @c --disable-unicode | |
226 | under Unix but doing this is strongly discouraged. By default, @c | |
227 | wxUSE_UNICODE_WCHAR is also defined as 1, however in UTF-8 build (described in | |
228 | the previous section), it is set to 0 and @c wxUSE_UNICODE_UTF8, which is | |
229 | usually 0, is set to 1 instead. In the latter case, @c wxUSE_UTF8_LOCALE_ONLY | |
230 | can also be set to 1 to indicate that all strings are considered to be in UTF-8. | |
231 | ||
232 | ||
233 | ||
cc506697 VZ |
234 | @section overview_unicode_pitfalls Potential Unicode Pitfalls |
235 | ||
236 | The problems can be separated into three broad classes: | |
237 | ||
238 | @subsection overview_unicode_compilation_errors Unicode-Related Compilation Errors | |
239 | ||
77ef61f5 FM |
240 | Because of the need to support implicit conversions to both @c char and |
241 | @c wchar_t, wxString implementation is rather involved and many of its operators | |
242 | don't return the types which they could be naively expected to return. | |
243 | For example, the @c operator[] doesn't return neither a @c char nor a @c wchar_t | |
cc506697 VZ |
244 | but an object of a helper class wxUniChar or wxUniCharRef which is implicitly |
245 | convertible to either. Usually you don't need to worry about this as the | |
246 | conversions do their work behind the scenes however in some cases it doesn't | |
247 | work. Here are some examples, using a wxString object @c s and some integer @c | |
248 | n: | |
249 | ||
250 | - Writing @code switch ( s[n] ) @endcode doesn't work because the argument of | |
d9384bfb | 251 | the switch statement must be an integer expression so you need to replace |
cc506697 | 252 | @c s[n] with @code s[n].GetValue() @endcode. You may also force the |
77ef61f5 | 253 | conversion to @c char or @c wchar_t by using an explicit cast but beware that |
cc506697 VZ |
254 | converting the value to char uses the conversion to current locale and may |
255 | return 0 if it fails. Finally notice that writing @code (wxChar)s[n] @endcode | |
256 | works both with wxWidgets 3.0 and previous library versions and so should be | |
257 | used for writing code which should be compatible with both 2.8 and 3.0. | |
258 | ||
259 | - Similarly, @code &s[n] @endcode doesn't yield a pointer to char so you may | |
260 | not pass it to functions expecting @c char* or @c wchar_t*. Consider using | |
261 | string iterators instead if possible or replace this expression with | |
262 | @code s.c_str() + n @endcode otherwise. | |
263 | ||
91fa0da4 FM |
264 | Another class of problems is related to the fact that the value returned by |
265 | @c c_str() itself is also not just a pointer to a buffer but a value of helper | |
cc506697 VZ |
266 | class wxCStrData which is implicitly convertible to both narrow and wide |
267 | strings. Again, this mostly will be unnoticeable but can result in some | |
268 | problems: | |
269 | ||
270 | - You shouldn't pass @c c_str() result to vararg functions such as standard | |
271 | @c printf(). Some compilers (notably g++) warn about this but even if they | |
272 | don't, this @code printf("Hello, %s", s.c_str()) @endcode is not going to | |
273 | work. It can be corrected in one of the following ways: | |
274 | ||
275 | - Preferred: @code wxPrintf("Hello, %s", s) @endcode (notice the absence | |
276 | of @c c_str(), it is not needed at all with wxWidgets functions) | |
277 | - Compatible with wxWidgets 2.8: @code wxPrintf("Hello, %s", s.c_str()) @endcode | |
278 | - Using an explicit conversion to narrow, multibyte, string: | |
f99af6c0 | 279 | @code printf("Hello, %s", (const char *)s.mb_str()) @endcode |
cc506697 VZ |
280 | - Using a cast to force the issue (listed only for completeness): |
281 | @code printf("Hello, %s", (const char *)s.c_str()) @endcode | |
282 | ||
4c51a665 | 283 | - The result of @c c_str() cannot be cast to @c char* but only to @c const @c |
cc506697 VZ |
284 | @c char*. Of course, modifying the string via the pointer returned by this |
285 | method has never been possible but unfortunately it was occasionally useful | |
286 | to use a @c const_cast here to pass the value to const-incorrect functions. | |
287 | This can be done either using new wxString::char_str() (and matching | |
288 | wchar_str()) method or by writing a double cast: | |
289 | @code (char *)(const char *)s.c_str() @endcode | |
290 | ||
291 | - One of the unfortunate consequences of the possibility to pass wxString to | |
292 | @c wxPrintf() without using @c c_str() is that it is now impossible to pass | |
293 | the elements of unnamed enumerations to @c wxPrintf() and other similar | |
294 | vararg functions, i.e. | |
295 | @code | |
296 | enum { Red, Green, Blue }; | |
297 | wxPrintf("Red is %d", Red); | |
298 | @endcode | |
299 | doesn't compile. The easiest workaround is to give a name to the enum. | |
300 | ||
301 | Other unexpected compilation errors may arise but they should happen even more | |
302 | rarely than the above-mentioned ones and the solution should usually be quite | |
303 | simple: just use the explicit methods of wxUniChar and wxCStrData classes | |
304 | instead of relying on their implicit conversions if the compiler can't choose | |
305 | among them. | |
306 | ||
307 | ||
308 | @subsection overview_unicode_data_loss Data Loss due To Unicode Conversion Errors | |
309 | ||
310 | wxString API provides implicit conversion of the internal Unicode string | |
311 | contents to narrow, char strings. This can be very convenient and is absolutely | |
312 | necessary for backwards compatibility with the existing code using wxWidgets | |
313 | however it is a rather dangerous operation as it can easily give unexpected | |
314 | results if the string contents isn't convertible to the current locale. | |
315 | ||
316 | To be precise, the conversion will always succeed if the string was created | |
317 | from a narrow string initially. It will also succeed if the current encoding is | |
318 | UTF-8 as all Unicode strings are representable in this encoding. However | |
91fa0da4 FM |
319 | initializing the string using wxString::FromUTF8() method and then accessing it |
320 | as a char string via its wxString::c_str() method is a recipe for disaster as the | |
321 | program may work perfectly well during testing on Unix systems using UTF-8 locale | |
322 | but completely fail under Windows where UTF-8 locales are never used because | |
323 | wxString::c_str() would return an empty string. | |
cc506697 VZ |
324 | |
325 | The simplest way to ensure that this doesn't happen is to avoid conversions to | |
326 | @c char* completely by using wxString throughout your program. However if the | |
327 | program never manipulates 8 bit strings internally, using @c char* pointers is | |
328 | safe as well. So the existing code needs to be reviewed when upgrading to | |
329 | wxWidgets 3.0 and the new code should be used with this in mind and ideally | |
330 | avoiding implicit conversions to @c char*. | |
331 | ||
332 | ||
bf0f2c4b | 333 | @subsection overview_unicode_performance Performance Implications of Using UTF-8 |
cc506697 | 334 | |
bf0f2c4b VZ |
335 | As mentioned above, under Unix systems wxString class can use variable-width |
336 | UTF-8 encoding for internal representation. In this case it can't guarantee | |
337 | constant-time access to N-th element of the string any longer as to find the | |
338 | position of this character in the string we have to examine all the preceding | |
339 | ones. Usually this doesn't matter much because most algorithms used on the | |
340 | strings examine them sequentially anyhow and because wxString implements a | |
341 | cache for iterating over the string by index but it can have serious | |
342 | consequences for algorithms using random access to string elements as they | |
343 | typically acquire O(N^2) time complexity instead of O(N) where N is the length | |
344 | of the string. | |
cc506697 | 345 | |
a6919a6a | 346 | Even despite caching the index, indexed access should be replaced with |
cc506697 | 347 | sequential access using string iterators. For example a typical loop: |
7b74e828 | 348 | @code |
cc506697 VZ |
349 | wxString s("hello"); |
350 | for ( size_t i = 0; i < s.length(); i++ ) | |
7b74e828 | 351 | { |
cc506697 | 352 | wchar_t ch = s[i]; |
91fa0da4 | 353 | |
7b74e828 RR |
354 | // do something with it |
355 | } | |
356 | @endcode | |
cc506697 | 357 | should be rewritten as |
2cd3cc94 | 358 | @code |
cc506697 VZ |
359 | wxString s("hello"); |
360 | for ( wxString::const_iterator i = s.begin(); i != s.end(); ++i ) | |
7b74e828 | 361 | { |
cc506697 | 362 | wchar_t ch = *i |
91fa0da4 | 363 | |
7b74e828 RR |
364 | // do something with it |
365 | } | |
2cd3cc94 BP |
366 | @endcode |
367 | ||
cc506697 | 368 | Another, similar, alternative is to use pointer arithmetic: |
7b74e828 | 369 | @code |
cc506697 VZ |
370 | wxString s("hello"); |
371 | for ( const wchar_t *p = s.wc_str(); *p; p++ ) | |
7b74e828 | 372 | { |
cc506697 VZ |
373 | wchar_t ch = *i |
374 | ||
375 | // do something with it | |
7b74e828 RR |
376 | } |
377 | @endcode | |
cc506697 VZ |
378 | however this doesn't work correctly for strings with embedded @c NUL characters |
379 | and the use of iterators is generally preferred as they provide some run-time | |
380 | checks (at least in debug build) unlike the raw pointers. But if you do use | |
77ef61f5 | 381 | them, it is better to use @c wchar_t pointers rather than @c char ones to avoid the |
cc506697 | 382 | data loss problems due to conversion as discussed in the previous section. |
2cd3cc94 | 383 | |
2cd3cc94 | 384 | |
cc506697 | 385 | @section overview_unicode_supportout Unicode and the Outside World |
2cd3cc94 | 386 | |
cc506697 VZ |
387 | Even though wxWidgets always uses Unicode internally, not all the other |
388 | libraries and programs do and even those that do use Unicode may use a | |
389 | different encoding of it. So you need to be able to convert the data to various | |
91fa0da4 FM |
390 | representations and the wxString methods wxString::ToAscii(), wxString::ToUTF8() |
391 | (or its synonym wxString::utf8_str()), wxString::mb_str(), wxString::c_str() and | |
392 | wxString::wc_str() can be used for this. | |
727aa906 | 393 | |
91fa0da4 FM |
394 | The first of them should be only used for the string containing 7-bit ASCII characters |
395 | only, anything else will be replaced by some substitution character. | |
396 | wxString::mb_str() converts the string to the encoding used by the current locale | |
397 | and so can return an empty string if the string contains characters not representable in | |
398 | it as explained in @ref overview_unicode_data_loss. The same applies to wxString::c_str() | |
399 | if its result is used as a narrow string. Finally, wxString::ToUTF8() and wxString::wc_str() | |
cc506697 | 400 | functions never fail and always return a pointer to char string containing the |
77ef61f5 | 401 | UTF-8 representation of the string or @c wchar_t string. |
cc506697 | 402 | |
91fa0da4 FM |
403 | wxString also provides two convenience functions: wxString::From8BitData() and |
404 | wxString::To8BitData(). They can be used to create a wxString from arbitrary binary | |
405 | data without supposing that it is in current locale encoding, and then get it back, | |
cc506697 | 406 | again, without any conversion or, rather, undoing the conversion used by |
91fa0da4 FM |
407 | wxString::From8BitData(). Because of this you should only use wxString::From8BitData() |
408 | for the strings created using wxString::To8BitData(). Also notice that in spite | |
409 | of the availability of these functions, wxString is not the ideal class for storing | |
cc506697 VZ |
410 | arbitrary binary data as they can take up to 4 times more space than needed |
411 | (when using @c wchar_t internal representation on the systems where size of | |
412 | wide characters is 4 bytes) and you should consider using wxMemoryBuffer | |
413 | instead. | |
414 | ||
415 | Final word of caution: most of these functions may return either directly the | |
416 | pointer to internal string buffer or a temporary wxCharBuffer or wxWCharBuffer | |
77ef61f5 | 417 | object. Such objects are implicitly convertible to @c char and @c wchar_t pointers, |
91fa0da4 | 418 | respectively, and so the result of, for example, wxString::ToUTF8() can always be |
77ef61f5 | 419 | passed directly to a function taking <tt>const char*</tt>. However code such as |
7b74e828 | 420 | @code |
cc506697 VZ |
421 | const char *p = s.ToUTF8(); |
422 | ... | |
423 | puts(p); // or call any other function taking const char * | |
7b74e828 | 424 | @endcode |
91fa0da4 | 425 | does @b not work because the temporary buffer returned by wxString::ToUTF8() is |
197380a0 | 426 | destroyed and @c p is left pointing nowhere. To correct this you should use |
7b74e828 | 427 | @code |
197380a0 | 428 | const wxScopedCharBuffer p(s.ToUTF8()); |
cc506697 | 429 | puts(p); |
7b74e828 | 430 | @endcode |
197380a0 VS |
431 | which does work. |
432 | ||
91fa0da4 | 433 | Similarly, wxWX2WCbuf can be used for the return type of wxString::wc_str(). |
cc506697 VZ |
434 | But, once again, none of these cryptic types is really needed if you just pass |
435 | the return value of any of the functions mentioned in this section to another | |
436 | function directly. | |
2cd3cc94 | 437 | |
2cd3cc94 BP |
438 | */ |
439 |