]>
Commit | Line | Data |
---|---|---|
15b6757b | 1 | ///////////////////////////////////////////////////////////////////////////// |
f05d2fde | 2 | // Name: string.h |
15b6757b FM |
3 | // Purpose: topic overview |
4 | // Author: wxWidgets team | |
5 | // RCS-ID: $Id$ | |
526954c5 | 6 | // Licence: wxWindows licence |
15b6757b FM |
7 | ///////////////////////////////////////////////////////////////////////////// |
8 | ||
880efa2a | 9 | /** |
36c9828f | 10 | |
f05d2fde BP |
11 | @page overview_string wxString Overview |
12 | ||
831e1028 | 13 | @tableofcontents |
f05d2fde | 14 | |
727aa906 | 15 | wxString is a class which represents a Unicode string of arbitrary length and |
2f365fcb | 16 | containing arbitrary Unicode characters. |
f05d2fde BP |
17 | |
18 | This class has all the standard operations you can expect to find in a string | |
19 | class: dynamic memory management (string extends to accommodate new | |
2f365fcb FM |
20 | characters), construction from other strings, compatibility with C strings and |
21 | wide character C strings, assignment operators, access to individual characters, string | |
727aa906 FM |
22 | concatenation and comparison, substring extraction, case conversion, trimming and |
23 | padding (with spaces), searching and replacing and both C-like @c printf (wxString::Printf) | |
f05d2fde BP |
24 | and stream-like insertion functions as well as much more - see wxString for a |
25 | list of all functions. | |
26 | ||
727aa906 FM |
27 | The wxString class has been completely rewritten for wxWidgets 3.0 but much work |
28 | has been done to make existing code using ANSI string literals work as it did | |
29 | in previous versions. | |
30 | ||
31 | ||
831e1028 | 32 | @section overview_string_internal Internal wxString Encoding |
727aa906 | 33 | |
2f365fcb | 34 | Since wxWidgets 3.0 wxString internally uses <b>UTF-16</b> (with Unicode |
727aa906 FM |
35 | code units stored in @c wchar_t) under Windows and <b>UTF-8</b> (with Unicode |
36 | code units stored in @c char) under Unix, Linux and Mac OS X to store its content. | |
37 | ||
38 | For definitions of <em>code units</em> and <em>code points</em> terms, please | |
39 | see the @ref overview_unicode_encodings paragraph. | |
40 | ||
727aa906 | 41 | For simplicity of implementation, wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt> |
2f365fcb FM |
42 | (e.g. on Windows) uses <em>per code unit indexing</em> instead of |
43 | <em>per code point indexing</em> and doesn't know anything about surrogate pairs; | |
c6d93dd7 | 44 | in other words it always considers code points to be composed by 1 code unit, |
2f365fcb | 45 | while this is really true only for characters in the @e BMP (Basic Multilingual Plane). |
727aa906 | 46 | Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user |
2f365fcb | 47 | code has to take care of <em>surrogate pairs</em> himself. |
727aa906 FM |
48 | (Note however that Windows itself has built-in support for surrogate pairs in UTF-16, |
49 | such as for drawing strings on screen.) | |
50 | ||
2f365fcb FM |
51 | @remarks |
52 | Note that while the behaviour of wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt> | |
53 | resembles UCS-2 encoding, it's not completely correct to refer to wxString as | |
c6d93dd7 FM |
54 | UCS-2 encoded since you can encode code points outside the @e BMP in a wxString |
55 | as two code units (i.e. as a surrogate pair; as already mentioned however wxString | |
56 | will "see" them as two different code points) | |
2f365fcb | 57 | |
727aa906 | 58 | When instead <tt>wxUSE_UNICODE_UTF8==1</tt> (e.g. on Linux and Mac OS X) |
2f365fcb FM |
59 | wxString handles UTF8 multi-bytes sequences just fine also for characters outside |
60 | the BMP (it implements <em>per code point indexing</em>), so that you can use | |
727aa906 FM |
61 | UTF8 in a completely transparent way: |
62 | ||
63 | Example: | |
64 | @code | |
65 | // first test, using exotic characters outside of the Unicode BMP: | |
66 | ||
67 | wxString test = wxString::FromUTF8("\xF0\x90\x8C\x80"); | |
68 | // U+10300 is "OLD ITALIC LETTER A" and is part of Unicode Plane 1 | |
69 | // in UTF8 it's encoded as 0xF0 0x90 0x8C 0x80 | |
70 | ||
71 | // it's a single Unicode code-point encoded as: | |
72 | // - a UTF16 surrogate pair under Windows | |
73 | // - a UTF8 multiple-bytes sequence under Linux | |
74 | // (without considering the final NULL) | |
75 | ||
76 | wxPrintf("wxString reports a length of %d character(s)", test.length()); | |
77 | // prints "wxString reports a length of 1 character(s)" on Linux | |
78 | // prints "wxString reports a length of 2 character(s)" on Windows | |
2f365fcb | 79 | // since wxString on Windows doesn't have surrogate pairs support! |
727aa906 FM |
80 | |
81 | ||
82 | // second test, this time using characters part of the Unicode BMP: | |
83 | ||
84 | wxString test2 = wxString::FromUTF8("\x41\xC3\xA0\xE2\x82\xAC"); | |
85 | // this is the UTF8 encoding of capital letter A followed by | |
86 | // 'small case letter a with grave' followed by the 'euro sign' | |
87 | ||
88 | // they are 3 Unicode code-points encoded as: | |
89 | // - 3 UTF16 code units under Windows | |
90 | // - 6 UTF8 code units under Linux | |
91 | // (without considering the final NULL) | |
92 | ||
93 | wxPrintf("wxString reports a length of %d character(s)", test2.length()); | |
94 | // prints "wxString reports a length of 3 character(s)" on Linux | |
95 | // prints "wxString reports a length of 3 character(s)" on Windows | |
96 | @endcode | |
97 | ||
98 | To better explain what stated above, consider the second string of the example | |
99 | above; it's composed by 3 characters and the final @c NULL: | |
100 | ||
101 | @image html overview_wxstring_encoding.png | |
102 | ||
2f365fcb FM |
103 | As you can see, UTF16 encoding is straightforward (for characters in the @e BMP) |
104 | and in this example the UTF16-encoded wxString takes 8 bytes. | |
727aa906 FM |
105 | UTF8 encoding is more elaborated and in this example takes 7 bytes. |
106 | ||
727aa906 | 107 | In general, for strings containing many latin characters UTF8 provides a big |
2f365fcb FM |
108 | advantage with regards to the memory footprint respect UTF16, but requires some |
109 | more processing for common operations like e.g. length calculation. | |
110 | ||
111 | Finally, note that the type used by wxString to store Unicode code units | |
112 | (@c wchar_t or @c char) is always @c typedef-ined to be ::wxStringCharType. | |
727aa906 FM |
113 | |
114 | ||
2f365fcb FM |
115 | @section overview_string_binary Using wxString to store binary data |
116 | ||
117 | wxString can be used to store binary data (even if it contains @c NULs) using the | |
118 | functions wxString::To8BitData and wxString::From8BitData. | |
119 | ||
120 | Beware that even if @c NUL character is allowed, in the current string implementation | |
121 | some methods might not work correctly with them. | |
122 | ||
123 | Note however that other classes like wxMemoryBuffer are more suited to this task. | |
124 | For handling binary data you may also want to look at the wxStreamBuffer, | |
125 | wxMemoryOutputStream, wxMemoryInputStream classes. | |
126 | ||
f05d2fde BP |
127 | |
128 | @section overview_string_comparison Comparison to Other String Classes | |
129 | ||
130 | The advantages of using a special string class instead of working directly with | |
131 | C strings are so obvious that there is a huge number of such classes available. | |
132 | The most important advantage is the need to always remember to allocate/free | |
133 | memory for C strings; working with fixed size buffers almost inevitably leads | |
727aa906 | 134 | to buffer overflows. At last, C++ has a standard string class (@c std::string). So |
f05d2fde BP |
135 | why the need for wxString? There are several advantages: |
136 | ||
727aa906 FM |
137 | @li <b>Efficiency:</b> Since wxWidgets 3.0 wxString uses @c std::string (in UTF8 |
138 | mode under Linux, Unix and OS X) or @c std::wstring (in UTF16 mode under Windows) | |
139 | internally by default to store its contents. wxString will therefore inherit the | |
140 | performance characteristics from @c std::string. | |
f05d2fde | 141 | @li <b>Compatibility:</b> This class tries to combine almost full compatibility |
727aa906 FM |
142 | with the old wxWidgets 1.xx wxString class, some reminiscence of MFC's |
143 | CString class and 90% of the functionality of @c std::string class. | |
144 | @li <b>Rich set of functions:</b> Some of the functions present in wxString are | |
145 | very useful but don't exist in most of other string classes: for example, | |
146 | wxString::AfterFirst, wxString::BeforeLast, wxString::Printf. | |
147 | Of course, all the standard string operations are supported as well. | |
148 | @li <b>wxString is Unicode friendly:</b> it allows to easily convert to | |
149 | and from ANSI and Unicode strings (see @ref overview_unicode | |
150 | for more details) and maps to @c std::wstring transparently. | |
f05d2fde BP |
151 | @li <b>Used by wxWidgets:</b> And, of course, this class is used everywhere |
152 | inside wxWidgets so there is no performance loss which would result from | |
727aa906 | 153 | conversions of objects of any other string class (including @c std::string) to |
f05d2fde BP |
154 | wxString internally by wxWidgets. |
155 | ||
156 | However, there are several problems as well. The most important one is probably | |
157 | that there are often several functions to do exactly the same thing: for | |
47e1c61b | 158 | example, to get the length of the string either one of wxString::length(), |
f05d2fde | 159 | wxString::Len() or wxString::Length() may be used. The first function, as |
727aa906 | 160 | almost all the other functions in lowercase, is @c std::string compatible. The |
f05d2fde BP |
161 | second one is the "native" wxString version and the last one is the wxWidgets |
162 | 1.xx way. | |
163 | ||
727aa906 | 164 | So which is better to use? The usage of the @c std::string compatible functions is |
f05d2fde | 165 | strongly advised! It will both make your code more familiar to other C++ |
727aa906 | 166 | programmers (who are supposed to have knowledge of @c std::string but not of |
f05d2fde | 167 | wxString), let you reuse the same code in both wxWidgets and other programs (by |
727aa906 | 168 | just typedefing wxString as @c std::string when used outside wxWidgets) and by |
f05d2fde | 169 | staying compatible with future versions of wxWidgets which will probably start |
727aa906 | 170 | using @c std::string sooner or later too. |
f05d2fde | 171 | |
727aa906 | 172 | In the situations where there is no corresponding @c std::string function, please |
f05d2fde BP |
173 | try to use the new wxString methods and not the old wxWidgets 1.xx variants |
174 | which are deprecated and may disappear in future versions. | |
175 | ||
176 | ||
177 | @section overview_string_advice Advice About Using wxString | |
178 | ||
727aa906 FM |
179 | @subsection overview_string_implicitconv Implicit conversions |
180 | ||
f05d2fde BP |
181 | Probably the main trap with using this class is the implicit conversion |
182 | operator to <tt>const char*</tt>. It is advised that you use wxString::c_str() | |
183 | instead to clearly indicate when the conversion is done. Specifically, the | |
184 | danger of this implicit conversion may be seen in the following code fragment: | |
185 | ||
186 | @code | |
187 | // this function converts the input string to uppercase, | |
188 | // output it to the screen and returns the result | |
189 | const char *SayHELLO(const wxString& input) | |
190 | { | |
191 | wxString output = input.Upper(); | |
192 | printf("Hello, %s!\n", output); | |
193 | return output; | |
194 | } | |
195 | @endcode | |
196 | ||
197 | There are two nasty bugs in these three lines. The first is in the call to the | |
198 | @c printf() function. Although the implicit conversion to C strings is applied | |
199 | automatically by the compiler in the case of | |
200 | ||
201 | @code | |
202 | puts(output); | |
203 | @endcode | |
204 | ||
205 | because the argument of @c puts() is known to be of the type | |
206 | <tt>const char*</tt>, this is @b not done for @c printf() which is a function | |
207 | with variable number of arguments (and whose arguments are of unknown types). | |
208 | So this call may do any number of things (including displaying the correct | |
727aa906 FM |
209 | string on screen), although the most likely result is a program crash. |
210 | The solution is to use wxString::c_str(). Just replace this line with this: | |
f05d2fde BP |
211 | |
212 | @code | |
213 | printf("Hello, %s!\n", output.c_str()); | |
214 | @endcode | |
215 | ||
216 | The second bug is that returning @c output doesn't work. The implicit cast is | |
217 | used again, so the code compiles, but as it returns a pointer to a buffer | |
218 | belonging to a local variable which is deleted as soon as the function exits, | |
219 | its contents are completely arbitrary. The solution to this problem is also | |
220 | easy, just make the function return wxString instead of a C string. | |
221 | ||
222 | This leads us to the following general advice: all functions taking string | |
727aa906 | 223 | arguments should take <tt>const wxString&</tt> (this makes assignment to the |
47e1c61b RR |
224 | strings inside the function faster) and all functions returning strings |
225 | should return wxString - this makes it safe to return local variables. | |
f05d2fde | 226 | |
727aa906 FM |
227 | Finally note that wxString uses the current locale encoding to convert any C string |
228 | literal to Unicode. The same is done for converting to and from @c std::string | |
229 | and for the return value of c_str(). | |
230 | For this conversion, the @a wxConvLibc class instance is used. | |
231 | See wxCSConv and wxMBConv. | |
232 | ||
233 | ||
831e1028 | 234 | @subsection overview_string_iterating Iterating wxString Characters |
727aa906 FM |
235 | |
236 | As previously described, when <tt>wxUSE_UNICODE_UTF8==1</tt>, wxString internally | |
237 | uses the variable-length UTF8 encoding. | |
238 | Accessing a UTF-8 string by index can be very @b inefficient because | |
239 | a single character is represented by a variable number of bytes so that | |
240 | the entire string has to be parsed in order to find the character. | |
241 | Since iterating over a string by index is a common programming technique and | |
242 | was also possible and encouraged by wxString using the access operator[]() | |
243 | wxString implements caching of the last used index so that iterating over | |
244 | a string is a linear operation even in UTF-8 mode. | |
245 | ||
246 | It is nonetheless recommended to use @b iterators (instead of index based | |
247 | access) like this: | |
248 | ||
249 | @code | |
250 | wxString s = "hello"; | |
251 | wxString::const_iterator i; | |
252 | for (i = s.begin(); i != s.end(); ++i) | |
253 | { | |
254 | wxUniChar uni_ch = *i; | |
255 | // do something with it | |
256 | } | |
257 | @endcode | |
258 | ||
259 | ||
f05d2fde BP |
260 | |
261 | @section overview_string_related String Related Functions and Classes | |
262 | ||
263 | As most programs use character strings, the standard C library provides quite | |
264 | a few functions to work with them. Unfortunately, some of them have rather | |
265 | counter-intuitive behaviour (like @c strncpy() which doesn't always terminate | |
266 | the resulting string with a @NULL) and are in general not very safe (passing | |
267 | @NULL to them will probably lead to program crash). Moreover, some very useful | |
268 | functions are not standard at all. This is why in addition to all wxString | |
269 | functions, there are also a few global string functions which try to correct | |
270 | these problems: wxIsEmpty() verifies whether the string is empty (returning | |
2cd3cc94 | 271 | @true for @NULL pointers), wxStrlen() also handles @NULL correctly and returns |
f05d2fde BP |
272 | 0 for them and wxStricmp() is just a platform-independent version of |
273 | case-insensitive string comparison function known either as @c stricmp() or | |
274 | @c strcasecmp() on different platforms. | |
275 | ||
831e1028 | 276 | The <tt>@<wx/string.h@></tt> header also defines wxSnprintf() and wxVsnprintf() |
f05d2fde BP |
277 | functions which should be used instead of the inherently dangerous standard |
278 | @c sprintf() and which use @c snprintf() instead which does buffer size checks | |
279 | whenever possible. Of course, you may also use wxString::Printf which is also | |
280 | safe. | |
281 | ||
282 | There is another class which might be useful when working with wxString: | |
283 | wxStringTokenizer. It is helpful when a string must be broken into tokens and | |
284 | replaces the standard C library @c strtok() function. | |
285 | ||
286 | And the very last string-related class is wxArrayString: it is just a version | |
287 | of the "template" dynamic array class which is specialized to work with | |
288 | strings. Please note that this class is specially optimized (using its | |
289 | knowledge of the internal structure of wxString) for storing strings and so it | |
290 | is vastly better from a performance point of view than a wxObjectArray of | |
291 | wxStrings. | |
292 | ||
293 | ||
f05d2fde BP |
294 | @section overview_string_tuning Tuning wxString for Your Application |
295 | ||
296 | @note This section is strictly about performance issues and is absolutely not | |
297 | necessary to read for using wxString class. Please skip it unless you feel | |
727aa906 | 298 | familiar with profilers and relative tools. |
f05d2fde BP |
299 | |
300 | For the performance reasons wxString doesn't allocate exactly the amount of | |
301 | memory needed for each string. Instead, it adds a small amount of space to each | |
302 | allocated block which allows it to not reallocate memory (a relatively | |
303 | expensive operation) too often as when, for example, a string is constructed by | |
304 | subsequently adding one character at a time to it, as for example in: | |
305 | ||
306 | @code | |
307 | // delete all vowels from the string | |
308 | wxString DeleteAllVowels(const wxString& original) | |
309 | { | |
47e1c61b | 310 | wxString vowels( "aeuioAEIOU" ); |
f05d2fde | 311 | wxString result; |
47e1c61b RR |
312 | wxString::const_iterator i; |
313 | for ( i = original.begin(); i != original.end(); ++i ) | |
f05d2fde | 314 | { |
47e1c61b RR |
315 | if (vowels.Find( *i ) == wxNOT_FOUND) |
316 | result += *i; | |
f05d2fde BP |
317 | } |
318 | ||
319 | return result; | |
320 | } | |
321 | @endcode | |
322 | ||
323 | This is quite a common situation and not allocating extra memory at all would | |
324 | lead to very bad performance in this case because there would be as many memory | |
325 | (re)allocations as there are consonants in the original string. Allocating too | |
326 | much extra memory would help to improve the speed in this situation, but due to | |
327 | a great number of wxString objects typically used in a program would also | |
328 | increase the memory consumption too much. | |
329 | ||
330 | The very best solution in precisely this case would be to use wxString::Alloc() | |
331 | function to preallocate, for example, len bytes from the beginning - this will | |
332 | lead to exactly one memory allocation being performed (because the result is at | |
333 | most as long as the original string). | |
334 | ||
335 | However, using wxString::Alloc() is tedious and so wxString tries to do its | |
336 | best. The default algorithm assumes that memory allocation is done in | |
337 | granularity of at least 16 bytes (which is the case on almost all of | |
338 | wide-spread platforms) and so nothing is lost if the amount of memory to | |
339 | allocate is rounded up to the next multiple of 16. Like this, no memory is lost | |
340 | and 15 iterations from 16 in the example above won't allocate memory but use | |
341 | the already allocated pool. | |
342 | ||
343 | The default approach is quite conservative. Allocating more memory may bring | |
344 | important performance benefits for programs using (relatively) few very long | |
345 | strings. The amount of memory allocated is configured by the setting of | |
346 | @c EXTRA_ALLOC in the file string.cpp during compilation (be sure to understand | |
347 | why its default value is what it is before modifying it!). You may try setting | |
348 | it to greater amount (say twice nLen) or to 0 (to see performance degradation | |
349 | which will follow) and analyse the impact of it on your program. If you do it, | |
350 | you will probably find it helpful to also define @c WXSTRING_STATISTICS symbol | |
351 | which tells the wxString class to collect performance statistics and to show | |
352 | them on stderr on program termination. This will show you the average length of | |
353 | strings your program manipulates, their average initial length and also the | |
354 | percent of times when memory wasn't reallocated when string concatenation was | |
355 | done but the already preallocated memory was used (this value should be about | |
356 | 98% for the default allocation policy, if it is less than 90% you should | |
357 | really consider fine tuning wxString for your application). | |
358 | ||
359 | It goes without saying that a profiler should be used to measure the precise | |
360 | difference the change to @c EXTRA_ALLOC makes to your program. | |
361 | ||
727aa906 FM |
362 | |
363 | @section overview_string_settings wxString Related Compilation Settings | |
364 | ||
365 | Much work has been done to make existing code using ANSI string literals | |
366 | work as before version 3.0. | |
2f365fcb | 367 | |
727aa906 FM |
368 | If you nonetheless need to have a wxString that uses @c wchar_t |
369 | on Unix and Linux, too, you can specify this on the command line with the | |
370 | @c configure @c --disable-utf8 switch or you can consider using wxUString | |
371 | or @c std::wstring instead. | |
372 | ||
2f365fcb FM |
373 | @c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support. |
374 | If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is | |
375 | also defined, otherwise @c wxUSE_UNICODE_WCHAR is. | |
376 | See also @ref page_wxusedef_important. | |
727aa906 | 377 | |
f05d2fde | 378 | */ |