| 1 | ///////////////////////////////////////////////////////////////////////////// |
| 2 | // Name: string.h |
| 3 | // Purpose: topic overview |
| 4 | // Author: wxWidgets team |
| 5 | // RCS-ID: $Id$ |
| 6 | // Licence: wxWindows license |
| 7 | ///////////////////////////////////////////////////////////////////////////// |
| 8 | |
| 9 | /** |
| 10 | |
| 11 | @page overview_string wxString Overview |
| 12 | |
| 13 | Classes: wxString, wxArrayString, wxStringTokenizer |
| 14 | |
| 15 | @li @ref overview_string_intro |
| 16 | @li @ref overview_string_internal |
| 17 | @li @ref overview_string_binary |
| 18 | @li @ref overview_string_comparison |
| 19 | @li @ref overview_string_advice |
| 20 | @li @ref overview_string_related |
| 21 | @li @ref overview_string_tuning |
| 22 | @li @ref overview_string_settings |
| 23 | |
| 24 | |
| 25 | <hr> |
| 26 | |
| 27 | |
| 28 | @section overview_string_intro Introduction |
| 29 | |
| 30 | wxString is a class which represents a Unicode string of arbitrary length and |
| 31 | containing arbitrary Unicode characters. |
| 32 | |
| 33 | This class has all the standard operations you can expect to find in a string |
| 34 | class: dynamic memory management (string extends to accommodate new |
| 35 | characters), construction from other strings, compatibility with C strings and |
| 36 | wide character C strings, assignment operators, access to individual characters, string |
| 37 | concatenation and comparison, substring extraction, case conversion, trimming and |
| 38 | padding (with spaces), searching and replacing and both C-like @c printf (wxString::Printf) |
| 39 | and stream-like insertion functions as well as much more - see wxString for a |
| 40 | list of all functions. |
| 41 | |
| 42 | The wxString class has been completely rewritten for wxWidgets 3.0 but much work |
| 43 | has been done to make existing code using ANSI string literals work as it did |
| 44 | in previous versions. |
| 45 | |
| 46 | |
| 47 | @section overview_string_internal Internal wxString encoding |
| 48 | |
| 49 | Since wxWidgets 3.0 wxString internally uses <b>UTF-16</b> (with Unicode |
| 50 | code units stored in @c wchar_t) under Windows and <b>UTF-8</b> (with Unicode |
| 51 | code units stored in @c char) under Unix, Linux and Mac OS X to store its content. |
| 52 | |
| 53 | For definitions of <em>code units</em> and <em>code points</em> terms, please |
| 54 | see the @ref overview_unicode_encodings paragraph. |
| 55 | |
| 56 | For simplicity of implementation, wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt> |
| 57 | (e.g. on Windows) uses <em>per code unit indexing</em> instead of |
| 58 | <em>per code point indexing</em> and doesn't know anything about surrogate pairs; |
| 59 | in other words it always considers code points to be composed by 1 code unit, |
| 60 | while this is really true only for characters in the @e BMP (Basic Multilingual Plane). |
| 61 | Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user |
| 62 | code has to take care of <em>surrogate pairs</em> himself. |
| 63 | (Note however that Windows itself has built-in support for surrogate pairs in UTF-16, |
| 64 | such as for drawing strings on screen.) |
| 65 | |
| 66 | @remarks |
| 67 | Note that while the behaviour of wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt> |
| 68 | resembles UCS-2 encoding, it's not completely correct to refer to wxString as |
| 69 | UCS-2 encoded since you can encode code points outside the @e BMP in a wxString |
| 70 | as two code units (i.e. as a surrogate pair; as already mentioned however wxString |
| 71 | will "see" them as two different code points) |
| 72 | |
| 73 | When instead <tt>wxUSE_UNICODE_UTF8==1</tt> (e.g. on Linux and Mac OS X) |
| 74 | wxString handles UTF8 multi-bytes sequences just fine also for characters outside |
| 75 | the BMP (it implements <em>per code point indexing</em>), so that you can use |
| 76 | UTF8 in a completely transparent way: |
| 77 | |
| 78 | Example: |
| 79 | @code |
| 80 | // first test, using exotic characters outside of the Unicode BMP: |
| 81 | |
| 82 | wxString test = wxString::FromUTF8("\xF0\x90\x8C\x80"); |
| 83 | // U+10300 is "OLD ITALIC LETTER A" and is part of Unicode Plane 1 |
| 84 | // in UTF8 it's encoded as 0xF0 0x90 0x8C 0x80 |
| 85 | |
| 86 | // it's a single Unicode code-point encoded as: |
| 87 | // - a UTF16 surrogate pair under Windows |
| 88 | // - a UTF8 multiple-bytes sequence under Linux |
| 89 | // (without considering the final NULL) |
| 90 | |
| 91 | wxPrintf("wxString reports a length of %d character(s)", test.length()); |
| 92 | // prints "wxString reports a length of 1 character(s)" on Linux |
| 93 | // prints "wxString reports a length of 2 character(s)" on Windows |
| 94 | // since wxString on Windows doesn't have surrogate pairs support! |
| 95 | |
| 96 | |
| 97 | // second test, this time using characters part of the Unicode BMP: |
| 98 | |
| 99 | wxString test2 = wxString::FromUTF8("\x41\xC3\xA0\xE2\x82\xAC"); |
| 100 | // this is the UTF8 encoding of capital letter A followed by |
| 101 | // 'small case letter a with grave' followed by the 'euro sign' |
| 102 | |
| 103 | // they are 3 Unicode code-points encoded as: |
| 104 | // - 3 UTF16 code units under Windows |
| 105 | // - 6 UTF8 code units under Linux |
| 106 | // (without considering the final NULL) |
| 107 | |
| 108 | wxPrintf("wxString reports a length of %d character(s)", test2.length()); |
| 109 | // prints "wxString reports a length of 3 character(s)" on Linux |
| 110 | // prints "wxString reports a length of 3 character(s)" on Windows |
| 111 | @endcode |
| 112 | |
| 113 | To better explain what stated above, consider the second string of the example |
| 114 | above; it's composed by 3 characters and the final @c NULL: |
| 115 | |
| 116 | @image html overview_wxstring_encoding.png |
| 117 | |
| 118 | As you can see, UTF16 encoding is straightforward (for characters in the @e BMP) |
| 119 | and in this example the UTF16-encoded wxString takes 8 bytes. |
| 120 | UTF8 encoding is more elaborated and in this example takes 7 bytes. |
| 121 | |
| 122 | In general, for strings containing many latin characters UTF8 provides a big |
| 123 | advantage with regards to the memory footprint respect UTF16, but requires some |
| 124 | more processing for common operations like e.g. length calculation. |
| 125 | |
| 126 | Finally, note that the type used by wxString to store Unicode code units |
| 127 | (@c wchar_t or @c char) is always @c typedef-ined to be ::wxStringCharType. |
| 128 | |
| 129 | |
| 130 | @section overview_string_binary Using wxString to store binary data |
| 131 | |
| 132 | wxString can be used to store binary data (even if it contains @c NULs) using the |
| 133 | functions wxString::To8BitData and wxString::From8BitData. |
| 134 | |
| 135 | Beware that even if @c NUL character is allowed, in the current string implementation |
| 136 | some methods might not work correctly with them. |
| 137 | |
| 138 | Note however that other classes like wxMemoryBuffer are more suited to this task. |
| 139 | For handling binary data you may also want to look at the wxStreamBuffer, |
| 140 | wxMemoryOutputStream, wxMemoryInputStream classes. |
| 141 | |
| 142 | |
| 143 | @section overview_string_comparison Comparison to Other String Classes |
| 144 | |
| 145 | The advantages of using a special string class instead of working directly with |
| 146 | C strings are so obvious that there is a huge number of such classes available. |
| 147 | The most important advantage is the need to always remember to allocate/free |
| 148 | memory for C strings; working with fixed size buffers almost inevitably leads |
| 149 | to buffer overflows. At last, C++ has a standard string class (@c std::string). So |
| 150 | why the need for wxString? There are several advantages: |
| 151 | |
| 152 | @li <b>Efficiency:</b> Since wxWidgets 3.0 wxString uses @c std::string (in UTF8 |
| 153 | mode under Linux, Unix and OS X) or @c std::wstring (in UTF16 mode under Windows) |
| 154 | internally by default to store its contents. wxString will therefore inherit the |
| 155 | performance characteristics from @c std::string. |
| 156 | @li <b>Compatibility:</b> This class tries to combine almost full compatibility |
| 157 | with the old wxWidgets 1.xx wxString class, some reminiscence of MFC's |
| 158 | CString class and 90% of the functionality of @c std::string class. |
| 159 | @li <b>Rich set of functions:</b> Some of the functions present in wxString are |
| 160 | very useful but don't exist in most of other string classes: for example, |
| 161 | wxString::AfterFirst, wxString::BeforeLast, wxString::Printf. |
| 162 | Of course, all the standard string operations are supported as well. |
| 163 | @li <b>wxString is Unicode friendly:</b> it allows to easily convert to |
| 164 | and from ANSI and Unicode strings (see @ref overview_unicode |
| 165 | for more details) and maps to @c std::wstring transparently. |
| 166 | @li <b>Used by wxWidgets:</b> And, of course, this class is used everywhere |
| 167 | inside wxWidgets so there is no performance loss which would result from |
| 168 | conversions of objects of any other string class (including @c std::string) to |
| 169 | wxString internally by wxWidgets. |
| 170 | |
| 171 | However, there are several problems as well. The most important one is probably |
| 172 | that there are often several functions to do exactly the same thing: for |
| 173 | example, to get the length of the string either one of wxString::length(), |
| 174 | wxString::Len() or wxString::Length() may be used. The first function, as |
| 175 | almost all the other functions in lowercase, is @c std::string compatible. The |
| 176 | second one is the "native" wxString version and the last one is the wxWidgets |
| 177 | 1.xx way. |
| 178 | |
| 179 | So which is better to use? The usage of the @c std::string compatible functions is |
| 180 | strongly advised! It will both make your code more familiar to other C++ |
| 181 | programmers (who are supposed to have knowledge of @c std::string but not of |
| 182 | wxString), let you reuse the same code in both wxWidgets and other programs (by |
| 183 | just typedefing wxString as @c std::string when used outside wxWidgets) and by |
| 184 | staying compatible with future versions of wxWidgets which will probably start |
| 185 | using @c std::string sooner or later too. |
| 186 | |
| 187 | In the situations where there is no corresponding @c std::string function, please |
| 188 | try to use the new wxString methods and not the old wxWidgets 1.xx variants |
| 189 | which are deprecated and may disappear in future versions. |
| 190 | |
| 191 | |
| 192 | @section overview_string_advice Advice About Using wxString |
| 193 | |
| 194 | @subsection overview_string_implicitconv Implicit conversions |
| 195 | |
| 196 | Probably the main trap with using this class is the implicit conversion |
| 197 | operator to <tt>const char*</tt>. It is advised that you use wxString::c_str() |
| 198 | instead to clearly indicate when the conversion is done. Specifically, the |
| 199 | danger of this implicit conversion may be seen in the following code fragment: |
| 200 | |
| 201 | @code |
| 202 | // this function converts the input string to uppercase, |
| 203 | // output it to the screen and returns the result |
| 204 | const char *SayHELLO(const wxString& input) |
| 205 | { |
| 206 | wxString output = input.Upper(); |
| 207 | printf("Hello, %s!\n", output); |
| 208 | return output; |
| 209 | } |
| 210 | @endcode |
| 211 | |
| 212 | There are two nasty bugs in these three lines. The first is in the call to the |
| 213 | @c printf() function. Although the implicit conversion to C strings is applied |
| 214 | automatically by the compiler in the case of |
| 215 | |
| 216 | @code |
| 217 | puts(output); |
| 218 | @endcode |
| 219 | |
| 220 | because the argument of @c puts() is known to be of the type |
| 221 | <tt>const char*</tt>, this is @b not done for @c printf() which is a function |
| 222 | with variable number of arguments (and whose arguments are of unknown types). |
| 223 | So this call may do any number of things (including displaying the correct |
| 224 | string on screen), although the most likely result is a program crash. |
| 225 | The solution is to use wxString::c_str(). Just replace this line with this: |
| 226 | |
| 227 | @code |
| 228 | printf("Hello, %s!\n", output.c_str()); |
| 229 | @endcode |
| 230 | |
| 231 | The second bug is that returning @c output doesn't work. The implicit cast is |
| 232 | used again, so the code compiles, but as it returns a pointer to a buffer |
| 233 | belonging to a local variable which is deleted as soon as the function exits, |
| 234 | its contents are completely arbitrary. The solution to this problem is also |
| 235 | easy, just make the function return wxString instead of a C string. |
| 236 | |
| 237 | This leads us to the following general advice: all functions taking string |
| 238 | arguments should take <tt>const wxString&</tt> (this makes assignment to the |
| 239 | strings inside the function faster) and all functions returning strings |
| 240 | should return wxString - this makes it safe to return local variables. |
| 241 | |
| 242 | Finally note that wxString uses the current locale encoding to convert any C string |
| 243 | literal to Unicode. The same is done for converting to and from @c std::string |
| 244 | and for the return value of c_str(). |
| 245 | For this conversion, the @a wxConvLibc class instance is used. |
| 246 | See wxCSConv and wxMBConv. |
| 247 | |
| 248 | |
| 249 | @subsection overview_string_iterating Iterating wxString's characters |
| 250 | |
| 251 | As previously described, when <tt>wxUSE_UNICODE_UTF8==1</tt>, wxString internally |
| 252 | uses the variable-length UTF8 encoding. |
| 253 | Accessing a UTF-8 string by index can be very @b inefficient because |
| 254 | a single character is represented by a variable number of bytes so that |
| 255 | the entire string has to be parsed in order to find the character. |
| 256 | Since iterating over a string by index is a common programming technique and |
| 257 | was also possible and encouraged by wxString using the access operator[]() |
| 258 | wxString implements caching of the last used index so that iterating over |
| 259 | a string is a linear operation even in UTF-8 mode. |
| 260 | |
| 261 | It is nonetheless recommended to use @b iterators (instead of index based |
| 262 | access) like this: |
| 263 | |
| 264 | @code |
| 265 | wxString s = "hello"; |
| 266 | wxString::const_iterator i; |
| 267 | for (i = s.begin(); i != s.end(); ++i) |
| 268 | { |
| 269 | wxUniChar uni_ch = *i; |
| 270 | // do something with it |
| 271 | } |
| 272 | @endcode |
| 273 | |
| 274 | |
| 275 | |
| 276 | @section overview_string_related String Related Functions and Classes |
| 277 | |
| 278 | As most programs use character strings, the standard C library provides quite |
| 279 | a few functions to work with them. Unfortunately, some of them have rather |
| 280 | counter-intuitive behaviour (like @c strncpy() which doesn't always terminate |
| 281 | the resulting string with a @NULL) and are in general not very safe (passing |
| 282 | @NULL to them will probably lead to program crash). Moreover, some very useful |
| 283 | functions are not standard at all. This is why in addition to all wxString |
| 284 | functions, there are also a few global string functions which try to correct |
| 285 | these problems: wxIsEmpty() verifies whether the string is empty (returning |
| 286 | @true for @NULL pointers), wxStrlen() also handles @NULL correctly and returns |
| 287 | 0 for them and wxStricmp() is just a platform-independent version of |
| 288 | case-insensitive string comparison function known either as @c stricmp() or |
| 289 | @c strcasecmp() on different platforms. |
| 290 | |
| 291 | The <tt>@<wx/string.h@></tt> header also defines ::wxSnprintf and ::wxVsnprintf |
| 292 | functions which should be used instead of the inherently dangerous standard |
| 293 | @c sprintf() and which use @c snprintf() instead which does buffer size checks |
| 294 | whenever possible. Of course, you may also use wxString::Printf which is also |
| 295 | safe. |
| 296 | |
| 297 | There is another class which might be useful when working with wxString: |
| 298 | wxStringTokenizer. It is helpful when a string must be broken into tokens and |
| 299 | replaces the standard C library @c strtok() function. |
| 300 | |
| 301 | And the very last string-related class is wxArrayString: it is just a version |
| 302 | of the "template" dynamic array class which is specialized to work with |
| 303 | strings. Please note that this class is specially optimized (using its |
| 304 | knowledge of the internal structure of wxString) for storing strings and so it |
| 305 | is vastly better from a performance point of view than a wxObjectArray of |
| 306 | wxStrings. |
| 307 | |
| 308 | |
| 309 | @section overview_string_tuning Tuning wxString for Your Application |
| 310 | |
| 311 | @note This section is strictly about performance issues and is absolutely not |
| 312 | necessary to read for using wxString class. Please skip it unless you feel |
| 313 | familiar with profilers and relative tools. |
| 314 | |
| 315 | For the performance reasons wxString doesn't allocate exactly the amount of |
| 316 | memory needed for each string. Instead, it adds a small amount of space to each |
| 317 | allocated block which allows it to not reallocate memory (a relatively |
| 318 | expensive operation) too often as when, for example, a string is constructed by |
| 319 | subsequently adding one character at a time to it, as for example in: |
| 320 | |
| 321 | @code |
| 322 | // delete all vowels from the string |
| 323 | wxString DeleteAllVowels(const wxString& original) |
| 324 | { |
| 325 | wxString vowels( "aeuioAEIOU" ); |
| 326 | wxString result; |
| 327 | wxString::const_iterator i; |
| 328 | for ( i = original.begin(); i != original.end(); ++i ) |
| 329 | { |
| 330 | if (vowels.Find( *i ) == wxNOT_FOUND) |
| 331 | result += *i; |
| 332 | } |
| 333 | |
| 334 | return result; |
| 335 | } |
| 336 | @endcode |
| 337 | |
| 338 | This is quite a common situation and not allocating extra memory at all would |
| 339 | lead to very bad performance in this case because there would be as many memory |
| 340 | (re)allocations as there are consonants in the original string. Allocating too |
| 341 | much extra memory would help to improve the speed in this situation, but due to |
| 342 | a great number of wxString objects typically used in a program would also |
| 343 | increase the memory consumption too much. |
| 344 | |
| 345 | The very best solution in precisely this case would be to use wxString::Alloc() |
| 346 | function to preallocate, for example, len bytes from the beginning - this will |
| 347 | lead to exactly one memory allocation being performed (because the result is at |
| 348 | most as long as the original string). |
| 349 | |
| 350 | However, using wxString::Alloc() is tedious and so wxString tries to do its |
| 351 | best. The default algorithm assumes that memory allocation is done in |
| 352 | granularity of at least 16 bytes (which is the case on almost all of |
| 353 | wide-spread platforms) and so nothing is lost if the amount of memory to |
| 354 | allocate is rounded up to the next multiple of 16. Like this, no memory is lost |
| 355 | and 15 iterations from 16 in the example above won't allocate memory but use |
| 356 | the already allocated pool. |
| 357 | |
| 358 | The default approach is quite conservative. Allocating more memory may bring |
| 359 | important performance benefits for programs using (relatively) few very long |
| 360 | strings. The amount of memory allocated is configured by the setting of |
| 361 | @c EXTRA_ALLOC in the file string.cpp during compilation (be sure to understand |
| 362 | why its default value is what it is before modifying it!). You may try setting |
| 363 | it to greater amount (say twice nLen) or to 0 (to see performance degradation |
| 364 | which will follow) and analyse the impact of it on your program. If you do it, |
| 365 | you will probably find it helpful to also define @c WXSTRING_STATISTICS symbol |
| 366 | which tells the wxString class to collect performance statistics and to show |
| 367 | them on stderr on program termination. This will show you the average length of |
| 368 | strings your program manipulates, their average initial length and also the |
| 369 | percent of times when memory wasn't reallocated when string concatenation was |
| 370 | done but the already preallocated memory was used (this value should be about |
| 371 | 98% for the default allocation policy, if it is less than 90% you should |
| 372 | really consider fine tuning wxString for your application). |
| 373 | |
| 374 | It goes without saying that a profiler should be used to measure the precise |
| 375 | difference the change to @c EXTRA_ALLOC makes to your program. |
| 376 | |
| 377 | |
| 378 | @section overview_string_settings wxString Related Compilation Settings |
| 379 | |
| 380 | Much work has been done to make existing code using ANSI string literals |
| 381 | work as before version 3.0. |
| 382 | |
| 383 | If you nonetheless need to have a wxString that uses @c wchar_t |
| 384 | on Unix and Linux, too, you can specify this on the command line with the |
| 385 | @c configure @c --disable-utf8 switch or you can consider using wxUString |
| 386 | or @c std::wstring instead. |
| 387 | |
| 388 | @c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support. |
| 389 | If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is |
| 390 | also defined, otherwise @c wxUSE_UNICODE_WCHAR is. |
| 391 | See also @ref page_wxusedef_important. |
| 392 | |
| 393 | */ |
| 394 | |