X-Git-Url: https://git.saurik.com/wxWidgets.git/blobdiff_plain/ade80f997fcf42e9fcd9ca5526c3850388392f53..1a289c624bca8583cd7f2cd6bfc22a70fc1a41db:/docs/doxygen/overviews/string.h diff --git a/docs/doxygen/overviews/string.h b/docs/doxygen/overviews/string.h index 6a5cf38908..f74c6e3a2e 100644 --- a/docs/doxygen/overviews/string.h +++ b/docs/doxygen/overviews/string.h @@ -13,11 +13,13 @@ Classes: wxString, wxArrayString, wxStringTokenizer @li @ref overview_string_intro +@li @ref overview_string_internal +@li @ref overview_string_binary @li @ref overview_string_comparison @li @ref overview_string_advice @li @ref overview_string_related -@li @ref overview_string_refcount @li @ref overview_string_tuning +@li @ref overview_string_settings
@@ -25,25 +27,118 @@ Classes: wxString, wxArrayString, wxStringTokenizer @section overview_string_intro Introduction -wxString is a class which represents a character string of arbitrary length and -containing arbitrary characters. The ASCII NUL character is allowed, but be -aware that in the current string implementation some methods might not work -correctly in this case. - -Since wxWidgets 3.0 wxString internally uses UCS-2 (basically 2-byte per -character wchar_t) under Windows and UTF-8 under Unix, Linux and -OS X to store its content. Much work has been done to make -existing code using ANSI string literals work as before. +wxString is a class which represents a Unicode string of arbitrary length and +containing arbitrary Unicode characters. This class has all the standard operations you can expect to find in a string class: dynamic memory management (string extends to accommodate new -characters), construction from other strings, C strings, wide character C strings -and characters, assignment operators, access to individual characters, string -concatenation and comparison, substring extraction, case conversion, trimming and padding (with -spaces), searching and replacing and both C-like @c printf (wxString::Printf) +characters), construction from other strings, compatibility with C strings and +wide character C strings, assignment operators, access to individual characters, string +concatenation and comparison, substring extraction, case conversion, trimming and +padding (with spaces), searching and replacing and both C-like @c printf (wxString::Printf) and stream-like insertion functions as well as much more - see wxString for a list of all functions. +The wxString class has been completely rewritten for wxWidgets 3.0 but much work +has been done to make existing code using ANSI string literals work as it did +in previous versions. + + +@section overview_string_internal Internal wxString encoding + +Since wxWidgets 3.0 wxString internally uses UTF-16 (with Unicode +code units stored in @c wchar_t) under Windows and UTF-8 (with Unicode +code units stored in @c char) under Unix, Linux and Mac OS X to store its content. + +For definitions of code units and code points terms, please +see the @ref overview_unicode_encodings paragraph. + +For simplicity of implementation, wxString when wxUSE_UNICODE_WCHAR==1 +(e.g. on Windows) uses per code unit indexing instead of +per code point indexing and doesn't know anything about surrogate pairs; +in other words it always considers code points to be composed by 1 code unit, +while this is really true only for characters in the @e BMP (Basic Multilingual Plane). +Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user +code has to take care of surrogate pairs himself. +(Note however that Windows itself has built-in support for surrogate pairs in UTF-16, +such as for drawing strings on screen.) + +@remarks +Note that while the behaviour of wxString when wxUSE_UNICODE_WCHAR==1 +resembles UCS-2 encoding, it's not completely correct to refer to wxString as +UCS-2 encoded since you can encode code points outside the @e BMP in a wxString +as two code units (i.e. as a surrogate pair; as already mentioned however wxString +will "see" them as two different code points) + +When instead wxUSE_UNICODE_UTF8==1 (e.g. on Linux and Mac OS X) +wxString handles UTF8 multi-bytes sequences just fine also for characters outside +the BMP (it implements per code point indexing), so that you can use +UTF8 in a completely transparent way: + +Example: +@code + // first test, using exotic characters outside of the Unicode BMP: + + wxString test = wxString::FromUTF8("\xF0\x90\x8C\x80"); + // U+10300 is "OLD ITALIC LETTER A" and is part of Unicode Plane 1 + // in UTF8 it's encoded as 0xF0 0x90 0x8C 0x80 + + // it's a single Unicode code-point encoded as: + // - a UTF16 surrogate pair under Windows + // - a UTF8 multiple-bytes sequence under Linux + // (without considering the final NULL) + + wxPrintf("wxString reports a length of %d character(s)", test.length()); + // prints "wxString reports a length of 1 character(s)" on Linux + // prints "wxString reports a length of 2 character(s)" on Windows + // since wxString on Windows doesn't have surrogate pairs support! + + + // second test, this time using characters part of the Unicode BMP: + + wxString test2 = wxString::FromUTF8("\x41\xC3\xA0\xE2\x82\xAC"); + // this is the UTF8 encoding of capital letter A followed by + // 'small case letter a with grave' followed by the 'euro sign' + + // they are 3 Unicode code-points encoded as: + // - 3 UTF16 code units under Windows + // - 6 UTF8 code units under Linux + // (without considering the final NULL) + + wxPrintf("wxString reports a length of %d character(s)", test2.length()); + // prints "wxString reports a length of 3 character(s)" on Linux + // prints "wxString reports a length of 3 character(s)" on Windows +@endcode + +To better explain what stated above, consider the second string of the example +above; it's composed by 3 characters and the final @c NULL: + +@image html overview_wxstring_encoding.png + +As you can see, UTF16 encoding is straightforward (for characters in the @e BMP) +and in this example the UTF16-encoded wxString takes 8 bytes. +UTF8 encoding is more elaborated and in this example takes 7 bytes. + +In general, for strings containing many latin characters UTF8 provides a big +advantage with regards to the memory footprint respect UTF16, but requires some +more processing for common operations like e.g. length calculation. + +Finally, note that the type used by wxString to store Unicode code units +(@c wchar_t or @c char) is always @c typedef-ined to be ::wxStringCharType. + + +@section overview_string_binary Using wxString to store binary data + +wxString can be used to store binary data (even if it contains @c NULs) using the +functions wxString::To8BitData and wxString::From8BitData. + +Beware that even if @c NUL character is allowed, in the current string implementation +some methods might not work correctly with them. + +Note however that other classes like wxMemoryBuffer are more suited to this task. +For handling binary data you may also want to look at the wxStreamBuffer, +wxMemoryOutputStream, wxMemoryInputStream classes. + @section overview_string_comparison Comparison to Other String Classes @@ -51,56 +146,53 @@ The advantages of using a special string class instead of working directly with C strings are so obvious that there is a huge number of such classes available. The most important advantage is the need to always remember to allocate/free memory for C strings; working with fixed size buffers almost inevitably leads -to buffer overflows. At last, C++ has a standard string class (std::string). So +to buffer overflows. At last, C++ has a standard string class (@c std::string). So why the need for wxString? There are several advantages: -@li Efficiency: This class was made to be as efficient as possible: both in - terms of size (each wxString objects takes exactly the same space as a - char* pointer, see @ref overview_string_refcount - "reference counting") and speed. It also provides performance - @ref overview_string_tuning "statistics gathering code" which may be - enabled to fine tune the memory allocation strategy for your particular - application - and the gain might be quite big. +@li Efficiency: Since wxWidgets 3.0 wxString uses @c std::string (in UTF8 + mode under Linux, Unix and OS X) or @c std::wstring (in UTF16 mode under Windows) + internally by default to store its contents. wxString will therefore inherit the + performance characteristics from @c std::string. @li Compatibility: This class tries to combine almost full compatibility - with the old wxWidgets 1.xx wxString class, some reminiscence to MFC - CString class and 90% of the functionality of std::string class. -@li Rich set of functions: Some of the functions present in wxString are very - useful but don't exist in most of other string classes: for example, - wxString::AfterFirst, wxString::BeforeLast, wxString::operators or - wxString::Printf. Of course, all the standard string operations are - supported as well. -@li Unicode wxString is Unicode friendly: it allows to easily convert to - and from ANSI and Unicode strings in any build mode (see the - @ref overview_unicode "unicode overview" for more details) and maps to - either @c string or @c wstring transparently depending on the current mode. + with the old wxWidgets 1.xx wxString class, some reminiscence of MFC's + CString class and 90% of the functionality of @c std::string class. +@li Rich set of functions: Some of the functions present in wxString are + very useful but don't exist in most of other string classes: for example, + wxString::AfterFirst, wxString::BeforeLast, wxString::Printf. + Of course, all the standard string operations are supported as well. +@li wxString is Unicode friendly: it allows to easily convert to + and from ANSI and Unicode strings (see @ref overview_unicode + for more details) and maps to @c std::wstring transparently. @li Used by wxWidgets: And, of course, this class is used everywhere inside wxWidgets so there is no performance loss which would result from - conversions of objects of any other string class (including std::string) to + conversions of objects of any other string class (including @c std::string) to wxString internally by wxWidgets. However, there are several problems as well. The most important one is probably that there are often several functions to do exactly the same thing: for -example, to get the length of the string either one of @c length(), +example, to get the length of the string either one of wxString::length(), wxString::Len() or wxString::Length() may be used. The first function, as -almost all the other functions in lowercase, is std::string compatible. The +almost all the other functions in lowercase, is @c std::string compatible. The second one is the "native" wxString version and the last one is the wxWidgets 1.xx way. -So which is better to use? The usage of the std::string compatible functions is +So which is better to use? The usage of the @c std::string compatible functions is strongly advised! It will both make your code more familiar to other C++ -programmers (who are supposed to have knowledge of std::string but not of +programmers (who are supposed to have knowledge of @c std::string but not of wxString), let you reuse the same code in both wxWidgets and other programs (by -just typedefing wxString as std::string when used outside wxWidgets) and by +just typedefing wxString as @c std::string when used outside wxWidgets) and by staying compatible with future versions of wxWidgets which will probably start -using std::string sooner or later too. +using @c std::string sooner or later too. -In the situations where there is no corresponding std::string function, please +In the situations where there is no corresponding @c std::string function, please try to use the new wxString methods and not the old wxWidgets 1.xx variants which are deprecated and may disappear in future versions. @section overview_string_advice Advice About Using wxString +@subsection overview_string_implicitconv Implicit conversions + Probably the main trap with using this class is the implicit conversion operator to const char*. It is advised that you use wxString::c_str() instead to clearly indicate when the conversion is done. Specifically, the @@ -129,8 +221,8 @@ because the argument of @c puts() is known to be of the type const char*, this is @b not done for @c printf() which is a function with variable number of arguments (and whose arguments are of unknown types). So this call may do any number of things (including displaying the correct -string on screen), although the most likely result is a program crash. The -solution is to use wxString::c_str(). Just replace this line with this: +string on screen), although the most likely result is a program crash. +The solution is to use wxString::c_str(). Just replace this line with this: @code printf("Hello, %s!\n", output.c_str()); @@ -143,10 +235,42 @@ its contents are completely arbitrary. The solution to this problem is also easy, just make the function return wxString instead of a C string. This leads us to the following general advice: all functions taking string -arguments should take const wxString (this makes assignment to the -strings inside the function faster because of -@ref overview_string_refcount "reference counting") and all functions returning -strings should return wxString - this makes it safe to return local variables. +arguments should take const wxString& (this makes assignment to the +strings inside the function faster) and all functions returning strings +should return wxString - this makes it safe to return local variables. + +Finally note that wxString uses the current locale encoding to convert any C string +literal to Unicode. The same is done for converting to and from @c std::string +and for the return value of c_str(). +For this conversion, the @a wxConvLibc class instance is used. +See wxCSConv and wxMBConv. + + +@subsection overview_string_iterating Iterating wxString's characters + +As previously described, when wxUSE_UNICODE_UTF8==1, wxString internally +uses the variable-length UTF8 encoding. +Accessing a UTF-8 string by index can be very @b inefficient because +a single character is represented by a variable number of bytes so that +the entire string has to be parsed in order to find the character. +Since iterating over a string by index is a common programming technique and +was also possible and encouraged by wxString using the access operator[]() +wxString implements caching of the last used index so that iterating over +a string is a linear operation even in UTF-8 mode. + +It is nonetheless recommended to use @b iterators (instead of index based +access) like this: + +@code +wxString s = "hello"; +wxString::const_iterator i; +for (i = s.begin(); i != s.end(); ++i) +{ + wxUniChar uni_ch = *i; + // do something with it +} +@endcode + @section overview_string_related String Related Functions and Classes @@ -164,7 +288,7 @@ these problems: wxIsEmpty() verifies whether the string is empty (returning case-insensitive string comparison function known either as @c stricmp() or @c strcasecmp() on different platforms. -The @ header also defines wxSnprintf and wxVsnprintf +The @ header also defines ::wxSnprintf and ::wxVsnprintf functions which should be used instead of the inherently dangerous standard @c sprintf() and which use @c snprintf() instead which does buffer size checks whenever possible. Of course, you may also use wxString::Printf which is also @@ -182,36 +306,11 @@ is vastly better from a performance point of view than a wxObjectArray of wxStrings. -@section overview_string_refcount Reference Counting and Why You Shouldn't Care - -All considerations for wxObject-derived -@ref overview_refcount "reference counted" objects are valid also for wxString, -even if it does not derive from wxObject. - -Probably the unique case when you might want to think about reference counting -is when a string character is taken from a string which is not a constant (or -a constant reference). In this case, due to C++ rules, the "read-only" -@c operator[] (which is the same as wxString::GetChar()) cannot be chosen and -the "read/write" @c operator[] (the same as wxString::GetWritableChar()) is -used instead. As the call to this operator may modify the string, its data is -unshared (COW is done) and so if the string was really shared there is some -performance loss (both in terms of speed and memory consumption). In the rare -cases when this may be important, you might prefer using wxString::GetChar() -instead of the array subscript operator for this reasons. Please note that -wxString::at() method has the same problem as the subscript operator in this -situation and so using it is not really better. Also note that if all string -arguments to your functions are passed as const wxString (see the -@ref overview_string_advice section) this situation will almost never arise -because for constant references the correct operator is called automatically. - - @section overview_string_tuning Tuning wxString for Your Application @note This section is strictly about performance issues and is absolutely not necessary to read for using wxString class. Please skip it unless you feel -familiar with profilers and relative tools. If you do read it, please also -read the preceding section about -@ref overview_string_refcount "reference counting". +familiar with profilers and relative tools. For the performance reasons wxString doesn't allocate exactly the amount of memory needed for each string. Instead, it adds a small amount of space to each @@ -223,13 +322,13 @@ subsequently adding one character at a time to it, as for example in: // delete all vowels from the string wxString DeleteAllVowels(const wxString& original) { + wxString vowels( "aeuioAEIOU" ); wxString result; - - size_t len = original.length(); - for ( size_t n = 0; n < len; n++ ) + wxString::const_iterator i; + for ( i = original.begin(); i != original.end(); ++i ) { - if ( strchr("aeuio", tolower(original[n])) == NULL ) - result += original[n]; + if (vowels.Find( *i ) == wxNOT_FOUND) + result += *i; } return result; @@ -275,5 +374,21 @@ really consider fine tuning wxString for your application). It goes without saying that a profiler should be used to measure the precise difference the change to @c EXTRA_ALLOC makes to your program. + +@section overview_string_settings wxString Related Compilation Settings + +Much work has been done to make existing code using ANSI string literals +work as before version 3.0. + +If you nonetheless need to have a wxString that uses @c wchar_t +on Unix and Linux, too, you can specify this on the command line with the +@c configure @c --disable-utf8 switch or you can consider using wxUString +or @c std::wstring instead. + +@c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support. +If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is +also defined, otherwise @c wxUSE_UNICODE_WCHAR is. +See also @ref page_wxusedef_important. + */