// Purpose: topic overview
// Author: wxWidgets team
// RCS-ID: $Id$
-// Licence: wxWindows license
+// Licence: wxWindows licence
/////////////////////////////////////////////////////////////////////////////
/**
Classes: wxString, wxArrayString, wxStringTokenizer
@li @ref overview_string_intro
+@li @ref overview_string_internal
+@li @ref overview_string_binary
@li @ref overview_string_comparison
@li @ref overview_string_advice
@li @ref overview_string_related
-@li @ref overview_string_refcount
@li @ref overview_string_tuning
+@li @ref overview_string_settings
<hr>
@section overview_string_intro Introduction
-wxString is a class which represents a character string of arbitrary length
-(limited by @c MAX_INT which is usually 2147483647 on 32 bit machines) and
-containing arbitrary characters. The ASCII NUL character is allowed, but be
-aware that in the current string implementation some methods might not work
-correctly in this case.
-
-wxString works with both ASCII (traditional, 7 or 8 bit, characters) as well as
-Unicode (wide characters) strings.
+wxString is a class which represents a Unicode string of arbitrary length and
+containing arbitrary Unicode characters.
This class has all the standard operations you can expect to find in a string
class: dynamic memory management (string extends to accommodate new
-characters), construction from other strings, C strings and characters,
-assignment operators, access to individual characters, string concatenation and
-comparison, substring extraction, case conversion, trimming and padding (with
-spaces), searching and replacing and both C-like @c printf (wxString::Printf)
+characters), construction from other strings, compatibility with C strings and
+wide character C strings, assignment operators, access to individual characters, string
+concatenation and comparison, substring extraction, case conversion, trimming and
+padding (with spaces), searching and replacing and both C-like @c printf (wxString::Printf)
and stream-like insertion functions as well as much more - see wxString for a
list of all functions.
+The wxString class has been completely rewritten for wxWidgets 3.0 but much work
+has been done to make existing code using ANSI string literals work as it did
+in previous versions.
+
+
+@section overview_string_internal Internal wxString encoding
+
+Since wxWidgets 3.0 wxString internally uses <b>UTF-16</b> (with Unicode
+code units stored in @c wchar_t) under Windows and <b>UTF-8</b> (with Unicode
+code units stored in @c char) under Unix, Linux and Mac OS X to store its content.
+
+For definitions of <em>code units</em> and <em>code points</em> terms, please
+see the @ref overview_unicode_encodings paragraph.
+
+For simplicity of implementation, wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt>
+(e.g. on Windows) uses <em>per code unit indexing</em> instead of
+<em>per code point indexing</em> and doesn't know anything about surrogate pairs;
+in other words it always considers code points to be composed by 1 code unit,
+while this is really true only for characters in the @e BMP (Basic Multilingual Plane).
+Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user
+code has to take care of <em>surrogate pairs</em> himself.
+(Note however that Windows itself has built-in support for surrogate pairs in UTF-16,
+such as for drawing strings on screen.)
+
+@remarks
+Note that while the behaviour of wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt>
+resembles UCS-2 encoding, it's not completely correct to refer to wxString as
+UCS-2 encoded since you can encode code points outside the @e BMP in a wxString
+as two code units (i.e. as a surrogate pair; as already mentioned however wxString
+will "see" them as two different code points)
+
+When instead <tt>wxUSE_UNICODE_UTF8==1</tt> (e.g. on Linux and Mac OS X)
+wxString handles UTF8 multi-bytes sequences just fine also for characters outside
+the BMP (it implements <em>per code point indexing</em>), so that you can use
+UTF8 in a completely transparent way:
+
+Example:
+@code
+ // first test, using exotic characters outside of the Unicode BMP:
+
+ wxString test = wxString::FromUTF8("\xF0\x90\x8C\x80");
+ // U+10300 is "OLD ITALIC LETTER A" and is part of Unicode Plane 1
+ // in UTF8 it's encoded as 0xF0 0x90 0x8C 0x80
+
+ // it's a single Unicode code-point encoded as:
+ // - a UTF16 surrogate pair under Windows
+ // - a UTF8 multiple-bytes sequence under Linux
+ // (without considering the final NULL)
+
+ wxPrintf("wxString reports a length of %d character(s)", test.length());
+ // prints "wxString reports a length of 1 character(s)" on Linux
+ // prints "wxString reports a length of 2 character(s)" on Windows
+ // since wxString on Windows doesn't have surrogate pairs support!
+
+
+ // second test, this time using characters part of the Unicode BMP:
+
+ wxString test2 = wxString::FromUTF8("\x41\xC3\xA0\xE2\x82\xAC");
+ // this is the UTF8 encoding of capital letter A followed by
+ // 'small case letter a with grave' followed by the 'euro sign'
+
+ // they are 3 Unicode code-points encoded as:
+ // - 3 UTF16 code units under Windows
+ // - 6 UTF8 code units under Linux
+ // (without considering the final NULL)
+
+ wxPrintf("wxString reports a length of %d character(s)", test2.length());
+ // prints "wxString reports a length of 3 character(s)" on Linux
+ // prints "wxString reports a length of 3 character(s)" on Windows
+@endcode
+
+To better explain what stated above, consider the second string of the example
+above; it's composed by 3 characters and the final @c NULL:
+
+@image html overview_wxstring_encoding.png
+
+As you can see, UTF16 encoding is straightforward (for characters in the @e BMP)
+and in this example the UTF16-encoded wxString takes 8 bytes.
+UTF8 encoding is more elaborated and in this example takes 7 bytes.
+
+In general, for strings containing many latin characters UTF8 provides a big
+advantage with regards to the memory footprint respect UTF16, but requires some
+more processing for common operations like e.g. length calculation.
+
+Finally, note that the type used by wxString to store Unicode code units
+(@c wchar_t or @c char) is always @c typedef-ined to be ::wxStringCharType.
+
+
+@section overview_string_binary Using wxString to store binary data
+
+wxString can be used to store binary data (even if it contains @c NULs) using the
+functions wxString::To8BitData and wxString::From8BitData.
+
+Beware that even if @c NUL character is allowed, in the current string implementation
+some methods might not work correctly with them.
+
+Note however that other classes like wxMemoryBuffer are more suited to this task.
+For handling binary data you may also want to look at the wxStreamBuffer,
+wxMemoryOutputStream, wxMemoryInputStream classes.
+
@section overview_string_comparison Comparison to Other String Classes
C strings are so obvious that there is a huge number of such classes available.
The most important advantage is the need to always remember to allocate/free
memory for C strings; working with fixed size buffers almost inevitably leads
-to buffer overflows. At last, C++ has a standard string class (std::string). So
+to buffer overflows. At last, C++ has a standard string class (@c std::string). So
why the need for wxString? There are several advantages:
-@li <b>Efficiency:</b> This class was made to be as efficient as possible: both in
- terms of size (each wxString objects takes exactly the same space as a
- <tt>char*</tt> pointer, see @ref overview_string_refcount
- "reference counting") and speed. It also provides performance
- @ref overview_string_tuning "statistics gathering code" which may be
- enabled to fine tune the memory allocation strategy for your particular
- application - and the gain might be quite big.
+@li <b>Efficiency:</b> Since wxWidgets 3.0 wxString uses @c std::string (in UTF8
+ mode under Linux, Unix and OS X) or @c std::wstring (in UTF16 mode under Windows)
+ internally by default to store its contents. wxString will therefore inherit the
+ performance characteristics from @c std::string.
@li <b>Compatibility:</b> This class tries to combine almost full compatibility
- with the old wxWidgets 1.xx wxString class, some reminiscence to MFC
- CString class and 90% of the functionality of std::string class.
-@li <b>Rich set of functions:</b> Some of the functions present in wxString are very
- useful but don't exist in most of other string classes: for example,
- wxString::AfterFirst, wxString::BeforeLast, wxString::operators or
- wxString::Printf. Of course, all the standard string operations are
- supported as well.
-@li <b>Unicode wxString is Unicode friendly:</b> it allows to easily convert to
- and from ANSI and Unicode strings in any build mode (see the
- @ref overview_unicode "unicode overview" for more details) and maps to
- either @c string or @c wstring transparently depending on the current mode.
+ with the old wxWidgets 1.xx wxString class, some reminiscence of MFC's
+ CString class and 90% of the functionality of @c std::string class.
+@li <b>Rich set of functions:</b> Some of the functions present in wxString are
+ very useful but don't exist in most of other string classes: for example,
+ wxString::AfterFirst, wxString::BeforeLast, wxString::Printf.
+ Of course, all the standard string operations are supported as well.
+@li <b>wxString is Unicode friendly:</b> it allows to easily convert to
+ and from ANSI and Unicode strings (see @ref overview_unicode
+ for more details) and maps to @c std::wstring transparently.
@li <b>Used by wxWidgets:</b> And, of course, this class is used everywhere
inside wxWidgets so there is no performance loss which would result from
- conversions of objects of any other string class (including std::string) to
+ conversions of objects of any other string class (including @c std::string) to
wxString internally by wxWidgets.
However, there are several problems as well. The most important one is probably
that there are often several functions to do exactly the same thing: for
-example, to get the length of the string either one of @c length(),
+example, to get the length of the string either one of wxString::length(),
wxString::Len() or wxString::Length() may be used. The first function, as
-almost all the other functions in lowercase, is std::string compatible. The
+almost all the other functions in lowercase, is @c std::string compatible. The
second one is the "native" wxString version and the last one is the wxWidgets
1.xx way.
-So which is better to use? The usage of the std::string compatible functions is
+So which is better to use? The usage of the @c std::string compatible functions is
strongly advised! It will both make your code more familiar to other C++
-programmers (who are supposed to have knowledge of std::string but not of
+programmers (who are supposed to have knowledge of @c std::string but not of
wxString), let you reuse the same code in both wxWidgets and other programs (by
-just typedefing wxString as std::string when used outside wxWidgets) and by
+just typedefing wxString as @c std::string when used outside wxWidgets) and by
staying compatible with future versions of wxWidgets which will probably start
-using std::string sooner or later too.
+using @c std::string sooner or later too.
-In the situations where there is no corresponding std::string function, please
+In the situations where there is no corresponding @c std::string function, please
try to use the new wxString methods and not the old wxWidgets 1.xx variants
which are deprecated and may disappear in future versions.
@section overview_string_advice Advice About Using wxString
+@subsection overview_string_implicitconv Implicit conversions
+
Probably the main trap with using this class is the implicit conversion
operator to <tt>const char*</tt>. It is advised that you use wxString::c_str()
instead to clearly indicate when the conversion is done. Specifically, the
<tt>const char*</tt>, this is @b not done for @c printf() which is a function
with variable number of arguments (and whose arguments are of unknown types).
So this call may do any number of things (including displaying the correct
-string on screen), although the most likely result is a program crash. The
-solution is to use wxString::c_str(). Just replace this line with this:
+string on screen), although the most likely result is a program crash.
+The solution is to use wxString::c_str(). Just replace this line with this:
@code
printf("Hello, %s!\n", output.c_str());
easy, just make the function return wxString instead of a C string.
This leads us to the following general advice: all functions taking string
-arguments should take <tt>const wxString</tt> (this makes assignment to the
-strings inside the function faster because of
-@ref overview_string_refcount "reference counting") and all functions returning
-strings should return wxString - this makes it safe to return local variables.
+arguments should take <tt>const wxString&</tt> (this makes assignment to the
+strings inside the function faster) and all functions returning strings
+should return wxString - this makes it safe to return local variables.
+
+Finally note that wxString uses the current locale encoding to convert any C string
+literal to Unicode. The same is done for converting to and from @c std::string
+and for the return value of c_str().
+For this conversion, the @a wxConvLibc class instance is used.
+See wxCSConv and wxMBConv.
+
+
+@subsection overview_string_iterating Iterating wxString's characters
+
+As previously described, when <tt>wxUSE_UNICODE_UTF8==1</tt>, wxString internally
+uses the variable-length UTF8 encoding.
+Accessing a UTF-8 string by index can be very @b inefficient because
+a single character is represented by a variable number of bytes so that
+the entire string has to be parsed in order to find the character.
+Since iterating over a string by index is a common programming technique and
+was also possible and encouraged by wxString using the access operator[]()
+wxString implements caching of the last used index so that iterating over
+a string is a linear operation even in UTF-8 mode.
+
+It is nonetheless recommended to use @b iterators (instead of index based
+access) like this:
+
+@code
+wxString s = "hello";
+wxString::const_iterator i;
+for (i = s.begin(); i != s.end(); ++i)
+{
+ wxUniChar uni_ch = *i;
+ // do something with it
+}
+@endcode
+
@section overview_string_related String Related Functions and Classes
case-insensitive string comparison function known either as @c stricmp() or
@c strcasecmp() on different platforms.
-The <tt>@<wx/string.h@></tt> header also defines wxSnprintf and wxVsnprintf
+The <tt>@<wx/string.h@></tt> header also defines ::wxSnprintf and ::wxVsnprintf
functions which should be used instead of the inherently dangerous standard
@c sprintf() and which use @c snprintf() instead which does buffer size checks
whenever possible. Of course, you may also use wxString::Printf which is also
wxStrings.
-@section overview_string_refcount Reference Counting and Why You Shouldn't Care
-
-All considerations for wxObject-derived
-@ref overview_refcount "reference counted" objects are valid also for wxString,
-even if it does not derive from wxObject.
-
-Probably the unique case when you might want to think about reference counting
-is when a string character is taken from a string which is not a constant (or
-a constant reference). In this case, due to C++ rules, the "read-only"
-@c operator[] (which is the same as wxString::GetChar()) cannot be chosen and
-the "read/write" @c operator[] (the same as wxString::GetWritableChar()) is
-used instead. As the call to this operator may modify the string, its data is
-unshared (COW is done) and so if the string was really shared there is some
-performance loss (both in terms of speed and memory consumption). In the rare
-cases when this may be important, you might prefer using wxString::GetChar()
-instead of the array subscript operator for this reasons. Please note that
-wxString::at() method has the same problem as the subscript operator in this
-situation and so using it is not really better. Also note that if all string
-arguments to your functions are passed as <tt>const wxString</tt> (see the
-@ref overview_string_advice section) this situation will almost never arise
-because for constant references the correct operator is called automatically.
-
-
@section overview_string_tuning Tuning wxString for Your Application
@note This section is strictly about performance issues and is absolutely not
necessary to read for using wxString class. Please skip it unless you feel
-familiar with profilers and relative tools. If you do read it, please also
-read the preceding section about
-@ref overview_string_refcount "reference counting".
+familiar with profilers and relative tools.
For the performance reasons wxString doesn't allocate exactly the amount of
memory needed for each string. Instead, it adds a small amount of space to each
// delete all vowels from the string
wxString DeleteAllVowels(const wxString& original)
{
+ wxString vowels( "aeuioAEIOU" );
wxString result;
-
- size_t len = original.length();
- for ( size_t n = 0; n < len; n++ )
+ wxString::const_iterator i;
+ for ( i = original.begin(); i != original.end(); ++i )
{
- if ( strchr("aeuio", tolower(original[n])) == NULL )
- result += original[n];
+ if (vowels.Find( *i ) == wxNOT_FOUND)
+ result += *i;
}
return result;
It goes without saying that a profiler should be used to measure the precise
difference the change to @c EXTRA_ALLOC makes to your program.
+
+@section overview_string_settings wxString Related Compilation Settings
+
+Much work has been done to make existing code using ANSI string literals
+work as before version 3.0.
+
+If you nonetheless need to have a wxString that uses @c wchar_t
+on Unix and Linux, too, you can specify this on the command line with the
+@c configure @c --disable-utf8 switch or you can consider using wxUString
+or @c std::wstring instead.
+
+@c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support.
+If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is
+also defined, otherwise @c wxUSE_UNICODE_WCHAR is.
+See also @ref page_wxusedef_important.
+
*/