moved many things from wxString reference page to the wxString overview; updated...

author Francesco Montorsi <f18m_cpp217828@yahoo.it>

Sat, 6 Dec 2008 16:24:52 +0000 (16:24 +0000)

committer Francesco Montorsi <f18m_cpp217828@yahoo.it>

Sat, 6 Dec 2008 16:24:52 +0000 (16:24 +0000)
author Francesco Montorsi <f18m_cpp217828@yahoo.it>
Sat, 6 Dec 2008 16:24:52 +0000 (16:24 +0000)
committer Francesco Montorsi <f18m_cpp217828@yahoo.it>
Sat, 6 Dec 2008 16:24:52 +0000 (16:24 +0000)
diff --git a/docs/doxygen/images/overview_unicode_codes.dia b/docs/doxygen/images/overview_unicode_codes.dia

new file mode 100644 (file)

index 0000000..e8bd50f

Binary files /dev/null and b/docs/doxygen/images/overview_unicode_codes.dia differ
diff --git a/docs/doxygen/images/overview_unicode_codes.png b/docs/doxygen/images/overview_unicode_codes.png

index c936ea066180c771180244efba90f600226dea11..0da2d8ffa84c0b92b615a8bc6446fcc13d18f386 100644 (file)

Binary files a/docs/doxygen/images/overview_unicode_codes.png and b/docs/doxygen/images/overview_unicode_codes.png differ
diff --git a/docs/doxygen/images/overview_wxstring_encoding.dia b/docs/doxygen/images/overview_wxstring_encoding.dia

new file mode 100644 (file)

index 0000000..4d42a4a

Binary files /dev/null and b/docs/doxygen/images/overview_wxstring_encoding.dia differ
diff --git a/docs/doxygen/images/overview_wxstring_encoding.png b/docs/doxygen/images/overview_wxstring_encoding.png

new file mode 100644 (file)

index 0000000..f81af5d

Binary files /dev/null and b/docs/doxygen/images/overview_wxstring_encoding.png differ
diff --git a/docs/doxygen/overviews/mbconvclasses.h b/docs/doxygen/overviews/mbconvclasses.h

index 4dbb18b64c29bc580604891118704096157daa68..5cec57242181fbfa0d4f282b5eeb48fae53804a6 100644 (file)
--- a/docs/doxygen/overviews/mbconvclasses.h
+++ b/docs/doxygen/overviews/mbconvclasses.h
@@ -51,6 +51,8 @@ unhindered through any traditional transport channels.
  
  @section overview_mbconv_string Background: The wxString Class
  
  
  @section overview_mbconv_string Background: The wxString Class
  
+@todo rewrite this overview; it's not up2date with wxString changes
+
  If you have compiled wxWidgets in Unicode mode, the wxChar type will become
  identical to wchar_t rather than char, and a wxString stores wxChars. Hence,
  all wxString manipulation in your application will then operate on Unicode
  If you have compiled wxWidgets in Unicode mode, the wxChar type will become
  identical to wchar_t rather than char, and a wxString stores wxChars. Hence,
  all wxString manipulation in your application will then operate on Unicode
diff --git a/docs/doxygen/overviews/string.h b/docs/doxygen/overviews/string.h

index 42247d28d4df104336c58d3c72072028c02f8947..54513b6bb43c2884c1f6220c072a3a56476935e8 100644 (file)
--- a/docs/doxygen/overviews/string.h
+++ b/docs/doxygen/overviews/string.h
@@ -13,10 +13,12 @@
  Classes: wxString, wxArrayString, wxStringTokenizer
  
  @li @ref overview_string_intro
  Classes: wxString, wxArrayString, wxStringTokenizer
  
  @li @ref overview_string_intro
+@li @ref overview_string_internal
  @li @ref overview_string_comparison
  @li @ref overview_string_advice
  @li @ref overview_string_related
  @li @ref overview_string_tuning
  @li @ref overview_string_comparison
  @li @ref overview_string_advice
  @li @ref overview_string_related
  @li @ref overview_string_tuning
+@li @ref overview_string_settings
  
  
  <hr>
  
  
  <hr>
@@ -24,25 +26,104 @@ Classes: wxString, wxArrayString, wxStringTokenizer
  
  @section overview_string_intro Introduction
  
  
  @section overview_string_intro Introduction
  
-wxString is a class which represents a character string of arbitrary length and
-containing arbitrary characters. The ASCII NUL character is allowed, but be
-aware that in the current string implementation some methods might not work
-correctly in this case.
+wxString is a class which represents a Unicode string of arbitrary length and
+containing arbitrary characters.
  
  
-Since wxWidgets 3.0 wxString internally uses UCS-2 (basically 2-byte per
-character wchar_t) under Windows and UTF-8 under Unix, Linux and
-OS X to store its content. Much work has been done to make
-existing code using ANSI string literals work as before.
+The @c NUL character is allowed, but be
+aware that in the current string implementation some methods might not work
+correctly in this case. @todo still true?
  
  This class has all the standard operations you can expect to find in a string
  class: dynamic memory management (string extends to accommodate new
  
  This class has all the standard operations you can expect to find in a string
  class: dynamic memory management (string extends to accommodate new
-characters), construction from other strings, C strings, wide character C strings 
+characters), construction from other strings, C strings, wide character C strings
  and characters, assignment operators, access to individual characters, string
  and characters, assignment operators, access to individual characters, string
-concatenation and comparison, substring extraction, case conversion, trimming and padding (with
-spaces), searching and replacing and both C-like @c printf (wxString::Printf)
+concatenation and comparison, substring extraction, case conversion, trimming and
+padding (with spaces), searching and replacing and both C-like @c printf (wxString::Printf)
  and stream-like insertion functions as well as much more - see wxString for a
  list of all functions.
  
  and stream-like insertion functions as well as much more - see wxString for a
  list of all functions.
  
+The wxString class has been completely rewritten for wxWidgets 3.0 but much work
+has been done to make existing code using ANSI string literals work as it did
+in previous versions.
+
+
+@section overview_string_internal Internal wxString encoding
+
+Since wxWidgets 3.0 wxString internally uses <b>UCS-2</b> (with Unicode
+code units stored in @c wchar_t) under Windows and <b>UTF-8</b> (with Unicode
+code units stored in @c char) under Unix, Linux and Mac OS X to store its content.
+
+For definitions of <em>code units</em> and <em>code points</em> terms, please
+see the @ref overview_unicode_encodings paragraph.
+
+Note that there is a difference about UCS-2 and UTF-16: the first is a fixed-length
+encoding, without <em>surrogate pairs</em>, while the latter is a
+variable-length encoding. Except for this the two encodings are identical.
+
+For simplicity of implementation, wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt>
+(e.g. on Windows) uses UCS-2 and thus doesn't know anything about surrogate pairs;
+it always consider 1 code unit per 1 code point, while this is really true only for
+characters in the @e BMP (Basic Multilingual Plane).
+Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user
+code has to take care of <em>surrogate pair</em> handling himself.
+(Note however that Windows itself has built-in support for surrogate pairs in UTF-16,
+such as for drawing strings on screen.)
+
+When instead <tt>wxUSE_UNICODE_UTF8==1</tt> (e.g. on Linux and Mac OS X)
+wxString handles UTF8 multi-bytes sequences just fine, so that you can use
+UTF8 in a completely transparent way:
+
+Example:
+@code
+    // first test, using exotic characters outside of the Unicode BMP:
+
+    wxString test = wxString::FromUTF8("\xF0\x90\x8C\x80");
+        // U+10300 is "OLD ITALIC LETTER A" and is part of Unicode Plane 1
+        // in UTF8 it's encoded as 0xF0 0x90 0x8C 0x80
+
+    // it's a single Unicode code-point encoded as:
+    // - a UTF16 surrogate pair under Windows
+    // - a UTF8 multiple-bytes sequence under Linux
+    // (without considering the final NULL)
+
+    wxPrintf("wxString reports a length of %d character(s)", test.length());
+        // prints "wxString reports a length of 1 character(s)" on Linux
+        // prints "wxString reports a length of 2 character(s)" on Windows
+        // since Windows doesn't have surrogate pairs support!
+
+
+    // second test, this time using characters part of the Unicode BMP:
+
+    wxString test2 = wxString::FromUTF8("\x41\xC3\xA0\xE2\x82\xAC");
+        // this is the UTF8 encoding of capital letter A followed by
+        // 'small case letter a with grave' followed by the 'euro sign'
+
+    // they are 3 Unicode code-points encoded as:
+    // - 3 UTF16 code units under Windows
+    // - 6 UTF8 code units under Linux
+    // (without considering the final NULL)
+
+    wxPrintf("wxString reports a length of %d character(s)", test2.length());
+        // prints "wxString reports a length of 3 character(s)" on Linux
+        // prints "wxString reports a length of 3 character(s)" on Windows
+@endcode
+
+To better explain what stated above, consider the second string of the example
+above; it's composed by 3 characters and the final @c NULL:
+
+@image html overview_wxstring_encoding.png
+
+As you can see, UCS2/UTF16 encoding is straightforward (for characters in the @e BMP)
+and in this example the UCS2-encoded wxString takes 8 bytes.
+UTF8 encoding is more elaborated and in this example takes 7 bytes.
+
+The type used by wxString to store Unicode code units is called wxStringCharType.
+
+In general, for strings containing many latin characters UTF8 provides a big
+advantage in memory footprint respect UTF16, but requires some more processing
+for common operations like e.g. length calculation.
+
+
  
  @section overview_string_comparison Comparison to Other String Classes
  
  
  @section overview_string_comparison Comparison to Other String Classes
  
@@ -50,52 +131,53 @@ The advantages of using a special string class instead of working directly with
  C strings are so obvious that there is a huge number of such classes available.
  The most important advantage is the need to always remember to allocate/free
  memory for C strings; working with fixed size buffers almost inevitably leads
  C strings are so obvious that there is a huge number of such classes available.
  The most important advantage is the need to always remember to allocate/free
  memory for C strings; working with fixed size buffers almost inevitably leads
-to buffer overflows. At last, C++ has a standard string class (std::string). So
+to buffer overflows. At last, C++ has a standard string class (@c std::string). So
  why the need for wxString? There are several advantages:
  
  why the need for wxString? There are several advantages:
  
-@li <b>Efficiency:</b> Since wxWidgets 3.0 wxString uses std::string (UTF8
-    mode under Linux, Unix and OS X) or std::wstring (MSW) internally by
-    default to store its constent. wxString will therefore inherit the
-    performance characteristics from std::string.
+@li <b>Efficiency:</b> Since wxWidgets 3.0 wxString uses @c std::string (in UTF8
+    mode under Linux, Unix and OS X) or @c std::wstring (in UTF16 mode under Windows)
+    internally by default to store its contents. wxString will therefore inherit the
+    performance characteristics from @c std::string.
  @li <b>Compatibility:</b> This class tries to combine almost full compatibility
  @li <b>Compatibility:</b> This class tries to combine almost full compatibility
-    with the old wxWidgets 1.xx wxString class, some reminiscence to MFC
-    CString class and 90% of the functionality of std::string class.
-@li <b>Rich set of functions:</b> Some of the functions present in wxString are very
-    useful but don't exist in most of other string classes: for example,
-    wxString::AfterFirst, wxString::BeforeLast, wxString::operators or
-    wxString::Printf. Of course, all the standard string operations are
-    supported as well.
-@li <b>Unicode wxString is Unicode friendly:</b> it allows to easily convert to
-    and from ANSI and Unicode strings (see the @ref overview_unicode "unicode overview"
-    for more details) and maps to @c wstring transparently.
+    with the old wxWidgets 1.xx wxString class, some reminiscence of MFC's
+    CString class and 90% of the functionality of @c std::string class.
+@li <b>Rich set of functions:</b> Some of the functions present in wxString are
+    very useful but don't exist in most of other string classes: for example,
+    wxString::AfterFirst, wxString::BeforeLast, wxString::Printf.
+    Of course, all the standard string operations are supported as well.
+@li <b>wxString is Unicode friendly:</b> it allows to easily convert to
+    and from ANSI and Unicode strings (see @ref overview_unicode
+    for more details) and maps to @c std::wstring transparently.
  @li <b>Used by wxWidgets:</b> And, of course, this class is used everywhere
      inside wxWidgets so there is no performance loss which would result from
  @li <b>Used by wxWidgets:</b> And, of course, this class is used everywhere
      inside wxWidgets so there is no performance loss which would result from
-    conversions of objects of any other string class (including std::string) to
+    conversions of objects of any other string class (including @c std::string) to
      wxString internally by wxWidgets.
  
  However, there are several problems as well. The most important one is probably
  that there are often several functions to do exactly the same thing: for
  example, to get the length of the string either one of wxString::length(),
  wxString::Len() or wxString::Length() may be used. The first function, as
      wxString internally by wxWidgets.
  
  However, there are several problems as well. The most important one is probably
  that there are often several functions to do exactly the same thing: for
  example, to get the length of the string either one of wxString::length(),
  wxString::Len() or wxString::Length() may be used. The first function, as
-almost all the other functions in lowercase, is std::string compatible. The
+almost all the other functions in lowercase, is @c std::string compatible. The
  second one is the "native" wxString version and the last one is the wxWidgets
  1.xx way.
  
  second one is the "native" wxString version and the last one is the wxWidgets
  1.xx way.
  
-So which is better to use? The usage of the std::string compatible functions is
+So which is better to use? The usage of the @c std::string compatible functions is
  strongly advised! It will both make your code more familiar to other C++
  strongly advised! It will both make your code more familiar to other C++
-programmers (who are supposed to have knowledge of std::string but not of
+programmers (who are supposed to have knowledge of @c std::string but not of
  wxString), let you reuse the same code in both wxWidgets and other programs (by
  wxString), let you reuse the same code in both wxWidgets and other programs (by
-just typedefing wxString as std::string when used outside wxWidgets) and by
+just typedefing wxString as @c std::string when used outside wxWidgets) and by
  staying compatible with future versions of wxWidgets which will probably start
  staying compatible with future versions of wxWidgets which will probably start
-using std::string sooner or later too.
+using @c std::string sooner or later too.
  
  
-In the situations where there is no corresponding std::string function, please
+In the situations where there is no corresponding @c std::string function, please
  try to use the new wxString methods and not the old wxWidgets 1.xx variants
  which are deprecated and may disappear in future versions.
  
  
  @section overview_string_advice Advice About Using wxString
  
  try to use the new wxString methods and not the old wxWidgets 1.xx variants
  which are deprecated and may disappear in future versions.
  
  
  @section overview_string_advice Advice About Using wxString
  
+@subsection overview_string_implicitconv Implicit conversions
+
  Probably the main trap with using this class is the implicit conversion
  operator to <tt>const char*</tt>. It is advised that you use wxString::c_str()
  instead to clearly indicate when the conversion is done. Specifically, the
  Probably the main trap with using this class is the implicit conversion
  operator to <tt>const char*</tt>. It is advised that you use wxString::c_str()
  instead to clearly indicate when the conversion is done. Specifically, the
@@ -124,8 +206,8 @@ because the argument of @c puts() is known to be of the type
  <tt>const char*</tt>, this is @b not done for @c printf() which is a function
  with variable number of arguments (and whose arguments are of unknown types).
  So this call may do any number of things (including displaying the correct
  <tt>const char*</tt>, this is @b not done for @c printf() which is a function
  with variable number of arguments (and whose arguments are of unknown types).
  So this call may do any number of things (including displaying the correct
-string on screen), although the most likely result is a program crash. The
-solution is to use wxString::c_str(). Just replace this line with this:
+string on screen), although the most likely result is a program crash.
+The solution is to use wxString::c_str(). Just replace this line with this:
  
  @code
  printf("Hello, %s!\n", output.c_str());
  
  @code
  printf("Hello, %s!\n", output.c_str());
@@ -138,10 +220,43 @@ its contents are completely arbitrary. The solution to this problem is also
  easy, just make the function return wxString instead of a C string.
  
  This leads us to the following general advice: all functions taking string
  easy, just make the function return wxString instead of a C string.
  
  This leads us to the following general advice: all functions taking string
-arguments should take <tt>const wxString</tt> (this makes assignment to the
+arguments should take <tt>const wxString&</tt> (this makes assignment to the
  strings inside the function faster) and all functions returning strings
  should return wxString - this makes it safe to return local variables.
  
  strings inside the function faster) and all functions returning strings
  should return wxString - this makes it safe to return local variables.
  
+Finally note that wxString uses the current locale encoding to convert any C string
+literal to Unicode. The same is done for converting to and from @c std::string
+and for the return value of c_str().
+For this conversion, the @a wxConvLibc class instance is used.
+See wxCSConv and wxMBConv.
+
+
+@subsection overview_string_iterating Iterating wxString's characters
+
+As previously described, when <tt>wxUSE_UNICODE_UTF8==1</tt>, wxString internally
+uses the variable-length UTF8 encoding.
+Accessing a UTF-8 string by index can be very @b inefficient because
+a single character is represented by a variable number of bytes so that
+the entire string has to be parsed in order to find the character.
+Since iterating over a string by index is a common programming technique and
+was also possible and encouraged by wxString using the access operator[]()
+wxString implements caching of the last used index so that iterating over
+a string is a linear operation even in UTF-8 mode.
+
+It is nonetheless recommended to use @b iterators (instead of index based
+access) like this:
+
+@code
+wxString s = "hello";
+wxString::const_iterator i;
+for (i = s.begin(); i != s.end(); ++i)
+{
+    wxUniChar uni_ch = *i;
+    // do something with it
+}
+@endcode
+
+
  
  @section overview_string_related String Related Functions and Classes
  
  
  @section overview_string_related String Related Functions and Classes
  
@@ -158,7 +273,7 @@ these problems: wxIsEmpty() verifies whether the string is empty (returning
  case-insensitive string comparison function known either as @c stricmp() or
  @c strcasecmp() on different platforms.
  
  case-insensitive string comparison function known either as @c stricmp() or
  @c strcasecmp() on different platforms.
  
-The <tt>@<wx/string.h@></tt> header also defines wxSnprintf and wxVsnprintf
+The <tt>@<wx/string.h@></tt> header also defines ::wxSnprintf and ::wxVsnprintf
  functions which should be used instead of the inherently dangerous standard
  @c sprintf() and which use @c snprintf() instead which does buffer size checks
  whenever possible. Of course, you may also use wxString::Printf which is also
  functions which should be used instead of the inherently dangerous standard
  @c sprintf() and which use @c snprintf() instead which does buffer size checks
  whenever possible. Of course, you may also use wxString::Printf which is also
@@ -180,7 +295,7 @@ wxStrings.
  
  @note This section is strictly about performance issues and is absolutely not
  necessary to read for using wxString class. Please skip it unless you feel
  
  @note This section is strictly about performance issues and is absolutely not
  necessary to read for using wxString class. Please skip it unless you feel
-familiar with profilers and relative tools. 
+familiar with profilers and relative tools.
  
  For the performance reasons wxString doesn't allocate exactly the amount of
  memory needed for each string. Instead, it adds a small amount of space to each
  
  For the performance reasons wxString doesn't allocate exactly the amount of
  memory needed for each string. Instead, it adds a small amount of space to each
@@ -244,5 +359,16 @@ really consider fine tuning wxString for your application).
  It goes without saying that a profiler should be used to measure the precise
  difference the change to @c EXTRA_ALLOC makes to your program.
  
  It goes without saying that a profiler should be used to measure the precise
  difference the change to @c EXTRA_ALLOC makes to your program.
  
+
+@section overview_string_settings wxString Related Compilation Settings
+
+Much work has been done to make existing code using ANSI string literals
+work as before version 3.0.
+If you nonetheless need to have a wxString that uses @c wchar_t
+on Unix and Linux, too, you can specify this on the command line with the
+@c configure @c --disable-utf8 switch or you can consider using wxUString
+or @c std::wstring instead.
+
+
  */
  
  */
  
diff --git a/docs/doxygen/overviews/unicode.h b/docs/doxygen/overviews/unicode.h

index e372007b1ed0a836e050e3aa0ee1e82c1ac2b66f..e50454a1cd78126c0daedefd384902d31415162e 100644 (file)
--- a/docs/doxygen/overviews/unicode.h
+++ b/docs/doxygen/overviews/unicode.h
@@ -49,30 +49,34 @@ other services should be ready to deal with Unicode.
  
  When working with Unicode, it's important to define the meaning of some terms.
  
  
  When working with Unicode, it's important to define the meaning of some terms.
  
-A @e glyph is a particular image that represents a @e character or part of a character.
+A <b><em>glyph</em></b> is a particular image that represents a character or part
+of a character.
  Any character may have one or more glyph associated; e.g. some of the possible
  glyphs for the capital letter 'A' are:
  
  @image html overview_unicode_glyphs.png
  
  Unicode assigns each character of almost any existing alphabet/script a number,
  Any character may have one or more glyph associated; e.g. some of the possible
  glyphs for the capital letter 'A' are:
  
  @image html overview_unicode_glyphs.png
  
  Unicode assigns each character of almost any existing alphabet/script a number,
-which is called <em>code point</em>; it's typically indicated in documentation
+which is called <b><em>code point</em></b>; it's typically indicated in documentation
  manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number.
  
  The Unicode standard divides the space of all possible code points in @e planes;
  a plane is a range of 65,536 (1000016) contiguous Unicode code points.
  Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic
  Multilingual Plane.
  manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number.
  
  The Unicode standard divides the space of all possible code points in @e planes;
  a plane is a range of 65,536 (1000016) contiguous Unicode code points.
  Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic
  Multilingual Plane.
+The BMP contains characters for all modern languages, and a large number of
+special characters. The other planes in fact contain mainly historic scripts,
+special-purpose characters or are unused.
  
  Code points are represented in computer memory as a sequence of one or more
  
  Code points are represented in computer memory as a sequence of one or more
-<em>code units</em>, where a code unit is a unit of memory: 8, 16, or 32 bits.
+<b><em>code units</em></b>, where a code unit is a unit of memory: 8, 16, or 32 bits.
  More precisely, a code unit is the minimal bit combination that can represent a
  unit of encoded text for processing or interchange.
  
  The @e UTF or Unicode Transformation Formats are algorithms mapping the Unicode
  code points to code unit sequences. The simplest of them is <b>UTF-32</b> where
  More precisely, a code unit is the minimal bit combination that can represent a
  unit of encoded text for processing or interchange.
  
  The @e UTF or Unicode Transformation Formats are algorithms mapping the Unicode
  code points to code unit sequences. The simplest of them is <b>UTF-32</b> where
-each code unit is composed by 32 bits (4 bytes) and each code point is represented
-by a single code unit.
+each code unit is composed by 32 bits (4 bytes) and each code point is always
+represented by a single code unit (fixed length encoding).
  (Note that even UTF-32 is still not completely trivial as the mapping is different
  for little and big-endian architectures). UTF-32 is commonly used under Unix systems for
  internal representation of Unicode strings.
  (Note that even UTF-32 is still not completely trivial as the mapping is different
  for little and big-endian architectures). UTF-32 is commonly used under Unix systems for
  internal representation of Unicode strings.
@@ -81,6 +85,7 @@ Another very widespread standard is <b>UTF-16</b> which is used by Microsoft Win
  it encodes the first (approximately) 64 thousands of Unicode code points
  (the BMP plane) using 16-bit code units (2 bytes) and uses a pair of 16-bit code
  units to encode the characters beyond this. These pairs are called @e surrogate.
  it encodes the first (approximately) 64 thousands of Unicode code points
  (the BMP plane) using 16-bit code units (2 bytes) and uses a pair of 16-bit code
  units to encode the characters beyond this. These pairs are called @e surrogate.
+Thus UTF16 uses a variable number of code units to encode each code point.
  
  Finally, the most widespread encoding used for the external Unicode storage
  (e.g. files and network protocols) is <b>UTF-8</b> which is byte-oriented and so
  
  Finally, the most widespread encoding used for the external Unicode storage
  (e.g. files and network protocols) is <b>UTF-8</b> which is byte-oriented and so
@@ -107,7 +112,7 @@ Typically when UTF8 is used, code units are stored into @c char types, since
  @c char are 8bit wide on almost all systems; when using UTF16 typically code
  units are stored into @c wchar_t types since @c wchar_t is at least 16bits on
  all systems. This is also the approach used by wxString.
  @c char are 8bit wide on almost all systems; when using UTF16 typically code
  units are stored into @c wchar_t types since @c wchar_t is at least 16bits on
  all systems. This is also the approach used by wxString.
-See @ref overview_wxstring for more info.
+See @ref overview_string for more info.
  
  See also http://unicode.org/glossary/ for the official definitions of the
  terms reported above.
  
  See also http://unicode.org/glossary/ for the official definitions of the
  terms reported above.
@@ -123,8 +128,8 @@ programs require the Microsoft Layer for Unicode to run on Windows 95/98/ME.
  
  However, unlike the Unicode build mode of the previous versions of wxWidgets, this
  support is mostly transparent: you can still continue to work with the @b narrow
  
  However, unlike the Unicode build mode of the previous versions of wxWidgets, this
  support is mostly transparent: you can still continue to work with the @b narrow
-(i.e. current-locale-encoded @c char*) strings even if @b wide
-(i.e. UTF16/UCS2-encoded @c wchar_t* or UTF8-encoded @c char) strings are also
+(i.e. current locale-encoded @c char*) strings even if @b wide
+(i.e. UTF16/UCS2-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also
  supported. Any wxWidgets function accepts arguments of either type as both
  kinds of strings are implicitly converted to wxString, so both
  @code
  supported. Any wxWidgets function accepts arguments of either type as both
  kinds of strings are implicitly converted to wxString, so both
  @code
@@ -132,7 +137,7 @@ wxMessageBox("Hello, world!");
  @endcode
  and the somewhat less usual
  @code
  @endcode
  and the somewhat less usual
  @code
-wxMessageBox(L"Salut \u00e0 toi!"); // 00E0 is "Latin Small Letter a with Grave"
+wxMessageBox(L"Salut \u00E0 toi!"); // U+00E0 is "Latin Small Letter a with Grave"
  @endcode
  work as expected.
  
  @endcode
  work as expected.
  
@@ -147,9 +152,10 @@ in the case of gcc). In particular, the most common encoding used under
  modern Unix systems is UTF-8 and as the string above is not a valid UTF-8 byte
  sequence, nothing would be displayed at all in this case. Thus it is important
  to <b>never use 8-bit (instead of 7-bit) characters directly in the program source</b>
  modern Unix systems is UTF-8 and as the string above is not a valid UTF-8 byte
  sequence, nothing would be displayed at all in this case. Thus it is important
  to <b>never use 8-bit (instead of 7-bit) characters directly in the program source</b>
-but use wide strings or, alternatively, write
+but use wide strings or, alternatively, write:
  @code
  @code
-wxMessageBox(wxString::FromUTF8("Salut \xc3\xa0 toi!"));
+wxMessageBox(wxString::FromUTF8("Salut \xC3\xA0 toi!"));
+    // in UTF8 the character U+00E0 is encoded as 0xC3A0
  @endcode
  
  In a similar way, wxString provides access to its contents as either @c wchar_t or
  @endcode
  
  In a similar way, wxString provides access to its contents as either @c wchar_t or
@@ -327,6 +333,7 @@ different encoding of it. So you need to be able to convert the data to various
  representations and the wxString methods wxString::ToAscii(), wxString::ToUTF8()
  (or its synonym wxString::utf8_str()), wxString::mb_str(), wxString::c_str() and
  wxString::wc_str() can be used for this.
  representations and the wxString methods wxString::ToAscii(), wxString::ToUTF8()
  (or its synonym wxString::utf8_str()), wxString::mb_str(), wxString::c_str() and
  wxString::wc_str() can be used for this.
+
  The first of them should be only used for the string containing 7-bit ASCII characters
  only, anything else will be replaced by some substitution character.
  wxString::mb_str() converts the string to the encoding used by the current locale
  The first of them should be only used for the string containing 7-bit ASCII characters
  only, anything else will be replaced by some substitution character.
  wxString::mb_str() converts the string to the encoding used by the current locale
diff --git a/interface/wx/string.h b/interface/wx/string.h

index ed2afa915ff5578b2c143b45d8780d84171180dc..7d57b228a63521288a179db8e5e4d8912a4730fd 100644 (file)
--- a/interface/wx/string.h
+++ b/interface/wx/string.h
@@ -6,59 +6,6 @@
  // Licence:     wxWindows license
  /////////////////////////////////////////////////////////////////////////////
  
  // Licence:     wxWindows license
  /////////////////////////////////////////////////////////////////////////////
  
-/**
-    @class wxStringBuffer
-
-    This tiny class allows you to conveniently access the wxString internal buffer
-    as a writable pointer without any risk of forgetting to restore the string
-    to the usable state later.
-
-    For example, assuming you have a low-level OS function called
-    @c "GetMeaningOfLifeAsString(char *)" returning the value in the provided
-    buffer (which must be writable, of course) you might call it like this:
-
-    @code
-        wxString theAnswer;
-        GetMeaningOfLifeAsString(wxStringBuffer(theAnswer, 1024));
-        if ( theAnswer != "42" )
-            wxLogError("Something is very wrong!");
-    @endcode
-
-    Note that the exact usage of this depends on whether or not wxUSE_STL is
-    enabled. If wxUSE_STL is enabled, wxStringBuffer creates a separate empty
-    character buffer, and if wxUSE_STL is disabled, it uses GetWriteBuf() from
-    wxString, keeping the same buffer wxString uses intact. In other words,
-    relying on wxStringBuffer containing the old wxString data is not a good
-    idea if you want to build your program both with and without wxUSE_STL.
-
-    @library{wxbase}
-    @category{data}
-*/
-class wxStringBuffer
-{
-public:
-    /**
-        Constructs a writable string buffer object associated with the given string
-        and containing enough space for at least @a len characters.
-        Basically, this is equivalent to calling wxString::GetWriteBuf() and
-        saving the result.
-    */
-    wxStringBuffer(const wxString& str, size_t len);
-
-    /**
-        Restores the string passed to the constructor to the usable state by calling
-        wxString::UngetWriteBuf() on it.
-    */
-    ~wxStringBuffer();
-
-    /**
-        Returns the writable pointer to a buffer of the size at least equal to the
-        length specified in the constructor.
-    */
-    wxStringCharType* operator wxStringCharType *();
-};
-
-
  
  /**
      @class wxString
  
  /**
      @class wxString
@@ -68,66 +15,29 @@ public:
      version wxWidgets 3.0.
  
      wxString is a class representing a Unicode character string.
      version wxWidgets 3.0.
  
      wxString is a class representing a Unicode character string.
-    wxString uses @c std::string internally to store its content
-    unless this is not supported by the compiler or disabled
-    specifically when building wxWidgets and it therefore inherits
-    many features from @c std::string. Most implementations of
-    @c std::string are thread-safe and don't use reference counting.
-    By default, wxString uses @c std::string internally even if
-    wxUSE_STL is not defined.
-
-    wxString now internally uses UTF-16 under Windows and UTF-8 under
-    Unix, Linux and OS X to store its content. Note that when iterating
-    over a UTF-16 string under Windows, the user code has to take care
-    of surrogate pair handling whereas Windows itself has built-in
-    support pairs in UTF-16, such as for drawing strings on screen.
-
-    Much work has been done to make existing code using ANSI string literals
-    work as before. If you nonetheless need to have a wxString that uses wchar_t
-    on Unix and Linux, too, you can specify this on the command line with the
-    @c configure @c --disable-utf8 switch or you can consider using wxUString
-    or std::wstring instead.
-
-    Accessing a UTF-8 string by index can be very inefficient because
-    a single character is represented by a variable number of bytes so that
-    the entire string has to be parsed in order to find the character.
-    Since iterating over a string by index is a common programming technique and
-    was also possible and encouraged by wxString using the access operator[]()
-    wxString implements caching of the last used index so that iterating over
-    a string is a linear operation even in UTF-8 mode.
-
-    It is nonetheless recommended to use iterators (instead of index based
-    access) like this:
-
-    @code
-    wxString s = "hello";
-    wxString::const_iterator i;
-    for (i = s.begin(); i != s.end(); ++i)
-    {
-        wxUniChar uni_ch = *i;
-        // do something with it
-    }
-    @endcode
-
-    Please see the @ref overview_string and the @ref overview_unicode for more
-    information about it.
-
-    wxString uses the current locale encoding to convert any C string
-    literal to Unicode. The same is done for converting to and from
-    @c std::string and for the return value of c_str().
-    For this conversion, the @a wxConvLibc class instance is used.
-    See wxCSConv and wxMBConv.
-
-    wxString implements most of the methods of the @c std::string class.
-    These standard functions are only listed here, but they are not
-    fully documented in this manual. Please see the STL documentation.
+    wxString uses @c std::basic_string internally (even if @c wxUSE_STL is not defined)
+    to store its content (unless this is not supported by the compiler or disabled
+    specifically when building wxWidgets) and it therefore inherits
+    many features from @c std::basic_string. (Note that most implementations of
+    @c std::basic_string are thread-safe and don't use reference counting.)
+
+    These @c std::basic_string standard functions are only listed here, but
+    they are not fully documented in this manual; see the STL documentation
+    (http://www.cppreference.com/wiki/string/start) for more info.
      The behaviour of all these functions is identical to the behaviour
      described there.
  
      You may notice that wxString sometimes has several functions which do
      The behaviour of all these functions is identical to the behaviour
      described there.
  
      You may notice that wxString sometimes has several functions which do
-    the same thing like Length(), Len() and length() which
-    all return the string length. In all cases of such duplication the
-    @c std::string compatible method should be used.
+    the same thing like Length(), Len() and length() which all return the
+    string length. In all cases of such duplication the @c std::string
+    compatible methods should be used.
+
+    For informations about the internal encoding used by wxString and
+    for important warnings and advices for using it, please read
+    the @ref overview_string.
+
+    In wxWidgets 3.0 wxString always stores Unicode strings, so you should
+    be sure to read also @ref overview_unicode.
  
  
      @section string_construct Constructors and assignment operators
  
  
      @section string_construct Constructors and assignment operators
@@ -229,6 +139,7 @@ public:
      original string is not modified and the function returns the extracted
      substring.
  
      original string is not modified and the function returns the extracted
      substring.
  
+    @li at()
      @li substr()
      @li Mid()
      @li operator()()
      @li substr()
      @li Mid()
      @li operator()()
@@ -1344,14 +1255,6 @@ public:
          STL reference for their documentation.
      */
      //@{
          STL reference for their documentation.
      */
      //@{
-        size_t length() const;
-        size_type size() const;
-        size_type max_size() const;
-        size_type capacity() const;
-        void reserve(size_t sz);
-
-        void resize(size_t nSize, wxUniChar ch = '\0');
-
          wxString& append(const wxString& str, size_t pos, size_t n);
          wxString& append(const wxString& str);
          wxString& append(const char *sz, size_t n);
          wxString& append(const wxString& str, size_t pos, size_t n);
          wxString& append(const wxString& str);
          wxString& append(const char *sz, size_t n);
@@ -1366,8 +1269,13 @@ public:
          wxString& assign(size_t n, wxUniChar ch);
          wxString& assign(const_iterator first, const_iterator last);
  
          wxString& assign(size_t n, wxUniChar ch);
          wxString& assign(const_iterator first, const_iterator last);
  
+        wxUniChar at(size_t n) const;
+        wxUniCharRef at(size_t n);
+
          void clear();
  
          void clear();
  
+        size_type capacity() const;
+
          int compare(const wxString& str) const;
          int compare(size_t nStart, size_t nLen, const wxString& str) const;
          int compare(size_t nStart, size_t nLen,
          int compare(const wxString& str) const;
          int compare(size_t nStart, size_t nLen, const wxString& str) const;
          int compare(size_t nStart, size_t nLen,
@@ -1377,6 +1285,8 @@ public:
          int compare(size_t nStart, size_t nLen,
                const wchar_t* sz, size_t nCount = npos) const;
  
          int compare(size_t nStart, size_t nLen,
                const wchar_t* sz, size_t nCount = npos) const;
  
+        wxCStrData data() const;
+
          bool empty() const;
  
          wxString& erase(size_type pos = 0, size_type n = npos);
          bool empty() const;
  
          wxString& erase(size_type pos = 0, size_type n = npos);
@@ -1387,6 +1297,28 @@ public:
          size_t find(const char* sz, size_t nStart = 0, size_t n = npos) const;
          size_t find(const wchar_t* sz, size_t nStart = 0, size_t n = npos) const;
          size_t find(wxUniChar ch, size_t nStart = 0) const;
          size_t find(const char* sz, size_t nStart = 0, size_t n = npos) const;
          size_t find(const wchar_t* sz, size_t nStart = 0, size_t n = npos) const;
          size_t find(wxUniChar ch, size_t nStart = 0) const;
+        size_t find_first_of(const char* sz, size_t nStart = 0) const;
+        size_t find_first_of(const wchar_t* sz, size_t nStart = 0) const;
+        size_t find_first_of(const char* sz, size_t nStart, size_t n) const;
+        size_t find_first_of(const wchar_t* sz, size_t nStart, size_t n) const;
+        size_t find_first_of(wxUniChar c, size_t nStart = 0) const
+        size_t find_last_of (const wxString& str, size_t nStart = npos) const
+        size_t find_last_of (const char* sz, size_t nStart = npos) const;
+        size_t find_last_of (const wchar_t* sz, size_t nStart = npos) const;
+        size_t find_last_of(const char* sz, size_t nStart, size_t n) const;
+        size_t find_last_of(const wchar_t* sz, size_t nStart, size_t n) const;
+        size_t find_last_of(wxUniChar c, size_t nStart = npos) const
+        size_t find_first_not_of(const wxString& str, size_t nStart = 0) const
+        size_t find_first_not_of(const char* sz, size_t nStart = 0) const;
+        size_t find_first_not_of(const wchar_t* sz, size_t nStart = 0) const;
+        size_t find_first_not_of(const char* sz, size_t nStart, size_t n) const;
+        size_t find_first_not_of(const wchar_t* sz, size_t nStart, size_t n) const;
+        size_t find_first_not_of(wxUniChar ch, size_t nStart = 0) const;
+        size_t find_last_not_of(const wxString& str, size_t nStart = npos) const
+        size_t find_last_not_of(const char* sz, size_t nStart = npos) const;
+        size_t find_last_not_of(const wchar_t* sz, size_t nStart = npos) const;
+        size_t find_last_not_of(const char* sz, size_t nStart, size_t n) const;
+        size_t find_last_not_of(const wchar_t* sz, size_t nStart, size_t n) const;
  
          wxString& insert(size_t nPos, const wxString& str);
          wxString& insert(size_t nPos, const wxString& str, size_t nStart, size_t n);
  
          wxString& insert(size_t nPos, const wxString& str);
          wxString& insert(size_t nPos, const wxString& str, size_t nStart, size_t n);
@@ -1397,6 +1329,13 @@ public:
          void insert(iterator it, const_iterator first, const_iterator last);
          void insert(iterator it, size_type n, wxUniChar ch);
  
          void insert(iterator it, const_iterator first, const_iterator last);
          void insert(iterator it, size_type n, wxUniChar ch);
  
+        size_t length() const;
+
+        size_type max_size() const;
+
+        void reserve(size_t sz);
+        void resize(size_t nSize, wxUniChar ch = '\0');
+
          wxString& replace(size_t nStart, size_t nLen, const wxString& str);
          wxString& replace(size_t nStart, size_t nLen, size_t nCount, wxUniChar ch);
          wxString& replace(size_t nStart, size_t nLen,
          wxString& replace(size_t nStart, size_t nLen, const wxString& str);
          wxString& replace(size_t nStart, size_t nLen, size_t nCount, wxUniChar ch);
          wxString& replace(size_t nStart, size_t nLen,
@@ -1423,12 +1362,10 @@ public:
          size_t rfind(const wchar_t* sz, size_t nStart = npos, size_t n = npos) const;
          size_t rfind(wxUniChar ch, size_t nStart = npos) const;
  
          size_t rfind(const wchar_t* sz, size_t nStart = npos, size_t n = npos) const;
          size_t rfind(wxUniChar ch, size_t nStart = npos) const;
  
+        size_type size() const;
          wxString substr(size_t nStart = 0, size_t nLen = npos) const;
          wxString substr(size_t nStart = 0, size_t nLen = npos) const;
-
          void swap(wxString& str);
          void swap(wxString& str);
-
      //@}
      //@}
-
  };
  
  /**
  };
  
  /**
@@ -1510,3 +1447,55 @@ public:
      wxChar* operator wxChar *();
  };
  
      wxChar* operator wxChar *();
  };
  
+
+/**
+    @class wxStringBuffer
+
+    This tiny class allows you to conveniently access the wxString internal buffer
+    as a writable pointer without any risk of forgetting to restore the string
+    to the usable state later.
+
+    For example, assuming you have a low-level OS function called
+    @c "GetMeaningOfLifeAsString(char *)" returning the value in the provided
+    buffer (which must be writable, of course) you might call it like this:
+
+    @code
+        wxString theAnswer;
+        GetMeaningOfLifeAsString(wxStringBuffer(theAnswer, 1024));
+        if ( theAnswer != "42" )
+            wxLogError("Something is very wrong!");
+    @endcode
+
+    Note that the exact usage of this depends on whether or not @c wxUSE_STL is
+    enabled. If @c wxUSE_STL is enabled, wxStringBuffer creates a separate empty
+    character buffer, and if @c wxUSE_STL is disabled, it uses GetWriteBuf() from
+    wxString, keeping the same buffer wxString uses intact. In other words,
+    relying on wxStringBuffer containing the old wxString data is not a good
+    idea if you want to build your program both with and without @c wxUSE_STL.
+
+    @library{wxbase}
+    @category{data}
+*/
+class wxStringBuffer
+{
+public:
+    /**
+        Constructs a writable string buffer object associated with the given string
+        and containing enough space for at least @a len characters.
+        Basically, this is equivalent to calling wxString::GetWriteBuf() and
+        saving the result.
+    */
+    wxStringBuffer(const wxString& str, size_t len);
+
+    /**
+        Restores the string passed to the constructor to the usable state by calling
+        wxString::UngetWriteBuf() on it.
+    */
+    ~wxStringBuffer();
+
+    /**
+        Returns the writable pointer to a buffer of the size at least equal to the
+        length specified in the constructor.
+    */
+    wxStringCharType* operator wxStringCharType *();
+};
author	Francesco Montorsi <f18m_cpp217828@yahoo.it>
	Sat, 6 Dec 2008 16:24:52 +0000 (16:24 +0000)
committer	Francesco Montorsi <f18m_cpp217828@yahoo.it>
	Sat, 6 Dec 2008 16:24:52 +0000 (16:24 +0000)
docs/doxygen/images/overview_unicode_codes.dia	[new file with mode: 0644]	patch \| blob
docs/doxygen/images/overview_unicode_codes.png		patch \| blob \| blame \| history
docs/doxygen/images/overview_wxstring_encoding.dia	[new file with mode: 0644]	patch \| blob
docs/doxygen/images/overview_wxstring_encoding.png	[new file with mode: 0644]	patch \| blob
docs/doxygen/overviews/mbconvclasses.h		patch \| blob \| blame \| history
docs/doxygen/overviews/string.h		patch \| blob \| blame \| history
docs/doxygen/overviews/unicode.h		patch \| blob \| blame \| history
interface/wx/string.h		patch \| blob \| blame \| history