X-Git-Url: https://git.saurik.com/wxWidgets.git/blobdiff_plain/727aa9062ba6ffa3153069e15df38dca958172d5..b2b31b87fbac61af6eb96f3f5fa960ee4479bb1d:/docs/doxygen/overviews/string.h
diff --git a/docs/doxygen/overviews/string.h b/docs/doxygen/overviews/string.h
index 54513b6bb4..927a208d0e 100644
--- a/docs/doxygen/overviews/string.h
+++ b/docs/doxygen/overviews/string.h
@@ -3,7 +3,7 @@
// Purpose: topic overview
// Author: wxWidgets team
// RCS-ID: $Id$
-// Licence: wxWindows license
+// Licence: wxWindows licence
/////////////////////////////////////////////////////////////////////////////
/**
@@ -14,6 +14,7 @@ Classes: wxString, wxArrayString, wxStringTokenizer
@li @ref overview_string_intro
@li @ref overview_string_internal
+@li @ref overview_string_binary
@li @ref overview_string_comparison
@li @ref overview_string_advice
@li @ref overview_string_related
@@ -27,16 +28,12 @@ Classes: wxString, wxArrayString, wxStringTokenizer
@section overview_string_intro Introduction
wxString is a class which represents a Unicode string of arbitrary length and
-containing arbitrary characters.
-
-The @c NUL character is allowed, but be
-aware that in the current string implementation some methods might not work
-correctly in this case. @todo still true?
+containing arbitrary Unicode characters.
This class has all the standard operations you can expect to find in a string
class: dynamic memory management (string extends to accommodate new
-characters), construction from other strings, C strings, wide character C strings
-and characters, assignment operators, access to individual characters, string
+characters), construction from other strings, compatibility with C strings and
+wide character C strings, assignment operators, access to individual characters, string
concatenation and comparison, substring extraction, case conversion, trimming and
padding (with spaces), searching and replacing and both C-like @c printf (wxString::Printf)
and stream-like insertion functions as well as much more - see wxString for a
@@ -49,28 +46,33 @@ in previous versions.
@section overview_string_internal Internal wxString encoding
-Since wxWidgets 3.0 wxString internally uses UCS-2 (with Unicode
+Since wxWidgets 3.0 wxString internally uses UTF-16 (with Unicode
code units stored in @c wchar_t) under Windows and UTF-8 (with Unicode
code units stored in @c char) under Unix, Linux and Mac OS X to store its content.
For definitions of code units and code points terms, please
see the @ref overview_unicode_encodings paragraph.
-Note that there is a difference about UCS-2 and UTF-16: the first is a fixed-length
-encoding, without surrogate pairs, while the latter is a
-variable-length encoding. Except for this the two encodings are identical.
-
For simplicity of implementation, wxString when wxUSE_UNICODE_WCHAR==1
-(e.g. on Windows) uses UCS-2 and thus doesn't know anything about surrogate pairs;
-it always consider 1 code unit per 1 code point, while this is really true only for
-characters in the @e BMP (Basic Multilingual Plane).
+(e.g. on Windows) uses per code unit indexing instead of
+per code point indexing and doesn't know anything about surrogate pairs;
+in other words it always considers code points to be composed by 1 code unit,
+while this is really true only for characters in the @e BMP (Basic Multilingual Plane).
Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user
-code has to take care of surrogate pair handling himself.
+code has to take care of surrogate pairs himself.
(Note however that Windows itself has built-in support for surrogate pairs in UTF-16,
such as for drawing strings on screen.)
+@remarks
+Note that while the behaviour of wxString when wxUSE_UNICODE_WCHAR==1
+resembles UCS-2 encoding, it's not completely correct to refer to wxString as
+UCS-2 encoded since you can encode code points outside the @e BMP in a wxString
+as two code units (i.e. as a surrogate pair; as already mentioned however wxString
+will "see" them as two different code points)
+
When instead wxUSE_UNICODE_UTF8==1 (e.g. on Linux and Mac OS X)
-wxString handles UTF8 multi-bytes sequences just fine, so that you can use
+wxString handles UTF8 multi-bytes sequences just fine also for characters outside
+the BMP (it implements per code point indexing), so that you can use
UTF8 in a completely transparent way:
Example:
@@ -89,7 +91,7 @@ Example:
wxPrintf("wxString reports a length of %d character(s)", test.length());
// prints "wxString reports a length of 1 character(s)" on Linux
// prints "wxString reports a length of 2 character(s)" on Windows
- // since Windows doesn't have surrogate pairs support!
+ // since wxString on Windows doesn't have surrogate pairs support!
// second test, this time using characters part of the Unicode BMP:
@@ -113,17 +115,30 @@ above; it's composed by 3 characters and the final @c NULL:
@image html overview_wxstring_encoding.png
-As you can see, UCS2/UTF16 encoding is straightforward (for characters in the @e BMP)
-and in this example the UCS2-encoded wxString takes 8 bytes.
+As you can see, UTF16 encoding is straightforward (for characters in the @e BMP)
+and in this example the UTF16-encoded wxString takes 8 bytes.
UTF8 encoding is more elaborated and in this example takes 7 bytes.
-The type used by wxString to store Unicode code units is called wxStringCharType.
-
In general, for strings containing many latin characters UTF8 provides a big
-advantage in memory footprint respect UTF16, but requires some more processing
-for common operations like e.g. length calculation.
+advantage with regards to the memory footprint respect UTF16, but requires some
+more processing for common operations like e.g. length calculation.
+
+Finally, note that the type used by wxString to store Unicode code units
+(@c wchar_t or @c char) is always @c typedef-ined to be ::wxStringCharType.
+@section overview_string_binary Using wxString to store binary data
+
+wxString can be used to store binary data (even if it contains @c NULs) using the
+functions wxString::To8BitData and wxString::From8BitData.
+
+Beware that even if @c NUL character is allowed, in the current string implementation
+some methods might not work correctly with them.
+
+Note however that other classes like wxMemoryBuffer are more suited to this task.
+For handling binary data you may also want to look at the wxStreamBuffer,
+wxMemoryOutputStream, wxMemoryInputStream classes.
+
@section overview_string_comparison Comparison to Other String Classes
@@ -364,11 +379,16 @@ difference the change to @c EXTRA_ALLOC makes to your program.
Much work has been done to make existing code using ANSI string literals
work as before version 3.0.
+
If you nonetheless need to have a wxString that uses @c wchar_t
on Unix and Linux, too, you can specify this on the command line with the
@c configure @c --disable-utf8 switch or you can consider using wxUString
or @c std::wstring instead.
+@c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support.
+If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is
+also defined, otherwise @c wxUSE_UNICODE_WCHAR is.
+See also @ref page_wxusedef_important.
*/