git-svn-id: https://svn.wxwidgets.org/svn/wx/wxWidgets/trunk@57204
c3d73ce0-8a6f-49c7-b76d-
6d57e0e08775
@li @ref overview_string_intro
@li @ref overview_string_internal
@li @ref overview_string_intro
@li @ref overview_string_internal
+@li @ref overview_string_binary
@li @ref overview_string_comparison
@li @ref overview_string_advice
@li @ref overview_string_related
@li @ref overview_string_comparison
@li @ref overview_string_advice
@li @ref overview_string_related
@section overview_string_intro Introduction
wxString is a class which represents a Unicode string of arbitrary length and
@section overview_string_intro Introduction
wxString is a class which represents a Unicode string of arbitrary length and
-containing arbitrary characters.
-
-The @c NUL character is allowed, but be
-aware that in the current string implementation some methods might not work
-correctly in this case. @todo still true?
+containing arbitrary Unicode characters.
This class has all the standard operations you can expect to find in a string
class: dynamic memory management (string extends to accommodate new
This class has all the standard operations you can expect to find in a string
class: dynamic memory management (string extends to accommodate new
-characters), construction from other strings, C strings, wide character C strings
-and characters, assignment operators, access to individual characters, string
+characters), construction from other strings, compatibility with C strings and
+wide character C strings, assignment operators, access to individual characters, string
concatenation and comparison, substring extraction, case conversion, trimming and
padding (with spaces), searching and replacing and both C-like @c printf (wxString::Printf)
and stream-like insertion functions as well as much more - see wxString for a
concatenation and comparison, substring extraction, case conversion, trimming and
padding (with spaces), searching and replacing and both C-like @c printf (wxString::Printf)
and stream-like insertion functions as well as much more - see wxString for a
@section overview_string_internal Internal wxString encoding
@section overview_string_internal Internal wxString encoding
-Since wxWidgets 3.0 wxString internally uses <b>UCS-2</b> (with Unicode
+Since wxWidgets 3.0 wxString internally uses <b>UTF-16</b> (with Unicode
code units stored in @c wchar_t) under Windows and <b>UTF-8</b> (with Unicode
code units stored in @c char) under Unix, Linux and Mac OS X to store its content.
For definitions of <em>code units</em> and <em>code points</em> terms, please
see the @ref overview_unicode_encodings paragraph.
code units stored in @c wchar_t) under Windows and <b>UTF-8</b> (with Unicode
code units stored in @c char) under Unix, Linux and Mac OS X to store its content.
For definitions of <em>code units</em> and <em>code points</em> terms, please
see the @ref overview_unicode_encodings paragraph.
-Note that there is a difference about UCS-2 and UTF-16: the first is a fixed-length
-encoding, without <em>surrogate pairs</em>, while the latter is a
-variable-length encoding. Except for this the two encodings are identical.
-
For simplicity of implementation, wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt>
For simplicity of implementation, wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt>
-(e.g. on Windows) uses UCS-2 and thus doesn't know anything about surrogate pairs;
-it always consider 1 code unit per 1 code point, while this is really true only for
-characters in the @e BMP (Basic Multilingual Plane).
+(e.g. on Windows) uses <em>per code unit indexing</em> instead of
+<em>per code point indexing</em> and doesn't know anything about surrogate pairs;
+in other words it always considers code points to be composed by 1 code point,
+while this is really true only for characters in the @e BMP (Basic Multilingual Plane).
Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user
Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user
-code has to take care of <em>surrogate pair</em> handling himself.
+code has to take care of <em>surrogate pairs</em> himself.
(Note however that Windows itself has built-in support for surrogate pairs in UTF-16,
such as for drawing strings on screen.)
(Note however that Windows itself has built-in support for surrogate pairs in UTF-16,
such as for drawing strings on screen.)
+@remarks
+Note that while the behaviour of wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt>
+resembles UCS-2 encoding, it's not completely correct to refer to wxString as
+UCS-2 encoded since you can encode characters outside the @e BMP in a wxString.
+
When instead <tt>wxUSE_UNICODE_UTF8==1</tt> (e.g. on Linux and Mac OS X)
When instead <tt>wxUSE_UNICODE_UTF8==1</tt> (e.g. on Linux and Mac OS X)
-wxString handles UTF8 multi-bytes sequences just fine, so that you can use
+wxString handles UTF8 multi-bytes sequences just fine also for characters outside
+the BMP (it implements <em>per code point indexing</em>), so that you can use
UTF8 in a completely transparent way:
Example:
UTF8 in a completely transparent way:
Example:
wxPrintf("wxString reports a length of %d character(s)", test.length());
// prints "wxString reports a length of 1 character(s)" on Linux
// prints "wxString reports a length of 2 character(s)" on Windows
wxPrintf("wxString reports a length of %d character(s)", test.length());
// prints "wxString reports a length of 1 character(s)" on Linux
// prints "wxString reports a length of 2 character(s)" on Windows
- // since Windows doesn't have surrogate pairs support!
+ // since wxString on Windows doesn't have surrogate pairs support!
// second test, this time using characters part of the Unicode BMP:
// second test, this time using characters part of the Unicode BMP:
@image html overview_wxstring_encoding.png
@image html overview_wxstring_encoding.png
-As you can see, UCS2/UTF16 encoding is straightforward (for characters in the @e BMP)
-and in this example the UCS2-encoded wxString takes 8 bytes.
+As you can see, UTF16 encoding is straightforward (for characters in the @e BMP)
+and in this example the UTF16-encoded wxString takes 8 bytes.
UTF8 encoding is more elaborated and in this example takes 7 bytes.
UTF8 encoding is more elaborated and in this example takes 7 bytes.
-The type used by wxString to store Unicode code units is called wxStringCharType.
-
In general, for strings containing many latin characters UTF8 provides a big
In general, for strings containing many latin characters UTF8 provides a big
-advantage in memory footprint respect UTF16, but requires some more processing
-for common operations like e.g. length calculation.
+advantage with regards to the memory footprint respect UTF16, but requires some
+more processing for common operations like e.g. length calculation.
+
+Finally, note that the type used by wxString to store Unicode code units
+(@c wchar_t or @c char) is always @c typedef-ined to be ::wxStringCharType.
+@section overview_string_binary Using wxString to store binary data
+
+wxString can be used to store binary data (even if it contains @c NULs) using the
+functions wxString::To8BitData and wxString::From8BitData.
+
+Beware that even if @c NUL character is allowed, in the current string implementation
+some methods might not work correctly with them.
+
+Note however that other classes like wxMemoryBuffer are more suited to this task.
+For handling binary data you may also want to look at the wxStreamBuffer,
+wxMemoryOutputStream, wxMemoryInputStream classes.
+
@section overview_string_comparison Comparison to Other String Classes
@section overview_string_comparison Comparison to Other String Classes
Much work has been done to make existing code using ANSI string literals
work as before version 3.0.
Much work has been done to make existing code using ANSI string literals
work as before version 3.0.
If you nonetheless need to have a wxString that uses @c wchar_t
on Unix and Linux, too, you can specify this on the command line with the
@c configure @c --disable-utf8 switch or you can consider using wxUString
or @c std::wstring instead.
If you nonetheless need to have a wxString that uses @c wchar_t
on Unix and Linux, too, you can specify this on the command line with the
@c configure @c --disable-utf8 switch or you can consider using wxUString
or @c std::wstring instead.
+@c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support.
+If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is
+also defined, otherwise @c wxUSE_UNICODE_WCHAR is.
+See also @ref page_wxusedef_important.
When working with Unicode, it's important to define the meaning of some terms.
When working with Unicode, it's important to define the meaning of some terms.
-A <b><em>glyph</em></b> is a particular image that represents a character or part
-of a character.
+A <b><em>glyph</em></b> is a particular image (usually part of a font) that
+represents a character or part of a character.
Any character may have one or more glyph associated; e.g. some of the possible
glyphs for the capital letter 'A' are:
Any character may have one or more glyph associated; e.g. some of the possible
glyphs for the capital letter 'A' are:
which is called <b><em>code point</em></b>; it's typically indicated in documentation
manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number.
which is called <b><em>code point</em></b>; it's typically indicated in documentation
manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number.
-The Unicode standard divides the space of all possible code points in @e planes;
+Note that typically one character is assigned exactly one code point, but there
+are exceptions; the so-called <em>precomposed characters</em>
+(see http://en.wikipedia.org/wiki/Precomposed_character) or the <em>ligatures</em>.
+In these cases a single "character" may be mapped to more than one code point or
+viceversa more characters may be mapped to a single code point.
+
+The Unicode standard divides the space of all possible code points in <b><em>planes</em></b>;
a plane is a range of 65,536 (1000016) contiguous Unicode code points.
Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic
Multilingual Plane.
a plane is a range of 65,536 (1000016) contiguous Unicode code points.
Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic
Multilingual Plane.
More precisely, a code unit is the minimal bit combination that can represent a
unit of encoded text for processing or interchange.
More precisely, a code unit is the minimal bit combination that can represent a
unit of encoded text for processing or interchange.
-The @e UTF or Unicode Transformation Formats are algorithms mapping the Unicode
+The <b><em>UTF</em></b> or Unicode Transformation Formats are algorithms mapping the Unicode
code points to code unit sequences. The simplest of them is <b>UTF-32</b> where
each code unit is composed by 32 bits (4 bytes) and each code point is always
represented by a single code unit (fixed length encoding).
code points to code unit sequences. The simplest of them is <b>UTF-32</b> where
each code unit is composed by 32 bits (4 bytes) and each code point is always
represented by a single code unit (fixed length encoding).
However, unlike the Unicode build mode of the previous versions of wxWidgets, this
support is mostly transparent: you can still continue to work with the @b narrow
(i.e. current locale-encoded @c char*) strings even if @b wide
However, unlike the Unicode build mode of the previous versions of wxWidgets, this
support is mostly transparent: you can still continue to work with the @b narrow
(i.e. current locale-encoded @c char*) strings even if @b wide
-(i.e. UTF16/UCS2-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also
+(i.e. UTF16-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also
supported. Any wxWidgets function accepts arguments of either type as both
kinds of strings are implicitly converted to wxString, so both
@code
supported. Any wxWidgets function accepts arguments of either type as both
kinds of strings are implicitly converted to wxString, so both
@code
@section overview_unicode_settings Unicode Related Compilation Settings
@section overview_unicode_settings Unicode Related Compilation Settings
-@c wxUSE_UNICODE is now defined as 1 by default to indicate Unicode support.
+@c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support.
If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is
also defined, otherwise @c wxUSE_UNICODE_WCHAR is.
If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is
also defined, otherwise @c wxUSE_UNICODE_WCHAR is.