From: Francesco Montorsi Date: Mon, 8 Dec 2008 19:25:07 +0000 (+0000) Subject: added a overview_string_binary section describing what is wxString support with regar... X-Git-Url: https://git.saurik.com/wxWidgets.git/commitdiff_plain/2f365fcbd591ef4da63f5ca44d1f4b22ab20d287 added a overview_string_binary section describing what is wxString support with regard to binary data; removed traces of UCS2 wording; it was not completely correct (see wx-dev thread 'string changes doubts and docs') git-svn-id: https://svn.wxwidgets.org/svn/wx/wxWidgets/trunk@57204 c3d73ce0-8a6f-49c7-b76d-6d57e0e08775 --- diff --git a/docs/doxygen/images/overview_unicode_codes.dia b/docs/doxygen/images/overview_unicode_codes.dia index e8bd50f988..0f6f9066be 100644 Binary files a/docs/doxygen/images/overview_unicode_codes.dia and b/docs/doxygen/images/overview_unicode_codes.dia differ diff --git a/docs/doxygen/images/overview_unicode_codes.png b/docs/doxygen/images/overview_unicode_codes.png index 0da2d8ffa8..e58cb8cfd7 100644 Binary files a/docs/doxygen/images/overview_unicode_codes.png and b/docs/doxygen/images/overview_unicode_codes.png differ diff --git a/docs/doxygen/images/overview_wxstring_encoding.dia b/docs/doxygen/images/overview_wxstring_encoding.dia index 4d42a4a1a0..fce4617b7b 100644 Binary files a/docs/doxygen/images/overview_wxstring_encoding.dia and b/docs/doxygen/images/overview_wxstring_encoding.dia differ diff --git a/docs/doxygen/images/overview_wxstring_encoding.png b/docs/doxygen/images/overview_wxstring_encoding.png index f81af5d1a2..5a152f8c44 100644 Binary files a/docs/doxygen/images/overview_wxstring_encoding.png and b/docs/doxygen/images/overview_wxstring_encoding.png differ diff --git a/docs/doxygen/overviews/string.h b/docs/doxygen/overviews/string.h index 54513b6bb4..3829548e3c 100644 --- a/docs/doxygen/overviews/string.h +++ b/docs/doxygen/overviews/string.h @@ -14,6 +14,7 @@ Classes: wxString, wxArrayString, wxStringTokenizer @li @ref overview_string_intro @li @ref overview_string_internal +@li @ref overview_string_binary @li @ref overview_string_comparison @li @ref overview_string_advice @li @ref overview_string_related @@ -27,16 +28,12 @@ Classes: wxString, wxArrayString, wxStringTokenizer @section overview_string_intro Introduction wxString is a class which represents a Unicode string of arbitrary length and -containing arbitrary characters. - -The @c NUL character is allowed, but be -aware that in the current string implementation some methods might not work -correctly in this case. @todo still true? +containing arbitrary Unicode characters. This class has all the standard operations you can expect to find in a string class: dynamic memory management (string extends to accommodate new -characters), construction from other strings, C strings, wide character C strings -and characters, assignment operators, access to individual characters, string +characters), construction from other strings, compatibility with C strings and +wide character C strings, assignment operators, access to individual characters, string concatenation and comparison, substring extraction, case conversion, trimming and padding (with spaces), searching and replacing and both C-like @c printf (wxString::Printf) and stream-like insertion functions as well as much more - see wxString for a @@ -49,28 +46,31 @@ in previous versions. @section overview_string_internal Internal wxString encoding -Since wxWidgets 3.0 wxString internally uses UCS-2 (with Unicode +Since wxWidgets 3.0 wxString internally uses UTF-16 (with Unicode code units stored in @c wchar_t) under Windows and UTF-8 (with Unicode code units stored in @c char) under Unix, Linux and Mac OS X to store its content. For definitions of code units and code points terms, please see the @ref overview_unicode_encodings paragraph. -Note that there is a difference about UCS-2 and UTF-16: the first is a fixed-length -encoding, without surrogate pairs, while the latter is a -variable-length encoding. Except for this the two encodings are identical. - For simplicity of implementation, wxString when wxUSE_UNICODE_WCHAR==1 -(e.g. on Windows) uses UCS-2 and thus doesn't know anything about surrogate pairs; -it always consider 1 code unit per 1 code point, while this is really true only for -characters in the @e BMP (Basic Multilingual Plane). +(e.g. on Windows) uses per code unit indexing instead of +per code point indexing and doesn't know anything about surrogate pairs; +in other words it always considers code points to be composed by 1 code point, +while this is really true only for characters in the @e BMP (Basic Multilingual Plane). Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user -code has to take care of surrogate pair handling himself. +code has to take care of surrogate pairs himself. (Note however that Windows itself has built-in support for surrogate pairs in UTF-16, such as for drawing strings on screen.) +@remarks +Note that while the behaviour of wxString when wxUSE_UNICODE_WCHAR==1 +resembles UCS-2 encoding, it's not completely correct to refer to wxString as +UCS-2 encoded since you can encode characters outside the @e BMP in a wxString. + When instead wxUSE_UNICODE_UTF8==1 (e.g. on Linux and Mac OS X) -wxString handles UTF8 multi-bytes sequences just fine, so that you can use +wxString handles UTF8 multi-bytes sequences just fine also for characters outside +the BMP (it implements per code point indexing), so that you can use UTF8 in a completely transparent way: Example: @@ -89,7 +89,7 @@ Example: wxPrintf("wxString reports a length of %d character(s)", test.length()); // prints "wxString reports a length of 1 character(s)" on Linux // prints "wxString reports a length of 2 character(s)" on Windows - // since Windows doesn't have surrogate pairs support! + // since wxString on Windows doesn't have surrogate pairs support! // second test, this time using characters part of the Unicode BMP: @@ -113,17 +113,30 @@ above; it's composed by 3 characters and the final @c NULL: @image html overview_wxstring_encoding.png -As you can see, UCS2/UTF16 encoding is straightforward (for characters in the @e BMP) -and in this example the UCS2-encoded wxString takes 8 bytes. +As you can see, UTF16 encoding is straightforward (for characters in the @e BMP) +and in this example the UTF16-encoded wxString takes 8 bytes. UTF8 encoding is more elaborated and in this example takes 7 bytes. -The type used by wxString to store Unicode code units is called wxStringCharType. - In general, for strings containing many latin characters UTF8 provides a big -advantage in memory footprint respect UTF16, but requires some more processing -for common operations like e.g. length calculation. +advantage with regards to the memory footprint respect UTF16, but requires some +more processing for common operations like e.g. length calculation. + +Finally, note that the type used by wxString to store Unicode code units +(@c wchar_t or @c char) is always @c typedef-ined to be ::wxStringCharType. +@section overview_string_binary Using wxString to store binary data + +wxString can be used to store binary data (even if it contains @c NULs) using the +functions wxString::To8BitData and wxString::From8BitData. + +Beware that even if @c NUL character is allowed, in the current string implementation +some methods might not work correctly with them. + +Note however that other classes like wxMemoryBuffer are more suited to this task. +For handling binary data you may also want to look at the wxStreamBuffer, +wxMemoryOutputStream, wxMemoryInputStream classes. + @section overview_string_comparison Comparison to Other String Classes @@ -364,11 +377,16 @@ difference the change to @c EXTRA_ALLOC makes to your program. Much work has been done to make existing code using ANSI string literals work as before version 3.0. + If you nonetheless need to have a wxString that uses @c wchar_t on Unix and Linux, too, you can specify this on the command line with the @c configure @c --disable-utf8 switch or you can consider using wxUString or @c std::wstring instead. +@c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support. +If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is +also defined, otherwise @c wxUSE_UNICODE_WCHAR is. +See also @ref page_wxusedef_important. */ diff --git a/docs/doxygen/overviews/unicode.h b/docs/doxygen/overviews/unicode.h index e50454a1cd..be0d550b9b 100644 --- a/docs/doxygen/overviews/unicode.h +++ b/docs/doxygen/overviews/unicode.h @@ -49,8 +49,8 @@ other services should be ready to deal with Unicode. When working with Unicode, it's important to define the meaning of some terms. -A glyph is a particular image that represents a character or part -of a character. +A glyph is a particular image (usually part of a font) that +represents a character or part of a character. Any character may have one or more glyph associated; e.g. some of the possible glyphs for the capital letter 'A' are: @@ -60,7 +60,13 @@ Unicode assigns each character of almost any existing alphabet/script a number, which is called code point; it's typically indicated in documentation manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number. -The Unicode standard divides the space of all possible code points in @e planes; +Note that typically one character is assigned exactly one code point, but there +are exceptions; the so-called precomposed characters +(see http://en.wikipedia.org/wiki/Precomposed_character) or the ligatures. +In these cases a single "character" may be mapped to more than one code point or +viceversa more characters may be mapped to a single code point. + +The Unicode standard divides the space of all possible code points in planes; a plane is a range of 65,536 (1000016) contiguous Unicode code points. Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic Multilingual Plane. @@ -73,7 +79,7 @@ Code points are represented in computer memory as a sequence of one or more More precisely, a code unit is the minimal bit combination that can represent a unit of encoded text for processing or interchange. -The @e UTF or Unicode Transformation Formats are algorithms mapping the Unicode +The UTF or Unicode Transformation Formats are algorithms mapping the Unicode code points to code unit sequences. The simplest of them is UTF-32 where each code unit is composed by 32 bits (4 bytes) and each code point is always represented by a single code unit (fixed length encoding). @@ -129,7 +135,7 @@ programs require the Microsoft Layer for Unicode to run on Windows 95/98/ME. However, unlike the Unicode build mode of the previous versions of wxWidgets, this support is mostly transparent: you can still continue to work with the @b narrow (i.e. current locale-encoded @c char*) strings even if @b wide -(i.e. UTF16/UCS2-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also +(i.e. UTF16-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also supported. Any wxWidgets function accepts arguments of either type as both kinds of strings are implicitly converted to wxString, so both @code @@ -386,7 +392,7 @@ function directly. @section overview_unicode_settings Unicode Related Compilation Settings -@c wxUSE_UNICODE is now defined as 1 by default to indicate Unicode support. +@c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support. If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is also defined, otherwise @c wxUSE_UNICODE_WCHAR is.