added a overview_string_binary section describing what is wxString support with regar...

author Francesco Montorsi <f18m_cpp217828@yahoo.it>

Mon, 8 Dec 2008 19:25:07 +0000 (19:25 +0000)

committer Francesco Montorsi <f18m_cpp217828@yahoo.it>

Mon, 8 Dec 2008 19:25:07 +0000 (19:25 +0000)
author Francesco Montorsi <f18m_cpp217828@yahoo.it>
Mon, 8 Dec 2008 19:25:07 +0000 (19:25 +0000)
committer Francesco Montorsi <f18m_cpp217828@yahoo.it>
Mon, 8 Dec 2008 19:25:07 +0000 (19:25 +0000)
diff --git a/docs/doxygen/images/overview_unicode_codes.dia b/docs/doxygen/images/overview_unicode_codes.dia

index e8bd50f9886e88786473734eda523315913bceba..0f6f9066bee2ef972ea8418a965c8c812d740933 100644 (file)

Binary files a/docs/doxygen/images/overview_unicode_codes.dia and b/docs/doxygen/images/overview_unicode_codes.dia differ
diff --git a/docs/doxygen/images/overview_unicode_codes.png b/docs/doxygen/images/overview_unicode_codes.png

index 0da2d8ffa84c0b92b615a8bc6446fcc13d18f386..e58cb8cfd7eb3db3f980c7392d31659b712f5e42 100644 (file)

Binary files a/docs/doxygen/images/overview_unicode_codes.png and b/docs/doxygen/images/overview_unicode_codes.png differ
diff --git a/docs/doxygen/images/overview_wxstring_encoding.dia b/docs/doxygen/images/overview_wxstring_encoding.dia

index 4d42a4a1a033ed2c737c5121e31d3cbf7051e8ed..fce4617b7bf69bd8e6f0fd7681de345d168ebf50 100644 (file)

Binary files a/docs/doxygen/images/overview_wxstring_encoding.dia and b/docs/doxygen/images/overview_wxstring_encoding.dia differ
diff --git a/docs/doxygen/images/overview_wxstring_encoding.png b/docs/doxygen/images/overview_wxstring_encoding.png

index f81af5d1a2781f63920c5cfbabe9dd63e521452e..5a152f8c4423df0b1ea865120b739e7a6adbcdcd 100644 (file)

Binary files a/docs/doxygen/images/overview_wxstring_encoding.png and b/docs/doxygen/images/overview_wxstring_encoding.png differ
diff --git a/docs/doxygen/overviews/string.h b/docs/doxygen/overviews/string.h

index 54513b6bb43c2884c1f6220c072a3a56476935e8..3829548e3c0e5587acfc056a929a7f21b7f2d1a9 100644 (file)
--- a/docs/doxygen/overviews/string.h
+++ b/docs/doxygen/overviews/string.h
@@ -14,6 +14,7 @@ Classes: wxString, wxArrayString, wxStringTokenizer
  
  @li @ref overview_string_intro
  @li @ref overview_string_internal
  
  @li @ref overview_string_intro
  @li @ref overview_string_internal
+@li @ref overview_string_binary
  @li @ref overview_string_comparison
  @li @ref overview_string_advice
  @li @ref overview_string_related
  @li @ref overview_string_comparison
  @li @ref overview_string_advice
  @li @ref overview_string_related
@@ -27,16 +28,12 @@ Classes: wxString, wxArrayString, wxStringTokenizer
  @section overview_string_intro Introduction
  
  wxString is a class which represents a Unicode string of arbitrary length and
  @section overview_string_intro Introduction
  
  wxString is a class which represents a Unicode string of arbitrary length and
-containing arbitrary characters.
-
-The @c NUL character is allowed, but be
-aware that in the current string implementation some methods might not work
-correctly in this case. @todo still true?
+containing arbitrary Unicode characters.
  
  This class has all the standard operations you can expect to find in a string
  class: dynamic memory management (string extends to accommodate new
  
  This class has all the standard operations you can expect to find in a string
  class: dynamic memory management (string extends to accommodate new
-characters), construction from other strings, C strings, wide character C strings
-and characters, assignment operators, access to individual characters, string
+characters), construction from other strings, compatibility with C strings and
+wide character C strings, assignment operators, access to individual characters, string
  concatenation and comparison, substring extraction, case conversion, trimming and
  padding (with spaces), searching and replacing and both C-like @c printf (wxString::Printf)
  and stream-like insertion functions as well as much more - see wxString for a
  concatenation and comparison, substring extraction, case conversion, trimming and
  padding (with spaces), searching and replacing and both C-like @c printf (wxString::Printf)
  and stream-like insertion functions as well as much more - see wxString for a
@@ -49,28 +46,31 @@ in previous versions.
  
  @section overview_string_internal Internal wxString encoding
  
  
  @section overview_string_internal Internal wxString encoding
  
-Since wxWidgets 3.0 wxString internally uses <b>UCS-2</b> (with Unicode
+Since wxWidgets 3.0 wxString internally uses <b>UTF-16</b> (with Unicode
  code units stored in @c wchar_t) under Windows and <b>UTF-8</b> (with Unicode
  code units stored in @c char) under Unix, Linux and Mac OS X to store its content.
  
  For definitions of <em>code units</em> and <em>code points</em> terms, please
  see the @ref overview_unicode_encodings paragraph.
  
  code units stored in @c wchar_t) under Windows and <b>UTF-8</b> (with Unicode
  code units stored in @c char) under Unix, Linux and Mac OS X to store its content.
  
  For definitions of <em>code units</em> and <em>code points</em> terms, please
  see the @ref overview_unicode_encodings paragraph.
  
-Note that there is a difference about UCS-2 and UTF-16: the first is a fixed-length
-encoding, without <em>surrogate pairs</em>, while the latter is a
-variable-length encoding. Except for this the two encodings are identical.
-
  For simplicity of implementation, wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt>
  For simplicity of implementation, wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt>
-(e.g. on Windows) uses UCS-2 and thus doesn't know anything about surrogate pairs;
-it always consider 1 code unit per 1 code point, while this is really true only for
-characters in the @e BMP (Basic Multilingual Plane).
+(e.g. on Windows) uses <em>per code unit indexing</em> instead of
+<em>per code point indexing</em> and doesn't know anything about surrogate pairs;
+in other words it always considers code points to be composed by 1 code point,
+while this is really true only for characters in the @e BMP (Basic Multilingual Plane).
  Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user
  Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user
-code has to take care of <em>surrogate pair</em> handling himself.
+code has to take care of <em>surrogate pairs</em> himself.
  (Note however that Windows itself has built-in support for surrogate pairs in UTF-16,
  such as for drawing strings on screen.)
  
  (Note however that Windows itself has built-in support for surrogate pairs in UTF-16,
  such as for drawing strings on screen.)
  
+@remarks
+Note that while the behaviour of wxString when <tt>wxUSE_UNICODE_WCHAR==1</tt>
+resembles UCS-2 encoding, it's not completely correct to refer to wxString as
+UCS-2 encoded since you can encode characters outside the @e BMP in a wxString.
+
  When instead <tt>wxUSE_UNICODE_UTF8==1</tt> (e.g. on Linux and Mac OS X)
  When instead <tt>wxUSE_UNICODE_UTF8==1</tt> (e.g. on Linux and Mac OS X)
-wxString handles UTF8 multi-bytes sequences just fine, so that you can use
+wxString handles UTF8 multi-bytes sequences just fine also for characters outside
+the BMP (it implements <em>per code point indexing</em>), so that you can use
  UTF8 in a completely transparent way:
  
  Example:
  UTF8 in a completely transparent way:
  
  Example:
@@ -89,7 +89,7 @@ Example:
      wxPrintf("wxString reports a length of %d character(s)", test.length());
          // prints "wxString reports a length of 1 character(s)" on Linux
          // prints "wxString reports a length of 2 character(s)" on Windows
      wxPrintf("wxString reports a length of %d character(s)", test.length());
          // prints "wxString reports a length of 1 character(s)" on Linux
          // prints "wxString reports a length of 2 character(s)" on Windows
-        // since Windows doesn't have surrogate pairs support!
+        // since wxString on Windows doesn't have surrogate pairs support!
  
  
      // second test, this time using characters part of the Unicode BMP:
  
  
      // second test, this time using characters part of the Unicode BMP:
@@ -113,17 +113,30 @@ above; it's composed by 3 characters and the final @c NULL:
  
  @image html overview_wxstring_encoding.png
  
  
  @image html overview_wxstring_encoding.png
  
-As you can see, UCS2/UTF16 encoding is straightforward (for characters in the @e BMP)
-and in this example the UCS2-encoded wxString takes 8 bytes.
+As you can see, UTF16 encoding is straightforward (for characters in the @e BMP)
+and in this example the UTF16-encoded wxString takes 8 bytes.
  UTF8 encoding is more elaborated and in this example takes 7 bytes.
  
  UTF8 encoding is more elaborated and in this example takes 7 bytes.
  
-The type used by wxString to store Unicode code units is called wxStringCharType.
-
  In general, for strings containing many latin characters UTF8 provides a big
  In general, for strings containing many latin characters UTF8 provides a big
-advantage in memory footprint respect UTF16, but requires some more processing
-for common operations like e.g. length calculation.
+advantage with regards to the memory footprint respect UTF16, but requires some
+more processing for common operations like e.g. length calculation.
+
+Finally, note that the type used by wxString to store Unicode code units
+(@c wchar_t or @c char) is always @c typedef-ined to be ::wxStringCharType.
  
  
  
  
+@section overview_string_binary Using wxString to store binary data
+
+wxString can be used to store binary data (even if it contains @c NULs) using the
+functions wxString::To8BitData and wxString::From8BitData.
+
+Beware that even if @c NUL character is allowed, in the current string implementation
+some methods might not work correctly with them.
+
+Note however that other classes like wxMemoryBuffer are more suited to this task.
+For handling binary data you may also want to look at the wxStreamBuffer,
+wxMemoryOutputStream, wxMemoryInputStream classes.
+
  
  @section overview_string_comparison Comparison to Other String Classes
  
  
  @section overview_string_comparison Comparison to Other String Classes
  
@@ -364,11 +377,16 @@ difference the change to @c EXTRA_ALLOC makes to your program.
  
  Much work has been done to make existing code using ANSI string literals
  work as before version 3.0.
  
  Much work has been done to make existing code using ANSI string literals
  work as before version 3.0.
+
  If you nonetheless need to have a wxString that uses @c wchar_t
  on Unix and Linux, too, you can specify this on the command line with the
  @c configure @c --disable-utf8 switch or you can consider using wxUString
  or @c std::wstring instead.
  
  If you nonetheless need to have a wxString that uses @c wchar_t
  on Unix and Linux, too, you can specify this on the command line with the
  @c configure @c --disable-utf8 switch or you can consider using wxUString
  or @c std::wstring instead.
  
+@c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support.
+If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is
+also defined, otherwise @c wxUSE_UNICODE_WCHAR is.
+See also @ref page_wxusedef_important.
  
  */
  
  
  */
  
diff --git a/docs/doxygen/overviews/unicode.h b/docs/doxygen/overviews/unicode.h

index e50454a1cd78126c0daedefd384902d31415162e..be0d550b9bb501fa91ed888d78599e170d476f42 100644 (file)
--- a/docs/doxygen/overviews/unicode.h
+++ b/docs/doxygen/overviews/unicode.h
@@ -49,8 +49,8 @@ other services should be ready to deal with Unicode.
  
  When working with Unicode, it's important to define the meaning of some terms.
  
  
  When working with Unicode, it's important to define the meaning of some terms.
  
-A <b><em>glyph</em></b> is a particular image that represents a character or part
-of a character.
+A <b><em>glyph</em></b> is a particular image (usually part of a font) that
+represents a character or part of a character.
  Any character may have one or more glyph associated; e.g. some of the possible
  glyphs for the capital letter 'A' are:
  
  Any character may have one or more glyph associated; e.g. some of the possible
  glyphs for the capital letter 'A' are:
  
@@ -60,7 +60,13 @@ Unicode assigns each character of almost any existing alphabet/script a number,
  which is called <b><em>code point</em></b>; it's typically indicated in documentation
  manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number.
  
  which is called <b><em>code point</em></b>; it's typically indicated in documentation
  manuals and in the Unicode website as @c U+xxxx where @c xxxx is an hexadecimal number.
  
-The Unicode standard divides the space of all possible code points in @e planes;
+Note that typically one character is assigned exactly one code point, but there
+are exceptions; the so-called <em>precomposed characters</em>
+(see http://en.wikipedia.org/wiki/Precomposed_character) or the <em>ligatures</em>.
+In these cases a single "character" may be mapped to more than one code point or
+viceversa more characters may be mapped to a single code point.
+
+The Unicode standard divides the space of all possible code points in <b><em>planes</em></b>;
  a plane is a range of 65,536 (1000016) contiguous Unicode code points.
  Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic
  Multilingual Plane.
  a plane is a range of 65,536 (1000016) contiguous Unicode code points.
  Planes are numbered from 0 to 16, where the first one is the @e BMP, or Basic
  Multilingual Plane.
@@ -73,7 +79,7 @@ Code points are represented in computer memory as a sequence of one or more
  More precisely, a code unit is the minimal bit combination that can represent a
  unit of encoded text for processing or interchange.
  
  More precisely, a code unit is the minimal bit combination that can represent a
  unit of encoded text for processing or interchange.
  
-The @e UTF or Unicode Transformation Formats are algorithms mapping the Unicode
+The <b><em>UTF</em></b> or Unicode Transformation Formats are algorithms mapping the Unicode
  code points to code unit sequences. The simplest of them is <b>UTF-32</b> where
  each code unit is composed by 32 bits (4 bytes) and each code point is always
  represented by a single code unit (fixed length encoding).
  code points to code unit sequences. The simplest of them is <b>UTF-32</b> where
  each code unit is composed by 32 bits (4 bytes) and each code point is always
  represented by a single code unit (fixed length encoding).
@@ -129,7 +135,7 @@ programs require the Microsoft Layer for Unicode to run on Windows 95/98/ME.
  However, unlike the Unicode build mode of the previous versions of wxWidgets, this
  support is mostly transparent: you can still continue to work with the @b narrow
  (i.e. current locale-encoded @c char*) strings even if @b wide
  However, unlike the Unicode build mode of the previous versions of wxWidgets, this
  support is mostly transparent: you can still continue to work with the @b narrow
  (i.e. current locale-encoded @c char*) strings even if @b wide
-(i.e. UTF16/UCS2-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also
+(i.e. UTF16-encoded @c wchar_t* or UTF8-encoded @c char*) strings are also
  supported. Any wxWidgets function accepts arguments of either type as both
  kinds of strings are implicitly converted to wxString, so both
  @code
  supported. Any wxWidgets function accepts arguments of either type as both
  kinds of strings are implicitly converted to wxString, so both
  @code
@@ -386,7 +392,7 @@ function directly.
  
  @section overview_unicode_settings Unicode Related Compilation Settings
  
  
  @section overview_unicode_settings Unicode Related Compilation Settings
  
-@c wxUSE_UNICODE is now defined as 1 by default to indicate Unicode support.
+@c wxUSE_UNICODE is now defined as @c 1 by default to indicate Unicode support.
  If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is
  also defined, otherwise @c wxUSE_UNICODE_WCHAR is.
  
  If UTF-8 is used for the internal storage in wxString, @c wxUSE_UNICODE_UTF8 is
  also defined, otherwise @c wxUSE_UNICODE_WCHAR is.
author	Francesco Montorsi <f18m_cpp217828@yahoo.it>
	Mon, 8 Dec 2008 19:25:07 +0000 (19:25 +0000)
committer	Francesco Montorsi <f18m_cpp217828@yahoo.it>
	Mon, 8 Dec 2008 19:25:07 +0000 (19:25 +0000)
docs/doxygen/images/overview_unicode_codes.dia		patch \| blob \| blame \| history
docs/doxygen/images/overview_unicode_codes.png		patch \| blob \| blame \| history
docs/doxygen/images/overview_wxstring_encoding.dia		patch \| blob \| blame \| history
docs/doxygen/images/overview_wxstring_encoding.png		patch \| blob \| blame \| history
docs/doxygen/overviews/string.h		patch \| blob \| blame \| history
docs/doxygen/overviews/unicode.h		patch \| blob \| blame \| history