[wxWidgets.git] / docs / doxygen / overviews / unicode.h

/////////////////////////////////////////////////////////////////////////////
// Name:        unicode.h
// Purpose:     topic overview
// Author:      wxWidgets team
// RCS-ID:      $Id$
// Licence:     wxWindows license
/////////////////////////////////////////////////////////////////////////////

/**

@page overview_unicode Unicode Support in wxWidgets

This section briefly describes the state of the Unicode support in wxWidgets.
Read it if you want to know more about how to write programs able to work with
characters from languages other than English.

@li @ref overview_unicode_what
@li @ref overview_unicode_ansi
@li @ref overview_unicode_supportin
@li @ref overview_unicode_supportout
@li @ref overview_unicode_settings

<hr>


@section overview_unicode_what What is Unicode?

wxWidgets has support for compiling in Unicode mode on the platforms which
support it. Unicode is a standard for character encoding which addresses the
shortcomings of the previous, 8 bit standards, by using at least 16 (and
possibly 32) bits for encoding each character. This allows to have at least
65536 characters (what is called the BMP, or basic multilingual plane) and
possible 2^32 of them instead of the usual 256 and is sufficient to encode all
of the world languages at once. A different approach is to encode all 
strings in UTF8 which does not require the use of wide characters and
additionally is backwards compatible with 7-bit ASCII. The solution to
use UTF8 is prefered under Linux and partially OS X.

More details about Unicode may be found at <http://www.unicode.org/>.

Writing internationalized programs is much easier with Unicode. Moreover
even a program which uses only standard ASCII can benefit from using Unicode
for string representation because there will be no need to convert all
strings the program uses to/from Unicode each time a system call is made.

@section overview_unicode_ansi Unicode and ANSI Modes

Until wxWidgets 3.0 it was possible to compile the library both in
ANSI (=8-bit) mode as well as in wide char mode (16-bit per character
on Windows and 32-but on most Unix versions, Linux and OS X). This
has been changed in wxWidget with the removal of the ANSI mode,
but much effort has been made so that most of the previous ANSI
code should still compile and work as before.

@section overview_unicode_supportin Unicode Support in wxWidgets

Since wxWidgets 3.0 Unicode support is always enabled meaning
that the wxString class always uses Unicode to encode its content.
Under Windows wxString uses UCS-2 (basically an array of 16-bit
wchar_t). Under Unix, Linux  and OS X however, wxString uses UTF8
to encode its content.

For the programmer, the biggest change is that iterating over
a string can be slower than before since wxString has to parse
the entire string in order to find the n-th character in a 
string, meaning that iterating over a string should no longer
be done by index but using iterators. Old code will still work
but might be less efficient.

Old code like this:

@code
wxString s = wxT("hello");
size_t i;
for (i = 0; i < s.Len(); i++)
{
    wxChar ch = s[i];
    
    // do something with it
}
@endcode

should be replaced (especially in time critical places) with:

@code
wxString s = "hello";
wxString::const_iterator i;
for (i = s.begin(); i != s.end(); ++i)
{
    wxUniChar uni_ch = *i;
    wxChar ch = uni_ch;
    // same as:   wxChar ch = *i
    
    // do something with it
}
@endcode

If you want to replace individual characters in the string you
need to get a reference to that character:

@code
wxString s = "hello";
wxString::iterator i;
for (i = s.begin(); i != s.end(); ++i)
{
    wxUniCharRef ch = *i;
    ch = 'a';
    // same as:  *i = 'a';
}
@endcode

which will change the content of the wxString s from "hello" to "aaaaa".

String literals are translated to Unicode when they are assigned to
a wxString object so code can be written like this:

@code
wxString s = "Hello, world!";
int len = s.Len();
@endcode

wxWidgets provides wrappers around most Posix C functions (like printf(..))
and the syntax has been adapted to support input with wxString, normal
C-style strings and wchar_t strings:

@code
wxString s;
s.Printf( "%s %s %s", "hello1", L"hello2", wxString("hello3") );
wxPrintf( "Three times hello %s\n", s );
@endcode

@section overview_unicode_supportout Unicode and the Outside World

We have seen that it was easy to write Unicode programs using wxWidgets types
and macros, but it has been also mentioned that it isn't quite enough. Although
everything works fine inside the program, things can get nasty when it tries to
communicate with the outside world which, sadly, often expects ANSI strings (a
notable exception is the entire Win32 API which accepts either Unicode or ANSI
strings and which thus makes it unnecessary to ever perform any conversions in
the program). GTK 2.0 only accepts UTF-8 strings.

To get an ANSI string from a wxString, you may use the mb_str() function which
always returns an ANSI string (independently of the mode - while the usual
c_str() returns a pointer to the internal representation which is either ASCII
or Unicode). More rarely used, but still useful, is wc_str() function which
always returns the Unicode string.

Sometimes it is also necessary to go from ANSI strings to wxStrings. In this
case, you can use the converter-constructor, as follows:

@code
const char* ascii_str = "Some text";
wxString str(ascii_str, wxConvUTF8);
@endcode

For more information about converters and Unicode see the @ref overview_mbconv.


@section overview_unicode_settings Unicode Related Compilation Settings

You should define @c wxUSE_UNICODE to 1 to compile your program in Unicode
mode. Since wxWidgets 3.0 this is always the case. When compiled in UTF8
mode @c wxUSE_UNICODE_UTF8 is also defined.

*/
Commit	Line	Data
15b6757b	1	/////////////////////////////////////////////////////////////////////////////
2cd3cc94	2	// Name: unicode.h
15b6757b FM	3	// Purpose: topic overview
	4	// Author: wxWidgets team
	5	// RCS-ID: $Id$
	6	// Licence: wxWindows license
	7	/////////////////////////////////////////////////////////////////////////////
	8
880efa2a	9	/**
36c9828f	10
2cd3cc94 BP	11	@page overview_unicode Unicode Support in wxWidgets
	12
	13	This section briefly describes the state of the Unicode support in wxWidgets.
	14	Read it if you want to know more about how to write programs able to work with
	15	characters from languages other than English.
36c9828f	16
2cd3cc94 BP	17	@li @ref overview_unicode_what
	18	@li @ref overview_unicode_ansi
	19	@li @ref overview_unicode_supportin
	20	@li @ref overview_unicode_supportout
	21	@li @ref overview_unicode_settings
36c9828f	22
2cd3cc94	23	<hr>
36c9828f FM	24
36c9828f FM	25
2cd3cc94 BP	26	@section overview_unicode_what What is Unicode?
	27
	28	wxWidgets has support for compiling in Unicode mode on the platforms which
	29	support it. Unicode is a standard for character encoding which addresses the
	30	shortcomings of the previous, 8 bit standards, by using at least 16 (and
	31	possibly 32) bits for encoding each character. This allows to have at least
	32	65536 characters (what is called the BMP, or basic multilingual plane) and
	33	possible 2^32 of them instead of the usual 256 and is sufficient to encode all
7b74e828 RR	34	of the world languages at once. A different approach is to encode all
	35	strings in UTF8 which does not require the use of wide characters and
	36	additionally is backwards compatible with 7-bit ASCII. The solution to
	37	use UTF8 is prefered under Linux and partially OS X.
2cd3cc94	38
7b74e828	39	More details about Unicode may be found at <http://www.unicode.org/>.
2cd3cc94	40
ffac5996	41	Writing internationalized programs is much easier with Unicode. Moreover
7b74e828 RR	42	even a program which uses only standard ASCII can benefit from using Unicode
	43	for string representation because there will be no need to convert all
	44	strings the program uses to/from Unicode each time a system call is made.
2cd3cc94 BP	45
	46	@section overview_unicode_ansi Unicode and ANSI Modes
	47
7b74e828 RR	48	Until wxWidgets 3.0 it was possible to compile the library both in
	49	ANSI (=8-bit) mode as well as in wide char mode (16-bit per character
	50	on Windows and 32-but on most Unix versions, Linux and OS X). This
ffac5996 RR	51	has been changed in wxWidget with the removal of the ANSI mode,
	52	but much effort has been made so that most of the previous ANSI
	53	code should still compile and work as before.
2cd3cc94	54
7b74e828	55	@section overview_unicode_supportin Unicode Support in wxWidgets
2cd3cc94	56
7b74e828 RR	57	Since wxWidgets 3.0 Unicode support is always enabled meaning
7b74e828 RR	58	that the wxString class always uses Unicode to encode its content.
ffac5996 RR	59	Under Windows wxString uses UCS-2 (basically an array of 16-bit
	60	wchar_t). Under Unix, Linux and OS X however, wxString uses UTF8
	61	to encode its content.
2cd3cc94	62
7b74e828 RR	63	For the programmer, the biggest change is that iterating over
	64	a string can be slower than before since wxString has to parse
	65	the entire string in order to find the n-th character in a
	66	string, meaning that iterating over a string should no longer
	67	be done by index but using iterators. Old code will still work
	68	but might be less efficient.
2cd3cc94	69
7b74e828	70	Old code like this:
2cd3cc94	71
7b74e828 RR	72	@code
	73	wxString s = wxT("hello");
	74	size_t i;
	75	for (i = 0; i < s.Len(); i++)
	76	{
	77	wxChar ch = s[i];
	78
	79	// do something with it
	80	}
	81	@endcode
	82
	83	should be replaced (especially in time critical places) with:
2cd3cc94 BP	84
2cd3cc94 BP	85	@code
7b74e828	86	wxString s = "hello";
36b952b7	87	wxString::const_iterator i;
7b74e828 RR	88	for (i = s.begin(); i != s.end(); ++i)
	89	{
	90	wxUniChar uni_ch = *i;
	91	wxChar ch = uni_ch;
	92	// same as: wxChar ch = *i
	93
	94	// do something with it
	95	}
2cd3cc94 BP	96	@endcode
2cd3cc94 BP	97
7b74e828 RR	98	If you want to replace individual characters in the string you
7b74e828 RR	99	need to get a reference to that character:
2cd3cc94	100
7b74e828 RR	101	@code
	102	wxString s = "hello";
	103	wxString::iterator i;
	104	for (i = s.begin(); i != s.end(); ++i)
	105	{
	106	wxUniCharRef ch = *i;
	107	ch = 'a';
	108	// same as: *i = 'a';
	109	}
	110	@endcode
2cd3cc94	111
7b74e828	112	which will change the content of the wxString s from "hello" to "aaaaa".
2cd3cc94	113
7b74e828 RR	114	String literals are translated to Unicode when they are assigned to
7b74e828 RR	115	a wxString object so code can be written like this:
2cd3cc94	116
7b74e828 RR	117	@code
	118	wxString s = "Hello, world!";
	119	int len = s.Len();
	120	@endcode
2cd3cc94	121
7b74e828 RR	122	wxWidgets provides wrappers around most Posix C functions (like printf(..))
	123	and the syntax has been adapted to support input with wxString, normal
	124	C-style strings and wchar_t strings:
2cd3cc94	125
7b74e828 RR	126	@code
	127	wxString s;
	128	s.Printf( "%s %s %s", "hello1", L"hello2", wxString("hello3") );
	129	wxPrintf( "Three times hello %s\n", s );
	130	@endcode
2cd3cc94 BP	131
	132	@section overview_unicode_supportout Unicode and the Outside World
	133
	134	We have seen that it was easy to write Unicode programs using wxWidgets types
	135	and macros, but it has been also mentioned that it isn't quite enough. Although
	136	everything works fine inside the program, things can get nasty when it tries to
	137	communicate with the outside world which, sadly, often expects ANSI strings (a
	138	notable exception is the entire Win32 API which accepts either Unicode or ANSI
	139	strings and which thus makes it unnecessary to ever perform any conversions in
	140	the program). GTK 2.0 only accepts UTF-8 strings.
	141
	142	To get an ANSI string from a wxString, you may use the mb_str() function which
	143	always returns an ANSI string (independently of the mode - while the usual
	144	c_str() returns a pointer to the internal representation which is either ASCII
	145	or Unicode). More rarely used, but still useful, is wc_str() function which
	146	always returns the Unicode string.
	147
	148	Sometimes it is also necessary to go from ANSI strings to wxStrings. In this
	149	case, you can use the converter-constructor, as follows:
	150
	151	@code
	152	const char* ascii_str = "Some text";
	153	wxString str(ascii_str, wxConvUTF8);
	154	@endcode
	155
2cd3cc94 BP	156	For more information about converters and Unicode see the @ref overview_mbconv.
	157
	158
	159	@section overview_unicode_settings Unicode Related Compilation Settings
	160
	161	You should define @c wxUSE_UNICODE to 1 to compile your program in Unicode
7b74e828 RR	162	mode. Since wxWidgets 3.0 this is always the case. When compiled in UTF8
7b74e828 RR	163	mode @c wxUSE_UNICODE_UTF8 is also defined.
2cd3cc94 BP	164
	165	*/
	166