[wxWidgets.git] / docs / doxygen / overviews / unicode.h

/////////////////////////////////////////////////////////////////////////////
// Name:        unicode.h
// Purpose:     topic overview
// Author:      wxWidgets team
// RCS-ID:      $Id$
// Licence:     wxWindows license
/////////////////////////////////////////////////////////////////////////////

/*!

@page overview_unicode Unicode Support in wxWidgets

This section briefly describes the state of the Unicode support in wxWidgets.
Read it if you want to know more about how to write programs able to work with
characters from languages other than English.

@li @ref overview_unicode_what
@li @ref overview_unicode_ansi
@li @ref overview_unicode_supportin
@li @ref overview_unicode_supportout
@li @ref overview_unicode_settings
@li @ref overview_unicode_traps


<hr>


@section overview_unicode_what What is Unicode?

wxWidgets has support for compiling in Unicode mode on the platforms which
support it. Unicode is a standard for character encoding which addresses the
shortcomings of the previous, 8 bit standards, by using at least 16 (and
possibly 32) bits for encoding each character. This allows to have at least
65536 characters (what is called the BMP, or basic multilingual plane) and
possible 2^32 of them instead of the usual 256 and is sufficient to encode all
of the world languages at once. More details about Unicode may be found at
<http://www.unicode.org/>.

As this solution is obviously preferable to the previous ones (think of
incompatible encodings for the same language, locale chaos and so on), many
modern operating systems support it. The probably first example is Windows NT
which uses only Unicode internally since its very first version.

Writing internationalized programs is much easier with Unicode and, as the
support for it improves, it should become more and more so. Moreover, in the
Windows NT/2000 case, even the program which uses only standard ASCII can
profit from using Unicode because they will work more efficiently - there will
be no need for the system to convert all strings the program uses to/from
Unicode each time a system call is made.


@section overview_unicode_ansi Unicode and ANSI Modes

As not all platforms supported by wxWidgets support Unicode (fully) yet, in
many cases it is unwise to write a program which can only work in Unicode
environment. A better solution is to write programs in such way that they may
be compiled either in ANSI (traditional) mode or in the Unicode one.

This can be achieved quite simply by using the means provided by wxWidgets.
Basically, there are only a few things to watch out for:

- Character type (@c char or @c wchar_t)
- Literal strings (i.e. @c "Hello, world!" or @c '*')
- String functions (@c strlen(), @c strcpy(), ...)
- Special preprocessor tokens (@c __FILE__, @c __DATE__ and @c __TIME__)

Let's look at them in order. First of all, each character in an Unicode program
takes 2 bytes instead of usual one, so another type should be used to store the
characters (@c char only holds 1 byte usually). This type is called @c wchar_t
which stands for @e wide-character type.

Also, the string and character constants should be encoded using wide
characters (@c wchar_t type) which typically take 2 or 4 bytes instead of
@c char which only takes one. This is achieved by using the standard C (and
C++) way: just put the letter @c 'L' after any string constant and it becomes a
@e long constant, i.e. a wide character one. To make things a bit more
readable, you are also allowed to prefix the constant with @c 'L' instead of
putting it after it.

Of course, the usual standard C functions don't work with @c wchar_t strings,
so another set of functions exists which do the same thing but accept
@c wchar_t* instead of @c char*. For example, a function to get the length of a
wide-character string is called @c wcslen() (compare with @c strlen() - you see
that the only difference is that the "str" prefix standing for "string" has
been replaced with "wcs" standing for "wide-character string").

And finally, the standard preprocessor tokens enumerated above expand to ANSI
strings but it is more likely that Unicode strings are wanted in the Unicode
build. wxWidgets provides the macros @c __TFILE__, @c __TDATE__ and
@c __TTIME__ which behave exactly as the standard ones except that they produce
ANSI strings in ANSI build and Unicode ones in the Unicode build.

To summarize, here is a brief example of how a program which can be compiled
in both ANSI and Unicode modes could look like:

@code
#ifdef __UNICODE__
    wchar_t wch = L'*';
    const wchar_t *ws = L"Hello, world!";
    int len = wcslen(ws);

    wprintf(L"Compiled at %s\n", __TDATE__);
#else // ANSI
    char ch = '*';
    const char *s = "Hello, world!";
    int len = strlen(s);

    printf("Compiled at %s\n", __DATE__);
#endif // Unicode/ANSI
@endcode

Of course, it would be nearly impossibly to write such programs if it had to
be done this way (try to imagine the number of UNICODE checkes an average
program would have had!). Luckily, there is another way - see the next section.


@section overview_unicode_supportin Unicode Support in wxWidgets

In wxWidgets, the code fragment from above should be written instead:

@code
wxChar ch = wxT('*');
wxString s = wxT("Hello, world!");
int len = s.Len();
@endcode

What happens here? First of all, you see that there are no more UNICODE checks
at all. Instead, we define some types and macros which behave differently in
the Unicode and ANSI builds and allow us to avoid using conditional compilation
in the program itself.

We have a @c wxChar type which maps either on @c char or @c wchar_t depending
on the mode in which program is being compiled. There is no need for a separate
type for strings though, because the standard wxString supports Unicode, i.e.
it stores either ANSI or Unicode strings depending on the compile mode.

Finally, there is a special wxT() macro which should enclose all literal
strings in the program. As it is easy to see comparing the last fragment with
the one above, this macro expands to nothing in the (usual) ANSI mode and
prefixes @c 'L' to its argument in the Unicode mode.

The important conclusion is that if you use @c wxChar instead of @c char, avoid
using C style strings and use @c wxString instead and don't forget to enclose
all string literals inside wxT() macro, your program automatically becomes
(almost) Unicode compliant!

Just let us state once again the rules:

@li Always use wxChar instead of @c char
@li Always enclose literal string constants in wxT() macro unless they're
    already converted to the right representation (another standard wxWidgets
    macro _() does it, for example, so there is no need for wxT() in this case)
    or you intend to pass the constant directly to an external function which
    doesn't accept wide-character strings.
@li Use wxString instead of C style strings.


@section overview_unicode_supportout Unicode and the Outside World

We have seen that it was easy to write Unicode programs using wxWidgets types
and macros, but it has been also mentioned that it isn't quite enough. Although
everything works fine inside the program, things can get nasty when it tries to
communicate with the outside world which, sadly, often expects ANSI strings (a
notable exception is the entire Win32 API which accepts either Unicode or ANSI
strings and which thus makes it unnecessary to ever perform any conversions in
the program). GTK 2.0 only accepts UTF-8 strings.

To get an ANSI string from a wxString, you may use the mb_str() function which
always returns an ANSI string (independently of the mode - while the usual
c_str() returns a pointer to the internal representation which is either ASCII
or Unicode). More rarely used, but still useful, is wc_str() function which
always returns the Unicode string.

Sometimes it is also necessary to go from ANSI strings to wxStrings. In this
case, you can use the converter-constructor, as follows:

@code
const char* ascii_str = "Some text";
wxString str(ascii_str, wxConvUTF8);
@endcode

This code also compiles fine under a non-Unicode build of wxWidgets, but in
that case the converter is ignored.

For more information about converters and Unicode see the @ref overview_mbconv.


@section overview_unicode_settings Unicode Related Compilation Settings

You should define @c wxUSE_UNICODE to 1 to compile your program in Unicode
mode. This currently works for wxMSW, wxGTK, wxMac and wxX11. If you compile
your program in ANSI mode you can still define @c wxUSE_WCHAR_T to get some
limited support for @c wchar_t type.

This will allow your program to perform conversions between Unicode strings and
ANSI ones (using @ref overview_mbconv "wxMBConv") and construct wxString
objects from Unicode strings (presumably read from some external file or
elsewhere).


@section overview_unicode_traps Traps for the Unwary

@li Casting c_str() to void* is now char*, not wxChar*
@li Passing c_str(), mb_str() or wc_str() to variadic functions doesn't work.

*/
Commit	Line	Data
15b6757b	1	/////////////////////////////////////////////////////////////////////////////
2cd3cc94	2	// Name: unicode.h
15b6757b FM	3	// Purpose: topic overview
	4	// Author: wxWidgets team
	5	// RCS-ID: $Id$
	6	// Licence: wxWindows license
	7	/////////////////////////////////////////////////////////////////////////////
	8
	9	/*!
36c9828f	10
2cd3cc94 BP	11	@page overview_unicode Unicode Support in wxWidgets
	12
	13	This section briefly describes the state of the Unicode support in wxWidgets.
	14	Read it if you want to know more about how to write programs able to work with
	15	characters from languages other than English.
36c9828f	16
2cd3cc94 BP	17	@li @ref overview_unicode_what
	18	@li @ref overview_unicode_ansi
	19	@li @ref overview_unicode_supportin
	20	@li @ref overview_unicode_supportout
	21	@li @ref overview_unicode_settings
	22	@li @ref overview_unicode_traps
36c9828f	23
36c9828f	24
2cd3cc94	25	<hr>
36c9828f FM	26
36c9828f FM	27
2cd3cc94 BP	28	@section overview_unicode_what What is Unicode?
	29
	30	wxWidgets has support for compiling in Unicode mode on the platforms which
	31	support it. Unicode is a standard for character encoding which addresses the
	32	shortcomings of the previous, 8 bit standards, by using at least 16 (and
	33	possibly 32) bits for encoding each character. This allows to have at least
	34	65536 characters (what is called the BMP, or basic multilingual plane) and
	35	possible 2^32 of them instead of the usual 256 and is sufficient to encode all
	36	of the world languages at once. More details about Unicode may be found at
	37	<http://www.unicode.org/>.
	38
	39	As this solution is obviously preferable to the previous ones (think of
	40	incompatible encodings for the same language, locale chaos and so on), many
	41	modern operating systems support it. The probably first example is Windows NT
	42	which uses only Unicode internally since its very first version.
	43
	44	Writing internationalized programs is much easier with Unicode and, as the
	45	support for it improves, it should become more and more so. Moreover, in the
	46	Windows NT/2000 case, even the program which uses only standard ASCII can
	47	profit from using Unicode because they will work more efficiently - there will
	48	be no need for the system to convert all strings the program uses to/from
	49	Unicode each time a system call is made.
	50
	51
	52	@section overview_unicode_ansi Unicode and ANSI Modes
	53
	54	As not all platforms supported by wxWidgets support Unicode (fully) yet, in
	55	many cases it is unwise to write a program which can only work in Unicode
	56	environment. A better solution is to write programs in such way that they may
	57	be compiled either in ANSI (traditional) mode or in the Unicode one.
	58
	59	This can be achieved quite simply by using the means provided by wxWidgets.
	60	Basically, there are only a few things to watch out for:
	61
	62	- Character type (@c char or @c wchar_t)
	63	- Literal strings (i.e. @c "Hello, world!" or @c '*')
	64	- String functions (@c strlen(), @c strcpy(), ...)
	65	- Special preprocessor tokens (@c __FILE__, @c __DATE__ and @c __TIME__)
	66
	67	Let's look at them in order. First of all, each character in an Unicode program
	68	takes 2 bytes instead of usual one, so another type should be used to store the
	69	characters (@c char only holds 1 byte usually). This type is called @c wchar_t
	70	which stands for @e wide-character type.
	71
	72	Also, the string and character constants should be encoded using wide
	73	characters (@c wchar_t type) which typically take 2 or 4 bytes instead of
	74	@c char which only takes one. This is achieved by using the standard C (and
	75	C++) way: just put the letter @c 'L' after any string constant and it becomes a
	76	@e long constant, i.e. a wide character one. To make things a bit more
	77	readable, you are also allowed to prefix the constant with @c 'L' instead of
	78	putting it after it.
	79
	80	Of course, the usual standard C functions don't work with @c wchar_t strings,
	81	so another set of functions exists which do the same thing but accept
	82	@c wchar_t* instead of @c char*. For example, a function to get the length of a
	83	wide-character string is called @c wcslen() (compare with @c strlen() - you see
	84	that the only difference is that the "str" prefix standing for "string" has
	85	been replaced with "wcs" standing for "wide-character string").
	86
	87	And finally, the standard preprocessor tokens enumerated above expand to ANSI
	88	strings but it is more likely that Unicode strings are wanted in the Unicode
	89	build. wxWidgets provides the macros @c __TFILE__, @c __TDATE__ and
	90	@c __TTIME__ which behave exactly as the standard ones except that they produce
	91	ANSI strings in ANSI build and Unicode ones in the Unicode build.
92
93	To summarize, here is a brief example of how a program which can be compiled
94	in both ANSI and Unicode modes could look like:
95
96	@code
97	#ifdef __UNICODE__
98	wchar_t wch = L'*';
99	const wchar_t *ws = L"Hello, world!";
100	int len = wcslen(ws);
101
102	wprintf(L"Compiled at %s\n", __TDATE__);
103	#else // ANSI
104	char ch = '*';
105	const char *s = "Hello, world!";
106	int len = strlen(s);
107
108	printf("Compiled at %s\n", __DATE__);
109	#endif // Unicode/ANSI
110	@endcode
111
112	Of course, it would be nearly impossibly to write such programs if it had to
3863c5eb	113	be done this way (try to imagine the number of UNICODE checkes an average
2cd3cc94 BP	114	program would have had!). Luckily, there is another way - see the next section.
	115
	116
	117	@section overview_unicode_supportin Unicode Support in wxWidgets
	118
	119	In wxWidgets, the code fragment from above should be written instead:
	120
	121	@code
	122	wxChar ch = wxT('*');
	123	wxString s = wxT("Hello, world!");
	124	int len = s.Len();
	125	@endcode
	126
	127	What happens here? First of all, you see that there are no more UNICODE checks
	128	at all. Instead, we define some types and macros which behave differently in
	129	the Unicode and ANSI builds and allow us to avoid using conditional compilation
	130	in the program itself.
	131
	132	We have a @c wxChar type which maps either on @c char or @c wchar_t depending
	133	on the mode in which program is being compiled. There is no need for a separate
	134	type for strings though, because the standard wxString supports Unicode, i.e.
	135	it stores either ANSI or Unicode strings depending on the compile mode.
	136
	137	Finally, there is a special wxT() macro which should enclose all literal
	138	strings in the program. As it is easy to see comparing the last fragment with
	139	the one above, this macro expands to nothing in the (usual) ANSI mode and
	140	prefixes @c 'L' to its argument in the Unicode mode.
	141
	142	The important conclusion is that if you use @c wxChar instead of @c char, avoid
	143	using C style strings and use @c wxString instead and don't forget to enclose
	144	all string literals inside wxT() macro, your program automatically becomes
	145	(almost) Unicode compliant!
	146
	147	Just let us state once again the rules:
	148
	149	@li Always use wxChar instead of @c char
	150	@li Always enclose literal string constants in wxT() macro unless they're
	151	already converted to the right representation (another standard wxWidgets
	152	macro _() does it, for example, so there is no need for wxT() in this case)
	153	or you intend to pass the constant directly to an external function which
	154	doesn't accept wide-character strings.
	155	@li Use wxString instead of C style strings.
	156
	157
	158	@section overview_unicode_supportout Unicode and the Outside World
	159
	160	We have seen that it was easy to write Unicode programs using wxWidgets types
	161	and macros, but it has been also mentioned that it isn't quite enough. Although
	162	everything works fine inside the program, things can get nasty when it tries to
	163	communicate with the outside world which, sadly, often expects ANSI strings (a
	164	notable exception is the entire Win32 API which accepts either Unicode or ANSI
	165	strings and which thus makes it unnecessary to ever perform any conversions in
	166	the program). GTK 2.0 only accepts UTF-8 strings.
	167
	168	To get an ANSI string from a wxString, you may use the mb_str() function which
	169	always returns an ANSI string (independently of the mode - while the usual
	170	c_str() returns a pointer to the internal representation which is either ASCII
	171	or Unicode). More rarely used, but still useful, is wc_str() function which
	172	always returns the Unicode string.
	173
	174	Sometimes it is also necessary to go from ANSI strings to wxStrings. In this
	175	case, you can use the converter-constructor, as follows:
	176
	177	@code
178	const char* ascii_str = "Some text";
179	wxString str(ascii_str, wxConvUTF8);
180	@endcode
181
182	This code also compiles fine under a non-Unicode build of wxWidgets, but in
183	that case the converter is ignored.
184
185	For more information about converters and Unicode see the @ref overview_mbconv.
186
187
188	@section overview_unicode_settings Unicode Related Compilation Settings
189
190	You should define @c wxUSE_UNICODE to 1 to compile your program in Unicode
191	mode. This currently works for wxMSW, wxGTK, wxMac and wxX11. If you compile
192	your program in ANSI mode you can still define @c wxUSE_WCHAR_T to get some
193	limited support for @c wchar_t type.
194
195	This will allow your program to perform conversions between Unicode strings and
196	ANSI ones (using @ref overview_mbconv "wxMBConv") and construct wxString
197	objects from Unicode strings (presumably read from some external file or
198	elsewhere).
199
200
201	@section overview_unicode_traps Traps for the Unwary
202
203	@li Casting c_str() to void* is now char, not wxChar
204	@li Passing c_str(), mb_str() or wc_str() to variadic functions doesn't work.
205
206	*/
207