[wxWidgets.git] / docs / latex / wx / tunicode.tex

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Name:        tunicode.tex
%% Purpose:     Overview of the Unicode support in wxWindows
%% Author:      Vadim Zeitlin
%% Modified by:
%% Created:     22.09.99
%% RCS-ID:      $Id$
%% Copyright:   (c) 1999 Vadim Zeitlin <zeitlin@dptmaths.ens-cachan.fr>
%% Licence:     wxWindows license
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Unicode support in wxWindows}\label{unicode}

This section briefly describes the state of the Unicode support in wxWindows.
Read it if you want to know more about how to write programs able to work with
characters from languages other than English.

\subsection{What is Unicode?}

Starting with release 2.1 wxWindows has support for compiling in Unicode mode
on the platforms which support it. Unicode is a standard for character
encoding which addresses the shortcomings of the previous, 8 bit standards, by
using at least 16 (and possibly 32) bits for encoding each character. This
allows to have at least 65536 characters (what is called the BMP, or basic
multilingual plane) and possible $2^{32}$ of them instead of the usual 256 and
is sufficient to encode all of the world languages at once. More details about
Unicode may be found at {\tt www.unicode.org}.

% TODO expand on it, say that Unicode extends ASCII, mention ISO8859, ...

As this solution is obviously preferable to the previous ones (think of
incompatible encodings for the same language, locale chaos and so on), many
modern operating systems support it. The probably first example is Windows NT
which uses only Unicode internally since its very first version.

Writing internationalized programs is much easier with Unicode and, as the
support for it improves, it should become more and more so. Moreover, in the
Windows NT/2000 case, even the program which uses only standard ASCII can profit
from using Unicode because they will work more efficiently - there will be no
need for the system to convert all strings the program uses to/from Unicode
each time a system call is made.

\subsection{Unicode and ANSI modes}

As not all platforms supported by wxWindows support Unicode (fully) yet, in
many cases it is unwise to write a program which can only work in Unicode
environment. A better solution is to write programs in such way that they may
be compiled either in ANSI (traditional) mode or in the Unicode one.

This can be achieved quite simply by using the means provided by wxWindows.
Basically, there are only a few things to watch out for:

\begin{itemize}
\item Character type ({\tt char} or {\tt wchar\_t})
\item Literal strings (i.e. {\tt "Hello, world!"} or {\tt '*'})
\item String functions ({\tt strlen()}, {\tt strcpy()}, ...)
\item Special preprocessor tokens ({\tt \_\_FILE\_\_}, {\tt \_\_DATE\_\_} 
and {\tt \_\_TIME\_\_})
\end{itemize}

Let's look at them in order. First of all, each character in an Unicode
program takes 2 bytes instead of usual one, so another type should be used to
store the characters ({\tt char} only holds 1 byte usually). This type is
called {\tt wchar\_t} which stands for {\it wide-character type}.

Also, the string and character constants should be encoded using wide
characters ({\tt wchar\_t} type) which typically take $2$ or $4$ bytes instead
of {\tt char} which only takes one. This is achieved by using the standard C
(and C++) way: just put the letter {\tt 'L'} after any string constant and it
becomes a {\it long} constant, i.e. a wide character one. To make things a bit
more readable, you are also allowed to prefix the constant with {\tt 'L'}
instead of putting it after it.

Of course, the usual standard C functions don't work with {\tt wchar\_t}
strings, so another set of functions exists which do the same thing but accept
{\tt wchar\_t *} instead of {\tt char *}. For example, a function to get the
length of a wide-character string is called {\tt wcslen()} (compare with 
{\tt strlen()} - you see that the only difference is that the "str" prefix
standing for "string" has been replaced with "wcs" standing for "wide-character
string").

And finally, the standard preprocessor tokens enumerated above expand to ANSI
strings but it is more likely that Unicode strings are wanted in the Unicode
build. wxWindows provides the macros {\tt \_\_TFILE\_\_}, {\tt \_\_TDATE\_\_} 
and {\tt \_\_TTIME\_\_} which behave exactly as the standard ones except that
they produce ANSI strings in ANSI build and Unicode ones in the Unicode build.

To summarize, here is a brief example of how a program which can be compiled
in both ANSI and Unicode modes could look like:

\begin{verbatim}
#ifdef __UNICODE__
    wchar_t wch = L'*';
    const wchar_t *ws = L"Hello, world!";
    int len = wcslen(ws);

    wprintf(L"Compiled at %s\n", __TDATE__);
#else // ANSI
    char ch = '*';
    const char *s = "Hello, world!";
    int len = strlen(s);

    printf("Compiled at %s\n", __DATE__);
#endif // Unicode/ANSI
\end{verbatim}

Of course, it would be nearly impossibly to write such programs if it had to
be done this way (try to imagine the number of {\tt \#ifdef UNICODE} an average
program would have had!). Luckily, there is another way - see the next
section.

\subsection{Unicode support in wxWindows}

In wxWindows, the code fragment from above should be written instead:

\begin{verbatim}
    wxChar ch = wxT('*');
    wxString s = wxT("Hello, world!");
    int len = s.Len();
\end{verbatim}

What happens here? First of all, you see that there are no more {\tt \#ifdef}s
at all. Instead, we define some types and macros which behave differently in
the Unicode and ANSI builds and allows us to avoid using conditional
compilation in the program itself.

We have a {\tt wxChar} type which maps either on {\tt char} or {\tt wchar\_t} 
depending on the mode in which program is being compiled. There is no need for
a separate type for strings though, because the standard 
\helpref{wxString}{wxstring} supports Unicode, i.e. it stores either ANSI or
Unicode strings depending on the compile mode.

Finally, there is a special \helpref{wxT()}{wxt} macro which should enclose all
literal strings in the program. As it is easy to see comparing the last
fragment with the one above, this macro expands to nothing in the (usual) ANSI
mode and prefixes {\tt 'L'} to its argument in the Unicode mode.

The important conclusion is that if you use {\tt wxChar} instead of 
{\tt char}, avoid using C style strings and use {\tt wxString} instead and
don't forget to enclose all string literals inside \helpref{wxT()}{wxt} macro, your
program automatically becomes (almost) Unicode compliant!

Just let us state once again the rules:

\begin{itemize}
\item Always use {\tt wxChar} instead of {\tt char}
\item Always enclose literal string constants in \helpref{wxT()}{wxt} macro
unless they're already converted to the right representation (another standard
wxWindows macro \helpref{\_()}{underscore} does it, for example, so there is no
need for {\tt wxT()} in this case) or you intend to pass the constant directly
to an external function which doesn't accept wide-character strings.
\item Use {\tt wxString} instead of C style strings.
\end{itemize}

\subsection{Unicode and the outside world}

We have seen that it was easy to write Unicode programs using wxWindows types
and macros, but it has been also mentioned that it isn't quite enough.
Although everything works fine inside the program, things can get nasty when
it tries to communicate with the outside world which, sadly, often expects
ANSI strings (a notable exception is the entire Win32 API which accepts either
Unicode or ANSI strings and which thus makes it unnecessary to ever perform
any conversions in the program). GTK 2.0 only accepts UTF-8 strings.

To get a ANSI string from a wxString, you may use the 
mb\_str() function which always returns an ANSI
string (independently of the mode - while the usual 
\helpref{c\_str()}{wxstringcstr} returns a pointer to the internal
representation which is either ASCII or Unicode). More rarely used, but still
useful, is wc\_str() function which always returns
the Unicode string.

% TODO describe fn_str(), wx_str(), wxCharBuf classes, ...

\subsection{Unicode-related compilation settings}

You should define {\tt wxUSE\_UNICODE} to $1$ to compile your program in
Unicode mode. Note that it currently only works in Win32 and GTK 2.0 and
that some parts of
wxWindows are not Unicode-compliant yet (ODBC classes, for example). If you
compile your program in ANSI mode you can still define {\tt wxUSE\_WCHAR\_T} 
to get some limited support for {\tt wchar\_t} type.

This will allow your program to perform conversions between Unicode strings and
ANSI ones (using \helpref{wxMBConv classes}{mbconvclasses}) 
and construct wxString objects from Unicode strings (presumably read
from some external file or elsewhere).
Commit	Line	Data
0c5d3e1c VZ	1	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	2	%% Name: tunicode.tex
	3	%% Purpose: Overview of the Unicode support in wxWindows
	4	%% Author: Vadim Zeitlin
	5	%% Modified by:
	6	%% Created: 22.09.99
	7	%% RCS-ID: $Id$
	8	%% Copyright: (c) 1999 Vadim Zeitlin <zeitlin@dptmaths.ens-cachan.fr>
	9	%% Licence: wxWindows license
	10	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	11
	12	\section{Unicode support in wxWindows}\label{unicode}
	13
	14	This section briefly describes the state of the Unicode support in wxWindows.
	15	Read it if you want to know more about how to write programs able to work with
	16	characters from languages other than English.
	17
	18	\subsection{What is Unicode?}
	19
	20	Starting with release 2.1 wxWindows has support for compiling in Unicode mode
	21	on the platforms which support it. Unicode is a standard for character
f6bcfd97	22	encoding which addresses the shortcomings of the previous, 8 bit standards, by
8f684821 VZ	23	using at least 16 (and possibly 32) bits for encoding each character. This
	24	allows to have at least 65536 characters (what is called the BMP, or basic
	25	multilingual plane) and possible $2^{32}$ of them instead of the usual 256 and
	26	is sufficient to encode all of the world languages at once. More details about
	27	Unicode may be found at {\tt www.unicode.org}.
0c5d3e1c VZ	28
	29	% TODO expand on it, say that Unicode extends ASCII, mention ISO8859, ...
	30
	31	As this solution is obviously preferable to the previous ones (think of
	32	incompatible encodings for the same language, locale chaos and so on), many
f6bcfd97	33	modern operating systems support it. The probably first example is Windows NT
0c5d3e1c VZ	34	which uses only Unicode internally since its very first version.
	35
	36	Writing internationalized programs is much easier with Unicode and, as the
	37	support for it improves, it should become more and more so. Moreover, in the
	38	Windows NT/2000 case, even the program which uses only standard ASCII can profit
	39	from using Unicode because they will work more efficiently - there will be no
f6bcfd97	40	need for the system to convert all strings the program uses to/from Unicode
0c5d3e1c VZ	41	each time a system call is made.
	42
	43	\subsection{Unicode and ANSI modes}
	44
	45	As not all platforms supported by wxWindows support Unicode (fully) yet, in
	46	many cases it is unwise to write a program which can only work in Unicode
	47	environment. A better solution is to write programs in such way that they may
	48	be compiled either in ANSI (traditional) mode or in the Unicode one.
	49
	50	This can be achieved quite simply by using the means provided by wxWindows.
f6bcfd97	51	Basically, there are only a few things to watch out for:
4c61bdab	52
0c5d3e1c VZ	53	\begin{itemize}
	54	\item Character type ({\tt char} or {\tt wchar\_t})
	55	\item Literal strings (i.e. {\tt "Hello, world!"} or {\tt '*'})
	56	\item String functions ({\tt strlen()}, {\tt strcpy()}, ...)
8f684821 VZ	57	\item Special preprocessor tokens ({\tt \_\_FILE\_\_}, {\tt \_\_DATE\_\_}
8f684821 VZ	58	and {\tt \_\_TIME\_\_})
0c5d3e1c VZ	59	\end{itemize}
	60
	61	Let's look at them in order. First of all, each character in an Unicode
	62	program takes 2 bytes instead of usual one, so another type should be used to
	63	store the characters ({\tt char} only holds 1 byte usually). This type is
	64	called {\tt wchar\_t} which stands for {\it wide-character type}.
	65
8f684821 VZ	66	Also, the string and character constants should be encoded using wide
	67	characters ({\tt wchar\_t} type) which typically take $2$ or $4$ bytes instead
	68	of {\tt char} which only takes one. This is achieved by using the standard C
	69	(and C++) way: just put the letter {\tt 'L'} after any string constant and it
	70	becomes a {\it long} constant, i.e. a wide character one. To make things a bit
	71	more readable, you are also allowed to prefix the constant with {\tt 'L'}
	72	instead of putting it after it.
0c5d3e1c	73
8f684821 VZ	74	Of course, the usual standard C functions don't work with {\tt wchar\_t}
8f684821 VZ	75	strings, so another set of functions exists which do the same thing but accept
0c5d3e1c VZ	76	{\tt wchar\_t } instead of {\tt char }. For example, a function to get the
	77	length of a wide-character string is called {\tt wcslen()} (compare with
	78	{\tt strlen()} - you see that the only difference is that the "str" prefix
8f684821 VZ	79	standing for "string" has been replaced with "wcs" standing for "wide-character
	80	string").
	81
	82	And finally, the standard preprocessor tokens enumerated above expand to ANSI
	83	strings but it is more likely that Unicode strings are wanted in the Unicode
	84	build. wxWindows provides the macros {\tt \_\_TFILE\_\_}, {\tt \_\_TDATE\_\_}
	85	and {\tt \_\_TTIME\_\_} which behave exactly as the standard ones except that
	86	they produce ANSI strings in ANSI build and Unicode ones in the Unicode build.
0c5d3e1c VZ	87
	88	To summarize, here is a brief example of how a program which can be compiled
	89	in both ANSI and Unicode modes could look like:
	90
	91	\begin{verbatim}
	92	#ifdef __UNICODE__
	93	wchar_t wch = L'*';
	94	const wchar_t *ws = L"Hello, world!";
	95	int len = wcslen(ws);
8f684821 VZ	96
8f684821 VZ	97	wprintf(L"Compiled at %s\n", __TDATE__);
0c5d3e1c VZ	98	#else // ANSI
	99	char ch = '*';
	100	const char *s = "Hello, world!";
	101	int len = strlen(s);
8f684821 VZ	102
8f684821 VZ	103	printf("Compiled at %s\n", __DATE__);
0c5d3e1c VZ	104	#endif // Unicode/ANSI
	105	\end{verbatim}
	106
	107	Of course, it would be nearly impossibly to write such programs if it had to
605d715d	108	be done this way (try to imagine the number of {\tt \#ifdef UNICODE} an average
0c5d3e1c VZ	109	program would have had!). Luckily, there is another way - see the next
	110	section.
	111
	112	\subsection{Unicode support in wxWindows}
	113
fbdcff4a	114	In wxWindows, the code fragment from above should be written instead:
0c5d3e1c VZ	115
0c5d3e1c VZ	116	\begin{verbatim}
330d6fd0 RR	117	wxChar ch = wxT('*');
330d6fd0 RR	118	wxString s = wxT("Hello, world!");
0c5d3e1c VZ	119	int len = s.Len();
	120	\end{verbatim}
	121
605d715d	122	What happens here? First of all, you see that there are no more {\tt \#ifdef}s
0c5d3e1c VZ	123	at all. Instead, we define some types and macros which behave differently in
	124	the Unicode and ANSI builds and allows us to avoid using conditional
	125	compilation in the program itself.
	126
	127	We have a {\tt wxChar} type which maps either on {\tt char} or {\tt wchar\_t}
	128	depending on the mode in which program is being compiled. There is no need for
	129	a separate type for strings though, because the standard
330d6fd0 RR	130	\helpref{wxString}{wxstring} supports Unicode, i.e. it stores either ANSI or
330d6fd0 RR	131	Unicode strings depending on the compile mode.
0c5d3e1c	132
0bbe4e29 VZ	133	Finally, there is a special \helpref{wxT()}{wxt} macro which should enclose all
	134	literal strings in the program. As it is easy to see comparing the last
	135	fragment with the one above, this macro expands to nothing in the (usual) ANSI
	136	mode and prefixes {\tt 'L'} to its argument in the Unicode mode.
0c5d3e1c VZ	137
	138	The important conclusion is that if you use {\tt wxChar} instead of
	139	{\tt char}, avoid using C style strings and use {\tt wxString} instead and
0bbe4e29	140	don't forget to enclose all string literals inside \helpref{wxT()}{wxt} macro, your
0c5d3e1c VZ	141	program automatically becomes (almost) Unicode compliant!
	142
	143	Just let us state once again the rules:
4c61bdab	144
0c5d3e1c VZ	145	\begin{itemize}
0c5d3e1c VZ	146	\item Always use {\tt wxChar} instead of {\tt char}
0bbe4e29 VZ	147	\item Always enclose literal string constants in \helpref{wxT()}{wxt} macro
	148	unless they're already converted to the right representation (another standard
	149	wxWindows macro \helpref{\_()}{underscore} does it, for example, so there is no
	150	need for {\tt wxT()} in this case) or you intend to pass the constant directly
	151	to an external function which doesn't accept wide-character strings.
0c5d3e1c VZ	152	\item Use {\tt wxString} instead of C style strings.
	153	\end{itemize}
	154
	155	\subsection{Unicode and the outside world}
	156
	157	We have seen that it was easy to write Unicode programs using wxWindows types
	158	and macros, but it has been also mentioned that it isn't quite enough.
	159	Although everything works fine inside the program, things can get nasty when
	160	it tries to communicate with the outside world which, sadly, often expects
	161	ANSI strings (a notable exception is the entire Win32 API which accepts either
	162	Unicode or ANSI strings and which thus makes it unnecessary to ever perform
2b5f62a0	163	any conversions in the program). GTK 2.0 only accepts UTF-8 strings.
0c5d3e1c	164
88b1927c JS	165	To get a ANSI string from a wxString, you may use the
88b1927c JS	166	mb\_str() function which always returns an ANSI
0c5d3e1c VZ	167	string (independently of the mode - while the usual
	168	\helpref{c\_str()}{wxstringcstr} returns a pointer to the internal
	169	representation which is either ASCII or Unicode). More rarely used, but still
88b1927c	170	useful, is wc\_str() function which always returns
0c5d3e1c VZ	171	the Unicode string.
	172
	173	% TODO describe fn_str(), wx_str(), wxCharBuf classes, ...
f6bcfd97 BP	174
	175	\subsection{Unicode-related compilation settings}
	176
	177	You should define {\tt wxUSE\_UNICODE} to $1$ to compile your program in
2b5f62a0 VZ	178	Unicode mode. Note that it currently only works in Win32 and GTK 2.0 and
2b5f62a0 VZ	179	that some parts of
f6bcfd97 BP	180	wxWindows are not Unicode-compliant yet (ODBC classes, for example). If you
	181	compile your program in ANSI mode you can still define {\tt wxUSE\_WCHAR\_T}
	182	to get some limited support for {\tt wchar\_t} type.
	183
	184	This will allow your program to perform conversions between Unicode strings and
a663cce7 VS	185	ANSI ones (using \helpref{wxMBConv classes}{mbconvclasses})
a663cce7 VS	186	and construct wxString objects from Unicode strings (presumably read
f6bcfd97	187	from some external file or elsewhere).
4c61bdab	188