[wxWidgets.git] / docs / latex / wx / tunicode.tex

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Name:        tunicode.tex
%% Purpose:     Overview of the Unicode support in wxWidgets
%% Author:      Vadim Zeitlin
%% Modified by:
%% Created:     22.09.99
%% RCS-ID:      $Id$
%% Copyright:   (c) 1999 Vadim Zeitlin <zeitlin@dptmaths.ens-cachan.fr>
%% Licence:     wxWindows license
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Unicode support in wxWidgets}\label{unicode}

This section briefly describes the state of the Unicode support in wxWidgets.
Read it if you want to know more about how to write programs able to work with
characters from languages other than English.

\subsection{What is Unicode?}\label{whatisunicode}

wxWidgets has support for compiling in Unicode mode
on the platforms which support it. Unicode is a standard for character
encoding which addresses the shortcomings of the previous, 8 bit standards, by
using at least 16 (and possibly 32) bits for encoding each character. This
allows to have at least 65536 characters (what is called the BMP, or basic
multilingual plane) and possible $2^{32}$ of them instead of the usual 256 and
is sufficient to encode all of the world languages at once. More details about
Unicode may be found at {\tt www.unicode.org}.

% TODO expand on it, say that Unicode extends ASCII, mention ISO8859, ...

As this solution is obviously preferable to the previous ones (think of
incompatible encodings for the same language, locale chaos and so on), many
modern operating systems support it. The probably first example is Windows NT
which uses only Unicode internally since its very first version.

Writing internationalized programs is much easier with Unicode and, as the
support for it improves, it should become more and more so. Moreover, in the
Windows NT/2000 case, even the program which uses only standard ASCII can profit
from using Unicode because they will work more efficiently - there will be no
need for the system to convert all strings the program uses to/from Unicode
each time a system call is made.

\subsection{Unicode and ANSI modes}\label{unicodeandansi}

As not all platforms supported by wxWidgets support Unicode (fully) yet, in
many cases it is unwise to write a program which can only work in Unicode
environment. A better solution is to write programs in such way that they may
be compiled either in ANSI (traditional) mode or in the Unicode one.

This can be achieved quite simply by using the means provided by wxWidgets.
Basically, there are only a few things to watch out for:

\begin{itemize}
\item Character type ({\tt char} or {\tt wchar\_t})
\item Literal strings (i.e. {\tt "Hello, world!"} or {\tt '*'})
\item String functions ({\tt strlen()}, {\tt strcpy()}, ...)
\item Special preprocessor tokens ({\tt \_\_FILE\_\_}, {\tt \_\_DATE\_\_} 
and {\tt \_\_TIME\_\_})
\end{itemize}

Let's look at them in order. First of all, each character in an Unicode
program takes 2 bytes instead of usual one, so another type should be used to
store the characters ({\tt char} only holds 1 byte usually). This type is
called {\tt wchar\_t} which stands for {\it wide-character type}.

Also, the string and character constants should be encoded using wide
characters ({\tt wchar\_t} type) which typically take $2$ or $4$ bytes instead
of {\tt char} which only takes one. This is achieved by using the standard C
(and C++) way: just put the letter {\tt 'L'} after any string constant and it
becomes a {\it long} constant, i.e. a wide character one. To make things a bit
more readable, you are also allowed to prefix the constant with {\tt 'L'}
instead of putting it after it.

Of course, the usual standard C functions don't work with {\tt wchar\_t}
strings, so another set of functions exists which do the same thing but accept
{\tt wchar\_t *} instead of {\tt char *}. For example, a function to get the
length of a wide-character string is called {\tt wcslen()} (compare with 
{\tt strlen()} - you see that the only difference is that the "str" prefix
standing for "string" has been replaced with "wcs" standing for "wide-character
string").

And finally, the standard preprocessor tokens enumerated above expand to ANSI
strings but it is more likely that Unicode strings are wanted in the Unicode
build. wxWidgets provides the macros {\tt \_\_TFILE\_\_}, {\tt \_\_TDATE\_\_} 
and {\tt \_\_TTIME\_\_} which behave exactly as the standard ones except that
they produce ANSI strings in ANSI build and Unicode ones in the Unicode build.

To summarize, here is a brief example of how a program which can be compiled
in both ANSI and Unicode modes could look like:

\begin{verbatim}
#ifdef __UNICODE__
    wchar_t wch = L'*';
    const wchar_t *ws = L"Hello, world!";
    int len = wcslen(ws);

    wprintf(L"Compiled at %s\n", __TDATE__);
#else // ANSI
    char ch = '*';
    const char *s = "Hello, world!";
    int len = strlen(s);

    printf("Compiled at %s\n", __DATE__);
#endif // Unicode/ANSI
\end{verbatim}

Of course, it would be nearly impossibly to write such programs if it had to
be done this way (try to imagine the number of {\tt \#ifdef UNICODE} an average
program would have had!). Luckily, there is another way - see the next
section.

\subsection{Unicode support in wxWidgets}\label{unicodeinsidewxw}

In wxWidgets, the code fragment from above should be written instead:

\begin{verbatim}
    wxChar ch = wxT('*');
    wxString s = wxT("Hello, world!");
    int len = s.Len();
\end{verbatim}

What happens here? First of all, you see that there are no more {\tt \#ifdef}s
at all. Instead, we define some types and macros which behave differently in
the Unicode and ANSI builds and allow us to avoid using conditional
compilation in the program itself.

We have a {\tt wxChar} type which maps either on {\tt char} or {\tt wchar\_t} 
depending on the mode in which program is being compiled. There is no need for
a separate type for strings though, because the standard 
\helpref{wxString}{wxstring} supports Unicode, i.e. it stores either ANSI or
Unicode strings depending on the compile mode.

Finally, there is a special \helpref{wxT()}{wxt} macro which should enclose all
literal strings in the program. As it is easy to see comparing the last
fragment with the one above, this macro expands to nothing in the (usual) ANSI
mode and prefixes {\tt 'L'} to its argument in the Unicode mode.

The important conclusion is that if you use {\tt wxChar} instead of 
{\tt char}, avoid using C style strings and use {\tt wxString} instead and
don't forget to enclose all string literals inside \helpref{wxT()}{wxt} macro, your
program automatically becomes (almost) Unicode compliant!

Just let us state once again the rules:

\begin{itemize}
\item Always use {\tt wxChar} instead of {\tt char}
\item Always enclose literal string constants in \helpref{wxT()}{wxt} macro
unless they're already converted to the right representation (another standard
wxWidgets macro \helpref{\_()}{underscore} does it, for example, so there is no
need for {\tt wxT()} in this case) or you intend to pass the constant directly
to an external function which doesn't accept wide-character strings.
\item Use {\tt wxString} instead of C style strings.
\end{itemize}

\subsection{Unicode and the outside world}\label{unicodeoutsidewxw}

We have seen that it was easy to write Unicode programs using wxWidgets types
and macros, but it has been also mentioned that it isn't quite enough.
Although everything works fine inside the program, things can get nasty when
it tries to communicate with the outside world which, sadly, often expects
ANSI strings (a notable exception is the entire Win32 API which accepts either
Unicode or ANSI strings and which thus makes it unnecessary to ever perform
any conversions in the program). GTK 2.0 only accepts UTF-8 strings.

To get an ANSI string from a wxString, you may use the 
mb\_str() function which always returns an ANSI
string (independently of the mode - while the usual 
\helpref{c\_str()}{wxstringcstr} returns a pointer to the internal
representation which is either ASCII or Unicode). More rarely used, but still
useful, is wc\_str() function which always returns
the Unicode string.

Sometimes it is also necessary to go from ANSI strings to wxStrings.  
In this case, you can use the converter-constructor, as follows:
 
\begin{verbatim}
   const char* ascii_str = "Some text";
   wxString str(ascii_str, wxConvUTF8);
\end{verbatim}

This code also compiles fine under a non-Unicode build of wxWidgets,
but in that case the converter is ignored.

For more information about converters and Unicode see
the \helpref{wxMBConv classes overview}{mbconvclasses}.

% TODO describe fn_str(), wx_str(), wxCharBuf classes, ...

\subsection{Unicode-related compilation settings}\label{unicodesettings}

You should define {\tt wxUSE\_UNICODE} to $1$ to compile your program in
Unicode mode. This currently works for wxMSW, wxGTK, wxMac and wxX11. If you
compile your program in ANSI mode you can still define {\tt wxUSE\_WCHAR\_T} 
to get some limited support for {\tt wchar\_t} type.

This will allow your program to perform conversions between Unicode strings and
ANSI ones (using \helpref{wxMBConv classes}{mbconvclasses}) 
and construct wxString objects from Unicode strings (presumably read
from some external file or elsewhere).
Commit	Line	Data
0c5d3e1c VZ	1	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0c5d3e1c VZ	2	%% Name: tunicode.tex
fc2171bd	3	%% Purpose: Overview of the Unicode support in wxWidgets
0c5d3e1c VZ	4	%% Author: Vadim Zeitlin
	5	%% Modified by:
	6	%% Created: 22.09.99
	7	%% RCS-ID: $Id$
	8	%% Copyright: (c) 1999 Vadim Zeitlin <zeitlin@dptmaths.ens-cachan.fr>
8795498c	9	%% Licence: wxWindows license
0c5d3e1c VZ	10	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0c5d3e1c VZ	11
fc2171bd	12	\section{Unicode support in wxWidgets}\label{unicode}
0c5d3e1c	13
fc2171bd	14	This section briefly describes the state of the Unicode support in wxWidgets.
0c5d3e1c VZ	15	Read it if you want to know more about how to write programs able to work with
	16	characters from languages other than English.
	17
a203f6c0	18	\subsection{What is Unicode?}\label{whatisunicode}
0c5d3e1c	19
0588f41d	20	wxWidgets has support for compiling in Unicode mode
0c5d3e1c	21	on the platforms which support it. Unicode is a standard for character
f6bcfd97	22	encoding which addresses the shortcomings of the previous, 8 bit standards, by
8f684821 VZ	23	using at least 16 (and possibly 32) bits for encoding each character. This
	24	allows to have at least 65536 characters (what is called the BMP, or basic
	25	multilingual plane) and possible $2^{32}$ of them instead of the usual 256 and
	26	is sufficient to encode all of the world languages at once. More details about
	27	Unicode may be found at {\tt www.unicode.org}.
0c5d3e1c VZ	28
	29	% TODO expand on it, say that Unicode extends ASCII, mention ISO8859, ...
	30
	31	As this solution is obviously preferable to the previous ones (think of
	32	incompatible encodings for the same language, locale chaos and so on), many
f6bcfd97	33	modern operating systems support it. The probably first example is Windows NT
0c5d3e1c VZ	34	which uses only Unicode internally since its very first version.
	35
	36	Writing internationalized programs is much easier with Unicode and, as the
	37	support for it improves, it should become more and more so. Moreover, in the
	38	Windows NT/2000 case, even the program which uses only standard ASCII can profit
	39	from using Unicode because they will work more efficiently - there will be no
f6bcfd97	40	need for the system to convert all strings the program uses to/from Unicode
0c5d3e1c VZ	41	each time a system call is made.
0c5d3e1c VZ	42
a203f6c0	43	\subsection{Unicode and ANSI modes}\label{unicodeandansi}
0c5d3e1c	44
fc2171bd	45	As not all platforms supported by wxWidgets support Unicode (fully) yet, in
0c5d3e1c VZ	46	many cases it is unwise to write a program which can only work in Unicode
	47	environment. A better solution is to write programs in such way that they may
	48	be compiled either in ANSI (traditional) mode or in the Unicode one.
	49
fc2171bd	50	This can be achieved quite simply by using the means provided by wxWidgets.
f6bcfd97	51	Basically, there are only a few things to watch out for:
4c61bdab	52
0c5d3e1c VZ	53	\begin{itemize}
	54	\item Character type ({\tt char} or {\tt wchar\_t})
	55	\item Literal strings (i.e. {\tt "Hello, world!"} or {\tt '*'})
	56	\item String functions ({\tt strlen()}, {\tt strcpy()}, ...)
8f684821 VZ	57	\item Special preprocessor tokens ({\tt \_\_FILE\_\_}, {\tt \_\_DATE\_\_}
8f684821 VZ	58	and {\tt \_\_TIME\_\_})
0c5d3e1c VZ	59	\end{itemize}
	60
	61	Let's look at them in order. First of all, each character in an Unicode
	62	program takes 2 bytes instead of usual one, so another type should be used to
	63	store the characters ({\tt char} only holds 1 byte usually). This type is
	64	called {\tt wchar\_t} which stands for {\it wide-character type}.
	65
8f684821 VZ	66	Also, the string and character constants should be encoded using wide
	67	characters ({\tt wchar\_t} type) which typically take $2$ or $4$ bytes instead
	68	of {\tt char} which only takes one. This is achieved by using the standard C
	69	(and C++) way: just put the letter {\tt 'L'} after any string constant and it
	70	becomes a {\it long} constant, i.e. a wide character one. To make things a bit
	71	more readable, you are also allowed to prefix the constant with {\tt 'L'}
	72	instead of putting it after it.
0c5d3e1c	73
8f684821 VZ	74	Of course, the usual standard C functions don't work with {\tt wchar\_t}
8f684821 VZ	75	strings, so another set of functions exists which do the same thing but accept
0c5d3e1c VZ	76	{\tt wchar\_t } instead of {\tt char }. For example, a function to get the
	77	length of a wide-character string is called {\tt wcslen()} (compare with
	78	{\tt strlen()} - you see that the only difference is that the "str" prefix
8f684821 VZ	79	standing for "string" has been replaced with "wcs" standing for "wide-character
	80	string").
	81
	82	And finally, the standard preprocessor tokens enumerated above expand to ANSI
	83	strings but it is more likely that Unicode strings are wanted in the Unicode
fc2171bd	84	build. wxWidgets provides the macros {\tt \_\_TFILE\_\_}, {\tt \_\_TDATE\_\_}
8f684821 VZ	85	and {\tt \_\_TTIME\_\_} which behave exactly as the standard ones except that
8f684821 VZ	86	they produce ANSI strings in ANSI build and Unicode ones in the Unicode build.
0c5d3e1c VZ	87
	88	To summarize, here is a brief example of how a program which can be compiled
	89	in both ANSI and Unicode modes could look like:
	90
	91	\begin{verbatim}
	92	#ifdef __UNICODE__
	93	wchar_t wch = L'*';
	94	const wchar_t *ws = L"Hello, world!";
	95	int len = wcslen(ws);
8f684821 VZ	96
8f684821 VZ	97	wprintf(L"Compiled at %s\n", __TDATE__);
0c5d3e1c VZ	98	#else // ANSI
	99	char ch = '*';
	100	const char *s = "Hello, world!";
	101	int len = strlen(s);
8f684821 VZ	102
8f684821 VZ	103	printf("Compiled at %s\n", __DATE__);
0c5d3e1c VZ	104	#endif // Unicode/ANSI
	105	\end{verbatim}
	106
	107	Of course, it would be nearly impossibly to write such programs if it had to
605d715d	108	be done this way (try to imagine the number of {\tt \#ifdef UNICODE} an average
0c5d3e1c VZ	109	program would have had!). Luckily, there is another way - see the next
	110	section.
	111
a203f6c0	112	\subsection{Unicode support in wxWidgets}\label{unicodeinsidewxw}
0c5d3e1c	113
fc2171bd	114	In wxWidgets, the code fragment from above should be written instead:
0c5d3e1c VZ	115
0c5d3e1c VZ	116	\begin{verbatim}
330d6fd0 RR	117	wxChar ch = wxT('*');
330d6fd0 RR	118	wxString s = wxT("Hello, world!");
0c5d3e1c VZ	119	int len = s.Len();
	120	\end{verbatim}
	121
605d715d	122	What happens here? First of all, you see that there are no more {\tt \#ifdef}s
0c5d3e1c	123	at all. Instead, we define some types and macros which behave differently in
43e8916f	124	the Unicode and ANSI builds and allow us to avoid using conditional
0c5d3e1c VZ	125	compilation in the program itself.
	126
	127	We have a {\tt wxChar} type which maps either on {\tt char} or {\tt wchar\_t}
	128	depending on the mode in which program is being compiled. There is no need for
	129	a separate type for strings though, because the standard
330d6fd0 RR	130	\helpref{wxString}{wxstring} supports Unicode, i.e. it stores either ANSI or
330d6fd0 RR	131	Unicode strings depending on the compile mode.
0c5d3e1c	132
0bbe4e29 VZ	133	Finally, there is a special \helpref{wxT()}{wxt} macro which should enclose all
	134	literal strings in the program. As it is easy to see comparing the last
	135	fragment with the one above, this macro expands to nothing in the (usual) ANSI
	136	mode and prefixes {\tt 'L'} to its argument in the Unicode mode.
0c5d3e1c VZ	137
	138	The important conclusion is that if you use {\tt wxChar} instead of
	139	{\tt char}, avoid using C style strings and use {\tt wxString} instead and
0bbe4e29	140	don't forget to enclose all string literals inside \helpref{wxT()}{wxt} macro, your
0c5d3e1c VZ	141	program automatically becomes (almost) Unicode compliant!
	142
	143	Just let us state once again the rules:
4c61bdab	144
0c5d3e1c VZ	145	\begin{itemize}
0c5d3e1c VZ	146	\item Always use {\tt wxChar} instead of {\tt char}
0bbe4e29 VZ	147	\item Always enclose literal string constants in \helpref{wxT()}{wxt} macro
0bbe4e29 VZ	148	unless they're already converted to the right representation (another standard
fc2171bd	149	wxWidgets macro \helpref{\_()}{underscore} does it, for example, so there is no
0bbe4e29 VZ	150	need for {\tt wxT()} in this case) or you intend to pass the constant directly
0bbe4e29 VZ	151	to an external function which doesn't accept wide-character strings.
0c5d3e1c VZ	152	\item Use {\tt wxString} instead of C style strings.
	153	\end{itemize}
	154
a203f6c0	155	\subsection{Unicode and the outside world}\label{unicodeoutsidewxw}
0c5d3e1c	156
fc2171bd	157	We have seen that it was easy to write Unicode programs using wxWidgets types
0c5d3e1c VZ	158	and macros, but it has been also mentioned that it isn't quite enough.
	159	Although everything works fine inside the program, things can get nasty when
	160	it tries to communicate with the outside world which, sadly, often expects
	161	ANSI strings (a notable exception is the entire Win32 API which accepts either
	162	Unicode or ANSI strings and which thus makes it unnecessary to ever perform
2b5f62a0	163	any conversions in the program). GTK 2.0 only accepts UTF-8 strings.
0c5d3e1c	164
1c2ed09a	165	To get an ANSI string from a wxString, you may use the
88b1927c	166	mb\_str() function which always returns an ANSI
0c5d3e1c VZ	167	string (independently of the mode - while the usual
	168	\helpref{c\_str()}{wxstringcstr} returns a pointer to the internal
	169	representation which is either ASCII or Unicode). More rarely used, but still
88b1927c	170	useful, is wc\_str() function which always returns
0c5d3e1c VZ	171	the Unicode string.
0c5d3e1c VZ	172
1c2ed09a MW	173	Sometimes it is also necessary to go from ANSI strings to wxStrings.
	174	In this case, you can use the converter-constructor, as follows:
	175
	176	\begin{verbatim}
	177	const char* ascii_str = "Some text";
	178	wxString str(ascii_str, wxConvUTF8);
	179	\end{verbatim}
	180
	181	This code also compiles fine under a non-Unicode build of wxWidgets,
	182	but in that case the converter is ignored.
	183
	184	For more information about converters and Unicode see
	185	the \helpref{wxMBConv classes overview}{mbconvclasses}.
	186
0c5d3e1c	187	% TODO describe fn_str(), wx_str(), wxCharBuf classes, ...
f6bcfd97	188
a203f6c0	189	\subsection{Unicode-related compilation settings}\label{unicodesettings}
f6bcfd97 BP	190
f6bcfd97 BP	191	You should define {\tt wxUSE\_UNICODE} to $1$ to compile your program in
0588f41d	192	Unicode mode. This currently works for wxMSW, wxGTK, wxMac and wxX11. If you
f6bcfd97 BP	193	compile your program in ANSI mode you can still define {\tt wxUSE\_WCHAR\_T}
	194	to get some limited support for {\tt wchar\_t} type.
	195
	196	This will allow your program to perform conversions between Unicode strings and
a663cce7 VS	197	ANSI ones (using \helpref{wxMBConv classes}{mbconvclasses})
a663cce7 VS	198	and construct wxString objects from Unicode strings (presumably read
f6bcfd97	199	from some external file or elsewhere).
4c61bdab	200