git.saurik.com Git - wxWidgets.git/blame_incremental

... / ...

Commit	Line	Data
	1	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	2	%% Name: tunicode.tex
	3	%% Purpose: Overview of the Unicode support in wxWidgets
	4	%% Author: Vadim Zeitlin
	5	%% Modified by:
	6	%% Created: 22.09.99
	7	%% RCS-ID: $Id$
	8	%% Copyright: (c) 1999 Vadim Zeitlin <zeitlin@dptmaths.ens-cachan.fr>
	9	%% Licence: wxWindows license
	10	%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
	11
	12	\section{Unicode support in wxWidgets}\label{unicode}
	13
	14	This section briefly describes the state of the Unicode support in wxWidgets.
	15	Read it if you want to know more about how to write programs able to work with
	16	characters from languages other than English.
	17
	18	\subsection{What is Unicode?}\label{whatisunicode}
	19
	20	wxWidgets has support for compiling in Unicode mode
	21	on the platforms which support it. Unicode is a standard for character
	22	encoding which addresses the shortcomings of the previous, 8 bit standards, by
	23	using at least 16 (and possibly 32) bits for encoding each character. This
	24	allows to have at least 65536 characters (what is called the BMP, or basic
	25	multilingual plane) and possible $2^{32}$ of them instead of the usual 256 and
	26	is sufficient to encode all of the world languages at once. More details about
	27	Unicode may be found at {\tt www.unicode.org}.
	28
	29	% TODO expand on it, say that Unicode extends ASCII, mention ISO8859, ...
	30
	31	As this solution is obviously preferable to the previous ones (think of
	32	incompatible encodings for the same language, locale chaos and so on), many
	33	modern operating systems support it. The probably first example is Windows NT
	34	which uses only Unicode internally since its very first version.
	35
	36	Writing internationalized programs is much easier with Unicode and, as the
	37	support for it improves, it should become more and more so. Moreover, in the
	38	Windows NT/2000 case, even the program which uses only standard ASCII can profit
	39	from using Unicode because they will work more efficiently - there will be no
	40	need for the system to convert all strings the program uses to/from Unicode
	41	each time a system call is made.
	42
	43	\subsection{Unicode and ANSI modes}\label{unicodeandansi}
	44
	45	As not all platforms supported by wxWidgets support Unicode (fully) yet, in
	46	many cases it is unwise to write a program which can only work in Unicode
	47	environment. A better solution is to write programs in such way that they may
	48	be compiled either in ANSI (traditional) mode or in the Unicode one.
	49
	50	This can be achieved quite simply by using the means provided by wxWidgets.
	51	Basically, there are only a few things to watch out for:
	52
	53	\begin{itemize}
	54	\item Character type ({\tt char} or {\tt wchar\_t})
	55	\item Literal strings (i.e. {\tt "Hello, world!"} or {\tt '*'})
	56	\item String functions ({\tt strlen()}, {\tt strcpy()}, ...)
	57	\item Special preprocessor tokens ({\tt \_\_FILE\_\_}, {\tt \_\_DATE\_\_}
	58	and {\tt \_\_TIME\_\_})
	59	\end{itemize}
	60
	61	Let's look at them in order. First of all, each character in an Unicode
	62	program takes 2 bytes instead of usual one, so another type should be used to
	63	store the characters ({\tt char} only holds 1 byte usually). This type is
	64	called {\tt wchar\_t} which stands for {\it wide-character type}.
	65
	66	Also, the string and character constants should be encoded using wide
	67	characters ({\tt wchar\_t} type) which typically take $2$ or $4$ bytes instead
	68	of {\tt char} which only takes one. This is achieved by using the standard C
	69	(and C++) way: just put the letter {\tt 'L'} after any string constant and it
	70	becomes a {\it long} constant, i.e. a wide character one. To make things a bit
	71	more readable, you are also allowed to prefix the constant with {\tt 'L'}
	72	instead of putting it after it.
	73
	74	Of course, the usual standard C functions don't work with {\tt wchar\_t}
	75	strings, so another set of functions exists which do the same thing but accept
	76	{\tt wchar\_t } instead of {\tt char }. For example, a function to get the
	77	length of a wide-character string is called {\tt wcslen()} (compare with
	78	{\tt strlen()} - you see that the only difference is that the "str" prefix
	79	standing for "string" has been replaced with "wcs" standing for "wide-character
	80	string").
	81
	82	And finally, the standard preprocessor tokens enumerated above expand to ANSI
	83	strings but it is more likely that Unicode strings are wanted in the Unicode
	84	build. wxWidgets provides the macros {\tt \_\_TFILE\_\_}, {\tt \_\_TDATE\_\_}
	85	and {\tt \_\_TTIME\_\_} which behave exactly as the standard ones except that
	86	they produce ANSI strings in ANSI build and Unicode ones in the Unicode build.
	87
	88	To summarize, here is a brief example of how a program which can be compiled
	89	in both ANSI and Unicode modes could look like:
	90
	91	\begin{verbatim}
	92	#ifdef __UNICODE__
	93	wchar_t wch = L'*';
	94	const wchar_t *ws = L"Hello, world!";
	95	int len = wcslen(ws);
	96
	97	wprintf(L"Compiled at %s\n", __TDATE__);
	98	#else // ANSI
	99	char ch = '*';
	100	const char *s = "Hello, world!";
	101	int len = strlen(s);
	102
	103	printf("Compiled at %s\n", __DATE__);
	104	#endif // Unicode/ANSI
	105	\end{verbatim}
	106
	107	Of course, it would be nearly impossibly to write such programs if it had to
	108	be done this way (try to imagine the number of {\tt \#ifdef UNICODE} an average
	109	program would have had!). Luckily, there is another way - see the next
	110	section.
	111
	112	\subsection{Unicode support in wxWidgets}\label{unicodeinsidewxw}
	113
	114	In wxWidgets, the code fragment from above should be written instead:
	115
	116	\begin{verbatim}
	117	wxChar ch = wxT('*');
	118	wxString s = wxT("Hello, world!");
	119	int len = s.Len();
	120	\end{verbatim}
	121
	122	What happens here? First of all, you see that there are no more {\tt \#ifdef}s
	123	at all. Instead, we define some types and macros which behave differently in
	124	the Unicode and ANSI builds and allow us to avoid using conditional
	125	compilation in the program itself.
	126
	127	We have a {\tt wxChar} type which maps either on {\tt char} or {\tt wchar\_t}
	128	depending on the mode in which program is being compiled. There is no need for
	129	a separate type for strings though, because the standard
	130	\helpref{wxString}{wxstring} supports Unicode, i.e. it stores either ANSI or
	131	Unicode strings depending on the compile mode.
	132
	133	Finally, there is a special \helpref{wxT()}{wxt} macro which should enclose all
	134	literal strings in the program. As it is easy to see comparing the last
	135	fragment with the one above, this macro expands to nothing in the (usual) ANSI
	136	mode and prefixes {\tt 'L'} to its argument in the Unicode mode.
	137
	138	The important conclusion is that if you use {\tt wxChar} instead of
	139	{\tt char}, avoid using C style strings and use {\tt wxString} instead and
	140	don't forget to enclose all string literals inside \helpref{wxT()}{wxt} macro, your
	141	program automatically becomes (almost) Unicode compliant!
	142
	143	Just let us state once again the rules:
	144
	145	\begin{itemize}
	146	\item Always use {\tt wxChar} instead of {\tt char}
	147	\item Always enclose literal string constants in \helpref{wxT()}{wxt} macro
	148	unless they're already converted to the right representation (another standard
	149	wxWidgets macro \helpref{\_()}{underscore} does it, for example, so there is no
	150	need for {\tt wxT()} in this case) or you intend to pass the constant directly
	151	to an external function which doesn't accept wide-character strings.
	152	\item Use {\tt wxString} instead of C style strings.
	153	\end{itemize}
	154
	155	\subsection{Unicode and the outside world}\label{unicodeoutsidewxw}
	156
	157	We have seen that it was easy to write Unicode programs using wxWidgets types
	158	and macros, but it has been also mentioned that it isn't quite enough.
	159	Although everything works fine inside the program, things can get nasty when
	160	it tries to communicate with the outside world which, sadly, often expects
	161	ANSI strings (a notable exception is the entire Win32 API which accepts either
	162	Unicode or ANSI strings and which thus makes it unnecessary to ever perform
	163	any conversions in the program). GTK 2.0 only accepts UTF-8 strings.
	164
	165	To get an ANSI string from a wxString, you may use the
	166	mb\_str() function which always returns an ANSI
	167	string (independently of the mode - while the usual
	168	\helpref{c\_str()}{wxstringcstr} returns a pointer to the internal
	169	representation which is either ASCII or Unicode). More rarely used, but still
	170	useful, is wc\_str() function which always returns
	171	the Unicode string.
	172
	173	Sometimes it is also necessary to go from ANSI strings to wxStrings.
	174	In this case, you can use the converter-constructor, as follows:
	175
	176	\begin{verbatim}
	177	const char* ascii_str = "Some text";
	178	wxString str(ascii_str, wxConvUTF8);
	179	\end{verbatim}
	180
	181	This code also compiles fine under a non-Unicode build of wxWidgets,
	182	but in that case the converter is ignored.
	183
	184	For more information about converters and Unicode see
	185	the \helpref{wxMBConv classes overview}{mbconvclasses}.
	186
	187	% TODO describe fn_str(), wx_str(), wxCharBuf classes, ...
	188
	189	\subsection{Unicode-related compilation settings}\label{unicodesettings}
	190
	191	You should define {\tt wxUSE\_UNICODE} to $1$ to compile your program in
	192	Unicode mode. This currently works for wxMSW, wxGTK, wxMac and wxX11. If you
	193	compile your program in ANSI mode you can still define {\tt wxUSE\_WCHAR\_T}
	194	to get some limited support for {\tt wchar\_t} type.
	195
	196	This will allow your program to perform conversions between Unicode strings and
	197	ANSI ones (using \helpref{wxMBConv classes}{mbconvclasses})
	198	and construct wxString objects from Unicode strings (presumably read
	199	from some external file or elsewhere).
	200