Commit | Line | Data |
---|---|---|
0c5d3e1c VZ |
1 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
2 | %% Name: tunicode.tex | |
fc2171bd | 3 | %% Purpose: Overview of the Unicode support in wxWidgets |
0c5d3e1c VZ |
4 | %% Author: Vadim Zeitlin |
5 | %% Modified by: | |
6 | %% Created: 22.09.99 | |
7 | %% RCS-ID: $Id$ | |
8 | %% Copyright: (c) 1999 Vadim Zeitlin <zeitlin@dptmaths.ens-cachan.fr> | |
8795498c | 9 | %% Licence: wxWindows license |
0c5d3e1c VZ |
10 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
11 | ||
fc2171bd | 12 | \section{Unicode support in wxWidgets}\label{unicode} |
0c5d3e1c | 13 | |
fc2171bd | 14 | This section briefly describes the state of the Unicode support in wxWidgets. |
0c5d3e1c VZ |
15 | Read it if you want to know more about how to write programs able to work with |
16 | characters from languages other than English. | |
17 | ||
a203f6c0 | 18 | \subsection{What is Unicode?}\label{whatisunicode} |
0c5d3e1c | 19 | |
0588f41d | 20 | wxWidgets has support for compiling in Unicode mode |
0c5d3e1c | 21 | on the platforms which support it. Unicode is a standard for character |
f6bcfd97 | 22 | encoding which addresses the shortcomings of the previous, 8 bit standards, by |
8f684821 VZ |
23 | using at least 16 (and possibly 32) bits for encoding each character. This |
24 | allows to have at least 65536 characters (what is called the BMP, or basic | |
25 | multilingual plane) and possible $2^{32}$ of them instead of the usual 256 and | |
26 | is sufficient to encode all of the world languages at once. More details about | |
27 | Unicode may be found at {\tt www.unicode.org}. | |
0c5d3e1c VZ |
28 | |
29 | % TODO expand on it, say that Unicode extends ASCII, mention ISO8859, ... | |
30 | ||
31 | As this solution is obviously preferable to the previous ones (think of | |
32 | incompatible encodings for the same language, locale chaos and so on), many | |
f6bcfd97 | 33 | modern operating systems support it. The probably first example is Windows NT |
0c5d3e1c VZ |
34 | which uses only Unicode internally since its very first version. |
35 | ||
36 | Writing internationalized programs is much easier with Unicode and, as the | |
37 | support for it improves, it should become more and more so. Moreover, in the | |
38 | Windows NT/2000 case, even the program which uses only standard ASCII can profit | |
39 | from using Unicode because they will work more efficiently - there will be no | |
f6bcfd97 | 40 | need for the system to convert all strings the program uses to/from Unicode |
0c5d3e1c VZ |
41 | each time a system call is made. |
42 | ||
a203f6c0 | 43 | \subsection{Unicode and ANSI modes}\label{unicodeandansi} |
0c5d3e1c | 44 | |
fc2171bd | 45 | As not all platforms supported by wxWidgets support Unicode (fully) yet, in |
0c5d3e1c VZ |
46 | many cases it is unwise to write a program which can only work in Unicode |
47 | environment. A better solution is to write programs in such way that they may | |
48 | be compiled either in ANSI (traditional) mode or in the Unicode one. | |
49 | ||
fc2171bd | 50 | This can be achieved quite simply by using the means provided by wxWidgets. |
f6bcfd97 | 51 | Basically, there are only a few things to watch out for: |
4c61bdab | 52 | |
0c5d3e1c VZ |
53 | \begin{itemize} |
54 | \item Character type ({\tt char} or {\tt wchar\_t}) | |
55 | \item Literal strings (i.e. {\tt "Hello, world!"} or {\tt '*'}) | |
56 | \item String functions ({\tt strlen()}, {\tt strcpy()}, ...) | |
8f684821 VZ |
57 | \item Special preprocessor tokens ({\tt \_\_FILE\_\_}, {\tt \_\_DATE\_\_} |
58 | and {\tt \_\_TIME\_\_}) | |
0c5d3e1c VZ |
59 | \end{itemize} |
60 | ||
61 | Let's look at them in order. First of all, each character in an Unicode | |
62 | program takes 2 bytes instead of usual one, so another type should be used to | |
63 | store the characters ({\tt char} only holds 1 byte usually). This type is | |
64 | called {\tt wchar\_t} which stands for {\it wide-character type}. | |
65 | ||
8f684821 VZ |
66 | Also, the string and character constants should be encoded using wide |
67 | characters ({\tt wchar\_t} type) which typically take $2$ or $4$ bytes instead | |
68 | of {\tt char} which only takes one. This is achieved by using the standard C | |
69 | (and C++) way: just put the letter {\tt 'L'} after any string constant and it | |
70 | becomes a {\it long} constant, i.e. a wide character one. To make things a bit | |
71 | more readable, you are also allowed to prefix the constant with {\tt 'L'} | |
72 | instead of putting it after it. | |
0c5d3e1c | 73 | |
8f684821 VZ |
74 | Of course, the usual standard C functions don't work with {\tt wchar\_t} |
75 | strings, so another set of functions exists which do the same thing but accept | |
0c5d3e1c VZ |
76 | {\tt wchar\_t *} instead of {\tt char *}. For example, a function to get the |
77 | length of a wide-character string is called {\tt wcslen()} (compare with | |
78 | {\tt strlen()} - you see that the only difference is that the "str" prefix | |
8f684821 VZ |
79 | standing for "string" has been replaced with "wcs" standing for "wide-character |
80 | string"). | |
81 | ||
82 | And finally, the standard preprocessor tokens enumerated above expand to ANSI | |
83 | strings but it is more likely that Unicode strings are wanted in the Unicode | |
fc2171bd | 84 | build. wxWidgets provides the macros {\tt \_\_TFILE\_\_}, {\tt \_\_TDATE\_\_} |
8f684821 VZ |
85 | and {\tt \_\_TTIME\_\_} which behave exactly as the standard ones except that |
86 | they produce ANSI strings in ANSI build and Unicode ones in the Unicode build. | |
0c5d3e1c VZ |
87 | |
88 | To summarize, here is a brief example of how a program which can be compiled | |
89 | in both ANSI and Unicode modes could look like: | |
90 | ||
91 | \begin{verbatim} | |
92 | #ifdef __UNICODE__ | |
93 | wchar_t wch = L'*'; | |
94 | const wchar_t *ws = L"Hello, world!"; | |
95 | int len = wcslen(ws); | |
8f684821 VZ |
96 | |
97 | wprintf(L"Compiled at %s\n", __TDATE__); | |
0c5d3e1c VZ |
98 | #else // ANSI |
99 | char ch = '*'; | |
100 | const char *s = "Hello, world!"; | |
101 | int len = strlen(s); | |
8f684821 VZ |
102 | |
103 | printf("Compiled at %s\n", __DATE__); | |
0c5d3e1c VZ |
104 | #endif // Unicode/ANSI |
105 | \end{verbatim} | |
106 | ||
107 | Of course, it would be nearly impossibly to write such programs if it had to | |
605d715d | 108 | be done this way (try to imagine the number of {\tt \#ifdef UNICODE} an average |
0c5d3e1c VZ |
109 | program would have had!). Luckily, there is another way - see the next |
110 | section. | |
111 | ||
a203f6c0 | 112 | \subsection{Unicode support in wxWidgets}\label{unicodeinsidewxw} |
0c5d3e1c | 113 | |
fc2171bd | 114 | In wxWidgets, the code fragment from above should be written instead: |
0c5d3e1c VZ |
115 | |
116 | \begin{verbatim} | |
330d6fd0 RR |
117 | wxChar ch = wxT('*'); |
118 | wxString s = wxT("Hello, world!"); | |
0c5d3e1c VZ |
119 | int len = s.Len(); |
120 | \end{verbatim} | |
121 | ||
605d715d | 122 | What happens here? First of all, you see that there are no more {\tt \#ifdef}s |
0c5d3e1c | 123 | at all. Instead, we define some types and macros which behave differently in |
43e8916f | 124 | the Unicode and ANSI builds and allow us to avoid using conditional |
0c5d3e1c VZ |
125 | compilation in the program itself. |
126 | ||
127 | We have a {\tt wxChar} type which maps either on {\tt char} or {\tt wchar\_t} | |
128 | depending on the mode in which program is being compiled. There is no need for | |
129 | a separate type for strings though, because the standard | |
330d6fd0 RR |
130 | \helpref{wxString}{wxstring} supports Unicode, i.e. it stores either ANSI or |
131 | Unicode strings depending on the compile mode. | |
0c5d3e1c | 132 | |
0bbe4e29 VZ |
133 | Finally, there is a special \helpref{wxT()}{wxt} macro which should enclose all |
134 | literal strings in the program. As it is easy to see comparing the last | |
135 | fragment with the one above, this macro expands to nothing in the (usual) ANSI | |
136 | mode and prefixes {\tt 'L'} to its argument in the Unicode mode. | |
0c5d3e1c VZ |
137 | |
138 | The important conclusion is that if you use {\tt wxChar} instead of | |
139 | {\tt char}, avoid using C style strings and use {\tt wxString} instead and | |
0bbe4e29 | 140 | don't forget to enclose all string literals inside \helpref{wxT()}{wxt} macro, your |
0c5d3e1c VZ |
141 | program automatically becomes (almost) Unicode compliant! |
142 | ||
143 | Just let us state once again the rules: | |
4c61bdab | 144 | |
0c5d3e1c VZ |
145 | \begin{itemize} |
146 | \item Always use {\tt wxChar} instead of {\tt char} | |
0bbe4e29 VZ |
147 | \item Always enclose literal string constants in \helpref{wxT()}{wxt} macro |
148 | unless they're already converted to the right representation (another standard | |
fc2171bd | 149 | wxWidgets macro \helpref{\_()}{underscore} does it, for example, so there is no |
0bbe4e29 VZ |
150 | need for {\tt wxT()} in this case) or you intend to pass the constant directly |
151 | to an external function which doesn't accept wide-character strings. | |
0c5d3e1c VZ |
152 | \item Use {\tt wxString} instead of C style strings. |
153 | \end{itemize} | |
154 | ||
a203f6c0 | 155 | \subsection{Unicode and the outside world}\label{unicodeoutsidewxw} |
0c5d3e1c | 156 | |
fc2171bd | 157 | We have seen that it was easy to write Unicode programs using wxWidgets types |
0c5d3e1c VZ |
158 | and macros, but it has been also mentioned that it isn't quite enough. |
159 | Although everything works fine inside the program, things can get nasty when | |
160 | it tries to communicate with the outside world which, sadly, often expects | |
161 | ANSI strings (a notable exception is the entire Win32 API which accepts either | |
162 | Unicode or ANSI strings and which thus makes it unnecessary to ever perform | |
2b5f62a0 | 163 | any conversions in the program). GTK 2.0 only accepts UTF-8 strings. |
0c5d3e1c | 164 | |
1c2ed09a | 165 | To get an ANSI string from a wxString, you may use the |
88b1927c | 166 | mb\_str() function which always returns an ANSI |
0c5d3e1c VZ |
167 | string (independently of the mode - while the usual |
168 | \helpref{c\_str()}{wxstringcstr} returns a pointer to the internal | |
169 | representation which is either ASCII or Unicode). More rarely used, but still | |
88b1927c | 170 | useful, is wc\_str() function which always returns |
0c5d3e1c VZ |
171 | the Unicode string. |
172 | ||
1c2ed09a MW |
173 | Sometimes it is also necessary to go from ANSI strings to wxStrings. |
174 | In this case, you can use the converter-constructor, as follows: | |
175 | ||
176 | \begin{verbatim} | |
177 | const char* ascii_str = "Some text"; | |
178 | wxString str(ascii_str, wxConvUTF8); | |
179 | \end{verbatim} | |
180 | ||
181 | This code also compiles fine under a non-Unicode build of wxWidgets, | |
182 | but in that case the converter is ignored. | |
183 | ||
184 | For more information about converters and Unicode see | |
185 | the \helpref{wxMBConv classes overview}{mbconvclasses}. | |
186 | ||
0c5d3e1c | 187 | % TODO describe fn_str(), wx_str(), wxCharBuf classes, ... |
f6bcfd97 | 188 | |
a203f6c0 | 189 | \subsection{Unicode-related compilation settings}\label{unicodesettings} |
f6bcfd97 BP |
190 | |
191 | You should define {\tt wxUSE\_UNICODE} to $1$ to compile your program in | |
0588f41d | 192 | Unicode mode. This currently works for wxMSW, wxGTK, wxMac and wxX11. If you |
f6bcfd97 BP |
193 | compile your program in ANSI mode you can still define {\tt wxUSE\_WCHAR\_T} |
194 | to get some limited support for {\tt wchar\_t} type. | |
195 | ||
196 | This will allow your program to perform conversions between Unicode strings and | |
a663cce7 VS |
197 | ANSI ones (using \helpref{wxMBConv classes}{mbconvclasses}) |
198 | and construct wxString objects from Unicode strings (presumably read | |
f6bcfd97 | 199 | from some external file or elsewhere). |
4c61bdab | 200 |