]>
Commit | Line | Data |
---|---|---|
0c5d3e1c VZ |
1 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
2 | %% Name: tunicode.tex | |
3 | %% Purpose: Overview of the Unicode support in wxWindows | |
4 | %% Author: Vadim Zeitlin | |
5 | %% Modified by: | |
6 | %% Created: 22.09.99 | |
7 | %% RCS-ID: $Id$ | |
8 | %% Copyright: (c) 1999 Vadim Zeitlin <zeitlin@dptmaths.ens-cachan.fr> | |
9 | %% Licence: wxWindows license | |
10 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% | |
11 | ||
12 | \section{Unicode support in wxWindows}\label{unicode} | |
13 | ||
14 | This section briefly describes the state of the Unicode support in wxWindows. | |
15 | Read it if you want to know more about how to write programs able to work with | |
16 | characters from languages other than English. | |
17 | ||
18 | \subsection{What is Unicode?} | |
19 | ||
20 | Starting with release 2.1 wxWindows has support for compiling in Unicode mode | |
21 | on the platforms which support it. Unicode is a standard for character | |
22 | encoding which addreses the shortcomings of the previous, 8 bit standards, by | |
23 | using 16 bit for encoding each character. This allows to have 65536 characters | |
24 | instead of the usual 256 and is sufficient to encode all of the world | |
25 | languages at once. More details about Unicode may be found at {\tt www.unicode.org}. | |
26 | ||
27 | % TODO expand on it, say that Unicode extends ASCII, mention ISO8859, ... | |
28 | ||
29 | As this solution is obviously preferable to the previous ones (think of | |
30 | incompatible encodings for the same language, locale chaos and so on), many | |
31 | modern ooperating systems support it. The probably first example is Windows NT | |
32 | which uses only Unicode internally since its very first version. | |
33 | ||
34 | Writing internationalized programs is much easier with Unicode and, as the | |
35 | support for it improves, it should become more and more so. Moreover, in the | |
36 | Windows NT/2000 case, even the program which uses only standard ASCII can profit | |
37 | from using Unicode because they will work more efficiently - there will be no | |
38 | need for the system to convert all strings hte program uses to/from Unicode | |
39 | each time a system call is made. | |
40 | ||
41 | \subsection{Unicode and ANSI modes} | |
42 | ||
43 | As not all platforms supported by wxWindows support Unicode (fully) yet, in | |
44 | many cases it is unwise to write a program which can only work in Unicode | |
45 | environment. A better solution is to write programs in such way that they may | |
46 | be compiled either in ANSI (traditional) mode or in the Unicode one. | |
47 | ||
48 | This can be achieved quite simply by using the means provided by wxWindows. | |
49 | Basicly, there are only a few things to watch out for: | |
4c61bdab | 50 | |
0c5d3e1c VZ |
51 | \begin{itemize} |
52 | \item Character type ({\tt char} or {\tt wchar\_t}) | |
53 | \item Literal strings (i.e. {\tt "Hello, world!"} or {\tt '*'}) | |
54 | \item String functions ({\tt strlen()}, {\tt strcpy()}, ...) | |
55 | \end{itemize} | |
56 | ||
57 | Let's look at them in order. First of all, each character in an Unicode | |
58 | program takes 2 bytes instead of usual one, so another type should be used to | |
59 | store the characters ({\tt char} only holds 1 byte usually). This type is | |
60 | called {\tt wchar\_t} which stands for {\it wide-character type}. | |
61 | ||
62 | Also, the string and character constants should be encoded on 2 bytes instead | |
63 | of one. This is achieved by using the standard C (and C++) way: just put the | |
64 | letter {\tt 'L'} after any string constant and it becomes a {\it long} | |
65 | constant, i.e. a wide character one. To make things a bit more readable, you | |
66 | are also allowed to prefix the constant with {\tt 'L'} instead of putting it | |
67 | after it. | |
68 | ||
69 | Finally, the standard C functions don't work with {\tt wchar\_t} strings, so | |
70 | another set of functions exists which do the same thing but accept | |
71 | {\tt wchar\_t *} instead of {\tt char *}. For example, a function to get the | |
72 | length of a wide-character string is called {\tt wcslen()} (compare with | |
73 | {\tt strlen()} - you see that the only difference is that the "str" prefix | |
74 | standing for "string" has been replaced with "wcs" standing for | |
75 | "wide-character string"). | |
76 | ||
77 | To summarize, here is a brief example of how a program which can be compiled | |
78 | in both ANSI and Unicode modes could look like: | |
79 | ||
80 | \begin{verbatim} | |
81 | #ifdef __UNICODE__ | |
82 | wchar_t wch = L'*'; | |
83 | const wchar_t *ws = L"Hello, world!"; | |
84 | int len = wcslen(ws); | |
85 | #else // ANSI | |
86 | char ch = '*'; | |
87 | const char *s = "Hello, world!"; | |
88 | int len = strlen(s); | |
89 | #endif // Unicode/ANSI | |
90 | \end{verbatim} | |
91 | ||
92 | Of course, it would be nearly impossibly to write such programs if it had to | |
93 | be done this way (try to imagine the number of {\tt #ifdef UNICODE} an average | |
94 | program would have had!). Luckily, there is another way - see the next | |
95 | section. | |
96 | ||
97 | \subsection{Unicode support in wxWindows} | |
98 | ||
4c61bdab | 99 | In wxWindows, the code fragment froim above should be written instead: |
0c5d3e1c VZ |
100 | |
101 | \begin{verbatim} | |
4c61bdab JS |
102 | wxChar ch = T('*'); |
103 | wxString s = T("Hello, world!"); | |
0c5d3e1c VZ |
104 | int len = s.Len(); |
105 | \end{verbatim} | |
106 | ||
107 | What happens here? First of all, you see that there are no more {\tt #ifdef}s | |
108 | at all. Instead, we define some types and macros which behave differently in | |
109 | the Unicode and ANSI builds and allows us to avoid using conditional | |
110 | compilation in the program itself. | |
111 | ||
112 | We have a {\tt wxChar} type which maps either on {\tt char} or {\tt wchar\_t} | |
113 | depending on the mode in which program is being compiled. There is no need for | |
114 | a separate type for strings though, because the standard | |
115 | \helpref{wxString}{wxstring} supports Unicode, i.e. it stores iether ANSI or | |
116 | Unicode strings depending on the mode. | |
117 | ||
4c61bdab | 118 | Finally, there is a special {\tt T()} macro which should enclose all literal |
0c5d3e1c VZ |
119 | strings in the program. As it's easy to see comparing the last fragment with |
120 | the one above, this macro expands to nothing in the (usual) ANSI mode and | |
121 | prefixes {\tt 'L'} to its argument in the Unicode mode. | |
122 | ||
123 | The important conclusion is that if you use {\tt wxChar} instead of | |
124 | {\tt char}, avoid using C style strings and use {\tt wxString} instead and | |
4c61bdab | 125 | don't forget to enclose all string literals inside {\tt T()} macro, your |
0c5d3e1c VZ |
126 | program automatically becomes (almost) Unicode compliant! |
127 | ||
128 | Just let us state once again the rules: | |
4c61bdab | 129 | |
0c5d3e1c VZ |
130 | \begin{itemize} |
131 | \item Always use {\tt wxChar} instead of {\tt char} | |
4c61bdab | 132 | \item Always enclose literal string constants in {\tt T()} macro unless |
0c5d3e1c | 133 | they're already converted to the right representation (another standard |
4c61bdab | 134 | wxWindows macro {\tt \_()} does it, so there is no need for {\tt T()} in this |
0c5d3e1c VZ |
135 | case) or you intend to pass the constant directly to an external function |
136 | which doesn't accept wide-character strings. | |
137 | \item Use {\tt wxString} instead of C style strings. | |
138 | \end{itemize} | |
139 | ||
140 | \subsection{Unicode and the outside world} | |
141 | ||
142 | We have seen that it was easy to write Unicode programs using wxWindows types | |
143 | and macros, but it has been also mentioned that it isn't quite enough. | |
144 | Although everything works fine inside the program, things can get nasty when | |
145 | it tries to communicate with the outside world which, sadly, often expects | |
146 | ANSI strings (a notable exception is the entire Win32 API which accepts either | |
147 | Unicode or ANSI strings and which thus makes it unnecessary to ever perform | |
148 | any convertions in the program). | |
149 | ||
150 | To get a ANSI string from a wxString, you may use | |
151 | \helpref{mb\_str()}{wxstringmbstr} function which always returns an ANSI | |
152 | string (independently of the mode - while the usual | |
153 | \helpref{c\_str()}{wxstringcstr} returns a pointer to the internal | |
154 | representation which is either ASCII or Unicode). More rarely used, but still | |
155 | useful, is \helpref{wc\_str()}{wxstringwcstr} function which always returns | |
156 | the Unicode string. | |
157 | ||
158 | % TODO describe fn_str(), wx_str(), wxCharBuf classes, ... | |
4c61bdab JS |
159 | % Please remember to put a blank line at the end of each file! (Tex2RTF 'issue') |
160 |