]>
Commit | Line | Data |
---|---|---|
15b6757b | 1 | ///////////////////////////////////////////////////////////////////////////// |
2cd3cc94 | 2 | // Name: unicode.h |
15b6757b FM |
3 | // Purpose: topic overview |
4 | // Author: wxWidgets team | |
5 | // RCS-ID: $Id$ | |
6 | // Licence: wxWindows license | |
7 | ///////////////////////////////////////////////////////////////////////////// | |
8 | ||
9 | /*! | |
36c9828f | 10 | |
2cd3cc94 BP |
11 | @page overview_unicode Unicode Support in wxWidgets |
12 | ||
13 | This section briefly describes the state of the Unicode support in wxWidgets. | |
14 | Read it if you want to know more about how to write programs able to work with | |
15 | characters from languages other than English. | |
36c9828f | 16 | |
2cd3cc94 BP |
17 | @li @ref overview_unicode_what |
18 | @li @ref overview_unicode_ansi | |
19 | @li @ref overview_unicode_supportin | |
20 | @li @ref overview_unicode_supportout | |
21 | @li @ref overview_unicode_settings | |
22 | @li @ref overview_unicode_traps | |
36c9828f | 23 | |
36c9828f | 24 | |
2cd3cc94 | 25 | <hr> |
36c9828f FM |
26 | |
27 | ||
2cd3cc94 BP |
28 | @section overview_unicode_what What is Unicode? |
29 | ||
30 | wxWidgets has support for compiling in Unicode mode on the platforms which | |
31 | support it. Unicode is a standard for character encoding which addresses the | |
32 | shortcomings of the previous, 8 bit standards, by using at least 16 (and | |
33 | possibly 32) bits for encoding each character. This allows to have at least | |
34 | 65536 characters (what is called the BMP, or basic multilingual plane) and | |
35 | possible 2^32 of them instead of the usual 256 and is sufficient to encode all | |
36 | of the world languages at once. More details about Unicode may be found at | |
37 | <http://www.unicode.org/>. | |
38 | ||
39 | As this solution is obviously preferable to the previous ones (think of | |
40 | incompatible encodings for the same language, locale chaos and so on), many | |
41 | modern operating systems support it. The probably first example is Windows NT | |
42 | which uses only Unicode internally since its very first version. | |
43 | ||
44 | Writing internationalized programs is much easier with Unicode and, as the | |
45 | support for it improves, it should become more and more so. Moreover, in the | |
46 | Windows NT/2000 case, even the program which uses only standard ASCII can | |
47 | profit from using Unicode because they will work more efficiently - there will | |
48 | be no need for the system to convert all strings the program uses to/from | |
49 | Unicode each time a system call is made. | |
50 | ||
51 | ||
52 | @section overview_unicode_ansi Unicode and ANSI Modes | |
53 | ||
54 | As not all platforms supported by wxWidgets support Unicode (fully) yet, in | |
55 | many cases it is unwise to write a program which can only work in Unicode | |
56 | environment. A better solution is to write programs in such way that they may | |
57 | be compiled either in ANSI (traditional) mode or in the Unicode one. | |
58 | ||
59 | This can be achieved quite simply by using the means provided by wxWidgets. | |
60 | Basically, there are only a few things to watch out for: | |
61 | ||
62 | - Character type (@c char or @c wchar_t) | |
63 | - Literal strings (i.e. @c "Hello, world!" or @c '*') | |
64 | - String functions (@c strlen(), @c strcpy(), ...) | |
65 | - Special preprocessor tokens (@c __FILE__, @c __DATE__ and @c __TIME__) | |
66 | ||
67 | Let's look at them in order. First of all, each character in an Unicode program | |
68 | takes 2 bytes instead of usual one, so another type should be used to store the | |
69 | characters (@c char only holds 1 byte usually). This type is called @c wchar_t | |
70 | which stands for @e wide-character type. | |
71 | ||
72 | Also, the string and character constants should be encoded using wide | |
73 | characters (@c wchar_t type) which typically take 2 or 4 bytes instead of | |
74 | @c char which only takes one. This is achieved by using the standard C (and | |
75 | C++) way: just put the letter @c 'L' after any string constant and it becomes a | |
76 | @e long constant, i.e. a wide character one. To make things a bit more | |
77 | readable, you are also allowed to prefix the constant with @c 'L' instead of | |
78 | putting it after it. | |
79 | ||
80 | Of course, the usual standard C functions don't work with @c wchar_t strings, | |
81 | so another set of functions exists which do the same thing but accept | |
82 | @c wchar_t* instead of @c char*. For example, a function to get the length of a | |
83 | wide-character string is called @c wcslen() (compare with @c strlen() - you see | |
84 | that the only difference is that the "str" prefix standing for "string" has | |
85 | been replaced with "wcs" standing for "wide-character string"). | |
86 | ||
87 | And finally, the standard preprocessor tokens enumerated above expand to ANSI | |
88 | strings but it is more likely that Unicode strings are wanted in the Unicode | |
89 | build. wxWidgets provides the macros @c __TFILE__, @c __TDATE__ and | |
90 | @c __TTIME__ which behave exactly as the standard ones except that they produce | |
91 | ANSI strings in ANSI build and Unicode ones in the Unicode build. | |
92 | ||
93 | To summarize, here is a brief example of how a program which can be compiled | |
94 | in both ANSI and Unicode modes could look like: | |
95 | ||
96 | @code | |
97 | #ifdef __UNICODE__ | |
98 | wchar_t wch = L'*'; | |
99 | const wchar_t *ws = L"Hello, world!"; | |
100 | int len = wcslen(ws); | |
101 | ||
102 | wprintf(L"Compiled at %s\n", __TDATE__); | |
103 | #else // ANSI | |
104 | char ch = '*'; | |
105 | const char *s = "Hello, world!"; | |
106 | int len = strlen(s); | |
107 | ||
108 | printf("Compiled at %s\n", __DATE__); | |
109 | #endif // Unicode/ANSI | |
110 | @endcode | |
111 | ||
112 | Of course, it would be nearly impossibly to write such programs if it had to | |
3863c5eb | 113 | be done this way (try to imagine the number of UNICODE checkes an average |
2cd3cc94 BP |
114 | program would have had!). Luckily, there is another way - see the next section. |
115 | ||
116 | ||
117 | @section overview_unicode_supportin Unicode Support in wxWidgets | |
118 | ||
119 | In wxWidgets, the code fragment from above should be written instead: | |
120 | ||
121 | @code | |
122 | wxChar ch = wxT('*'); | |
123 | wxString s = wxT("Hello, world!"); | |
124 | int len = s.Len(); | |
125 | @endcode | |
126 | ||
127 | What happens here? First of all, you see that there are no more UNICODE checks | |
128 | at all. Instead, we define some types and macros which behave differently in | |
129 | the Unicode and ANSI builds and allow us to avoid using conditional compilation | |
130 | in the program itself. | |
131 | ||
132 | We have a @c wxChar type which maps either on @c char or @c wchar_t depending | |
133 | on the mode in which program is being compiled. There is no need for a separate | |
134 | type for strings though, because the standard wxString supports Unicode, i.e. | |
135 | it stores either ANSI or Unicode strings depending on the compile mode. | |
136 | ||
137 | Finally, there is a special wxT() macro which should enclose all literal | |
138 | strings in the program. As it is easy to see comparing the last fragment with | |
139 | the one above, this macro expands to nothing in the (usual) ANSI mode and | |
140 | prefixes @c 'L' to its argument in the Unicode mode. | |
141 | ||
142 | The important conclusion is that if you use @c wxChar instead of @c char, avoid | |
143 | using C style strings and use @c wxString instead and don't forget to enclose | |
144 | all string literals inside wxT() macro, your program automatically becomes | |
145 | (almost) Unicode compliant! | |
146 | ||
147 | Just let us state once again the rules: | |
148 | ||
149 | @li Always use wxChar instead of @c char | |
150 | @li Always enclose literal string constants in wxT() macro unless they're | |
151 | already converted to the right representation (another standard wxWidgets | |
152 | macro _() does it, for example, so there is no need for wxT() in this case) | |
153 | or you intend to pass the constant directly to an external function which | |
154 | doesn't accept wide-character strings. | |
155 | @li Use wxString instead of C style strings. | |
156 | ||
157 | ||
158 | @section overview_unicode_supportout Unicode and the Outside World | |
159 | ||
160 | We have seen that it was easy to write Unicode programs using wxWidgets types | |
161 | and macros, but it has been also mentioned that it isn't quite enough. Although | |
162 | everything works fine inside the program, things can get nasty when it tries to | |
163 | communicate with the outside world which, sadly, often expects ANSI strings (a | |
164 | notable exception is the entire Win32 API which accepts either Unicode or ANSI | |
165 | strings and which thus makes it unnecessary to ever perform any conversions in | |
166 | the program). GTK 2.0 only accepts UTF-8 strings. | |
167 | ||
168 | To get an ANSI string from a wxString, you may use the mb_str() function which | |
169 | always returns an ANSI string (independently of the mode - while the usual | |
170 | c_str() returns a pointer to the internal representation which is either ASCII | |
171 | or Unicode). More rarely used, but still useful, is wc_str() function which | |
172 | always returns the Unicode string. | |
173 | ||
174 | Sometimes it is also necessary to go from ANSI strings to wxStrings. In this | |
175 | case, you can use the converter-constructor, as follows: | |
176 | ||
177 | @code | |
178 | const char* ascii_str = "Some text"; | |
179 | wxString str(ascii_str, wxConvUTF8); | |
180 | @endcode | |
181 | ||
182 | This code also compiles fine under a non-Unicode build of wxWidgets, but in | |
183 | that case the converter is ignored. | |
184 | ||
185 | For more information about converters and Unicode see the @ref overview_mbconv. | |
186 | ||
187 | ||
188 | @section overview_unicode_settings Unicode Related Compilation Settings | |
189 | ||
190 | You should define @c wxUSE_UNICODE to 1 to compile your program in Unicode | |
191 | mode. This currently works for wxMSW, wxGTK, wxMac and wxX11. If you compile | |
192 | your program in ANSI mode you can still define @c wxUSE_WCHAR_T to get some | |
193 | limited support for @c wchar_t type. | |
194 | ||
195 | This will allow your program to perform conversions between Unicode strings and | |
196 | ANSI ones (using @ref overview_mbconv "wxMBConv") and construct wxString | |
197 | objects from Unicode strings (presumably read from some external file or | |
198 | elsewhere). | |
199 | ||
200 | ||
201 | @section overview_unicode_traps Traps for the Unwary | |
202 | ||
203 | @li Casting c_str() to void* is now char*, not wxChar* | |
204 | @li Passing c_str(), mb_str() or wc_str() to variadic functions doesn't work. | |
205 | ||
206 | */ | |
207 |