]>
Commit | Line | Data |
---|---|---|
15b6757b FM |
1 | ///////////////////////////////////////////////////////////////////////////// |
2 | // Name: unicode | |
3 | // Purpose: topic overview | |
4 | // Author: wxWidgets team | |
5 | // RCS-ID: $Id$ | |
6 | // Licence: wxWindows license | |
7 | ///////////////////////////////////////////////////////////////////////////// | |
8 | ||
9 | /*! | |
36c9828f | 10 | |
75b31b23 | 11 | @page overview_unicode Unicode support in wxWidgets |
36c9828f | 12 | |
15b6757b FM |
13 | This section briefly describes the state of the Unicode support in wxWidgets. |
14 | Read it if you want to know more about how to write programs able to work with | |
15 | characters from languages other than English. | |
36c9828f | 16 | |
7fa3c420 SC |
17 | @li @ref overview_whatisunicode |
18 | @li @ref overview_unicodeandansi | |
19 | @li @ref overview_unicodeinsidewxw | |
20 | @li @ref overview_unicodeoutsidewxw | |
21 | @li @ref overview_unicodesettings | |
22 | @li @ref overview_topic8 | |
36c9828f | 23 | |
7fa3c420 SC |
24 | |
25 | @section overview_whatisunicode What is Unicode? | |
36c9828f | 26 | |
15b6757b FM |
27 | wxWidgets has support for compiling in Unicode mode |
28 | on the platforms which support it. Unicode is a standard for character | |
29 | encoding which addresses the shortcomings of the previous, 8 bit standards, by | |
30 | using at least 16 (and possibly 32) bits for encoding each character. This | |
31 | allows to have at least 65536 characters (what is called the BMP, or basic | |
32 | multilingual plane) and possible 2^32 of them instead of the usual 256 and | |
33 | is sufficient to encode all of the world languages at once. More details about | |
34 | Unicode may be found at #http://www.unicode.org. | |
7fa3c420 | 35 | |
15b6757b FM |
36 | As this solution is obviously preferable to the previous ones (think of |
37 | incompatible encodings for the same language, locale chaos and so on), many | |
38 | modern operating systems support it. The probably first example is Windows NT | |
39 | which uses only Unicode internally since its very first version. | |
7fa3c420 | 40 | |
15b6757b FM |
41 | Writing internationalized programs is much easier with Unicode and, as the |
42 | support for it improves, it should become more and more so. Moreover, in the | |
43 | Windows NT/2000 case, even the program which uses only standard ASCII can profit | |
44 | from using Unicode because they will work more efficiently - there will be no | |
45 | need for the system to convert all strings the program uses to/from Unicode | |
46 | each time a system call is made. | |
36c9828f | 47 | |
7fa3c420 | 48 | @section overview_unicodeandansi Unicode and ANSI modes |
36c9828f | 49 | |
15b6757b FM |
50 | As not all platforms supported by wxWidgets support Unicode (fully) yet, in |
51 | many cases it is unwise to write a program which can only work in Unicode | |
52 | environment. A better solution is to write programs in such way that they may | |
53 | be compiled either in ANSI (traditional) mode or in the Unicode one. | |
7fa3c420 | 54 | |
15b6757b FM |
55 | This can be achieved quite simply by using the means provided by wxWidgets. |
56 | Basically, there are only a few things to watch out for: | |
36c9828f FM |
57 | |
58 | ||
7fa3c420 SC |
59 | - Character type (@c char or @c wchar_t) |
60 | - Literal strings (i.e. @c "Hello, world!" or @c '*') | |
61 | - String functions (@c strlen(), @c strcpy(), ...) | |
62 | - Special preprocessor tokens (@c __FILE__, @c __DATE__ | |
15b6757b | 63 | and @c __TIME__) |
36c9828f FM |
64 | |
65 | ||
15b6757b FM |
66 | Let's look at them in order. First of all, each character in an Unicode |
67 | program takes 2 bytes instead of usual one, so another type should be used to | |
68 | store the characters (@c char only holds 1 byte usually). This type is | |
69 | called @c wchar_t which stands for @e wide-character type. | |
7fa3c420 | 70 | |
15b6757b FM |
71 | Also, the string and character constants should be encoded using wide |
72 | characters (@c wchar_t type) which typically take 2 or 4 bytes instead | |
73 | of @c char which only takes one. This is achieved by using the standard C | |
74 | (and C++) way: just put the letter @c 'L' after any string constant and it | |
75 | becomes a @e long constant, i.e. a wide character one. To make things a bit | |
76 | more readable, you are also allowed to prefix the constant with @c 'L' | |
77 | instead of putting it after it. | |
7fa3c420 | 78 | |
15b6757b FM |
79 | Of course, the usual standard C functions don't work with @c wchar_t |
80 | strings, so another set of functions exists which do the same thing but accept | |
81 | @c wchar_t * instead of @c char *. For example, a function to get the | |
36c9828f | 82 | length of a wide-character string is called @c wcslen() (compare with |
15b6757b FM |
83 | @c strlen() - you see that the only difference is that the "str" prefix |
84 | standing for "string" has been replaced with "wcs" standing for "wide-character | |
85 | string"). | |
7fa3c420 | 86 | |
15b6757b FM |
87 | And finally, the standard preprocessor tokens enumerated above expand to ANSI |
88 | strings but it is more likely that Unicode strings are wanted in the Unicode | |
36c9828f | 89 | build. wxWidgets provides the macros @c __TFILE__, @c __TDATE__ |
15b6757b FM |
90 | and @c __TTIME__ which behave exactly as the standard ones except that |
91 | they produce ANSI strings in ANSI build and Unicode ones in the Unicode build. | |
7fa3c420 | 92 | |
15b6757b FM |
93 | To summarize, here is a brief example of how a program which can be compiled |
94 | in both ANSI and Unicode modes could look like: | |
36c9828f | 95 | |
15b6757b FM |
96 | @code |
97 | #ifdef __UNICODE__ | |
98 | wchar_t wch = L'*'; | |
99 | const wchar_t *ws = L"Hello, world!"; | |
100 | int len = wcslen(ws); | |
36c9828f | 101 | |
15b6757b FM |
102 | wprintf(L"Compiled at %s\n", __TDATE__); |
103 | #else // ANSI | |
104 | char ch = '*'; | |
105 | const char *s = "Hello, world!"; | |
106 | int len = strlen(s); | |
36c9828f | 107 | |
15b6757b FM |
108 | printf("Compiled at %s\n", __DATE__); |
109 | #endif // Unicode/ANSI | |
110 | @endcode | |
36c9828f | 111 | |
15b6757b FM |
112 | Of course, it would be nearly impossibly to write such programs if it had to |
113 | be done this way (try to imagine the number of @c #ifdef UNICODE an average | |
114 | program would have had!). Luckily, there is another way - see the next | |
115 | section. | |
36c9828f | 116 | |
7fa3c420 | 117 | @section overview_unicodeinsidewxw Unicode support in wxWidgets |
36c9828f | 118 | |
15b6757b | 119 | In wxWidgets, the code fragment from above should be written instead: |
36c9828f | 120 | |
15b6757b FM |
121 | @code |
122 | wxChar ch = wxT('*'); | |
123 | wxString s = wxT("Hello, world!"); | |
124 | int len = s.Len(); | |
125 | @endcode | |
36c9828f | 126 | |
15b6757b FM |
127 | What happens here? First of all, you see that there are no more @c #ifdefs |
128 | at all. Instead, we define some types and macros which behave differently in | |
129 | the Unicode and ANSI builds and allow us to avoid using conditional | |
130 | compilation in the program itself. | |
7fa3c420 | 131 | |
36c9828f | 132 | We have a @c wxChar type which maps either on @c char or @c wchar_t |
15b6757b | 133 | depending on the mode in which program is being compiled. There is no need for |
36c9828f | 134 | a separate type for strings though, because the standard |
15b6757b FM |
135 | #wxString supports Unicode, i.e. it stores either ANSI or |
136 | Unicode strings depending on the compile mode. | |
7fa3c420 | 137 | |
15b6757b FM |
138 | Finally, there is a special #wxT() macro which should enclose all |
139 | literal strings in the program. As it is easy to see comparing the last | |
140 | fragment with the one above, this macro expands to nothing in the (usual) ANSI | |
141 | mode and prefixes @c 'L' to its argument in the Unicode mode. | |
7fa3c420 | 142 | |
36c9828f | 143 | The important conclusion is that if you use @c wxChar instead of |
15b6757b FM |
144 | @c char, avoid using C style strings and use @c wxString instead and |
145 | don't forget to enclose all string literals inside #wxT() macro, your | |
146 | program automatically becomes (almost) Unicode compliant! | |
36c9828f | 147 | |
7fa3c420 | 148 | Just let us state once again the rules: |
36c9828f | 149 | |
7fa3c420 SC |
150 | - Always use @c wxChar instead of @c char |
151 | - Always enclose literal string constants in #wxT() macro | |
15b6757b FM |
152 | unless they're already converted to the right representation (another standard |
153 | wxWidgets macro #_() does it, for example, so there is no | |
154 | need for @c wxT() in this case) or you intend to pass the constant directly | |
155 | to an external function which doesn't accept wide-character strings. | |
7fa3c420 | 156 | - Use @c wxString instead of C style strings. |
36c9828f | 157 | |
7fa3c420 | 158 | @section overview_unicodeoutsidewxw Unicode and the outside world |
36c9828f | 159 | |
15b6757b FM |
160 | We have seen that it was easy to write Unicode programs using wxWidgets types |
161 | and macros, but it has been also mentioned that it isn't quite enough. | |
162 | Although everything works fine inside the program, things can get nasty when | |
163 | it tries to communicate with the outside world which, sadly, often expects | |
164 | ANSI strings (a notable exception is the entire Win32 API which accepts either | |
165 | Unicode or ANSI strings and which thus makes it unnecessary to ever perform | |
166 | any conversions in the program). GTK 2.0 only accepts UTF-8 strings. | |
7fa3c420 | 167 | |
36c9828f | 168 | To get an ANSI string from a wxString, you may use the |
15b6757b | 169 | mb_str() function which always returns an ANSI |
36c9828f | 170 | string (independently of the mode - while the usual |
15b6757b FM |
171 | #c_str() returns a pointer to the internal |
172 | representation which is either ASCII or Unicode). More rarely used, but still | |
173 | useful, is wc_str() function which always returns | |
174 | the Unicode string. | |
7fa3c420 | 175 | |
36c9828f | 176 | Sometimes it is also necessary to go from ANSI strings to wxStrings. |
15b6757b | 177 | In this case, you can use the converter-constructor, as follows: |
36c9828f FM |
178 | |
179 | ||
15b6757b FM |
180 | @code |
181 | const char* ascii_str = "Some text"; | |
182 | wxString str(ascii_str, wxConvUTF8); | |
183 | @endcode | |
36c9828f | 184 | |
15b6757b FM |
185 | This code also compiles fine under a non-Unicode build of wxWidgets, |
186 | but in that case the converter is ignored. | |
7fa3c420 | 187 | |
15b6757b | 188 | For more information about converters and Unicode see |
7fa3c420 | 189 | the @ref overview_mbconvclasses. |
36c9828f | 190 | |
7fa3c420 | 191 | @section overview_unicodesettings Unicode-related compilation settings |
36c9828f | 192 | |
15b6757b FM |
193 | You should define @c wxUSE_UNICODE to 1 to compile your program in |
194 | Unicode mode. This currently works for wxMSW, wxGTK, wxMac and wxX11. If you | |
36c9828f | 195 | compile your program in ANSI mode you can still define @c wxUSE_WCHAR_T |
15b6757b | 196 | to get some limited support for @c wchar_t type. |
7fa3c420 | 197 | |
15b6757b | 198 | This will allow your program to perform conversions between Unicode strings and |
7fa3c420 | 199 | ANSI ones (using @ref overview_mbconvclasses) |
15b6757b FM |
200 | and construct wxString objects from Unicode strings (presumably read |
201 | from some external file or elsewhere). | |
36c9828f | 202 | |
7fa3c420 | 203 | @section overview_topic8 Traps for the unwary |
36c9828f | 204 | |
7fa3c420 SC |
205 | - Casting c_str() to void* is now char*, not wxChar* |
206 | - Passing c_str(), mb_str() or wc_str() to variadic functions | |
15b6757b | 207 | doesn't work |
36c9828f | 208 | |
15b6757b | 209 | */ |
36c9828f FM |
210 | |
211 |