]> git.saurik.com Git - wxWidgets.git/blame - docs/doxygen/overviews/unicode.h
removed trailing whitespace in Doxygen files
[wxWidgets.git] / docs / doxygen / overviews / unicode.h
CommitLineData
15b6757b 1/////////////////////////////////////////////////////////////////////////////
2cd3cc94 2// Name: unicode.h
15b6757b
FM
3// Purpose: topic overview
4// Author: wxWidgets team
5// RCS-ID: $Id$
6// Licence: wxWindows license
7/////////////////////////////////////////////////////////////////////////////
8
880efa2a 9/**
36c9828f 10
2cd3cc94
BP
11@page overview_unicode Unicode Support in wxWidgets
12
13This section briefly describes the state of the Unicode support in wxWidgets.
14Read it if you want to know more about how to write programs able to work with
15characters from languages other than English.
36c9828f 16
2cd3cc94
BP
17@li @ref overview_unicode_what
18@li @ref overview_unicode_ansi
19@li @ref overview_unicode_supportin
20@li @ref overview_unicode_supportout
21@li @ref overview_unicode_settings
22@li @ref overview_unicode_traps
36c9828f 23
36c9828f 24
2cd3cc94 25<hr>
36c9828f
FM
26
27
2cd3cc94
BP
28@section overview_unicode_what What is Unicode?
29
30wxWidgets has support for compiling in Unicode mode on the platforms which
31support it. Unicode is a standard for character encoding which addresses the
32shortcomings of the previous, 8 bit standards, by using at least 16 (and
33possibly 32) bits for encoding each character. This allows to have at least
3465536 characters (what is called the BMP, or basic multilingual plane) and
35possible 2^32 of them instead of the usual 256 and is sufficient to encode all
36of the world languages at once. More details about Unicode may be found at
37<http://www.unicode.org/>.
38
39As this solution is obviously preferable to the previous ones (think of
40incompatible encodings for the same language, locale chaos and so on), many
41modern operating systems support it. The probably first example is Windows NT
42which uses only Unicode internally since its very first version.
43
44Writing internationalized programs is much easier with Unicode and, as the
45support for it improves, it should become more and more so. Moreover, in the
46Windows NT/2000 case, even the program which uses only standard ASCII can
47profit from using Unicode because they will work more efficiently - there will
48be no need for the system to convert all strings the program uses to/from
49Unicode each time a system call is made.
50
51
52@section overview_unicode_ansi Unicode and ANSI Modes
53
54As not all platforms supported by wxWidgets support Unicode (fully) yet, in
55many cases it is unwise to write a program which can only work in Unicode
56environment. A better solution is to write programs in such way that they may
57be compiled either in ANSI (traditional) mode or in the Unicode one.
58
59This can be achieved quite simply by using the means provided by wxWidgets.
60Basically, there are only a few things to watch out for:
61
62- Character type (@c char or @c wchar_t)
63- Literal strings (i.e. @c "Hello, world!" or @c '*')
64- String functions (@c strlen(), @c strcpy(), ...)
65- Special preprocessor tokens (@c __FILE__, @c __DATE__ and @c __TIME__)
66
67Let's look at them in order. First of all, each character in an Unicode program
68takes 2 bytes instead of usual one, so another type should be used to store the
69characters (@c char only holds 1 byte usually). This type is called @c wchar_t
70which stands for @e wide-character type.
71
72Also, the string and character constants should be encoded using wide
73characters (@c wchar_t type) which typically take 2 or 4 bytes instead of
74@c char which only takes one. This is achieved by using the standard C (and
75C++) way: just put the letter @c 'L' after any string constant and it becomes a
76@e long constant, i.e. a wide character one. To make things a bit more
77readable, you are also allowed to prefix the constant with @c 'L' instead of
78putting it after it.
79
80Of course, the usual standard C functions don't work with @c wchar_t strings,
81so another set of functions exists which do the same thing but accept
82@c wchar_t* instead of @c char*. For example, a function to get the length of a
83wide-character string is called @c wcslen() (compare with @c strlen() - you see
84that the only difference is that the "str" prefix standing for "string" has
85been replaced with "wcs" standing for "wide-character string").
86
87And finally, the standard preprocessor tokens enumerated above expand to ANSI
88strings but it is more likely that Unicode strings are wanted in the Unicode
89build. wxWidgets provides the macros @c __TFILE__, @c __TDATE__ and
90@c __TTIME__ which behave exactly as the standard ones except that they produce
91ANSI strings in ANSI build and Unicode ones in the Unicode build.
92
93To summarize, here is a brief example of how a program which can be compiled
94in both ANSI and Unicode modes could look like:
95
96@code
97#ifdef __UNICODE__
98 wchar_t wch = L'*';
99 const wchar_t *ws = L"Hello, world!";
100 int len = wcslen(ws);
101
102 wprintf(L"Compiled at %s\n", __TDATE__);
103#else // ANSI
104 char ch = '*';
105 const char *s = "Hello, world!";
106 int len = strlen(s);
107
108 printf("Compiled at %s\n", __DATE__);
109#endif // Unicode/ANSI
110@endcode
111
112Of course, it would be nearly impossibly to write such programs if it had to
3863c5eb 113be done this way (try to imagine the number of UNICODE checkes an average
2cd3cc94
BP
114program would have had!). Luckily, there is another way - see the next section.
115
116
117@section overview_unicode_supportin Unicode Support in wxWidgets
118
119In wxWidgets, the code fragment from above should be written instead:
120
121@code
122wxChar ch = wxT('*');
123wxString s = wxT("Hello, world!");
124int len = s.Len();
125@endcode
126
127What happens here? First of all, you see that there are no more UNICODE checks
128at all. Instead, we define some types and macros which behave differently in
129the Unicode and ANSI builds and allow us to avoid using conditional compilation
130in the program itself.
131
132We have a @c wxChar type which maps either on @c char or @c wchar_t depending
133on the mode in which program is being compiled. There is no need for a separate
134type for strings though, because the standard wxString supports Unicode, i.e.
135it stores either ANSI or Unicode strings depending on the compile mode.
136
137Finally, there is a special wxT() macro which should enclose all literal
138strings in the program. As it is easy to see comparing the last fragment with
139the one above, this macro expands to nothing in the (usual) ANSI mode and
140prefixes @c 'L' to its argument in the Unicode mode.
141
142The important conclusion is that if you use @c wxChar instead of @c char, avoid
143using C style strings and use @c wxString instead and don't forget to enclose
144all string literals inside wxT() macro, your program automatically becomes
145(almost) Unicode compliant!
146
147Just let us state once again the rules:
148
149@li Always use wxChar instead of @c char
150@li Always enclose literal string constants in wxT() macro unless they're
151 already converted to the right representation (another standard wxWidgets
152 macro _() does it, for example, so there is no need for wxT() in this case)
153 or you intend to pass the constant directly to an external function which
154 doesn't accept wide-character strings.
155@li Use wxString instead of C style strings.
156
157
158@section overview_unicode_supportout Unicode and the Outside World
159
160We have seen that it was easy to write Unicode programs using wxWidgets types
161and macros, but it has been also mentioned that it isn't quite enough. Although
162everything works fine inside the program, things can get nasty when it tries to
163communicate with the outside world which, sadly, often expects ANSI strings (a
164notable exception is the entire Win32 API which accepts either Unicode or ANSI
165strings and which thus makes it unnecessary to ever perform any conversions in
166the program). GTK 2.0 only accepts UTF-8 strings.
167
168To get an ANSI string from a wxString, you may use the mb_str() function which
169always returns an ANSI string (independently of the mode - while the usual
170c_str() returns a pointer to the internal representation which is either ASCII
171or Unicode). More rarely used, but still useful, is wc_str() function which
172always returns the Unicode string.
173
174Sometimes it is also necessary to go from ANSI strings to wxStrings. In this
175case, you can use the converter-constructor, as follows:
176
177@code
178const char* ascii_str = "Some text";
179wxString str(ascii_str, wxConvUTF8);
180@endcode
181
182This code also compiles fine under a non-Unicode build of wxWidgets, but in
183that case the converter is ignored.
184
185For more information about converters and Unicode see the @ref overview_mbconv.
186
187
188@section overview_unicode_settings Unicode Related Compilation Settings
189
190You should define @c wxUSE_UNICODE to 1 to compile your program in Unicode
191mode. This currently works for wxMSW, wxGTK, wxMac and wxX11. If you compile
192your program in ANSI mode you can still define @c wxUSE_WCHAR_T to get some
193limited support for @c wchar_t type.
194
195This will allow your program to perform conversions between Unicode strings and
196ANSI ones (using @ref overview_mbconv "wxMBConv") and construct wxString
197objects from Unicode strings (presumably read from some external file or
198elsewhere).
199
200
201@section overview_unicode_traps Traps for the Unwary
202
203@li Casting c_str() to void* is now char*, not wxChar*
204@li Passing c_str(), mb_str() or wc_str() to variadic functions doesn't work.
205
206*/
207