]>
Commit | Line | Data |
---|---|---|
15b6757b | 1 | ///////////////////////////////////////////////////////////////////////////// |
2cd3cc94 | 2 | // Name: unicode.h |
15b6757b FM |
3 | // Purpose: topic overview |
4 | // Author: wxWidgets team | |
5 | // RCS-ID: $Id$ | |
6 | // Licence: wxWindows license | |
7 | ///////////////////////////////////////////////////////////////////////////// | |
8 | ||
880efa2a | 9 | /** |
36c9828f | 10 | |
2cd3cc94 BP |
11 | @page overview_unicode Unicode Support in wxWidgets |
12 | ||
13 | This section briefly describes the state of the Unicode support in wxWidgets. | |
14 | Read it if you want to know more about how to write programs able to work with | |
15 | characters from languages other than English. | |
36c9828f | 16 | |
2cd3cc94 BP |
17 | @li @ref overview_unicode_what |
18 | @li @ref overview_unicode_ansi | |
19 | @li @ref overview_unicode_supportin | |
20 | @li @ref overview_unicode_supportout | |
21 | @li @ref overview_unicode_settings | |
36c9828f | 22 | |
2cd3cc94 | 23 | <hr> |
36c9828f FM |
24 | |
25 | ||
2cd3cc94 BP |
26 | @section overview_unicode_what What is Unicode? |
27 | ||
28 | wxWidgets has support for compiling in Unicode mode on the platforms which | |
29 | support it. Unicode is a standard for character encoding which addresses the | |
30 | shortcomings of the previous, 8 bit standards, by using at least 16 (and | |
31 | possibly 32) bits for encoding each character. This allows to have at least | |
32 | 65536 characters (what is called the BMP, or basic multilingual plane) and | |
33 | possible 2^32 of them instead of the usual 256 and is sufficient to encode all | |
7b74e828 RR |
34 | of the world languages at once. A different approach is to encode all |
35 | strings in UTF8 which does not require the use of wide characters and | |
36 | additionally is backwards compatible with 7-bit ASCII. The solution to | |
37 | use UTF8 is prefered under Linux and partially OS X. | |
2cd3cc94 | 38 | |
7b74e828 | 39 | More details about Unicode may be found at <http://www.unicode.org/>. |
2cd3cc94 | 40 | |
ffac5996 | 41 | Writing internationalized programs is much easier with Unicode. Moreover |
7b74e828 RR |
42 | even a program which uses only standard ASCII can benefit from using Unicode |
43 | for string representation because there will be no need to convert all | |
44 | strings the program uses to/from Unicode each time a system call is made. | |
2cd3cc94 BP |
45 | |
46 | @section overview_unicode_ansi Unicode and ANSI Modes | |
47 | ||
7b74e828 RR |
48 | Until wxWidgets 3.0 it was possible to compile the library both in |
49 | ANSI (=8-bit) mode as well as in wide char mode (16-bit per character | |
50 | on Windows and 32-but on most Unix versions, Linux and OS X). This | |
ffac5996 RR |
51 | has been changed in wxWidget with the removal of the ANSI mode, |
52 | but much effort has been made so that most of the previous ANSI | |
53 | code should still compile and work as before. | |
2cd3cc94 | 54 | |
7b74e828 | 55 | @section overview_unicode_supportin Unicode Support in wxWidgets |
2cd3cc94 | 56 | |
7b74e828 RR |
57 | Since wxWidgets 3.0 Unicode support is always enabled meaning |
58 | that the wxString class always uses Unicode to encode its content. | |
ffac5996 RR |
59 | Under Windows wxString uses UCS-2 (basically an array of 16-bit |
60 | wchar_t). Under Unix, Linux and OS X however, wxString uses UTF8 | |
61 | to encode its content. | |
2cd3cc94 | 62 | |
7b74e828 RR |
63 | For the programmer, the biggest change is that iterating over |
64 | a string can be slower than before since wxString has to parse | |
65 | the entire string in order to find the n-th character in a | |
66 | string, meaning that iterating over a string should no longer | |
67 | be done by index but using iterators. Old code will still work | |
68 | but might be less efficient. | |
2cd3cc94 | 69 | |
7b74e828 | 70 | Old code like this: |
2cd3cc94 | 71 | |
7b74e828 RR |
72 | @code |
73 | wxString s = wxT("hello"); | |
74 | size_t i; | |
75 | for (i = 0; i < s.Len(); i++) | |
76 | { | |
77 | wxChar ch = s[i]; | |
78 | ||
79 | // do something with it | |
80 | } | |
81 | @endcode | |
82 | ||
83 | should be replaced (especially in time critical places) with: | |
2cd3cc94 BP |
84 | |
85 | @code | |
7b74e828 | 86 | wxString s = "hello"; |
36b952b7 | 87 | wxString::const_iterator i; |
7b74e828 RR |
88 | for (i = s.begin(); i != s.end(); ++i) |
89 | { | |
90 | wxUniChar uni_ch = *i; | |
91 | wxChar ch = uni_ch; | |
92 | // same as: wxChar ch = *i | |
93 | ||
94 | // do something with it | |
95 | } | |
2cd3cc94 BP |
96 | @endcode |
97 | ||
7b74e828 RR |
98 | If you want to replace individual characters in the string you |
99 | need to get a reference to that character: | |
2cd3cc94 | 100 | |
7b74e828 RR |
101 | @code |
102 | wxString s = "hello"; | |
103 | wxString::iterator i; | |
104 | for (i = s.begin(); i != s.end(); ++i) | |
105 | { | |
106 | wxUniCharRef ch = *i; | |
107 | ch = 'a'; | |
108 | // same as: *i = 'a'; | |
109 | } | |
110 | @endcode | |
2cd3cc94 | 111 | |
7b74e828 | 112 | which will change the content of the wxString s from "hello" to "aaaaa". |
2cd3cc94 | 113 | |
7b74e828 RR |
114 | String literals are translated to Unicode when they are assigned to |
115 | a wxString object so code can be written like this: | |
2cd3cc94 | 116 | |
7b74e828 RR |
117 | @code |
118 | wxString s = "Hello, world!"; | |
119 | int len = s.Len(); | |
120 | @endcode | |
2cd3cc94 | 121 | |
7b74e828 RR |
122 | wxWidgets provides wrappers around most Posix C functions (like printf(..)) |
123 | and the syntax has been adapted to support input with wxString, normal | |
124 | C-style strings and wchar_t strings: | |
2cd3cc94 | 125 | |
7b74e828 RR |
126 | @code |
127 | wxString s; | |
128 | s.Printf( "%s %s %s", "hello1", L"hello2", wxString("hello3") ); | |
129 | wxPrintf( "Three times hello %s\n", s ); | |
130 | @endcode | |
2cd3cc94 BP |
131 | |
132 | @section overview_unicode_supportout Unicode and the Outside World | |
133 | ||
134 | We have seen that it was easy to write Unicode programs using wxWidgets types | |
135 | and macros, but it has been also mentioned that it isn't quite enough. Although | |
136 | everything works fine inside the program, things can get nasty when it tries to | |
137 | communicate with the outside world which, sadly, often expects ANSI strings (a | |
138 | notable exception is the entire Win32 API which accepts either Unicode or ANSI | |
139 | strings and which thus makes it unnecessary to ever perform any conversions in | |
140 | the program). GTK 2.0 only accepts UTF-8 strings. | |
141 | ||
142 | To get an ANSI string from a wxString, you may use the mb_str() function which | |
143 | always returns an ANSI string (independently of the mode - while the usual | |
144 | c_str() returns a pointer to the internal representation which is either ASCII | |
145 | or Unicode). More rarely used, but still useful, is wc_str() function which | |
146 | always returns the Unicode string. | |
147 | ||
148 | Sometimes it is also necessary to go from ANSI strings to wxStrings. In this | |
149 | case, you can use the converter-constructor, as follows: | |
150 | ||
151 | @code | |
152 | const char* ascii_str = "Some text"; | |
153 | wxString str(ascii_str, wxConvUTF8); | |
154 | @endcode | |
155 | ||
2cd3cc94 BP |
156 | For more information about converters and Unicode see the @ref overview_mbconv. |
157 | ||
158 | ||
159 | @section overview_unicode_settings Unicode Related Compilation Settings | |
160 | ||
161 | You should define @c wxUSE_UNICODE to 1 to compile your program in Unicode | |
7b74e828 RR |
162 | mode. Since wxWidgets 3.0 this is always the case. When compiled in UTF8 |
163 | mode @c wxUSE_UNICODE_UTF8 is also defined. | |
2cd3cc94 BP |
164 | |
165 | */ | |
166 |