]> git.saurik.com Git - apple/icu.git/blob - icuSources/extra/uconv/uconv.1.in
ICU-64232.0.1.tar.gz
[apple/icu.git] / icuSources / extra / uconv / uconv.1.in
1 .\" Hey, Emacs! This is -*-nroff-*- you know...
2 .\"
3 .\" uconv.1: manual page for the uconv utility.
4 .\"
5 .\" Copyright (C) 2016 and later: Unicode, Inc. and others.
6 .\" License & terms of use: http://www.unicode.org/copyright.html
7 .\" Copyright (C) 2000-2013 IBM, Inc. and others.
8 .\"
9 .\" Manual page by Yves Arrouye <yves@realnames.com>.
10 .\"
11 .TH UCONV 1 "2005-jul-1" "ICU MANPAGE" "ICU @VERSION@ Manual"
12 .SH NAME
13 .B uconv
14 \- convert data from one encoding to another
15 .SH SYNOPSIS
16 .B uconv
17 [
18 .BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
19 ]
20 [
21 .BI "\-V\fP, \fB\-\-version"
22 ]
23 [
24 .BI "\-s\fP, \fB\-\-silent"
25 ]
26 [
27 .BI "\-v\fP, \fB\-\-verbose"
28 ]
29 [
30 .BI "\-l\fP, \fB\-\-list"
31 |
32 .BI "\-l\fP, \fB\-\-list\-code" " code"
33 |
34 .BI "\-\-default-code"
35 |
36 .BI "\-L\fP, \fB\-\-list\-transliterators"
37 ]
38 [
39 .BI "\-\-canon"
40 ]
41 [
42 .BI "\-x" " transliteration
43 ]
44 [
45 .BI "\-\-to\-callback" " callback"
46 |
47 .B "\-c"
48 ]
49 [
50 .BI "\-\-from\-callback" " callback"
51 |
52 .B "\-i"
53 ]
54 [
55 .BI "\-\-callback" " callback"
56 ]
57 [
58 .BI "\-\-fallback"
59 |
60 .BI "\-\-no\-fallback"
61 ]
62 [
63 .BI "\-b\fP, \fB\-\-block\-size" " size"
64 ]
65 [
66 .BI "\-f\fP, \fB\-\-from\-code" " encoding"
67 ]
68 [
69 .BI "\-t\fP, \fB\-\-to\-code" " encoding"
70 ]
71 [
72 .BI "\-\-add\-signature"
73 ]
74 [
75 .BI "\-\-remove\-signature"
76 ]
77 [
78 .BI "\-o\fP, \fB\-\-output" " file"
79 ]
80 [
81 .IR file .\|.\|.
82 ]
83 .SH DESCRIPTION
84 .B uconv
85 converts, or transcodes, each given
86 .I file
87 (or its standard input if no
88 .I file
89 is specified) from one
90 .I encoding
91 to another.
92 The transcoding is done using Unicode as a pivot encoding
93 (i.e. the data are first transcoded from their original encoding to
94 Unicode, and then from Unicode to the destination encoding).
95 .PP
96 If an
97 .I encoding
98 is not specified or is
99 .BR - ,
100 the default encoding is used. Thus, calling
101 .B uconv
102 with no
103 .I encoding
104 provides an easy way to validate and sanitize data files for
105 further consumption by tools requiring data in the default encoding.
106 .PP
107 When calling
108 .BR uconv ,
109 it is possible to specify callbacks that are used to handle invalid
110 characters in the input, or characters that cannot be transcoded to
111 the destination encoding. Some encodings, for example, offer a default
112 substitution character that can be used to represent the occurrence of
113 such characters in the input. Other callbacks offer a useful visual
114 representation of the invalid data.
115 .PP
116 .B uconv
117 can also run the specified
118 .IR transliteration
119 on the transcoded data,
120 in which case transliteration will happen as an intermediate step,
121 after the data have been transcoded to Unicode.
122 The
123 .I transliteration
124 can be either a list of semicolon-separated transliterator names,
125 or an arbitrarily complex set of rules in the ICU transliteration
126 rules format.
127 .PP
128 For transcoding purposes,
129 .B uconv
130 options are compatible with those of
131 .BR iconv (1),
132 making it easy to replace it in scripts. It is not necessarily the case,
133 however, that the encoding names used by
134 .B uconv
135 and ICU are the same as the ones used by
136 .BR iconv (1).
137 Also, options that provide informational data, such as the
138 .B \-l\fP, \fB\-\-list
139 one offered by some
140 .BR iconv (1)
141 variants such as GNU's, produce data in a slightly different and
142 easier to parse format.
143 .SH OPTIONS
144 .TP
145 .BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
146 Print help about usage and exit.
147 .TP
148 .BR "\-V\fP, \fB\-\-version"
149 Print the version of
150 .B uconv
151 and exit.
152 .TP
153 .BI "\-s\fP, \fB\-\-silent"
154 Suppress messages during execution.
155 .TP
156 .BI "\-v\fP, \fB\-\-verbose"
157 Display extra informative messages during execution.
158 .TP
159 .BI "\-l\fP, \fB\-\-list"
160 List all the available encodings and exit.
161 .TP
162 .BI "\-l\fP, \fB\-\-list\-code" " code"
163 List only the
164 .I code
165 encoding and exit. If
166 .I code
167 is not a proper encoding, exit with an error.
168 .TP
169 .BI "\-\-default-code"
170 List only the name of the default encoding and exit.
171 .TP
172 .BI "\-L\fP, \fB\-\-list\-transliterators"
173 List all the available transliterators and exit.
174 .TP
175 .BI "\--canon"
176 If used with
177 .BI "\-l\fP, \fB\-\-list"
178 or
179 .BR "\-\-default-code" ,
180 the list of encodings is produced in a format compatible with
181 .BR convrtrs.txt (5).
182 If used with
183 .BR "\-L\fP, \fB\-\-list\-transliterators" ,
184 print only one transliterator name per line.
185 .TP
186 .BI "\-x" " transliteration"
187 Run the given
188 .IR transliteration
189 on the transcoded Unicode data,
190 and use the transliterated data as input for the transcoding to
191 the destination encoding.
192 .TP
193 .BI "\-\-to\-callback" " callback"
194 Use
195 .I callback
196 to handle characters that cannot be transcoded to the destination
197 encoding. See section
198 .B CALLBACKS
199 for details on valid callbacks.
200 .TP
201 .B "\-c"
202 Omit invalid characters from the output.
203 Same as
204 .BR "\-\-to\-callback skip" .
205 .TP
206 .BI "\-\-from\-callback" " callback"
207 Use
208 .I callback
209 to handle characters that cannot be transcoded from the original
210 encoding. See section
211 .B CALLBACKS
212 for details on valid callbacks.
213 .TP
214 .B "\-i"
215 Ignore invalid sequences in the input.
216 Same as
217 .BR "\-\-from\-callback skip" .
218 .TP
219 .BI "\-\-callback" " callback"
220 Use
221 .I callback
222 to handle both characters that cannot be transcoded from the original
223 encoding and characters that cannot be transcoded to the destination
224 encoding. See section
225 .B CALLBACKS
226 for details on valid callbacks.
227 .TP
228 .BI "\-\-fallback"
229 Use the fallback mapping when transcoding from
230 Unicode to the destination encoding.
231 .TP
232 .BI "\-\-no\-fallback"
233 Do not use the fallback mapping when transcoding from Unicode to the
234 destination encoding.
235 This is the default.
236 .TP
237 .BI "\-b\fP, \fB\-\-block\-size" " size"
238 Read input in blocks of
239 .I size
240 bytes at a time. The default block size is
241 4096.
242 .TP
243 .BI "\-f\fP, \fB\-\-from\-code" " encoding"
244 Set the original encoding of the data to
245 .IR encoding .
246 .TP
247 .BI "\-t\fP, \fB\-\-to\-code" " encoding"
248 Transcode the data to
249 .IR encoding .
250 .TP
251 .BI "\-\-add\-signature"
252 Add a U+FEFF Unicode signature character (BOM) if the output charset
253 supports it and does not add one anyway.
254 .TP
255 .BI "\-\-remove\-signature"
256 Remove a U+FEFF Unicode signature character (BOM).
257 .TP
258 .BI "\-o\fP, \fB\-\-output" " file"
259 Write the transcoded data to
260 .IR file .
261 .SH CALLBACKS
262 .B uconv
263 supports specifying callbacks to handle invalid data. Callbacks can be
264 set for both directions of transcoding: from the original encoding to
265 Unicode, with the
266 .BR "\-\-from\-callback"
267 option, and from Unicode to the destination encoding, with the
268 .BR "\-\-to\-callback"
269 option.
270 .PP
271 The following is a list of valid
272 .I callback
273 names, along with a description of their behavior. The list of
274 callbacks actually supported by
275 .B uconv
276 is displayed when it is called with
277 .BR "\-h\fP, \fB\-\-help" .
278 .PP
279 .TP \w'\fBescape-unicode'u+3n
280 .B substitute
281 Write the encoding's substitute sequence, or the Unicode
282 replacement character
283 .B U+FFFD
284 when transcoding to Unicode.
285 .TP
286 .B skip
287 Ignore the invalid data.
288 .TP
289 .B stop
290 Stop with an error when encountering invalid data.
291 This is the default callback.
292 .TP
293 .B escape
294 Same as
295 .BR escape-icu .
296 .TP
297 .B escape-icu
298 Replace the missing characters with a string of the format
299 .BR %U\fIhhhh\fP
300 for plane 0 characters, and
301 .BR %U\fIhhhh\fP%U\fIhhhh\fP
302 for planes 1 and above characters,
303 where
304 .I hhhh
305 is the hexadecimal value of one of the UTF-16 code units representing the
306 character. Characters from planes 1 and above are written as a pair of
307 UTF-16 surrogate code units.
308 .TP
309 .B escape-java
310 Replace the missing characters with a string of the format
311 .BR \eu\fIhhhh\fP
312 for plane 0 characters, and
313 .BR \eu\fIhhhh\fP\eu\fIhhhh\fP
314 for planes 1 and above characters,
315 where
316 .I hhhh
317 is the hexadecimal value of one of the UTF-16 code units representing the
318 character. Characters from planes 1 and above are written as a pair of
319 UTF-16 surrogate code units.
320 .TP
321 .B escape-c
322 Replace the missing characters with a string of the format
323 .BR \eu\fIhhhh\fP
324 for plane 0 characters, and
325 .BR \eU\fIhhhhhhhh\fP
326 for planes 1 and above characters,
327 where
328 .I hhhh
329 and
330 .I hhhhhhhh
331 are the hexadecimal values of the Unicode codepoint.
332 .TP
333 .B escape-xml
334 Same as
335 .BR escape-xml-hex .
336 .TP
337 .B escape-xml-hex
338 Replace the missing characters with a string of the format
339 .BR &#x\fIhhhh\fP; ,
340 where
341 .I hhhh
342 is the hexadecimal value of the Unicode codepoint.
343 .TP
344 .B escape-xml-dec
345 Replace the missing characters with a string of the format
346 .BR &#\fInnnn\fP; ,
347 where
348 .I nnnn
349 is the decimal value of the Unicode codepoint.
350 .TP
351 .B escape-unicode
352 Replace the missing characters with a string of the format
353 .BR {U+\fIhhhh\fP} ,
354 where
355 .I hhhh
356 is the hexadecimal value of the Unicode codepoint.
357 That hexadecimal string is of variable length and can use from 4 to
358 6 digits.
359 This is the format universally used to denote a Unicode codepoint in
360 the literature, delimited by curly braces for easy recognition of those
361 substitutions in the output.
362 .SH EXAMPLES
363 Convert data from a given
364 .I encoding
365 to the platform encoding:
366
367 .RS 4
368 .B \fR$ \fPuconv \-f \fIencoding\fP
369 .RE
370 .PP
371 Check if a
372 .I file
373 contains valid data for a given
374 .IR encoding :
375
376 .RS 4
377 .B \fR$ \fPuconv \-f \fIencoding\fP \-c \fIfile\fP >/dev/null
378 .RE
379 .PP
380 Convert a UTF-8
381 .I file
382 to a given
383 .I encoding
384 and ensure that the resulting text is good for any version of HTML:
385
386 .RS 4
387 .B \fR$ \fPuconv \-f utf-8 \-t \fIencoding\fP \e
388 .br
389 .B " \-\-callback escape-xml-dec \fIfile\fP"
390 .RE
391 .PP
392 Display the names of the Unicode code points in a UTF-file:
393
394 .RS 4
395 .B \fR$ \fPuconv \-f utf-8 \-x any-name \fIfile\fP
396 .RE
397 .PP
398 Print the name of a Unicode code point whose value is known (\fBU+30AB\fP
399 in this example):
400
401 .RS 4
402 .B \fR$ \fPecho '\eu30ab' | uconv \-x 'hex-any; any-name'; echo
403 .br
404 {KATAKANA LETTER KA}{LINE FEED}
405 .br
406 $
407 .RE
408
409 (The names are delimited by curly braces.
410 Also, the name of the line terminator is also displayed.)
411 .PP
412 Normalize UTF-8 data using Unicode NFKC, remove all control characters,
413 and map Katakana to Hiragana:
414
415 .RS 4
416 .B \fR$ \fPuconv \-f utf-8 \-t utf-8 \e
417 .br
418 .B " \-x '::nfkc; [:Cc:] >; ::katakana-hiragana;'"
419 .SH CAVEATS AND BUGS
420 .B uconv
421 does report errors as occurring at the first invalid byte
422 encountered. This may be confusing to users of GNU
423 .BR iconv (1),
424 which reports errors as occurring at the first byte of an invalid
425 sequence. For multi-byte character sets or encodings, this means that
426 .BR uconv
427 error positions may be at a later offset in the input stream than
428 would be the case with GNU
429 .BR iconv (1).
430 .PP
431 The reporting of error positions when a transliterator is used may be
432 inaccurate or unavailable, in which case
433 .BR uconv
434 will report the offset in the output stream at which the error
435 occurred.
436 .SH AUTHORS
437 Jonas Utterstroem
438 .br
439 Yves Arrouye
440 .SH VERSION
441 @VERSION@
442 .SH COPYRIGHT
443 Copyright (C) 2000-2005 IBM, Inc. and others.
444 .SH SEE ALSO
445 .BR iconv (1)