]> git.saurik.com Git - apple/icu.git/blame - icuSources/extra/uconv/uconv.1.in
ICU-66108.tar.gz
[apple/icu.git] / icuSources / extra / uconv / uconv.1.in
CommitLineData
b75a7d8f
A
1.\" Hey, Emacs! This is -*-nroff-*- you know...
2.\"
3.\" uconv.1: manual page for the uconv utility.
4.\"
f3c0d7a5
A
5.\" Copyright (C) 2016 and later: Unicode, Inc. and others.
6.\" License & terms of use: http://www.unicode.org/copyright.html
57a6839d 7.\" Copyright (C) 2000-2013 IBM, Inc. and others.
b75a7d8f
A
8.\"
9.\" Manual page by Yves Arrouye <yves@realnames.com>.
10.\"
73c04bcf 11.TH UCONV 1 "2005-jul-1" "ICU MANPAGE" "ICU @VERSION@ Manual"
b75a7d8f
A
12.SH NAME
13.B uconv
14\- convert data from one encoding to another
15.SH SYNOPSIS
16.B uconv
17[
18.BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
19]
20[
21.BI "\-V\fP, \fB\-\-version"
22]
23[
24.BI "\-s\fP, \fB\-\-silent"
25]
26[
27.BI "\-v\fP, \fB\-\-verbose"
28]
29[
30.BI "\-l\fP, \fB\-\-list"
31|
32.BI "\-l\fP, \fB\-\-list\-code" " code"
33|
34.BI "\-\-default-code"
35|
36.BI "\-L\fP, \fB\-\-list\-transliterators"
37]
38[
39.BI "\-\-canon"
40]
41[
42.BI "\-x" " transliteration
43]
44[
45.BI "\-\-to\-callback" " callback"
46|
47.B "\-c"
48]
49[
50.BI "\-\-from\-callback" " callback"
51|
52.B "\-i"
53]
54[
55.BI "\-\-callback" " callback"
56]
57[
58.BI "\-\-fallback"
59|
60.BI "\-\-no\-fallback"
61]
62[
63.BI "\-b\fP, \fB\-\-block\-size" " size"
64]
65[
66.BI "\-f\fP, \fB\-\-from\-code" " encoding"
67]
68[
69.BI "\-t\fP, \fB\-\-to\-code" " encoding"
70]
71[
374ca955
A
72.BI "\-\-add\-signature"
73]
74[
75.BI "\-\-remove\-signature"
76]
77[
b75a7d8f
A
78.BI "\-o\fP, \fB\-\-output" " file"
79]
80[
81.IR file .\|.\|.
82]
83.SH DESCRIPTION
84.B uconv
85converts, or transcodes, each given
86.I file
87(or its standard input if no
88.I file
89is specified) from one
90.I encoding
91to another.
92The transcoding is done using Unicode as a pivot encoding
93(i.e. the data are first transcoded from their original encoding to
94Unicode, and then from Unicode to the destination encoding).
95.PP
96If an
97.I encoding
98is not specified or is
99.BR - ,
100the default encoding is used. Thus, calling
101.B uconv
102with no
103.I encoding
104provides an easy way to validate and sanitize data files for
105further consumption by tools requiring data in the default encoding.
106.PP
107When calling
108.BR uconv ,
109it is possible to specify callbacks that are used to handle invalid
110characters in the input, or characters that cannot be transcoded to
111the destination encoding. Some encodings, for example, offer a default
0f5d89e8 112substitution character that can be used to represent the occurrence of
b75a7d8f
A
113such characters in the input. Other callbacks offer a useful visual
114representation of the invalid data.
115.PP
116.B uconv
117can also run the specified
118.IR transliteration
119on the transcoded data,
120in which case transliteration will happen as an intermediate step,
121after the data have been transcoded to Unicode.
122The
123.I transliteration
124can be either a list of semicolon-separated transliterator names,
374ca955 125or an arbitrarily complex set of rules in the ICU transliteration
b75a7d8f
A
126rules format.
127.PP
128For transcoding purposes,
129.B uconv
130options are compatible with those of
131.BR iconv (1),
374ca955 132making it easy to replace it in scripts. It is not necessarily the case,
b75a7d8f
A
133however, that the encoding names used by
134.B uconv
135and ICU are the same as the ones used by
136.BR iconv (1).
137Also, options that provide informational data, such as the
138.B \-l\fP, \fB\-\-list
139one offered by some
140.BR iconv (1)
141variants such as GNU's, produce data in a slightly different and
142easier to parse format.
143.SH OPTIONS
144.TP
145.BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
146Print help about usage and exit.
147.TP
148.BR "\-V\fP, \fB\-\-version"
149Print the version of
150.B uconv
151and exit.
152.TP
153.BI "\-s\fP, \fB\-\-silent"
154Suppress messages during execution.
155.TP
156.BI "\-v\fP, \fB\-\-verbose"
157Display extra informative messages during execution.
158.TP
159.BI "\-l\fP, \fB\-\-list"
160List all the available encodings and exit.
161.TP
162.BI "\-l\fP, \fB\-\-list\-code" " code"
163List only the
164.I code
165encoding and exit. If
166.I code
167is not a proper encoding, exit with an error.
168.TP
169.BI "\-\-default-code"
170List only the name of the default encoding and exit.
171.TP
172.BI "\-L\fP, \fB\-\-list\-transliterators"
173List all the available transliterators and exit.
174.TP
175.BI "\--canon"
176If used with
177.BI "\-l\fP, \fB\-\-list"
178or
179.BR "\-\-default-code" ,
180the list of encodings is produced in a format compatible with
181.BR convrtrs.txt (5).
182If used with
183.BR "\-L\fP, \fB\-\-list\-transliterators" ,
184print only one transliterator name per line.
185.TP
186.BI "\-x" " transliteration"
187Run the given
188.IR transliteration
189on the transcoded Unicode data,
190and use the transliterated data as input for the transcoding to
0f5d89e8 191the destination encoding.
b75a7d8f
A
192.TP
193.BI "\-\-to\-callback" " callback"
194Use
195.I callback
196to handle characters that cannot be transcoded to the destination
197encoding. See section
198.B CALLBACKS
199for details on valid callbacks.
200.TP
201.B "\-c"
202Omit invalid characters from the output.
203Same as
204.BR "\-\-to\-callback skip" .
205.TP
206.BI "\-\-from\-callback" " callback"
207Use
208.I callback
209to handle characters that cannot be transcoded from the original
210encoding. See section
211.B CALLBACKS
212for details on valid callbacks.
213.TP
214.B "\-i"
215Ignore invalid sequences in the input.
216Same as
217.BR "\-\-from\-callback skip" .
218.TP
219.BI "\-\-callback" " callback"
220Use
221.I callback
222to handle both characters that cannot be transcoded from the original
223encoding and characters that cannot be transcoded to the destination
224encoding. See section
225.B CALLBACKS
226for details on valid callbacks.
227.TP
228.BI "\-\-fallback"
229Use the fallback mapping when transcoding from
230Unicode to the destination encoding.
231.TP
232.BI "\-\-no\-fallback"
233Do not use the fallback mapping when transcoding from Unicode to the
234destination encoding.
235This is the default.
236.TP
237.BI "\-b\fP, \fB\-\-block\-size" " size"
238Read input in blocks of
239.I size
240bytes at a time. The default block size is
2414096.
242.TP
243.BI "\-f\fP, \fB\-\-from\-code" " encoding"
244Set the original encoding of the data to
245.IR encoding .
246.TP
247.BI "\-t\fP, \fB\-\-to\-code" " encoding"
248Transcode the data to
249.IR encoding .
250.TP
374ca955
A
251.BI "\-\-add\-signature"
252Add a U+FEFF Unicode signature character (BOM) if the output charset
253supports it and does not add one anyway.
254.TP
255.BI "\-\-remove\-signature"
256Remove a U+FEFF Unicode signature character (BOM).
257.TP
b75a7d8f
A
258.BI "\-o\fP, \fB\-\-output" " file"
259Write the transcoded data to
260.IR file .
261.SH CALLBACKS
262.B uconv
263supports specifying callbacks to handle invalid data. Callbacks can be
264set for both directions of transcoding: from the original encoding to
265Unicode, with the
266.BR "\-\-from\-callback"
267option, and from Unicode to the destination encoding, with the
268.BR "\-\-to\-callback"
269option.
270.PP
271The following is a list of valid
272.I callback
729e4ab9 273names, along with a description of their behavior. The list of
b75a7d8f
A
274callbacks actually supported by
275.B uconv
276is displayed when it is called with
277.BR "\-h\fP, \fB\-\-help" .
278.PP
279.TP \w'\fBescape-unicode'u+3n
280.B substitute
0f5d89e8 281Write the encoding's substitute sequence, or the Unicode
b75a7d8f
A
282replacement character
283.B U+FFFD
284when transcoding to Unicode.
285.TP
286.B skip
287Ignore the invalid data.
288.TP
289.B stop
290Stop with an error when encountering invalid data.
291This is the default callback.
292.TP
293.B escape
294Same as
295.BR escape-icu .
296.TP
297.B escape-icu
298Replace the missing characters with a string of the format
299.BR %U\fIhhhh\fP
300for plane 0 characters, and
301.BR %U\fIhhhh\fP%U\fIhhhh\fP
302for planes 1 and above characters,
303where
304.I hhhh
305is the hexadecimal value of one of the UTF-16 code units representing the
306character. Characters from planes 1 and above are written as a pair of
307UTF-16 surrogate code units.
308.TP
309.B escape-java
310Replace the missing characters with a string of the format
311.BR \eu\fIhhhh\fP
312for plane 0 characters, and
313.BR \eu\fIhhhh\fP\eu\fIhhhh\fP
314for planes 1 and above characters,
315where
316.I hhhh
317is the hexadecimal value of one of the UTF-16 code units representing the
318character. Characters from planes 1 and above are written as a pair of
319UTF-16 surrogate code units.
320.TP
321.B escape-c
322Replace the missing characters with a string of the format
323.BR \eu\fIhhhh\fP
324for plane 0 characters, and
325.BR \eU\fIhhhhhhhh\fP
326for planes 1 and above characters,
327where
328.I hhhh
329and
330.I hhhhhhhh
331are the hexadecimal values of the Unicode codepoint.
332.TP
333.B escape-xml
334Same as
335.BR escape-xml-hex .
336.TP
337.B escape-xml-hex
338Replace the missing characters with a string of the format
339.BR &#x\fIhhhh\fP; ,
340where
341.I hhhh
342is the hexadecimal value of the Unicode codepoint.
343.TP
344.B escape-xml-dec
345Replace the missing characters with a string of the format
57a6839d 346.BR &#\fInnnn\fP; ,
b75a7d8f
A
347where
348.I nnnn
349is the decimal value of the Unicode codepoint.
350.TP
351.B escape-unicode
352Replace the missing characters with a string of the format
353.BR {U+\fIhhhh\fP} ,
354where
355.I hhhh
356is the hexadecimal value of the Unicode codepoint.
357That hexadecimal string is of variable length and can use from 4 to
3586 digits.
359This is the format universally used to denote a Unicode codepoint in
0f5d89e8 360the literature, delimited by curly braces for easy recognition of those
b75a7d8f
A
361substitutions in the output.
362.SH EXAMPLES
363Convert data from a given
364.I encoding
365to the platform encoding:
366
367.RS 4
368.B \fR$ \fPuconv \-f \fIencoding\fP
369.RE
370.PP
371Check if a
372.I file
373contains valid data for a given
374.IR encoding :
375
376.RS 4
377.B \fR$ \fPuconv \-f \fIencoding\fP \-c \fIfile\fP >/dev/null
378.RE
379.PP
380Convert a UTF-8
381.I file
382to a given
383.I encoding
384and ensure that the resulting text is good for any version of HTML:
385
386.RS 4
387.B \fR$ \fPuconv \-f utf-8 \-t \fIencoding\fP \e
388.br
389.B " \-\-callback escape-xml-dec \fIfile\fP"
390.RE
391.PP
392Display the names of the Unicode code points in a UTF-file:
393
394.RS 4
395.B \fR$ \fPuconv \-f utf-8 \-x any-name \fIfile\fP
396.RE
397.PP
398Print the name of a Unicode code point whose value is known (\fBU+30AB\fP
399in this example):
400
401.RS 4
402.B \fR$ \fPecho '\eu30ab' | uconv \-x 'hex-any; any-name'; echo
403.br
404{KATAKANA LETTER KA}{LINE FEED}
405.br
406$
407.RE
408
409(The names are delimited by curly braces.
410Also, the name of the line terminator is also displayed.)
411.PP
412Normalize UTF-8 data using Unicode NFKC, remove all control characters,
413and map Katakana to Hiragana:
414
415.RS 4
416.B \fR$ \fPuconv \-f utf-8 \-t utf-8 \e
417.br
418.B " \-x '::nfkc; [:Cc:] >; ::katakana-hiragana;'"
419.SH CAVEATS AND BUGS
420.B uconv
0f5d89e8 421does report errors as occurring at the first invalid byte
b75a7d8f
A
422encountered. This may be confusing to users of GNU
423.BR iconv (1),
0f5d89e8 424which reports errors as occurring at the first byte of an invalid
b75a7d8f
A
425sequence. For multi-byte character sets or encodings, this means that
426.BR uconv
427error positions may be at a later offset in the input stream than
428would be the case with GNU
429.BR iconv (1).
430.PP
431The reporting of error positions when a transliterator is used may be
432inaccurate or unavailable, in which case
433.BR uconv
434will report the offset in the output stream at which the error
0f5d89e8 435occurred.
b75a7d8f
A
436.SH AUTHORS
437Jonas Utterstroem
438.br
439Yves Arrouye
440.SH VERSION
441@VERSION@
442.SH COPYRIGHT
73c04bcf 443Copyright (C) 2000-2005 IBM, Inc. and others.
b75a7d8f 444.SH SEE ALSO
b75a7d8f 445.BR iconv (1)