]> git.saurik.com Git - apple/icu.git/blame - icuSources/extra/uconv/uconv.1.in
ICU-400.38.tar.gz
[apple/icu.git] / icuSources / extra / uconv / uconv.1.in
CommitLineData
b75a7d8f
A
1.\" Hey, Emacs! This is -*-nroff-*- you know...
2.\"
3.\" uconv.1: manual page for the uconv utility.
4.\"
73c04bcf 5.\" Copyright (C) 2000-2005 IBM, Inc. and others.
b75a7d8f
A
6.\"
7.\" Manual page by Yves Arrouye <yves@realnames.com>.
8.\"
73c04bcf 9.TH UCONV 1 "2005-jul-1" "ICU MANPAGE" "ICU @VERSION@ Manual"
b75a7d8f
A
10.SH NAME
11.B uconv
12\- convert data from one encoding to another
13.SH SYNOPSIS
14.B uconv
15[
16.BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
17]
18[
19.BI "\-V\fP, \fB\-\-version"
20]
21[
22.BI "\-s\fP, \fB\-\-silent"
23]
24[
25.BI "\-v\fP, \fB\-\-verbose"
26]
27[
28.BI "\-l\fP, \fB\-\-list"
29|
30.BI "\-l\fP, \fB\-\-list\-code" " code"
31|
32.BI "\-\-default-code"
33|
34.BI "\-L\fP, \fB\-\-list\-transliterators"
35]
36[
37.BI "\-\-canon"
38]
39[
40.BI "\-x" " transliteration
41]
42[
43.BI "\-\-to\-callback" " callback"
44|
45.B "\-c"
46]
47[
48.BI "\-\-from\-callback" " callback"
49|
50.B "\-i"
51]
52[
53.BI "\-\-callback" " callback"
54]
55[
56.BI "\-\-fallback"
57|
58.BI "\-\-no\-fallback"
59]
60[
61.BI "\-b\fP, \fB\-\-block\-size" " size"
62]
63[
64.BI "\-f\fP, \fB\-\-from\-code" " encoding"
65]
66[
67.BI "\-t\fP, \fB\-\-to\-code" " encoding"
68]
69[
374ca955
A
70.BI "\-\-add\-signature"
71]
72[
73.BI "\-\-remove\-signature"
74]
75[
b75a7d8f
A
76.BI "\-o\fP, \fB\-\-output" " file"
77]
78[
79.IR file .\|.\|.
80]
81.SH DESCRIPTION
82.B uconv
83converts, or transcodes, each given
84.I file
85(or its standard input if no
86.I file
87is specified) from one
88.I encoding
89to another.
90The transcoding is done using Unicode as a pivot encoding
91(i.e. the data are first transcoded from their original encoding to
92Unicode, and then from Unicode to the destination encoding).
93.PP
94If an
95.I encoding
96is not specified or is
97.BR - ,
98the default encoding is used. Thus, calling
99.B uconv
100with no
101.I encoding
102provides an easy way to validate and sanitize data files for
103further consumption by tools requiring data in the default encoding.
104.PP
105When calling
106.BR uconv ,
107it is possible to specify callbacks that are used to handle invalid
108characters in the input, or characters that cannot be transcoded to
109the destination encoding. Some encodings, for example, offer a default
110substitution character that can be used to represent the occurence of
111such characters in the input. Other callbacks offer a useful visual
112representation of the invalid data.
113.PP
114.B uconv
115can also run the specified
116.IR transliteration
117on the transcoded data,
118in which case transliteration will happen as an intermediate step,
119after the data have been transcoded to Unicode.
120The
121.I transliteration
122can be either a list of semicolon-separated transliterator names,
374ca955 123or an arbitrarily complex set of rules in the ICU transliteration
b75a7d8f
A
124rules format.
125.PP
126For transcoding purposes,
127.B uconv
128options are compatible with those of
129.BR iconv (1),
374ca955 130making it easy to replace it in scripts. It is not necessarily the case,
b75a7d8f
A
131however, that the encoding names used by
132.B uconv
133and ICU are the same as the ones used by
134.BR iconv (1).
135Also, options that provide informational data, such as the
136.B \-l\fP, \fB\-\-list
137one offered by some
138.BR iconv (1)
139variants such as GNU's, produce data in a slightly different and
140easier to parse format.
141.SH OPTIONS
142.TP
143.BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
144Print help about usage and exit.
145.TP
146.BR "\-V\fP, \fB\-\-version"
147Print the version of
148.B uconv
149and exit.
150.TP
151.BI "\-s\fP, \fB\-\-silent"
152Suppress messages during execution.
153.TP
154.BI "\-v\fP, \fB\-\-verbose"
155Display extra informative messages during execution.
156.TP
157.BI "\-l\fP, \fB\-\-list"
158List all the available encodings and exit.
159.TP
160.BI "\-l\fP, \fB\-\-list\-code" " code"
161List only the
162.I code
163encoding and exit. If
164.I code
165is not a proper encoding, exit with an error.
166.TP
167.BI "\-\-default-code"
168List only the name of the default encoding and exit.
169.TP
170.BI "\-L\fP, \fB\-\-list\-transliterators"
171List all the available transliterators and exit.
172.TP
173.BI "\--canon"
174If used with
175.BI "\-l\fP, \fB\-\-list"
176or
177.BR "\-\-default-code" ,
178the list of encodings is produced in a format compatible with
179.BR convrtrs.txt (5).
180If used with
181.BR "\-L\fP, \fB\-\-list\-transliterators" ,
182print only one transliterator name per line.
183.TP
184.BI "\-x" " transliteration"
185Run the given
186.IR transliteration
187on the transcoded Unicode data,
188and use the transliterated data as input for the transcoding to
189the the destination encoding.
190.TP
191.BI "\-\-to\-callback" " callback"
192Use
193.I callback
194to handle characters that cannot be transcoded to the destination
195encoding. See section
196.B CALLBACKS
197for details on valid callbacks.
198.TP
199.B "\-c"
200Omit invalid characters from the output.
201Same as
202.BR "\-\-to\-callback skip" .
203.TP
204.BI "\-\-from\-callback" " callback"
205Use
206.I callback
207to handle characters that cannot be transcoded from the original
208encoding. See section
209.B CALLBACKS
210for details on valid callbacks.
211.TP
212.B "\-i"
213Ignore invalid sequences in the input.
214Same as
215.BR "\-\-from\-callback skip" .
216.TP
217.BI "\-\-callback" " callback"
218Use
219.I callback
220to handle both characters that cannot be transcoded from the original
221encoding and characters that cannot be transcoded to the destination
222encoding. See section
223.B CALLBACKS
224for details on valid callbacks.
225.TP
226.BI "\-\-fallback"
227Use the fallback mapping when transcoding from
228Unicode to the destination encoding.
229.TP
230.BI "\-\-no\-fallback"
231Do not use the fallback mapping when transcoding from Unicode to the
232destination encoding.
233This is the default.
234.TP
235.BI "\-b\fP, \fB\-\-block\-size" " size"
236Read input in blocks of
237.I size
238bytes at a time. The default block size is
2394096.
240.TP
241.BI "\-f\fP, \fB\-\-from\-code" " encoding"
242Set the original encoding of the data to
243.IR encoding .
244.TP
245.BI "\-t\fP, \fB\-\-to\-code" " encoding"
246Transcode the data to
247.IR encoding .
248.TP
374ca955
A
249.BI "\-\-add\-signature"
250Add a U+FEFF Unicode signature character (BOM) if the output charset
251supports it and does not add one anyway.
252.TP
253.BI "\-\-remove\-signature"
254Remove a U+FEFF Unicode signature character (BOM).
255.TP
b75a7d8f
A
256.BI "\-o\fP, \fB\-\-output" " file"
257Write the transcoded data to
258.IR file .
259.SH CALLBACKS
260.B uconv
261supports specifying callbacks to handle invalid data. Callbacks can be
262set for both directions of transcoding: from the original encoding to
263Unicode, with the
264.BR "\-\-from\-callback"
265option, and from Unicode to the destination encoding, with the
266.BR "\-\-to\-callback"
267option.
268.PP
269The following is a list of valid
270.I callback
271names, alonmg with a description of their behavior. The list of
272callbacks actually supported by
273.B uconv
274is displayed when it is called with
275.BR "\-h\fP, \fB\-\-help" .
276.PP
277.TP \w'\fBescape-unicode'u+3n
278.B substitute
279Write the the encoding's substitute sequence, or the Unicode
280replacement character
281.B U+FFFD
282when transcoding to Unicode.
283.TP
284.B skip
285Ignore the invalid data.
286.TP
287.B stop
288Stop with an error when encountering invalid data.
289This is the default callback.
290.TP
291.B escape
292Same as
293.BR escape-icu .
294.TP
295.B escape-icu
296Replace the missing characters with a string of the format
297.BR %U\fIhhhh\fP
298for plane 0 characters, and
299.BR %U\fIhhhh\fP%U\fIhhhh\fP
300for planes 1 and above characters,
301where
302.I hhhh
303is the hexadecimal value of one of the UTF-16 code units representing the
304character. Characters from planes 1 and above are written as a pair of
305UTF-16 surrogate code units.
306.TP
307.B escape-java
308Replace the missing characters with a string of the format
309.BR \eu\fIhhhh\fP
310for plane 0 characters, and
311.BR \eu\fIhhhh\fP\eu\fIhhhh\fP
312for planes 1 and above characters,
313where
314.I hhhh
315is the hexadecimal value of one of the UTF-16 code units representing the
316character. Characters from planes 1 and above are written as a pair of
317UTF-16 surrogate code units.
318.TP
319.B escape-c
320Replace the missing characters with a string of the format
321.BR \eu\fIhhhh\fP
322for plane 0 characters, and
323.BR \eU\fIhhhhhhhh\fP
324for planes 1 and above characters,
325where
326.I hhhh
327and
328.I hhhhhhhh
329are the hexadecimal values of the Unicode codepoint.
330.TP
331.B escape-xml
332Same as
333.BR escape-xml-hex .
334.TP
335.B escape-xml-hex
336Replace the missing characters with a string of the format
337.BR &#x\fIhhhh\fP; ,
338where
339.I hhhh
340is the hexadecimal value of the Unicode codepoint.
341.TP
342.B escape-xml-dec
343Replace the missing characters with a string of the format
344.BR &#x\fInnnn\fP; ,
345where
346.I nnnn
347is the decimal value of the Unicode codepoint.
348.TP
349.B escape-unicode
350Replace the missing characters with a string of the format
351.BR {U+\fIhhhh\fP} ,
352where
353.I hhhh
354is the hexadecimal value of the Unicode codepoint.
355That hexadecimal string is of variable length and can use from 4 to
3566 digits.
357This is the format universally used to denote a Unicode codepoint in
358the litterature, delimited by curly braces for easy recognition of those
359substitutions in the output.
360.SH EXAMPLES
361Convert data from a given
362.I encoding
363to the platform encoding:
364
365.RS 4
366.B \fR$ \fPuconv \-f \fIencoding\fP
367.RE
368.PP
369Check if a
370.I file
371contains valid data for a given
372.IR encoding :
373
374.RS 4
375.B \fR$ \fPuconv \-f \fIencoding\fP \-c \fIfile\fP >/dev/null
376.RE
377.PP
378Convert a UTF-8
379.I file
380to a given
381.I encoding
382and ensure that the resulting text is good for any version of HTML:
383
384.RS 4
385.B \fR$ \fPuconv \-f utf-8 \-t \fIencoding\fP \e
386.br
387.B " \-\-callback escape-xml-dec \fIfile\fP"
388.RE
389.PP
390Display the names of the Unicode code points in a UTF-file:
391
392.RS 4
393.B \fR$ \fPuconv \-f utf-8 \-x any-name \fIfile\fP
394.RE
395.PP
396Print the name of a Unicode code point whose value is known (\fBU+30AB\fP
397in this example):
398
399.RS 4
400.B \fR$ \fPecho '\eu30ab' | uconv \-x 'hex-any; any-name'; echo
401.br
402{KATAKANA LETTER KA}{LINE FEED}
403.br
404$
405.RE
406
407(The names are delimited by curly braces.
408Also, the name of the line terminator is also displayed.)
409.PP
410Normalize UTF-8 data using Unicode NFKC, remove all control characters,
411and map Katakana to Hiragana:
412
413.RS 4
414.B \fR$ \fPuconv \-f utf-8 \-t utf-8 \e
415.br
416.B " \-x '::nfkc; [:Cc:] >; ::katakana-hiragana;'"
417.SH CAVEATS AND BUGS
418.B uconv
419does report errors as occuring at the first invalid byte
420encountered. This may be confusing to users of GNU
421.BR iconv (1),
422which reports errors as occuring at the first byte of an invalid
423sequence. For multi-byte character sets or encodings, this means that
424.BR uconv
425error positions may be at a later offset in the input stream than
426would be the case with GNU
427.BR iconv (1).
428.PP
429The reporting of error positions when a transliterator is used may be
430inaccurate or unavailable, in which case
431.BR uconv
432will report the offset in the output stream at which the error
433occured.
73c04bcf
A
434.\" .SH FILES
435.\" .TP 15
436.\" .B @pkgicudatadir@/@PACKAGE@/@VERSION@/uconvmsg.dat
437.\" Compiled resource bundle containing localized messages printed
438.\" by
439.\" .BR uconv .
b75a7d8f
A
440.SH AUTHORS
441Jonas Utterstroem
442.br
443Yves Arrouye
444.SH VERSION
445@VERSION@
446.SH COPYRIGHT
73c04bcf 447Copyright (C) 2000-2005 IBM, Inc. and others.
b75a7d8f 448.SH SEE ALSO
b75a7d8f 449.BR iconv (1)