1 .\" Hey, Emacs! This is -*-nroff-*- you know...
3 .\" uconv.1: manual page for the uconv utility.
5 .\" Copyright (C) 2016 and later: Unicode, Inc. and others.
6 .\" License & terms of use: http://www.unicode.org/copyright.html
7 .\" Copyright (C) 2000-2013 IBM, Inc. and others.
9 .\" Manual page by Yves Arrouye <yves@realnames.com>.
11 .TH UCONV 1 "2005-jul-1" "ICU MANPAGE" "ICU @VERSION@ Manual"
14 \- convert data from one encoding to another
18 .BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
21 .BI "\-V\fP, \fB\-\-version"
24 .BI "\-s\fP, \fB\-\-silent"
27 .BI "\-v\fP, \fB\-\-verbose"
30 .BI "\-l\fP, \fB\-\-list"
32 .BI "\-l\fP, \fB\-\-list\-code" " code"
34 .BI "\-\-default-code"
36 .BI "\-L\fP, \fB\-\-list\-transliterators"
42 .BI "\-x" " transliteration
45 .BI "\-\-to\-callback" " callback"
50 .BI "\-\-from\-callback" " callback"
55 .BI "\-\-callback" " callback"
60 .BI "\-\-no\-fallback"
63 .BI "\-b\fP, \fB\-\-block\-size" " size"
66 .BI "\-f\fP, \fB\-\-from\-code" " encoding"
69 .BI "\-t\fP, \fB\-\-to\-code" " encoding"
72 .BI "\-\-add\-signature"
75 .BI "\-\-remove\-signature"
78 .BI "\-o\fP, \fB\-\-output" " file"
85 converts, or transcodes, each given
87 (or its standard input if no
89 is specified) from one
92 The transcoding is done using Unicode as a pivot encoding
93 (i.e. the data are first transcoded from their original encoding to
94 Unicode, and then from Unicode to the destination encoding).
98 is not specified or is
100 the default encoding is used. Thus, calling
104 provides an easy way to validate and sanitize data files for
105 further consumption by tools requiring data in the default encoding.
109 it is possible to specify callbacks that are used to handle invalid
110 characters in the input, or characters that cannot be transcoded to
111 the destination encoding. Some encodings, for example, offer a default
112 substitution character that can be used to represent the occurrence of
113 such characters in the input. Other callbacks offer a useful visual
114 representation of the invalid data.
117 can also run the specified
119 on the transcoded data,
120 in which case transliteration will happen as an intermediate step,
121 after the data have been transcoded to Unicode.
124 can be either a list of semicolon-separated transliterator names,
125 or an arbitrarily complex set of rules in the ICU transliteration
128 For transcoding purposes,
130 options are compatible with those of
132 making it easy to replace it in scripts. It is not necessarily the case,
133 however, that the encoding names used by
135 and ICU are the same as the ones used by
137 Also, options that provide informational data, such as the
138 .B \-l\fP, \fB\-\-list
141 variants such as GNU's, produce data in a slightly different and
142 easier to parse format.
145 .BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
146 Print help about usage and exit.
148 .BR "\-V\fP, \fB\-\-version"
153 .BI "\-s\fP, \fB\-\-silent"
154 Suppress messages during execution.
156 .BI "\-v\fP, \fB\-\-verbose"
157 Display extra informative messages during execution.
159 .BI "\-l\fP, \fB\-\-list"
160 List all the available encodings and exit.
162 .BI "\-l\fP, \fB\-\-list\-code" " code"
165 encoding and exit. If
167 is not a proper encoding, exit with an error.
169 .BI "\-\-default-code"
170 List only the name of the default encoding and exit.
172 .BI "\-L\fP, \fB\-\-list\-transliterators"
173 List all the available transliterators and exit.
177 .BI "\-l\fP, \fB\-\-list"
179 .BR "\-\-default-code" ,
180 the list of encodings is produced in a format compatible with
181 .BR convrtrs.txt (5).
183 .BR "\-L\fP, \fB\-\-list\-transliterators" ,
184 print only one transliterator name per line.
186 .BI "\-x" " transliteration"
189 on the transcoded Unicode data,
190 and use the transliterated data as input for the transcoding to
191 the destination encoding.
193 .BI "\-\-to\-callback" " callback"
196 to handle characters that cannot be transcoded to the destination
197 encoding. See section
199 for details on valid callbacks.
202 Omit invalid characters from the output.
204 .BR "\-\-to\-callback skip" .
206 .BI "\-\-from\-callback" " callback"
209 to handle characters that cannot be transcoded from the original
210 encoding. See section
212 for details on valid callbacks.
215 Ignore invalid sequences in the input.
217 .BR "\-\-from\-callback skip" .
219 .BI "\-\-callback" " callback"
222 to handle both characters that cannot be transcoded from the original
223 encoding and characters that cannot be transcoded to the destination
224 encoding. See section
226 for details on valid callbacks.
229 Use the fallback mapping when transcoding from
230 Unicode to the destination encoding.
232 .BI "\-\-no\-fallback"
233 Do not use the fallback mapping when transcoding from Unicode to the
234 destination encoding.
237 .BI "\-b\fP, \fB\-\-block\-size" " size"
238 Read input in blocks of
240 bytes at a time. The default block size is
243 .BI "\-f\fP, \fB\-\-from\-code" " encoding"
244 Set the original encoding of the data to
247 .BI "\-t\fP, \fB\-\-to\-code" " encoding"
248 Transcode the data to
251 .BI "\-\-add\-signature"
252 Add a U+FEFF Unicode signature character (BOM) if the output charset
253 supports it and does not add one anyway.
255 .BI "\-\-remove\-signature"
256 Remove a U+FEFF Unicode signature character (BOM).
258 .BI "\-o\fP, \fB\-\-output" " file"
259 Write the transcoded data to
263 supports specifying callbacks to handle invalid data. Callbacks can be
264 set for both directions of transcoding: from the original encoding to
266 .BR "\-\-from\-callback"
267 option, and from Unicode to the destination encoding, with the
268 .BR "\-\-to\-callback"
271 The following is a list of valid
273 names, along with a description of their behavior. The list of
274 callbacks actually supported by
276 is displayed when it is called with
277 .BR "\-h\fP, \fB\-\-help" .
279 .TP \w'\fBescape-unicode'u+3n
281 Write the encoding's substitute sequence, or the Unicode
282 replacement character
284 when transcoding to Unicode.
287 Ignore the invalid data.
290 Stop with an error when encountering invalid data.
291 This is the default callback.
298 Replace the missing characters with a string of the format
300 for plane 0 characters, and
301 .BR %U\fIhhhh\fP%U\fIhhhh\fP
302 for planes 1 and above characters,
305 is the hexadecimal value of one of the UTF-16 code units representing the
306 character. Characters from planes 1 and above are written as a pair of
307 UTF-16 surrogate code units.
310 Replace the missing characters with a string of the format
312 for plane 0 characters, and
313 .BR \eu\fIhhhh\fP\eu\fIhhhh\fP
314 for planes 1 and above characters,
317 is the hexadecimal value of one of the UTF-16 code units representing the
318 character. Characters from planes 1 and above are written as a pair of
319 UTF-16 surrogate code units.
322 Replace the missing characters with a string of the format
324 for plane 0 characters, and
325 .BR \eU\fIhhhhhhhh\fP
326 for planes 1 and above characters,
331 are the hexadecimal values of the Unicode codepoint.
338 Replace the missing characters with a string of the format
342 is the hexadecimal value of the Unicode codepoint.
345 Replace the missing characters with a string of the format
349 is the decimal value of the Unicode codepoint.
352 Replace the missing characters with a string of the format
356 is the hexadecimal value of the Unicode codepoint.
357 That hexadecimal string is of variable length and can use from 4 to
359 This is the format universally used to denote a Unicode codepoint in
360 the literature, delimited by curly braces for easy recognition of those
361 substitutions in the output.
363 Convert data from a given
365 to the platform encoding:
368 .B \fR$ \fPuconv \-f \fIencoding\fP
373 contains valid data for a given
377 .B \fR$ \fPuconv \-f \fIencoding\fP \-c \fIfile\fP >/dev/null
384 and ensure that the resulting text is good for any version of HTML:
387 .B \fR$ \fPuconv \-f utf-8 \-t \fIencoding\fP \e
389 .B " \-\-callback escape-xml-dec \fIfile\fP"
392 Display the names of the Unicode code points in a UTF-file:
395 .B \fR$ \fPuconv \-f utf-8 \-x any-name \fIfile\fP
398 Print the name of a Unicode code point whose value is known (\fBU+30AB\fP
402 .B \fR$ \fPecho '\eu30ab' | uconv \-x 'hex-any; any-name'; echo
404 {KATAKANA LETTER KA}{LINE FEED}
409 (The names are delimited by curly braces.
410 Also, the name of the line terminator is also displayed.)
412 Normalize UTF-8 data using Unicode NFKC, remove all control characters,
413 and map Katakana to Hiragana:
416 .B \fR$ \fPuconv \-f utf-8 \-t utf-8 \e
418 .B " \-x '::nfkc; [:Cc:] >; ::katakana-hiragana;'"
421 does report errors as occurring at the first invalid byte
422 encountered. This may be confusing to users of GNU
424 which reports errors as occurring at the first byte of an invalid
425 sequence. For multi-byte character sets or encodings, this means that
427 error positions may be at a later offset in the input stream than
428 would be the case with GNU
431 The reporting of error positions when a transliterator is used may be
432 inaccurate or unavailable, in which case
434 will report the offset in the output stream at which the error
443 Copyright (C) 2000-2005 IBM, Inc. and others.