1 .\" Hey, Emacs! This is -*-nroff-*- you know...
3 .\" uconv.1: manual page for the uconv utility.
5 .\" Copyright (C) 2000-2010 IBM, Inc. and others.
7 .\" Manual page by Yves Arrouye <yves@realnames.com>.
9 .TH UCONV 1 "2005-jul-1" "ICU MANPAGE" "ICU @VERSION@ Manual"
12 \- convert data from one encoding to another
16 .BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
19 .BI "\-V\fP, \fB\-\-version"
22 .BI "\-s\fP, \fB\-\-silent"
25 .BI "\-v\fP, \fB\-\-verbose"
28 .BI "\-l\fP, \fB\-\-list"
30 .BI "\-l\fP, \fB\-\-list\-code" " code"
32 .BI "\-\-default-code"
34 .BI "\-L\fP, \fB\-\-list\-transliterators"
40 .BI "\-x" " transliteration
43 .BI "\-\-to\-callback" " callback"
48 .BI "\-\-from\-callback" " callback"
53 .BI "\-\-callback" " callback"
58 .BI "\-\-no\-fallback"
61 .BI "\-b\fP, \fB\-\-block\-size" " size"
64 .BI "\-f\fP, \fB\-\-from\-code" " encoding"
67 .BI "\-t\fP, \fB\-\-to\-code" " encoding"
70 .BI "\-\-add\-signature"
73 .BI "\-\-remove\-signature"
76 .BI "\-o\fP, \fB\-\-output" " file"
83 converts, or transcodes, each given
85 (or its standard input if no
87 is specified) from one
90 The transcoding is done using Unicode as a pivot encoding
91 (i.e. the data are first transcoded from their original encoding to
92 Unicode, and then from Unicode to the destination encoding).
96 is not specified or is
98 the default encoding is used. Thus, calling
102 provides an easy way to validate and sanitize data files for
103 further consumption by tools requiring data in the default encoding.
107 it is possible to specify callbacks that are used to handle invalid
108 characters in the input, or characters that cannot be transcoded to
109 the destination encoding. Some encodings, for example, offer a default
110 substitution character that can be used to represent the occurence of
111 such characters in the input. Other callbacks offer a useful visual
112 representation of the invalid data.
115 can also run the specified
117 on the transcoded data,
118 in which case transliteration will happen as an intermediate step,
119 after the data have been transcoded to Unicode.
122 can be either a list of semicolon-separated transliterator names,
123 or an arbitrarily complex set of rules in the ICU transliteration
126 For transcoding purposes,
128 options are compatible with those of
130 making it easy to replace it in scripts. It is not necessarily the case,
131 however, that the encoding names used by
133 and ICU are the same as the ones used by
135 Also, options that provide informational data, such as the
136 .B \-l\fP, \fB\-\-list
139 variants such as GNU's, produce data in a slightly different and
140 easier to parse format.
143 .BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
144 Print help about usage and exit.
146 .BR "\-V\fP, \fB\-\-version"
151 .BI "\-s\fP, \fB\-\-silent"
152 Suppress messages during execution.
154 .BI "\-v\fP, \fB\-\-verbose"
155 Display extra informative messages during execution.
157 .BI "\-l\fP, \fB\-\-list"
158 List all the available encodings and exit.
160 .BI "\-l\fP, \fB\-\-list\-code" " code"
163 encoding and exit. If
165 is not a proper encoding, exit with an error.
167 .BI "\-\-default-code"
168 List only the name of the default encoding and exit.
170 .BI "\-L\fP, \fB\-\-list\-transliterators"
171 List all the available transliterators and exit.
175 .BI "\-l\fP, \fB\-\-list"
177 .BR "\-\-default-code" ,
178 the list of encodings is produced in a format compatible with
179 .BR convrtrs.txt (5).
181 .BR "\-L\fP, \fB\-\-list\-transliterators" ,
182 print only one transliterator name per line.
184 .BI "\-x" " transliteration"
187 on the transcoded Unicode data,
188 and use the transliterated data as input for the transcoding to
189 the the destination encoding.
191 .BI "\-\-to\-callback" " callback"
194 to handle characters that cannot be transcoded to the destination
195 encoding. See section
197 for details on valid callbacks.
200 Omit invalid characters from the output.
202 .BR "\-\-to\-callback skip" .
204 .BI "\-\-from\-callback" " callback"
207 to handle characters that cannot be transcoded from the original
208 encoding. See section
210 for details on valid callbacks.
213 Ignore invalid sequences in the input.
215 .BR "\-\-from\-callback skip" .
217 .BI "\-\-callback" " callback"
220 to handle both characters that cannot be transcoded from the original
221 encoding and characters that cannot be transcoded to the destination
222 encoding. See section
224 for details on valid callbacks.
227 Use the fallback mapping when transcoding from
228 Unicode to the destination encoding.
230 .BI "\-\-no\-fallback"
231 Do not use the fallback mapping when transcoding from Unicode to the
232 destination encoding.
235 .BI "\-b\fP, \fB\-\-block\-size" " size"
236 Read input in blocks of
238 bytes at a time. The default block size is
241 .BI "\-f\fP, \fB\-\-from\-code" " encoding"
242 Set the original encoding of the data to
245 .BI "\-t\fP, \fB\-\-to\-code" " encoding"
246 Transcode the data to
249 .BI "\-\-add\-signature"
250 Add a U+FEFF Unicode signature character (BOM) if the output charset
251 supports it and does not add one anyway.
253 .BI "\-\-remove\-signature"
254 Remove a U+FEFF Unicode signature character (BOM).
256 .BI "\-o\fP, \fB\-\-output" " file"
257 Write the transcoded data to
261 supports specifying callbacks to handle invalid data. Callbacks can be
262 set for both directions of transcoding: from the original encoding to
264 .BR "\-\-from\-callback"
265 option, and from Unicode to the destination encoding, with the
266 .BR "\-\-to\-callback"
269 The following is a list of valid
271 names, along with a description of their behavior. The list of
272 callbacks actually supported by
274 is displayed when it is called with
275 .BR "\-h\fP, \fB\-\-help" .
277 .TP \w'\fBescape-unicode'u+3n
279 Write the the encoding's substitute sequence, or the Unicode
280 replacement character
282 when transcoding to Unicode.
285 Ignore the invalid data.
288 Stop with an error when encountering invalid data.
289 This is the default callback.
296 Replace the missing characters with a string of the format
298 for plane 0 characters, and
299 .BR %U\fIhhhh\fP%U\fIhhhh\fP
300 for planes 1 and above characters,
303 is the hexadecimal value of one of the UTF-16 code units representing the
304 character. Characters from planes 1 and above are written as a pair of
305 UTF-16 surrogate code units.
308 Replace the missing characters with a string of the format
310 for plane 0 characters, and
311 .BR \eu\fIhhhh\fP\eu\fIhhhh\fP
312 for planes 1 and above characters,
315 is the hexadecimal value of one of the UTF-16 code units representing the
316 character. Characters from planes 1 and above are written as a pair of
317 UTF-16 surrogate code units.
320 Replace the missing characters with a string of the format
322 for plane 0 characters, and
323 .BR \eU\fIhhhhhhhh\fP
324 for planes 1 and above characters,
329 are the hexadecimal values of the Unicode codepoint.
336 Replace the missing characters with a string of the format
340 is the hexadecimal value of the Unicode codepoint.
343 Replace the missing characters with a string of the format
347 is the decimal value of the Unicode codepoint.
350 Replace the missing characters with a string of the format
354 is the hexadecimal value of the Unicode codepoint.
355 That hexadecimal string is of variable length and can use from 4 to
357 This is the format universally used to denote a Unicode codepoint in
358 the litterature, delimited by curly braces for easy recognition of those
359 substitutions in the output.
361 Convert data from a given
363 to the platform encoding:
366 .B \fR$ \fPuconv \-f \fIencoding\fP
371 contains valid data for a given
375 .B \fR$ \fPuconv \-f \fIencoding\fP \-c \fIfile\fP >/dev/null
382 and ensure that the resulting text is good for any version of HTML:
385 .B \fR$ \fPuconv \-f utf-8 \-t \fIencoding\fP \e
387 .B " \-\-callback escape-xml-dec \fIfile\fP"
390 Display the names of the Unicode code points in a UTF-file:
393 .B \fR$ \fPuconv \-f utf-8 \-x any-name \fIfile\fP
396 Print the name of a Unicode code point whose value is known (\fBU+30AB\fP
400 .B \fR$ \fPecho '\eu30ab' | uconv \-x 'hex-any; any-name'; echo
402 {KATAKANA LETTER KA}{LINE FEED}
407 (The names are delimited by curly braces.
408 Also, the name of the line terminator is also displayed.)
410 Normalize UTF-8 data using Unicode NFKC, remove all control characters,
411 and map Katakana to Hiragana:
414 .B \fR$ \fPuconv \-f utf-8 \-t utf-8 \e
416 .B " \-x '::nfkc; [:Cc:] >; ::katakana-hiragana;'"
419 does report errors as occuring at the first invalid byte
420 encountered. This may be confusing to users of GNU
422 which reports errors as occuring at the first byte of an invalid
423 sequence. For multi-byte character sets or encodings, this means that
425 error positions may be at a later offset in the input stream than
426 would be the case with GNU
429 The reporting of error positions when a transliterator is used may be
430 inaccurate or unavailable, in which case
432 will report the offset in the output stream at which the error
441 Copyright (C) 2000-2005 IBM, Inc. and others.