icuSources/extra/uconv/uconv.1.in

   1 .\" Hey, Emacs! This is -*-nroff-*- you know...
   2 .\"
   3 .\" uconv.1: manual page for the uconv utility.
   4 .\"
   5 .\" Copyright (C) 2016 and later: Unicode, Inc. and others.
   6 .\" License & terms of use: http://www.unicode.org/copyright.html
   7 .\" Copyright (C) 2000-2013 IBM, Inc. and others.
   8 .\"
   9 .\" Manual page by Yves Arrouye <yves@realnames.com>.
  10 .\"
  11 .TH UCONV 1 "2005-jul-1" "ICU MANPAGE" "ICU @VERSION@ Manual"
  12 .SH NAME
  13 .B uconv
  14 \- convert data from one encoding to another
  15 .SH SYNOPSIS
  16 .B uconv
  17 [
  18 .BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
  19 ]
  20 [
  21 .BI "\-V\fP, \fB\-\-version"
  22 ]
  23 [
  24 .BI "\-s\fP, \fB\-\-silent"
  25 ]
  26 [
  27 .BI "\-v\fP, \fB\-\-verbose"
  28 ]
  29 [
  30 .BI "\-l\fP, \fB\-\-list"
  31 |
  32 .BI "\-l\fP, \fB\-\-list\-code" " code"
  33 |
  34 .BI "\-\-default-code"
  35 |
  36 .BI "\-L\fP, \fB\-\-list\-transliterators"
  37 ]
  38 [
  39 .BI "\-\-canon"
  40 ]
  41 [
  42 .BI "\-x" " transliteration
  43 ]
  44 [
  45 .BI "\-\-to\-callback" " callback"
  46 |
  47 .B "\-c"
  48 ]
  49 [
  50 .BI "\-\-from\-callback" " callback"
  51 |
  52 .B "\-i"
  53 ]
  54 [
  55 .BI "\-\-callback" " callback"
  56 ]
  57 [
  58 .BI "\-\-fallback"
  59 |
  60 .BI "\-\-no\-fallback"
  61 ]
  62 [
  63 .BI "\-b\fP, \fB\-\-block\-size" " size"
  64 ]
  65 [
  66 .BI "\-f\fP, \fB\-\-from\-code" " encoding"
  67 ]
  68 [
  69 .BI "\-t\fP, \fB\-\-to\-code" " encoding"
  70 ]
  71 [
  72 .BI "\-\-add\-signature"
  73 ]
  74 [
  75 .BI "\-\-remove\-signature"
  76 ]
  77 [
  78 .BI "\-o\fP, \fB\-\-output" " file"
  79 ]
  80 [
  81 .IR file .\|.\|.
  82 ]
  83 .SH DESCRIPTION
  84 .B uconv
  85 converts, or transcodes, each given
  86 .I file
  87 (or its standard input if no
  88 .I file
  89 is specified) from one
  90 .I encoding
  91 to another.
  92 The transcoding is done using Unicode as a pivot encoding
  93 (i.e. the data are first transcoded from their original encoding to
  94 Unicode, and then from Unicode to the destination encoding).
  95 .PP
  96 If an
  97 .I encoding
  98 is not specified or is
  99 .BR - ,
 100 the default encoding is used. Thus, calling
 101 .B uconv
 102 with no
 103 .I encoding
 104 provides an easy way to validate and sanitize data files for
 105 further consumption by tools requiring data in the default encoding.
 106 .PP
 107 When calling
 108 .BR uconv ,
 109 it is possible to specify callbacks that are used to handle invalid
 110 characters in the input, or characters that cannot be transcoded to
 111 the destination encoding. Some encodings, for example, offer a default
 112 substitution character that can be used to represent the occurrence of
 113 such characters in the input. Other callbacks offer a useful visual
 114 representation of the invalid data.
 115 .PP
 116 .B uconv
 117 can also run the specified
 118 .IR transliteration
 119 on the transcoded data,
 120 in which case transliteration will happen as an intermediate step,
 121 after the data have been transcoded to Unicode.
 122 The
 123 .I transliteration
 124 can be either a list of semicolon-separated transliterator names,
 125 or an arbitrarily complex set of rules in the ICU transliteration
 126 rules format.
 127 .PP
 128 For transcoding purposes,
 129 .B uconv
 130 options are compatible with those of
 131 .BR iconv (1),
 132 making it easy to replace it in scripts. It is not necessarily the case,
 133 however, that the encoding names used by
 134 .B uconv
 135 and ICU are the same as the ones used by
 136 .BR iconv (1).
 137 Also, options that provide informational data, such as the
 138 .B \-l\fP, \fB\-\-list
 139 one offered by some
 140 .BR iconv (1)
 141 variants such as GNU's, produce data in a slightly different and
 142 easier to parse format.
 143 .SH OPTIONS
 144 .TP
 145 .BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
 146 Print help about usage and exit.
 147 .TP
 148 .BR "\-V\fP, \fB\-\-version"
 149 Print the version of
 150 .B uconv
 151 and exit.
 152 .TP
 153 .BI "\-s\fP, \fB\-\-silent"
 154 Suppress messages during execution.
 155 .TP
 156 .BI "\-v\fP, \fB\-\-verbose"
 157 Display extra informative messages during execution.
 158 .TP
 159 .BI "\-l\fP, \fB\-\-list"
 160 List all the available encodings and exit.
 161 .TP
 162 .BI "\-l\fP, \fB\-\-list\-code" " code"
 163 List only the
 164 .I code
 165 encoding and exit. If
 166 .I code
 167 is not a proper encoding, exit with an error.
 168 .TP
 169 .BI "\-\-default-code"
 170 List only the name of the default encoding and exit.
 171 .TP
 172 .BI "\-L\fP, \fB\-\-list\-transliterators"
 173 List all the available transliterators and exit.
 174 .TP
 175 .BI "\--canon"
 176 If used with
 177 .BI "\-l\fP, \fB\-\-list"
 178 or
 179 .BR "\-\-default-code" ,
 180 the list of encodings is produced in a format compatible with
 181 .BR convrtrs.txt (5).
 182 If used with
 183 .BR "\-L\fP, \fB\-\-list\-transliterators" ,
 184 print only one transliterator name per line.
 185 .TP
 186 .BI "\-x" " transliteration"
 187 Run the given
 188 .IR transliteration
 189 on the transcoded Unicode data,
 190 and use the transliterated data as input for the transcoding to
 191 the destination encoding.
 192 .TP
 193 .BI "\-\-to\-callback" " callback"
 194 Use
 195 .I callback
 196 to handle characters that cannot be transcoded to the destination
 197 encoding. See section
 198 .B CALLBACKS
 199 for details on valid callbacks.
 200 .TP
 201 .B "\-c"
 202 Omit invalid characters from the output.
 203 Same as
 204 .BR "\-\-to\-callback skip" .
 205 .TP
 206 .BI "\-\-from\-callback" " callback"
 207 Use
 208 .I callback
 209 to handle characters that cannot be transcoded from the original
 210 encoding. See section
 211 .B CALLBACKS
 212 for details on valid callbacks.
 213 .TP
 214 .B "\-i"
 215 Ignore invalid sequences in the input.
 216 Same as
 217 .BR "\-\-from\-callback skip" .
 218 .TP
 219 .BI "\-\-callback" " callback"
 220 Use
 221 .I callback
 222 to handle both characters that cannot be transcoded from the original
 223 encoding and characters that cannot be transcoded to the destination
 224 encoding. See section
 225 .B CALLBACKS
 226 for details on valid callbacks.
 227 .TP
 228 .BI "\-\-fallback"
 229 Use the fallback mapping when transcoding from
 230 Unicode to the destination encoding.
 231 .TP
 232 .BI "\-\-no\-fallback"
 233 Do not use the fallback mapping when transcoding from Unicode to the
 234 destination encoding.
 235 This is the default.
 236 .TP
 237 .BI "\-b\fP, \fB\-\-block\-size" " size"
 238 Read input in blocks of
 239 .I size
 240 bytes at a time. The default block size is
 241 4096.
 242 .TP
 243 .BI "\-f\fP, \fB\-\-from\-code" " encoding"
 244 Set the original encoding of the data to
 245 .IR encoding .
 246 .TP
 247 .BI "\-t\fP, \fB\-\-to\-code" " encoding"
 248 Transcode the data to
 249 .IR encoding .
 250 .TP
 251 .BI "\-\-add\-signature"
 252 Add a U+FEFF Unicode signature character (BOM) if the output charset
 253 supports it and does not add one anyway.
 254 .TP
 255 .BI "\-\-remove\-signature"
 256 Remove a U+FEFF Unicode signature character (BOM).
 257 .TP
 258 .BI "\-o\fP, \fB\-\-output" " file"
 259 Write the transcoded data to
 260 .IR file .
 261 .SH CALLBACKS
 262 .B uconv
 263 supports specifying callbacks to handle invalid data. Callbacks can be
 264 set for both directions of transcoding: from the original encoding to
 265 Unicode, with the
 266 .BR "\-\-from\-callback"
 267 option, and from Unicode to the destination encoding, with the
 268 .BR "\-\-to\-callback"
 269 option.
 270 .PP
 271 The following is a list of valid
 272 .I callback
 273 names, along with a description of their behavior. The list of
 274 callbacks actually supported by
 275 .B uconv
 276 is displayed when it is called with
 277 .BR "\-h\fP, \fB\-\-help" .
 278 .PP
 279 .TP \w'\fBescape-unicode'u+3n
 280 .B substitute
 281 Write the encoding's substitute sequence, or the Unicode
 282 replacement character
 283 .B U+FFFD
 284 when transcoding to Unicode.
 285 .TP
 286 .B skip
 287 Ignore the invalid data.
 288 .TP
 289 .B stop
 290 Stop with an error when encountering invalid data.
 291 This is the default callback.
 292 .TP
 293 .B escape
 294 Same as
 295 .BR escape-icu .
 296 .TP
 297 .B escape-icu
 298 Replace the missing characters with a string of the format
 299 .BR %U\fIhhhh\fP
 300 for plane 0 characters, and
 301 .BR %U\fIhhhh\fP%U\fIhhhh\fP
 302 for planes 1 and above characters,
 303 where
 304 .I hhhh
 305 is the hexadecimal value of one of the UTF-16 code units representing the
 306 character. Characters from planes 1 and above are written as a pair of
 307 UTF-16 surrogate code units.
 308 .TP
 309 .B escape-java
 310 Replace the missing characters with a string of the format
 311 .BR \eu\fIhhhh\fP
 312 for plane 0 characters, and
 313 .BR \eu\fIhhhh\fP\eu\fIhhhh\fP
 314 for planes 1 and above characters,
 315 where
 316 .I hhhh
 317 is the hexadecimal value of one of the UTF-16 code units representing the
 318 character. Characters from planes 1 and above are written as a pair of
 319 UTF-16 surrogate code units.
 320 .TP
 321 .B escape-c
 322 Replace the missing characters with a string of the format
 323 .BR \eu\fIhhhh\fP
 324 for plane 0 characters, and
 325 .BR \eU\fIhhhhhhhh\fP
 326 for planes 1 and above characters,
 327 where
 328 .I hhhh
 329 and
 330 .I hhhhhhhh
 331 are the hexadecimal values of the Unicode codepoint.
 332 .TP
 333 .B escape-xml
 334 Same as
 335 .BR escape-xml-hex .
 336 .TP
 337 .B escape-xml-hex
 338 Replace the missing characters with a string of the format
 339 .BR &#x\fIhhhh\fP; ,
 340 where
 341 .I hhhh
 342 is the hexadecimal value of the Unicode codepoint.
 343 .TP
 344 .B escape-xml-dec
 345 Replace the missing characters with a string of the format
 346 .BR &#\fInnnn\fP; ,
 347 where
 348 .I nnnn
 349 is the decimal value of the Unicode codepoint.
 350 .TP
 351 .B escape-unicode
 352 Replace the missing characters with a string of the format
 353 .BR {U+\fIhhhh\fP} ,
 354 where
 355 .I hhhh
 356 is the hexadecimal value of the Unicode codepoint.
 357 That hexadecimal string is of variable length and can use from 4 to
 358 6 digits.
 359 This is the format universally used to denote a Unicode codepoint in
 360 the literature, delimited by curly braces for easy recognition of those
 361 substitutions in the output.
 362 .SH EXAMPLES
 363 Convert data from a given
 364 .I encoding
 365 to the platform encoding:
 366
 367 .RS 4
 368 .B \fR$ \fPuconv \-f \fIencoding\fP
 369 .RE
 370 .PP
 371 Check if a
 372 .I file
 373 contains valid data for a given
 374 .IR encoding :
 375
 376 .RS 4
 377 .B \fR$ \fPuconv \-f \fIencoding\fP \-c \fIfile\fP >/dev/null
 378 .RE
 379 .PP
 380 Convert a UTF-8
 381 .I file
 382 to a given
 383 .I encoding
 384 and ensure that the resulting text is good for any version of HTML:
 385
 386 .RS 4
 387 .B \fR$ \fPuconv \-f utf-8 \-t \fIencoding\fP \e
 388 .br
 389 .B "    \-\-callback escape-xml-dec \fIfile\fP"
 390 .RE
 391 .PP
 392 Display the names of the Unicode code points in a UTF-file:
 393
 394 .RS 4
 395 .B \fR$ \fPuconv \-f utf-8 \-x any-name \fIfile\fP
 396 .RE
 397 .PP
 398 Print the name of a Unicode code point whose value is known (\fBU+30AB\fP
 399 in this example):
 400
 401 .RS 4
 402 .B \fR$ \fPecho '\eu30ab' | uconv \-x 'hex-any; any-name'; echo
 403 .br
 404 {KATAKANA LETTER KA}{LINE FEED}
 405 .br
 406 $
 407 .RE
 408
 409 (The names are delimited by curly braces.
 410 Also, the name of the line terminator is also displayed.)
 411 .PP
 412 Normalize UTF-8 data using Unicode NFKC, remove all control characters,
 413 and map Katakana to Hiragana:
 414
 415 .RS 4
 416 .B \fR$ \fPuconv \-f utf-8 \-t utf-8 \e
 417 .br
 418 .B "      \-x '::nfkc; [:Cc:] >; ::katakana-hiragana;'"
 419 .SH CAVEATS AND BUGS
 420 .B uconv
 421 does report errors as occurring at the first invalid byte
 422 encountered. This may be confusing to users of GNU
 423 .BR iconv (1),
 424 which reports errors as occurring at the first byte of an invalid
 425 sequence. For multi-byte character sets or encodings, this means that
 426 .BR uconv
 427 error positions may be at a later offset in the input stream than
 428 would be the case with GNU
 429 .BR iconv (1).
 430 .PP
 431 The reporting of error positions when a transliterator is used may be
 432 inaccurate or unavailable, in which case
 433 .BR uconv
 434 will report the offset in the output stream at which the error
 435 occurred.
 436 .SH AUTHORS
 437 Jonas Utterstroem
 438 .br
 439 Yves Arrouye
 440 .SH VERSION
 441 @VERSION@
 442 .SH COPYRIGHT
 443 Copyright (C) 2000-2005 IBM, Inc. and others.
 444 .SH SEE ALSO
 445 .BR iconv (1)