1 # ***************************************************************************
3 # * Copyright (C) 2004-2016, International Business Machines
4 # * Corporation; Unicode, Inc.; and others. All Rights Reserved.
6 # ***************************************************************************
11 # note: a global filter is more efficient, but MUST include all source chars
12 :: [\u0000-\u007E 、。 \u3099-゜ ァ-ー 。-゚ー[:Hiragana:] [:Katakana:] [:nonspacing mark:]] ;
15 # This is largely a one-to-one mapping, but it has a
17 # 1. The Katakana va/vi/ve/vo (30F7-30FA) have no
18 # Hiragana equivalents. We use Hiragana wa/wi/we/wo
19 # (308F-3092) with a voicing mark (3099), which is
20 # semantically equivalent. However, this is a non-
21 # roundtripping transformation.
22 # 2. The Katakana small ka/ke (30F5,30F6) have no
23 # Hiragana equiavlents. We convert them to normal
24 # Hiragana ka/ke (304B,3051). This is a one-way
25 # information-losing transformation and precludes
26 # round-tripping of 30F5 and 30F6.
27 # 3. The combining marks 3099-309C are in the Hiragana
28 # block, but they apply to Katakana as well, so we
29 # leave them untouched.
30 # 4. The Katakana prolonged sound mark 30FC doubles the
31 # preceding vowel. This is a one-way information-
32 # losing transformation from Katakana to Hiragana.
33 # 5. The Katakana middle dot separates words in foreign
34 # expressions; we leave this unmodified.
35 # The above points preclude successful round-trip
36 # transformations of arbitrary input text. However,
37 # they provide naturalistic results that should conform
38 # to user expectations.
39 # Combining equivalents va/vi/ve/vo
44 # One-to-one mappings, main block
45 # 3041:3094 ↔ 30A1:30F4
133 # One-way Katakana-Hiragana xform of small K ka/ke to
137 # Katakana followed by a prolonged sound mark 30FC has
138 # its final vowel doubled. This is a Katakana-Hiragana
139 # one-way information-losing transformation. We
140 # include the small Katakana (e.g., small A 3041) and
141 # do not distinguish them from their large
142 # counterparts. It doesn't make sense to double a
143 # small counterpart vowel as a small Hiragana vowel, so
144 # we don't do so. In natural text this should never
145 # occur anyway. If a 30FC is seen without a preceding
146 # vowel sound (e.g., after n 30F3) we do not change it.
148 # The following categories are Hiragana, not Katakana
149 # as might be expected, since by the time we get to the
150 # 30FC, the preceding character will have already been
151 # transformed to Hiragana.
152 # {The following mechanically generated from the
185 # note: a global filter is more efficient, but MUST include all source chars!!
186 :: ([\u0000-\u007E 、。 \u3099-゜ ァ-ー 。-゚ー[:Hiragana:] [:Katakana:] [:nonspacing mark:]]);