1 #--------------------------------------------------------------------
2 # Copyright (c) 1999-2004, International Business Machines
3 # Corporation and others. All Rights Reserved.
4 #--------------------------------------------------------------------
6 # note: a global filter is more efficient, but MUST include all source chars
7 :: [\u0000-\u007E \u3001\u3002 \u3099-\u309C \u30A1-\u30FC \uFF61-\uFF9Fー[:Hiragana:] [:Katakana:] [:nonspacing mark:]] ;
12 # This is largely a one-to-one mapping, but it has a
15 # 1. The Katakana va/vi/ve/vo (30F7-30FA) have no
16 # Hiragana equivalents. We use Hiragana wa/wi/we/wo
17 # (308F-3092) with a voicing mark (3099), which is
18 # semantically equivalent. However, this is a non-
19 # roundtripping transformation.
21 # 2. The Katakana small ka/ke (30F5,30F6) have no
22 # Hiragana equiavlents. We convert them to normal
23 # Hiragana ka/ke (304B,3051). This is a one-way
24 # information-losing transformation and precludes
25 # round-tripping of 30F5 and 30F6.
27 # 3. The combining marks 3099-309C are in the Hiragana
28 # block, but they apply to Katakana as well, so we
29 # leave them untouched.
31 # 4. The Katakana prolonged sound mark 30FC doubles the
32 # preceding vowel. This is a one-way information-
33 # losing transformation from Katakana to Hiragana.
35 # 5. The Katakana middle dot separates words in foreign
36 # expressions; we leave this unmodified.
38 # The above points preclude successful round-trip
39 # transformations of arbitrary input text. However,
40 # they provide naturalistic results that should conform
41 # to user expectations.
44 # Combining equivalents va/vi/ve/vo
50 # One-to-one mappings, main block
51 # 3041:3094 <> 30A1:30F4
140 # One-way Katakana-Hiragana xform of small K ka/ke to
145 # Katakana followed by a prolonged sound mark 30FC has
146 # its final vowel doubled. This is a Katakana-Hiragana
147 # one-way information-losing transformation. We
148 # include the small Katakana (e.g., small A 3041) and
149 # do not distinguish them from their large
150 # counterparts. It doesn't make sense to double a
151 # small counterpart vowel as a small Hiragana vowel, so
152 # we don't do so. In natural text this should never
153 # occur anyway. If a 30FC is seen without a preceding
154 # vowel sound (e.g., after n 30F3) we do not change it.
158 # The following categories are Hiragana, not Katakana
159 # as might be expected, since by the time we get to the
160 # 30FC, the preceding character will have already been
161 # transformed to Hiragana.
163 # {The following mechanically generated from the
204 # note: a global filter is more efficient, but MUST include all source chars!!
205 :: ([\u0000-\u007E \u3001\u3002 \u3099-\u309C \u30A1-\u30FC \uFF61-\uFF9Fー[:Hiragana:] [:Katakana:] [:nonspacing mark:]]);