]>
Commit | Line | Data |
---|---|---|
1 | * Copyright (C) 2004-2012, International Business Machines | |
2 | * Corporation and others. All Rights Reserved. | |
3 | * | |
4 | * file name: changes.txt | |
5 | * encoding: US-ASCII | |
6 | * tab size: 8 (not used) | |
7 | * indentation:4 | |
8 | * | |
9 | * created on: 2004may06 | |
10 | * created by: Markus W. Scherer | |
11 | * | |
12 | * change log for Unicode updates | |
13 | ||
14 | ---------------------------------------------------------------------------- *** | |
15 | ||
16 | Future Unicode update | |
17 | ||
18 | Tools simplified since the Unicode 6.1 update. See | |
19 | - http://site.icu-project.org/design/props/ppucd | |
20 | - http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972 | |
21 | ||
22 | * Unicode version numbers | |
23 | - icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates | |
24 | ||
25 | * file preparation | |
26 | - ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py: | |
27 | - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src | |
28 | - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. | |
29 | - Check test file diffs for previously commented-out, known-failing data lines; | |
30 | probably need to keep those commented out. | |
31 | ||
32 | * PropertyValueAliases.txt changes | |
33 | - Script codes that are in ISO 15924 but not in Unicode are now listed in | |
34 | preparseucd.py, in the _scripts_only_in_iso15924 variable. | |
35 | If there are new ISO codes, then add them. | |
36 | If Unicode adds some of them, then remove them from the .py variable. | |
37 | ||
38 | * UnicodeData.txt changes | |
39 | - No more manual changes for CJK ranges for algorithmic names; | |
40 | those are now written to ppucd.txt and genprops reads them from there. | |
41 | ||
42 | * generate core properties data files (makeprops.sh was deleted) | |
43 | - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src | |
44 | ||
45 | * no more manual updates of source/data/unidata/norm2/nfkc_cf.txt | |
46 | - it is now generated by preparseucd.py | |
47 | ||
48 | * no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt | |
49 | - it is now generated by preparseucd.py | |
50 | - make sure that the Unicode data folder passed into preparseucd.py | |
51 | includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt | |
52 | (can be in some subfolder) | |
53 | ||
54 | * generate normalization data files | |
55 | - ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib | |
56 | - ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in | |
57 | - ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata | |
58 | - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt | |
59 | - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt | |
60 | - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt | |
61 | - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt | |
62 | ||
63 | * build ICU (make install) | |
64 | * build Unicode tools using CMake+make | |
65 | ||
66 | * new way to call genuca (makeuca.sh was deleted) | |
67 | - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src | |
68 | ||
69 | ---------------------------------------------------------------------------- *** | |
70 | ||
71 | Unicode 6.1 update | |
72 | ||
73 | *** ICU Trac | |
74 | ||
75 | - ticket 8995 final update to Unicode 6.1 | |
76 | - ticket 8994 regenerate source/layout/CanonData.cpp | |
77 | ||
78 | - ticket 8961 support Unicode "Age" value *names* | |
79 | - ticket 8963 support multiple character name aliases & types | |
80 | ||
81 | - ticket 8827 "update ICU to Unicode 6.1" | |
82 | - C++ branches/markus/uni61 at r30864 from trunk at r30843 | |
83 | - Java branches/markus/uni61 at r30865 from trunk at r30863 | |
84 | ||
85 | *** Unicode version numbers | |
86 | - makedata.mak | |
87 | - uchar.h | |
88 | (configure.in & configure: have been modified to extract the version from uchar.h) | |
89 | - com.ibm.icu.util.VersionInfo | |
90 | - icutools/unicode/makedefs.sh | |
91 | + also review & update other definitions in that file, | |
92 | e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l | |
93 | ||
94 | *** data files & enums & parser code | |
95 | ||
96 | * file preparation | |
97 | ||
98 | ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed | |
99 | - This prepares both unidata and testdata files in respective output subfolders. | |
100 | - Check test file diffs for previously commented-out, known-failing data lines; | |
101 | probably need to keep those commented out. | |
102 | ||
103 | * PropertyValueAliases.txt changes | |
104 | - 11 new block names: | |
105 | Arabic_Extended_A | |
106 | Arabic_Mathematical_Alphabetic_Symbols | |
107 | Chakma | |
108 | Meetei_Mayek_Extensions | |
109 | Meroitic_Cursive | |
110 | Meroitic_Hieroglyphs | |
111 | Miao | |
112 | Sharada | |
113 | Sora_Sompeng | |
114 | Sundanese_Supplement | |
115 | Takri | |
116 | -> add to uchar.h | |
117 | -> add to UCharacter.UnicodeBlock IDs | |
118 | Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) | |
119 | replace public static final int \1_ID = \2; \3 | |
120 | -> add to UCharacter.UnicodeBlock objects | |
121 | Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) | |
122 | replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 | |
123 | - 1 new Joining_Group (jg) value: | |
124 | Rohingya_Yeh | |
125 | -> uchar.h & UCharacter.JoiningGroup | |
126 | - 2 new Line_Break (lb) values: | |
127 | CJ=Conditional_Japanese_Starter | |
128 | HL=Hebrew_Letter | |
129 | -> uchar.h & UCharacter.LineBreak | |
130 | - 7 new scripts: | |
131 | sc ; Cakm ; Chakma | |
132 | sc ; Merc ; Meroitic_Cursive | |
133 | sc ; Mero ; Meroitic_Hieroglyphs | |
134 | sc ; Plrd ; Miao | |
135 | sc ; Shrd ; Sharada | |
136 | sc ; Sora ; Sora_Sompeng | |
137 | sc ; Takr ; Takri | |
138 | -> remove these from SyntheticPropertyValueAliases.txt | |
139 | -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() | |
140 | and in com.ibm.icu.dev.test.lang.TestUScript.java | |
141 | - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html | |
142 | (added 2011-06-21) | |
143 | Khoj 322 Khojki | |
144 | Tirh 326 Tirhuta | |
145 | and another one added 2011-12-09 | |
146 | Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs) | |
147 | -> uscript.h | |
148 | -> com.ibm.icu.lang.UScript | |
149 | find USCRIPT_([^ ]+) *= ([0-9]+),(.+) | |
150 | replace public static final int \1 = \2;\3 | |
151 | -> SyntheticPropertyValueAliases.txt | |
152 | -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() | |
153 | and in com.ibm.icu.dev.test.lang.TestUScript.java | |
154 | ||
155 | * UnicodeData.txt changes | |
156 | - the last Unihan code point changes from U+9FCB to U+9FCC | |
157 | search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive) | |
158 | + do change gennames.c | |
159 | + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java | |
160 | ||
161 | * DerivedBidiClass.txt changes | |
162 | - 2 new default-AL blocks: | |
163 | # Arabic Extended-A: U+08A0 - U+08FF (was default-R) | |
164 | # Arabic Mathematical Alphabetic Symbols: | |
165 | # U+1EE00 - U+1EEFF (was default-R) | |
166 | - 2 new default-R blocks: | |
167 | # Meroitic Hieroglyphs: | |
168 | # U+10980 - U+1099F | |
169 | # Meroitic Cursive: U+109A0 - U+109FF | |
170 | -> should be picked up by the explicit data in the file | |
171 | ||
172 | * NameAliases.txt changes | |
173 | - from | |
174 | # Each line has two fields | |
175 | # First field: Code point | |
176 | # Second field: Alias | |
177 | - to | |
178 | # Each line has three fields, as described here: | |
179 | # | |
180 | # First field: Code point | |
181 | # Second field: Alias | |
182 | # Third field: Type | |
183 | - Also, the file previously allowed multiple aliases but only now does it | |
184 | actually provide multiple, even multiple of the same type. For example, | |
185 | FEFF;BYTE ORDER MARK;alternate | |
186 | FEFF;BOM;abbreviation | |
187 | FEFF;ZWNBSP;abbreviation | |
188 | - This breaks our gennames parser, unames.icu data structure, and API. | |
189 | Fix gennames to only pick up "correction" aliases. | |
190 | New ticket #8963 for further changes. | |
191 | ||
192 | * run genpname/preparse.pl (on Linux) | |
193 | + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname | |
194 | + make sure that data.h is writable | |
195 | + perl preparse.pl ~/svn.icu/trunk/src > out.txt | |
196 | + preparse.pl shows no errors, out.txt Info and Warning lines look ok | |
197 | ||
198 | * build ICU (make install) | |
199 | so that the tools build can pick up the new definitions from the installed header files. | |
200 | * build Unicode tools (at least genpname) using CMake+make | |
201 | ||
202 | * run genpname | |
203 | (builds both pnames.icu and propname_data.h) | |
204 | - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in | |
205 | - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource | |
206 | ||
207 | * build ICU (make install) | |
208 | * build Unicode tools using CMake+make | |
209 | ||
210 | * update source/data/unidata/norm2/nfkc_cf.txt | |
211 | - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt | |
212 | ||
213 | * update source/data/unidata/norm2/uts46.txt | |
214 | - download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt | |
215 | to ~/svn.icu/tools/trunk/src/unicode/py | |
216 | - adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008". | |
217 | - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py | |
218 | - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 | |
219 | ||
220 | * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to | |
221 | sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) | |
222 | - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters | |
223 | - Unicode 6.0..6.1: U+2260, U+226E, U+226F | |
224 | - nothing new in 6.1, no test file to update | |
225 | ||
226 | * generate core properties data files | |
227 | - in initial bootstrapping, change the UCA version | |
228 | in source/data/unidata/FractionalUCA.txt to match the new Unicode version | |
229 | - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld | |
230 | - rebuild ICU & tools | |
231 | + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, | |
232 | check if the UCA version in FractionalUCA.txt matches the new Unicode version | |
233 | (see step above) | |
234 | - run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm: | |
235 | ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld | |
236 | - rebuild ICU & tools | |
237 | ||
238 | * update Java data files | |
239 | - refresh just the UCD-related files, just to be safe | |
240 | - see (ICU4C)/source/data/icu4j-readme.txt | |
241 | - mkdir /tmp/icu4j | |
242 | - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install | |
243 | output: | |
244 | ... | |
245 | Unicode .icu files built to ./out/build/icudt49l | |
246 | mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b | |
247 | mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b | |
248 | echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt | |
249 | LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b | |
250 | mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b" | |
251 | jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/ | |
252 | mkdir -p /tmp/icu4j/main/shared/data | |
253 | cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data | |
254 | jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/ | |
255 | mkdir -p /tmp/icu4j/main/shared/data | |
256 | cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data | |
257 | make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data' | |
258 | - copy the big-endian Unicode data files to another location, | |
259 | separate from the other data files | |
260 | mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll | |
261 | mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr | |
262 | ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b | |
263 | ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu | |
264 | ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b | |
265 | ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll | |
266 | ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr | |
267 | - refresh ICU4J | |
268 | ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b | |
269 | ||
270 | * refresh Java test .txt files | |
271 | - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode | |
272 | ||
273 | * test ICU so far, fix test code where necessary | |
274 | - temporarily ignore collation issues that look like UCA/UCD mismatches, | |
275 | until UCA data is updated | |
276 | ||
277 | * UCA | |
278 | ||
279 | - get output from Mark's tools; look in | |
280 | http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt | |
281 | - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt | |
282 | - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt | |
283 | (note removing the underscore before "Rules") | |
284 | - update (ICU)/source/test/testdata/CollationTest_*.txt | |
285 | and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt | |
286 | with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) | |
287 | - check test file diffs for previously commented-out, known-failing data lines; | |
288 | probably need to keep those commented out | |
289 | - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani | |
290 | - run makeuca.sh: | |
291 | ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld | |
292 | - rebuild ICU4C | |
293 | - refresh ICU4J collation data: | |
294 | (subset of instructions above for properties data refresh, except copies all coll/*) | |
295 | ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install | |
296 | ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll | |
297 | ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll | |
298 | ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b | |
299 | - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) | |
300 | - note on intltest: if collate/UCAConformanceTest fails, then | |
301 | utility/MultithreadTest/TestCollators will fail as well; | |
302 | fix the conformance test before looking into the multi-thread test | |
303 | ||
304 | * When refreshing all of ICU4J data from ICU4C | |
305 | - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install | |
306 | - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data | |
307 | or | |
308 | - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install | |
309 | ||
310 | *** LayoutEngine script information | |
311 | ||
312 | (For details see the Unicode 5.2 change log below.) | |
313 | ||
314 | * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. | |
315 | This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp | |
316 | in the working directory. | |
317 | (It also generates ScriptRunData.cpp, which is no longer needed.) | |
318 | ||
319 | The generated files have a current copyright date and "@draft" statement. | |
320 | ||
321 | - diff current <icu>/source/layout files vs. generated ones | |
322 | ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout | |
323 | review and manually merge desired changes; | |
324 | fix gratuitous changes, incorrect @draft and missing aliases; | |
325 | Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. | |
326 | - if you just copy the above files, then | |
327 | fix mixed line endings, review the diffs as above and restore changes to API tags etc.; | |
328 | manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h | |
329 | ||
330 | *** merge the Unicode update branches back onto the trunk | |
331 | - do not merge the icudata.jar and testdata.jar, | |
332 | instead rebuild them from merged & tested ICU4C | |
333 | ||
334 | ---------------------------------------------------------------------------- *** | |
335 | ||
336 | ICU 4.8 (no Unicode update, just new script codes) | |
337 | ||
338 | * 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html | |
339 | (added 2010-12-21) | |
340 | Afak 439 Afaka | |
341 | Jurc 510 Jurchen | |
342 | Mroo 199 Mro, Mru | |
343 | Nshu 499 Nüshu | |
344 | Shrd 319 Sharada, Śāradā | |
345 | Sora 398 Sora Sompeng | |
346 | Takr 321 Takri, Ṭākrī, Ṭāṅkrī | |
347 | Tang 520 Tangut | |
348 | Wole 480 Woleai | |
349 | -> uscript.h | |
350 | -> com.ibm.icu.lang.UScript | |
351 | find USCRIPT_([^ ]+) *= ([0-9]+),(.+) | |
352 | replace public static final int \1 = \2;\3 | |
353 | -> genpname/SyntheticPropertyValueAliases.txt | |
354 | -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() | |
355 | and in com.ibm.icu.dev.test.lang.TestUScript.java | |
356 | ||
357 | * run genpname/preparse.pl (on Linux) | |
358 | + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname | |
359 | + make sure that data.h is writable | |
360 | + perl preparse.pl ~/svn.icu/trunk/src > out.txt | |
361 | + preparse.pl shows no errors, out.txt Info and Warning lines look ok | |
362 | ||
363 | * rebuild Unicode tools (at least genpname) using make | |
364 | - You might first need to "make install" ICU so that the tools build can pick | |
365 | up the new definitions from the installed header files. | |
366 | ||
367 | * run genpname | |
368 | (builds both pnames.icu and propname_data.h) | |
369 | - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in | |
370 | - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource | |
371 | - rebuild ICU & tools | |
372 | ||
373 | * run genprops | |
374 | - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 | |
375 | - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 | |
376 | - rebuild ICU & tools | |
377 | ||
378 | * update Java data files | |
379 | - refresh just the UCD-related files, just to be safe | |
380 | - see (ICU4C)/source/data/icu4j-readme.txt | |
381 | - mkdir /tmp/icu4j | |
382 | - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install | |
383 | - copy the big-endian Unicode data files to another location, | |
384 | separate from the other data files | |
385 | mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b | |
386 | ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b | |
387 | ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b | |
388 | - refresh ICU4J | |
389 | ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b | |
390 | ||
391 | * should have updated the layout engine script codes but forgot | |
392 | ||
393 | ---------------------------------------------------------------------------- *** | |
394 | ||
395 | Unicode 6.0 update | |
396 | ||
397 | *** related ICU Trac tickets | |
398 | ||
399 | 7264 Unicode 6.0 Update | |
400 | ||
401 | *** Unicode version numbers | |
402 | - makedata.mak | |
403 | - uchar.h | |
404 | (configure.in & configure: have been modified to extract the version from uchar.h) | |
405 | - com.ibm.icu.util.VersionInfo | |
406 | ||
407 | *** data files & enums & parser code | |
408 | ||
409 | * file preparation | |
410 | ||
411 | ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed | |
412 | - This now prepares both unidata and testdata files in respective output subfolders. | |
413 | ||
414 | * PropertyAliases.txt changes | |
415 | - new Script_Extensions property defined in the new ScriptExtensions.txt file | |
416 | but not listed in PropertyAliases.txt; reported to unicode.org; | |
417 | -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt | |
418 | scx; Script_Extensions | |
419 | -> uchar.h with new UProperty section | |
420 | -> com.ibm.icu.lang.UProperty, parallel with uchar.h | |
421 | ||
422 | * PropertyValueAliases.txt changes | |
423 | - 12 new block names: | |
424 | Alchemical_Symbols | |
425 | Bamum_Supplement | |
426 | Batak | |
427 | Brahmi | |
428 | CJK_Unified_Ideographs_Extension_D | |
429 | Emoticons | |
430 | Ethiopic_Extended_A | |
431 | Kana_Supplement | |
432 | Mandaic | |
433 | Miscellaneous_Symbols_And_Pictographs | |
434 | Playing_Cards | |
435 | Transport_And_Map_Symbols | |
436 | -> add to uchar.h | |
437 | -> add to UCharacter.UnicodeBlock | |
438 | Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) | |
439 | replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 | |
440 | - Joining_Group (jg) values: | |
441 | Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias | |
442 | -> uchar.h & UCharacter.JoiningGroup | |
443 | - 3 new scripts: | |
444 | sc ; Batk ; Batak | |
445 | sc ; Brah ; Brahmi | |
446 | sc ; Mand ; Mandaic | |
447 | -> remove these from SyntheticPropertyValueAliases.txt | |
448 | -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN | |
449 | -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() | |
450 | and in com.ibm.icu.dev.test.lang.TestUScript.java | |
451 | - 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html | |
452 | (added 2009-11-11..2010-07-18) | |
453 | Bass 259 Bassa Vah | |
454 | Dupl 755 Duployan shortand | |
455 | Elba 226 Elbasan | |
456 | Gran 343 Grantha | |
457 | Kpel 436 Kpelle | |
458 | Loma 437 Loma | |
459 | Mend 438 Mende | |
460 | Merc 101 Meroitic Cursive | |
461 | Narb 106 Old North Arabian | |
462 | Nbat 159 Nabataean | |
463 | Palm 126 Palmyrene | |
464 | Sind 318 Sindhi | |
465 | Wara 262 Warang Citi | |
466 | -> uscript.h | |
467 | -> com.ibm.icu.lang.UScript | |
468 | find USCRIPT_([^ ]+) *= ([0-9]+),(.+) | |
469 | replace public static final int \1 = \2;\3 | |
470 | -> SyntheticPropertyValueAliases.txt | |
471 | -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() | |
472 | and in com.ibm.icu.dev.test.lang.TestUScript.java | |
473 | - ISO 15924 name change | |
474 | Mero 100 Meroitic Hieroglyphs (was Meroitic) | |
475 | -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC | |
476 | - property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt | |
477 | ||
478 | * UnicodeData.txt changes | |
479 | - new CJK block: | |
480 | 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;; | |
481 | 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;; | |
482 | -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion | |
483 | ||
484 | * build Unicode tools using CMake+make | |
485 | ||
486 | * run genpname/preparse.pl (on Linux) | |
487 | + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname | |
488 | + make sure that data.h is writable | |
489 | + perl preparse.pl ~/svn.icu/trunk/src > out.txt | |
490 | + preparse.pl shows no errors, out.txt Info and Warning lines look ok | |
491 | ||
492 | * rebuild Unicode tools (at least genpname) using make | |
493 | - You might first need to "make install" ICU so that the tools build can pick | |
494 | up the new definitions from the installed header files. | |
495 | ||
496 | * run genpname | |
497 | - ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in | |
498 | - rebuild ICU & tools | |
499 | ||
500 | * update source/data/unidata/norm2/nfkc_cf.txt | |
501 | - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt | |
502 | ||
503 | * update source/data/unidata/norm2/uts46.txt | |
504 | - download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt | |
505 | to ~/svn.icu/tools/trunk/src/unicode/py | |
506 | - adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values | |
507 | - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py | |
508 | - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 | |
509 | ||
510 | * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to | |
511 | sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) | |
512 | - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters | |
513 | - Unicode 6.0: U+2260, U+226E, U+226F | |
514 | ||
515 | * generate core properties data files | |
516 | - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld | |
517 | - rebuild ICU & tools | |
518 | - run makeuca.sh so that genuca picks up the new nfc.nrm: | |
519 | ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld | |
520 | - rebuild ICU & tools | |
521 | ||
522 | * implement new Script_Extensions property (provisional) | |
523 | - parser & generator: genprops & uprops.icu | |
524 | - uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp | |
525 | - UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java | |
526 | ||
527 | * switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2 | |
528 | - (one-time change) | |
529 | - genbidi/gencase/genprops tools changes | |
530 | - re-run makeprops.sh (see above) | |
531 | - UCharacterProperty.java, UCharacterTypeIterator.java, | |
532 | UBiDiProps.java, UCaseProps.java, and several others with minor changes; | |
533 | UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java | |
534 | ||
535 | * update Java data files | |
536 | - refresh just the UCD-related files, just to be safe | |
537 | - see (ICU4C)/source/data/icu4j-readme.txt | |
538 | - mkdir /tmp/icu4j | |
539 | - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install | |
540 | output: | |
541 | ... | |
542 | Unicode .icu files built to ./out/build/icudt45l | |
543 | mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b | |
544 | echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt | |
545 | LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b | |
546 | jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b | |
547 | mkdir -p /tmp/icu4j/main/shared/data | |
548 | cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data | |
549 | - copy the big-endian Unicode data files to another location, | |
550 | separate from the other data files | |
551 | mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll | |
552 | mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr | |
553 | ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b | |
554 | ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu | |
555 | ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b | |
556 | ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll | |
557 | ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr | |
558 | - refresh ICU4J | |
559 | ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b | |
560 | ||
561 | * refresh Java test .txt files | |
562 | - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode | |
563 | ||
564 | * un-hardcode normalization skippable (NF*_Inert) test data | |
565 | - removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools | |
566 | ||
567 | * copy updated break iterator test files | |
568 | - now handled by early ucdcopy.py and | |
569 | copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata | |
570 | (old instructions: | |
571 | copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt | |
572 | to ~/svn.icu/trunk/src/source/test/testdata) | |
573 | - they are not used in ICU4J | |
574 | ||
575 | * UCA | |
576 | ||
577 | - get output from Mark's tools; look in | |
578 | http://www.unicode.org/~book/incoming/mark/uca6.0.0/ | |
579 | http://www.macchiato.com/unicode/utc/additional-uca-files | |
580 | http://www.unicode.org/Public/UCA/6.0.0/ | |
581 | http://www.unicode.org/~mdavis/uca/ | |
582 | - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt | |
583 | - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt | |
584 | - update Han-implicit ranges for new CJK extensions: | |
585 | swapCJK() in ucol.cpp & ImplicitCEGenerator.java | |
586 | - genuca: allow bytes 02 for U+FFFE, new merge-sort character; | |
587 | do not add it into invuca so that tailoring primary-after an ignorable works | |
588 | - genuca: permit space between [variable top] bytes | |
589 | - ucol.cpp: treat noncharacters like unassigned rather than ignorable | |
590 | - run makeuca.sh: | |
591 | ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld | |
592 | - rebuild ICU4C | |
593 | - refresh ICU4J collation data: | |
594 | (subset of instructions above for properties data refresh, except copies all coll/*) | |
595 | ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install | |
596 | mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll | |
597 | ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll | |
598 | ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b | |
599 | - update (ICU)/source/test/testdata/CollationTest_*.txt | |
600 | and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt | |
601 | with output from Mark's Unicode tools | |
602 | - run all tests with the *_SHORT.txt or the full files (the full ones have comments) | |
603 | - note on intltest: if collate/UCAConformanceTest fails, then | |
604 | utility/MultithreadTest/TestCollators will fail as well; | |
605 | fix the conformance test before looking into the multi-thread test | |
606 | ||
607 | * When refreshing all of ICU4J data from ICU4C | |
608 | - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install | |
609 | - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data | |
610 | or | |
611 | - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install | |
612 | ||
613 | *** LayoutEngine script information | |
614 | ||
615 | (For details see the Unicode 5.2 change log below.) | |
616 | ||
617 | * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, | |
618 | ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates | |
619 | ScriptRunData.cpp, which is no longer needed.) | |
620 | ||
621 | The generated files have a current copyright date and "@draft" statement. | |
622 | ||
623 | * copy the above files into <icu>/source/layout, replacing the old files. | |
624 | * fix mixed line endings | |
625 | * review the diffs and fix incorrect @draft and missing aliases; | |
626 | Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. | |
627 | * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h | |
628 | ||
629 | ---------------------------------------------------------------------------- *** | |
630 | ||
631 | Unicode 5.2 update | |
632 | ||
633 | *** related ICU Trac tickets | |
634 | ||
635 | 7084 Unicode 5.2 | |
636 | ||
637 | 7167 verify collation bytes | |
638 | 7235 Java test NAME_ALIAS | |
639 | 7236 Java DerivedCoreProperties.txt test | |
640 | 7237 Java BidiTest.txt | |
641 | 7238 UTrie2 in core unidata | |
642 | 7239 test for tailoring gaps | |
643 | 7240 Java fix CollationMiscTest | |
644 | 7243 update layout engine for Unicode 5.2 | |
645 | ||
646 | *** Unicode version numbers | |
647 | - makedata.mak | |
648 | - uchar.h | |
649 | - configure.in & configure | |
650 | - update ucdVersion in gennames.c if an algorithmic range changes | |
651 | ||
652 | *** data files & enums & parser code | |
653 | ||
654 | * file preparation | |
655 | ||
656 | python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata | |
657 | - includes finding files regardless of version numbers, | |
658 | copying them, and performing the equivalent processing of the | |
659 | ucdstrip and ucdmerge tools on the desired set of files | |
660 | ||
661 | * notes on changes | |
662 | - PropertyAliases.txt | |
663 | moved from numeric to enumerated: | |
664 | ccc ; Canonical_Combining_Class | |
665 | new string properties: | |
666 | NFKC_CF ; NFKC_Casefold | |
667 | Name_Alias; Name_Alias | |
668 | new binary properties: | |
669 | Cased ; Cased | |
670 | CI ; Case_Ignorable | |
671 | CWCF ; Changes_When_Casefolded | |
672 | CWCM ; Changes_When_Casemapped | |
673 | CWKCF ; Changes_When_NFKC_Casefolded | |
674 | CWL ; Changes_When_Lowercased | |
675 | CWT ; Changes_When_Titlecased | |
676 | CWU ; Changes_When_Uppercased | |
677 | new CJK Unihan properties (not supported by ICU) | |
678 | - PropertyValueAliases.txt | |
679 | new block names | |
680 | new scripts | |
681 | one script code change: | |
682 | sc ; Qaai ; Inherited | |
683 | -> | |
684 | sc ; Zinh ; Inherited ; Qaai | |
685 | new Line_Break (lb) value: | |
686 | lb ; CP ; Close_Parenthesis | |
687 | new Joining_Group (jg) values: Farsi_Yeh, Nya | |
688 | other new values: | |
689 | ccc; 214; ATA ; Attached_Above | |
690 | - DerivedBidiClass.txt | |
691 | new default-R range: U+1E800 - U+1EFFF | |
692 | - UnicodeData.txt | |
693 | all of the ISO comments are gone | |
694 | new CJK block end: | |
695 | 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last> | |
696 | new CJK block: | |
697 | 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;; | |
698 | 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;; | |
699 | ||
700 | * genpname | |
701 | - run preparse.pl | |
702 | + cd \svn\icuproj\icu\trunk\source\tools\genpname | |
703 | + make sure that data.h is writable | |
704 | + perl preparse.pl \svn\icuproj\icu\trunk > out.txt | |
705 | + preparse.pl complains with errors like the following: | |
706 | Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34. | |
707 | This is because ICU 4.0 had scripts from ISO 15924 which are now | |
708 | added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt | |
709 | and PropertyValueAliases.txt. | |
710 | -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: | |
711 | Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt | |
712 | + preparse.pl complains with errors about block names missing from uchar.h; add them | |
713 | ||
714 | * uchar.h & uscript.h & uprops.h & uprops.c & genprops | |
715 | - new block & script values | |
716 | + 26 new blocks | |
717 | copy new blocks from Blocks.txt | |
718 | MS VC++ 2008 regular expression: | |
719 | find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$" | |
720 | replace with " UBLOCK_\3 = 172, /*[\1]*/" | |
721 | + several new script values already added in ICU 4.0 for ISO 15924 coverage | |
722 | (removed from SyntheticPropertyValueAliases.txt, see genpname notes above) | |
723 | + 3 new script values added for ISO 15924 and Unicode 5.2 coverage | |
724 | + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2) | |
725 | (added to SyntheticPropertyValueAliases.txt) | |
726 | - new Joining Group (JG) values: Farsi_Yeh, Nya | |
727 | - new Line_Break (lb) value: | |
728 | lb ; CP ; Close_Parenthesis | |
729 | ||
730 | * hardcoded Unihan range end/limit | |
731 | - Unihan range end moves from 9FC3 to 9FCB | |
732 | search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive) | |
733 | + do change gennames.c | |
734 | ||
735 | * Compare definitions of new binary properties with what we used to use | |
736 | in algorithms, to see if the definitions changed. | |
737 | - Verified that definitions for Cased and Case_Ignorable are unchanged. | |
738 | The gencase tool now parses the newly public Case_Ignorable values | |
739 | in case the definition changes in the future. | |
740 | ||
741 | * uchar.c & uprops.h & uprops.c & genprops | |
742 | - new numeric values that didn't exist in Unicode data before: | |
743 | 1/7, 1/9, 1/10, 3/10, 1/16, 3/16 | |
744 | the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5, | |
745 | therefore redesign the encoding of numeric types and values for formatVersion 6; | |
746 | design for simple numbers up to at least 144 ("one gross"), | |
747 | large values up to at least 10^20, | |
748 | and fractions with numerators -1..17 and denominators 1..16 | |
749 | to cover current and expected future values | |
750 | (e.g., more Han numeric values, Meroitic twelfths) | |
751 | ||
752 | * reimplement Hangul_Syllable_Type for new Jamo characters | |
753 | - the old code assumed that all Jamo characters are in the 11xx block | |
754 | - Unicode 5.2 fills holes there and adds new Jamo characters in | |
755 | A960..A97F; Hangul Jamo Extended-A | |
756 | and in | |
757 | D7B0..D7FF; Hangul Jamo Extended-B | |
758 | - Hangul_Syllable_Type can be trivially derived from a subset of | |
759 | Grapheme_Cluster_Break values | |
760 | ||
761 | * build Unicode data source code for hardcoding core data | |
762 | C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data | |
763 | ||
764 | ICU data make path is \svn\icuproj\icu\trunk\source\data\ | |
765 | ICU root path is \svn\icuproj\icu\trunk | |
766 | Information: cannot find "ucmlocal.mk". Not building user-additional converter files. | |
767 | Information: cannot find "brklocal.mk". Not building user-additional break iterator files. | |
768 | Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. | |
769 | Information: cannot find "collocal.mk". Not building user-additional resource bundle files. | |
770 | Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. | |
771 | Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. | |
772 | Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. | |
773 | Information: cannot find "spreplocal.mk". Not building user-additional stringprep files. | |
774 | Creating data file for Unicode Property Names | |
775 | Creating data file for Unicode Character Properties | |
776 | Creating data file for Unicode Case Mapping Properties | |
777 | Creating data file for Unicode BiDi/Shaping Properties | |
778 | Creating data file for Unicode Normalization | |
779 | Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l" | |
780 | Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp" | |
781 | ||
782 | - copy the .c source files to C:\svn\icuproj\icu\trunk\source\common | |
783 | and rebuild the common library | |
784 | ||
785 | *** UCA | |
786 | ||
787 | - update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools) | |
788 | - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools | |
789 | - update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools | |
790 | [ Begin obsolete instructions: | |
791 | Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files. | |
792 | - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py | |
793 | on Windows: | |
794 | python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt | |
795 | python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt | |
796 | End obsolete instructions] | |
797 | - run all tests with the *_SHORT.txt or the full files (the full ones have comments) | |
798 | not just the *_STUB.txt files | |
799 | - note on intltest: if collate/UCAConformanceTest fails, then | |
800 | utility/MultithreadTest/TestCollators will fail as well; | |
801 | fix the conformance test before looking into the multi-thread test | |
802 | ||
803 | *** Implement Cased & Case_Ignorable properties | |
804 | - via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable() | |
805 | - Problem: These properties should be disjoint, but aren't | |
806 | - UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not | |
807 | - change ucase.icu to be able to store any combination of Cased and Case_Ignorable | |
808 | ||
809 | *** Implement Changes_When_Xyz properties | |
810 | - without stored data | |
811 | ||
812 | *** Implement Name_Alias property | |
813 | - add it as another name field in unames.icu | |
814 | - make it available via u_charName() and UCharNameChoice and | |
815 | - consider it in u_charFromName() | |
816 | ||
817 | *** Break iterators | |
818 | ||
819 | * Update break iterator rules to new UAX versions and new property values | |
820 | * Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary | |
821 | ||
822 | *** new BidiTest file | |
823 | - review format and data | |
824 | - copy BidiTest.txt to source/test/testdata | |
825 | - write test code using this data | |
826 | - fix ICU code where it fails the conformance test | |
827 | ||
828 | *** Java | |
829 | - generally, find and update code corresponding to C/C++ | |
830 | - UCharacter.UnicodeBlock constants: | |
831 | a) add an _ID integer per new block, update COUNT | |
832 | b) add a class instance per new block | |
833 | Visual Studio regex: | |
834 | find UBLOCK_{[^ ]+} = [0-9]+, {/.+} | |
835 | replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 | |
836 | - CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias() | |
837 | ||
838 | - port test changes to Java | |
839 | ||
840 | *** LayoutEngine script information | |
841 | ||
842 | (For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833) | |
843 | ||
844 | * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, | |
845 | ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates | |
846 | ScriptRunData.cpp, which is no longer needed.) | |
847 | ||
848 | The generated files have a current copyright date and "@draft" statement. | |
849 | ||
850 | -> Eric Mader wrote in email on 20090930: | |
851 | "I think the tool has been modified to update @draft to @stable for | |
852 | older scripts and to add @draft for new scripts. | |
853 | (I worked with an intern on this last year.) | |
854 | You should check the output after you run it." | |
855 | ||
856 | * copy the above files into <icu>/source/layout, replacing the old files. | |
857 | * fix mixed line endings | |
858 | * review the diffs and fix incorrect @draft and missing aliases | |
859 | * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h | |
860 | ||
861 | Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp | |
862 | and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) | |
863 | ||
864 | -> Eric Mader wrote in email on 20090930: | |
865 | "This is just a matter of making sure that all the per-script tables have | |
866 | entries for any new scripts that were added. | |
867 | If any new Indic characters were added, then the class tables in | |
868 | IndicClassTables.cpp should be updated to reflect this. | |
869 | John Emmons should know how to do this if it's required." | |
870 | ||
871 | * rebuild the layout and layoutex libraries. | |
872 | ||
873 | *** Documentation | |
874 | - Update User Guide | |
875 | + Jamo_Short_Name, sfc->scf, binary property value aliases | |
876 | ||
877 | ---------------------------------------------------------------------------- *** | |
878 | ||
879 | Unicode 5.1 update | |
880 | ||
881 | *** related ICU Trac tickets | |
882 | ||
883 | 5696 Update to Unicode 5.1 | |
884 | ||
885 | *** Unicode version numbers | |
886 | - makedata.mak | |
887 | - uchar.h | |
888 | - configure.in & configure | |
889 | - update ucdVersion in gennames.c if an algorithmic range changes | |
890 | ||
891 | *** data files & enums & parser code | |
892 | ||
893 | * file preparation | |
894 | - ucdstrip: | |
895 | DerivedCoreProperties.txt | |
896 | DerivedNormalizationProps.txt | |
897 | NormalizationTest.txt | |
898 | PropList.txt | |
899 | Scripts.txt | |
900 | GraphemeBreakProperty.txt | |
901 | SentenceBreakProperty.txt | |
902 | WordBreakProperty.txt | |
903 | - ucdstrip and ucdmerge: | |
904 | EastAsianWidth.txt | |
905 | LineBreak.txt | |
906 | ||
907 | * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) | |
908 | copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\ | |
909 | copy 5.1.0\ucd\Blocks.txt ..\unidata\ | |
910 | copy 5.1.0\ucd\CaseFolding.txt ..\unidata\ | |
911 | copy 5.1.0\ucd\DerivedAge.txt ..\unidata\ | |
912 | copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ | |
913 | copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ | |
914 | copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ | |
915 | copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ | |
916 | copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\ | |
917 | copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\ | |
918 | copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\ | |
919 | copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\ | |
920 | copy 5.1.0\ucd\UnicodeData.txt ..\unidata\ | |
921 | ||
922 | ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt | |
923 | ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt | |
924 | ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt | |
925 | ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt | |
926 | ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt | |
927 | ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt | |
928 | ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt | |
929 | ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt | |
930 | ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt | |
931 | ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt | |
932 | ||
933 | * genpname | |
934 | - run preparse.pl | |
935 | + cd \svn\icuproj\icu\uni51\source\tools\genpname | |
936 | + make sure that data.h is writable | |
937 | + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt | |
938 | + preparse.pl complains with errors like the following: | |
939 | Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30. | |
940 | This is because ICU 3.8 had scripts from ISO 15924 which are now | |
941 | added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt | |
942 | and PropertyValueAliases.txt. | |
943 | -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: | |
944 | Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii | |
945 | + PropertyValueAliases.txt now explicitly contains values for boolean properties: | |
946 | N/Y, No/Yes, F/T, False/True | |
947 | -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases. | |
948 | It will use further values from the file if present. | |
949 | ||
950 | * uchar.h & uscript.h & uprops.h & uprops.c & genprops | |
951 | - new block & script values | |
952 | + 17 new blocks | |
953 | + 11 new script values already added in ICU 3.8 for ISO 15924 coverage | |
954 | (removed from SyntheticPropertyValueAliases.txt) | |
955 | + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1) | |
956 | (added to SyntheticPropertyValueAliases.txt) | |
957 | - uprops.icu (uprops.h) only provides 7 bits for script codes. | |
958 | In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now. | |
959 | There is none above 127 yet which is the script code for an | |
960 | assigned Unicode character, so ICU 4.0 uprops.icu does not store any | |
961 | script code values greater than 127. | |
962 | However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129 | |
963 | in a parallel bit field, and that overflows now. | |
964 | Also, future values >=128 would be incompatible anyway. | |
965 | uprops.h is modified to move around several of the bit fields | |
966 | in the properties vector words, and now uses 8 bits for the script code. | |
967 | Two other bit fields also grow to accommodate future growth: | |
968 | Block (current count: 172) grows from 8 to 9 bits, | |
969 | and Word_Break grows from 4 to 5 bits. | |
970 | - renamed property Simple_Case_Folding (sfc->scf) | |
971 | + nothing to be done: handled as normal alias | |
972 | - new property JSN Jamo_Short_Name | |
973 | + no new API: only contributes to the Name property | |
974 | - new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark | |
975 | - new Joining Group (JG) value: Burushashki_Yeh_Barree | |
976 | - new Sentence_Break (SB) values: | |
977 | SB ; CR ; CR | |
978 | SB ; EX ; Extend | |
979 | SB ; LF ; LF | |
980 | SB ; SC ; SContinue | |
981 | - new Word_Break (WB) values: | |
982 | WB ; CR ; CR | |
983 | WB ; Extend ; Extend | |
984 | WB ; LF ; LF | |
985 | WB ; MB ; MidNumLet | |
986 | ||
987 | * Further changes in the 2008-02-29 update: | |
988 | - Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP | |
989 | because they should not normally be invisible. | |
990 | - new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed) | |
991 | - new Grapheme_Cluster_Break (GCB) value: PP=Prepend | |
992 | - new Word_Break (WB) value: NL=Newline | |
993 | ||
994 | * hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison) | |
995 | - Unihan range end moves from 9FBB to 9FC3 | |
996 | search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive) | |
997 | + do change gennames.c | |
998 | ||
999 | * build Unicode data source code for hardcoding core data | |
1000 | C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data | |
1001 | ||
1002 | ICU data make path is \svn\icuproj\icu\uni51\source\data\ | |
1003 | ICU root path is \svn\icuproj\icu\uni51 | |
1004 | Information: cannot find "ucmlocal.mk". Not building user-additional converter files. | |
1005 | Information: cannot find "brklocal.mk". Not building user-additional break iterator files. | |
1006 | Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. | |
1007 | Information: cannot find "collocal.mk". Not building user-additional resource bundle files. | |
1008 | Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. | |
1009 | Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. | |
1010 | Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. | |
1011 | Creating data file for Unicode Character Properties | |
1012 | Creating data file for Unicode Case Mapping Properties | |
1013 | Creating data file for Unicode BiDi/Shaping Properties | |
1014 | Creating data file for Unicode Normalization | |
1015 | Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l" | |
1016 | Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp" | |
1017 | ||
1018 | - copy the .c source files to C:\svn\icuproj\icu\uni51\source\common | |
1019 | and rebuild the common library | |
1020 | ||
1021 | *** Break iterators | |
1022 | ||
1023 | * Update break iterator rules to new UAX versions and new property values | |
1024 | ||
1025 | *** UCA | |
1026 | ||
1027 | * update FractionalUCA.txt and UCARules.txt with new canonical closure | |
1028 | ||
1029 | *** Test suites | |
1030 | - Test that APIs using Unicode property value aliases (like UnicodeSet) | |
1031 | support all of the boolean values N/Y, No/Yes, F/T, False/True | |
1032 | -> TestBinaryValues() tests in both cintltst and intltest | |
1033 | ||
1034 | *** LayoutEngine script information | |
1035 | * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, | |
1036 | ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates | |
1037 | ScriptRunData.cpp, which is no longer needed.) | |
1038 | ||
1039 | The generated files have a current copyright date and "@draft" statement. | |
1040 | ||
1041 | * copy the above files into <icu>/source/layout, replacing the old files. | |
1042 | ||
1043 | Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp | |
1044 | and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) | |
1045 | ||
1046 | * rebuild the layout and layoutex libraries. | |
1047 | ||
1048 | *** Documentation | |
1049 | - Update User Guide | |
1050 | + Jamo_Short_Name, sfc->scf, binary property value aliases | |
1051 | ||
1052 | ---------------------------------------------------------------------------- *** | |
1053 | ||
1054 | Unicode 5.0 update | |
1055 | ||
1056 | *** related Jitterbugs | |
1057 | ||
1058 | 5084 RFE: Update to Unicode 5.0 | |
1059 | ||
1060 | *** data files & enums & parser code | |
1061 | ||
1062 | * file preparation | |
1063 | - ucdstrip: | |
1064 | DerivedCoreProperties.txt | |
1065 | DerivedNormalizationProps.txt | |
1066 | NormalizationTest.txt | |
1067 | PropList.txt | |
1068 | Scripts.txt | |
1069 | GraphemeBreakProperty.txt | |
1070 | SentenceBreakProperty.txt | |
1071 | WordBreakProperty.txt | |
1072 | - ucdstrip and ucdmerge: | |
1073 | EastAsianWidth.txt | |
1074 | LineBreak.txt | |
1075 | ||
1076 | * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) | |
1077 | copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\ | |
1078 | copy 5.0.0\ucd\Blocks.txt ..\unidata\ | |
1079 | copy 5.0.0\ucd\CaseFolding.txt ..\unidata\ | |
1080 | copy 5.0.0\ucd\DerivedAge.txt ..\unidata\ | |
1081 | copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ | |
1082 | copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ | |
1083 | copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ | |
1084 | copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ | |
1085 | copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\ | |
1086 | copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\ | |
1087 | copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\ | |
1088 | copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\ | |
1089 | copy 5.0.0\ucd\UnicodeData.txt ..\unidata\ | |
1090 | ||
1091 | ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt | |
1092 | ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt | |
1093 | ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt | |
1094 | ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt | |
1095 | ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt | |
1096 | ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt | |
1097 | ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt | |
1098 | ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt | |
1099 | ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt | |
1100 | ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt | |
1101 | ||
1102 | * update FractionalUCA.txt and UCARules.txt with new canonical closure | |
1103 | ||
1104 | * genpname | |
1105 | - run preparse.pl | |
1106 | + make sure that data.h is writable | |
1107 | + perl preparse.pl \cvs\oss\icu > out.txt | |
1108 | ||
1109 | * uchar.h & uscript.h & uprops.h & uprops.c & genprops | |
1110 | - new block & script values | |
1111 | + script values already added in ICU 3.6 because all of ISO 15924 is now covered | |
1112 | ||
1113 | * build Unicode data source code for hardcoding core data | |
1114 | C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data | |
1115 | ||
1116 | ICU data make path is \cvs\oss\icu\source\data\ | |
1117 | ICU root path is \cvs\oss\icu | |
1118 | Information: cannot find "ucmlocal.mk". Not building user-additional converter files. | |
1119 | [etc.] | |
1120 | Creating data file for Unicode Character Properties | |
1121 | Creating data file for Unicode Case Mapping Properties | |
1122 | Creating data file for Unicode BiDi/Shaping Properties | |
1123 | Creating data file for Unicode Normalization | |
1124 | Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l" | |
1125 | Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp" | |
1126 | ||
1127 | - copy the .c source files to C:\cvs\oss\icu\source\common | |
1128 | and rebuild the common library | |
1129 | ||
1130 | *** Unicode version numbers | |
1131 | - makedata.mak | |
1132 | - uchar.h | |
1133 | - configure.in | |
1134 | ||
1135 | *** LayoutEngine script information | |
1136 | * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, | |
1137 | ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates | |
1138 | ScriptRunData.cpp, which is no longer needed.) | |
1139 | ||
1140 | The generated files have a current copyright date and "@draft" statement. | |
1141 | ||
1142 | * copy the above files into <icu>/source/layout, replacing the old files. | |
1143 | ||
1144 | Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp | |
1145 | and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) | |
1146 | ||
1147 | * rebuild the layout and layoutex libraries. | |
1148 | ||
1149 | ---------------------------------------------------------------------------- *** | |
1150 | ||
1151 | Unicode 4.1 update | |
1152 | ||
1153 | *** related Jitterbugs | |
1154 | ||
1155 | 4332 RFE: Update to Unicode 4.1 | |
1156 | 4157 RBBI, TR29 4.1 updates | |
1157 | ||
1158 | *** data files & enums & parser code | |
1159 | ||
1160 | * file preparation | |
1161 | - ucdstrip: | |
1162 | DerivedCoreProperties.txt | |
1163 | DerivedNormalizationProps.txt | |
1164 | NormalizationTest.txt | |
1165 | GraphemeBreakProperty.txt | |
1166 | SentenceBreakProperty.txt | |
1167 | WordBreakProperty.txt | |
1168 | - ucdstrip and ucdmerge: | |
1169 | EastAsianWidth.txt | |
1170 | LineBreak.txt | |
1171 | ||
1172 | * add new files to the repository | |
1173 | GraphemeBreakProperty.txt | |
1174 | SentenceBreakProperty.txt | |
1175 | WordBreakProperty.txt | |
1176 | ||
1177 | * update FractionalUCA.txt and UCARules.txt with new canonical closure | |
1178 | ||
1179 | * genpname | |
1180 | - handle new enumerated properties in sub read_uchar | |
1181 | - run preparse.pl | |
1182 | ||
1183 | * uchar.h & uscript.h & uprops.h & uprops.c & genprops | |
1184 | - new binary properties | |
1185 | + Pattern_Syntax | |
1186 | + Pattern_White_Space | |
1187 | - new enumerated properties | |
1188 | + Grapheme_Cluster_Break | |
1189 | + Sentence_Break | |
1190 | + Word_Break | |
1191 | - new block & script & line break values | |
1192 | ||
1193 | * gencase | |
1194 | - case-ignorable changes | |
1195 | see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods | |
1196 | now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk | |
1197 | ||
1198 | *** Unicode version numbers | |
1199 | - makedata.mak | |
1200 | - uchar.h | |
1201 | - configure.in | |
1202 | ||
1203 | *** tests | |
1204 | - verify that u_charMirror() round-trips | |
1205 | - test all new properties and some new values of old properties | |
1206 | ||
1207 | *** other code | |
1208 | ||
1209 | * hardcoded Unihan range end/limit | |
1210 | - Unihan range end moves from 9FA5 to 9FBB | |
1211 | search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive) | |
1212 | + do not modify BOCU/BOCSU code because that would change the encoding | |
1213 | and break binary compatibility! | |
1214 | + similarly, do not change the GB 18030 range data (ucnvmbcs.c), | |
1215 | NamePrepProfile.txt | |
1216 | + ignore trietest.c: test data is arbitrary | |
1217 | + ignore tstnorm.cpp: test optimization, not important | |
1218 | + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF | |
1219 | + do change line_th.txt and word_th.txt | |
1220 | by replacing hardcoded ranges with the new property values | |
1221 | + do change gennames.c | |
1222 | ||
1223 | source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 | |
1224 | source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 | |
1225 | source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5, | |
1226 | ||
1227 | * case mappings | |
1228 | - compare new special casing context conditions with previous ones | |
1229 | see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods | |
1230 | ||
1231 | * genpname | |
1232 | - consider storing only the short name if it is the same as the long name | |
1233 | ||
1234 | *** other reviews | |
1235 | - UAX #29 changes (grapheme/word/sentence breaks) | |
1236 | - UAX #14 changes (line breaks) | |
1237 | - Pattern_Syntax & Pattern_White_Space | |
1238 | ||
1239 | ---------------------------------------------------------------------------- *** | |
1240 | ||
1241 | Unicode 4.0.1 update | |
1242 | ||
1243 | *** related Jitterbugs | |
1244 | ||
1245 | 3170 RFE: Update to Unicode 4.0.1 | |
1246 | 3171 Add new Unicode 4.0.1 properties | |
1247 | 3520 use Unicode 4.0.1 updates for break iteration | |
1248 | ||
1249 | *** data files & enums & parser code | |
1250 | ||
1251 | * file preparation | |
1252 | - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt | |
1253 | - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt | |
1254 | ||
1255 | * file fixes | |
1256 | - fix UnicodeData.txt general categories of Ethiopic digits Nd->No | |
1257 | according to PRI #26 | |
1258 | http://www.unicode.org/review/resolved-pri.html#pri26 | |
1259 | - undone again because no corrigendum in sight; | |
1260 | instead modified tests to not check consistency on this for Unicode 4.0.1 | |
1261 | ||
1262 | * ucdterms.txt | |
1263 | - update from http://www.unicode.org/copyright.html | |
1264 | formatted for plain text | |
1265 | ||
1266 | * uchar.h & uprops.h & uprops.c & genprops | |
1267 | - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed | |
1268 | - add U_LB_INSEPARABLE due to a spelling fix | |
1269 | + put short name comment only on line with new constant | |
1270 | for genpname perl script parser | |
1271 | - new binary properties | |
1272 | + STerm | |
1273 | + Variation_Selector | |
1274 | ||
1275 | * genpname | |
1276 | - fix genpname perl script so that it doesn't choke on more than 2 names per property value | |
1277 | - perl script: correctly calculate the maximum number of fields per row | |
1278 | ||
1279 | * uscript.h | |
1280 | - new script code Hrkt=Katakana_Or_Hiragana | |
1281 | ||
1282 | * gennorm.c track changes in DerivedNormalizationProps.txt | |
1283 | - "FNC" -> "FC_NFKC" | |
1284 | - single field "NFD_NO" -> two fields "NFD_QC; N" etc. | |
1285 | ||
1286 | * genprops/props2.c track changes in DerivedNumericValues.txt | |
1287 | - changed from 3 columns to 2, dropping the numeric type | |
1288 | + assume that the type is always numeric for Han characters, | |
1289 | and that only those are added in addition to what UnicodeData.txt lists | |
1290 | ||
1291 | *** Unicode version numbers | |
1292 | - makedata.mak | |
1293 | - uchar.h | |
1294 | - configure.in | |
1295 | ||
1296 | *** tests | |
1297 | - update test of default bidi classes according to PRI #28 | |
1298 | /tsutil/cucdtst/TestUnicodeData | |
1299 | http://www.unicode.org/review/resolved-pri.html#pri28 | |
1300 | - bidi tests: change exemplar character for ES depending on Unicode version | |
1301 | - change hardcoded expected property values where they change | |
1302 | ||
1303 | *** other code | |
1304 | ||
1305 | * name matching | |
1306 | - read UCD.html | |
1307 | ||
1308 | * scripts | |
1309 | - use new Hrkt=Katakana_Or_Hiragana | |
1310 | ||
1311 | * ZWJ & ZWNJ | |
1312 | - are now part of combining character sequences | |
1313 | - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ |