1 * Copyright (C) 2004-2015, International Business Machines
2 * Corporation and others. All Rights Reserved.
4 * file name: changes.txt
6 * tab size: 8 (not used)
9 * created on: 2004may06
10 * created by: Markus W. Scherer
12 * change log for Unicode updates
14 ---------------------------------------------------------------------------- ***
16 * New ISO 15924 script codes
18 Starting with ICU 55, we do not add UScriptCode constants any more until their scripts
19 are encoded in Unicode, or can be assumed to be encoded in the next Unicode version.
20 Script enum constant names want to follow the Unicode script property value aliases,
21 which are assigned only when the scripts are encoded.
22 When we encode scripts early and guess wrong, then we have confusing enum constants
23 and have sometimes added aliases.
25 Exception: Script codes like Latf and Aran that are not subject to separate encoding
26 can be added at any time.
28 Script codes not yet in ICU: http://www.unicode.org/iso15924/codechanges.html
30 Added 2014-11-15, see http://bugs.icu-project.org/trac/ticket/11561
32 - Aran 161 Arabic (Nastaliq variant)
33 - Kitl 505 Khitan large script
34 - Kits 288 Khitan small script
38 Aran can be added as USCRIPT_ARABIC_NASTALIQ at any time.
40 Adlam, Marchen, and Osage are expected to go into Unicode 9;
41 we should assign Unicode script property value aliases for them
42 soon after Unicode 8 is released, and add them in ICU 56.
44 Khitan scripts will be encoded later.
46 ---------------------------------------------------------------------------- ***
48 Unicode 8.0 update for ICU ??
52 - U+1DE9 COMBINING LATIN SMALL LETTER BETA
53 sorts with Greek Beta, should sort with Latin B?
55 No, it was deliberate:
57 03B2;GREEK SMALL LETTER BETA;Ll;;;;0392;;0392
58 1D5D;MODIFIER LETTER SMALL BETA;Lm;<super> 03B2;;;;;
59 1DE9;COMBINING LATIN SMALL LETTER BETA;Mn;<sort> 03B2;;;;;
60 1D66;GREEK SUBSCRIPT SMALL LETTER BETA;Ll;<sub> 03B2;;;;;
62 Note the relationship to U+1D5D.
64 When the disunified *Latin* beta base letter shows up in Unicode 8.0:
66 U+A7B4 LATIN CAPITAL LETTER BETA
67 U+A7B5 LATIN SMALL LETTER BETA
69 we could re-evaluate what U+1DE9 equates to, for collation,
70 but currently there isn’t any Latin beta to serve that function
73 - ICU_ROOT=~/svn.icu/trunk
74 - ICU_SRC_DIR=$ICU_ROOT/src
75 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR
76 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
79 ---------------------------------------------------------------------------- ***
81 Unicode 7.0 update for ICU 54
83 http://www.unicode.org/review/pri271/ -- beta review
84 http://www.unicode.org/reports/uax-proposed-updates.html
85 http://www.unicode.org/versions/beta-7.0.0.html#notable_issues
86 http://www.unicode.org/reports/tr44/tr44-13.html
90 - ticket 10821: Unicode 7.0, UCA 7.0
91 - C++ branches/markus/uni70 at r35584 from trunk at r35580
92 - Java branches/markus/uni70 at r35587 from trunk at r35545
96 - ticket 7195: UCA 7.0 CLDR root collation
97 - branches/markus/uni70 at r10062 from trunk at r10061
99 - ticket 6762: script metadata for Unicode 7.0 new scripts
101 *** Unicode version numbers
104 - com.ibm.icu.util.VersionInfo
105 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
107 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
108 so that the makefiles see the new version number.
110 *** data files & enums & parser code
114 - download UCD & IDNA files
115 - make sure that the Unicode data folder passed into preparseucd.py
116 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
117 - only for manual diffs: remove version suffixes from the file names
118 ~/unidata/uni70/20140403$ ../../desuffixucd.py .
119 (see https://sites.google.com/site/unicodetools/inputdata)
120 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
121 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src
122 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
123 - Restore TODO diffs in source/data/unidata/UCARules.txt
125 meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt
126 - Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt
128 - also: from http://unicode.org/Public/security/7.0.0/ download new
129 confusables.txt & confusablesWholeScript.txt
130 and copy to $ICU_ROOT/src/source/data/unidata/
132 * initial preparseucd.py changes
133 - remove new Unicode scripts from the
134 only-in-ISO-15924 list according to the error message:
135 ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass',
136 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm',
137 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj']
138 from _scripts_only_in_iso15924
139 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
140 and in com.ibm.icu.dev.test.lang.TestUScript.java
141 - NamesList.txt now has a heading with a non-ASCII character
142 + keep ppucd.txt in platform charset, rather than changing tool/test parsers
143 + escape non-ASCII characters in heading comments
144 - gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013
145 + get the copyright from the first file whose copyright line contains the current year
147 * PropertyValueAliases.txt changes
148 - 32 new Block (blk) values:
149 blk; Bassa_Vah ; Bassa_Vah
150 blk; Caucasian_Albanian ; Caucasian_Albanian
151 blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers
152 blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended
153 blk; Duployan ; Duployan
154 blk; Elbasan ; Elbasan
155 blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended
156 blk; Grantha ; Grantha
158 blk; Khudawadi ; Khudawadi
159 blk; Latin_Ext_E ; Latin_Extended_E
160 blk; Linear_A ; Linear_A
161 blk; Mahajani ; Mahajani
162 blk; Manichaean ; Manichaean
163 blk; Mende_Kikakui ; Mende_Kikakui
166 blk; Myanmar_Ext_B ; Myanmar_Extended_B
167 blk; Nabataean ; Nabataean
168 blk; Old_North_Arabian ; Old_North_Arabian
169 blk; Old_Permic ; Old_Permic
170 blk; Ornamental_Dingbats ; Ornamental_Dingbats
171 blk; Pahawh_Hmong ; Pahawh_Hmong
172 blk; Palmyrene ; Palmyrene
173 blk; Pau_Cin_Hau ; Pau_Cin_Hau
174 blk; Psalter_Pahlavi ; Psalter_Pahlavi
175 blk; Shorthand_Format_Controls ; Shorthand_Format_Controls
176 blk; Siddham ; Siddham
177 blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers
178 blk; Sup_Arrows_C ; Supplemental_Arrows_C
179 blk; Tirhuta ; Tirhuta
180 blk; Warang_Citi ; Warang_Citi
182 use long property names for enum constants
183 -> add to UCharacter.UnicodeBlock IDs
184 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
185 replace public static final int \1_ID = \2; \3
186 -> add to UCharacter.UnicodeBlock objects
187 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
188 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
189 - 28 new Joining_Group (jg) values:
190 jg ; Manichaean_Aleph ; Manichaean_Aleph
191 jg ; Manichaean_Ayin ; Manichaean_Ayin
192 jg ; Manichaean_Beth ; Manichaean_Beth
193 jg ; Manichaean_Daleth ; Manichaean_Daleth
194 jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh
195 jg ; Manichaean_Five ; Manichaean_Five
196 jg ; Manichaean_Gimel ; Manichaean_Gimel
197 jg ; Manichaean_Heth ; Manichaean_Heth
198 jg ; Manichaean_Hundred ; Manichaean_Hundred
199 jg ; Manichaean_Kaph ; Manichaean_Kaph
200 jg ; Manichaean_Lamedh ; Manichaean_Lamedh
201 jg ; Manichaean_Mem ; Manichaean_Mem
202 jg ; Manichaean_Nun ; Manichaean_Nun
203 jg ; Manichaean_One ; Manichaean_One
204 jg ; Manichaean_Pe ; Manichaean_Pe
205 jg ; Manichaean_Qoph ; Manichaean_Qoph
206 jg ; Manichaean_Resh ; Manichaean_Resh
207 jg ; Manichaean_Sadhe ; Manichaean_Sadhe
208 jg ; Manichaean_Samekh ; Manichaean_Samekh
209 jg ; Manichaean_Taw ; Manichaean_Taw
210 jg ; Manichaean_Ten ; Manichaean_Ten
211 jg ; Manichaean_Teth ; Manichaean_Teth
212 jg ; Manichaean_Thamedh ; Manichaean_Thamedh
213 jg ; Manichaean_Twenty ; Manichaean_Twenty
214 jg ; Manichaean_Waw ; Manichaean_Waw
215 jg ; Manichaean_Yodh ; Manichaean_Yodh
216 jg ; Manichaean_Zayin ; Manichaean_Zayin
217 jg ; Straight_Waw ; Straight_Waw
218 -> uchar.h & UCharacter.JoiningGroup
219 - 23 new Script (sc) values:
220 sc ; Aghb ; Caucasian_Albanian
221 sc ; Bass ; Bassa_Vah
225 sc ; Hmng ; Pahawh_Hmong
229 sc ; Mani ; Manichaean
230 sc ; Mend ; Mende_Kikakui
233 sc ; Narb ; Old_North_Arabian
234 sc ; Nbat ; Nabataean
235 sc ; Palm ; Palmyrene
236 sc ; Pauc ; Pau_Cin_Hau
237 sc ; Perm ; Old_Permic
238 sc ; Phlp ; Psalter_Pahlavi
240 sc ; Sind ; Khudawadi
242 sc ; Wara ; Warang_Citi
243 -> uscript.h (many were added before)
244 comment "Mende Kikakui" for USCRIPT_MENDE
245 add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias
246 -> com.ibm.icu.lang.UScript
247 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
248 replace public static final int \1 = \2; \3
249 - 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
258 -> uscript.h (some overlap with additions from Unicode)
259 -> com.ibm.icu.lang.UScript
260 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
261 replace public static final int \1 = \2; \3
262 -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924
263 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
264 and in com.ibm.icu.dev.test.lang.TestUScript.java
266 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
267 (not strictly necessary for NOT_ENCODED scripts)
268 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
270 * generate normalization data files
272 - export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
273 - SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
274 - UNIDATA=$ICU_SRC_DIR/source/data/unidata
275 - bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
276 - bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
277 - bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
278 - bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
279 - bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
281 * build ICU (make install)
282 so that the tools build can pick up the new definitions from the installed header files.
284 ~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
286 * build Unicode tools using CMake+make
288 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
290 # Location (--prefix) of where ICU was installed.
291 set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst)
292 # Location of the ICU source tree.
293 set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src)
295 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
296 ~/svn.icutools/trunk/dbg/unicode/c$ make
299 - new code point range for Joining_Group values: 10AC0..10AFF Manichaean
300 + add second array of Joining_Group values for at most 10800..10FFF
301 icutools: unicode/c/genprops/bidipropsbuilder.cpp
302 icu: source/common/ubidi_props.h/.c/_data.h
303 icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java
305 * generate core properties data files
306 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
307 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR
308 - rebuild ICU (make install) & tools
309 - run genuca again (see step above) so that it picks up the new nfc.nrm
310 - rebuild ICU (make install) & tools
312 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
313 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
314 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
315 - Unicode 6.0..7.0: U+2260, U+226E, U+226F
316 - nothing new in 7.0, no test file to update
318 * run & fix ICU4C tests
320 * update Java data files
321 - refresh just the UCD-related files, just to be safe
322 - see (ICU4C)/source/data/icu4j-readme.txt
324 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
327 Unicode .icu files built to ./out/build/icudt53l
328 echo timestamp > uni-core-data
329 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b
330 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b
331 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
332 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b
333 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b"
334 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/
335 mkdir -p /tmp/icu4j/main/shared/data
336 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
337 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/
338 mkdir -p /tmp/icu4j/main/shared/data
339 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
340 make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data'
341 - copy the big-endian Unicode data files to another location,
342 separate from the other data files
344 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
345 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
346 cd ~/svn.icu/uni70/dbg/data/out/icu4j
347 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
348 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
349 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
350 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
351 cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
352 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
354 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
356 * update CollationFCD.java
357 + copy & paste the initializers of lcccIndex[] etc. from
358 ICU4C/source/i18n/collationfcd.cpp to
359 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
361 * refresh Java test .txt files
362 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
363 cd $ICU_SRC_DIR/source/data/unidata
364 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
365 cd ../../test/testdata
366 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
367 cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
371 - download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/
372 - run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata)
373 - update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/
374 - run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA
375 - output files are in ~/svn.unitools/Generated/uca/7.0.0/
376 - review data; compare files, use blankweights.sed or similar
377 ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt
378 - cd ~/svn.unitools/Generated/uca/7.0.0/
379 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
380 cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
381 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
382 (note removing the underscore before "Rules")
383 cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
384 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
385 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
386 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
387 cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
388 cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
389 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
390 - run genuca, see command line above
392 - refresh ICU4J collation data:
393 (subset of instructions above for properties data refresh, except copies all coll/*)
395 ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
396 ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
397 ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
398 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
399 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
400 - note on intltest: if collate/UCAConformanceTest fails, then
401 utility/MultithreadTest/TestCollators will fail as well;
402 fix the conformance test before looking into the multi-thread test
403 - copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors
404 - copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch
405 ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
407 * When refreshing all of ICU4J data from ICU4C
408 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
409 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
411 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
413 * run & fix ICU4J tests
415 *** LayoutEngine script information
417 (For details see the Unicode 5.2 change log below.)
419 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
420 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
421 in the working directory.
422 (It also generates ScriptRunData.cpp, which is no longer needed.)
424 The generated files have a current copyright date and "@stable" statement.
425 ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java
426 for "born stable" Unicode API constants, and to stop parsing ICU version numbers
427 which may not contain dots any more.
429 - diff current <icu>/source/layout files vs. generated ones
430 ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
431 review and manually merge desired changes;
432 fix gratuitous changes, incorrect @draft/@stable and missing aliases;
433 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
434 - if you just copy the above files, then
435 fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
436 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
439 - send notice to icu-design about new born-@stable API (enum constants etc.)
441 *** merge the Unicode update branches back onto the trunk
442 - do not merge the icudata.jar and testdata.jar,
443 instead rebuild them from merged & tested ICU4C
445 ---------------------------------------------------------------------------- ***
449 http://www.unicode.org/review/pri249/ -- beta review
450 http://www.unicode.org/reports/uax-proposed-updates.html
451 http://www.unicode.org/versions/beta-6.3.0.html#notable_issues
452 http://www.unicode.org/reports/tr44/tr44-11.html
456 - ticket 10128: update ICU to Unicode 6.3 beta
457 - ticket 10168: update ICU to Unicode 6.3 final
458 - C++ branches/markus/uni63 at r33552 from trunk at r33551
459 - Java branches/markus/uni63 at r33550 from trunk at r33553
461 - ticket 10142: implement Unicode 6.3 bidi algorithm additions
463 *** Unicode version numbers
466 (configure.in & configure: have been modified to extract the version from uchar.h)
467 - com.ibm.icu.util.VersionInfo
468 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
470 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
471 so that the makefiles see the new version number.
473 *** data files & enums & parser code
477 - download UCD, UCA & IDNA files
478 - make sure that the Unicode data folder passed into preparseucd.py
479 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
480 - modify preparseucd.py:
481 parse new file BidiBrackets.txt
482 with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type
483 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src
484 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
485 - Check test file diffs for previously commented-out, known-failing data lines;
486 probably need to keep those commented out.
488 * PropertyAliases.txt changes
489 - 1 new Enumerated Property
490 bpt ; Bidi_Paired_Bracket_Type
491 -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType
492 -> ubidi_props.h & .c & UBiDiProps.java
493 -> remember to write the max value at UBIDI_MAX_VALUES_INDEX
495 -> change ubidi.icu format version from 2.0 to 2.1
496 - 1 new Miscellaneous Property
497 bpb ; Bidi_Paired_Bracket
498 -> uchar.h & UProperty.java
501 * PropertyValueAliases.txt changes
502 - 3 Bidi_Paired_Bracket_Type (bpt) values:
506 -> uchar.h & UCharacter.BidiPairedBracketType
507 -> ubidi_props.h & .c & UBiDiProps.java
508 -> change ubidi.icu format version from 2.0 to 2.1
509 - 4 new Bidi_Class (bc) values:
510 bc ; FSI ; First_Strong_Isolate
511 bc ; LRI ; Left_To_Right_Isolate
512 bc ; RLI ; Right_To_Left_Isolate
513 bc ; PDI ; Pop_Directional_Isolate
514 -> uchar.h & UCharacterEnums.ECharacterDirection
515 -> until the bidi code gets updated,
516 Roozbeh suggests mapping the new bc values to ON (Other_Neutral)
517 - 3 new Word_Break (WB) values:
518 WB ; HL ; Hebrew_Letter
519 WB ; SQ ; Single_Quote
520 WB ; DQ ; Double_Quote
521 -> uchar.h & UCharacter.WordBreak
522 -> first time Word_Break numeric constants exceed 4 bits (now 17 values)
523 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
525 Aghb 239 Caucasian Albanian
528 -> com.ibm.icu.lang.UScript
529 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
530 replace public static final int \1 = \2;\3
531 -> preparseucd.py _scripts_only_in_iso15924
532 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
533 and in com.ibm.icu.dev.test.lang.TestUScript.java
534 -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
535 (not strictly necessary for NOT_ENCODED scripts)
537 * generate normalization data files
538 - ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib
539 - ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in
540 - ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata
541 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
542 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
543 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
544 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
546 * build ICU (make install)
547 so that the tools build can pick up the new definitions from the installed header files.
549 ~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
551 * build Unicode tools using CMake+make
553 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
555 # Location (--prefix) of where ICU was installed.
556 set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst)
557 # Location of the ICU source tree.
558 set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src)
560 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
561 ~/svn.icutools/trunk/dbg/unicode/c$ make
563 * generate core properties data files
564 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src
565 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src
566 - rebuild ICU (make install) & tools
567 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
568 - rebuild ICU (make install) & tools
570 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
571 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
572 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
573 - Unicode 6.0..6.3: U+2260, U+226E, U+226F
574 - nothing new in 6.3, no test file to update
576 * update Java data files
577 - refresh just the UCD-related files, just to be safe
578 - see (ICU4C)/source/data/icu4j-readme.txt
580 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
583 Unicode .icu files built to ./out/build/icudt52l
584 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b
585 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b
586 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
587 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b
588 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b"
589 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/
590 mkdir -p /tmp/icu4j/main/shared/data
591 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
592 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/
593 mkdir -p /tmp/icu4j/main/shared/data
594 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
595 make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data'
596 - copy the big-endian Unicode data files to another location,
597 separate from the other data files
598 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
599 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
600 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
601 ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu
602 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
603 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
604 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
606 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
608 * refresh Java test .txt files
609 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
611 * UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files
613 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
614 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
615 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
616 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
617 (note removing the underscore before "Rules")
618 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
619 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
620 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
621 - check test file diffs for previously commented-out, known-failing data lines;
622 probably need to keep those commented out
623 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
624 - run genuca, see command line above
626 - refresh ICU4J collation data:
627 (subset of instructions above for properties data refresh, except copies all coll/*)
628 ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
629 ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
630 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
631 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
632 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
633 - note on intltest: if collate/UCAConformanceTest fails, then
634 utility/MultithreadTest/TestCollators will fail as well;
635 fix the conformance test before looking into the multi-thread test
637 * test ICU, fix test code where necessary
639 * When refreshing all of ICU4J data from ICU4C
640 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
641 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
643 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
645 *** LayoutEngine script information
646 - skipped for Unicode 6.3: no new scripts
648 *** merge the Unicode update branches back onto the trunk
649 - do not merge the icudata.jar and testdata.jar,
650 instead rebuild them from merged & tested ICU4C
652 ---------------------------------------------------------------------------- ***
656 http://www.unicode.org/review/pri230/
657 http://www.unicode.org/versions/beta-6.2.0.html
658 http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0
659 http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values
660 http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol
661 http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols
662 http://www.unicode.org/reports/tr46/tr46-8.html IDNA
663 http://unicode.org/Public/idna/6.2.0/
667 - ticket 9515: Unicode 6.2: final ICU update
669 - ticket 9514: UCA 6.2: fix UCARules.txt
671 - ticket 9437: update ICU to Unicode 6.2
672 - C++ branches/markus/uni62 at r32050 from trunk at r32041
673 - Java branches/markus/uni62 at r32068 from trunk at r32066
675 *** Unicode version numbers
678 (configure.in & configure: have been modified to extract the version from uchar.h)
679 - com.ibm.icu.util.VersionInfo
680 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
682 *** data files & enums & parser code
686 - download UCD, UCA & IDNA files
687 - make sure that the Unicode data folder passed into preparseucd.py
688 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
689 - modify preparseucd.py: NamesList.txt is now in UTF-8
690 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src
691 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
692 - Check test file diffs for previously commented-out, known-failing data lines;
693 probably need to keep those commented out.
695 * PropertyValueAliases.txt changes
696 - 1 new Line_Break (lb) value:
697 lb ; RI ; Regional_Indicator
698 -> uchar.h & UCharacter.LineBreak
699 - 1 new Word_Break (WB) value:
700 WB ; RI ; Regional_Indicator
701 -> uchar.h & UCharacter.WordBreak
702 - 1 new Grapheme_Cluster_Break (GCB) value:
703 GCB; RI ; Regional_Indicator
704 -> uchar.h & UCharacter.GraphemeClusterBreak
706 * 3 new numeric values
707 The new value -1, which was really supposed to be NaN but that would have required
708 new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1,
709 but encodeNumericValue() in corepropsbuilder.cpp had to be fixed.
710 cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1
711 cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1
712 The two new values 216000 and 432000 require an addition to the encoding of numeric values.
713 cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000
714 cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000
715 -> uprops.h, uchar.c & UCharacterProperty.java
716 -> cucdtst.c & UCharacterTest.java
718 * generate normalization data files
719 - ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib
720 - ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in
721 - ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata
722 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
723 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
724 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
725 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
727 * build ICU (make install)
728 so that the tools build can pick up the new definitions from the installed header files.
729 * build Unicode tools using CMake+make
731 * generate core properties data files
732 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src
733 - in initial bootstrapping, change the UCA version
734 in source/data/unidata/FractionalUCA.txt to match the new Unicode version
735 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src
736 - rebuild ICU (make install) & tools
737 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
738 check if the UCA version in FractionalUCA.txt matches the new Unicode version
740 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
741 - rebuild ICU (make install) & tools
743 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
744 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
745 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
746 - Unicode 6.0..6.2: U+2260, U+226E, U+226F
747 - nothing new in 6.2, no test file to update
749 * update Java data files
750 - refresh just the UCD-related files, just to be safe
751 - see (ICU4C)/source/data/icu4j-readme.txt
753 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
756 Unicode .icu files built to ./out/build/icudt50l
757 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b
758 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b
759 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
760 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b
761 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b"
762 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/
763 mkdir -p /tmp/icu4j/main/shared/data
764 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
765 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/
766 mkdir -p /tmp/icu4j/main/shared/data
767 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
768 make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data'
769 - copy the big-endian Unicode data files to another location,
770 separate from the other data files
771 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
772 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
773 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
774 ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu
775 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
776 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
777 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
779 ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
781 * refresh Java test .txt files
782 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
786 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
787 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
788 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
789 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
790 (note removing the underscore before "Rules")
791 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
792 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
793 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
794 - check test file diffs for previously commented-out, known-failing data lines;
795 probably need to keep those commented out
796 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
797 - run genuca, see command line above
799 - refresh ICU4J collation data:
800 (subset of instructions above for properties data refresh, except copies all coll/*)
801 ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
802 ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
803 ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
804 ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
805 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
806 - note on intltest: if collate/UCAConformanceTest fails, then
807 utility/MultithreadTest/TestCollators will fail as well;
808 fix the conformance test before looking into the multi-thread test
810 * test ICU, fix test code where necessary
812 * When refreshing all of ICU4J data from ICU4C
813 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
814 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
816 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
818 *** LayoutEngine script information
819 - skipped for Unicode 6.2: no new scripts
821 *** merge the Unicode update branches back onto the trunk
822 - do not merge the icudata.jar and testdata.jar,
823 instead rebuild them from merged & tested ICU4C
825 ---------------------------------------------------------------------------- ***
827 Future Unicode update
829 Tools simplified since the Unicode 6.1 update. See
830 - http://site.icu-project.org/design/props/ppucd
831 - http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972
833 * Unicode version numbers
834 - icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates
837 - ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py:
838 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src
839 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
840 - Check test file diffs for previously commented-out, known-failing data lines;
841 probably need to keep those commented out.
843 * PropertyValueAliases.txt changes
844 - Script codes that are in ISO 15924 but not in Unicode are now listed in
845 preparseucd.py, in the _scripts_only_in_iso15924 variable.
846 If there are new ISO codes, then add them.
847 If Unicode adds some of them, then remove them from the .py variable.
849 * UnicodeData.txt changes
850 - No more manual changes for CJK ranges for algorithmic names;
851 those are now written to ppucd.txt and genprops reads them from there.
853 * generate core properties data files (makeprops.sh was deleted)
854 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src
856 * no more manual updates of source/data/unidata/norm2/nfkc_cf.txt
857 - it is now generated by preparseucd.py
859 * no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt
860 - it is now generated by preparseucd.py
861 - make sure that the Unicode data folder passed into preparseucd.py
862 includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
863 (can be in some subfolder)
865 * generate normalization data files
866 - ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib
867 - ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in
868 - ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata
869 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
870 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
871 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
872 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
874 * build ICU (make install)
875 * build Unicode tools using CMake+make
877 * new way to call genuca (makeuca.sh was deleted)
878 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src
880 ---------------------------------------------------------------------------- ***
886 - ticket 8995 final update to Unicode 6.1
887 - ticket 8994 regenerate source/layout/CanonData.cpp
889 - ticket 8961 support Unicode "Age" value *names*
890 - ticket 8963 support multiple character name aliases & types
892 - ticket 8827 "update ICU to Unicode 6.1"
893 - C++ branches/markus/uni61 at r30864 from trunk at r30843
894 - Java branches/markus/uni61 at r30865 from trunk at r30863
896 *** Unicode version numbers
899 (configure.in & configure: have been modified to extract the version from uchar.h)
900 - com.ibm.icu.util.VersionInfo
901 - icutools/unicode/makedefs.sh
902 + also review & update other definitions in that file,
903 e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l
905 *** data files & enums & parser code
909 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed
910 - This prepares both unidata and testdata files in respective output subfolders.
911 - Check test file diffs for previously commented-out, known-failing data lines;
912 probably need to keep those commented out.
914 * PropertyValueAliases.txt changes
915 - 11 new block names:
917 Arabic_Mathematical_Alphabetic_Symbols
919 Meetei_Mayek_Extensions
928 -> add to UCharacter.UnicodeBlock IDs
929 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
930 replace public static final int \1_ID = \2; \3
931 -> add to UCharacter.UnicodeBlock objects
932 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
933 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
934 - 1 new Joining_Group (jg) value:
936 -> uchar.h & UCharacter.JoiningGroup
937 - 2 new Line_Break (lb) values:
938 CJ=Conditional_Japanese_Starter
940 -> uchar.h & UCharacter.LineBreak
943 sc ; Merc ; Meroitic_Cursive
944 sc ; Mero ; Meroitic_Hieroglyphs
947 sc ; Sora ; Sora_Sompeng
949 -> remove these from SyntheticPropertyValueAliases.txt
950 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
951 and in com.ibm.icu.dev.test.lang.TestUScript.java
952 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
956 and another one added 2011-12-09
957 Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)
959 -> com.ibm.icu.lang.UScript
960 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
961 replace public static final int \1 = \2;\3
962 -> SyntheticPropertyValueAliases.txt
963 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
964 and in com.ibm.icu.dev.test.lang.TestUScript.java
966 * UnicodeData.txt changes
967 - the last Unihan code point changes from U+9FCB to U+9FCC
968 search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive)
969 + do change gennames.c
970 + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java
972 * DerivedBidiClass.txt changes
973 - 2 new default-AL blocks:
974 # Arabic Extended-A: U+08A0 - U+08FF (was default-R)
975 # Arabic Mathematical Alphabetic Symbols:
976 # U+1EE00 - U+1EEFF (was default-R)
977 - 2 new default-R blocks:
978 # Meroitic Hieroglyphs:
980 # Meroitic Cursive: U+109A0 - U+109FF
981 -> should be picked up by the explicit data in the file
983 * NameAliases.txt changes
985 # Each line has two fields
986 # First field: Code point
987 # Second field: Alias
989 # Each line has three fields, as described here:
991 # First field: Code point
992 # Second field: Alias
994 - Also, the file previously allowed multiple aliases but only now does it
995 actually provide multiple, even multiple of the same type. For example,
996 FEFF;BYTE ORDER MARK;alternate
997 FEFF;BOM;abbreviation
998 FEFF;ZWNBSP;abbreviation
999 - This breaks our gennames parser, unames.icu data structure, and API.
1000 Fix gennames to only pick up "correction" aliases.
1001 New ticket #8963 for further changes.
1003 * run genpname/preparse.pl (on Linux)
1004 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
1005 + make sure that data.h is writable
1006 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
1007 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
1009 * build ICU (make install)
1010 so that the tools build can pick up the new definitions from the installed header files.
1011 * build Unicode tools (at least genpname) using CMake+make
1014 (builds both pnames.icu and propname_data.h)
1015 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
1016 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
1018 * build ICU (make install)
1019 * build Unicode tools using CMake+make
1021 * update source/data/unidata/norm2/nfkc_cf.txt
1022 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
1024 * update source/data/unidata/norm2/uts46.txt
1025 - download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
1026 to ~/svn.icu/tools/trunk/src/unicode/py
1027 - adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008".
1028 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
1029 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
1031 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1032 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1033 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1034 - Unicode 6.0..6.1: U+2260, U+226E, U+226F
1035 - nothing new in 6.1, no test file to update
1037 * generate core properties data files
1038 - in initial bootstrapping, change the UCA version
1039 in source/data/unidata/FractionalUCA.txt to match the new Unicode version
1040 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
1041 - rebuild ICU & tools
1042 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
1043 check if the UCA version in FractionalUCA.txt matches the new Unicode version
1045 - run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm:
1046 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
1047 - rebuild ICU & tools
1049 * update Java data files
1050 - refresh just the UCD-related files, just to be safe
1051 - see (ICU4C)/source/data/icu4j-readme.txt
1053 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1056 Unicode .icu files built to ./out/build/icudt49l
1057 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b
1058 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b
1059 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
1060 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b
1061 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b"
1062 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/
1063 mkdir -p /tmp/icu4j/main/shared/data
1064 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1065 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/
1066 mkdir -p /tmp/icu4j/main/shared/data
1067 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1068 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data'
1069 - copy the big-endian Unicode data files to another location,
1070 separate from the other data files
1071 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
1072 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
1073 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
1074 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu
1075 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
1076 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
1077 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
1079 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
1081 * refresh Java test .txt files
1082 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1084 * test ICU so far, fix test code where necessary
1085 - temporarily ignore collation issues that look like UCA/UCD mismatches,
1086 until UCA data is updated
1090 - get output from Mark's tools; look in
1091 http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt
1092 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1093 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1094 (note removing the underscore before "Rules")
1095 - update (ICU)/source/test/testdata/CollationTest_*.txt
1096 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1097 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
1098 - check test file diffs for previously commented-out, known-failing data lines;
1099 probably need to keep those commented out
1100 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
1102 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
1104 - refresh ICU4J collation data:
1105 (subset of instructions above for properties data refresh, except copies all coll/*)
1106 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1107 ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
1108 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
1109 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
1110 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
1111 - note on intltest: if collate/UCAConformanceTest fails, then
1112 utility/MultithreadTest/TestCollators will fail as well;
1113 fix the conformance test before looking into the multi-thread test
1115 * When refreshing all of ICU4J data from ICU4C
1116 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1117 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
1119 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
1121 *** LayoutEngine script information
1123 (For details see the Unicode 5.2 change log below.)
1125 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
1126 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
1127 in the working directory.
1128 (It also generates ScriptRunData.cpp, which is no longer needed.)
1130 The generated files have a current copyright date and "@draft" statement.
1132 - diff current <icu>/source/layout files vs. generated ones
1133 ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
1134 review and manually merge desired changes;
1135 fix gratuitous changes, incorrect @draft and missing aliases;
1136 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
1137 - if you just copy the above files, then
1138 fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
1139 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
1141 *** merge the Unicode update branches back onto the trunk
1142 - do not merge the icudata.jar and testdata.jar,
1143 instead rebuild them from merged & tested ICU4C
1145 ---------------------------------------------------------------------------- ***
1147 ICU 4.8 (no Unicode update, just new script codes)
1149 * 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
1155 Shrd 319 Sharada, Śāradā
1156 Sora 398 Sora Sompeng
1157 Takr 321 Takri, Ṭākrī, Ṭāṅkrī
1161 -> com.ibm.icu.lang.UScript
1162 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
1163 replace public static final int \1 = \2;\3
1164 -> genpname/SyntheticPropertyValueAliases.txt
1165 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
1166 and in com.ibm.icu.dev.test.lang.TestUScript.java
1168 * run genpname/preparse.pl (on Linux)
1169 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
1170 + make sure that data.h is writable
1171 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
1172 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
1174 * rebuild Unicode tools (at least genpname) using make
1175 - You might first need to "make install" ICU so that the tools build can pick
1176 up the new definitions from the installed header files.
1179 (builds both pnames.icu and propname_data.h)
1180 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
1181 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
1182 - rebuild ICU & tools
1185 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
1186 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
1187 - rebuild ICU & tools
1189 * update Java data files
1190 - refresh just the UCD-related files, just to be safe
1191 - see (ICU4C)/source/data/icu4j-readme.txt
1193 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1194 - copy the big-endian Unicode data files to another location,
1195 separate from the other data files
1196 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
1197 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
1198 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
1200 ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b
1202 * should have updated the layout engine script codes but forgot
1204 ---------------------------------------------------------------------------- ***
1208 *** related ICU Trac tickets
1210 7264 Unicode 6.0 Update
1212 *** Unicode version numbers
1215 (configure.in & configure: have been modified to extract the version from uchar.h)
1216 - com.ibm.icu.util.VersionInfo
1218 *** data files & enums & parser code
1222 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed
1223 - This now prepares both unidata and testdata files in respective output subfolders.
1225 * PropertyAliases.txt changes
1226 - new Script_Extensions property defined in the new ScriptExtensions.txt file
1227 but not listed in PropertyAliases.txt; reported to unicode.org;
1228 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt
1229 scx; Script_Extensions
1230 -> uchar.h with new UProperty section
1231 -> com.ibm.icu.lang.UProperty, parallel with uchar.h
1233 * PropertyValueAliases.txt changes
1234 - 12 new block names:
1239 CJK_Unified_Ideographs_Extension_D
1244 Miscellaneous_Symbols_And_Pictographs
1246 Transport_And_Map_Symbols
1248 -> add to UCharacter.UnicodeBlock
1249 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
1250 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1251 - Joining_Group (jg) values:
1252 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias
1253 -> uchar.h & UCharacter.JoiningGroup
1258 -> remove these from SyntheticPropertyValueAliases.txt
1259 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN
1260 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1261 and in com.ibm.icu.dev.test.lang.TestUScript.java
1262 - 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
1263 (added 2009-11-11..2010-07-18)
1265 Dupl 755 Duployan shortand
1271 Merc 101 Meroitic Cursive
1272 Narb 106 Old North Arabian
1276 Wara 262 Warang Citi
1278 -> com.ibm.icu.lang.UScript
1279 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
1280 replace public static final int \1 = \2;\3
1281 -> SyntheticPropertyValueAliases.txt
1282 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
1283 and in com.ibm.icu.dev.test.lang.TestUScript.java
1284 - ISO 15924 name change
1285 Mero 100 Meroitic Hieroglyphs (was Meroitic)
1286 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC
1287 - property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt
1289 * UnicodeData.txt changes
1291 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
1292 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
1293 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion
1295 * build Unicode tools using CMake+make
1297 * run genpname/preparse.pl (on Linux)
1298 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
1299 + make sure that data.h is writable
1300 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
1301 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
1303 * rebuild Unicode tools (at least genpname) using make
1304 - You might first need to "make install" ICU so that the tools build can pick
1305 up the new definitions from the installed header files.
1308 - ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
1309 - rebuild ICU & tools
1311 * update source/data/unidata/norm2/nfkc_cf.txt
1312 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
1314 * update source/data/unidata/norm2/uts46.txt
1315 - download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt
1316 to ~/svn.icu/tools/trunk/src/unicode/py
1317 - adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values
1318 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
1319 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
1321 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1322 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1323 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1324 - Unicode 6.0: U+2260, U+226E, U+226F
1326 * generate core properties data files
1327 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
1328 - rebuild ICU & tools
1329 - run makeuca.sh so that genuca picks up the new nfc.nrm:
1330 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
1331 - rebuild ICU & tools
1333 * implement new Script_Extensions property (provisional)
1334 - parser & generator: genprops & uprops.icu
1335 - uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp
1336 - UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java
1338 * switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2
1340 - genbidi/gencase/genprops tools changes
1341 - re-run makeprops.sh (see above)
1342 - UCharacterProperty.java, UCharacterTypeIterator.java,
1343 UBiDiProps.java, UCaseProps.java, and several others with minor changes;
1344 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java
1346 * update Java data files
1347 - refresh just the UCD-related files, just to be safe
1348 - see (ICU4C)/source/data/icu4j-readme.txt
1350 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1353 Unicode .icu files built to ./out/build/icudt45l
1354 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b
1355 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
1356 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b
1357 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b
1358 mkdir -p /tmp/icu4j/main/shared/data
1359 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1360 - copy the big-endian Unicode data files to another location,
1361 separate from the other data files
1362 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
1363 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
1364 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
1365 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu
1366 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
1367 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
1368 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
1370 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
1372 * refresh Java test .txt files
1373 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1375 * un-hardcode normalization skippable (NF*_Inert) test data
1376 - removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools
1378 * copy updated break iterator test files
1379 - now handled by early ucdcopy.py and
1380 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata
1382 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt
1383 to ~/svn.icu/trunk/src/source/test/testdata)
1384 - they are not used in ICU4J
1388 - get output from Mark's tools; look in
1389 http://www.unicode.org/~book/incoming/mark/uca6.0.0/
1390 http://www.macchiato.com/unicode/utc/additional-uca-files
1391 http://www.unicode.org/Public/UCA/6.0.0/
1392 http://www.unicode.org/~mdavis/uca/
1393 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1394 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1395 - update Han-implicit ranges for new CJK extensions:
1396 swapCJK() in ucol.cpp & ImplicitCEGenerator.java
1397 - genuca: allow bytes 02 for U+FFFE, new merge-sort character;
1398 do not add it into invuca so that tailoring primary-after an ignorable works
1399 - genuca: permit space between [variable top] bytes
1400 - ucol.cpp: treat noncharacters like unassigned rather than ignorable
1402 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
1404 - refresh ICU4J collation data:
1405 (subset of instructions above for properties data refresh, except copies all coll/*)
1406 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1407 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
1408 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
1409 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
1410 - update (ICU)/source/test/testdata/CollationTest_*.txt
1411 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1412 with output from Mark's Unicode tools
1413 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)
1414 - note on intltest: if collate/UCAConformanceTest fails, then
1415 utility/MultithreadTest/TestCollators will fail as well;
1416 fix the conformance test before looking into the multi-thread test
1418 * When refreshing all of ICU4J data from ICU4C
1419 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1420 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
1422 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
1424 *** LayoutEngine script information
1426 (For details see the Unicode 5.2 change log below.)
1428 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
1429 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
1430 ScriptRunData.cpp, which is no longer needed.)
1432 The generated files have a current copyright date and "@draft" statement.
1434 * copy the above files into <icu>/source/layout, replacing the old files.
1435 * fix mixed line endings
1436 * review the diffs and fix incorrect @draft and missing aliases;
1437 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
1438 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
1440 ---------------------------------------------------------------------------- ***
1444 *** related ICU Trac tickets
1448 7167 verify collation bytes
1449 7235 Java test NAME_ALIAS
1450 7236 Java DerivedCoreProperties.txt test
1451 7237 Java BidiTest.txt
1452 7238 UTrie2 in core unidata
1453 7239 test for tailoring gaps
1454 7240 Java fix CollationMiscTest
1455 7243 update layout engine for Unicode 5.2
1457 *** Unicode version numbers
1460 - configure.in & configure
1461 - update ucdVersion in gennames.c if an algorithmic range changes
1463 *** data files & enums & parser code
1467 python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata
1468 - includes finding files regardless of version numbers,
1469 copying them, and performing the equivalent processing of the
1470 ucdstrip and ucdmerge tools on the desired set of files
1473 - PropertyAliases.txt
1474 moved from numeric to enumerated:
1475 ccc ; Canonical_Combining_Class
1476 new string properties:
1477 NFKC_CF ; NFKC_Casefold
1478 Name_Alias; Name_Alias
1479 new binary properties:
1482 CWCF ; Changes_When_Casefolded
1483 CWCM ; Changes_When_Casemapped
1484 CWKCF ; Changes_When_NFKC_Casefolded
1485 CWL ; Changes_When_Lowercased
1486 CWT ; Changes_When_Titlecased
1487 CWU ; Changes_When_Uppercased
1488 new CJK Unihan properties (not supported by ICU)
1489 - PropertyValueAliases.txt
1492 one script code change:
1493 sc ; Qaai ; Inherited
1495 sc ; Zinh ; Inherited ; Qaai
1496 new Line_Break (lb) value:
1497 lb ; CP ; Close_Parenthesis
1498 new Joining_Group (jg) values: Farsi_Yeh, Nya
1500 ccc; 214; ATA ; Attached_Above
1501 - DerivedBidiClass.txt
1502 new default-R range: U+1E800 - U+1EFFF
1504 all of the ISO comments are gone
1506 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last>
1508 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
1509 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
1513 + cd \svn\icuproj\icu\trunk\source\tools\genpname
1514 + make sure that data.h is writable
1515 + perl preparse.pl \svn\icuproj\icu\trunk > out.txt
1516 + preparse.pl complains with errors like the following:
1517 Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34.
1518 This is because ICU 4.0 had scripts from ISO 15924 which are now
1519 added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt
1520 and PropertyValueAliases.txt.
1521 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
1522 Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt
1523 + preparse.pl complains with errors about block names missing from uchar.h; add them
1525 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
1526 - new block & script values
1528 copy new blocks from Blocks.txt
1529 MS VC++ 2008 regular expression:
1530 find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$"
1531 replace with " UBLOCK_\3 = 172, /*[\1]*/"
1532 + several new script values already added in ICU 4.0 for ISO 15924 coverage
1533 (removed from SyntheticPropertyValueAliases.txt, see genpname notes above)
1534 + 3 new script values added for ISO 15924 and Unicode 5.2 coverage
1535 + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2)
1536 (added to SyntheticPropertyValueAliases.txt)
1537 - new Joining Group (JG) values: Farsi_Yeh, Nya
1538 - new Line_Break (lb) value:
1539 lb ; CP ; Close_Parenthesis
1541 * hardcoded Unihan range end/limit
1542 - Unihan range end moves from 9FC3 to 9FCB
1543 search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive)
1544 + do change gennames.c
1546 * Compare definitions of new binary properties with what we used to use
1547 in algorithms, to see if the definitions changed.
1548 - Verified that definitions for Cased and Case_Ignorable are unchanged.
1549 The gencase tool now parses the newly public Case_Ignorable values
1550 in case the definition changes in the future.
1552 * uchar.c & uprops.h & uprops.c & genprops
1553 - new numeric values that didn't exist in Unicode data before:
1554 1/7, 1/9, 1/10, 3/10, 1/16, 3/16
1555 the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5,
1556 therefore redesign the encoding of numeric types and values for formatVersion 6;
1557 design for simple numbers up to at least 144 ("one gross"),
1558 large values up to at least 10^20,
1559 and fractions with numerators -1..17 and denominators 1..16
1560 to cover current and expected future values
1561 (e.g., more Han numeric values, Meroitic twelfths)
1563 * reimplement Hangul_Syllable_Type for new Jamo characters
1564 - the old code assumed that all Jamo characters are in the 11xx block
1565 - Unicode 5.2 fills holes there and adds new Jamo characters in
1566 A960..A97F; Hangul Jamo Extended-A
1568 D7B0..D7FF; Hangul Jamo Extended-B
1569 - Hangul_Syllable_Type can be trivially derived from a subset of
1570 Grapheme_Cluster_Break values
1572 * build Unicode data source code for hardcoding core data
1573 C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data
1575 ICU data make path is \svn\icuproj\icu\trunk\source\data\
1576 ICU root path is \svn\icuproj\icu\trunk
1577 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
1578 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
1579 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
1580 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
1581 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
1582 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
1583 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
1584 Information: cannot find "spreplocal.mk". Not building user-additional stringprep files.
1585 Creating data file for Unicode Property Names
1586 Creating data file for Unicode Character Properties
1587 Creating data file for Unicode Case Mapping Properties
1588 Creating data file for Unicode BiDi/Shaping Properties
1589 Creating data file for Unicode Normalization
1590 Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l"
1591 Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp"
1593 - copy the .c source files to C:\svn\icuproj\icu\trunk\source\common
1594 and rebuild the common library
1598 - update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools)
1599 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools
1600 - update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools
1601 [ Begin obsolete instructions:
1602 Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files.
1603 - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py
1605 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt
1606 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt
1607 End obsolete instructions]
1608 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)
1609 not just the *_STUB.txt files
1610 - note on intltest: if collate/UCAConformanceTest fails, then
1611 utility/MultithreadTest/TestCollators will fail as well;
1612 fix the conformance test before looking into the multi-thread test
1614 *** Implement Cased & Case_Ignorable properties
1615 - via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable()
1616 - Problem: These properties should be disjoint, but aren't
1617 - UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not
1618 - change ucase.icu to be able to store any combination of Cased and Case_Ignorable
1620 *** Implement Changes_When_Xyz properties
1621 - without stored data
1623 *** Implement Name_Alias property
1624 - add it as another name field in unames.icu
1625 - make it available via u_charName() and UCharNameChoice and
1626 - consider it in u_charFromName()
1630 * Update break iterator rules to new UAX versions and new property values
1631 * Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary
1633 *** new BidiTest file
1634 - review format and data
1635 - copy BidiTest.txt to source/test/testdata
1636 - write test code using this data
1637 - fix ICU code where it fails the conformance test
1640 - generally, find and update code corresponding to C/C++
1641 - UCharacter.UnicodeBlock constants:
1642 a) add an _ID integer per new block, update COUNT
1643 b) add a class instance per new block
1644 Visual Studio regex:
1645 find UBLOCK_{[^ ]+} = [0-9]+, {/.+}
1646 replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1647 - CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias()
1649 - port test changes to Java
1651 *** LayoutEngine script information
1653 (For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833)
1655 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
1656 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
1657 ScriptRunData.cpp, which is no longer needed.)
1659 The generated files have a current copyright date and "@draft" statement.
1661 -> Eric Mader wrote in email on 20090930:
1662 "I think the tool has been modified to update @draft to @stable for
1663 older scripts and to add @draft for new scripts.
1664 (I worked with an intern on this last year.)
1665 You should check the output after you run it."
1667 * copy the above files into <icu>/source/layout, replacing the old files.
1668 * fix mixed line endings
1669 * review the diffs and fix incorrect @draft and missing aliases
1670 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
1672 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
1673 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
1675 -> Eric Mader wrote in email on 20090930:
1676 "This is just a matter of making sure that all the per-script tables have
1677 entries for any new scripts that were added.
1678 If any new Indic characters were added, then the class tables in
1679 IndicClassTables.cpp should be updated to reflect this.
1680 John Emmons should know how to do this if it's required."
1682 * rebuild the layout and layoutex libraries.
1686 + Jamo_Short_Name, sfc->scf, binary property value aliases
1688 ---------------------------------------------------------------------------- ***
1692 *** related ICU Trac tickets
1694 5696 Update to Unicode 5.1
1696 *** Unicode version numbers
1699 - configure.in & configure
1700 - update ucdVersion in gennames.c if an algorithmic range changes
1702 *** data files & enums & parser code
1706 DerivedCoreProperties.txt
1707 DerivedNormalizationProps.txt
1708 NormalizationTest.txt
1711 GraphemeBreakProperty.txt
1712 SentenceBreakProperty.txt
1713 WordBreakProperty.txt
1714 - ucdstrip and ucdmerge:
1718 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
1719 copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
1720 copy 5.1.0\ucd\Blocks.txt ..\unidata\
1721 copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
1722 copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
1723 copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
1724 copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
1725 copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
1726 copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
1727 copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
1728 copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
1729 copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
1730 copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
1731 copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
1733 ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
1734 ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
1735 ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
1736 ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
1737 ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
1738 ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
1739 ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
1740 ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
1741 ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
1742 ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
1746 + cd \svn\icuproj\icu\uni51\source\tools\genpname
1747 + make sure that data.h is writable
1748 + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
1749 + preparse.pl complains with errors like the following:
1750 Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
1751 This is because ICU 3.8 had scripts from ISO 15924 which are now
1752 added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
1753 and PropertyValueAliases.txt.
1754 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
1755 Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
1756 + PropertyValueAliases.txt now explicitly contains values for boolean properties:
1757 N/Y, No/Yes, F/T, False/True
1758 -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
1759 It will use further values from the file if present.
1761 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
1762 - new block & script values
1764 + 11 new script values already added in ICU 3.8 for ISO 15924 coverage
1765 (removed from SyntheticPropertyValueAliases.txt)
1766 + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
1767 (added to SyntheticPropertyValueAliases.txt)
1768 - uprops.icu (uprops.h) only provides 7 bits for script codes.
1769 In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
1770 There is none above 127 yet which is the script code for an
1771 assigned Unicode character, so ICU 4.0 uprops.icu does not store any
1772 script code values greater than 127.
1773 However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
1774 in a parallel bit field, and that overflows now.
1775 Also, future values >=128 would be incompatible anyway.
1776 uprops.h is modified to move around several of the bit fields
1777 in the properties vector words, and now uses 8 bits for the script code.
1778 Two other bit fields also grow to accommodate future growth:
1779 Block (current count: 172) grows from 8 to 9 bits,
1780 and Word_Break grows from 4 to 5 bits.
1781 - renamed property Simple_Case_Folding (sfc->scf)
1782 + nothing to be done: handled as normal alias
1783 - new property JSN Jamo_Short_Name
1784 + no new API: only contributes to the Name property
1785 - new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
1786 - new Joining Group (JG) value: Burushashki_Yeh_Barree
1787 - new Sentence_Break (SB) values:
1792 - new Word_Break (WB) values:
1794 WB ; Extend ; Extend
1798 * Further changes in the 2008-02-29 update:
1799 - Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
1800 because they should not normally be invisible.
1801 - new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
1802 - new Grapheme_Cluster_Break (GCB) value: PP=Prepend
1803 - new Word_Break (WB) value: NL=Newline
1805 * hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
1806 - Unihan range end moves from 9FBB to 9FC3
1807 search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
1808 + do change gennames.c
1810 * build Unicode data source code for hardcoding core data
1811 C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
1813 ICU data make path is \svn\icuproj\icu\uni51\source\data\
1814 ICU root path is \svn\icuproj\icu\uni51
1815 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
1816 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
1817 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
1818 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
1819 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
1820 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
1821 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
1822 Creating data file for Unicode Character Properties
1823 Creating data file for Unicode Case Mapping Properties
1824 Creating data file for Unicode BiDi/Shaping Properties
1825 Creating data file for Unicode Normalization
1826 Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
1827 Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
1829 - copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
1830 and rebuild the common library
1834 * Update break iterator rules to new UAX versions and new property values
1838 * update FractionalUCA.txt and UCARules.txt with new canonical closure
1841 - Test that APIs using Unicode property value aliases (like UnicodeSet)
1842 support all of the boolean values N/Y, No/Yes, F/T, False/True
1843 -> TestBinaryValues() tests in both cintltst and intltest
1845 *** LayoutEngine script information
1846 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
1847 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
1848 ScriptRunData.cpp, which is no longer needed.)
1850 The generated files have a current copyright date and "@draft" statement.
1852 * copy the above files into <icu>/source/layout, replacing the old files.
1854 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
1855 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
1857 * rebuild the layout and layoutex libraries.
1861 + Jamo_Short_Name, sfc->scf, binary property value aliases
1863 ---------------------------------------------------------------------------- ***
1867 *** related Jitterbugs
1869 5084 RFE: Update to Unicode 5.0
1871 *** data files & enums & parser code
1875 DerivedCoreProperties.txt
1876 DerivedNormalizationProps.txt
1877 NormalizationTest.txt
1880 GraphemeBreakProperty.txt
1881 SentenceBreakProperty.txt
1882 WordBreakProperty.txt
1883 - ucdstrip and ucdmerge:
1887 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
1888 copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
1889 copy 5.0.0\ucd\Blocks.txt ..\unidata\
1890 copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
1891 copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
1892 copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
1893 copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
1894 copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
1895 copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
1896 copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
1897 copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
1898 copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
1899 copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
1900 copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
1902 ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
1903 ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
1904 ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
1905 ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
1906 ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
1907 ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
1908 ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
1909 ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
1910 ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
1911 ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
1913 * update FractionalUCA.txt and UCARules.txt with new canonical closure
1917 + make sure that data.h is writable
1918 + perl preparse.pl \cvs\oss\icu > out.txt
1920 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
1921 - new block & script values
1922 + script values already added in ICU 3.6 because all of ISO 15924 is now covered
1924 * build Unicode data source code for hardcoding core data
1925 C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
1927 ICU data make path is \cvs\oss\icu\source\data\
1928 ICU root path is \cvs\oss\icu
1929 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
1931 Creating data file for Unicode Character Properties
1932 Creating data file for Unicode Case Mapping Properties
1933 Creating data file for Unicode BiDi/Shaping Properties
1934 Creating data file for Unicode Normalization
1935 Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
1936 Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
1938 - copy the .c source files to C:\cvs\oss\icu\source\common
1939 and rebuild the common library
1941 *** Unicode version numbers
1946 *** LayoutEngine script information
1947 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
1948 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
1949 ScriptRunData.cpp, which is no longer needed.)
1951 The generated files have a current copyright date and "@draft" statement.
1953 * copy the above files into <icu>/source/layout, replacing the old files.
1955 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
1956 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
1958 * rebuild the layout and layoutex libraries.
1960 ---------------------------------------------------------------------------- ***
1964 *** related Jitterbugs
1966 4332 RFE: Update to Unicode 4.1
1967 4157 RBBI, TR29 4.1 updates
1969 *** data files & enums & parser code
1973 DerivedCoreProperties.txt
1974 DerivedNormalizationProps.txt
1975 NormalizationTest.txt
1976 GraphemeBreakProperty.txt
1977 SentenceBreakProperty.txt
1978 WordBreakProperty.txt
1979 - ucdstrip and ucdmerge:
1983 * add new files to the repository
1984 GraphemeBreakProperty.txt
1985 SentenceBreakProperty.txt
1986 WordBreakProperty.txt
1988 * update FractionalUCA.txt and UCARules.txt with new canonical closure
1991 - handle new enumerated properties in sub read_uchar
1994 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
1995 - new binary properties
1997 + Pattern_White_Space
1998 - new enumerated properties
1999 + Grapheme_Cluster_Break
2002 - new block & script & line break values
2005 - case-ignorable changes
2006 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
2007 now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
2009 *** Unicode version numbers
2015 - verify that u_charMirror() round-trips
2016 - test all new properties and some new values of old properties
2020 * hardcoded Unihan range end/limit
2021 - Unihan range end moves from 9FA5 to 9FBB
2022 search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
2023 + do not modify BOCU/BOCSU code because that would change the encoding
2024 and break binary compatibility!
2025 + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
2027 + ignore trietest.c: test data is arbitrary
2028 + ignore tstnorm.cpp: test optimization, not important
2029 + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
2030 + do change line_th.txt and word_th.txt
2031 by replacing hardcoded ranges with the new property values
2032 + do change gennames.c
2034 source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
2035 source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
2036 source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5,
2039 - compare new special casing context conditions with previous ones
2040 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
2043 - consider storing only the short name if it is the same as the long name
2046 - UAX #29 changes (grapheme/word/sentence breaks)
2047 - UAX #14 changes (line breaks)
2048 - Pattern_Syntax & Pattern_White_Space
2050 ---------------------------------------------------------------------------- ***
2052 Unicode 4.0.1 update
2054 *** related Jitterbugs
2056 3170 RFE: Update to Unicode 4.0.1
2057 3171 Add new Unicode 4.0.1 properties
2058 3520 use Unicode 4.0.1 updates for break iteration
2060 *** data files & enums & parser code
2063 - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
2064 - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
2067 - fix UnicodeData.txt general categories of Ethiopic digits Nd->No
2068 according to PRI #26
2069 http://www.unicode.org/review/resolved-pri.html#pri26
2070 - undone again because no corrigendum in sight;
2071 instead modified tests to not check consistency on this for Unicode 4.0.1
2074 - update from http://www.unicode.org/copyright.html
2075 formatted for plain text
2077 * uchar.h & uprops.h & uprops.c & genprops
2078 - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
2079 - add U_LB_INSEPARABLE due to a spelling fix
2080 + put short name comment only on line with new constant
2081 for genpname perl script parser
2082 - new binary properties
2084 + Variation_Selector
2087 - fix genpname perl script so that it doesn't choke on more than 2 names per property value
2088 - perl script: correctly calculate the maximum number of fields per row
2091 - new script code Hrkt=Katakana_Or_Hiragana
2093 * gennorm.c track changes in DerivedNormalizationProps.txt
2094 - "FNC" -> "FC_NFKC"
2095 - single field "NFD_NO" -> two fields "NFD_QC; N" etc.
2097 * genprops/props2.c track changes in DerivedNumericValues.txt
2098 - changed from 3 columns to 2, dropping the numeric type
2099 + assume that the type is always numeric for Han characters,
2100 and that only those are added in addition to what UnicodeData.txt lists
2102 *** Unicode version numbers
2108 - update test of default bidi classes according to PRI #28
2109 /tsutil/cucdtst/TestUnicodeData
2110 http://www.unicode.org/review/resolved-pri.html#pri28
2111 - bidi tests: change exemplar character for ES depending on Unicode version
2112 - change hardcoded expected property values where they change
2120 - use new Hrkt=Katakana_Or_Hiragana
2123 - are now part of combining character sequences
2124 - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ