1 * Copyright (C) 2016 and later: Unicode, Inc. and others.
2 * License & terms of use: http://www.unicode.org/copyright.html
3 * Copyright (C) 2004-2016, International Business Machines
4 * Corporation and others. All Rights Reserved.
6 * file name: changes.txt
8 * tab size: 8 (not used)
11 * created on: 2004may06
12 * created by: Markus W. Scherer
14 * change log for Unicode updates
16 * For each new Unicode version, during the beta period,
17 * I copy the change log for the previous version to the top of this file.
18 * I adjust the versions, tickets, URLs, and paths.
19 * I work my way through the steps listed in the log, top to bottom,
20 * adjusting the log as necessary.
21 * I report problems to the UTC and/or CLDR and/or ICU.
22 * Before the data is final, I "turn the crank" several more times,
23 * using appropriate subsets of the steps.
25 ---------------------------------------------------------------------------- ***
27 * New ISO 15924 script codes
29 Starting with ICU 55, we do not add UScriptCode constants for new scripts any more
30 until they are encoded in Unicode,
31 or can be assumed to be encoded in the next Unicode version.
32 Script enum constant names want to follow the Unicode script property value aliases,
33 which are assigned only when the scripts are encoded.
34 When we encode scripts early and guess wrong, then we have confusing enum constants
35 and have sometimes added aliases.
37 Variant script codes like Latf and Aran that are not subject to separate encoding
38 can be added at any time.
39 (For example, Aran could be added as USCRIPT_ARABIC_NASTALIQ.)
41 We add script codes used in CLDR or in the spoof checker.
42 This includes combination/alias codes like Hanb and Jamo.
43 See http://unicode.org/reports/tr35/#unicode_script_subtag_validity
44 and look for "alias" on http://unicode.org/iso15924/iso15924-codes.html
46 We add special Z* script codes like Zsye.
48 For new script codes see http://www.unicode.org/iso15924/codechanges.html
50 ---------------------------------------------------------------------------- ***
52 Unicode 11.0 update for ICU 62
54 http://www.unicode.org/versions/Unicode11.0.0/
55 http://unicode.org/versions/beta-11.0.0.html
56 https://www.unicode.org/review/pri372/
57 http://www.unicode.org/reports/uax-proposed-updates.html
58 http://www.unicode.org/reports/tr44/tr44-21.html
60 * Command-line environment setup
62 UNICODE_DATA=~/unidata/uni11/20180521
63 CLDR_SRC=~/svn.cldr/uni
64 ICU_ROOT=~/svn.icu/uni
67 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
68 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
69 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
73 - ticket:13630: Unicode 11
74 - ^/branches/markus/uni11
78 - cldrbug 10978: Unicode 11
79 - ^/branches/markus/uni11
81 *** Unicode version numbers
84 - com.ibm.icu.util.VersionInfo
85 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
87 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
88 so that the makefiles see the new version number.
90 *** data files & enums & parser code
93 - mkdir -p $UNICODE_DATA
94 - download Unicode files into $UNICODE_DATA
95 + subfolders: emoji, idna, security, ucd, uca
96 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
98 * for manual diffs and for Unicode Tools input data updates:
99 remove version suffixes from the file names
100 ~$ unidata/desuffixucd.py $UNICODE_DATA
101 (see https://sites.google.com/site/unicodetools/inputdata)
103 * process and/or copy files
104 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
105 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
106 + For debugging, and tweaking how ppucd.txt is written,
107 the tool has an --only_ppucd option:
108 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
110 - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
112 * build ICU (make install)
113 so that the tools build can pick up the new definitions from the installed header files.
115 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
117 * preparseucd.py changes
119 NameError: unknown property Extended_Pictographic
120 -> add Extended_Pictographic binary property
121 -> add new short names for all Emoji properties
123 * new constants for new property values
124 - preparseucd.py error:
125 ValueError: missing uchar.h enum constants for some property values:
126 [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar',
127 u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals',
128 u'Indic_Siyaq_Numbers'])),
129 (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])),
130 (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])),
131 (u'GCB', set([u'LinkC', u'Virama'])),
132 (u'WB', set([u'WSegSpace']))]
133 = PropertyValueAliases.txt new property values (diff old & new .txt files)
134 blk; Chess_Symbols ; Chess_Symbols
136 blk; Georgian_Ext ; Georgian_Extended
137 blk; Gunjala_Gondi ; Gunjala_Gondi
138 blk; Hanifi_Rohingya ; Hanifi_Rohingya
139 blk; Indic_Siyaq_Numbers ; Indic_Siyaq_Numbers
140 blk; Makasar ; Makasar
141 blk; Mayan_Numerals ; Mayan_Numerals
142 blk; Medefaidrin ; Medefaidrin
143 blk; Old_Sogdian ; Old_Sogdian
144 blk; Sogdian ; Sogdian
146 use long property names for enum constants,
147 for the trailing comment get the block start code point: diff old & new Blocks.txt
148 -> add to UCharacter.UnicodeBlock IDs
149 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
150 replace public static final int \1_ID = \2; \3
151 -> add to UCharacter.UnicodeBlock objects
152 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
153 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
155 GCB; LinkC ; LinkingConsonant
157 -> uchar.h & UCharacter.GraphemeClusterBreak
158 -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76
160 InSC; Consonant_Initial_Postfixed ; Consonant_Initial_Postfixed
161 -> ignore: ICU does not yet support this property
163 jg ; Hanifi_Rohingya_Kinna_Ya ; Hanifi_Rohingya_Kinna_Ya
164 jg ; Hanifi_Rohingya_Pa ; Hanifi_Rohingya_Pa
165 -> uchar.h & UCharacter.JoiningGroup
168 sc ; Gong ; Gunjala_Gondi
170 sc ; Medf ; Medefaidrin
171 sc ; Rohg ; Hanifi_Rohingya
173 sc ; Sogo ; Old_Sogdian
174 -> uscript.h & com.ibm.icu.lang.UScript
175 -> Nushu had been added already
176 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
177 and in com.ibm.icu.dev.test.lang.TestUScript.java
179 WB ; WSegSpace ; WSegSpace
180 -> uchar.h & UCharacter.WordBreak
182 * New short names for emoji properties
184 - short names set in preparseucd.py
187 - boolean emoji property Extended_Pictographic
188 -> added in preparseucd.py
189 -> uchar.h & UProperty.java
190 - misc. property Equivalent_Unified_Ideograph (EqUIdeo)
191 as shown in PropertyValueAliases.txt
194 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
195 (not strictly necessary for NOT_ENCODED scripts)
196 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
198 * update spoof checker UnicodeSet initializers:
199 inclusionPat & recommendedPat in uspoof.cpp
200 INCLUSION & RECOMMENDED in SpoofChecker.java
201 - make sure that the Unicode Tools tree contains the latest security data files
202 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
203 - update the hardcoded version number there in the DIRECTORY path
204 - run the tool (no special environment variables needed)
205 - copy & paste from the Console output into the .cpp & .java files
207 * generate normalization data files
208 cd $ICU_ROOT/dbg/icu4c
209 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
210 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
211 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
212 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
213 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
215 * build ICU (make install)
216 so that the tools build can pick up the new definitions from the installed header files.
218 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
220 * build Unicode tools using CMake+make
222 $ICU_SRC/tools/unicode/c/icudefs.txt:
224 # Location (--prefix) of where ICU was installed.
225 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
226 # Location of the ICU4C source tree.
227 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c)
230 mkdir -p tools/unicode/c
233 $ICU_ROOT/dbg/tools/unicode/c$
234 cmake ../../../../src/tools/unicode/c
237 * generate core properties data files
238 $ICU_ROOT/dbg/tools/unicode/c$
239 genprops/genprops $ICU_SRC/icu4c
240 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
241 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
242 - rebuild ICU (make install) & tools
245 genprops error: casepropsbuilder: too many exceptions words
246 genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR
247 - With the addition of Georgian Mtavruli capital letters,
248 there are now too many simple case mappings with big mapping deltas
249 that yield uncompressible exceptions.
250 - Changing the data structure (now formatVersion 4),
251 adding one bit for no-simple-case-folding (for Cherokee), and
252 one optional slot for a big delta (for most faraway mappings),
253 together with another bit for whether that is negative.
254 This makes most Cherokee & Georgian etc. case mappings compressible,
255 reducing the number of exceptions words.
256 - Further changes to gain one more bit for the exceptions index,
257 for future growth. Details see casepropsbuilder.cpp.
259 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
260 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
261 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
262 - Unicode 6.0..11.0: U+2260, U+226E, U+226F
263 - nothing new in this Unicode version, no test file to update
265 * run & fix ICU4C tests
266 - Andy handles RBBI & spoof check test failures
268 - Errors in char.txt, word.txt, word_POSIX.txt like
269 createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 46, column 16
270 because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty.
271 -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them
272 not empty, just to get ICU building.
273 -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables
274 and properties together with the rules that used them (GB 10, WB 14).
275 -> Andy adjusts the rule sets further to sync with
276 Unicode 11 grapheme, word, and line break spec changes.
278 * collation: CLDR collation root, UCA DUCET
280 - UCA DUCET goes into Mark's Unicode tools, see
281 https://sites.google.com/site/unicodetools/home#TOC-UCA
282 diff the main mapping file, look for bad changes
283 (for example, more bytes per weight for common characters)
284 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt
285 ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt
287 - CLDR root data files are checked into $CLDR_SRC/common/uca/
288 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
290 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
291 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
292 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
293 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
294 (note removing the underscore before "Rules")
295 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
296 - restore TODO diffs in UCARules.txt
297 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
298 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
299 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
300 from the CLDR root files (..._CLDR_..._SHORT.txt)
301 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
302 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
303 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
304 - if CLDR common/uca/unihan-index.txt changes, then update
305 CLDR common/collation/root.xml <collation type="private-unihan">
306 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
308 - run genuca, see command line above;
310 Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:
311 FDD1 1180B; [71 CC 02, 05, 05] # Dogra first primary (compressible)
312 (add the character to genuca.cpp sampleCharsToScripts[])
313 + look up the USCRIPT_ code for the new sample characters
314 (should be obvious from the comment in the error output)
315 + *add* mappings to sampleCharsToScripts[], do not replace them
316 (in case the script sample characters flip-flop)
317 + insert new scripts in DUCET script order, see the top_byte table
318 at the beginning of FractionalUCA.txt
322 https://sites.google.com/site/unicodetools/unihan
324 org.unicode.draft.GenerateUnihanCollators
327 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
328 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
329 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
330 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
333 org.unicode.draft.GenerateUnihanCollatorFiles
334 with the same arguments
337 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
338 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
341 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
342 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
343 - run CLDR unit tests, commit to CLDR
344 - generate ICU zh collation data: run CLDR
345 org.unicode.cldr.icu.NewLdml2IcuConverter
346 with program arguments
348 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
349 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
350 -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll
351 -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation
355 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
358 * run & fix ICU4C tests, now with new CLDR collation root data
359 - run all tests with the collation test data *_SHORT.txt or the full files
360 (the full ones have comments, useful for debugging)
361 - note on intltest: if collate/UCAConformanceTest fails, then
362 utility/MultithreadTest/TestCollators will fail as well;
363 fix the conformance test before looking into the multi-thread test
365 * update Java data files
366 - refresh just the UCD/UCA-related/derived files, just to be safe
367 - see (ICU4C)/source/data/icu4j-readme.txt
368 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
369 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
372 Unicode .icu files built to ./out/build/icudt61l
373 echo timestamp > uni-core-data
374 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b
375 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b
376 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
377 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b
378 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b"
379 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/
380 mkdir -p /tmp/icu4j/main/shared/data
381 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
382 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/
383 mkdir -p /tmp/icu4j/main/shared/data
384 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
385 make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data'
386 - copy the big-endian Unicode data files to another location,
387 separate from the other data files,
388 and then refresh ICU4J
389 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
390 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
391 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
392 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
393 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
394 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
395 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
396 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
397 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
398 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
400 * When refreshing all of ICU4J data from ICU4C
401 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
402 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
404 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
406 * update CollationFCD.java
407 + copy & paste the initializers of lcccIndex[] etc. from
408 ICU4C/source/i18n/collationfcd.cpp to
409 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
411 * refresh Java test .txt files
412 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
413 cd $ICU_SRC/icu4c/source/data/unidata
414 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
415 cd ../../test/testdata
416 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
417 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
419 * run & fix ICU4J tests
422 - send notice to icu-design about new born-@stable API (enum constants etc.)
424 *** CLDR numbering systems
425 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
426 Unicode 11: using Unicode 11 CLDR ticket #10978
427 rohg 10D30..10D39 Hanifi_Rohingya
428 gong 11DA0..11DA9 Gunjala_Gondi
429 Earlier: CLDR tickets specific to adding new numbering systems.
430 Unicode 10: http://unicode.org/cldr/trac/ticket/10219
431 Unicode 9: http://unicode.org/cldr/trac/ticket/9692
433 *** merge the Unicode update branches back onto the trunk
434 - do not merge the icudata.jar and testdata.jar,
435 instead rebuild them from merged & tested ICU4C
436 - make sure that changes to Unicode tools are checked in:
437 http://www.unicode.org/utility/trac/log/trunk/unicodetools
439 ---------------------------------------------------------------------------- ***
441 Unicode 10.0 update for ICU 60
443 http://www.unicode.org/versions/Unicode10.0.0/
444 http://www.unicode.org/versions/beta-10.0.0.html
445 http://blog.unicode.org/2017/03/unicode-100-beta-review.html
446 http://www.unicode.org/review/pri350/
447 http://www.unicode.org/reports/uax-proposed-updates.html
448 http://www.unicode.org/reports/tr44/tr44-19.html
450 * Command-line environment setup
452 UNICODE_DATA=~/unidata/uni10/20170605
453 CLDR_SRC=~/svn.cldr/uni10
454 ICU_ROOT=~/svn.icu/uni10
455 ICU_SRC=$ICU_ROOT/src
457 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
458 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
459 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
463 - ticket:12985: Unicode 10
464 - ticket:13061: undo hacks from emoji 5.0 update
465 - ticket:13062: add Emoji_Component property
466 - ^/branches/markus/uni10
470 - cldrbug 10055: Unicode 10
471 - cldrbug 9882: Unicode 10 script metadata
472 - cldrbug 10219: numbering systems for Unicode 10
474 *** Unicode version numbers
477 - com.ibm.icu.util.VersionInfo
478 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
480 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
481 so that the makefiles see the new version number.
483 *** data files & enums & parser code
486 - mkdir -p $UNICODE_DATA
487 - download Unicode 10.0 files into $UNICODE_DATA
488 + subfolders: ucd, uca, idna, security
489 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
490 - download emoji 5.0 files into $UNICODE_DATA/emoji
492 * for manual diffs: remove version suffixes from the file names
493 ~$ unidata/desuffixucd.py $UNICODE_DATA
494 (see https://sites.google.com/site/unicodetools/inputdata)
496 * process and/or copy files
497 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
498 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
499 + For debugging, and tweaking how ppucd.txt is written,
500 the tool has an --only_ppucd option:
501 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
503 - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
505 * build ICU (make install)
506 so that the tools build can pick up the new definitions from the installed header files.
508 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
510 * preparseucd.py changes
511 - remove or add new Unicode scripts from/to the
512 only-in-ISO-15924 list according to the error messages:
513 ValueError: remove ['Nshu'] from _scripts_only_in_iso15924
514 -> adjust _scripts_only_in_iso15924 as indicated
516 Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo']
517 -> add vo=Vertical_Orientation to _ignored_properties
518 -> later removed again, parsing the file, even though we do not yet store data for runtime use
520 * new constants for new property values
521 - preparseucd.py error:
522 ValueError: missing uchar.h enum constants for some property values:
523 [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F',
524 u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])),
525 (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla',
526 u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra',
527 u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])),
528 (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))]
529 = PropertyValueAliases.txt new property values (diff old & new .txt files)
530 blk; CJK_Ext_F ; CJK_Unified_Ideographs_Extension_F
531 blk; Kana_Ext_A ; Kana_Extended_A
532 blk; Masaram_Gondi ; Masaram_Gondi
534 blk; Soyombo ; Soyombo
535 blk; Syriac_Sup ; Syriac_Supplement
536 blk; Zanabazar_Square ; Zanabazar_Square
538 use long property names for enum constants,
539 for the trailing comment get the block start code point: diff old & new Blocks.txt
540 -> add to UCharacter.UnicodeBlock IDs
541 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
542 replace public static final int \1_ID = \2; \3
543 -> add to UCharacter.UnicodeBlock objects
544 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
545 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
547 jg ; Malayalam_Bha ; Malayalam_Bha
548 jg ; Malayalam_Ja ; Malayalam_Ja
549 jg ; Malayalam_Lla ; Malayalam_Lla
550 jg ; Malayalam_Llla ; Malayalam_Llla
551 jg ; Malayalam_Nga ; Malayalam_Nga
552 jg ; Malayalam_Nna ; Malayalam_Nna
553 jg ; Malayalam_Nnna ; Malayalam_Nnna
554 jg ; Malayalam_Nya ; Malayalam_Nya
555 jg ; Malayalam_Ra ; Malayalam_Ra
556 jg ; Malayalam_Ssa ; Malayalam_Ssa
557 jg ; Malayalam_Tta ; Malayalam_Tta
558 -> uchar.h & UCharacter.JoiningGroup
560 sc ; Gonm ; Masaram_Gondi
563 sc ; Zanb ; Zanabazar_Square
564 -> uscript.h & com.ibm.icu.lang.UScript
565 -> Nushu had been added already
566 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
567 and in com.ibm.icu.dev.test.lang.TestUScript.java
569 * New properties as shown in PropertyValueAliases.txt changes
570 - boolean Emoji_Component from emoji 5
571 -> uchar.h & UProperty.java
573 # Regional_Indicator (RI)
575 RI ; N ; No ; F ; False
576 RI ; Y ; Yes ; T ; True
577 -> uchar.h & UProperty.java
578 -> single immutable range, to be hardcoded
580 # Prepended_Concatenation_Mark (PCM)
582 PCM; N ; No ; F ; False
583 PCM; Y ; Yes ; T ; True
584 -> was new in Unicode 9
585 -> uchar.h & UProperty.java
587 # Vertical_Orientation (vo)
590 vo ; Tr ; Transformed_Rotated
591 vo ; Tu ; Transformed_Upright
593 -> only pre-parsed for now, but not yet stored for runtime use
595 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
596 (not strictly necessary for NOT_ENCODED scripts)
597 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
599 * generate normalization data files
600 cd $ICU_ROOT/dbg/icu4c
601 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
602 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
603 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
604 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
605 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
607 * build ICU (make install)
608 so that the tools build can pick up the new definitions from the installed header files.
610 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
612 * build Unicode tools using CMake+make
614 $ICU_SRC/tools/unicode/c/icudefs.txt:
616 # Location (--prefix) of where ICU was installed.
617 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
618 # Location of the ICU4C source tree.
619 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c)
621 $ICU_ROOT/dbg/tools/unicode/c$
622 cmake ../../../../src/tools/unicode/c
625 * generate core properties data files
626 $ICU_ROOT/dbg/tools/unicode/c$
627 genprops/genprops $ICU_SRC/icu4c
628 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
629 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
630 - rebuild ICU (make install) & tools
632 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
633 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
634 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
635 - Unicode 6.0..10.0: U+2260, U+226E, U+226F
636 - nothing new in this Unicode version, no test file to update
638 * run & fix ICU4C tests
639 - Andy handles RBBI & spoof check test failures
641 * collation: CLDR collation root, UCA DUCET
643 - UCA DUCET goes into Mark's Unicode tools, see
644 https://sites.google.com/site/unicodetools/home#TOC-UCA
645 - CLDR root data files are checked into $CLDR_SRC/common/uca/
646 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
648 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
649 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
650 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
651 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
652 (note removing the underscore before "Rules")
653 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
654 - restore TODO diffs in UCARules.txt
655 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
656 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
657 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
658 from the CLDR root files (..._CLDR_..._SHORT.txt)
659 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
660 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
661 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
662 - if CLDR common/uca/unihan-index.txt changes, then update
663 CLDR common/collation/root.xml <collation type="private-unihan">
664 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
666 - run genuca, see command line above;
668 Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt:
669 FDD1 11D10; [70 D5 02, 05, 05] # Masaram_Gondi first primary (compressible)
670 (add the character to genuca.cpp sampleCharsToScripts[])
671 + look up the USCRIPT_ code for the new sample characters
672 (should be obvious from the comment in the error output)
673 + *add* mappings to sampleCharsToScripts[], do not replace them
674 (in case the script sample characters flip-flop)
675 + insert new scripts in DUCET script order, see the top_byte table
676 at the beginning of FractionalUCA.txt
680 https://sites.google.com/site/unicodetools/unihan
682 org.unicode.draft.GenerateUnihanCollators
685 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
686 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
687 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
688 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
691 org.unicode.draft.GenerateUnihanCollatorFiles
692 with the same arguments
695 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
696 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
699 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
700 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
701 - run CLDR unit tests, commit to CLDR
702 - generate ICU zh collation data: run CLDR
703 org.unicode.cldr.icu.NewLdml2IcuConverter
704 with program arguments
706 -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation
707 -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental
708 -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll
709 -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation
713 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
716 * run & fix ICU4C tests, now with new CLDR collation root data
717 - run all tests with the collation test data *_SHORT.txt or the full files
718 (the full ones have comments, useful for debugging)
719 - note on intltest: if collate/UCAConformanceTest fails, then
720 utility/MultithreadTest/TestCollators will fail as well;
721 fix the conformance test before looking into the multi-thread test
723 * update Java data files
724 - refresh just the UCD/UCA-related/derived files, just to be safe
725 - see (ICU4C)/source/data/icu4j-readme.txt
726 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
727 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
730 Unicode .icu files built to ./out/build/icudt60l
731 echo timestamp > uni-core-data
732 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b
733 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b
734 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
735 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b
736 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b"
737 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/
738 mkdir -p /tmp/icu4j/main/shared/data
739 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
740 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/
741 mkdir -p /tmp/icu4j/main/shared/data
742 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
743 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data'
744 - copy the big-endian Unicode data files to another location,
745 separate from the other data files,
746 and then refresh ICU4J
747 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
748 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
749 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
750 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
751 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
752 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
753 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
754 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
755 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
756 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
758 * When refreshing all of ICU4J data from ICU4C
759 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
760 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
762 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
764 * update CollationFCD.java
765 + copy & paste the initializers of lcccIndex[] etc. from
766 ICU4C/source/i18n/collationfcd.cpp to
767 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
769 * refresh Java test .txt files
770 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
771 cd $ICU_SRC/icu4c/source/data/unidata
772 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
773 cd ../../test/testdata
774 cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
775 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
777 * run & fix ICU4J tests
780 - send notice to icu-design about new born-@stable API (enum constants etc.)
782 *** CLDR numbering systems
783 - look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket
784 Unicode 10: http://unicode.org/cldr/trac/ticket/10219
785 Unicode 9: http://unicode.org/cldr/trac/ticket/9692
787 *** merge the Unicode update branches back onto the trunk
788 - do not merge the icudata.jar and testdata.jar,
789 instead rebuild them from merged & tested ICU4C
790 - make sure that changes to Unicode tools are checked in:
791 http://www.unicode.org/utility/trac/log/trunk/unicodetools
793 ---------------------------------------------------------------------------- ***
795 Emoji 5.0 update for ICU 59
796 - ICU 59 mostly remains on Unicode 9.0
797 - except updates bidi and segmentation data to Unicode 10 beta
799 First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg.
801 * Command-line environment setup
803 ICU_ROOT=~/svn.icu/trunk
804 ICU_SRC_DIR=$ICU_ROOT/src
805 ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c
807 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
808 SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in
809 UNIDATA=$ICU4C_SRC_DIR/source/data/unidata
813 - ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released
814 - changes directly on trunk
816 *** data files & enums & parser code
820 - download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca)
821 - download emoji 5.0 beta files into the same uni90e50 folder
822 - download Unicode 10.0 beta files: ucd
823 + copy Unicode 10 bidi files to the uni90e50/ucd folder:
825 BidiCharacterTest.txt
828 extracted/DerivedBidiClass.txt
829 + copy Unicode 10 segmentation files to the uni90e50/ucd folder:
833 * preparseucd.py changes
834 - adjust for combined trunks
835 - write new copyright lines
836 - ignore new Emoji_Component property for now
838 * process and/or copy files
839 - ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR
840 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
842 - cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA
844 * build ICU (make install)
845 so that the tools build can pick up the new definitions from the installed header files.
847 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
849 * build Unicode tools using CMake+make
851 ~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt:
853 # Location (--prefix) of where ICU was installed.
854 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
855 # Location of the ICU4C source tree.
856 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c)
858 ~/svn.icu/trunk/dbg/tools/unicode/c$
859 cmake ../../../../src/tools/unicode/c
862 * generate core properties data files
863 ~/svn.icu/trunk/dbg/tools/unicode/c$
864 genprops/genprops $ICU4C_SRC_DIR
865 - rebuild ICU (make install) & tools
867 * run & fix ICU4C tests
868 - Andy handles RBBI & spoof check test failures
870 * update Java data files
871 - refresh just the UCD/UCA-related/derived files, just to be safe
872 - see (ICU4C)/source/data/icu4j-readme.txt
874 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
877 Unicode .icu files built to ./out/build/icudt59l
878 echo timestamp > uni-core-data
879 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b
880 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b
881 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
882 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b
883 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b"
884 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/
885 mkdir -p /tmp/icu4j/main/shared/data
886 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
887 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/
888 mkdir -p /tmp/icu4j/main/shared/data
889 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
890 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data'
891 - copy the big-endian Unicode data files to another location,
892 separate from the other data files,
893 and then refresh ICU4J
894 cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j
895 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
896 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
897 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
898 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
899 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
900 jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
902 * When refreshing all of ICU4J data from ICU4C
903 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
904 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data
906 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install
908 * refresh Java test .txt files
909 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
910 cd $ICU4C_SRC_DIR/source/data/unidata
911 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
912 cd ../../test/testdata
913 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
914 cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
916 * run & fix ICU4J tests
918 ---------------------------------------------------------------------------- ***
920 Unicode 9.0 update for ICU 58
922 * Command-line environment setup
924 ICU_ROOT=~/svn.icu/trunk
925 ICU_SRC_DIR=$ICU_ROOT/src
927 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
928 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
929 UNIDATA=$ICU_SRC_DIR/source/data/unidata
931 http://www.unicode.org/review/pri323/ -- beta review
932 http://www.unicode.org/reports/uax-proposed-updates.html
933 http://www.unicode.org/versions/beta-9.0.0.html
934 http://www.unicode.org/versions/Unicode9.0.0/
935 http://www.unicode.org/reports/tr44/tr44-17.html
939 - ticket:12526: integrate Unicode 9
940 - C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b
941 - Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b
945 - cldrbug 9414: UCA 9
946 - ^/branches/markus/uni90 at r11518 from trunk at r11517
948 - cldrbug 8745: Unicode 9.0 script metadata
950 *** Unicode version numbers
953 - com.ibm.icu.util.VersionInfo
954 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
956 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
957 so that the makefiles see the new version number.
959 *** data files & enums & parser code
963 - download UCD & IDNA files
964 - make sure that the Unicode data folder passed into preparseucd.py
965 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
966 - only for manual diffs: remove version suffixes from the file names
967 ~/unidata/uni70/20140403$ ../../desuffixucd.py .
968 (see https://sites.google.com/site/unicodetools/inputdata)
969 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
970 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src
971 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
973 - also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt
975 cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA
977 * preparseucd.py changes
978 - remove or add new Unicode scripts from/to the
979 only-in-ISO-15924 list according to the error messages:
980 ValueError: remove ['Tang'] from _scripts_only_in_iso15924
981 ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD
982 ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD
983 ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD
984 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
985 and in com.ibm.icu.dev.test.lang.TestUScript.java
986 - DerivedNumericValues.txt new numeric values
987 0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH
988 0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH
989 0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS
990 0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH
991 0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS
992 -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(),
993 uchar.c, UCharacterProperty.java
994 to support a new series of values
995 - adjust preparseucd.py for Tangut algorithmic names
997 algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH-
999 algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH-
1000 - avoid block-compressing most String/Miscellaneous property values,
1001 triggered by genprops not coping with a multi-code point Case_Folding on
1002 block;1C80..1C8F;...;Cased;cf=0442;CWCF;...
1003 keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors
1005 * PropertyAliases.txt changes
1006 - 1 new property PCM=Prepended_Concatenation_Mark
1007 Ignore: Only useful for layout engines.
1008 Ok to list in ppucd.txt.
1010 * PropertyValueAliases.txt new property values
1012 blk; Bhaiksuki ; Bhaiksuki
1013 blk; Cyrillic_Ext_C ; Cyrillic_Extended_C
1014 blk; Glagolitic_Sup ; Glagolitic_Supplement
1015 blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation
1016 blk; Marchen ; Marchen
1017 blk; Mongolian_Sup ; Mongolian_Supplement
1020 blk; Tangut ; Tangut
1021 blk; Tangut_Components ; Tangut_Components
1023 use long property names for enum constants
1024 -> add to UCharacter.UnicodeBlock IDs
1025 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1026 replace public static final int \1_ID = \2; \3
1027 -> add to UCharacter.UnicodeBlock objects
1028 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
1029 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1032 GCB; EBG ; E_Base_GAZ
1033 GCB; EM ; E_Modifier
1034 GCB; GAZ ; Glue_After_Zwj
1036 -> uchar.h & UCharacter.GraphemeClusterBreak
1038 jg ; African_Feh ; African_Feh
1039 jg ; African_Noon ; African_Noon
1040 jg ; African_Qaf ; African_Qaf
1041 -> uchar.h & UCharacter.JoiningGroup
1044 lb ; EM ; E_Modifier
1046 -> uchar.h & UCharacter.LineBreak
1049 sc ; Bhks ; Bhaiksuki
1054 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
1057 WB ; EBG ; E_Base_GAZ
1058 WB ; EM ; E_Modifier
1059 WB ; GAZ ; Glue_After_Zwj
1061 -> uchar.h & UCharacter.WordBreak
1063 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1064 (not strictly necessary for NOT_ENCODED scripts)
1065 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
1067 * generate normalization data files
1069 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
1070 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
1071 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
1072 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1073 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
1075 * build ICU (make install)
1076 so that the tools build can pick up the new definitions from the installed header files.
1078 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt
1080 * build Unicode tools using CMake+make
1082 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
1084 # Location (--prefix) of where ICU was installed.
1085 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
1086 # Location of the ICU source tree.
1087 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
1089 ~/svn.icutools/trunk/dbg/unicode/c$
1090 cmake ../../../src/unicode/c
1093 * generate core properties data files
1094 ~/svn.icutools/trunk/dbg/unicode/c$
1095 genprops/genprops $ICU_SRC_DIR
1096 genuca/genuca --hanOrder implicit $ICU_SRC_DIR
1097 genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
1098 - rebuild ICU (make install) & tools
1100 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1101 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1102 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1103 - Unicode 6.0..9.0: U+2260, U+226E, U+226F
1104 - nothing new in 9.0, no test file to update
1106 * run & fix ICU4C tests
1107 - Andy handles RBBI & spoof check test failures
1109 * collation: CLDR collation root, UCA DUCET
1111 - UCA DUCET goes into Mark's Unicode tools, see
1112 https://sites.google.com/site/unicodetools/home#TOC-UCA
1113 - CLDR root data files are checked into (CLDR UCA branch)/common/uca/
1114 cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
1116 - cd (CLDR UCA branch)/common/uca/
1117 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1118 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
1119 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1120 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
1121 (note removing the underscore before "Rules")
1122 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1123 - restore TODO diffs in UCARules.txt
1124 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1125 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
1126 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1127 from the CLDR root files (..._CLDR_..._SHORT.txt)
1128 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1129 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1130 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
1131 - if CLDR common/uca/unihan-index.txt changes, then update
1132 CLDR common/collation/root.xml <collation type="private-unihan">
1133 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
1135 - run genuca, see command line above;
1137 Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt:
1138 FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible)
1139 (add the character to genuca.cpp sampleCharsToScripts[])
1140 + look up the USCRIPT_ code for the new sample characters
1141 (should be obvious from the comment in the error output)
1142 + *add* mappings to sampleCharsToScripts[], do not replace them
1143 (in case the script sample characters flip-flop)
1144 + insert new scripts in DUCET script order, see the top_byte table
1145 at the beginning of FractionalUCA.txt
1150 org.unicode.draft.GenerateUnihanCollators
1152 -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk
1153 -DOTHER_WORKSPACE=/home/mscherer/svn.unitools
1154 -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data
1155 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
1159 org.unicode.draft.GenerateUnihanCollatorFiles
1160 with the same arguments
1163 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
1164 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
1167 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
1168 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
1170 - generate ICU zh collation data: run CLDR
1171 org.unicode.cldr.icu.NewLdml2IcuConverter
1172 with program arguments
1174 -s /home/mscherer/svn.cldr/trunk/common/collation
1175 -m /home/mscherer/svn.cldr/trunk/common/supplemental
1176 -d /home/mscherer/svn.icu/trunk/src/source/data/coll
1177 -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation
1180 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
1183 * run & fix ICU4C tests, now with new CLDR collation root data
1184 - run all tests with the collation test data *_SHORT.txt or the full files
1185 (the full ones have comments, useful for debugging)
1186 - note on intltest: if collate/UCAConformanceTest fails, then
1187 utility/MultithreadTest/TestCollators will fail as well;
1188 fix the conformance test before looking into the multi-thread test
1190 * update Java data files
1191 - refresh just the UCD/UCA-related/derived files, just to be safe
1192 - see (ICU4C)/source/data/icu4j-readme.txt
1194 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1197 Unicode .icu files built to ./out/build/icudt58l
1198 echo timestamp > uni-core-data
1199 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b
1200 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b
1201 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1202 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b
1203 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b"
1204 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/
1205 mkdir -p /tmp/icu4j/main/shared/data
1206 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1207 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/
1208 mkdir -p /tmp/icu4j/main/shared/data
1209 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1210 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
1211 - copy the big-endian Unicode data files to another location,
1212 separate from the other data files,
1213 and then refresh ICU4J
1214 cd ~/svn.icu/trunk/dbg/data/out/icu4j
1215 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1216 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1217 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1218 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1219 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1220 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1221 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1222 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1223 jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1225 * When refreshing all of ICU4J data from ICU4C
1226 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1227 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
1229 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
1231 * update CollationFCD.java
1232 + copy & paste the initializers of lcccIndex[] etc. from
1233 ICU4C/source/i18n/collationfcd.cpp to
1234 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1236 * refresh Java test .txt files
1237 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1238 cd $ICU_SRC_DIR/source/data/unidata
1239 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1240 cd ../../test/testdata
1241 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1242 cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1244 * run & fix ICU4J tests
1246 *** LayoutEngine script information
1248 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
1249 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
1250 in the working directory.
1252 (It also generates ScriptRunData.cpp, which is no longer needed.)
1254 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
1256 which maps ICU versions to the numbers of script/language constants
1257 that were added then.
1258 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
1260 The generated files have a current copyright date and "@deprecated" statement.
1262 * Review changes, fix Java tool if necessary, and copy to ICU4C
1263 cd ~/svn.icu4j/trunk/src
1264 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
1265 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
1266 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
1269 - send notice to icu-design about new born-@stable API (enum constants etc.)
1271 *** merge the Unicode update branches back onto the trunk
1272 - do not merge the icudata.jar and testdata.jar,
1273 instead rebuild them from merged & tested ICU4C
1274 - make sure that changes to Unicode tools & ICU tools are checked in
1275 http://www.unicode.org/utility/trac/log/trunk/unicodetools
1276 http://bugs.icu-project.org/trac/log/tools/trunk
1278 ---------------------------------------------------------------------------- ***
1280 New script codes early in ICU 58: http://bugs.icu-project.org/trac/ticket/11764
1283 - new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge
1284 - new combination/alias codes: Hanb, Jamo
1285 - used in CLDR 29 and in spoof checker
1288 Add new codes to uscript.h & UScript.java, see Unicode update logs.
1289 -> com.ibm.icu.lang.UScript
1290 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
1291 replace public static final int \1 = \2; \3
1293 Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h,
1294 add new script codes.
1295 "Long" script names only where established in Unicode 9 PropertyValueAliases.txt.
1297 Note: If we have to run preparseucd.py again before the Unicode 9 update,
1298 then we need to manually keep/restore the new script codes.
1300 ICU_ROOT=~/svn.icu/trunk
1301 ICU_SRC_DIR=$ICU_ROOT/src
1303 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1304 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
1305 UNIDATA=$ICU_SRC_DIR/source/data/unidata
1307 Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files,
1308 see http://bugs.icu-project.org/trac/ticket/12141
1310 make install, then icutools cmake & make, then
1311 ~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
1313 Generate Java data as usual, only update pnames.icu & uprops.icu.
1315 *** LayoutEngine script information
1317 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
1318 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
1319 in the working directory.
1321 (It also generates ScriptRunData.cpp, which is no longer needed.)
1323 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
1325 which maps ICU versions to the numbers of script/language constants
1326 that were added then.
1327 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
1329 The generated files have a current copyright date and "@deprecated" statement.
1331 * Review changes, fix Java tool if necessary, and copy to ICU4C
1332 cd ~/svn.icu4j/trunk/src
1333 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
1334 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
1335 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
1337 ---------------------------------------------------------------------------- ***
1339 Emoji properties added in ICU 57: http://bugs.icu-project.org/trac/ticket/11802
1341 Edit preparseucd.py to add & parse new properties.
1342 They share the UCD property namespace but are not listed in PropertyAliases.txt.
1344 Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/
1345 Initial data from emoji/2.0/
1347 ICU_ROOT=~/svn.icu/trunk
1348 ICU_SRC_DIR=$ICU_ROOT/src
1350 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1351 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
1352 UNIDATA=$ICU_SRC_DIR/source/data/unidata
1354 Add binary-property constants to uchar.h enum UProperty & UProperty.java.
1356 ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src
1357 (Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.)
1359 Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java
1361 make install, then icutools cmake & make, then
1362 ~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
1364 Generate Java data as usual, only update pnames.icu & uprops.icu.
1366 ---------------------------------------------------------------------------- ***
1368 Unicode 8.0 update for ICU 56
1370 * Command-line environment setup
1372 ICU_ROOT=~/svn.icu/trunk
1373 ICU_SRC_DIR=$ICU_ROOT/src
1375 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1376 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
1377 UNIDATA=$ICU_SRC_DIR/source/data/unidata
1379 http://www.unicode.org/review/pri297/ -- beta review
1380 http://www.unicode.org/reports/uax-proposed-updates.html
1381 http://unicode.org/versions/beta-8.0.0.html
1382 http://www.unicode.org/versions/Unicode8.0.0/
1383 http://www.unicode.org/reports/tr44/tr44-15.html
1387 - ticket:11574: Unicode 8
1388 - C++ branches/markus/uni80 at r37351 from trunk at r37343
1389 - Java branches/markus/uni80 at r37352 from trunk at r37338
1393 - cldrbug 8311: UCA 8
1394 - branches/markus/uni80 at r11518 from trunk at r11517
1396 - cldrbug 8109: Unicode 8.0 script metadata
1397 - cldrbug 8418: Updated segmentation for Unicode 8.0
1399 *** Unicode version numbers
1402 - com.ibm.icu.util.VersionInfo
1403 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1405 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1406 so that the makefiles see the new version number.
1408 *** data files & enums & parser code
1412 - download UCD & IDNA files
1413 - make sure that the Unicode data folder passed into preparseucd.py
1414 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
1415 - only for manual diffs: remove version suffixes from the file names
1416 ~/unidata/uni70/20140403$ ../../desuffixucd.py .
1417 (see https://sites.google.com/site/unicodetools/inputdata)
1418 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1419 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src
1420 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1422 - also: from http://unicode.org/Public/security/8.0.0/ download new
1423 confusables.txt & confusablesWholeScript.txt
1424 and copy to $UNIDATA
1425 ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA
1426 ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA
1428 * initial preparseucd.py changes
1429 - remove new Unicode scripts from the
1430 only-in-ISO-15924 list according to the error message:
1431 ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw']
1432 from _scripts_only_in_iso15924
1433 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1434 and in com.ibm.icu.dev.test.lang.TestUScript.java
1435 - property and file name change:
1436 IndicMatraCategory -> IndicPositionalCategory
1437 - UnicodeData.txt unusual numeric values (improper fractions)
1438 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;;
1439 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;;
1440 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;;
1441 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;;
1442 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;;
1443 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;;
1444 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;;
1445 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;;
1446 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;;
1447 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;;
1448 -> change preparseucd.py to map them to proper fractions (e.g., 1/6)
1449 which are listed in DerivedNumericValues.txt;
1450 keeps storage in data file simple
1452 * PropertyValueAliases.txt changes
1453 - 10 new Block (blk) values:
1455 blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs
1456 blk; Cherokee_Sup ; Cherokee_Supplement
1457 blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E
1458 blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform
1459 blk; Hatran ; Hatran
1460 blk; Multani ; Multani
1461 blk; Old_Hungarian ; Old_Hungarian
1462 blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs
1463 blk; Sutton_SignWriting ; Sutton_SignWriting
1465 use long property names for enum constants
1466 -> add to UCharacter.UnicodeBlock IDs
1467 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1468 replace public static final int \1_ID = \2; \3
1469 -> add to UCharacter.UnicodeBlock objects
1470 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
1471 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1472 - 6 new Script (sc) values:
1475 sc ; Hluw ; Anatolian_Hieroglyphs
1476 sc ; Hung ; Old_Hungarian
1478 sc ; Sgnw ; SignWriting
1479 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
1481 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1482 (not strictly necessary for NOT_ENCODED scripts)
1483 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
1485 * generate normalization data files
1487 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
1488 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
1489 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
1490 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1491 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
1493 * build ICU (make install)
1494 so that the tools build can pick up the new definitions from the installed header files.
1496 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
1498 * build Unicode tools using CMake+make
1500 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
1502 # Location (--prefix) of where ICU was installed.
1503 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
1504 # Location of the ICU source tree.
1505 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
1507 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
1508 ~/svn.icutools/trunk/dbg/unicode/c$ make
1510 * generate core properties data files
1511 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
1512 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR
1513 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
1514 - rebuild ICU (make install) & tools
1515 - run genuca again (see step above) so that it picks up the new nfc.nrm
1516 - rebuild ICU (make install) & tools
1518 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1519 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1520 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1521 - Unicode 6.0..8.0: U+2260, U+226E, U+226F
1522 - nothing new in 8.0, no test file to update
1524 * run & fix ICU4C tests
1525 - bad Cherokee case folding due to difference in fallbacks:
1526 UCD case folding falls back to no mapping,
1527 ICU runtime case folding falls back to lowercasing;
1528 fixed casepropsbuilder.cpp to generate scf mappings to self
1529 when there is an slc mapping but no scf
1530 - Andy handles RBBI & spoof check test failures
1532 * collation: CLDR collation root, UCA DUCET
1534 - UCA DUCET goes into Mark's Unicode tools, see
1535 https://sites.google.com/site/unicodetools/home#TOC-UCA
1536 - CLDR root data files are checked into (CLDR UCA branch)/common/uca/
1537 - cd (CLDR UCA branch)/common/uca/
1538 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1539 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
1540 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1541 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
1542 (note removing the underscore before "Rules")
1543 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1544 - restore TODO diffs in UCARules.txt
1545 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1546 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
1547 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1548 from the CLDR root files (..._CLDR_..._SHORT.txt)
1549 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1550 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1551 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
1552 - if CLDR common/uca/unihan-index.txt changes, then update
1553 CLDR common/collation/root.xml <collation type="private-unihan">
1554 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
1555 - run genuca, see command line above;
1557 Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt
1558 (add the character to genuca.cpp sampleCharsToScripts[])
1559 + look up the script for the new sample characters
1560 (e.g., in FractionalUCA.txt)
1561 + *add* mappings to sampleCharsToScripts[], do not replace them
1562 (in case the script sample characters flip-flop)
1563 + insert new scripts in DUCET script order, see the top_byte table
1564 at the beginning of FractionalUCA.txt
1567 * run & fix ICU4C tests, now with new CLDR collation root data
1568 - run all tests with the collation test data *_SHORT.txt or the full files
1569 (the full ones have comments, useful for debugging)
1570 - note on intltest: if collate/UCAConformanceTest fails, then
1571 utility/MultithreadTest/TestCollators will fail as well;
1572 fix the conformance test before looking into the multi-thread test
1573 - fixed bug in CollationWeights::getWeightRanges()
1574 exposed by new data and CollationTest::TestRootElements
1576 * update Java data files
1577 - refresh just the UCD/UCA-related/derived files, just to be safe
1578 - see (ICU4C)/source/data/icu4j-readme.txt
1580 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1583 Unicode .icu files built to ./out/build/icudt56l
1584 echo timestamp > uni-core-data
1585 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b
1586 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b
1587 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1588 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b
1589 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b"
1590 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/
1591 mkdir -p /tmp/icu4j/main/shared/data
1592 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1593 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/
1594 mkdir -p /tmp/icu4j/main/shared/data
1595 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1596 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
1597 - copy the big-endian Unicode data files to another location,
1598 separate from the other data files,
1599 and then refresh ICU4J
1600 cd ~/svn.icu/trunk/dbg/data/out/icu4j
1601 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1602 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1603 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1604 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1605 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1606 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1607 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1608 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1609 jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1611 * When refreshing all of ICU4J data from ICU4C
1612 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1613 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
1615 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
1617 * update CollationFCD.java
1618 + copy & paste the initializers of lcccIndex[] etc. from
1619 ICU4C/source/i18n/collationfcd.cpp to
1620 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1622 * refresh Java test .txt files
1623 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1624 cd $ICU_SRC_DIR/source/data/unidata
1625 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1626 cd ../../test/testdata
1627 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1628 cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1630 * run & fix ICU4J tests
1632 *** LayoutEngine script information
1634 * ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more,
1635 because the layout engine was deprecated in ICU 54.
1636 Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java
1637 to write lines that we used to add manually.
1639 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
1640 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
1641 in the working directory.
1643 (It also generates ScriptRunData.cpp, which is no longer needed.)
1645 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
1647 which maps ICU versions to the numbers of script/language constants
1648 that were added then.
1649 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
1651 The generated files have a current copyright date and "@deprecated" statement.
1653 * Review changes, fix Java tool if necessary, and copy to ICU4C
1654 cd ~/svn.icu4j/trunk/src
1655 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
1656 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
1657 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
1660 - send notice to icu-design about new born-@stable API (enum constants etc.)
1662 *** merge the Unicode update branches back onto the trunk
1663 - do not merge the icudata.jar and testdata.jar,
1664 instead rebuild them from merged & tested ICU4C
1665 - make sure that changes to Unicode tools & ICU tools are checked in
1666 http://www.unicode.org/utility/trac/log/trunk/unicodetools
1667 http://bugs.icu-project.org/trac/log/tools/trunk
1669 ---------------------------------------------------------------------------- ***
1671 Unicode 7.0 update for ICU 54
1673 http://www.unicode.org/review/pri271/ -- beta review
1674 http://www.unicode.org/reports/uax-proposed-updates.html
1675 http://www.unicode.org/versions/beta-7.0.0.html#notable_issues
1676 http://www.unicode.org/reports/tr44/tr44-13.html
1680 - ticket 10821: Unicode 7.0, UCA 7.0
1681 - C++ branches/markus/uni70 at r35584 from trunk at r35580
1682 - Java branches/markus/uni70 at r35587 from trunk at r35545
1686 - ticket 7195: UCA 7.0 CLDR root collation
1687 - branches/markus/uni70 at r10062 from trunk at r10061
1689 - ticket 6762: script metadata for Unicode 7.0 new scripts
1691 *** Unicode version numbers
1694 - com.ibm.icu.util.VersionInfo
1695 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1697 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1698 so that the makefiles see the new version number.
1700 *** data files & enums & parser code
1704 - download UCD & IDNA files
1705 - make sure that the Unicode data folder passed into preparseucd.py
1706 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
1707 - only for manual diffs: remove version suffixes from the file names
1708 ~/unidata/uni70/20140403$ ../../desuffixucd.py .
1709 (see https://sites.google.com/site/unicodetools/inputdata)
1710 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1711 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src
1712 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1713 - Restore TODO diffs in source/data/unidata/UCARules.txt
1715 meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt
1716 - Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt
1718 - also: from http://unicode.org/Public/security/7.0.0/ download new
1719 confusables.txt & confusablesWholeScript.txt
1720 and copy to $ICU_ROOT/src/source/data/unidata/
1722 * initial preparseucd.py changes
1723 - remove new Unicode scripts from the
1724 only-in-ISO-15924 list according to the error message:
1725 ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass',
1726 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm',
1727 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj']
1728 from _scripts_only_in_iso15924
1729 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1730 and in com.ibm.icu.dev.test.lang.TestUScript.java
1731 - NamesList.txt now has a heading with a non-ASCII character
1732 + keep ppucd.txt in platform charset, rather than changing tool/test parsers
1733 + escape non-ASCII characters in heading comments
1734 - gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013
1735 + get the copyright from the first file whose copyright line contains the current year
1737 * PropertyValueAliases.txt changes
1738 - 32 new Block (blk) values:
1739 blk; Bassa_Vah ; Bassa_Vah
1740 blk; Caucasian_Albanian ; Caucasian_Albanian
1741 blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers
1742 blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended
1743 blk; Duployan ; Duployan
1744 blk; Elbasan ; Elbasan
1745 blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended
1746 blk; Grantha ; Grantha
1747 blk; Khojki ; Khojki
1748 blk; Khudawadi ; Khudawadi
1749 blk; Latin_Ext_E ; Latin_Extended_E
1750 blk; Linear_A ; Linear_A
1751 blk; Mahajani ; Mahajani
1752 blk; Manichaean ; Manichaean
1753 blk; Mende_Kikakui ; Mende_Kikakui
1756 blk; Myanmar_Ext_B ; Myanmar_Extended_B
1757 blk; Nabataean ; Nabataean
1758 blk; Old_North_Arabian ; Old_North_Arabian
1759 blk; Old_Permic ; Old_Permic
1760 blk; Ornamental_Dingbats ; Ornamental_Dingbats
1761 blk; Pahawh_Hmong ; Pahawh_Hmong
1762 blk; Palmyrene ; Palmyrene
1763 blk; Pau_Cin_Hau ; Pau_Cin_Hau
1764 blk; Psalter_Pahlavi ; Psalter_Pahlavi
1765 blk; Shorthand_Format_Controls ; Shorthand_Format_Controls
1766 blk; Siddham ; Siddham
1767 blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers
1768 blk; Sup_Arrows_C ; Supplemental_Arrows_C
1769 blk; Tirhuta ; Tirhuta
1770 blk; Warang_Citi ; Warang_Citi
1772 use long property names for enum constants
1773 -> add to UCharacter.UnicodeBlock IDs
1774 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1775 replace public static final int \1_ID = \2; \3
1776 -> add to UCharacter.UnicodeBlock objects
1777 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
1778 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1779 - 28 new Joining_Group (jg) values:
1780 jg ; Manichaean_Aleph ; Manichaean_Aleph
1781 jg ; Manichaean_Ayin ; Manichaean_Ayin
1782 jg ; Manichaean_Beth ; Manichaean_Beth
1783 jg ; Manichaean_Daleth ; Manichaean_Daleth
1784 jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh
1785 jg ; Manichaean_Five ; Manichaean_Five
1786 jg ; Manichaean_Gimel ; Manichaean_Gimel
1787 jg ; Manichaean_Heth ; Manichaean_Heth
1788 jg ; Manichaean_Hundred ; Manichaean_Hundred
1789 jg ; Manichaean_Kaph ; Manichaean_Kaph
1790 jg ; Manichaean_Lamedh ; Manichaean_Lamedh
1791 jg ; Manichaean_Mem ; Manichaean_Mem
1792 jg ; Manichaean_Nun ; Manichaean_Nun
1793 jg ; Manichaean_One ; Manichaean_One
1794 jg ; Manichaean_Pe ; Manichaean_Pe
1795 jg ; Manichaean_Qoph ; Manichaean_Qoph
1796 jg ; Manichaean_Resh ; Manichaean_Resh
1797 jg ; Manichaean_Sadhe ; Manichaean_Sadhe
1798 jg ; Manichaean_Samekh ; Manichaean_Samekh
1799 jg ; Manichaean_Taw ; Manichaean_Taw
1800 jg ; Manichaean_Ten ; Manichaean_Ten
1801 jg ; Manichaean_Teth ; Manichaean_Teth
1802 jg ; Manichaean_Thamedh ; Manichaean_Thamedh
1803 jg ; Manichaean_Twenty ; Manichaean_Twenty
1804 jg ; Manichaean_Waw ; Manichaean_Waw
1805 jg ; Manichaean_Yodh ; Manichaean_Yodh
1806 jg ; Manichaean_Zayin ; Manichaean_Zayin
1807 jg ; Straight_Waw ; Straight_Waw
1808 -> uchar.h & UCharacter.JoiningGroup
1809 - 23 new Script (sc) values:
1810 sc ; Aghb ; Caucasian_Albanian
1811 sc ; Bass ; Bassa_Vah
1812 sc ; Dupl ; Duployan
1815 sc ; Hmng ; Pahawh_Hmong
1817 sc ; Lina ; Linear_A
1818 sc ; Mahj ; Mahajani
1819 sc ; Mani ; Manichaean
1820 sc ; Mend ; Mende_Kikakui
1823 sc ; Narb ; Old_North_Arabian
1824 sc ; Nbat ; Nabataean
1825 sc ; Palm ; Palmyrene
1826 sc ; Pauc ; Pau_Cin_Hau
1827 sc ; Perm ; Old_Permic
1828 sc ; Phlp ; Psalter_Pahlavi
1830 sc ; Sind ; Khudawadi
1832 sc ; Wara ; Warang_Citi
1833 -> uscript.h (many were added before)
1834 comment "Mende Kikakui" for USCRIPT_MENDE
1835 add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias
1836 -> com.ibm.icu.lang.UScript
1837 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
1838 replace public static final int \1 = \2; \3
1839 - 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
1846 Pauc 263 Pau Cin Hau
1848 -> uscript.h (some overlap with additions from Unicode)
1849 -> com.ibm.icu.lang.UScript
1850 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
1851 replace public static final int \1 = \2; \3
1852 -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924
1853 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
1854 and in com.ibm.icu.dev.test.lang.TestUScript.java
1856 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1857 (not strictly necessary for NOT_ENCODED scripts)
1858 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
1860 * generate normalization data files
1862 - export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1863 - SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
1864 - UNIDATA=$ICU_SRC_DIR/source/data/unidata
1865 - bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
1866 - bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
1867 - bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
1868 - bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1869 - bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
1871 * build ICU (make install)
1872 so that the tools build can pick up the new definitions from the installed header files.
1874 ~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
1876 * build Unicode tools using CMake+make
1878 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
1880 # Location (--prefix) of where ICU was installed.
1881 set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst)
1882 # Location of the ICU source tree.
1883 set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src)
1885 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
1886 ~/svn.icutools/trunk/dbg/unicode/c$ make
1889 - new code point range for Joining_Group values: 10AC0..10AFF Manichaean
1890 + add second array of Joining_Group values for at most 10800..10FFF
1891 icutools: unicode/c/genprops/bidipropsbuilder.cpp
1892 icu: source/common/ubidi_props.h/.c/_data.h
1893 icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java
1895 * generate core properties data files
1896 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
1897 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR
1898 - rebuild ICU (make install) & tools
1899 - run genuca again (see step above) so that it picks up the new nfc.nrm
1900 - rebuild ICU (make install) & tools
1902 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1903 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1904 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1905 - Unicode 6.0..7.0: U+2260, U+226E, U+226F
1906 - nothing new in 7.0, no test file to update
1908 * run & fix ICU4C tests
1910 * update Java data files
1911 - refresh just the UCD-related files, just to be safe
1912 - see (ICU4C)/source/data/icu4j-readme.txt
1914 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1917 Unicode .icu files built to ./out/build/icudt53l
1918 echo timestamp > uni-core-data
1919 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b
1920 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b
1921 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
1922 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b
1923 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b"
1924 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/
1925 mkdir -p /tmp/icu4j/main/shared/data
1926 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1927 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/
1928 mkdir -p /tmp/icu4j/main/shared/data
1929 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1930 make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data'
1931 - copy the big-endian Unicode data files to another location,
1932 separate from the other data files
1934 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1935 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1936 cd ~/svn.icu/uni70/dbg/data/out/icu4j
1937 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1938 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1939 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1940 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1941 cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1942 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1944 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1946 * update CollationFCD.java
1947 + copy & paste the initializers of lcccIndex[] etc. from
1948 ICU4C/source/i18n/collationfcd.cpp to
1949 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1951 * refresh Java test .txt files
1952 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1953 cd $ICU_SRC_DIR/source/data/unidata
1954 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1955 cd ../../test/testdata
1956 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1957 cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1961 - download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/
1962 - run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata)
1963 - update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/
1964 - run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA
1965 - output files are in ~/svn.unitools/Generated/uca/7.0.0/
1966 - review data; compare files, use blankweights.sed or similar
1967 ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt
1968 - cd ~/svn.unitools/Generated/uca/7.0.0/
1969 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1970 cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
1971 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1972 (note removing the underscore before "Rules")
1973 cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1974 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
1975 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1976 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
1977 cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1978 cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1979 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
1980 - run genuca, see command line above
1982 - refresh ICU4J collation data:
1983 (subset of instructions above for properties data refresh, except copies all coll/*)
1985 ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1986 ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1987 ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1988 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1989 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
1990 - note on intltest: if collate/UCAConformanceTest fails, then
1991 utility/MultithreadTest/TestCollators will fail as well;
1992 fix the conformance test before looking into the multi-thread test
1993 - copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors
1994 - copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch
1995 ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
1997 * When refreshing all of ICU4J data from ICU4C
1998 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1999 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2001 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2003 * run & fix ICU4J tests
2005 *** LayoutEngine script information
2007 (For details see the Unicode 5.2 change log below.)
2009 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
2010 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
2011 in the working directory.
2012 (It also generates ScriptRunData.cpp, which is no longer needed.)
2014 The generated files have a current copyright date and "@stable" statement.
2015 ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java
2016 for "born stable" Unicode API constants, and to stop parsing ICU version numbers
2017 which may not contain dots any more.
2019 - diff current <icu>/source/layout files vs. generated ones
2020 ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
2021 review and manually merge desired changes;
2022 fix gratuitous changes, incorrect @draft/@stable and missing aliases;
2023 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
2024 - if you just copy the above files, then
2025 fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
2026 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
2029 - send notice to icu-design about new born-@stable API (enum constants etc.)
2031 *** merge the Unicode update branches back onto the trunk
2032 - do not merge the icudata.jar and testdata.jar,
2033 instead rebuild them from merged & tested ICU4C
2035 ---------------------------------------------------------------------------- ***
2039 http://www.unicode.org/review/pri249/ -- beta review
2040 http://www.unicode.org/reports/uax-proposed-updates.html
2041 http://www.unicode.org/versions/beta-6.3.0.html#notable_issues
2042 http://www.unicode.org/reports/tr44/tr44-11.html
2046 - ticket 10128: update ICU to Unicode 6.3 beta
2047 - ticket 10168: update ICU to Unicode 6.3 final
2048 - C++ branches/markus/uni63 at r33552 from trunk at r33551
2049 - Java branches/markus/uni63 at r33550 from trunk at r33553
2051 - ticket 10142: implement Unicode 6.3 bidi algorithm additions
2053 *** Unicode version numbers
2056 (configure.in & configure: have been modified to extract the version from uchar.h)
2057 - com.ibm.icu.util.VersionInfo
2058 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2060 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
2061 so that the makefiles see the new version number.
2063 *** data files & enums & parser code
2067 - download UCD, UCA & IDNA files
2068 - make sure that the Unicode data folder passed into preparseucd.py
2069 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2070 - modify preparseucd.py:
2071 parse new file BidiBrackets.txt
2072 with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type
2073 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src
2074 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2075 - Check test file diffs for previously commented-out, known-failing data lines;
2076 probably need to keep those commented out.
2078 * PropertyAliases.txt changes
2079 - 1 new Enumerated Property
2080 bpt ; Bidi_Paired_Bracket_Type
2081 -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType
2082 -> ubidi_props.h & .c & UBiDiProps.java
2083 -> remember to write the max value at UBIDI_MAX_VALUES_INDEX
2085 -> change ubidi.icu format version from 2.0 to 2.1
2086 - 1 new Miscellaneous Property
2087 bpb ; Bidi_Paired_Bracket
2088 -> uchar.h & UProperty.java
2091 * PropertyValueAliases.txt changes
2092 - 3 Bidi_Paired_Bracket_Type (bpt) values:
2096 -> uchar.h & UCharacter.BidiPairedBracketType
2097 -> ubidi_props.h & .c & UBiDiProps.java
2098 -> change ubidi.icu format version from 2.0 to 2.1
2099 - 4 new Bidi_Class (bc) values:
2100 bc ; FSI ; First_Strong_Isolate
2101 bc ; LRI ; Left_To_Right_Isolate
2102 bc ; RLI ; Right_To_Left_Isolate
2103 bc ; PDI ; Pop_Directional_Isolate
2104 -> uchar.h & UCharacterEnums.ECharacterDirection
2105 -> until the bidi code gets updated,
2106 Roozbeh suggests mapping the new bc values to ON (Other_Neutral)
2107 - 3 new Word_Break (WB) values:
2108 WB ; HL ; Hebrew_Letter
2109 WB ; SQ ; Single_Quote
2110 WB ; DQ ; Double_Quote
2111 -> uchar.h & UCharacter.WordBreak
2112 -> first time Word_Break numeric constants exceed 4 bits (now 17 values)
2113 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2115 Aghb 239 Caucasian Albanian
2118 -> com.ibm.icu.lang.UScript
2119 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2120 replace public static final int \1 = \2;\3
2121 -> preparseucd.py _scripts_only_in_iso15924
2122 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2123 and in com.ibm.icu.dev.test.lang.TestUScript.java
2124 -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
2125 (not strictly necessary for NOT_ENCODED scripts)
2127 * generate normalization data files
2128 - ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib
2129 - ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in
2130 - ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata
2131 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
2132 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
2133 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2134 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
2136 * build ICU (make install)
2137 so that the tools build can pick up the new definitions from the installed header files.
2139 ~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
2141 * build Unicode tools using CMake+make
2143 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
2145 # Location (--prefix) of where ICU was installed.
2146 set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst)
2147 # Location of the ICU source tree.
2148 set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src)
2150 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
2151 ~/svn.icutools/trunk/dbg/unicode/c$ make
2153 * generate core properties data files
2154 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src
2155 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src
2156 - rebuild ICU (make install) & tools
2157 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
2158 - rebuild ICU (make install) & tools
2160 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2161 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2162 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2163 - Unicode 6.0..6.3: U+2260, U+226E, U+226F
2164 - nothing new in 6.3, no test file to update
2166 * update Java data files
2167 - refresh just the UCD-related files, just to be safe
2168 - see (ICU4C)/source/data/icu4j-readme.txt
2170 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2173 Unicode .icu files built to ./out/build/icudt52l
2174 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b
2175 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b
2176 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2177 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b
2178 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b"
2179 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/
2180 mkdir -p /tmp/icu4j/main/shared/data
2181 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2182 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/
2183 mkdir -p /tmp/icu4j/main/shared/data
2184 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2185 make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data'
2186 - copy the big-endian Unicode data files to another location,
2187 separate from the other data files
2188 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2189 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
2190 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
2191 ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu
2192 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
2193 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2194 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
2196 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
2198 * refresh Java test .txt files
2199 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2201 * UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files
2203 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
2204 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
2205 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2206 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2207 (note removing the underscore before "Rules")
2208 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
2209 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2210 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
2211 - check test file diffs for previously commented-out, known-failing data lines;
2212 probably need to keep those commented out
2213 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
2214 - run genuca, see command line above
2216 - refresh ICU4J collation data:
2217 (subset of instructions above for properties data refresh, except copies all coll/*)
2218 ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2219 ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2220 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2221 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
2222 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
2223 - note on intltest: if collate/UCAConformanceTest fails, then
2224 utility/MultithreadTest/TestCollators will fail as well;
2225 fix the conformance test before looking into the multi-thread test
2227 * test ICU, fix test code where necessary
2229 * When refreshing all of ICU4J data from ICU4C
2230 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2231 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2233 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2235 *** LayoutEngine script information
2236 - skipped for Unicode 6.3: no new scripts
2238 *** merge the Unicode update branches back onto the trunk
2239 - do not merge the icudata.jar and testdata.jar,
2240 instead rebuild them from merged & tested ICU4C
2242 ---------------------------------------------------------------------------- ***
2246 http://www.unicode.org/review/pri230/
2247 http://www.unicode.org/versions/beta-6.2.0.html
2248 http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0
2249 http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values
2250 http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol
2251 http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols
2252 http://www.unicode.org/reports/tr46/tr46-8.html IDNA
2253 http://unicode.org/Public/idna/6.2.0/
2257 - ticket 9515: Unicode 6.2: final ICU update
2259 - ticket 9514: UCA 6.2: fix UCARules.txt
2261 - ticket 9437: update ICU to Unicode 6.2
2262 - C++ branches/markus/uni62 at r32050 from trunk at r32041
2263 - Java branches/markus/uni62 at r32068 from trunk at r32066
2265 *** Unicode version numbers
2268 (configure.in & configure: have been modified to extract the version from uchar.h)
2269 - com.ibm.icu.util.VersionInfo
2270 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2272 *** data files & enums & parser code
2276 - download UCD, UCA & IDNA files
2277 - make sure that the Unicode data folder passed into preparseucd.py
2278 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2279 - modify preparseucd.py: NamesList.txt is now in UTF-8
2280 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src
2281 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2282 - Check test file diffs for previously commented-out, known-failing data lines;
2283 probably need to keep those commented out.
2285 * PropertyValueAliases.txt changes
2286 - 1 new Line_Break (lb) value:
2287 lb ; RI ; Regional_Indicator
2288 -> uchar.h & UCharacter.LineBreak
2289 - 1 new Word_Break (WB) value:
2290 WB ; RI ; Regional_Indicator
2291 -> uchar.h & UCharacter.WordBreak
2292 - 1 new Grapheme_Cluster_Break (GCB) value:
2293 GCB; RI ; Regional_Indicator
2294 -> uchar.h & UCharacter.GraphemeClusterBreak
2296 * 3 new numeric values
2297 The new value -1, which was really supposed to be NaN but that would have required
2298 new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1,
2299 but encodeNumericValue() in corepropsbuilder.cpp had to be fixed.
2300 cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1
2301 cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1
2302 The two new values 216000 and 432000 require an addition to the encoding of numeric values.
2303 cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000
2304 cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000
2305 -> uprops.h, uchar.c & UCharacterProperty.java
2306 -> cucdtst.c & UCharacterTest.java
2308 * generate normalization data files
2309 - ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib
2310 - ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in
2311 - ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata
2312 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
2313 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
2314 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2315 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
2317 * build ICU (make install)
2318 so that the tools build can pick up the new definitions from the installed header files.
2319 * build Unicode tools using CMake+make
2321 * generate core properties data files
2322 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src
2323 - in initial bootstrapping, change the UCA version
2324 in source/data/unidata/FractionalUCA.txt to match the new Unicode version
2325 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src
2326 - rebuild ICU (make install) & tools
2327 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
2328 check if the UCA version in FractionalUCA.txt matches the new Unicode version
2330 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
2331 - rebuild ICU (make install) & tools
2333 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2334 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2335 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2336 - Unicode 6.0..6.2: U+2260, U+226E, U+226F
2337 - nothing new in 6.2, no test file to update
2339 * update Java data files
2340 - refresh just the UCD-related files, just to be safe
2341 - see (ICU4C)/source/data/icu4j-readme.txt
2343 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2346 Unicode .icu files built to ./out/build/icudt50l
2347 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b
2348 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b
2349 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2350 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b
2351 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b"
2352 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/
2353 mkdir -p /tmp/icu4j/main/shared/data
2354 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2355 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/
2356 mkdir -p /tmp/icu4j/main/shared/data
2357 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2358 make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data'
2359 - copy the big-endian Unicode data files to another location,
2360 separate from the other data files
2361 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
2362 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
2363 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
2364 ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu
2365 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
2366 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
2367 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
2369 ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
2371 * refresh Java test .txt files
2372 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2376 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
2377 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
2378 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2379 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2380 (note removing the underscore before "Rules")
2381 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
2382 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2383 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
2384 - check test file diffs for previously commented-out, known-failing data lines;
2385 probably need to keep those commented out
2386 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
2387 - run genuca, see command line above
2389 - refresh ICU4J collation data:
2390 (subset of instructions above for properties data refresh, except copies all coll/*)
2391 ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2392 ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
2393 ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
2394 ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
2395 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
2396 - note on intltest: if collate/UCAConformanceTest fails, then
2397 utility/MultithreadTest/TestCollators will fail as well;
2398 fix the conformance test before looking into the multi-thread test
2400 * test ICU, fix test code where necessary
2402 * When refreshing all of ICU4J data from ICU4C
2403 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2404 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2406 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2408 *** LayoutEngine script information
2409 - skipped for Unicode 6.2: no new scripts
2411 *** merge the Unicode update branches back onto the trunk
2412 - do not merge the icudata.jar and testdata.jar,
2413 instead rebuild them from merged & tested ICU4C
2415 ---------------------------------------------------------------------------- ***
2417 Future Unicode update
2419 Tools simplified since the Unicode 6.1 update. See
2420 - http://site.icu-project.org/design/props/ppucd
2421 - http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972
2423 * Unicode version numbers
2424 - icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates
2427 - ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py:
2428 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src
2429 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2430 - Check test file diffs for previously commented-out, known-failing data lines;
2431 probably need to keep those commented out.
2433 * PropertyValueAliases.txt changes
2434 - Script codes that are in ISO 15924 but not in Unicode are now listed in
2435 preparseucd.py, in the _scripts_only_in_iso15924 variable.
2436 If there are new ISO codes, then add them.
2437 If Unicode adds some of them, then remove them from the .py variable.
2439 * UnicodeData.txt changes
2440 - No more manual changes for CJK ranges for algorithmic names;
2441 those are now written to ppucd.txt and genprops reads them from there.
2443 * generate core properties data files (makeprops.sh was deleted)
2444 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src
2446 * no more manual updates of source/data/unidata/norm2/nfkc_cf.txt
2447 - it is now generated by preparseucd.py
2449 * no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt
2450 - it is now generated by preparseucd.py
2451 - make sure that the Unicode data folder passed into preparseucd.py
2452 includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
2453 (can be in some subfolder)
2455 * generate normalization data files
2456 - ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib
2457 - ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in
2458 - ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata
2459 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
2460 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
2461 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2462 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
2464 * build ICU (make install)
2465 * build Unicode tools using CMake+make
2467 * new way to call genuca (makeuca.sh was deleted)
2468 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src
2470 ---------------------------------------------------------------------------- ***
2476 - ticket 8995 final update to Unicode 6.1
2477 - ticket 8994 regenerate source/layout/CanonData.cpp
2479 - ticket 8961 support Unicode "Age" value *names*
2480 - ticket 8963 support multiple character name aliases & types
2482 - ticket 8827 "update ICU to Unicode 6.1"
2483 - C++ branches/markus/uni61 at r30864 from trunk at r30843
2484 - Java branches/markus/uni61 at r30865 from trunk at r30863
2486 *** Unicode version numbers
2489 (configure.in & configure: have been modified to extract the version from uchar.h)
2490 - com.ibm.icu.util.VersionInfo
2491 - icutools/unicode/makedefs.sh
2492 + also review & update other definitions in that file,
2493 e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l
2495 *** data files & enums & parser code
2499 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed
2500 - This prepares both unidata and testdata files in respective output subfolders.
2501 - Check test file diffs for previously commented-out, known-failing data lines;
2502 probably need to keep those commented out.
2504 * PropertyValueAliases.txt changes
2505 - 11 new block names:
2507 Arabic_Mathematical_Alphabetic_Symbols
2509 Meetei_Mayek_Extensions
2511 Meroitic_Hieroglyphs
2515 Sundanese_Supplement
2518 -> add to UCharacter.UnicodeBlock IDs
2519 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
2520 replace public static final int \1_ID = \2; \3
2521 -> add to UCharacter.UnicodeBlock objects
2522 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
2523 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
2524 - 1 new Joining_Group (jg) value:
2526 -> uchar.h & UCharacter.JoiningGroup
2527 - 2 new Line_Break (lb) values:
2528 CJ=Conditional_Japanese_Starter
2530 -> uchar.h & UCharacter.LineBreak
2533 sc ; Merc ; Meroitic_Cursive
2534 sc ; Mero ; Meroitic_Hieroglyphs
2537 sc ; Sora ; Sora_Sompeng
2539 -> remove these from SyntheticPropertyValueAliases.txt
2540 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2541 and in com.ibm.icu.dev.test.lang.TestUScript.java
2542 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2546 and another one added 2011-12-09
2547 Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)
2549 -> com.ibm.icu.lang.UScript
2550 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2551 replace public static final int \1 = \2;\3
2552 -> SyntheticPropertyValueAliases.txt
2553 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2554 and in com.ibm.icu.dev.test.lang.TestUScript.java
2556 * UnicodeData.txt changes
2557 - the last Unihan code point changes from U+9FCB to U+9FCC
2558 search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive)
2559 + do change gennames.c
2560 + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java
2562 * DerivedBidiClass.txt changes
2563 - 2 new default-AL blocks:
2564 # Arabic Extended-A: U+08A0 - U+08FF (was default-R)
2565 # Arabic Mathematical Alphabetic Symbols:
2566 # U+1EE00 - U+1EEFF (was default-R)
2567 - 2 new default-R blocks:
2568 # Meroitic Hieroglyphs:
2570 # Meroitic Cursive: U+109A0 - U+109FF
2571 -> should be picked up by the explicit data in the file
2573 * NameAliases.txt changes
2575 # Each line has two fields
2576 # First field: Code point
2577 # Second field: Alias
2579 # Each line has three fields, as described here:
2581 # First field: Code point
2582 # Second field: Alias
2584 - Also, the file previously allowed multiple aliases but only now does it
2585 actually provide multiple, even multiple of the same type. For example,
2586 FEFF;BYTE ORDER MARK;alternate
2587 FEFF;BOM;abbreviation
2588 FEFF;ZWNBSP;abbreviation
2589 - This breaks our gennames parser, unames.icu data structure, and API.
2590 Fix gennames to only pick up "correction" aliases.
2591 New ticket #8963 for further changes.
2593 * run genpname/preparse.pl (on Linux)
2594 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
2595 + make sure that data.h is writable
2596 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
2597 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
2599 * build ICU (make install)
2600 so that the tools build can pick up the new definitions from the installed header files.
2601 * build Unicode tools (at least genpname) using CMake+make
2604 (builds both pnames.icu and propname_data.h)
2605 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
2606 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
2608 * build ICU (make install)
2609 * build Unicode tools using CMake+make
2611 * update source/data/unidata/norm2/nfkc_cf.txt
2612 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
2614 * update source/data/unidata/norm2/uts46.txt
2615 - download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
2616 to ~/svn.icu/tools/trunk/src/unicode/py
2617 - adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008".
2618 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
2619 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
2621 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2622 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2623 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2624 - Unicode 6.0..6.1: U+2260, U+226E, U+226F
2625 - nothing new in 6.1, no test file to update
2627 * generate core properties data files
2628 - in initial bootstrapping, change the UCA version
2629 in source/data/unidata/FractionalUCA.txt to match the new Unicode version
2630 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2631 - rebuild ICU & tools
2632 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
2633 check if the UCA version in FractionalUCA.txt matches the new Unicode version
2635 - run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm:
2636 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2637 - rebuild ICU & tools
2639 * update Java data files
2640 - refresh just the UCD-related files, just to be safe
2641 - see (ICU4C)/source/data/icu4j-readme.txt
2643 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2646 Unicode .icu files built to ./out/build/icudt49l
2647 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b
2648 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b
2649 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2650 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b
2651 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b"
2652 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/
2653 mkdir -p /tmp/icu4j/main/shared/data
2654 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2655 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/
2656 mkdir -p /tmp/icu4j/main/shared/data
2657 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2658 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data'
2659 - copy the big-endian Unicode data files to another location,
2660 separate from the other data files
2661 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
2662 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
2663 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
2664 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu
2665 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
2666 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
2667 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
2669 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
2671 * refresh Java test .txt files
2672 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2674 * test ICU so far, fix test code where necessary
2675 - temporarily ignore collation issues that look like UCA/UCD mismatches,
2676 until UCA data is updated
2680 - get output from Mark's tools; look in
2681 http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt
2682 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2683 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2684 (note removing the underscore before "Rules")
2685 - update (ICU)/source/test/testdata/CollationTest_*.txt
2686 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2687 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
2688 - check test file diffs for previously commented-out, known-failing data lines;
2689 probably need to keep those commented out
2690 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
2692 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2694 - refresh ICU4J collation data:
2695 (subset of instructions above for properties data refresh, except copies all coll/*)
2696 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2697 ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
2698 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
2699 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
2700 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
2701 - note on intltest: if collate/UCAConformanceTest fails, then
2702 utility/MultithreadTest/TestCollators will fail as well;
2703 fix the conformance test before looking into the multi-thread test
2705 * When refreshing all of ICU4J data from ICU4C
2706 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2707 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2709 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2711 *** LayoutEngine script information
2713 (For details see the Unicode 5.2 change log below.)
2715 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
2716 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
2717 in the working directory.
2718 (It also generates ScriptRunData.cpp, which is no longer needed.)
2720 The generated files have a current copyright date and "@draft" statement.
2722 - diff current <icu>/source/layout files vs. generated ones
2723 ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
2724 review and manually merge desired changes;
2725 fix gratuitous changes, incorrect @draft and missing aliases;
2726 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
2727 - if you just copy the above files, then
2728 fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
2729 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
2731 *** merge the Unicode update branches back onto the trunk
2732 - do not merge the icudata.jar and testdata.jar,
2733 instead rebuild them from merged & tested ICU4C
2735 ---------------------------------------------------------------------------- ***
2737 ICU 4.8 (no Unicode update, just new script codes)
2739 * 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2745 Shrd 319 Sharada, Śāradā
2746 Sora 398 Sora Sompeng
2747 Takr 321 Takri, Ṭākrī, Ṭāṅkrī
2751 -> com.ibm.icu.lang.UScript
2752 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2753 replace public static final int \1 = \2;\3
2754 -> genpname/SyntheticPropertyValueAliases.txt
2755 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2756 and in com.ibm.icu.dev.test.lang.TestUScript.java
2758 * run genpname/preparse.pl (on Linux)
2759 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
2760 + make sure that data.h is writable
2761 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
2762 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
2764 * rebuild Unicode tools (at least genpname) using make
2765 - You might first need to "make install" ICU so that the tools build can pick
2766 up the new definitions from the installed header files.
2769 (builds both pnames.icu and propname_data.h)
2770 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
2771 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
2772 - rebuild ICU & tools
2775 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
2776 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
2777 - rebuild ICU & tools
2779 * update Java data files
2780 - refresh just the UCD-related files, just to be safe
2781 - see (ICU4C)/source/data/icu4j-readme.txt
2783 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2784 - copy the big-endian Unicode data files to another location,
2785 separate from the other data files
2786 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
2787 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
2788 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
2790 ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b
2792 * should have updated the layout engine script codes but forgot
2794 ---------------------------------------------------------------------------- ***
2798 *** related ICU Trac tickets
2800 7264 Unicode 6.0 Update
2802 *** Unicode version numbers
2805 (configure.in & configure: have been modified to extract the version from uchar.h)
2806 - com.ibm.icu.util.VersionInfo
2808 *** data files & enums & parser code
2812 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed
2813 - This now prepares both unidata and testdata files in respective output subfolders.
2815 * PropertyAliases.txt changes
2816 - new Script_Extensions property defined in the new ScriptExtensions.txt file
2817 but not listed in PropertyAliases.txt; reported to unicode.org;
2818 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt
2819 scx; Script_Extensions
2820 -> uchar.h with new UProperty section
2821 -> com.ibm.icu.lang.UProperty, parallel with uchar.h
2823 * PropertyValueAliases.txt changes
2824 - 12 new block names:
2829 CJK_Unified_Ideographs_Extension_D
2834 Miscellaneous_Symbols_And_Pictographs
2836 Transport_And_Map_Symbols
2838 -> add to UCharacter.UnicodeBlock
2839 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
2840 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
2841 - Joining_Group (jg) values:
2842 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias
2843 -> uchar.h & UCharacter.JoiningGroup
2848 -> remove these from SyntheticPropertyValueAliases.txt
2849 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN
2850 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2851 and in com.ibm.icu.dev.test.lang.TestUScript.java
2852 - 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2853 (added 2009-11-11..2010-07-18)
2855 Dupl 755 Duployan shortand
2861 Merc 101 Meroitic Cursive
2862 Narb 106 Old North Arabian
2866 Wara 262 Warang Citi
2868 -> com.ibm.icu.lang.UScript
2869 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2870 replace public static final int \1 = \2;\3
2871 -> SyntheticPropertyValueAliases.txt
2872 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2873 and in com.ibm.icu.dev.test.lang.TestUScript.java
2874 - ISO 15924 name change
2875 Mero 100 Meroitic Hieroglyphs (was Meroitic)
2876 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC
2877 - property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt
2879 * UnicodeData.txt changes
2881 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
2882 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
2883 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion
2885 * build Unicode tools using CMake+make
2887 * run genpname/preparse.pl (on Linux)
2888 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
2889 + make sure that data.h is writable
2890 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
2891 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
2893 * rebuild Unicode tools (at least genpname) using make
2894 - You might first need to "make install" ICU so that the tools build can pick
2895 up the new definitions from the installed header files.
2898 - ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
2899 - rebuild ICU & tools
2901 * update source/data/unidata/norm2/nfkc_cf.txt
2902 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
2904 * update source/data/unidata/norm2/uts46.txt
2905 - download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt
2906 to ~/svn.icu/tools/trunk/src/unicode/py
2907 - adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values
2908 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
2909 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
2911 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2912 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2913 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2914 - Unicode 6.0: U+2260, U+226E, U+226F
2916 * generate core properties data files
2917 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2918 - rebuild ICU & tools
2919 - run makeuca.sh so that genuca picks up the new nfc.nrm:
2920 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2921 - rebuild ICU & tools
2923 * implement new Script_Extensions property (provisional)
2924 - parser & generator: genprops & uprops.icu
2925 - uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp
2926 - UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java
2928 * switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2
2930 - genbidi/gencase/genprops tools changes
2931 - re-run makeprops.sh (see above)
2932 - UCharacterProperty.java, UCharacterTypeIterator.java,
2933 UBiDiProps.java, UCaseProps.java, and several others with minor changes;
2934 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java
2936 * update Java data files
2937 - refresh just the UCD-related files, just to be safe
2938 - see (ICU4C)/source/data/icu4j-readme.txt
2940 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2943 Unicode .icu files built to ./out/build/icudt45l
2944 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b
2945 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2946 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b
2947 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b
2948 mkdir -p /tmp/icu4j/main/shared/data
2949 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2950 - copy the big-endian Unicode data files to another location,
2951 separate from the other data files
2952 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
2953 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
2954 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
2955 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu
2956 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
2957 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
2958 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
2960 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
2962 * refresh Java test .txt files
2963 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2965 * un-hardcode normalization skippable (NF*_Inert) test data
2966 - removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools
2968 * copy updated break iterator test files
2969 - now handled by early ucdcopy.py and
2970 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata
2972 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt
2973 to ~/svn.icu/trunk/src/source/test/testdata)
2974 - they are not used in ICU4J
2978 - get output from Mark's tools; look in
2979 http://www.unicode.org/~book/incoming/mark/uca6.0.0/
2980 http://www.macchiato.com/unicode/utc/additional-uca-files
2981 http://www.unicode.org/Public/UCA/6.0.0/
2982 http://www.unicode.org/~mdavis/uca/
2983 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2984 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2985 - update Han-implicit ranges for new CJK extensions:
2986 swapCJK() in ucol.cpp & ImplicitCEGenerator.java
2987 - genuca: allow bytes 02 for U+FFFE, new merge-sort character;
2988 do not add it into invuca so that tailoring primary-after an ignorable works
2989 - genuca: permit space between [variable top] bytes
2990 - ucol.cpp: treat noncharacters like unassigned rather than ignorable
2992 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2994 - refresh ICU4J collation data:
2995 (subset of instructions above for properties data refresh, except copies all coll/*)
2996 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2997 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
2998 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
2999 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
3000 - update (ICU)/source/test/testdata/CollationTest_*.txt
3001 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3002 with output from Mark's Unicode tools
3003 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)
3004 - note on intltest: if collate/UCAConformanceTest fails, then
3005 utility/MultithreadTest/TestCollators will fail as well;
3006 fix the conformance test before looking into the multi-thread test
3008 * When refreshing all of ICU4J data from ICU4C
3009 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3010 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
3012 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
3014 *** LayoutEngine script information
3016 (For details see the Unicode 5.2 change log below.)
3018 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
3019 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
3020 ScriptRunData.cpp, which is no longer needed.)
3022 The generated files have a current copyright date and "@draft" statement.
3024 * copy the above files into <icu>/source/layout, replacing the old files.
3025 * fix mixed line endings
3026 * review the diffs and fix incorrect @draft and missing aliases;
3027 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
3028 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
3030 ---------------------------------------------------------------------------- ***
3034 *** related ICU Trac tickets
3038 7167 verify collation bytes
3039 7235 Java test NAME_ALIAS
3040 7236 Java DerivedCoreProperties.txt test
3041 7237 Java BidiTest.txt
3042 7238 UTrie2 in core unidata
3043 7239 test for tailoring gaps
3044 7240 Java fix CollationMiscTest
3045 7243 update layout engine for Unicode 5.2
3047 *** Unicode version numbers
3050 - configure.in & configure
3051 - update ucdVersion in gennames.c if an algorithmic range changes
3053 *** data files & enums & parser code
3057 python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata
3058 - includes finding files regardless of version numbers,
3059 copying them, and performing the equivalent processing of the
3060 ucdstrip and ucdmerge tools on the desired set of files
3063 - PropertyAliases.txt
3064 moved from numeric to enumerated:
3065 ccc ; Canonical_Combining_Class
3066 new string properties:
3067 NFKC_CF ; NFKC_Casefold
3068 Name_Alias; Name_Alias
3069 new binary properties:
3072 CWCF ; Changes_When_Casefolded
3073 CWCM ; Changes_When_Casemapped
3074 CWKCF ; Changes_When_NFKC_Casefolded
3075 CWL ; Changes_When_Lowercased
3076 CWT ; Changes_When_Titlecased
3077 CWU ; Changes_When_Uppercased
3078 new CJK Unihan properties (not supported by ICU)
3079 - PropertyValueAliases.txt
3082 one script code change:
3083 sc ; Qaai ; Inherited
3085 sc ; Zinh ; Inherited ; Qaai
3086 new Line_Break (lb) value:
3087 lb ; CP ; Close_Parenthesis
3088 new Joining_Group (jg) values: Farsi_Yeh, Nya
3090 ccc; 214; ATA ; Attached_Above
3091 - DerivedBidiClass.txt
3092 new default-R range: U+1E800 - U+1EFFF
3094 all of the ISO comments are gone
3096 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last>
3098 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
3099 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
3103 + cd \svn\icuproj\icu\trunk\source\tools\genpname
3104 + make sure that data.h is writable
3105 + perl preparse.pl \svn\icuproj\icu\trunk > out.txt
3106 + preparse.pl complains with errors like the following:
3107 Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34.
3108 This is because ICU 4.0 had scripts from ISO 15924 which are now
3109 added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt
3110 and PropertyValueAliases.txt.
3111 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
3112 Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt
3113 + preparse.pl complains with errors about block names missing from uchar.h; add them
3115 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
3116 - new block & script values
3118 copy new blocks from Blocks.txt
3119 MS VC++ 2008 regular expression:
3120 find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$"
3121 replace with " UBLOCK_\3 = 172, /*[\1]*/"
3122 + several new script values already added in ICU 4.0 for ISO 15924 coverage
3123 (removed from SyntheticPropertyValueAliases.txt, see genpname notes above)
3124 + 3 new script values added for ISO 15924 and Unicode 5.2 coverage
3125 + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2)
3126 (added to SyntheticPropertyValueAliases.txt)
3127 - new Joining Group (JG) values: Farsi_Yeh, Nya
3128 - new Line_Break (lb) value:
3129 lb ; CP ; Close_Parenthesis
3131 * hardcoded Unihan range end/limit
3132 - Unihan range end moves from 9FC3 to 9FCB
3133 search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive)
3134 + do change gennames.c
3136 * Compare definitions of new binary properties with what we used to use
3137 in algorithms, to see if the definitions changed.
3138 - Verified that definitions for Cased and Case_Ignorable are unchanged.
3139 The gencase tool now parses the newly public Case_Ignorable values
3140 in case the definition changes in the future.
3142 * uchar.c & uprops.h & uprops.c & genprops
3143 - new numeric values that didn't exist in Unicode data before:
3144 1/7, 1/9, 1/10, 3/10, 1/16, 3/16
3145 the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5,
3146 therefore redesign the encoding of numeric types and values for formatVersion 6;
3147 design for simple numbers up to at least 144 ("one gross"),
3148 large values up to at least 10^20,
3149 and fractions with numerators -1..17 and denominators 1..16
3150 to cover current and expected future values
3151 (e.g., more Han numeric values, Meroitic twelfths)
3153 * reimplement Hangul_Syllable_Type for new Jamo characters
3154 - the old code assumed that all Jamo characters are in the 11xx block
3155 - Unicode 5.2 fills holes there and adds new Jamo characters in
3156 A960..A97F; Hangul Jamo Extended-A
3158 D7B0..D7FF; Hangul Jamo Extended-B
3159 - Hangul_Syllable_Type can be trivially derived from a subset of
3160 Grapheme_Cluster_Break values
3162 * build Unicode data source code for hardcoding core data
3163 C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data
3165 ICU data make path is \svn\icuproj\icu\trunk\source\data\
3166 ICU root path is \svn\icuproj\icu\trunk
3167 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
3168 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
3169 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
3170 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
3171 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
3172 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
3173 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
3174 Information: cannot find "spreplocal.mk". Not building user-additional stringprep files.
3175 Creating data file for Unicode Property Names
3176 Creating data file for Unicode Character Properties
3177 Creating data file for Unicode Case Mapping Properties
3178 Creating data file for Unicode BiDi/Shaping Properties
3179 Creating data file for Unicode Normalization
3180 Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l"
3181 Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp"
3183 - copy the .c source files to C:\svn\icuproj\icu\trunk\source\common
3184 and rebuild the common library
3188 - update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools)
3189 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools
3190 - update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools
3191 [ Begin obsolete instructions:
3192 Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files.
3193 - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py
3195 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt
3196 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt
3197 End obsolete instructions]
3198 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)
3199 not just the *_STUB.txt files
3200 - note on intltest: if collate/UCAConformanceTest fails, then
3201 utility/MultithreadTest/TestCollators will fail as well;
3202 fix the conformance test before looking into the multi-thread test
3204 *** Implement Cased & Case_Ignorable properties
3205 - via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable()
3206 - Problem: These properties should be disjoint, but aren't
3207 - UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not
3208 - change ucase.icu to be able to store any combination of Cased and Case_Ignorable
3210 *** Implement Changes_When_Xyz properties
3211 - without stored data
3213 *** Implement Name_Alias property
3214 - add it as another name field in unames.icu
3215 - make it available via u_charName() and UCharNameChoice and
3216 - consider it in u_charFromName()
3220 * Update break iterator rules to new UAX versions and new property values
3221 * Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary
3223 *** new BidiTest file
3224 - review format and data
3225 - copy BidiTest.txt to source/test/testdata
3226 - write test code using this data
3227 - fix ICU code where it fails the conformance test
3230 - generally, find and update code corresponding to C/C++
3231 - UCharacter.UnicodeBlock constants:
3232 a) add an _ID integer per new block, update COUNT
3233 b) add a class instance per new block
3234 Visual Studio regex:
3235 find UBLOCK_{[^ ]+} = [0-9]+, {/.+}
3236 replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
3237 - CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias()
3239 - port test changes to Java
3241 *** LayoutEngine script information
3243 (For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833)
3245 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
3246 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
3247 ScriptRunData.cpp, which is no longer needed.)
3249 The generated files have a current copyright date and "@draft" statement.
3251 -> Eric Mader wrote in email on 20090930:
3252 "I think the tool has been modified to update @draft to @stable for
3253 older scripts and to add @draft for new scripts.
3254 (I worked with an intern on this last year.)
3255 You should check the output after you run it."
3257 * copy the above files into <icu>/source/layout, replacing the old files.
3258 * fix mixed line endings
3259 * review the diffs and fix incorrect @draft and missing aliases
3260 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
3262 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
3263 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
3265 -> Eric Mader wrote in email on 20090930:
3266 "This is just a matter of making sure that all the per-script tables have
3267 entries for any new scripts that were added.
3268 If any new Indic characters were added, then the class tables in
3269 IndicClassTables.cpp should be updated to reflect this.
3270 John Emmons should know how to do this if it's required."
3272 * rebuild the layout and layoutex libraries.
3276 + Jamo_Short_Name, sfc->scf, binary property value aliases
3278 ---------------------------------------------------------------------------- ***
3282 *** related ICU Trac tickets
3284 5696 Update to Unicode 5.1
3286 *** Unicode version numbers
3289 - configure.in & configure
3290 - update ucdVersion in gennames.c if an algorithmic range changes
3292 *** data files & enums & parser code
3296 DerivedCoreProperties.txt
3297 DerivedNormalizationProps.txt
3298 NormalizationTest.txt
3301 GraphemeBreakProperty.txt
3302 SentenceBreakProperty.txt
3303 WordBreakProperty.txt
3304 - ucdstrip and ucdmerge:
3308 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
3309 copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
3310 copy 5.1.0\ucd\Blocks.txt ..\unidata\
3311 copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
3312 copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
3313 copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
3314 copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
3315 copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
3316 copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
3317 copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
3318 copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
3319 copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
3320 copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
3321 copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
3323 ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
3324 ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
3325 ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
3326 ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
3327 ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
3328 ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
3329 ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
3330 ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
3331 ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
3332 ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
3336 + cd \svn\icuproj\icu\uni51\source\tools\genpname
3337 + make sure that data.h is writable
3338 + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
3339 + preparse.pl complains with errors like the following:
3340 Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
3341 This is because ICU 3.8 had scripts from ISO 15924 which are now
3342 added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
3343 and PropertyValueAliases.txt.
3344 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
3345 Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
3346 + PropertyValueAliases.txt now explicitly contains values for boolean properties:
3347 N/Y, No/Yes, F/T, False/True
3348 -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
3349 It will use further values from the file if present.
3351 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
3352 - new block & script values
3354 + 11 new script values already added in ICU 3.8 for ISO 15924 coverage
3355 (removed from SyntheticPropertyValueAliases.txt)
3356 + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
3357 (added to SyntheticPropertyValueAliases.txt)
3358 - uprops.icu (uprops.h) only provides 7 bits for script codes.
3359 In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
3360 There is none above 127 yet which is the script code for an
3361 assigned Unicode character, so ICU 4.0 uprops.icu does not store any
3362 script code values greater than 127.
3363 However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
3364 in a parallel bit field, and that overflows now.
3365 Also, future values >=128 would be incompatible anyway.
3366 uprops.h is modified to move around several of the bit fields
3367 in the properties vector words, and now uses 8 bits for the script code.
3368 Two other bit fields also grow to accommodate future growth:
3369 Block (current count: 172) grows from 8 to 9 bits,
3370 and Word_Break grows from 4 to 5 bits.
3371 - renamed property Simple_Case_Folding (sfc->scf)
3372 + nothing to be done: handled as normal alias
3373 - new property JSN Jamo_Short_Name
3374 + no new API: only contributes to the Name property
3375 - new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
3376 - new Joining Group (JG) value: Burushashki_Yeh_Barree
3377 - new Sentence_Break (SB) values:
3382 - new Word_Break (WB) values:
3384 WB ; Extend ; Extend
3388 * Further changes in the 2008-02-29 update:
3389 - Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
3390 because they should not normally be invisible.
3391 - new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
3392 - new Grapheme_Cluster_Break (GCB) value: PP=Prepend
3393 - new Word_Break (WB) value: NL=Newline
3395 * hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
3396 - Unihan range end moves from 9FBB to 9FC3
3397 search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
3398 + do change gennames.c
3400 * build Unicode data source code for hardcoding core data
3401 C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
3403 ICU data make path is \svn\icuproj\icu\uni51\source\data\
3404 ICU root path is \svn\icuproj\icu\uni51
3405 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
3406 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
3407 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
3408 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
3409 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
3410 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
3411 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
3412 Creating data file for Unicode Character Properties
3413 Creating data file for Unicode Case Mapping Properties
3414 Creating data file for Unicode BiDi/Shaping Properties
3415 Creating data file for Unicode Normalization
3416 Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
3417 Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
3419 - copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
3420 and rebuild the common library
3424 * Update break iterator rules to new UAX versions and new property values
3428 * update FractionalUCA.txt and UCARules.txt with new canonical closure
3431 - Test that APIs using Unicode property value aliases (like UnicodeSet)
3432 support all of the boolean values N/Y, No/Yes, F/T, False/True
3433 -> TestBinaryValues() tests in both cintltst and intltest
3435 *** LayoutEngine script information
3436 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
3437 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
3438 ScriptRunData.cpp, which is no longer needed.)
3440 The generated files have a current copyright date and "@draft" statement.
3442 * copy the above files into <icu>/source/layout, replacing the old files.
3444 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
3445 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
3447 * rebuild the layout and layoutex libraries.
3451 + Jamo_Short_Name, sfc->scf, binary property value aliases
3453 ---------------------------------------------------------------------------- ***
3457 *** related Jitterbugs
3459 5084 RFE: Update to Unicode 5.0
3461 *** data files & enums & parser code
3465 DerivedCoreProperties.txt
3466 DerivedNormalizationProps.txt
3467 NormalizationTest.txt
3470 GraphemeBreakProperty.txt
3471 SentenceBreakProperty.txt
3472 WordBreakProperty.txt
3473 - ucdstrip and ucdmerge:
3477 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
3478 copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
3479 copy 5.0.0\ucd\Blocks.txt ..\unidata\
3480 copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
3481 copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
3482 copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
3483 copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
3484 copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
3485 copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
3486 copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
3487 copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
3488 copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
3489 copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
3490 copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
3492 ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
3493 ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
3494 ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
3495 ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
3496 ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
3497 ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
3498 ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
3499 ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
3500 ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
3501 ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
3503 * update FractionalUCA.txt and UCARules.txt with new canonical closure
3507 + make sure that data.h is writable
3508 + perl preparse.pl \cvs\oss\icu > out.txt
3510 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
3511 - new block & script values
3512 + script values already added in ICU 3.6 because all of ISO 15924 is now covered
3514 * build Unicode data source code for hardcoding core data
3515 C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
3517 ICU data make path is \cvs\oss\icu\source\data\
3518 ICU root path is \cvs\oss\icu
3519 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
3521 Creating data file for Unicode Character Properties
3522 Creating data file for Unicode Case Mapping Properties
3523 Creating data file for Unicode BiDi/Shaping Properties
3524 Creating data file for Unicode Normalization
3525 Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
3526 Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
3528 - copy the .c source files to C:\cvs\oss\icu\source\common
3529 and rebuild the common library
3531 *** Unicode version numbers
3536 *** LayoutEngine script information
3537 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
3538 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
3539 ScriptRunData.cpp, which is no longer needed.)
3541 The generated files have a current copyright date and "@draft" statement.
3543 * copy the above files into <icu>/source/layout, replacing the old files.
3545 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
3546 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
3548 * rebuild the layout and layoutex libraries.
3550 ---------------------------------------------------------------------------- ***
3554 *** related Jitterbugs
3556 4332 RFE: Update to Unicode 4.1
3557 4157 RBBI, TR29 4.1 updates
3559 *** data files & enums & parser code
3563 DerivedCoreProperties.txt
3564 DerivedNormalizationProps.txt
3565 NormalizationTest.txt
3566 GraphemeBreakProperty.txt
3567 SentenceBreakProperty.txt
3568 WordBreakProperty.txt
3569 - ucdstrip and ucdmerge:
3573 * add new files to the repository
3574 GraphemeBreakProperty.txt
3575 SentenceBreakProperty.txt
3576 WordBreakProperty.txt
3578 * update FractionalUCA.txt and UCARules.txt with new canonical closure
3581 - handle new enumerated properties in sub read_uchar
3584 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
3585 - new binary properties
3587 + Pattern_White_Space
3588 - new enumerated properties
3589 + Grapheme_Cluster_Break
3592 - new block & script & line break values
3595 - case-ignorable changes
3596 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
3597 now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
3599 *** Unicode version numbers
3605 - verify that u_charMirror() round-trips
3606 - test all new properties and some new values of old properties
3610 * hardcoded Unihan range end/limit
3611 - Unihan range end moves from 9FA5 to 9FBB
3612 search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
3613 + do not modify BOCU/BOCSU code because that would change the encoding
3614 and break binary compatibility!
3615 + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
3617 + ignore trietest.c: test data is arbitrary
3618 + ignore tstnorm.cpp: test optimization, not important
3619 + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
3620 + do change line_th.txt and word_th.txt
3621 by replacing hardcoded ranges with the new property values
3622 + do change gennames.c
3624 source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
3625 source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
3626 source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5,
3629 - compare new special casing context conditions with previous ones
3630 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
3633 - consider storing only the short name if it is the same as the long name
3636 - UAX #29 changes (grapheme/word/sentence breaks)
3637 - UAX #14 changes (line breaks)
3638 - Pattern_Syntax & Pattern_White_Space
3640 ---------------------------------------------------------------------------- ***
3642 Unicode 4.0.1 update
3644 *** related Jitterbugs
3646 3170 RFE: Update to Unicode 4.0.1
3647 3171 Add new Unicode 4.0.1 properties
3648 3520 use Unicode 4.0.1 updates for break iteration
3650 *** data files & enums & parser code
3653 - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
3654 - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
3657 - fix UnicodeData.txt general categories of Ethiopic digits Nd->No
3658 according to PRI #26
3659 http://www.unicode.org/review/resolved-pri.html#pri26
3660 - undone again because no corrigendum in sight;
3661 instead modified tests to not check consistency on this for Unicode 4.0.1
3664 - update from http://www.unicode.org/copyright.html
3665 formatted for plain text
3667 * uchar.h & uprops.h & uprops.c & genprops
3668 - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
3669 - add U_LB_INSEPARABLE due to a spelling fix
3670 + put short name comment only on line with new constant
3671 for genpname perl script parser
3672 - new binary properties
3674 + Variation_Selector
3677 - fix genpname perl script so that it doesn't choke on more than 2 names per property value
3678 - perl script: correctly calculate the maximum number of fields per row
3681 - new script code Hrkt=Katakana_Or_Hiragana
3683 * gennorm.c track changes in DerivedNormalizationProps.txt
3684 - "FNC" -> "FC_NFKC"
3685 - single field "NFD_NO" -> two fields "NFD_QC; N" etc.
3687 * genprops/props2.c track changes in DerivedNumericValues.txt
3688 - changed from 3 columns to 2, dropping the numeric type
3689 + assume that the type is always numeric for Han characters,
3690 and that only those are added in addition to what UnicodeData.txt lists
3692 *** Unicode version numbers
3698 - update test of default bidi classes according to PRI #28
3699 /tsutil/cucdtst/TestUnicodeData
3700 http://www.unicode.org/review/resolved-pri.html#pri28
3701 - bidi tests: change exemplar character for ES depending on Unicode version
3702 - change hardcoded expected property values where they change
3710 - use new Hrkt=Katakana_Or_Hiragana
3713 - are now part of combining character sequences
3714 - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ