]> git.saurik.com Git - apple/icu.git/blob - icuSources/data/unidata/changes.txt
ICU-62141.0.1.tar.gz
[apple/icu.git] / icuSources / data / unidata / changes.txt
1 * Copyright (C) 2016 and later: Unicode, Inc. and others.
2 * License & terms of use: http://www.unicode.org/copyright.html
3 * Copyright (C) 2004-2016, International Business Machines
4 * Corporation and others. All Rights Reserved.
5 *
6 * file name: changes.txt
7 * encoding: US-ASCII
8 * tab size: 8 (not used)
9 * indentation:4
10 *
11 * created on: 2004may06
12 * created by: Markus W. Scherer
13 *
14 * change log for Unicode updates
15 *
16 * For each new Unicode version, during the beta period,
17 * I copy the change log for the previous version to the top of this file.
18 * I adjust the versions, tickets, URLs, and paths.
19 * I work my way through the steps listed in the log, top to bottom,
20 * adjusting the log as necessary.
21 * I report problems to the UTC and/or CLDR and/or ICU.
22 * Before the data is final, I "turn the crank" several more times,
23 * using appropriate subsets of the steps.
24
25 ---------------------------------------------------------------------------- ***
26
27 * New ISO 15924 script codes
28
29 Starting with ICU 55, we do not add UScriptCode constants for new scripts any more
30 until they are encoded in Unicode,
31 or can be assumed to be encoded in the next Unicode version.
32 Script enum constant names want to follow the Unicode script property value aliases,
33 which are assigned only when the scripts are encoded.
34 When we encode scripts early and guess wrong, then we have confusing enum constants
35 and have sometimes added aliases.
36
37 Variant script codes like Latf and Aran that are not subject to separate encoding
38 can be added at any time.
39 (For example, Aran could be added as USCRIPT_ARABIC_NASTALIQ.)
40
41 We add script codes used in CLDR or in the spoof checker.
42 This includes combination/alias codes like Hanb and Jamo.
43 See http://unicode.org/reports/tr35/#unicode_script_subtag_validity
44 and look for "alias" on http://unicode.org/iso15924/iso15924-codes.html
45
46 We add special Z* script codes like Zsye.
47
48 For new script codes see http://www.unicode.org/iso15924/codechanges.html
49
50 ---------------------------------------------------------------------------- ***
51
52 Unicode 11.0 update for ICU 62
53
54 http://www.unicode.org/versions/Unicode11.0.0/
55 http://unicode.org/versions/beta-11.0.0.html
56 https://www.unicode.org/review/pri372/
57 http://www.unicode.org/reports/uax-proposed-updates.html
58 http://www.unicode.org/reports/tr44/tr44-21.html
59
60 * Command-line environment setup
61
62 UNICODE_DATA=~/unidata/uni11/20180521
63 CLDR_SRC=~/svn.cldr/uni
64 ICU_ROOT=~/svn.icu/uni
65 ICU_SRC=$ICU_ROOT/src
66 ICUDT=icudt61b
67 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
68 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
69 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
70
71 *** ICU Trac
72
73 - ticket:13630: Unicode 11
74 - ^/branches/markus/uni11
75
76 *** CLDR Trac
77
78 - cldrbug 10978: Unicode 11
79 - ^/branches/markus/uni11
80
81 *** Unicode version numbers
82 - makedata.mak
83 - uchar.h
84 - com.ibm.icu.util.VersionInfo
85 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
86
87 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
88 so that the makefiles see the new version number.
89
90 *** data files & enums & parser code
91
92 * download files
93 - mkdir -p $UNICODE_DATA
94 - download Unicode files into $UNICODE_DATA
95 + subfolders: emoji, idna, security, ucd, uca
96 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
97
98 * for manual diffs and for Unicode Tools input data updates:
99 remove version suffixes from the file names
100 ~$ unidata/desuffixucd.py $UNICODE_DATA
101 (see https://sites.google.com/site/unicodetools/inputdata)
102
103 * process and/or copy files
104 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
105 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
106 + For debugging, and tweaking how ppucd.txt is written,
107 the tool has an --only_ppucd option:
108 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
109
110 - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
111
112 * build ICU (make install)
113 so that the tools build can pick up the new definitions from the installed header files.
114
115 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
116
117 * preparseucd.py changes
118 - fix other errors
119 NameError: unknown property Extended_Pictographic
120 -> add Extended_Pictographic binary property
121 -> add new short names for all Emoji properties
122
123 * new constants for new property values
124 - preparseucd.py error:
125 ValueError: missing uchar.h enum constants for some property values:
126 [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar',
127 u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals',
128 u'Indic_Siyaq_Numbers'])),
129 (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])),
130 (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])),
131 (u'GCB', set([u'LinkC', u'Virama'])),
132 (u'WB', set([u'WSegSpace']))]
133 = PropertyValueAliases.txt new property values (diff old & new .txt files)
134 blk; Chess_Symbols ; Chess_Symbols
135 blk; Dogra ; Dogra
136 blk; Georgian_Ext ; Georgian_Extended
137 blk; Gunjala_Gondi ; Gunjala_Gondi
138 blk; Hanifi_Rohingya ; Hanifi_Rohingya
139 blk; Indic_Siyaq_Numbers ; Indic_Siyaq_Numbers
140 blk; Makasar ; Makasar
141 blk; Mayan_Numerals ; Mayan_Numerals
142 blk; Medefaidrin ; Medefaidrin
143 blk; Old_Sogdian ; Old_Sogdian
144 blk; Sogdian ; Sogdian
145 -> add to uchar.h
146 use long property names for enum constants,
147 for the trailing comment get the block start code point: diff old & new Blocks.txt
148 -> add to UCharacter.UnicodeBlock IDs
149 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
150 replace public static final int \1_ID = \2; \3
151 -> add to UCharacter.UnicodeBlock objects
152 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
153 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
154
155 GCB; LinkC ; LinkingConsonant
156 GCB; Virama ; Virama
157 -> uchar.h & UCharacter.GraphemeClusterBreak
158 -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76
159
160 InSC; Consonant_Initial_Postfixed ; Consonant_Initial_Postfixed
161 -> ignore: ICU does not yet support this property
162
163 jg ; Hanifi_Rohingya_Kinna_Ya ; Hanifi_Rohingya_Kinna_Ya
164 jg ; Hanifi_Rohingya_Pa ; Hanifi_Rohingya_Pa
165 -> uchar.h & UCharacter.JoiningGroup
166
167 sc ; Dogr ; Dogra
168 sc ; Gong ; Gunjala_Gondi
169 sc ; Maka ; Makasar
170 sc ; Medf ; Medefaidrin
171 sc ; Rohg ; Hanifi_Rohingya
172 sc ; Sogd ; Sogdian
173 sc ; Sogo ; Old_Sogdian
174 -> uscript.h & com.ibm.icu.lang.UScript
175 -> Nushu had been added already
176 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
177 and in com.ibm.icu.dev.test.lang.TestUScript.java
178
179 WB ; WSegSpace ; WSegSpace
180 -> uchar.h & UCharacter.WordBreak
181
182 * New short names for emoji properties
183 - see UTS #51
184 - short names set in preparseucd.py
185
186 * New properties
187 - boolean emoji property Extended_Pictographic
188 -> added in preparseucd.py
189 -> uchar.h & UProperty.java
190 - misc. property Equivalent_Unified_Ideograph (EqUIdeo)
191 as shown in PropertyValueAliases.txt
192 -> ignore for now
193
194 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
195 (not strictly necessary for NOT_ENCODED scripts)
196 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
197
198 * update spoof checker UnicodeSet initializers:
199 inclusionPat & recommendedPat in uspoof.cpp
200 INCLUSION & RECOMMENDED in SpoofChecker.java
201 - make sure that the Unicode Tools tree contains the latest security data files
202 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
203 - update the hardcoded version number there in the DIRECTORY path
204 - run the tool (no special environment variables needed)
205 - copy & paste from the Console output into the .cpp & .java files
206
207 * generate normalization data files
208 cd $ICU_ROOT/dbg/icu4c
209 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
210 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
211 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
212 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
213 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
214
215 * build ICU (make install)
216 so that the tools build can pick up the new definitions from the installed header files.
217
218 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
219
220 * build Unicode tools using CMake+make
221
222 $ICU_SRC/tools/unicode/c/icudefs.txt:
223
224 # Location (--prefix) of where ICU was installed.
225 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
226 # Location of the ICU4C source tree.
227 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c)
228
229 $ICU_ROOT/dbg$
230 mkdir -p tools/unicode/c
231 cd tools/unicode/c
232
233 $ICU_ROOT/dbg/tools/unicode/c$
234 cmake ../../../../src/tools/unicode/c
235 make
236
237 * generate core properties data files
238 $ICU_ROOT/dbg/tools/unicode/c$
239 genprops/genprops $ICU_SRC/icu4c
240 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
241 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
242 - rebuild ICU (make install) & tools
243
244 * Fix case props
245 genprops error: casepropsbuilder: too many exceptions words
246 genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR
247 - With the addition of Georgian Mtavruli capital letters,
248 there are now too many simple case mappings with big mapping deltas
249 that yield uncompressible exceptions.
250 - Changing the data structure (now formatVersion 4),
251 adding one bit for no-simple-case-folding (for Cherokee), and
252 one optional slot for a big delta (for most faraway mappings),
253 together with another bit for whether that is negative.
254 This makes most Cherokee & Georgian etc. case mappings compressible,
255 reducing the number of exceptions words.
256 - Further changes to gain one more bit for the exceptions index,
257 for future growth. Details see casepropsbuilder.cpp.
258
259 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
260 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
261 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
262 - Unicode 6.0..11.0: U+2260, U+226E, U+226F
263 - nothing new in this Unicode version, no test file to update
264
265 * run & fix ICU4C tests
266 - Andy handles RBBI & spoof check test failures
267
268 - Errors in char.txt, word.txt, word_POSIX.txt like
269 createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 46, column 16
270 because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty.
271 -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them
272 not empty, just to get ICU building.
273 -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables
274 and properties together with the rules that used them (GB 10, WB 14).
275 -> Andy adjusts the rule sets further to sync with
276 Unicode 11 grapheme, word, and line break spec changes.
277
278 * collation: CLDR collation root, UCA DUCET
279
280 - UCA DUCET goes into Mark's Unicode tools, see
281 https://sites.google.com/site/unicodetools/home#TOC-UCA
282 diff the main mapping file, look for bad changes
283 (for example, more bytes per weight for common characters)
284 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt
285 ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt
286
287 - CLDR root data files are checked into $CLDR_SRC/common/uca/
288 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
289
290 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
291 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
292 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
293 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
294 (note removing the underscore before "Rules")
295 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
296 - restore TODO diffs in UCARules.txt
297 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
298 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
299 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
300 from the CLDR root files (..._CLDR_..._SHORT.txt)
301 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
302 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
303 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
304 - if CLDR common/uca/unihan-index.txt changes, then update
305 CLDR common/collation/root.xml <collation type="private-unihan">
306 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
307
308 - run genuca, see command line above;
309 deal with
310 Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:
311 FDD1 1180B; [71 CC 02, 05, 05] # Dogra first primary (compressible)
312 (add the character to genuca.cpp sampleCharsToScripts[])
313 + look up the USCRIPT_ code for the new sample characters
314 (should be obvious from the comment in the error output)
315 + *add* mappings to sampleCharsToScripts[], do not replace them
316 (in case the script sample characters flip-flop)
317 + insert new scripts in DUCET script order, see the top_byte table
318 at the beginning of FractionalUCA.txt
319 - rebuild ICU4C
320
321 * Unihan collators
322 https://sites.google.com/site/unicodetools/unihan
323 - run Unicode Tools
324 org.unicode.draft.GenerateUnihanCollators
325 with VM arguments
326 -ea
327 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
328 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
329 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
330 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
331 -DUVERSION=11.0.0
332 - run Unicode Tools
333 org.unicode.draft.GenerateUnihanCollatorFiles
334 with the same arguments
335 - check CLDR diffs
336 cd $CLDR_SRC
337 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
338 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
339 - copy to CLDR
340 cd $CLDR_SRC
341 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
342 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
343 - run CLDR unit tests, commit to CLDR
344 - generate ICU zh collation data: run CLDR
345 org.unicode.cldr.icu.NewLdml2IcuConverter
346 with program arguments
347 -t collation
348 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
349 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
350 -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll
351 -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation
352 zh
353 and VM arguments
354 -ea
355 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
356 - rebuild ICU4C
357
358 * run & fix ICU4C tests, now with new CLDR collation root data
359 - run all tests with the collation test data *_SHORT.txt or the full files
360 (the full ones have comments, useful for debugging)
361 - note on intltest: if collate/UCAConformanceTest fails, then
362 utility/MultithreadTest/TestCollators will fail as well;
363 fix the conformance test before looking into the multi-thread test
364
365 * update Java data files
366 - refresh just the UCD/UCA-related/derived files, just to be safe
367 - see (ICU4C)/source/data/icu4j-readme.txt
368 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
369 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
370 output:
371 ...
372 Unicode .icu files built to ./out/build/icudt61l
373 echo timestamp > uni-core-data
374 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b
375 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b
376 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
377 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b
378 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b"
379 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/
380 mkdir -p /tmp/icu4j/main/shared/data
381 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
382 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/
383 mkdir -p /tmp/icu4j/main/shared/data
384 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
385 make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data'
386 - copy the big-endian Unicode data files to another location,
387 separate from the other data files,
388 and then refresh ICU4J
389 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
390 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
391 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
392 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
393 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
394 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
395 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
396 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
397 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
398 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
399
400 * When refreshing all of ICU4J data from ICU4C
401 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
402 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
403 or
404 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
405
406 * update CollationFCD.java
407 + copy & paste the initializers of lcccIndex[] etc. from
408 ICU4C/source/i18n/collationfcd.cpp to
409 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
410
411 * refresh Java test .txt files
412 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
413 cd $ICU_SRC/icu4c/source/data/unidata
414 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
415 cd ../../test/testdata
416 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
417 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
418
419 * run & fix ICU4J tests
420
421 *** API additions
422 - send notice to icu-design about new born-@stable API (enum constants etc.)
423
424 *** CLDR numbering systems
425 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
426 Unicode 11: using Unicode 11 CLDR ticket #10978
427 rohg 10D30..10D39 Hanifi_Rohingya
428 gong 11DA0..11DA9 Gunjala_Gondi
429 Earlier: CLDR tickets specific to adding new numbering systems.
430 Unicode 10: http://unicode.org/cldr/trac/ticket/10219
431 Unicode 9: http://unicode.org/cldr/trac/ticket/9692
432
433 *** merge the Unicode update branches back onto the trunk
434 - do not merge the icudata.jar and testdata.jar,
435 instead rebuild them from merged & tested ICU4C
436 - make sure that changes to Unicode tools are checked in:
437 http://www.unicode.org/utility/trac/log/trunk/unicodetools
438
439 ---------------------------------------------------------------------------- ***
440
441 Unicode 10.0 update for ICU 60
442
443 http://www.unicode.org/versions/Unicode10.0.0/
444 http://www.unicode.org/versions/beta-10.0.0.html
445 http://blog.unicode.org/2017/03/unicode-100-beta-review.html
446 http://www.unicode.org/review/pri350/
447 http://www.unicode.org/reports/uax-proposed-updates.html
448 http://www.unicode.org/reports/tr44/tr44-19.html
449
450 * Command-line environment setup
451
452 UNICODE_DATA=~/unidata/uni10/20170605
453 CLDR_SRC=~/svn.cldr/uni10
454 ICU_ROOT=~/svn.icu/uni10
455 ICU_SRC=$ICU_ROOT/src
456 ICUDT=icudt60b
457 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
458 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
459 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
460
461 *** ICU Trac
462
463 - ticket:12985: Unicode 10
464 - ticket:13061: undo hacks from emoji 5.0 update
465 - ticket:13062: add Emoji_Component property
466 - ^/branches/markus/uni10
467
468 *** CLDR Trac
469
470 - cldrbug 10055: Unicode 10
471 - cldrbug 9882: Unicode 10 script metadata
472 - cldrbug 10219: numbering systems for Unicode 10
473
474 *** Unicode version numbers
475 - makedata.mak
476 - uchar.h
477 - com.ibm.icu.util.VersionInfo
478 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
479
480 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
481 so that the makefiles see the new version number.
482
483 *** data files & enums & parser code
484
485 * download files
486 - mkdir -p $UNICODE_DATA
487 - download Unicode 10.0 files into $UNICODE_DATA
488 + subfolders: ucd, uca, idna, security
489 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
490 - download emoji 5.0 files into $UNICODE_DATA/emoji
491
492 * for manual diffs: remove version suffixes from the file names
493 ~$ unidata/desuffixucd.py $UNICODE_DATA
494 (see https://sites.google.com/site/unicodetools/inputdata)
495
496 * process and/or copy files
497 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
498 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
499 + For debugging, and tweaking how ppucd.txt is written,
500 the tool has an --only_ppucd option:
501 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
502
503 - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
504
505 * build ICU (make install)
506 so that the tools build can pick up the new definitions from the installed header files.
507
508 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
509
510 * preparseucd.py changes
511 - remove or add new Unicode scripts from/to the
512 only-in-ISO-15924 list according to the error messages:
513 ValueError: remove ['Nshu'] from _scripts_only_in_iso15924
514 -> adjust _scripts_only_in_iso15924 as indicated
515 - fix other errors
516 Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo']
517 -> add vo=Vertical_Orientation to _ignored_properties
518 -> later removed again, parsing the file, even though we do not yet store data for runtime use
519
520 * new constants for new property values
521 - preparseucd.py error:
522 ValueError: missing uchar.h enum constants for some property values:
523 [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F',
524 u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])),
525 (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla',
526 u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra',
527 u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])),
528 (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))]
529 = PropertyValueAliases.txt new property values (diff old & new .txt files)
530 blk; CJK_Ext_F ; CJK_Unified_Ideographs_Extension_F
531 blk; Kana_Ext_A ; Kana_Extended_A
532 blk; Masaram_Gondi ; Masaram_Gondi
533 blk; Nushu ; Nushu
534 blk; Soyombo ; Soyombo
535 blk; Syriac_Sup ; Syriac_Supplement
536 blk; Zanabazar_Square ; Zanabazar_Square
537 -> add to uchar.h
538 use long property names for enum constants,
539 for the trailing comment get the block start code point: diff old & new Blocks.txt
540 -> add to UCharacter.UnicodeBlock IDs
541 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
542 replace public static final int \1_ID = \2; \3
543 -> add to UCharacter.UnicodeBlock objects
544 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
545 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
546
547 jg ; Malayalam_Bha ; Malayalam_Bha
548 jg ; Malayalam_Ja ; Malayalam_Ja
549 jg ; Malayalam_Lla ; Malayalam_Lla
550 jg ; Malayalam_Llla ; Malayalam_Llla
551 jg ; Malayalam_Nga ; Malayalam_Nga
552 jg ; Malayalam_Nna ; Malayalam_Nna
553 jg ; Malayalam_Nnna ; Malayalam_Nnna
554 jg ; Malayalam_Nya ; Malayalam_Nya
555 jg ; Malayalam_Ra ; Malayalam_Ra
556 jg ; Malayalam_Ssa ; Malayalam_Ssa
557 jg ; Malayalam_Tta ; Malayalam_Tta
558 -> uchar.h & UCharacter.JoiningGroup
559
560 sc ; Gonm ; Masaram_Gondi
561 sc ; Nshu ; Nushu
562 sc ; Soyo ; Soyombo
563 sc ; Zanb ; Zanabazar_Square
564 -> uscript.h & com.ibm.icu.lang.UScript
565 -> Nushu had been added already
566 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
567 and in com.ibm.icu.dev.test.lang.TestUScript.java
568
569 * New properties as shown in PropertyValueAliases.txt changes
570 - boolean Emoji_Component from emoji 5
571 -> uchar.h & UProperty.java
572 - boolean
573 # Regional_Indicator (RI)
574
575 RI ; N ; No ; F ; False
576 RI ; Y ; Yes ; T ; True
577 -> uchar.h & UProperty.java
578 -> single immutable range, to be hardcoded
579 - boolean
580 # Prepended_Concatenation_Mark (PCM)
581
582 PCM; N ; No ; F ; False
583 PCM; Y ; Yes ; T ; True
584 -> was new in Unicode 9
585 -> uchar.h & UProperty.java
586 - enumerated
587 # Vertical_Orientation (vo)
588
589 vo ; R ; Rotated
590 vo ; Tr ; Transformed_Rotated
591 vo ; Tu ; Transformed_Upright
592 vo ; U ; Upright
593 -> only pre-parsed for now, but not yet stored for runtime use
594
595 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
596 (not strictly necessary for NOT_ENCODED scripts)
597 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
598
599 * generate normalization data files
600 cd $ICU_ROOT/dbg/icu4c
601 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
602 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
603 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
604 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
605 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
606
607 * build ICU (make install)
608 so that the tools build can pick up the new definitions from the installed header files.
609
610 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
611
612 * build Unicode tools using CMake+make
613
614 $ICU_SRC/tools/unicode/c/icudefs.txt:
615
616 # Location (--prefix) of where ICU was installed.
617 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
618 # Location of the ICU4C source tree.
619 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c)
620
621 $ICU_ROOT/dbg/tools/unicode/c$
622 cmake ../../../../src/tools/unicode/c
623 make
624
625 * generate core properties data files
626 $ICU_ROOT/dbg/tools/unicode/c$
627 genprops/genprops $ICU_SRC/icu4c
628 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
629 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
630 - rebuild ICU (make install) & tools
631
632 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
633 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
634 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
635 - Unicode 6.0..10.0: U+2260, U+226E, U+226F
636 - nothing new in this Unicode version, no test file to update
637
638 * run & fix ICU4C tests
639 - Andy handles RBBI & spoof check test failures
640
641 * collation: CLDR collation root, UCA DUCET
642
643 - UCA DUCET goes into Mark's Unicode tools, see
644 https://sites.google.com/site/unicodetools/home#TOC-UCA
645 - CLDR root data files are checked into $CLDR_SRC/common/uca/
646 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
647
648 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
649 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
650 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
651 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
652 (note removing the underscore before "Rules")
653 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
654 - restore TODO diffs in UCARules.txt
655 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
656 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
657 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
658 from the CLDR root files (..._CLDR_..._SHORT.txt)
659 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
660 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
661 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
662 - if CLDR common/uca/unihan-index.txt changes, then update
663 CLDR common/collation/root.xml <collation type="private-unihan">
664 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
665
666 - run genuca, see command line above;
667 deal with
668 Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt:
669 FDD1 11D10; [70 D5 02, 05, 05] # Masaram_Gondi first primary (compressible)
670 (add the character to genuca.cpp sampleCharsToScripts[])
671 + look up the USCRIPT_ code for the new sample characters
672 (should be obvious from the comment in the error output)
673 + *add* mappings to sampleCharsToScripts[], do not replace them
674 (in case the script sample characters flip-flop)
675 + insert new scripts in DUCET script order, see the top_byte table
676 at the beginning of FractionalUCA.txt
677 - rebuild ICU4C
678
679 * Unihan collators
680 https://sites.google.com/site/unicodetools/unihan
681 - run Unicode Tools
682 org.unicode.draft.GenerateUnihanCollators
683 with VM arguments
684 -ea
685 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
686 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
687 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
688 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
689 -DUVERSION=10.0.0
690 - run Unicode Tools
691 org.unicode.draft.GenerateUnihanCollatorFiles
692 with the same arguments
693 - check CLDR diffs
694 cd $CLDR_SRC
695 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
696 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
697 - copy to CLDR
698 cd $CLDR_SRC
699 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
700 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
701 - run CLDR unit tests, commit to CLDR
702 - generate ICU zh collation data: run CLDR
703 org.unicode.cldr.icu.NewLdml2IcuConverter
704 with program arguments
705 -t collation
706 -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation
707 -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental
708 -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll
709 -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation
710 zh
711 and VM arguments
712 -ea
713 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
714 - rebuild ICU4C
715
716 * run & fix ICU4C tests, now with new CLDR collation root data
717 - run all tests with the collation test data *_SHORT.txt or the full files
718 (the full ones have comments, useful for debugging)
719 - note on intltest: if collate/UCAConformanceTest fails, then
720 utility/MultithreadTest/TestCollators will fail as well;
721 fix the conformance test before looking into the multi-thread test
722
723 * update Java data files
724 - refresh just the UCD/UCA-related/derived files, just to be safe
725 - see (ICU4C)/source/data/icu4j-readme.txt
726 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
727 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
728 output:
729 ...
730 Unicode .icu files built to ./out/build/icudt60l
731 echo timestamp > uni-core-data
732 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b
733 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b
734 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
735 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b
736 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b"
737 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/
738 mkdir -p /tmp/icu4j/main/shared/data
739 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
740 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/
741 mkdir -p /tmp/icu4j/main/shared/data
742 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
743 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data'
744 - copy the big-endian Unicode data files to another location,
745 separate from the other data files,
746 and then refresh ICU4J
747 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
748 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
749 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
750 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
751 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
752 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
753 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
754 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
755 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
756 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
757
758 * When refreshing all of ICU4J data from ICU4C
759 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
760 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
761 or
762 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
763
764 * update CollationFCD.java
765 + copy & paste the initializers of lcccIndex[] etc. from
766 ICU4C/source/i18n/collationfcd.cpp to
767 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
768
769 * refresh Java test .txt files
770 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
771 cd $ICU_SRC/icu4c/source/data/unidata
772 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
773 cd ../../test/testdata
774 cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
775 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
776
777 * run & fix ICU4J tests
778
779 *** API additions
780 - send notice to icu-design about new born-@stable API (enum constants etc.)
781
782 *** CLDR numbering systems
783 - look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket
784 Unicode 10: http://unicode.org/cldr/trac/ticket/10219
785 Unicode 9: http://unicode.org/cldr/trac/ticket/9692
786
787 *** merge the Unicode update branches back onto the trunk
788 - do not merge the icudata.jar and testdata.jar,
789 instead rebuild them from merged & tested ICU4C
790 - make sure that changes to Unicode tools are checked in:
791 http://www.unicode.org/utility/trac/log/trunk/unicodetools
792
793 ---------------------------------------------------------------------------- ***
794
795 Emoji 5.0 update for ICU 59
796 - ICU 59 mostly remains on Unicode 9.0
797 - except updates bidi and segmentation data to Unicode 10 beta
798
799 First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg.
800
801 * Command-line environment setup
802
803 ICU_ROOT=~/svn.icu/trunk
804 ICU_SRC_DIR=$ICU_ROOT/src
805 ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c
806 ICUDT=icudt59b
807 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
808 SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in
809 UNIDATA=$ICU4C_SRC_DIR/source/data/unidata
810
811 *** ICU Trac
812
813 - ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released
814 - changes directly on trunk
815
816 *** data files & enums & parser code
817
818 * download files
819
820 - download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca)
821 - download emoji 5.0 beta files into the same uni90e50 folder
822 - download Unicode 10.0 beta files: ucd
823 + copy Unicode 10 bidi files to the uni90e50/ucd folder:
824 BidiBrackets.txt
825 BidiCharacterTest.txt
826 BidiMirroring.txt
827 BidiTest.txt
828 extracted/DerivedBidiClass.txt
829 + copy Unicode 10 segmentation files to the uni90e50/ucd folder:
830 LineBreak.txt
831 auxiliary/*
832
833 * preparseucd.py changes
834 - adjust for combined trunks
835 - write new copyright lines
836 - ignore new Emoji_Component property for now
837
838 * process and/or copy files
839 - ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR
840 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
841
842 - cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA
843
844 * build ICU (make install)
845 so that the tools build can pick up the new definitions from the installed header files.
846
847 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
848
849 * build Unicode tools using CMake+make
850
851 ~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt:
852
853 # Location (--prefix) of where ICU was installed.
854 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
855 # Location of the ICU4C source tree.
856 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c)
857
858 ~/svn.icu/trunk/dbg/tools/unicode/c$
859 cmake ../../../../src/tools/unicode/c
860 make
861
862 * generate core properties data files
863 ~/svn.icu/trunk/dbg/tools/unicode/c$
864 genprops/genprops $ICU4C_SRC_DIR
865 - rebuild ICU (make install) & tools
866
867 * run & fix ICU4C tests
868 - Andy handles RBBI & spoof check test failures
869
870 * update Java data files
871 - refresh just the UCD/UCA-related/derived files, just to be safe
872 - see (ICU4C)/source/data/icu4j-readme.txt
873 - mkdir /tmp/icu4j
874 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
875 output:
876 ...
877 Unicode .icu files built to ./out/build/icudt59l
878 echo timestamp > uni-core-data
879 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b
880 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b
881 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
882 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b
883 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b"
884 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/
885 mkdir -p /tmp/icu4j/main/shared/data
886 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
887 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/
888 mkdir -p /tmp/icu4j/main/shared/data
889 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
890 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data'
891 - copy the big-endian Unicode data files to another location,
892 separate from the other data files,
893 and then refresh ICU4J
894 cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j
895 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
896 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
897 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
898 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
899 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
900 jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
901
902 * When refreshing all of ICU4J data from ICU4C
903 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
904 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data
905 or
906 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install
907
908 * refresh Java test .txt files
909 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
910 cd $ICU4C_SRC_DIR/source/data/unidata
911 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
912 cd ../../test/testdata
913 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
914 cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
915
916 * run & fix ICU4J tests
917
918 ---------------------------------------------------------------------------- ***
919
920 Unicode 9.0 update for ICU 58
921
922 * Command-line environment setup
923
924 ICU_ROOT=~/svn.icu/trunk
925 ICU_SRC_DIR=$ICU_ROOT/src
926 ICUDT=icudt58b
927 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
928 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
929 UNIDATA=$ICU_SRC_DIR/source/data/unidata
930
931 http://www.unicode.org/review/pri323/ -- beta review
932 http://www.unicode.org/reports/uax-proposed-updates.html
933 http://www.unicode.org/versions/beta-9.0.0.html
934 http://www.unicode.org/versions/Unicode9.0.0/
935 http://www.unicode.org/reports/tr44/tr44-17.html
936
937 *** ICU Trac
938
939 - ticket:12526: integrate Unicode 9
940 - C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b
941 - Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b
942
943 *** CLDR Trac
944
945 - cldrbug 9414: UCA 9
946 - ^/branches/markus/uni90 at r11518 from trunk at r11517
947
948 - cldrbug 8745: Unicode 9.0 script metadata
949
950 *** Unicode version numbers
951 - makedata.mak
952 - uchar.h
953 - com.ibm.icu.util.VersionInfo
954 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
955
956 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
957 so that the makefiles see the new version number.
958
959 *** data files & enums & parser code
960
961 * file preparation
962
963 - download UCD & IDNA files
964 - make sure that the Unicode data folder passed into preparseucd.py
965 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
966 - only for manual diffs: remove version suffixes from the file names
967 ~/unidata/uni70/20140403$ ../../desuffixucd.py .
968 (see https://sites.google.com/site/unicodetools/inputdata)
969 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
970 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src
971 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
972
973 - also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt
974 and copy to $UNIDATA
975 cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA
976
977 * preparseucd.py changes
978 - remove or add new Unicode scripts from/to the
979 only-in-ISO-15924 list according to the error messages:
980 ValueError: remove ['Tang'] from _scripts_only_in_iso15924
981 ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD
982 ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD
983 ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD
984 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
985 and in com.ibm.icu.dev.test.lang.TestUScript.java
986 - DerivedNumericValues.txt new numeric values
987 0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH
988 0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH
989 0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS
990 0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH
991 0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS
992 -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(),
993 uchar.c, UCharacterProperty.java
994 to support a new series of values
995 - adjust preparseucd.py for Tangut algorithmic names
996 in ppucd.txt:
997 algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH-
998 ->
999 algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH-
1000 - avoid block-compressing most String/Miscellaneous property values,
1001 triggered by genprops not coping with a multi-code point Case_Folding on
1002 block;1C80..1C8F;...;Cased;cf=0442;CWCF;...
1003 keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors
1004
1005 * PropertyAliases.txt changes
1006 - 1 new property PCM=Prepended_Concatenation_Mark
1007 Ignore: Only useful for layout engines.
1008 Ok to list in ppucd.txt.
1009
1010 * PropertyValueAliases.txt new property values
1011 blk; Adlam ; Adlam
1012 blk; Bhaiksuki ; Bhaiksuki
1013 blk; Cyrillic_Ext_C ; Cyrillic_Extended_C
1014 blk; Glagolitic_Sup ; Glagolitic_Supplement
1015 blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation
1016 blk; Marchen ; Marchen
1017 blk; Mongolian_Sup ; Mongolian_Supplement
1018 blk; Newa ; Newa
1019 blk; Osage ; Osage
1020 blk; Tangut ; Tangut
1021 blk; Tangut_Components ; Tangut_Components
1022 -> add to uchar.h
1023 use long property names for enum constants
1024 -> add to UCharacter.UnicodeBlock IDs
1025 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1026 replace public static final int \1_ID = \2; \3
1027 -> add to UCharacter.UnicodeBlock objects
1028 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
1029 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1030
1031 GCB; EB ; E_Base
1032 GCB; EBG ; E_Base_GAZ
1033 GCB; EM ; E_Modifier
1034 GCB; GAZ ; Glue_After_Zwj
1035 GCB; ZWJ ; ZWJ
1036 -> uchar.h & UCharacter.GraphemeClusterBreak
1037
1038 jg ; African_Feh ; African_Feh
1039 jg ; African_Noon ; African_Noon
1040 jg ; African_Qaf ; African_Qaf
1041 -> uchar.h & UCharacter.JoiningGroup
1042
1043 lb ; EB ; E_Base
1044 lb ; EM ; E_Modifier
1045 lb ; ZWJ ; ZWJ
1046 -> uchar.h & UCharacter.LineBreak
1047
1048 sc ; Adlm ; Adlam
1049 sc ; Bhks ; Bhaiksuki
1050 sc ; Marc ; Marchen
1051 sc ; Newa ; Newa
1052 sc ; Osge ; Osage
1053 sc ; Tang ; Tangut
1054 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
1055
1056 WB ; EB ; E_Base
1057 WB ; EBG ; E_Base_GAZ
1058 WB ; EM ; E_Modifier
1059 WB ; GAZ ; Glue_After_Zwj
1060 WB ; ZWJ ; ZWJ
1061 -> uchar.h & UCharacter.WordBreak
1062
1063 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1064 (not strictly necessary for NOT_ENCODED scripts)
1065 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
1066
1067 * generate normalization data files
1068 cd $ICU_ROOT/dbg
1069 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
1070 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
1071 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
1072 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1073 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
1074
1075 * build ICU (make install)
1076 so that the tools build can pick up the new definitions from the installed header files.
1077
1078 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt
1079
1080 * build Unicode tools using CMake+make
1081
1082 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
1083
1084 # Location (--prefix) of where ICU was installed.
1085 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
1086 # Location of the ICU source tree.
1087 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
1088
1089 ~/svn.icutools/trunk/dbg/unicode/c$
1090 cmake ../../../src/unicode/c
1091 make
1092
1093 * generate core properties data files
1094 ~/svn.icutools/trunk/dbg/unicode/c$
1095 genprops/genprops $ICU_SRC_DIR
1096 genuca/genuca --hanOrder implicit $ICU_SRC_DIR
1097 genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
1098 - rebuild ICU (make install) & tools
1099
1100 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1101 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1102 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1103 - Unicode 6.0..9.0: U+2260, U+226E, U+226F
1104 - nothing new in 9.0, no test file to update
1105
1106 * run & fix ICU4C tests
1107 - Andy handles RBBI & spoof check test failures
1108
1109 * collation: CLDR collation root, UCA DUCET
1110
1111 - UCA DUCET goes into Mark's Unicode tools, see
1112 https://sites.google.com/site/unicodetools/home#TOC-UCA
1113 - CLDR root data files are checked into (CLDR UCA branch)/common/uca/
1114 cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
1115
1116 - cd (CLDR UCA branch)/common/uca/
1117 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1118 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
1119 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1120 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
1121 (note removing the underscore before "Rules")
1122 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1123 - restore TODO diffs in UCARules.txt
1124 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1125 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
1126 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1127 from the CLDR root files (..._CLDR_..._SHORT.txt)
1128 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1129 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1130 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
1131 - if CLDR common/uca/unihan-index.txt changes, then update
1132 CLDR common/collation/root.xml <collation type="private-unihan">
1133 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
1134
1135 - run genuca, see command line above;
1136 deal with
1137 Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt:
1138 FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible)
1139 (add the character to genuca.cpp sampleCharsToScripts[])
1140 + look up the USCRIPT_ code for the new sample characters
1141 (should be obvious from the comment in the error output)
1142 + *add* mappings to sampleCharsToScripts[], do not replace them
1143 (in case the script sample characters flip-flop)
1144 + insert new scripts in DUCET script order, see the top_byte table
1145 at the beginning of FractionalUCA.txt
1146 - rebuild ICU4C
1147
1148 * Unihan collators
1149 - run Unicode Tools
1150 org.unicode.draft.GenerateUnihanCollators
1151 with VM arguments
1152 -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk
1153 -DOTHER_WORKSPACE=/home/mscherer/svn.unitools
1154 -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data
1155 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
1156 -DUVERSION=9.0.0
1157 -ea
1158 - run Unicode Tools
1159 org.unicode.draft.GenerateUnihanCollatorFiles
1160 with the same arguments
1161 - check CLDR diffs
1162 cd ~/svn.cldr/trunk
1163 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
1164 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
1165 - copy to CLDR
1166 cd ~/svn.cldr/trunk
1167 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
1168 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
1169 - commit to CLDR
1170 - generate ICU zh collation data: run CLDR
1171 org.unicode.cldr.icu.NewLdml2IcuConverter
1172 with program arguments
1173 -t collation
1174 -s /home/mscherer/svn.cldr/trunk/common/collation
1175 -m /home/mscherer/svn.cldr/trunk/common/supplemental
1176 -d /home/mscherer/svn.icu/trunk/src/source/data/coll
1177 -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation
1178 zh
1179 and VM arguments
1180 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
1181 - rebuild ICU4C
1182
1183 * run & fix ICU4C tests, now with new CLDR collation root data
1184 - run all tests with the collation test data *_SHORT.txt or the full files
1185 (the full ones have comments, useful for debugging)
1186 - note on intltest: if collate/UCAConformanceTest fails, then
1187 utility/MultithreadTest/TestCollators will fail as well;
1188 fix the conformance test before looking into the multi-thread test
1189
1190 * update Java data files
1191 - refresh just the UCD/UCA-related/derived files, just to be safe
1192 - see (ICU4C)/source/data/icu4j-readme.txt
1193 - mkdir /tmp/icu4j
1194 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1195 output:
1196 ...
1197 Unicode .icu files built to ./out/build/icudt58l
1198 echo timestamp > uni-core-data
1199 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b
1200 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b
1201 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1202 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b
1203 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b"
1204 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/
1205 mkdir -p /tmp/icu4j/main/shared/data
1206 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1207 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/
1208 mkdir -p /tmp/icu4j/main/shared/data
1209 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1210 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
1211 - copy the big-endian Unicode data files to another location,
1212 separate from the other data files,
1213 and then refresh ICU4J
1214 cd ~/svn.icu/trunk/dbg/data/out/icu4j
1215 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1216 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1217 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1218 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1219 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1220 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1221 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1222 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1223 jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1224
1225 * When refreshing all of ICU4J data from ICU4C
1226 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1227 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
1228 or
1229 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
1230
1231 * update CollationFCD.java
1232 + copy & paste the initializers of lcccIndex[] etc. from
1233 ICU4C/source/i18n/collationfcd.cpp to
1234 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1235
1236 * refresh Java test .txt files
1237 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1238 cd $ICU_SRC_DIR/source/data/unidata
1239 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1240 cd ../../test/testdata
1241 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1242 cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1243
1244 * run & fix ICU4J tests
1245
1246 *** LayoutEngine script information
1247
1248 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
1249 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
1250 in the working directory.
1251
1252 (It also generates ScriptRunData.cpp, which is no longer needed.)
1253
1254 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
1255 (a plain text file)
1256 which maps ICU versions to the numbers of script/language constants
1257 that were added then.
1258 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
1259
1260 The generated files have a current copyright date and "@deprecated" statement.
1261
1262 * Review changes, fix Java tool if necessary, and copy to ICU4C
1263 cd ~/svn.icu4j/trunk/src
1264 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
1265 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
1266 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
1267
1268 *** API additions
1269 - send notice to icu-design about new born-@stable API (enum constants etc.)
1270
1271 *** merge the Unicode update branches back onto the trunk
1272 - do not merge the icudata.jar and testdata.jar,
1273 instead rebuild them from merged & tested ICU4C
1274 - make sure that changes to Unicode tools & ICU tools are checked in
1275 http://www.unicode.org/utility/trac/log/trunk/unicodetools
1276 http://bugs.icu-project.org/trac/log/tools/trunk
1277
1278 ---------------------------------------------------------------------------- ***
1279
1280 New script codes early in ICU 58: http://bugs.icu-project.org/trac/ticket/11764
1281
1282 Adding
1283 - new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge
1284 - new combination/alias codes: Hanb, Jamo
1285 - used in CLDR 29 and in spoof checker
1286 - new Z* code: Zsye
1287
1288 Add new codes to uscript.h & UScript.java, see Unicode update logs.
1289 -> com.ibm.icu.lang.UScript
1290 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
1291 replace public static final int \1 = \2; \3
1292
1293 Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h,
1294 add new script codes.
1295 "Long" script names only where established in Unicode 9 PropertyValueAliases.txt.
1296
1297 Note: If we have to run preparseucd.py again before the Unicode 9 update,
1298 then we need to manually keep/restore the new script codes.
1299
1300 ICU_ROOT=~/svn.icu/trunk
1301 ICU_SRC_DIR=$ICU_ROOT/src
1302 ICUDT=icudt57b
1303 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1304 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
1305 UNIDATA=$ICU_SRC_DIR/source/data/unidata
1306
1307 Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files,
1308 see http://bugs.icu-project.org/trac/ticket/12141
1309
1310 make install, then icutools cmake & make, then
1311 ~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
1312
1313 Generate Java data as usual, only update pnames.icu & uprops.icu.
1314
1315 *** LayoutEngine script information
1316
1317 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
1318 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
1319 in the working directory.
1320
1321 (It also generates ScriptRunData.cpp, which is no longer needed.)
1322
1323 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
1324 (a plain text file)
1325 which maps ICU versions to the numbers of script/language constants
1326 that were added then.
1327 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
1328
1329 The generated files have a current copyright date and "@deprecated" statement.
1330
1331 * Review changes, fix Java tool if necessary, and copy to ICU4C
1332 cd ~/svn.icu4j/trunk/src
1333 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
1334 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
1335 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
1336
1337 ---------------------------------------------------------------------------- ***
1338
1339 Emoji properties added in ICU 57: http://bugs.icu-project.org/trac/ticket/11802
1340
1341 Edit preparseucd.py to add & parse new properties.
1342 They share the UCD property namespace but are not listed in PropertyAliases.txt.
1343
1344 Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/
1345 Initial data from emoji/2.0/
1346
1347 ICU_ROOT=~/svn.icu/trunk
1348 ICU_SRC_DIR=$ICU_ROOT/src
1349 ICUDT=icudt56b
1350 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1351 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
1352 UNIDATA=$ICU_SRC_DIR/source/data/unidata
1353
1354 Add binary-property constants to uchar.h enum UProperty & UProperty.java.
1355
1356 ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src
1357 (Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.)
1358
1359 Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java
1360
1361 make install, then icutools cmake & make, then
1362 ~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
1363
1364 Generate Java data as usual, only update pnames.icu & uprops.icu.
1365
1366 ---------------------------------------------------------------------------- ***
1367
1368 Unicode 8.0 update for ICU 56
1369
1370 * Command-line environment setup
1371
1372 ICU_ROOT=~/svn.icu/trunk
1373 ICU_SRC_DIR=$ICU_ROOT/src
1374 ICUDT=icudt56b
1375 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1376 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
1377 UNIDATA=$ICU_SRC_DIR/source/data/unidata
1378
1379 http://www.unicode.org/review/pri297/ -- beta review
1380 http://www.unicode.org/reports/uax-proposed-updates.html
1381 http://unicode.org/versions/beta-8.0.0.html
1382 http://www.unicode.org/versions/Unicode8.0.0/
1383 http://www.unicode.org/reports/tr44/tr44-15.html
1384
1385 *** ICU Trac
1386
1387 - ticket:11574: Unicode 8
1388 - C++ branches/markus/uni80 at r37351 from trunk at r37343
1389 - Java branches/markus/uni80 at r37352 from trunk at r37338
1390
1391 *** CLDR Trac
1392
1393 - cldrbug 8311: UCA 8
1394 - branches/markus/uni80 at r11518 from trunk at r11517
1395
1396 - cldrbug 8109: Unicode 8.0 script metadata
1397 - cldrbug 8418: Updated segmentation for Unicode 8.0
1398
1399 *** Unicode version numbers
1400 - makedata.mak
1401 - uchar.h
1402 - com.ibm.icu.util.VersionInfo
1403 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1404
1405 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1406 so that the makefiles see the new version number.
1407
1408 *** data files & enums & parser code
1409
1410 * file preparation
1411
1412 - download UCD & IDNA files
1413 - make sure that the Unicode data folder passed into preparseucd.py
1414 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
1415 - only for manual diffs: remove version suffixes from the file names
1416 ~/unidata/uni70/20140403$ ../../desuffixucd.py .
1417 (see https://sites.google.com/site/unicodetools/inputdata)
1418 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1419 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src
1420 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1421
1422 - also: from http://unicode.org/Public/security/8.0.0/ download new
1423 confusables.txt & confusablesWholeScript.txt
1424 and copy to $UNIDATA
1425 ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA
1426 ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA
1427
1428 * initial preparseucd.py changes
1429 - remove new Unicode scripts from the
1430 only-in-ISO-15924 list according to the error message:
1431 ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw']
1432 from _scripts_only_in_iso15924
1433 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1434 and in com.ibm.icu.dev.test.lang.TestUScript.java
1435 - property and file name change:
1436 IndicMatraCategory -> IndicPositionalCategory
1437 - UnicodeData.txt unusual numeric values (improper fractions)
1438 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;;
1439 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;;
1440 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;;
1441 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;;
1442 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;;
1443 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;;
1444 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;;
1445 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;;
1446 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;;
1447 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;;
1448 -> change preparseucd.py to map them to proper fractions (e.g., 1/6)
1449 which are listed in DerivedNumericValues.txt;
1450 keeps storage in data file simple
1451
1452 * PropertyValueAliases.txt changes
1453 - 10 new Block (blk) values:
1454 blk; Ahom ; Ahom
1455 blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs
1456 blk; Cherokee_Sup ; Cherokee_Supplement
1457 blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E
1458 blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform
1459 blk; Hatran ; Hatran
1460 blk; Multani ; Multani
1461 blk; Old_Hungarian ; Old_Hungarian
1462 blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs
1463 blk; Sutton_SignWriting ; Sutton_SignWriting
1464 -> add to uchar.h
1465 use long property names for enum constants
1466 -> add to UCharacter.UnicodeBlock IDs
1467 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1468 replace public static final int \1_ID = \2; \3
1469 -> add to UCharacter.UnicodeBlock objects
1470 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
1471 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1472 - 6 new Script (sc) values:
1473 sc ; Ahom ; Ahom
1474 sc ; Hatr ; Hatran
1475 sc ; Hluw ; Anatolian_Hieroglyphs
1476 sc ; Hung ; Old_Hungarian
1477 sc ; Mult ; Multani
1478 sc ; Sgnw ; SignWriting
1479 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
1480
1481 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1482 (not strictly necessary for NOT_ENCODED scripts)
1483 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
1484
1485 * generate normalization data files
1486 cd $ICU_ROOT/dbg
1487 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
1488 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
1489 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
1490 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1491 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
1492
1493 * build ICU (make install)
1494 so that the tools build can pick up the new definitions from the installed header files.
1495
1496 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
1497
1498 * build Unicode tools using CMake+make
1499
1500 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
1501
1502 # Location (--prefix) of where ICU was installed.
1503 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
1504 # Location of the ICU source tree.
1505 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
1506
1507 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
1508 ~/svn.icutools/trunk/dbg/unicode/c$ make
1509
1510 * generate core properties data files
1511 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
1512 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR
1513 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
1514 - rebuild ICU (make install) & tools
1515 - run genuca again (see step above) so that it picks up the new nfc.nrm
1516 - rebuild ICU (make install) & tools
1517
1518 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1519 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1520 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1521 - Unicode 6.0..8.0: U+2260, U+226E, U+226F
1522 - nothing new in 8.0, no test file to update
1523
1524 * run & fix ICU4C tests
1525 - bad Cherokee case folding due to difference in fallbacks:
1526 UCD case folding falls back to no mapping,
1527 ICU runtime case folding falls back to lowercasing;
1528 fixed casepropsbuilder.cpp to generate scf mappings to self
1529 when there is an slc mapping but no scf
1530 - Andy handles RBBI & spoof check test failures
1531
1532 * collation: CLDR collation root, UCA DUCET
1533
1534 - UCA DUCET goes into Mark's Unicode tools, see
1535 https://sites.google.com/site/unicodetools/home#TOC-UCA
1536 - CLDR root data files are checked into (CLDR UCA branch)/common/uca/
1537 - cd (CLDR UCA branch)/common/uca/
1538 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1539 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
1540 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1541 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
1542 (note removing the underscore before "Rules")
1543 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1544 - restore TODO diffs in UCARules.txt
1545 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1546 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
1547 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1548 from the CLDR root files (..._CLDR_..._SHORT.txt)
1549 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1550 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1551 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
1552 - if CLDR common/uca/unihan-index.txt changes, then update
1553 CLDR common/collation/root.xml <collation type="private-unihan">
1554 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
1555 - run genuca, see command line above;
1556 deal with
1557 Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt
1558 (add the character to genuca.cpp sampleCharsToScripts[])
1559 + look up the script for the new sample characters
1560 (e.g., in FractionalUCA.txt)
1561 + *add* mappings to sampleCharsToScripts[], do not replace them
1562 (in case the script sample characters flip-flop)
1563 + insert new scripts in DUCET script order, see the top_byte table
1564 at the beginning of FractionalUCA.txt
1565 - rebuild ICU4C
1566
1567 * run & fix ICU4C tests, now with new CLDR collation root data
1568 - run all tests with the collation test data *_SHORT.txt or the full files
1569 (the full ones have comments, useful for debugging)
1570 - note on intltest: if collate/UCAConformanceTest fails, then
1571 utility/MultithreadTest/TestCollators will fail as well;
1572 fix the conformance test before looking into the multi-thread test
1573 - fixed bug in CollationWeights::getWeightRanges()
1574 exposed by new data and CollationTest::TestRootElements
1575
1576 * update Java data files
1577 - refresh just the UCD/UCA-related/derived files, just to be safe
1578 - see (ICU4C)/source/data/icu4j-readme.txt
1579 - mkdir /tmp/icu4j
1580 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1581 output:
1582 ...
1583 Unicode .icu files built to ./out/build/icudt56l
1584 echo timestamp > uni-core-data
1585 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b
1586 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b
1587 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1588 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b
1589 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b"
1590 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/
1591 mkdir -p /tmp/icu4j/main/shared/data
1592 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1593 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/
1594 mkdir -p /tmp/icu4j/main/shared/data
1595 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1596 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
1597 - copy the big-endian Unicode data files to another location,
1598 separate from the other data files,
1599 and then refresh ICU4J
1600 cd ~/svn.icu/trunk/dbg/data/out/icu4j
1601 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1602 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1603 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1604 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1605 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1606 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1607 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1608 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1609 jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1610
1611 * When refreshing all of ICU4J data from ICU4C
1612 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1613 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
1614 or
1615 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
1616
1617 * update CollationFCD.java
1618 + copy & paste the initializers of lcccIndex[] etc. from
1619 ICU4C/source/i18n/collationfcd.cpp to
1620 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1621
1622 * refresh Java test .txt files
1623 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1624 cd $ICU_SRC_DIR/source/data/unidata
1625 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1626 cd ../../test/testdata
1627 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1628 cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1629
1630 * run & fix ICU4J tests
1631
1632 *** LayoutEngine script information
1633
1634 * ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more,
1635 because the layout engine was deprecated in ICU 54.
1636 Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java
1637 to write lines that we used to add manually.
1638
1639 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
1640 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
1641 in the working directory.
1642
1643 (It also generates ScriptRunData.cpp, which is no longer needed.)
1644
1645 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
1646 (a plain text file)
1647 which maps ICU versions to the numbers of script/language constants
1648 that were added then.
1649 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
1650
1651 The generated files have a current copyright date and "@deprecated" statement.
1652
1653 * Review changes, fix Java tool if necessary, and copy to ICU4C
1654 cd ~/svn.icu4j/trunk/src
1655 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
1656 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
1657 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
1658
1659 *** API additions
1660 - send notice to icu-design about new born-@stable API (enum constants etc.)
1661
1662 *** merge the Unicode update branches back onto the trunk
1663 - do not merge the icudata.jar and testdata.jar,
1664 instead rebuild them from merged & tested ICU4C
1665 - make sure that changes to Unicode tools & ICU tools are checked in
1666 http://www.unicode.org/utility/trac/log/trunk/unicodetools
1667 http://bugs.icu-project.org/trac/log/tools/trunk
1668
1669 ---------------------------------------------------------------------------- ***
1670
1671 Unicode 7.0 update for ICU 54
1672
1673 http://www.unicode.org/review/pri271/ -- beta review
1674 http://www.unicode.org/reports/uax-proposed-updates.html
1675 http://www.unicode.org/versions/beta-7.0.0.html#notable_issues
1676 http://www.unicode.org/reports/tr44/tr44-13.html
1677
1678 *** ICU Trac
1679
1680 - ticket 10821: Unicode 7.0, UCA 7.0
1681 - C++ branches/markus/uni70 at r35584 from trunk at r35580
1682 - Java branches/markus/uni70 at r35587 from trunk at r35545
1683
1684 *** CLDR Trac
1685
1686 - ticket 7195: UCA 7.0 CLDR root collation
1687 - branches/markus/uni70 at r10062 from trunk at r10061
1688
1689 - ticket 6762: script metadata for Unicode 7.0 new scripts
1690
1691 *** Unicode version numbers
1692 - makedata.mak
1693 - uchar.h
1694 - com.ibm.icu.util.VersionInfo
1695 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1696
1697 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1698 so that the makefiles see the new version number.
1699
1700 *** data files & enums & parser code
1701
1702 * file preparation
1703
1704 - download UCD & IDNA files
1705 - make sure that the Unicode data folder passed into preparseucd.py
1706 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
1707 - only for manual diffs: remove version suffixes from the file names
1708 ~/unidata/uni70/20140403$ ../../desuffixucd.py .
1709 (see https://sites.google.com/site/unicodetools/inputdata)
1710 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1711 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src
1712 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1713 - Restore TODO diffs in source/data/unidata/UCARules.txt
1714 cd $ICU_SRC_DIR
1715 meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt
1716 - Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt
1717
1718 - also: from http://unicode.org/Public/security/7.0.0/ download new
1719 confusables.txt & confusablesWholeScript.txt
1720 and copy to $ICU_ROOT/src/source/data/unidata/
1721
1722 * initial preparseucd.py changes
1723 - remove new Unicode scripts from the
1724 only-in-ISO-15924 list according to the error message:
1725 ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass',
1726 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm',
1727 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj']
1728 from _scripts_only_in_iso15924
1729 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1730 and in com.ibm.icu.dev.test.lang.TestUScript.java
1731 - NamesList.txt now has a heading with a non-ASCII character
1732 + keep ppucd.txt in platform charset, rather than changing tool/test parsers
1733 + escape non-ASCII characters in heading comments
1734 - gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013
1735 + get the copyright from the first file whose copyright line contains the current year
1736
1737 * PropertyValueAliases.txt changes
1738 - 32 new Block (blk) values:
1739 blk; Bassa_Vah ; Bassa_Vah
1740 blk; Caucasian_Albanian ; Caucasian_Albanian
1741 blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers
1742 blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended
1743 blk; Duployan ; Duployan
1744 blk; Elbasan ; Elbasan
1745 blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended
1746 blk; Grantha ; Grantha
1747 blk; Khojki ; Khojki
1748 blk; Khudawadi ; Khudawadi
1749 blk; Latin_Ext_E ; Latin_Extended_E
1750 blk; Linear_A ; Linear_A
1751 blk; Mahajani ; Mahajani
1752 blk; Manichaean ; Manichaean
1753 blk; Mende_Kikakui ; Mende_Kikakui
1754 blk; Modi ; Modi
1755 blk; Mro ; Mro
1756 blk; Myanmar_Ext_B ; Myanmar_Extended_B
1757 blk; Nabataean ; Nabataean
1758 blk; Old_North_Arabian ; Old_North_Arabian
1759 blk; Old_Permic ; Old_Permic
1760 blk; Ornamental_Dingbats ; Ornamental_Dingbats
1761 blk; Pahawh_Hmong ; Pahawh_Hmong
1762 blk; Palmyrene ; Palmyrene
1763 blk; Pau_Cin_Hau ; Pau_Cin_Hau
1764 blk; Psalter_Pahlavi ; Psalter_Pahlavi
1765 blk; Shorthand_Format_Controls ; Shorthand_Format_Controls
1766 blk; Siddham ; Siddham
1767 blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers
1768 blk; Sup_Arrows_C ; Supplemental_Arrows_C
1769 blk; Tirhuta ; Tirhuta
1770 blk; Warang_Citi ; Warang_Citi
1771 -> add to uchar.h
1772 use long property names for enum constants
1773 -> add to UCharacter.UnicodeBlock IDs
1774 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1775 replace public static final int \1_ID = \2; \3
1776 -> add to UCharacter.UnicodeBlock objects
1777 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
1778 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1779 - 28 new Joining_Group (jg) values:
1780 jg ; Manichaean_Aleph ; Manichaean_Aleph
1781 jg ; Manichaean_Ayin ; Manichaean_Ayin
1782 jg ; Manichaean_Beth ; Manichaean_Beth
1783 jg ; Manichaean_Daleth ; Manichaean_Daleth
1784 jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh
1785 jg ; Manichaean_Five ; Manichaean_Five
1786 jg ; Manichaean_Gimel ; Manichaean_Gimel
1787 jg ; Manichaean_Heth ; Manichaean_Heth
1788 jg ; Manichaean_Hundred ; Manichaean_Hundred
1789 jg ; Manichaean_Kaph ; Manichaean_Kaph
1790 jg ; Manichaean_Lamedh ; Manichaean_Lamedh
1791 jg ; Manichaean_Mem ; Manichaean_Mem
1792 jg ; Manichaean_Nun ; Manichaean_Nun
1793 jg ; Manichaean_One ; Manichaean_One
1794 jg ; Manichaean_Pe ; Manichaean_Pe
1795 jg ; Manichaean_Qoph ; Manichaean_Qoph
1796 jg ; Manichaean_Resh ; Manichaean_Resh
1797 jg ; Manichaean_Sadhe ; Manichaean_Sadhe
1798 jg ; Manichaean_Samekh ; Manichaean_Samekh
1799 jg ; Manichaean_Taw ; Manichaean_Taw
1800 jg ; Manichaean_Ten ; Manichaean_Ten
1801 jg ; Manichaean_Teth ; Manichaean_Teth
1802 jg ; Manichaean_Thamedh ; Manichaean_Thamedh
1803 jg ; Manichaean_Twenty ; Manichaean_Twenty
1804 jg ; Manichaean_Waw ; Manichaean_Waw
1805 jg ; Manichaean_Yodh ; Manichaean_Yodh
1806 jg ; Manichaean_Zayin ; Manichaean_Zayin
1807 jg ; Straight_Waw ; Straight_Waw
1808 -> uchar.h & UCharacter.JoiningGroup
1809 - 23 new Script (sc) values:
1810 sc ; Aghb ; Caucasian_Albanian
1811 sc ; Bass ; Bassa_Vah
1812 sc ; Dupl ; Duployan
1813 sc ; Elba ; Elbasan
1814 sc ; Gran ; Grantha
1815 sc ; Hmng ; Pahawh_Hmong
1816 sc ; Khoj ; Khojki
1817 sc ; Lina ; Linear_A
1818 sc ; Mahj ; Mahajani
1819 sc ; Mani ; Manichaean
1820 sc ; Mend ; Mende_Kikakui
1821 sc ; Modi ; Modi
1822 sc ; Mroo ; Mro
1823 sc ; Narb ; Old_North_Arabian
1824 sc ; Nbat ; Nabataean
1825 sc ; Palm ; Palmyrene
1826 sc ; Pauc ; Pau_Cin_Hau
1827 sc ; Perm ; Old_Permic
1828 sc ; Phlp ; Psalter_Pahlavi
1829 sc ; Sidd ; Siddham
1830 sc ; Sind ; Khudawadi
1831 sc ; Tirh ; Tirhuta
1832 sc ; Wara ; Warang_Citi
1833 -> uscript.h (many were added before)
1834 comment "Mende Kikakui" for USCRIPT_MENDE
1835 add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias
1836 -> com.ibm.icu.lang.UScript
1837 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
1838 replace public static final int \1 = \2; \3
1839 - 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
1840 (added 2012-11-01)
1841 Ahom 338 Ahom
1842 Hatr 127 Hatran
1843 Mult 323 Multani
1844 (added 2013-10-12)
1845 Modi 324 Modi
1846 Pauc 263 Pau Cin Hau
1847 Sidd 302 Siddham
1848 -> uscript.h (some overlap with additions from Unicode)
1849 -> com.ibm.icu.lang.UScript
1850 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
1851 replace public static final int \1 = \2; \3
1852 -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924
1853 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
1854 and in com.ibm.icu.dev.test.lang.TestUScript.java
1855
1856 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1857 (not strictly necessary for NOT_ENCODED scripts)
1858 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
1859
1860 * generate normalization data files
1861 - cd $ICU_ROOT/dbg
1862 - export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1863 - SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
1864 - UNIDATA=$ICU_SRC_DIR/source/data/unidata
1865 - bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
1866 - bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
1867 - bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
1868 - bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1869 - bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
1870
1871 * build ICU (make install)
1872 so that the tools build can pick up the new definitions from the installed header files.
1873
1874 ~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
1875
1876 * build Unicode tools using CMake+make
1877
1878 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
1879
1880 # Location (--prefix) of where ICU was installed.
1881 set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst)
1882 # Location of the ICU source tree.
1883 set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src)
1884
1885 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
1886 ~/svn.icutools/trunk/dbg/unicode/c$ make
1887
1888 * genprops work
1889 - new code point range for Joining_Group values: 10AC0..10AFF Manichaean
1890 + add second array of Joining_Group values for at most 10800..10FFF
1891 icutools: unicode/c/genprops/bidipropsbuilder.cpp
1892 icu: source/common/ubidi_props.h/.c/_data.h
1893 icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java
1894
1895 * generate core properties data files
1896 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
1897 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR
1898 - rebuild ICU (make install) & tools
1899 - run genuca again (see step above) so that it picks up the new nfc.nrm
1900 - rebuild ICU (make install) & tools
1901
1902 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1903 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1904 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1905 - Unicode 6.0..7.0: U+2260, U+226E, U+226F
1906 - nothing new in 7.0, no test file to update
1907
1908 * run & fix ICU4C tests
1909
1910 * update Java data files
1911 - refresh just the UCD-related files, just to be safe
1912 - see (ICU4C)/source/data/icu4j-readme.txt
1913 - mkdir /tmp/icu4j
1914 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1915 output:
1916 ...
1917 Unicode .icu files built to ./out/build/icudt53l
1918 echo timestamp > uni-core-data
1919 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b
1920 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b
1921 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
1922 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b
1923 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b"
1924 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/
1925 mkdir -p /tmp/icu4j/main/shared/data
1926 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1927 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/
1928 mkdir -p /tmp/icu4j/main/shared/data
1929 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1930 make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data'
1931 - copy the big-endian Unicode data files to another location,
1932 separate from the other data files
1933 ICUDT=icudt54b
1934 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1935 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1936 cd ~/svn.icu/uni70/dbg/data/out/icu4j
1937 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1938 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1939 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1940 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1941 cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1942 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1943 - refresh ICU4J
1944 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1945
1946 * update CollationFCD.java
1947 + copy & paste the initializers of lcccIndex[] etc. from
1948 ICU4C/source/i18n/collationfcd.cpp to
1949 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1950
1951 * refresh Java test .txt files
1952 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1953 cd $ICU_SRC_DIR/source/data/unidata
1954 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1955 cd ../../test/testdata
1956 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1957 cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1958
1959 * UCA
1960
1961 - download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/
1962 - run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata)
1963 - update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/
1964 - run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA
1965 - output files are in ~/svn.unitools/Generated/uca/7.0.0/
1966 - review data; compare files, use blankweights.sed or similar
1967 ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt
1968 - cd ~/svn.unitools/Generated/uca/7.0.0/
1969 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1970 cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
1971 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1972 (note removing the underscore before "Rules")
1973 cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1974 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
1975 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1976 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
1977 cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1978 cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1979 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
1980 - run genuca, see command line above
1981 - rebuild ICU4C
1982 - refresh ICU4J collation data:
1983 (subset of instructions above for properties data refresh, except copies all coll/*)
1984 ICUDT=icudt54b
1985 ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1986 ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1987 ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1988 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1989 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
1990 - note on intltest: if collate/UCAConformanceTest fails, then
1991 utility/MultithreadTest/TestCollators will fail as well;
1992 fix the conformance test before looking into the multi-thread test
1993 - copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors
1994 - copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch
1995 ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
1996
1997 * When refreshing all of ICU4J data from ICU4C
1998 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1999 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2000 or
2001 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2002
2003 * run & fix ICU4J tests
2004
2005 *** LayoutEngine script information
2006
2007 (For details see the Unicode 5.2 change log below.)
2008
2009 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
2010 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
2011 in the working directory.
2012 (It also generates ScriptRunData.cpp, which is no longer needed.)
2013
2014 The generated files have a current copyright date and "@stable" statement.
2015 ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java
2016 for "born stable" Unicode API constants, and to stop parsing ICU version numbers
2017 which may not contain dots any more.
2018
2019 - diff current <icu>/source/layout files vs. generated ones
2020 ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
2021 review and manually merge desired changes;
2022 fix gratuitous changes, incorrect @draft/@stable and missing aliases;
2023 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
2024 - if you just copy the above files, then
2025 fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
2026 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
2027
2028 *** API additions
2029 - send notice to icu-design about new born-@stable API (enum constants etc.)
2030
2031 *** merge the Unicode update branches back onto the trunk
2032 - do not merge the icudata.jar and testdata.jar,
2033 instead rebuild them from merged & tested ICU4C
2034
2035 ---------------------------------------------------------------------------- ***
2036
2037 Unicode 6.3 update
2038
2039 http://www.unicode.org/review/pri249/ -- beta review
2040 http://www.unicode.org/reports/uax-proposed-updates.html
2041 http://www.unicode.org/versions/beta-6.3.0.html#notable_issues
2042 http://www.unicode.org/reports/tr44/tr44-11.html
2043
2044 *** ICU Trac
2045
2046 - ticket 10128: update ICU to Unicode 6.3 beta
2047 - ticket 10168: update ICU to Unicode 6.3 final
2048 - C++ branches/markus/uni63 at r33552 from trunk at r33551
2049 - Java branches/markus/uni63 at r33550 from trunk at r33553
2050
2051 - ticket 10142: implement Unicode 6.3 bidi algorithm additions
2052
2053 *** Unicode version numbers
2054 - makedata.mak
2055 - uchar.h
2056 (configure.in & configure: have been modified to extract the version from uchar.h)
2057 - com.ibm.icu.util.VersionInfo
2058 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2059
2060 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
2061 so that the makefiles see the new version number.
2062
2063 *** data files & enums & parser code
2064
2065 * file preparation
2066
2067 - download UCD, UCA & IDNA files
2068 - make sure that the Unicode data folder passed into preparseucd.py
2069 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2070 - modify preparseucd.py:
2071 parse new file BidiBrackets.txt
2072 with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type
2073 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src
2074 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2075 - Check test file diffs for previously commented-out, known-failing data lines;
2076 probably need to keep those commented out.
2077
2078 * PropertyAliases.txt changes
2079 - 1 new Enumerated Property
2080 bpt ; Bidi_Paired_Bracket_Type
2081 -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType
2082 -> ubidi_props.h & .c & UBiDiProps.java
2083 -> remember to write the max value at UBIDI_MAX_VALUES_INDEX
2084 -> uprops.cpp
2085 -> change ubidi.icu format version from 2.0 to 2.1
2086 - 1 new Miscellaneous Property
2087 bpb ; Bidi_Paired_Bracket
2088 -> uchar.h & UProperty.java
2089 -> ppucd.h & .cpp
2090
2091 * PropertyValueAliases.txt changes
2092 - 3 Bidi_Paired_Bracket_Type (bpt) values:
2093 bpt; c ; Close
2094 bpt; n ; None
2095 bpt; o ; Open
2096 -> uchar.h & UCharacter.BidiPairedBracketType
2097 -> ubidi_props.h & .c & UBiDiProps.java
2098 -> change ubidi.icu format version from 2.0 to 2.1
2099 - 4 new Bidi_Class (bc) values:
2100 bc ; FSI ; First_Strong_Isolate
2101 bc ; LRI ; Left_To_Right_Isolate
2102 bc ; RLI ; Right_To_Left_Isolate
2103 bc ; PDI ; Pop_Directional_Isolate
2104 -> uchar.h & UCharacterEnums.ECharacterDirection
2105 -> until the bidi code gets updated,
2106 Roozbeh suggests mapping the new bc values to ON (Other_Neutral)
2107 - 3 new Word_Break (WB) values:
2108 WB ; HL ; Hebrew_Letter
2109 WB ; SQ ; Single_Quote
2110 WB ; DQ ; Double_Quote
2111 -> uchar.h & UCharacter.WordBreak
2112 -> first time Word_Break numeric constants exceed 4 bits (now 17 values)
2113 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2114 (added 2012-10-16)
2115 Aghb 239 Caucasian Albanian
2116 Mahj 314 Mahajani
2117 -> uscript.h
2118 -> com.ibm.icu.lang.UScript
2119 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2120 replace public static final int \1 = \2;\3
2121 -> preparseucd.py _scripts_only_in_iso15924
2122 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2123 and in com.ibm.icu.dev.test.lang.TestUScript.java
2124 -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
2125 (not strictly necessary for NOT_ENCODED scripts)
2126
2127 * generate normalization data files
2128 - ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib
2129 - ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in
2130 - ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata
2131 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
2132 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
2133 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2134 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
2135
2136 * build ICU (make install)
2137 so that the tools build can pick up the new definitions from the installed header files.
2138
2139 ~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
2140
2141 * build Unicode tools using CMake+make
2142
2143 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
2144
2145 # Location (--prefix) of where ICU was installed.
2146 set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst)
2147 # Location of the ICU source tree.
2148 set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src)
2149
2150 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
2151 ~/svn.icutools/trunk/dbg/unicode/c$ make
2152
2153 * generate core properties data files
2154 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src
2155 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src
2156 - rebuild ICU (make install) & tools
2157 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
2158 - rebuild ICU (make install) & tools
2159
2160 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2161 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2162 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2163 - Unicode 6.0..6.3: U+2260, U+226E, U+226F
2164 - nothing new in 6.3, no test file to update
2165
2166 * update Java data files
2167 - refresh just the UCD-related files, just to be safe
2168 - see (ICU4C)/source/data/icu4j-readme.txt
2169 - mkdir /tmp/icu4j
2170 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2171 output:
2172 ...
2173 Unicode .icu files built to ./out/build/icudt52l
2174 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b
2175 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b
2176 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2177 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b
2178 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b"
2179 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/
2180 mkdir -p /tmp/icu4j/main/shared/data
2181 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2182 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/
2183 mkdir -p /tmp/icu4j/main/shared/data
2184 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2185 make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data'
2186 - copy the big-endian Unicode data files to another location,
2187 separate from the other data files
2188 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2189 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
2190 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
2191 ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu
2192 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
2193 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2194 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
2195 - refresh ICU4J
2196 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
2197
2198 * refresh Java test .txt files
2199 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2200
2201 * UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files
2202
2203 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
2204 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
2205 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2206 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2207 (note removing the underscore before "Rules")
2208 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
2209 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2210 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
2211 - check test file diffs for previously commented-out, known-failing data lines;
2212 probably need to keep those commented out
2213 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
2214 - run genuca, see command line above
2215 - rebuild ICU4C
2216 - refresh ICU4J collation data:
2217 (subset of instructions above for properties data refresh, except copies all coll/*)
2218 ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2219 ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2220 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2221 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
2222 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
2223 - note on intltest: if collate/UCAConformanceTest fails, then
2224 utility/MultithreadTest/TestCollators will fail as well;
2225 fix the conformance test before looking into the multi-thread test
2226
2227 * test ICU, fix test code where necessary
2228
2229 * When refreshing all of ICU4J data from ICU4C
2230 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2231 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2232 or
2233 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2234
2235 *** LayoutEngine script information
2236 - skipped for Unicode 6.3: no new scripts
2237
2238 *** merge the Unicode update branches back onto the trunk
2239 - do not merge the icudata.jar and testdata.jar,
2240 instead rebuild them from merged & tested ICU4C
2241
2242 ---------------------------------------------------------------------------- ***
2243
2244 Unicode 6.2 update
2245
2246 http://www.unicode.org/review/pri230/
2247 http://www.unicode.org/versions/beta-6.2.0.html
2248 http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0
2249 http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values
2250 http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol
2251 http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols
2252 http://www.unicode.org/reports/tr46/tr46-8.html IDNA
2253 http://unicode.org/Public/idna/6.2.0/
2254
2255 *** ICU Trac
2256
2257 - ticket 9515: Unicode 6.2: final ICU update
2258
2259 - ticket 9514: UCA 6.2: fix UCARules.txt
2260
2261 - ticket 9437: update ICU to Unicode 6.2
2262 - C++ branches/markus/uni62 at r32050 from trunk at r32041
2263 - Java branches/markus/uni62 at r32068 from trunk at r32066
2264
2265 *** Unicode version numbers
2266 - makedata.mak
2267 - uchar.h
2268 (configure.in & configure: have been modified to extract the version from uchar.h)
2269 - com.ibm.icu.util.VersionInfo
2270 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2271
2272 *** data files & enums & parser code
2273
2274 * file preparation
2275
2276 - download UCD, UCA & IDNA files
2277 - make sure that the Unicode data folder passed into preparseucd.py
2278 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2279 - modify preparseucd.py: NamesList.txt is now in UTF-8
2280 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src
2281 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2282 - Check test file diffs for previously commented-out, known-failing data lines;
2283 probably need to keep those commented out.
2284
2285 * PropertyValueAliases.txt changes
2286 - 1 new Line_Break (lb) value:
2287 lb ; RI ; Regional_Indicator
2288 -> uchar.h & UCharacter.LineBreak
2289 - 1 new Word_Break (WB) value:
2290 WB ; RI ; Regional_Indicator
2291 -> uchar.h & UCharacter.WordBreak
2292 - 1 new Grapheme_Cluster_Break (GCB) value:
2293 GCB; RI ; Regional_Indicator
2294 -> uchar.h & UCharacter.GraphemeClusterBreak
2295
2296 * 3 new numeric values
2297 The new value -1, which was really supposed to be NaN but that would have required
2298 new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1,
2299 but encodeNumericValue() in corepropsbuilder.cpp had to be fixed.
2300 cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1
2301 cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1
2302 The two new values 216000 and 432000 require an addition to the encoding of numeric values.
2303 cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000
2304 cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000
2305 -> uprops.h, uchar.c & UCharacterProperty.java
2306 -> cucdtst.c & UCharacterTest.java
2307
2308 * generate normalization data files
2309 - ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib
2310 - ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in
2311 - ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata
2312 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
2313 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
2314 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2315 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
2316
2317 * build ICU (make install)
2318 so that the tools build can pick up the new definitions from the installed header files.
2319 * build Unicode tools using CMake+make
2320
2321 * generate core properties data files
2322 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src
2323 - in initial bootstrapping, change the UCA version
2324 in source/data/unidata/FractionalUCA.txt to match the new Unicode version
2325 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src
2326 - rebuild ICU (make install) & tools
2327 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
2328 check if the UCA version in FractionalUCA.txt matches the new Unicode version
2329 (see step above)
2330 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
2331 - rebuild ICU (make install) & tools
2332
2333 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2334 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2335 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2336 - Unicode 6.0..6.2: U+2260, U+226E, U+226F
2337 - nothing new in 6.2, no test file to update
2338
2339 * update Java data files
2340 - refresh just the UCD-related files, just to be safe
2341 - see (ICU4C)/source/data/icu4j-readme.txt
2342 - mkdir /tmp/icu4j
2343 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2344 output:
2345 ...
2346 Unicode .icu files built to ./out/build/icudt50l
2347 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b
2348 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b
2349 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2350 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b
2351 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b"
2352 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/
2353 mkdir -p /tmp/icu4j/main/shared/data
2354 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2355 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/
2356 mkdir -p /tmp/icu4j/main/shared/data
2357 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2358 make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data'
2359 - copy the big-endian Unicode data files to another location,
2360 separate from the other data files
2361 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
2362 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
2363 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
2364 ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu
2365 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
2366 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
2367 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
2368 - refresh ICU4J
2369 ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
2370
2371 * refresh Java test .txt files
2372 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2373
2374 * UCA
2375
2376 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
2377 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
2378 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2379 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2380 (note removing the underscore before "Rules")
2381 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
2382 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2383 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
2384 - check test file diffs for previously commented-out, known-failing data lines;
2385 probably need to keep those commented out
2386 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
2387 - run genuca, see command line above
2388 - rebuild ICU4C
2389 - refresh ICU4J collation data:
2390 (subset of instructions above for properties data refresh, except copies all coll/*)
2391 ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2392 ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
2393 ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
2394 ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
2395 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
2396 - note on intltest: if collate/UCAConformanceTest fails, then
2397 utility/MultithreadTest/TestCollators will fail as well;
2398 fix the conformance test before looking into the multi-thread test
2399
2400 * test ICU, fix test code where necessary
2401
2402 * When refreshing all of ICU4J data from ICU4C
2403 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2404 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2405 or
2406 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2407
2408 *** LayoutEngine script information
2409 - skipped for Unicode 6.2: no new scripts
2410
2411 *** merge the Unicode update branches back onto the trunk
2412 - do not merge the icudata.jar and testdata.jar,
2413 instead rebuild them from merged & tested ICU4C
2414
2415 ---------------------------------------------------------------------------- ***
2416
2417 Future Unicode update
2418
2419 Tools simplified since the Unicode 6.1 update. See
2420 - http://site.icu-project.org/design/props/ppucd
2421 - http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972
2422
2423 * Unicode version numbers
2424 - icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates
2425
2426 * file preparation
2427 - ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py:
2428 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src
2429 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2430 - Check test file diffs for previously commented-out, known-failing data lines;
2431 probably need to keep those commented out.
2432
2433 * PropertyValueAliases.txt changes
2434 - Script codes that are in ISO 15924 but not in Unicode are now listed in
2435 preparseucd.py, in the _scripts_only_in_iso15924 variable.
2436 If there are new ISO codes, then add them.
2437 If Unicode adds some of them, then remove them from the .py variable.
2438
2439 * UnicodeData.txt changes
2440 - No more manual changes for CJK ranges for algorithmic names;
2441 those are now written to ppucd.txt and genprops reads them from there.
2442
2443 * generate core properties data files (makeprops.sh was deleted)
2444 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src
2445
2446 * no more manual updates of source/data/unidata/norm2/nfkc_cf.txt
2447 - it is now generated by preparseucd.py
2448
2449 * no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt
2450 - it is now generated by preparseucd.py
2451 - make sure that the Unicode data folder passed into preparseucd.py
2452 includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
2453 (can be in some subfolder)
2454
2455 * generate normalization data files
2456 - ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib
2457 - ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in
2458 - ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata
2459 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
2460 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
2461 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2462 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
2463
2464 * build ICU (make install)
2465 * build Unicode tools using CMake+make
2466
2467 * new way to call genuca (makeuca.sh was deleted)
2468 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src
2469
2470 ---------------------------------------------------------------------------- ***
2471
2472 Unicode 6.1 update
2473
2474 *** ICU Trac
2475
2476 - ticket 8995 final update to Unicode 6.1
2477 - ticket 8994 regenerate source/layout/CanonData.cpp
2478
2479 - ticket 8961 support Unicode "Age" value *names*
2480 - ticket 8963 support multiple character name aliases & types
2481
2482 - ticket 8827 "update ICU to Unicode 6.1"
2483 - C++ branches/markus/uni61 at r30864 from trunk at r30843
2484 - Java branches/markus/uni61 at r30865 from trunk at r30863
2485
2486 *** Unicode version numbers
2487 - makedata.mak
2488 - uchar.h
2489 (configure.in & configure: have been modified to extract the version from uchar.h)
2490 - com.ibm.icu.util.VersionInfo
2491 - icutools/unicode/makedefs.sh
2492 + also review & update other definitions in that file,
2493 e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l
2494
2495 *** data files & enums & parser code
2496
2497 * file preparation
2498
2499 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed
2500 - This prepares both unidata and testdata files in respective output subfolders.
2501 - Check test file diffs for previously commented-out, known-failing data lines;
2502 probably need to keep those commented out.
2503
2504 * PropertyValueAliases.txt changes
2505 - 11 new block names:
2506 Arabic_Extended_A
2507 Arabic_Mathematical_Alphabetic_Symbols
2508 Chakma
2509 Meetei_Mayek_Extensions
2510 Meroitic_Cursive
2511 Meroitic_Hieroglyphs
2512 Miao
2513 Sharada
2514 Sora_Sompeng
2515 Sundanese_Supplement
2516 Takri
2517 -> add to uchar.h
2518 -> add to UCharacter.UnicodeBlock IDs
2519 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
2520 replace public static final int \1_ID = \2; \3
2521 -> add to UCharacter.UnicodeBlock objects
2522 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
2523 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
2524 - 1 new Joining_Group (jg) value:
2525 Rohingya_Yeh
2526 -> uchar.h & UCharacter.JoiningGroup
2527 - 2 new Line_Break (lb) values:
2528 CJ=Conditional_Japanese_Starter
2529 HL=Hebrew_Letter
2530 -> uchar.h & UCharacter.LineBreak
2531 - 7 new scripts:
2532 sc ; Cakm ; Chakma
2533 sc ; Merc ; Meroitic_Cursive
2534 sc ; Mero ; Meroitic_Hieroglyphs
2535 sc ; Plrd ; Miao
2536 sc ; Shrd ; Sharada
2537 sc ; Sora ; Sora_Sompeng
2538 sc ; Takr ; Takri
2539 -> remove these from SyntheticPropertyValueAliases.txt
2540 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2541 and in com.ibm.icu.dev.test.lang.TestUScript.java
2542 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2543 (added 2011-06-21)
2544 Khoj 322 Khojki
2545 Tirh 326 Tirhuta
2546 and another one added 2011-12-09
2547 Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)
2548 -> uscript.h
2549 -> com.ibm.icu.lang.UScript
2550 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2551 replace public static final int \1 = \2;\3
2552 -> SyntheticPropertyValueAliases.txt
2553 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2554 and in com.ibm.icu.dev.test.lang.TestUScript.java
2555
2556 * UnicodeData.txt changes
2557 - the last Unihan code point changes from U+9FCB to U+9FCC
2558 search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive)
2559 + do change gennames.c
2560 + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java
2561
2562 * DerivedBidiClass.txt changes
2563 - 2 new default-AL blocks:
2564 # Arabic Extended-A: U+08A0 - U+08FF (was default-R)
2565 # Arabic Mathematical Alphabetic Symbols:
2566 # U+1EE00 - U+1EEFF (was default-R)
2567 - 2 new default-R blocks:
2568 # Meroitic Hieroglyphs:
2569 # U+10980 - U+1099F
2570 # Meroitic Cursive: U+109A0 - U+109FF
2571 -> should be picked up by the explicit data in the file
2572
2573 * NameAliases.txt changes
2574 - from
2575 # Each line has two fields
2576 # First field: Code point
2577 # Second field: Alias
2578 - to
2579 # Each line has three fields, as described here:
2580 #
2581 # First field: Code point
2582 # Second field: Alias
2583 # Third field: Type
2584 - Also, the file previously allowed multiple aliases but only now does it
2585 actually provide multiple, even multiple of the same type. For example,
2586 FEFF;BYTE ORDER MARK;alternate
2587 FEFF;BOM;abbreviation
2588 FEFF;ZWNBSP;abbreviation
2589 - This breaks our gennames parser, unames.icu data structure, and API.
2590 Fix gennames to only pick up "correction" aliases.
2591 New ticket #8963 for further changes.
2592
2593 * run genpname/preparse.pl (on Linux)
2594 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
2595 + make sure that data.h is writable
2596 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
2597 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
2598
2599 * build ICU (make install)
2600 so that the tools build can pick up the new definitions from the installed header files.
2601 * build Unicode tools (at least genpname) using CMake+make
2602
2603 * run genpname
2604 (builds both pnames.icu and propname_data.h)
2605 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
2606 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
2607
2608 * build ICU (make install)
2609 * build Unicode tools using CMake+make
2610
2611 * update source/data/unidata/norm2/nfkc_cf.txt
2612 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
2613
2614 * update source/data/unidata/norm2/uts46.txt
2615 - download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
2616 to ~/svn.icu/tools/trunk/src/unicode/py
2617 - adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008".
2618 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
2619 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
2620
2621 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2622 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2623 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2624 - Unicode 6.0..6.1: U+2260, U+226E, U+226F
2625 - nothing new in 6.1, no test file to update
2626
2627 * generate core properties data files
2628 - in initial bootstrapping, change the UCA version
2629 in source/data/unidata/FractionalUCA.txt to match the new Unicode version
2630 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2631 - rebuild ICU & tools
2632 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
2633 check if the UCA version in FractionalUCA.txt matches the new Unicode version
2634 (see step above)
2635 - run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm:
2636 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2637 - rebuild ICU & tools
2638
2639 * update Java data files
2640 - refresh just the UCD-related files, just to be safe
2641 - see (ICU4C)/source/data/icu4j-readme.txt
2642 - mkdir /tmp/icu4j
2643 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2644 output:
2645 ...
2646 Unicode .icu files built to ./out/build/icudt49l
2647 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b
2648 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b
2649 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2650 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b
2651 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b"
2652 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/
2653 mkdir -p /tmp/icu4j/main/shared/data
2654 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2655 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/
2656 mkdir -p /tmp/icu4j/main/shared/data
2657 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2658 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data'
2659 - copy the big-endian Unicode data files to another location,
2660 separate from the other data files
2661 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
2662 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
2663 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
2664 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu
2665 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
2666 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
2667 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
2668 - refresh ICU4J
2669 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
2670
2671 * refresh Java test .txt files
2672 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2673
2674 * test ICU so far, fix test code where necessary
2675 - temporarily ignore collation issues that look like UCA/UCD mismatches,
2676 until UCA data is updated
2677
2678 * UCA
2679
2680 - get output from Mark's tools; look in
2681 http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt
2682 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2683 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2684 (note removing the underscore before "Rules")
2685 - update (ICU)/source/test/testdata/CollationTest_*.txt
2686 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2687 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
2688 - check test file diffs for previously commented-out, known-failing data lines;
2689 probably need to keep those commented out
2690 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
2691 - run makeuca.sh:
2692 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2693 - rebuild ICU4C
2694 - refresh ICU4J collation data:
2695 (subset of instructions above for properties data refresh, except copies all coll/*)
2696 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2697 ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
2698 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
2699 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
2700 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
2701 - note on intltest: if collate/UCAConformanceTest fails, then
2702 utility/MultithreadTest/TestCollators will fail as well;
2703 fix the conformance test before looking into the multi-thread test
2704
2705 * When refreshing all of ICU4J data from ICU4C
2706 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2707 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2708 or
2709 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2710
2711 *** LayoutEngine script information
2712
2713 (For details see the Unicode 5.2 change log below.)
2714
2715 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
2716 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
2717 in the working directory.
2718 (It also generates ScriptRunData.cpp, which is no longer needed.)
2719
2720 The generated files have a current copyright date and "@draft" statement.
2721
2722 - diff current <icu>/source/layout files vs. generated ones
2723 ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
2724 review and manually merge desired changes;
2725 fix gratuitous changes, incorrect @draft and missing aliases;
2726 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
2727 - if you just copy the above files, then
2728 fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
2729 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
2730
2731 *** merge the Unicode update branches back onto the trunk
2732 - do not merge the icudata.jar and testdata.jar,
2733 instead rebuild them from merged & tested ICU4C
2734
2735 ---------------------------------------------------------------------------- ***
2736
2737 ICU 4.8 (no Unicode update, just new script codes)
2738
2739 * 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2740 (added 2010-12-21)
2741 Afak 439 Afaka
2742 Jurc 510 Jurchen
2743 Mroo 199 Mro, Mru
2744 Nshu 499 Nüshu
2745 Shrd 319 Sharada, Śāradā
2746 Sora 398 Sora Sompeng
2747 Takr 321 Takri, Ṭākrī, Ṭāṅkrī
2748 Tang 520 Tangut
2749 Wole 480 Woleai
2750 -> uscript.h
2751 -> com.ibm.icu.lang.UScript
2752 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2753 replace public static final int \1 = \2;\3
2754 -> genpname/SyntheticPropertyValueAliases.txt
2755 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2756 and in com.ibm.icu.dev.test.lang.TestUScript.java
2757
2758 * run genpname/preparse.pl (on Linux)
2759 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
2760 + make sure that data.h is writable
2761 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
2762 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
2763
2764 * rebuild Unicode tools (at least genpname) using make
2765 - You might first need to "make install" ICU so that the tools build can pick
2766 up the new definitions from the installed header files.
2767
2768 * run genpname
2769 (builds both pnames.icu and propname_data.h)
2770 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
2771 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
2772 - rebuild ICU & tools
2773
2774 * run genprops
2775 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
2776 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
2777 - rebuild ICU & tools
2778
2779 * update Java data files
2780 - refresh just the UCD-related files, just to be safe
2781 - see (ICU4C)/source/data/icu4j-readme.txt
2782 - mkdir /tmp/icu4j
2783 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2784 - copy the big-endian Unicode data files to another location,
2785 separate from the other data files
2786 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
2787 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
2788 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
2789 - refresh ICU4J
2790 ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b
2791
2792 * should have updated the layout engine script codes but forgot
2793
2794 ---------------------------------------------------------------------------- ***
2795
2796 Unicode 6.0 update
2797
2798 *** related ICU Trac tickets
2799
2800 7264 Unicode 6.0 Update
2801
2802 *** Unicode version numbers
2803 - makedata.mak
2804 - uchar.h
2805 (configure.in & configure: have been modified to extract the version from uchar.h)
2806 - com.ibm.icu.util.VersionInfo
2807
2808 *** data files & enums & parser code
2809
2810 * file preparation
2811
2812 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed
2813 - This now prepares both unidata and testdata files in respective output subfolders.
2814
2815 * PropertyAliases.txt changes
2816 - new Script_Extensions property defined in the new ScriptExtensions.txt file
2817 but not listed in PropertyAliases.txt; reported to unicode.org;
2818 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt
2819 scx; Script_Extensions
2820 -> uchar.h with new UProperty section
2821 -> com.ibm.icu.lang.UProperty, parallel with uchar.h
2822
2823 * PropertyValueAliases.txt changes
2824 - 12 new block names:
2825 Alchemical_Symbols
2826 Bamum_Supplement
2827 Batak
2828 Brahmi
2829 CJK_Unified_Ideographs_Extension_D
2830 Emoticons
2831 Ethiopic_Extended_A
2832 Kana_Supplement
2833 Mandaic
2834 Miscellaneous_Symbols_And_Pictographs
2835 Playing_Cards
2836 Transport_And_Map_Symbols
2837 -> add to uchar.h
2838 -> add to UCharacter.UnicodeBlock
2839 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
2840 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
2841 - Joining_Group (jg) values:
2842 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias
2843 -> uchar.h & UCharacter.JoiningGroup
2844 - 3 new scripts:
2845 sc ; Batk ; Batak
2846 sc ; Brah ; Brahmi
2847 sc ; Mand ; Mandaic
2848 -> remove these from SyntheticPropertyValueAliases.txt
2849 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN
2850 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2851 and in com.ibm.icu.dev.test.lang.TestUScript.java
2852 - 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2853 (added 2009-11-11..2010-07-18)
2854 Bass 259 Bassa Vah
2855 Dupl 755 Duployan shortand
2856 Elba 226 Elbasan
2857 Gran 343 Grantha
2858 Kpel 436 Kpelle
2859 Loma 437 Loma
2860 Mend 438 Mende
2861 Merc 101 Meroitic Cursive
2862 Narb 106 Old North Arabian
2863 Nbat 159 Nabataean
2864 Palm 126 Palmyrene
2865 Sind 318 Sindhi
2866 Wara 262 Warang Citi
2867 -> uscript.h
2868 -> com.ibm.icu.lang.UScript
2869 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2870 replace public static final int \1 = \2;\3
2871 -> SyntheticPropertyValueAliases.txt
2872 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2873 and in com.ibm.icu.dev.test.lang.TestUScript.java
2874 - ISO 15924 name change
2875 Mero 100 Meroitic Hieroglyphs (was Meroitic)
2876 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC
2877 - property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt
2878
2879 * UnicodeData.txt changes
2880 - new CJK block:
2881 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
2882 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
2883 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion
2884
2885 * build Unicode tools using CMake+make
2886
2887 * run genpname/preparse.pl (on Linux)
2888 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
2889 + make sure that data.h is writable
2890 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
2891 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
2892
2893 * rebuild Unicode tools (at least genpname) using make
2894 - You might first need to "make install" ICU so that the tools build can pick
2895 up the new definitions from the installed header files.
2896
2897 * run genpname
2898 - ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
2899 - rebuild ICU & tools
2900
2901 * update source/data/unidata/norm2/nfkc_cf.txt
2902 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
2903
2904 * update source/data/unidata/norm2/uts46.txt
2905 - download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt
2906 to ~/svn.icu/tools/trunk/src/unicode/py
2907 - adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values
2908 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
2909 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
2910
2911 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2912 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2913 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2914 - Unicode 6.0: U+2260, U+226E, U+226F
2915
2916 * generate core properties data files
2917 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2918 - rebuild ICU & tools
2919 - run makeuca.sh so that genuca picks up the new nfc.nrm:
2920 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2921 - rebuild ICU & tools
2922
2923 * implement new Script_Extensions property (provisional)
2924 - parser & generator: genprops & uprops.icu
2925 - uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp
2926 - UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java
2927
2928 * switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2
2929 - (one-time change)
2930 - genbidi/gencase/genprops tools changes
2931 - re-run makeprops.sh (see above)
2932 - UCharacterProperty.java, UCharacterTypeIterator.java,
2933 UBiDiProps.java, UCaseProps.java, and several others with minor changes;
2934 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java
2935
2936 * update Java data files
2937 - refresh just the UCD-related files, just to be safe
2938 - see (ICU4C)/source/data/icu4j-readme.txt
2939 - mkdir /tmp/icu4j
2940 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2941 output:
2942 ...
2943 Unicode .icu files built to ./out/build/icudt45l
2944 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b
2945 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2946 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b
2947 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b
2948 mkdir -p /tmp/icu4j/main/shared/data
2949 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2950 - copy the big-endian Unicode data files to another location,
2951 separate from the other data files
2952 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
2953 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
2954 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
2955 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu
2956 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
2957 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
2958 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
2959 - refresh ICU4J
2960 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
2961
2962 * refresh Java test .txt files
2963 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2964
2965 * un-hardcode normalization skippable (NF*_Inert) test data
2966 - removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools
2967
2968 * copy updated break iterator test files
2969 - now handled by early ucdcopy.py and
2970 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata
2971 (old instructions:
2972 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt
2973 to ~/svn.icu/trunk/src/source/test/testdata)
2974 - they are not used in ICU4J
2975
2976 * UCA
2977
2978 - get output from Mark's tools; look in
2979 http://www.unicode.org/~book/incoming/mark/uca6.0.0/
2980 http://www.macchiato.com/unicode/utc/additional-uca-files
2981 http://www.unicode.org/Public/UCA/6.0.0/
2982 http://www.unicode.org/~mdavis/uca/
2983 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2984 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2985 - update Han-implicit ranges for new CJK extensions:
2986 swapCJK() in ucol.cpp & ImplicitCEGenerator.java
2987 - genuca: allow bytes 02 for U+FFFE, new merge-sort character;
2988 do not add it into invuca so that tailoring primary-after an ignorable works
2989 - genuca: permit space between [variable top] bytes
2990 - ucol.cpp: treat noncharacters like unassigned rather than ignorable
2991 - run makeuca.sh:
2992 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2993 - rebuild ICU4C
2994 - refresh ICU4J collation data:
2995 (subset of instructions above for properties data refresh, except copies all coll/*)
2996 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2997 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
2998 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
2999 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
3000 - update (ICU)/source/test/testdata/CollationTest_*.txt
3001 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3002 with output from Mark's Unicode tools
3003 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)
3004 - note on intltest: if collate/UCAConformanceTest fails, then
3005 utility/MultithreadTest/TestCollators will fail as well;
3006 fix the conformance test before looking into the multi-thread test
3007
3008 * When refreshing all of ICU4J data from ICU4C
3009 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3010 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
3011 or
3012 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
3013
3014 *** LayoutEngine script information
3015
3016 (For details see the Unicode 5.2 change log below.)
3017
3018 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
3019 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
3020 ScriptRunData.cpp, which is no longer needed.)
3021
3022 The generated files have a current copyright date and "@draft" statement.
3023
3024 * copy the above files into <icu>/source/layout, replacing the old files.
3025 * fix mixed line endings
3026 * review the diffs and fix incorrect @draft and missing aliases;
3027 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
3028 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
3029
3030 ---------------------------------------------------------------------------- ***
3031
3032 Unicode 5.2 update
3033
3034 *** related ICU Trac tickets
3035
3036 7084 Unicode 5.2
3037
3038 7167 verify collation bytes
3039 7235 Java test NAME_ALIAS
3040 7236 Java DerivedCoreProperties.txt test
3041 7237 Java BidiTest.txt
3042 7238 UTrie2 in core unidata
3043 7239 test for tailoring gaps
3044 7240 Java fix CollationMiscTest
3045 7243 update layout engine for Unicode 5.2
3046
3047 *** Unicode version numbers
3048 - makedata.mak
3049 - uchar.h
3050 - configure.in & configure
3051 - update ucdVersion in gennames.c if an algorithmic range changes
3052
3053 *** data files & enums & parser code
3054
3055 * file preparation
3056
3057 python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata
3058 - includes finding files regardless of version numbers,
3059 copying them, and performing the equivalent processing of the
3060 ucdstrip and ucdmerge tools on the desired set of files
3061
3062 * notes on changes
3063 - PropertyAliases.txt
3064 moved from numeric to enumerated:
3065 ccc ; Canonical_Combining_Class
3066 new string properties:
3067 NFKC_CF ; NFKC_Casefold
3068 Name_Alias; Name_Alias
3069 new binary properties:
3070 Cased ; Cased
3071 CI ; Case_Ignorable
3072 CWCF ; Changes_When_Casefolded
3073 CWCM ; Changes_When_Casemapped
3074 CWKCF ; Changes_When_NFKC_Casefolded
3075 CWL ; Changes_When_Lowercased
3076 CWT ; Changes_When_Titlecased
3077 CWU ; Changes_When_Uppercased
3078 new CJK Unihan properties (not supported by ICU)
3079 - PropertyValueAliases.txt
3080 new block names
3081 new scripts
3082 one script code change:
3083 sc ; Qaai ; Inherited
3084 ->
3085 sc ; Zinh ; Inherited ; Qaai
3086 new Line_Break (lb) value:
3087 lb ; CP ; Close_Parenthesis
3088 new Joining_Group (jg) values: Farsi_Yeh, Nya
3089 other new values:
3090 ccc; 214; ATA ; Attached_Above
3091 - DerivedBidiClass.txt
3092 new default-R range: U+1E800 - U+1EFFF
3093 - UnicodeData.txt
3094 all of the ISO comments are gone
3095 new CJK block end:
3096 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last>
3097 new CJK block:
3098 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
3099 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
3100
3101 * genpname
3102 - run preparse.pl
3103 + cd \svn\icuproj\icu\trunk\source\tools\genpname
3104 + make sure that data.h is writable
3105 + perl preparse.pl \svn\icuproj\icu\trunk > out.txt
3106 + preparse.pl complains with errors like the following:
3107 Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34.
3108 This is because ICU 4.0 had scripts from ISO 15924 which are now
3109 added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt
3110 and PropertyValueAliases.txt.
3111 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
3112 Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt
3113 + preparse.pl complains with errors about block names missing from uchar.h; add them
3114
3115 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
3116 - new block & script values
3117 + 26 new blocks
3118 copy new blocks from Blocks.txt
3119 MS VC++ 2008 regular expression:
3120 find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$"
3121 replace with " UBLOCK_\3 = 172, /*[\1]*/"
3122 + several new script values already added in ICU 4.0 for ISO 15924 coverage
3123 (removed from SyntheticPropertyValueAliases.txt, see genpname notes above)
3124 + 3 new script values added for ISO 15924 and Unicode 5.2 coverage
3125 + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2)
3126 (added to SyntheticPropertyValueAliases.txt)
3127 - new Joining Group (JG) values: Farsi_Yeh, Nya
3128 - new Line_Break (lb) value:
3129 lb ; CP ; Close_Parenthesis
3130
3131 * hardcoded Unihan range end/limit
3132 - Unihan range end moves from 9FC3 to 9FCB
3133 search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive)
3134 + do change gennames.c
3135
3136 * Compare definitions of new binary properties with what we used to use
3137 in algorithms, to see if the definitions changed.
3138 - Verified that definitions for Cased and Case_Ignorable are unchanged.
3139 The gencase tool now parses the newly public Case_Ignorable values
3140 in case the definition changes in the future.
3141
3142 * uchar.c & uprops.h & uprops.c & genprops
3143 - new numeric values that didn't exist in Unicode data before:
3144 1/7, 1/9, 1/10, 3/10, 1/16, 3/16
3145 the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5,
3146 therefore redesign the encoding of numeric types and values for formatVersion 6;
3147 design for simple numbers up to at least 144 ("one gross"),
3148 large values up to at least 10^20,
3149 and fractions with numerators -1..17 and denominators 1..16
3150 to cover current and expected future values
3151 (e.g., more Han numeric values, Meroitic twelfths)
3152
3153 * reimplement Hangul_Syllable_Type for new Jamo characters
3154 - the old code assumed that all Jamo characters are in the 11xx block
3155 - Unicode 5.2 fills holes there and adds new Jamo characters in
3156 A960..A97F; Hangul Jamo Extended-A
3157 and in
3158 D7B0..D7FF; Hangul Jamo Extended-B
3159 - Hangul_Syllable_Type can be trivially derived from a subset of
3160 Grapheme_Cluster_Break values
3161
3162 * build Unicode data source code for hardcoding core data
3163 C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data
3164
3165 ICU data make path is \svn\icuproj\icu\trunk\source\data\
3166 ICU root path is \svn\icuproj\icu\trunk
3167 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
3168 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
3169 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
3170 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
3171 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
3172 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
3173 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
3174 Information: cannot find "spreplocal.mk". Not building user-additional stringprep files.
3175 Creating data file for Unicode Property Names
3176 Creating data file for Unicode Character Properties
3177 Creating data file for Unicode Case Mapping Properties
3178 Creating data file for Unicode BiDi/Shaping Properties
3179 Creating data file for Unicode Normalization
3180 Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l"
3181 Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp"
3182
3183 - copy the .c source files to C:\svn\icuproj\icu\trunk\source\common
3184 and rebuild the common library
3185
3186 *** UCA
3187
3188 - update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools)
3189 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools
3190 - update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools
3191 [ Begin obsolete instructions:
3192 Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files.
3193 - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py
3194 on Windows:
3195 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt
3196 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt
3197 End obsolete instructions]
3198 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)
3199 not just the *_STUB.txt files
3200 - note on intltest: if collate/UCAConformanceTest fails, then
3201 utility/MultithreadTest/TestCollators will fail as well;
3202 fix the conformance test before looking into the multi-thread test
3203
3204 *** Implement Cased & Case_Ignorable properties
3205 - via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable()
3206 - Problem: These properties should be disjoint, but aren't
3207 - UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not
3208 - change ucase.icu to be able to store any combination of Cased and Case_Ignorable
3209
3210 *** Implement Changes_When_Xyz properties
3211 - without stored data
3212
3213 *** Implement Name_Alias property
3214 - add it as another name field in unames.icu
3215 - make it available via u_charName() and UCharNameChoice and
3216 - consider it in u_charFromName()
3217
3218 *** Break iterators
3219
3220 * Update break iterator rules to new UAX versions and new property values
3221 * Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary
3222
3223 *** new BidiTest file
3224 - review format and data
3225 - copy BidiTest.txt to source/test/testdata
3226 - write test code using this data
3227 - fix ICU code where it fails the conformance test
3228
3229 *** Java
3230 - generally, find and update code corresponding to C/C++
3231 - UCharacter.UnicodeBlock constants:
3232 a) add an _ID integer per new block, update COUNT
3233 b) add a class instance per new block
3234 Visual Studio regex:
3235 find UBLOCK_{[^ ]+} = [0-9]+, {/.+}
3236 replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
3237 - CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias()
3238
3239 - port test changes to Java
3240
3241 *** LayoutEngine script information
3242
3243 (For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833)
3244
3245 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
3246 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
3247 ScriptRunData.cpp, which is no longer needed.)
3248
3249 The generated files have a current copyright date and "@draft" statement.
3250
3251 -> Eric Mader wrote in email on 20090930:
3252 "I think the tool has been modified to update @draft to @stable for
3253 older scripts and to add @draft for new scripts.
3254 (I worked with an intern on this last year.)
3255 You should check the output after you run it."
3256
3257 * copy the above files into <icu>/source/layout, replacing the old files.
3258 * fix mixed line endings
3259 * review the diffs and fix incorrect @draft and missing aliases
3260 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
3261
3262 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
3263 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
3264
3265 -> Eric Mader wrote in email on 20090930:
3266 "This is just a matter of making sure that all the per-script tables have
3267 entries for any new scripts that were added.
3268 If any new Indic characters were added, then the class tables in
3269 IndicClassTables.cpp should be updated to reflect this.
3270 John Emmons should know how to do this if it's required."
3271
3272 * rebuild the layout and layoutex libraries.
3273
3274 *** Documentation
3275 - Update User Guide
3276 + Jamo_Short_Name, sfc->scf, binary property value aliases
3277
3278 ---------------------------------------------------------------------------- ***
3279
3280 Unicode 5.1 update
3281
3282 *** related ICU Trac tickets
3283
3284 5696 Update to Unicode 5.1
3285
3286 *** Unicode version numbers
3287 - makedata.mak
3288 - uchar.h
3289 - configure.in & configure
3290 - update ucdVersion in gennames.c if an algorithmic range changes
3291
3292 *** data files & enums & parser code
3293
3294 * file preparation
3295 - ucdstrip:
3296 DerivedCoreProperties.txt
3297 DerivedNormalizationProps.txt
3298 NormalizationTest.txt
3299 PropList.txt
3300 Scripts.txt
3301 GraphemeBreakProperty.txt
3302 SentenceBreakProperty.txt
3303 WordBreakProperty.txt
3304 - ucdstrip and ucdmerge:
3305 EastAsianWidth.txt
3306 LineBreak.txt
3307
3308 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
3309 copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
3310 copy 5.1.0\ucd\Blocks.txt ..\unidata\
3311 copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
3312 copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
3313 copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
3314 copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
3315 copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
3316 copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
3317 copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
3318 copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
3319 copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
3320 copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
3321 copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
3322
3323 ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
3324 ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
3325 ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
3326 ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
3327 ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
3328 ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
3329 ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
3330 ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
3331 ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
3332 ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
3333
3334 * genpname
3335 - run preparse.pl
3336 + cd \svn\icuproj\icu\uni51\source\tools\genpname
3337 + make sure that data.h is writable
3338 + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
3339 + preparse.pl complains with errors like the following:
3340 Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
3341 This is because ICU 3.8 had scripts from ISO 15924 which are now
3342 added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
3343 and PropertyValueAliases.txt.
3344 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
3345 Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
3346 + PropertyValueAliases.txt now explicitly contains values for boolean properties:
3347 N/Y, No/Yes, F/T, False/True
3348 -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
3349 It will use further values from the file if present.
3350
3351 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
3352 - new block & script values
3353 + 17 new blocks
3354 + 11 new script values already added in ICU 3.8 for ISO 15924 coverage
3355 (removed from SyntheticPropertyValueAliases.txt)
3356 + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
3357 (added to SyntheticPropertyValueAliases.txt)
3358 - uprops.icu (uprops.h) only provides 7 bits for script codes.
3359 In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
3360 There is none above 127 yet which is the script code for an
3361 assigned Unicode character, so ICU 4.0 uprops.icu does not store any
3362 script code values greater than 127.
3363 However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
3364 in a parallel bit field, and that overflows now.
3365 Also, future values >=128 would be incompatible anyway.
3366 uprops.h is modified to move around several of the bit fields
3367 in the properties vector words, and now uses 8 bits for the script code.
3368 Two other bit fields also grow to accommodate future growth:
3369 Block (current count: 172) grows from 8 to 9 bits,
3370 and Word_Break grows from 4 to 5 bits.
3371 - renamed property Simple_Case_Folding (sfc->scf)
3372 + nothing to be done: handled as normal alias
3373 - new property JSN Jamo_Short_Name
3374 + no new API: only contributes to the Name property
3375 - new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
3376 - new Joining Group (JG) value: Burushashki_Yeh_Barree
3377 - new Sentence_Break (SB) values:
3378 SB ; CR ; CR
3379 SB ; EX ; Extend
3380 SB ; LF ; LF
3381 SB ; SC ; SContinue
3382 - new Word_Break (WB) values:
3383 WB ; CR ; CR
3384 WB ; Extend ; Extend
3385 WB ; LF ; LF
3386 WB ; MB ; MidNumLet
3387
3388 * Further changes in the 2008-02-29 update:
3389 - Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
3390 because they should not normally be invisible.
3391 - new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
3392 - new Grapheme_Cluster_Break (GCB) value: PP=Prepend
3393 - new Word_Break (WB) value: NL=Newline
3394
3395 * hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
3396 - Unihan range end moves from 9FBB to 9FC3
3397 search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
3398 + do change gennames.c
3399
3400 * build Unicode data source code for hardcoding core data
3401 C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
3402
3403 ICU data make path is \svn\icuproj\icu\uni51\source\data\
3404 ICU root path is \svn\icuproj\icu\uni51
3405 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
3406 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
3407 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
3408 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
3409 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
3410 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
3411 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
3412 Creating data file for Unicode Character Properties
3413 Creating data file for Unicode Case Mapping Properties
3414 Creating data file for Unicode BiDi/Shaping Properties
3415 Creating data file for Unicode Normalization
3416 Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
3417 Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
3418
3419 - copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
3420 and rebuild the common library
3421
3422 *** Break iterators
3423
3424 * Update break iterator rules to new UAX versions and new property values
3425
3426 *** UCA
3427
3428 * update FractionalUCA.txt and UCARules.txt with new canonical closure
3429
3430 *** Test suites
3431 - Test that APIs using Unicode property value aliases (like UnicodeSet)
3432 support all of the boolean values N/Y, No/Yes, F/T, False/True
3433 -> TestBinaryValues() tests in both cintltst and intltest
3434
3435 *** LayoutEngine script information
3436 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
3437 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
3438 ScriptRunData.cpp, which is no longer needed.)
3439
3440 The generated files have a current copyright date and "@draft" statement.
3441
3442 * copy the above files into <icu>/source/layout, replacing the old files.
3443
3444 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
3445 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
3446
3447 * rebuild the layout and layoutex libraries.
3448
3449 *** Documentation
3450 - Update User Guide
3451 + Jamo_Short_Name, sfc->scf, binary property value aliases
3452
3453 ---------------------------------------------------------------------------- ***
3454
3455 Unicode 5.0 update
3456
3457 *** related Jitterbugs
3458
3459 5084 RFE: Update to Unicode 5.0
3460
3461 *** data files & enums & parser code
3462
3463 * file preparation
3464 - ucdstrip:
3465 DerivedCoreProperties.txt
3466 DerivedNormalizationProps.txt
3467 NormalizationTest.txt
3468 PropList.txt
3469 Scripts.txt
3470 GraphemeBreakProperty.txt
3471 SentenceBreakProperty.txt
3472 WordBreakProperty.txt
3473 - ucdstrip and ucdmerge:
3474 EastAsianWidth.txt
3475 LineBreak.txt
3476
3477 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
3478 copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
3479 copy 5.0.0\ucd\Blocks.txt ..\unidata\
3480 copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
3481 copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
3482 copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
3483 copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
3484 copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
3485 copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
3486 copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
3487 copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
3488 copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
3489 copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
3490 copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
3491
3492 ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
3493 ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
3494 ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
3495 ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
3496 ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
3497 ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
3498 ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
3499 ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
3500 ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
3501 ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
3502
3503 * update FractionalUCA.txt and UCARules.txt with new canonical closure
3504
3505 * genpname
3506 - run preparse.pl
3507 + make sure that data.h is writable
3508 + perl preparse.pl \cvs\oss\icu > out.txt
3509
3510 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
3511 - new block & script values
3512 + script values already added in ICU 3.6 because all of ISO 15924 is now covered
3513
3514 * build Unicode data source code for hardcoding core data
3515 C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
3516
3517 ICU data make path is \cvs\oss\icu\source\data\
3518 ICU root path is \cvs\oss\icu
3519 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
3520 [etc.]
3521 Creating data file for Unicode Character Properties
3522 Creating data file for Unicode Case Mapping Properties
3523 Creating data file for Unicode BiDi/Shaping Properties
3524 Creating data file for Unicode Normalization
3525 Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
3526 Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
3527
3528 - copy the .c source files to C:\cvs\oss\icu\source\common
3529 and rebuild the common library
3530
3531 *** Unicode version numbers
3532 - makedata.mak
3533 - uchar.h
3534 - configure.in
3535
3536 *** LayoutEngine script information
3537 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
3538 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
3539 ScriptRunData.cpp, which is no longer needed.)
3540
3541 The generated files have a current copyright date and "@draft" statement.
3542
3543 * copy the above files into <icu>/source/layout, replacing the old files.
3544
3545 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
3546 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
3547
3548 * rebuild the layout and layoutex libraries.
3549
3550 ---------------------------------------------------------------------------- ***
3551
3552 Unicode 4.1 update
3553
3554 *** related Jitterbugs
3555
3556 4332 RFE: Update to Unicode 4.1
3557 4157 RBBI, TR29 4.1 updates
3558
3559 *** data files & enums & parser code
3560
3561 * file preparation
3562 - ucdstrip:
3563 DerivedCoreProperties.txt
3564 DerivedNormalizationProps.txt
3565 NormalizationTest.txt
3566 GraphemeBreakProperty.txt
3567 SentenceBreakProperty.txt
3568 WordBreakProperty.txt
3569 - ucdstrip and ucdmerge:
3570 EastAsianWidth.txt
3571 LineBreak.txt
3572
3573 * add new files to the repository
3574 GraphemeBreakProperty.txt
3575 SentenceBreakProperty.txt
3576 WordBreakProperty.txt
3577
3578 * update FractionalUCA.txt and UCARules.txt with new canonical closure
3579
3580 * genpname
3581 - handle new enumerated properties in sub read_uchar
3582 - run preparse.pl
3583
3584 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
3585 - new binary properties
3586 + Pattern_Syntax
3587 + Pattern_White_Space
3588 - new enumerated properties
3589 + Grapheme_Cluster_Break
3590 + Sentence_Break
3591 + Word_Break
3592 - new block & script & line break values
3593
3594 * gencase
3595 - case-ignorable changes
3596 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
3597 now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
3598
3599 *** Unicode version numbers
3600 - makedata.mak
3601 - uchar.h
3602 - configure.in
3603
3604 *** tests
3605 - verify that u_charMirror() round-trips
3606 - test all new properties and some new values of old properties
3607
3608 *** other code
3609
3610 * hardcoded Unihan range end/limit
3611 - Unihan range end moves from 9FA5 to 9FBB
3612 search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
3613 + do not modify BOCU/BOCSU code because that would change the encoding
3614 and break binary compatibility!
3615 + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
3616 NamePrepProfile.txt
3617 + ignore trietest.c: test data is arbitrary
3618 + ignore tstnorm.cpp: test optimization, not important
3619 + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
3620 + do change line_th.txt and word_th.txt
3621 by replacing hardcoded ranges with the new property values
3622 + do change gennames.c
3623
3624 source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
3625 source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
3626 source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5,
3627
3628 * case mappings
3629 - compare new special casing context conditions with previous ones
3630 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
3631
3632 * genpname
3633 - consider storing only the short name if it is the same as the long name
3634
3635 *** other reviews
3636 - UAX #29 changes (grapheme/word/sentence breaks)
3637 - UAX #14 changes (line breaks)
3638 - Pattern_Syntax & Pattern_White_Space
3639
3640 ---------------------------------------------------------------------------- ***
3641
3642 Unicode 4.0.1 update
3643
3644 *** related Jitterbugs
3645
3646 3170 RFE: Update to Unicode 4.0.1
3647 3171 Add new Unicode 4.0.1 properties
3648 3520 use Unicode 4.0.1 updates for break iteration
3649
3650 *** data files & enums & parser code
3651
3652 * file preparation
3653 - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
3654 - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
3655
3656 * file fixes
3657 - fix UnicodeData.txt general categories of Ethiopic digits Nd->No
3658 according to PRI #26
3659 http://www.unicode.org/review/resolved-pri.html#pri26
3660 - undone again because no corrigendum in sight;
3661 instead modified tests to not check consistency on this for Unicode 4.0.1
3662
3663 * ucdterms.txt
3664 - update from http://www.unicode.org/copyright.html
3665 formatted for plain text
3666
3667 * uchar.h & uprops.h & uprops.c & genprops
3668 - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
3669 - add U_LB_INSEPARABLE due to a spelling fix
3670 + put short name comment only on line with new constant
3671 for genpname perl script parser
3672 - new binary properties
3673 + STerm
3674 + Variation_Selector
3675
3676 * genpname
3677 - fix genpname perl script so that it doesn't choke on more than 2 names per property value
3678 - perl script: correctly calculate the maximum number of fields per row
3679
3680 * uscript.h
3681 - new script code Hrkt=Katakana_Or_Hiragana
3682
3683 * gennorm.c track changes in DerivedNormalizationProps.txt
3684 - "FNC" -> "FC_NFKC"
3685 - single field "NFD_NO" -> two fields "NFD_QC; N" etc.
3686
3687 * genprops/props2.c track changes in DerivedNumericValues.txt
3688 - changed from 3 columns to 2, dropping the numeric type
3689 + assume that the type is always numeric for Han characters,
3690 and that only those are added in addition to what UnicodeData.txt lists
3691
3692 *** Unicode version numbers
3693 - makedata.mak
3694 - uchar.h
3695 - configure.in
3696
3697 *** tests
3698 - update test of default bidi classes according to PRI #28
3699 /tsutil/cucdtst/TestUnicodeData
3700 http://www.unicode.org/review/resolved-pri.html#pri28
3701 - bidi tests: change exemplar character for ES depending on Unicode version
3702 - change hardcoded expected property values where they change
3703
3704 *** other code
3705
3706 * name matching
3707 - read UCD.html
3708
3709 * scripts
3710 - use new Hrkt=Katakana_Or_Hiragana
3711
3712 * ZWJ & ZWNJ
3713 - are now part of combining character sequences
3714 - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ