]> git.saurik.com Git - apple/icu.git/blame - icuSources/data/unidata/changes.txt
ICU-64243.0.1.tar.gz
[apple/icu.git] / icuSources / data / unidata / changes.txt
CommitLineData
f3c0d7a5
A
1* Copyright (C) 2016 and later: Unicode, Inc. and others.
2* License & terms of use: http://www.unicode.org/copyright.html
3* Copyright (C) 2004-2016, International Business Machines
73c04bcf
A
4* Corporation and others. All Rights Reserved.
5*
6* file name: changes.txt
7* encoding: US-ASCII
8* tab size: 8 (not used)
9* indentation:4
10*
11* created on: 2004may06
12* created by: Markus W. Scherer
13*
14* change log for Unicode updates
6be67b06
A
15*
16* For each new Unicode version, during the beta period,
17* I copy the change log for the previous version to the top of this file.
18* I adjust the versions, tickets, URLs, and paths.
19* I work my way through the steps listed in the log, top to bottom,
20* adjusting the log as necessary.
21* I report problems to the UTC and/or CLDR and/or ICU.
22* Before the data is final, I "turn the crank" several more times,
23* using appropriate subsets of the steps.
73c04bcf
A
24
25---------------------------------------------------------------------------- ***
51004dcb 26
b331163b
A
27* New ISO 15924 script codes
28
f3c0d7a5
A
29Starting with ICU 55, we do not add UScriptCode constants for new scripts any more
30until they are encoded in Unicode,
31or can be assumed to be encoded in the next Unicode version.
b331163b
A
32Script enum constant names want to follow the Unicode script property value aliases,
33which are assigned only when the scripts are encoded.
34When we encode scripts early and guess wrong, then we have confusing enum constants
35and have sometimes added aliases.
36
f3c0d7a5 37Variant script codes like Latf and Aran that are not subject to separate encoding
b331163b 38can be added at any time.
f3c0d7a5 39(For example, Aran could be added as USCRIPT_ARABIC_NASTALIQ.)
b331163b 40
f3c0d7a5
A
41We add script codes used in CLDR or in the spoof checker.
42This includes combination/alias codes like Hanb and Jamo.
43See http://unicode.org/reports/tr35/#unicode_script_subtag_validity
44and look for "alias" on http://unicode.org/iso15924/iso15924-codes.html
b331163b 45
f3c0d7a5 46We add special Z* script codes like Zsye.
b331163b 47
f3c0d7a5 48For new script codes see http://www.unicode.org/iso15924/codechanges.html
b331163b 49
f3c0d7a5
A
50---------------------------------------------------------------------------- ***
51
3d1f044b
A
52Unicode 12.1 update for ICU 64.2
53
54** This is an abbreviated update with one new character for the new
55** Japanese era expected to start on 2019-May-01: U+32FF SQUARE ERA NAME REIWA
56https://en.wikipedia.org/wiki/Reiwa_period
57
58http://www.unicode.org/versions/Unicode12.1.0/
59
60ICU-20497 Unicode 12.1
61
62cldrbug 11978: Unicode 12.1
63
64* Command-line environment setup
65
66UNICODE_DATA=~/unidata/uni121/20190403
67CLDR_SRC=~/svn.cldr/uni
68ICU_ROOT=~/icu/uni
69ICU_SRC=$ICU_ROOT/src
70ICUDT=icudt64b
71ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
72ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
73export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
74
75*** Unicode version numbers
76- makedata.mak
77- uchar.h
78- com.ibm.icu.util.VersionInfo
79- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
80
81- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
82 so that the makefiles see the new version number.
83 cd $ICU_ROOT/dbg/icu4c
84 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
85
86*** data files & enums & parser code
87
88* download files
89- mkdir -p $UNICODE_DATA
90- download Unicode files into $UNICODE_DATA
91 + subfolders: emoji, idna, security, ucd, uca
92 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
93
94* for manual diffs and for Unicode Tools input data updates:
95 remove version suffixes from the file names
96 ~$ unidata/desuffixucd.py $UNICODE_DATA
97 (see https://sites.google.com/site/unicodetools/inputdata)
98
99* process and/or copy files
100- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
101 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
102 + For debugging, and tweaking how ppucd.txt is written,
103 the tool has an --only_ppucd option:
104 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
105
106- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
107
108* build ICU (make install)
109 so that the tools build can pick up the new definitions from the installed header files.
110
111 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
112
113* update spoof checker UnicodeSet initializers:
114 inclusionPat & recommendedPat in uspoof.cpp
115 INCLUSION & RECOMMENDED in SpoofChecker.java
116- make sure that the Unicode Tools tree contains the latest security data files
117- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
118- update the hardcoded version number there in the DIRECTORY path
119- run the tool (no special environment variables needed)
120- copy & paste from the Console output into the .cpp & .java files
121
122* generate normalization data files
123 cd $ICU_ROOT/dbg/icu4c
124 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
125 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
126 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
127 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
128 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
129
130* build ICU (make install)
131 so that the tools build can pick up the new definitions from the installed header files.
132
133 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
134
135* build Unicode tools using CMake+make
136
137$ICU_SRC/tools/unicode/c/icudefs.txt:
138
139# Location (--prefix) of where ICU was installed.
140set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
141# Location of the ICU4C source tree.
142set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)
143
144 $ICU_ROOT/dbg$
145 mkdir -p tools/unicode/c
146 cd tools/unicode/c
147
148 $ICU_ROOT/dbg/tools/unicode/c$
149 cmake ../../../../src/tools/unicode/c
150 make
151
152* generate core properties data files
153 $ICU_ROOT/dbg/tools/unicode/c$
154 genprops/genprops $ICU_SRC/icu4c
155 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \
156 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
157- rebuild ICU (make install) & tools
158
159* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
160 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
161- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
162- Unicode 6.0..12.1: U+2260, U+226E, U+226F
163- nothing new in this Unicode version, no test file to update
164
165* run & fix ICU4C tests
166- Andy handles RBBI & spoof check test failures
167
168* collation: CLDR collation root, UCA DUCET
169
170- UCA DUCET goes into Mark's Unicode tools, see
171 https://sites.google.com/site/unicodetools/home#TOC-UCA
172 diff the main mapping file, look for bad changes
173 (for example, more bytes per weight for common characters)
174 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.1.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.1.txt
175 ~/svn.unitools/trunk$ meld ../frac-12.txt ../frac-12.1.txt
176
177- CLDR root data files are checked into $CLDR_SRC/common/uca/
178 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
179
180- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
181 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
182- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
183 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
184 (note removing the underscore before "Rules")
185 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
186- restore TODO diffs in UCARules.txt
187 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
188- update (ICU4C)/source/test/testdata/CollationTest_*.txt
189 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
190 from the CLDR root files (..._CLDR_..._SHORT.txt)
191 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
192 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
193 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
194- if CLDR common/uca/unihan-index.txt changes, then update
195 CLDR common/collation/root.xml <collation type="private-unihan">
196 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
197
198- run genuca, see command line above
199- rebuild ICU4C
200
201* Unihan collators
202 https://sites.google.com/site/unicodetools/unihan
203- run Unicode Tools
204 org.unicode.draft.GenerateUnihanCollators
205 with VM arguments
206 -ea
207 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
208 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
209 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
210 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
211 -DUVERSION=12.1.0
212- run Unicode Tools
213 org.unicode.draft.GenerateUnihanCollatorFiles
214 with the same arguments
215- check CLDR diffs
216 cd $CLDR_SRC
217 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
218 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
219- copy to CLDR
220 cd $CLDR_SRC
221 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
222 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
223- run CLDR unit tests, commit to CLDR
224- generate ICU zh collation data: run CLDR
225 org.unicode.cldr.icu.NewLdml2IcuConverter
226 with program arguments
227 -t collation
228 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
229 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
230 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll
231 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation
232 zh
233 and VM arguments
234 -ea
235 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
236- rebuild ICU4C
237
238* run & fix ICU4C tests, now with new CLDR collation root data
239- run all tests with the collation test data *_SHORT.txt or the full files
240 (the full ones have comments, useful for debugging)
241- note on intltest: if collate/UCAConformanceTest fails, then
242 utility/MultithreadTest/TestCollators will fail as well;
243 fix the conformance test before looking into the multi-thread test
244
245* update Java data files
246- refresh just the UCD/UCA-related/derived files, just to be safe
247- see (ICU4C)/source/data/icu4j-readme.txt
248- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
249- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
250 output:
251 ...
252 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
253 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt64b
254 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b
255 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt64l.dat ./out/icu4j/icudt64b.dat -s ./out/build/icudt64l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt64b
256 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b"
257 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt64b/
258 mkdir -p /tmp/icu4j/main/shared/data
259 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
260 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt64b/
261 mkdir -p /tmp/icu4j/main/shared/data
262 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
263 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
264- copy the big-endian Unicode data files to another location,
265 separate from the other data files,
266 and then refresh ICU4J
267 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
268 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
269 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
270 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
271 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
272 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
273 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
274 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
275 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
276 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
277
278* When refreshing all of ICU4J data from ICU4C
279- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
280- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
281or
282- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
283
284* update CollationFCD.java
285 + copy & paste the initializers of lcccIndex[] etc. from
286 ICU4C/source/i18n/collationfcd.cpp to
287 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
288
289* refresh Java test .txt files
290- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
291 cd $ICU_SRC/icu4c/source/data/unidata
292 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
293 cd ../../test/testdata
294 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
295 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
296
297* run & fix ICU4J tests
298
299*** API additions
300- send notice to icu-design about new born-@stable API (enum constants etc.)
301
302*** CLDR numbering systems
303- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
304 for example, look for
305 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt
306 in new blocks (Blocks.txt)
307 Unicode 12: using Unicode 12 CLDR ticket #11478
308 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong
309 wcho 1E2F0..1E2F9 Wancho
310 Unicode 11: using Unicode 11 CLDR ticket #10978
311 rohg 10D30..10D39 Hanifi_Rohingya
312 gong 11DA0..11DA9 Gunjala_Gondi
313 Earlier: CLDR tickets specific to adding new numbering systems.
314 Unicode 10: http://unicode.org/cldr/trac/ticket/10219
315 Unicode 9: http://unicode.org/cldr/trac/ticket/9692
316
317*** merge the Unicode update branches back onto the trunk
318- do not merge the icudata.jar and testdata.jar,
319 instead rebuild them from merged & tested ICU4C
320- make sure that changes to Unicode tools are checked in:
321 http://www.unicode.org/utility/trac/log/trunk/unicodetools
322
323---------------------------------------------------------------------------- ***
324
325Unicode 12.0 update for ICU 64
326
327http://www.unicode.org/versions/Unicode12.0.0/
328http://unicode.org/versions/beta-12.0.0.html
329https://www.unicode.org/review/pri389/
330http://www.unicode.org/reports/uax-proposed-updates.html
331http://www.unicode.org/reports/tr44/tr44-23.html
332
333ICU-20203 Unicode 12
334
335ICU-20111 move text layout properties data into a data file
336
337cldrbug 11478: Unicode 12
338Accidentally used ^/trunk instead of ^/branches/markus/uni12
339
340* Command-line environment setup
341
342UNICODE_DATA=~/unidata/uni12/20190309
343CLDR_SRC=~/svn.cldr/uni
344ICU_ROOT=~/icu/uni
345ICU_SRC=$ICU_ROOT/src
346ICUDT=icudt63b
347ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
348ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
349export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
350
351*** Unicode version numbers
352- makedata.mak
353- uchar.h
354- com.ibm.icu.util.VersionInfo
355- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
356
357- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
358 so that the makefiles see the new version number.
359
360*** data files & enums & parser code
361
362* download files
363- mkdir -p $UNICODE_DATA
364- download Unicode files into $UNICODE_DATA
365 + subfolders: emoji, idna, security, ucd, uca
366 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
367
368* for manual diffs and for Unicode Tools input data updates:
369 remove version suffixes from the file names
370 ~$ unidata/desuffixucd.py $UNICODE_DATA
371 (see https://sites.google.com/site/unicodetools/inputdata)
372
373* process and/or copy files
374- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
375 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
376 + For debugging, and tweaking how ppucd.txt is written,
377 the tool has an --only_ppucd option:
378 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
379
380- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
381
382* build ICU (make install)
383 so that the tools build can pick up the new definitions from the installed header files.
384
385 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
386
387* new constants for new property values
388- preparseucd.py error:
389 ValueError: missing uchar.h enum constants for some property values:
390 [(u'blk', set([u'Symbols_And_Pictographs_Ext_A', u'Elymaic',
391 u'Ottoman_Siyaq_Numbers', u'Nandinagari', u'Nyiakeng_Puachue_Hmong',
392 u'Small_Kana_Ext', u'Egyptian_Hieroglyph_Format_Controls', u'Wancho', u'Tamil_Sup'])),
393 (u'sc', set([u'Nand', u'Wcho', u'Elym', u'Hmnp']))]
394 = PropertyValueAliases.txt new property values (diff old & new .txt files)
395 blk; Egyptian_Hieroglyph_Format_Controls; Egyptian_Hieroglyph_Format_Controls
396 blk; Elymaic ; Elymaic
397 blk; Nandinagari ; Nandinagari
398 blk; Nyiakeng_Puachue_Hmong ; Nyiakeng_Puachue_Hmong
399 blk; Ottoman_Siyaq_Numbers ; Ottoman_Siyaq_Numbers
400 blk; Small_Kana_Ext ; Small_Kana_Extension
401 blk; Symbols_And_Pictographs_Ext_A ; Symbols_And_Pictographs_Extended_A
402 blk; Tamil_Sup ; Tamil_Supplement
403 blk; Wancho ; Wancho
404 -> add to uchar.h
405 use long property names for enum constants,
406 for the trailing comment get the block start code point: diff old & new Blocks.txt
407 -> add to UCharacter.UnicodeBlock IDs
408 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
409 replace public static final int \1_ID = \2; \3
410 -> add to UCharacter.UnicodeBlock objects
411 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
412 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \3
413
414 sc ; Elym ; Elymaic
415 sc ; Hmnp ; Nyiakeng_Puachue_Hmong
416 sc ; Nand ; Nandinagari
417 sc ; Wcho ; Wancho
418 -> uscript.h & com.ibm.icu.lang.UScript
419 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
420 and in com.ibm.icu.dev.test.lang.TestUScript.java
421
422* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
423 (not strictly necessary for NOT_ENCODED scripts)
424 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
425
426* update spoof checker UnicodeSet initializers:
427 inclusionPat & recommendedPat in uspoof.cpp
428 INCLUSION & RECOMMENDED in SpoofChecker.java
429- make sure that the Unicode Tools tree contains the latest security data files
430- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
431- update the hardcoded version number there in the DIRECTORY path
432- run the tool (no special environment variables needed)
433- copy & paste from the Console output into the .cpp & .java files
434
435* generate normalization data files
436 cd $ICU_ROOT/dbg/icu4c
437 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
438 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
439 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
440 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
441 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
442
443* build ICU (make install)
444 so that the tools build can pick up the new definitions from the installed header files.
445
446 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
447
448* build Unicode tools using CMake+make
449
450$ICU_SRC/tools/unicode/c/icudefs.txt:
451
452# Location (--prefix) of where ICU was installed.
453set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
454# Location of the ICU4C source tree.
455set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)
456
457 $ICU_ROOT/dbg$
458 mkdir -p tools/unicode/c
459 cd tools/unicode/c
460
461 $ICU_ROOT/dbg/tools/unicode/c$
462 cmake ../../../../src/tools/unicode/c
463 make
464
465* generate core properties data files
466 $ICU_ROOT/dbg/tools/unicode/c$
467 genprops/genprops $ICU_SRC/icu4c
468 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \
469 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
470- rebuild ICU (make install) & tools
471
472* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
473 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
474- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
475- Unicode 6.0..12.0: U+2260, U+226E, U+226F
476- nothing new in this Unicode version, no test file to update
477
478* run & fix ICU4C tests
479- update test of default bidi classes:
480 Bidi range \U0001ED00-\U0001ED4F changes default from R to AL,
481 see diffs in DerivedBidiClass.txt
482 + /tsutil/cucdtst/TestUnicodeData enumDefaultsRange() defaultBidi[]
483 + UCharacterTest.java TestIteration() defaultBidi[]
484- Andy handles RBBI & spoof check test failures
485
486* collation: CLDR collation root, UCA DUCET
487
488- UCA DUCET goes into Mark's Unicode tools, see
489 https://sites.google.com/site/unicodetools/home#TOC-UCA
490 diff the main mapping file, look for bad changes
491 (for example, more bytes per weight for common characters)
492 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.txt
493 ~/svn.unitools/trunk$ meld ../frac-11.txt ../frac-12.txt
494
495- CLDR root data files are checked into $CLDR_SRC/common/uca/
496 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
497
498- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
499 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
500- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
501 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
502 (note removing the underscore before "Rules")
503 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
504- restore TODO diffs in UCARules.txt
505 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
506- update (ICU4C)/source/test/testdata/CollationTest_*.txt
507 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
508 from the CLDR root files (..._CLDR_..._SHORT.txt)
509 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
510 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
511 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
512- if CLDR common/uca/unihan-index.txt changes, then update
513 CLDR common/collation/root.xml <collation type="private-unihan">
514 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
515
516- run genuca, see command line above;
517 deal with
518 Error: Unknown script for first-primary sample character U+119CE on line 29233 of /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:
519 FDD1 119CE; [71 CD 02, 05, 05] # Nandinagari first primary (compressible)
520 (add the character to genuca.cpp sampleCharsToScripts[])
521 + This time, I added code to genuca.cpp to use uscript_getSampleUnicodeString(script)
522 and cache its values.
523 Works as long as the script metadata is updated before the collation data.
524- rebuild ICU4C
525
526* Unihan collators
527 https://sites.google.com/site/unicodetools/unihan
528- run Unicode Tools
529 org.unicode.draft.GenerateUnihanCollators
530 with VM arguments
531 -ea
532 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
533 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
534 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
535 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
536 -DUVERSION=12.0.0
537- run Unicode Tools
538 org.unicode.draft.GenerateUnihanCollatorFiles
539 with the same arguments
540- check CLDR diffs
541 cd $CLDR_SRC
542 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
543 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
544- copy to CLDR
545 cd $CLDR_SRC
546 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
547 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
548- run CLDR unit tests, commit to CLDR
549- generate ICU zh collation data: run CLDR
550 org.unicode.cldr.icu.NewLdml2IcuConverter
551 with program arguments
552 -t collation
553 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
554 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
555 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll
556 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation
557 zh
558 and VM arguments
559 -ea
560 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
561- rebuild ICU4C
562
563* run & fix ICU4C tests, now with new CLDR collation root data
564- run all tests with the collation test data *_SHORT.txt or the full files
565 (the full ones have comments, useful for debugging)
566- note on intltest: if collate/UCAConformanceTest fails, then
567 utility/MultithreadTest/TestCollators will fail as well;
568 fix the conformance test before looking into the multi-thread test
569
570* update Java data files
571- refresh just the UCD/UCA-related/derived files, just to be safe
572- see (ICU4C)/source/data/icu4j-readme.txt
573- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
574- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
575 output:
576 ...
577 Unicode .icu files built to ./out/build/icudt63l
578 echo timestamp > uni-core-data
579 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt63b
580 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b
581 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
582 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt63l.dat ./out/icu4j/icudt63b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt63l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt63b
583 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b"
584 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt63b/
585 mkdir -p /tmp/icu4j/main/shared/data
586 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
587 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt63b/
588 mkdir -p /tmp/icu4j/main/shared/data
589 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
590 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
591- copy the big-endian Unicode data files to another location,
592 separate from the other data files,
593 and then refresh ICU4J
594 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
595 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
596 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
597 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
598 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
599 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
600 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
601 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
602 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
603 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
604
605* When refreshing all of ICU4J data from ICU4C
606- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
607- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
608or
609- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
610
611* update CollationFCD.java
612 + copy & paste the initializers of lcccIndex[] etc. from
613 ICU4C/source/i18n/collationfcd.cpp to
614 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
615
616* refresh Java test .txt files
617- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
618 cd $ICU_SRC/icu4c/source/data/unidata
619 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
620 cd ../../test/testdata
621 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
622 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
623
624* run & fix ICU4J tests
625
626*** API additions
627- send notice to icu-design about new born-@stable API (enum constants etc.)
628
629*** CLDR numbering systems
630- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
631 for example, look for
632 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt
633 in new blocks (Blocks.txt)
634 Unicode 12: using Unicode 12 CLDR ticket #11478
635 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong
636 wcho 1E2F0..1E2F9 Wancho
637 Unicode 11: using Unicode 11 CLDR ticket #10978
638 rohg 10D30..10D39 Hanifi_Rohingya
639 gong 11DA0..11DA9 Gunjala_Gondi
640 Earlier: CLDR tickets specific to adding new numbering systems.
641 Unicode 10: http://unicode.org/cldr/trac/ticket/10219
642 Unicode 9: http://unicode.org/cldr/trac/ticket/9692
643
644*** merge the Unicode update branches back onto the trunk
645- do not merge the icudata.jar and testdata.jar,
646 instead rebuild them from merged & tested ICU4C
647- make sure that changes to Unicode tools are checked in:
648 http://www.unicode.org/utility/trac/log/trunk/unicodetools
649
650---------------------------------------------------------------------------- ***
651
652ICU 63 addition of ICU support of text layout properties InPC, InSC, vo
653
654* Command-line environment setup
655
656UNICODE_DATA=~/unidata/uni11/20180609
657CLDR_SRC=~/svn.cldr/uni
658ICU_ROOT=~/icu/mine
659ICU_SRC=$ICU_ROOT/src
660ICUDT=icudt62b
661ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
662ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
663export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
664
665*** Links
666
667https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC
668https://unicode-org.atlassian.net/browse/ICU-12850 vo
669
670*** data files & enums & parser code
671
672* API additions
673- for each of the three new enumerated properties
674 + uchar.h: add the enum UProperty constant UCHAR_<long prop name>
675 + uchar.h: update UCHAR_INT_LIMIT
676 + uchar.h: add the enum U<long prop name>
677 with constants U_<short prop name>_<long value name>
678 + UProperty.java: add the constant <long prop name>
679 + UProperty.java: update INT_LIMIT
680 + UCharacter.java: add the interface <long prop name>
681 with constants <long value name>
682
683* process and/or copy files
684- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
685 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
686 + It also writes tools/unicode/c/genprops/pnames_data.h with property and value
687 names and aliases.
688 + For debugging, and tweaking how ppucd.txt is written,
689 the tool has an --only_ppucd option:
690 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
691
692* preparseucd.py changes
693- add new property short names (uppercase) to _prop_and_value_re
694 so that ParseUCharHeader() parses the new enum constants
695
696* build ICU (make install)
697 so that the tools build can pick up the new definitions from the installed header files.
698
699 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
700
701* build Unicode tools using CMake+make
702
703$ICU_SRC/tools/unicode/c/icudefs.txt:
704
705# Location (--prefix) of where ICU was installed.
706set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
707# Location of the ICU4C source tree.
708set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c)
709
710 $ICU_ROOT/dbg$
711 mkdir -p tools/unicode/c
712 cd tools/unicode/c
713
714 $ICU_ROOT/dbg/tools/unicode/c$
715 cmake ../../../../../src/tools/unicode/c
716 make
717
718* generate core properties data files
719 $ICU_ROOT/dbg/tools/unicode/c$
720 genprops/genprops $ICU_SRC/icu4c
721- rebuild ICU (make install) & tools
722
723* write data for runtime, hardcoded for now
724- add genprops/layoutpropsbuilder.cpp with pieces from sibling files
725- generate new icu4c/source/common/ulayout_props_data.h
726- for each of the three new enumerated properties
727 + int property max value
728 + small, 8-bit UCPTrie
729 (A small 16-bit trie with bit fields for these three properties
730 is very nearly the same size as the sum of the three.)
731
732* wire into C++
733- uprops.cpp: #include ulayout_props_data.h
734- uprops.cpp: add getInPC() etc. functions
735- uprops.cpp: add lines to intProps[], include max values
736- uprops.h: add UPropertySource constants
737- uprops.cpp: add uprops_addPropertyStarts(src)
738- uniset_props.cpp: add to UnicodeSet_initInclusion()
739- intltest/ucdtest.cpp: write unit tests
740
741* update Java data files
742- refresh just the pnames.icu file with the new property [value] names, just to be safe
743- see $ICU_SRC/icu4c/source/data/icu4j-readme.txt
744- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
745- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
746- copy the big-endian Unicode data files to another location,
747 separate from the other data files,
748 and then refresh ICU4J
749 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
750 cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
751 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
752
753* wire into Java
754- UCharacterProperty.java: add new SRC_INPC etc. constants as in C++
755- UCharacterProperty.java: for each new property
756 + create a nested class to hold its CodePointTrie
757 + initialize it from a string literal
758 + paste in the initializer printed by genprops
759 + add a new IntProperty object to the intProps[] array
760 + use the correct max int value for each property, also printed by genprops
761- UCharacterProperty.java: add ulayout_addPropertyStarts(src, set)
762- UnicodeSet.java: add to getInclusions()
763- UCharacterTest.java: write unit tests
764
765---------------------------------------------------------------------------- ***
766
0f5d89e8
A
767Unicode 11.0 update for ICU 62
768
769http://www.unicode.org/versions/Unicode11.0.0/
770http://unicode.org/versions/beta-11.0.0.html
771https://www.unicode.org/review/pri372/
772http://www.unicode.org/reports/uax-proposed-updates.html
773http://www.unicode.org/reports/tr44/tr44-21.html
774
775* Command-line environment setup
776
777UNICODE_DATA=~/unidata/uni11/20180521
778CLDR_SRC=~/svn.cldr/uni
779ICU_ROOT=~/svn.icu/uni
780ICU_SRC=$ICU_ROOT/src
781ICUDT=icudt61b
782ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
783ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
784export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
785
786*** ICU Trac
787
788- ticket:13630: Unicode 11
789- ^/branches/markus/uni11
790
791*** CLDR Trac
792
793- cldrbug 10978: Unicode 11
794- ^/branches/markus/uni11
795
796*** Unicode version numbers
797- makedata.mak
798- uchar.h
799- com.ibm.icu.util.VersionInfo
800- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
801
802- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
803 so that the makefiles see the new version number.
804
805*** data files & enums & parser code
806
807* download files
808- mkdir -p $UNICODE_DATA
809- download Unicode files into $UNICODE_DATA
810 + subfolders: emoji, idna, security, ucd, uca
811 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
812
813* for manual diffs and for Unicode Tools input data updates:
814 remove version suffixes from the file names
815 ~$ unidata/desuffixucd.py $UNICODE_DATA
816 (see https://sites.google.com/site/unicodetools/inputdata)
817
818* process and/or copy files
819- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
820 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
821 + For debugging, and tweaking how ppucd.txt is written,
822 the tool has an --only_ppucd option:
823 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
824
825- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
826
827* build ICU (make install)
828 so that the tools build can pick up the new definitions from the installed header files.
829
830 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
831
832* preparseucd.py changes
833- fix other errors
834 NameError: unknown property Extended_Pictographic
835 -> add Extended_Pictographic binary property
836 -> add new short names for all Emoji properties
837
838* new constants for new property values
839- preparseucd.py error:
840 ValueError: missing uchar.h enum constants for some property values:
841 [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar',
842 u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals',
843 u'Indic_Siyaq_Numbers'])),
844 (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])),
845 (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])),
846 (u'GCB', set([u'LinkC', u'Virama'])),
847 (u'WB', set([u'WSegSpace']))]
848 = PropertyValueAliases.txt new property values (diff old & new .txt files)
849 blk; Chess_Symbols ; Chess_Symbols
850 blk; Dogra ; Dogra
851 blk; Georgian_Ext ; Georgian_Extended
852 blk; Gunjala_Gondi ; Gunjala_Gondi
853 blk; Hanifi_Rohingya ; Hanifi_Rohingya
854 blk; Indic_Siyaq_Numbers ; Indic_Siyaq_Numbers
855 blk; Makasar ; Makasar
856 blk; Mayan_Numerals ; Mayan_Numerals
857 blk; Medefaidrin ; Medefaidrin
858 blk; Old_Sogdian ; Old_Sogdian
859 blk; Sogdian ; Sogdian
860 -> add to uchar.h
861 use long property names for enum constants,
862 for the trailing comment get the block start code point: diff old & new Blocks.txt
863 -> add to UCharacter.UnicodeBlock IDs
864 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
865 replace public static final int \1_ID = \2; \3
866 -> add to UCharacter.UnicodeBlock objects
867 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
868 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
869
870 GCB; LinkC ; LinkingConsonant
871 GCB; Virama ; Virama
872 -> uchar.h & UCharacter.GraphemeClusterBreak
873 -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76
874
875 InSC; Consonant_Initial_Postfixed ; Consonant_Initial_Postfixed
876 -> ignore: ICU does not yet support this property
877
878 jg ; Hanifi_Rohingya_Kinna_Ya ; Hanifi_Rohingya_Kinna_Ya
879 jg ; Hanifi_Rohingya_Pa ; Hanifi_Rohingya_Pa
880 -> uchar.h & UCharacter.JoiningGroup
881
882 sc ; Dogr ; Dogra
883 sc ; Gong ; Gunjala_Gondi
884 sc ; Maka ; Makasar
885 sc ; Medf ; Medefaidrin
886 sc ; Rohg ; Hanifi_Rohingya
887 sc ; Sogd ; Sogdian
888 sc ; Sogo ; Old_Sogdian
889 -> uscript.h & com.ibm.icu.lang.UScript
890 -> Nushu had been added already
891 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
892 and in com.ibm.icu.dev.test.lang.TestUScript.java
893
894 WB ; WSegSpace ; WSegSpace
895 -> uchar.h & UCharacter.WordBreak
896
897* New short names for emoji properties
898- see UTS #51
899- short names set in preparseucd.py
900
901* New properties
902- boolean emoji property Extended_Pictographic
903 -> added in preparseucd.py
904 -> uchar.h & UProperty.java
905- misc. property Equivalent_Unified_Ideograph (EqUIdeo)
906 as shown in PropertyValueAliases.txt
907 -> ignore for now
908
909* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
910 (not strictly necessary for NOT_ENCODED scripts)
911 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
912
913* update spoof checker UnicodeSet initializers:
914 inclusionPat & recommendedPat in uspoof.cpp
915 INCLUSION & RECOMMENDED in SpoofChecker.java
916- make sure that the Unicode Tools tree contains the latest security data files
917- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
918- update the hardcoded version number there in the DIRECTORY path
919- run the tool (no special environment variables needed)
920- copy & paste from the Console output into the .cpp & .java files
921
922* generate normalization data files
923 cd $ICU_ROOT/dbg/icu4c
924 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
925 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
926 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
927 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
928 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
929
930* build ICU (make install)
931 so that the tools build can pick up the new definitions from the installed header files.
932
933 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
934
935* build Unicode tools using CMake+make
936
937$ICU_SRC/tools/unicode/c/icudefs.txt:
938
939# Location (--prefix) of where ICU was installed.
940set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
941# Location of the ICU4C source tree.
942set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c)
943
944 $ICU_ROOT/dbg$
945 mkdir -p tools/unicode/c
946 cd tools/unicode/c
947
948 $ICU_ROOT/dbg/tools/unicode/c$
949 cmake ../../../../src/tools/unicode/c
950 make
951
952* generate core properties data files
953 $ICU_ROOT/dbg/tools/unicode/c$
954 genprops/genprops $ICU_SRC/icu4c
955 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
956 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
957- rebuild ICU (make install) & tools
958
959* Fix case props
960 genprops error: casepropsbuilder: too many exceptions words
961 genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR
962- With the addition of Georgian Mtavruli capital letters,
963 there are now too many simple case mappings with big mapping deltas
964 that yield uncompressible exceptions.
965- Changing the data structure (now formatVersion 4),
966 adding one bit for no-simple-case-folding (for Cherokee), and
967 one optional slot for a big delta (for most faraway mappings),
968 together with another bit for whether that is negative.
969 This makes most Cherokee & Georgian etc. case mappings compressible,
970 reducing the number of exceptions words.
971- Further changes to gain one more bit for the exceptions index,
972 for future growth. Details see casepropsbuilder.cpp.
973
974* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
975 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
976- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
977- Unicode 6.0..11.0: U+2260, U+226E, U+226F
978- nothing new in this Unicode version, no test file to update
979
980* run & fix ICU4C tests
981- Andy handles RBBI & spoof check test failures
982
983- Errors in char.txt, word.txt, word_POSIX.txt like
984 createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 46, column 16
985 because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty.
986 -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them
987 not empty, just to get ICU building.
988 -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables
989 and properties together with the rules that used them (GB 10, WB 14).
990 -> Andy adjusts the rule sets further to sync with
991 Unicode 11 grapheme, word, and line break spec changes.
992
993* collation: CLDR collation root, UCA DUCET
994
995- UCA DUCET goes into Mark's Unicode tools, see
996 https://sites.google.com/site/unicodetools/home#TOC-UCA
997 diff the main mapping file, look for bad changes
998 (for example, more bytes per weight for common characters)
999 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt
1000 ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt
1001
1002- CLDR root data files are checked into $CLDR_SRC/common/uca/
1003 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
1004
1005- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1006 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
1007- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1008 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
1009 (note removing the underscore before "Rules")
1010 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
1011- restore TODO diffs in UCARules.txt
1012 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
1013- update (ICU4C)/source/test/testdata/CollationTest_*.txt
1014 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1015 from the CLDR root files (..._CLDR_..._SHORT.txt)
1016 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1017 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1018 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
1019- if CLDR common/uca/unihan-index.txt changes, then update
1020 CLDR common/collation/root.xml <collation type="private-unihan">
1021 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
1022
1023- run genuca, see command line above;
1024 deal with
1025 Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:
1026 FDD1 1180B; [71 CC 02, 05, 05] # Dogra first primary (compressible)
1027 (add the character to genuca.cpp sampleCharsToScripts[])
1028 + look up the USCRIPT_ code for the new sample characters
1029 (should be obvious from the comment in the error output)
1030 + *add* mappings to sampleCharsToScripts[], do not replace them
1031 (in case the script sample characters flip-flop)
1032 + insert new scripts in DUCET script order, see the top_byte table
1033 at the beginning of FractionalUCA.txt
1034- rebuild ICU4C
1035
1036* Unihan collators
1037 https://sites.google.com/site/unicodetools/unihan
1038- run Unicode Tools
1039 org.unicode.draft.GenerateUnihanCollators
1040 with VM arguments
1041 -ea
1042 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
1043 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
1044 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
1045 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
1046 -DUVERSION=11.0.0
1047- run Unicode Tools
1048 org.unicode.draft.GenerateUnihanCollatorFiles
1049 with the same arguments
1050- check CLDR diffs
1051 cd $CLDR_SRC
1052 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
1053 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
1054- copy to CLDR
1055 cd $CLDR_SRC
1056 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
1057 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
1058- run CLDR unit tests, commit to CLDR
1059- generate ICU zh collation data: run CLDR
1060 org.unicode.cldr.icu.NewLdml2IcuConverter
1061 with program arguments
1062 -t collation
1063 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
1064 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
1065 -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll
1066 -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation
1067 zh
1068 and VM arguments
1069 -ea
1070 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
1071- rebuild ICU4C
1072
1073* run & fix ICU4C tests, now with new CLDR collation root data
1074- run all tests with the collation test data *_SHORT.txt or the full files
1075 (the full ones have comments, useful for debugging)
1076- note on intltest: if collate/UCAConformanceTest fails, then
1077 utility/MultithreadTest/TestCollators will fail as well;
1078 fix the conformance test before looking into the multi-thread test
1079
1080* update Java data files
1081- refresh just the UCD/UCA-related/derived files, just to be safe
1082- see (ICU4C)/source/data/icu4j-readme.txt
1083- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1084- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1085 output:
1086 ...
1087 Unicode .icu files built to ./out/build/icudt61l
1088 echo timestamp > uni-core-data
1089 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b
1090 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b
1091 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1092 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b
1093 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b"
1094 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/
1095 mkdir -p /tmp/icu4j/main/shared/data
1096 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1097 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/
1098 mkdir -p /tmp/icu4j/main/shared/data
1099 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1100 make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data'
1101- copy the big-endian Unicode data files to another location,
1102 separate from the other data files,
1103 and then refresh ICU4J
1104 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
1105 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1106 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1107 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1108 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1109 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1110 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1111 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1112 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1113 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1114
1115* When refreshing all of ICU4J data from ICU4C
1116- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1117- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
1118or
1119- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
1120
1121* update CollationFCD.java
1122 + copy & paste the initializers of lcccIndex[] etc. from
1123 ICU4C/source/i18n/collationfcd.cpp to
1124 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1125
1126* refresh Java test .txt files
1127- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1128 cd $ICU_SRC/icu4c/source/data/unidata
1129 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1130 cd ../../test/testdata
1131 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1132 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1133
1134* run & fix ICU4J tests
1135
1136*** API additions
1137- send notice to icu-design about new born-@stable API (enum constants etc.)
1138
1139*** CLDR numbering systems
1140- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
1141 Unicode 11: using Unicode 11 CLDR ticket #10978
1142 rohg 10D30..10D39 Hanifi_Rohingya
1143 gong 11DA0..11DA9 Gunjala_Gondi
1144 Earlier: CLDR tickets specific to adding new numbering systems.
1145 Unicode 10: http://unicode.org/cldr/trac/ticket/10219
1146 Unicode 9: http://unicode.org/cldr/trac/ticket/9692
1147
1148*** merge the Unicode update branches back onto the trunk
1149- do not merge the icudata.jar and testdata.jar,
1150 instead rebuild them from merged & tested ICU4C
1151- make sure that changes to Unicode tools are checked in:
1152 http://www.unicode.org/utility/trac/log/trunk/unicodetools
1153
1154---------------------------------------------------------------------------- ***
1155
6be67b06
A
1156Unicode 10.0 update for ICU 60
1157
1158http://www.unicode.org/versions/Unicode10.0.0/
1159http://www.unicode.org/versions/beta-10.0.0.html
1160http://blog.unicode.org/2017/03/unicode-100-beta-review.html
1161http://www.unicode.org/review/pri350/
1162http://www.unicode.org/reports/uax-proposed-updates.html
1163http://www.unicode.org/reports/tr44/tr44-19.html
1164
1165* Command-line environment setup
1166
1167UNICODE_DATA=~/unidata/uni10/20170605
1168CLDR_SRC=~/svn.cldr/uni10
1169ICU_ROOT=~/svn.icu/uni10
1170ICU_SRC=$ICU_ROOT/src
1171ICUDT=icudt60b
1172ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
1173ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
1174export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
1175
1176*** ICU Trac
1177
1178- ticket:12985: Unicode 10
1179- ticket:13061: undo hacks from emoji 5.0 update
1180- ticket:13062: add Emoji_Component property
1181- ^/branches/markus/uni10
1182
1183*** CLDR Trac
1184
1185- cldrbug 10055: Unicode 10
1186- cldrbug 9882: Unicode 10 script metadata
1187- cldrbug 10219: numbering systems for Unicode 10
1188
1189*** Unicode version numbers
1190- makedata.mak
1191- uchar.h
1192- com.ibm.icu.util.VersionInfo
1193- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1194
1195- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1196 so that the makefiles see the new version number.
1197
1198*** data files & enums & parser code
1199
1200* download files
1201- mkdir -p $UNICODE_DATA
1202- download Unicode 10.0 files into $UNICODE_DATA
1203 + subfolders: ucd, uca, idna, security
1204 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1205- download emoji 5.0 files into $UNICODE_DATA/emoji
1206
1207* for manual diffs: remove version suffixes from the file names
1208 ~$ unidata/desuffixucd.py $UNICODE_DATA
1209 (see https://sites.google.com/site/unicodetools/inputdata)
1210
1211* process and/or copy files
1212- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
1213 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1214 + For debugging, and tweaking how ppucd.txt is written,
1215 the tool has an --only_ppucd option:
1216 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
1217
1218- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
1219
1220* build ICU (make install)
1221 so that the tools build can pick up the new definitions from the installed header files.
1222
1223 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
1224
1225* preparseucd.py changes
1226- remove or add new Unicode scripts from/to the
1227 only-in-ISO-15924 list according to the error messages:
1228 ValueError: remove ['Nshu'] from _scripts_only_in_iso15924
1229 -> adjust _scripts_only_in_iso15924 as indicated
1230- fix other errors
1231 Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo']
1232 -> add vo=Vertical_Orientation to _ignored_properties
1233 -> later removed again, parsing the file, even though we do not yet store data for runtime use
1234
1235* new constants for new property values
1236- preparseucd.py error:
1237 ValueError: missing uchar.h enum constants for some property values:
1238 [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F',
1239 u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])),
1240 (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla',
1241 u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra',
1242 u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])),
1243 (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))]
1244 = PropertyValueAliases.txt new property values (diff old & new .txt files)
1245 blk; CJK_Ext_F ; CJK_Unified_Ideographs_Extension_F
1246 blk; Kana_Ext_A ; Kana_Extended_A
1247 blk; Masaram_Gondi ; Masaram_Gondi
1248 blk; Nushu ; Nushu
1249 blk; Soyombo ; Soyombo
1250 blk; Syriac_Sup ; Syriac_Supplement
1251 blk; Zanabazar_Square ; Zanabazar_Square
1252 -> add to uchar.h
1253 use long property names for enum constants,
1254 for the trailing comment get the block start code point: diff old & new Blocks.txt
1255 -> add to UCharacter.UnicodeBlock IDs
1256 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1257 replace public static final int \1_ID = \2; \3
1258 -> add to UCharacter.UnicodeBlock objects
1259 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
1260 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1261
1262 jg ; Malayalam_Bha ; Malayalam_Bha
1263 jg ; Malayalam_Ja ; Malayalam_Ja
1264 jg ; Malayalam_Lla ; Malayalam_Lla
1265 jg ; Malayalam_Llla ; Malayalam_Llla
1266 jg ; Malayalam_Nga ; Malayalam_Nga
1267 jg ; Malayalam_Nna ; Malayalam_Nna
1268 jg ; Malayalam_Nnna ; Malayalam_Nnna
1269 jg ; Malayalam_Nya ; Malayalam_Nya
1270 jg ; Malayalam_Ra ; Malayalam_Ra
1271 jg ; Malayalam_Ssa ; Malayalam_Ssa
1272 jg ; Malayalam_Tta ; Malayalam_Tta
1273 -> uchar.h & UCharacter.JoiningGroup
1274
1275 sc ; Gonm ; Masaram_Gondi
1276 sc ; Nshu ; Nushu
1277 sc ; Soyo ; Soyombo
1278 sc ; Zanb ; Zanabazar_Square
1279 -> uscript.h & com.ibm.icu.lang.UScript
1280 -> Nushu had been added already
1281 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1282 and in com.ibm.icu.dev.test.lang.TestUScript.java
1283
1284* New properties as shown in PropertyValueAliases.txt changes
1285- boolean Emoji_Component from emoji 5
1286 -> uchar.h & UProperty.java
1287- boolean
1288 # Regional_Indicator (RI)
1289
1290 RI ; N ; No ; F ; False
1291 RI ; Y ; Yes ; T ; True
1292 -> uchar.h & UProperty.java
1293 -> single immutable range, to be hardcoded
1294- boolean
1295 # Prepended_Concatenation_Mark (PCM)
1296
1297 PCM; N ; No ; F ; False
1298 PCM; Y ; Yes ; T ; True
1299 -> was new in Unicode 9
1300 -> uchar.h & UProperty.java
1301- enumerated
1302 # Vertical_Orientation (vo)
1303
1304 vo ; R ; Rotated
1305 vo ; Tr ; Transformed_Rotated
1306 vo ; Tu ; Transformed_Upright
1307 vo ; U ; Upright
1308 -> only pre-parsed for now, but not yet stored for runtime use
1309
1310* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1311 (not strictly necessary for NOT_ENCODED scripts)
1312 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
1313
1314* generate normalization data files
1315 cd $ICU_ROOT/dbg/icu4c
1316 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
1317 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
1318 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
1319 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1320 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
1321
1322* build ICU (make install)
1323 so that the tools build can pick up the new definitions from the installed header files.
1324
1325 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
1326
1327* build Unicode tools using CMake+make
1328
1329$ICU_SRC/tools/unicode/c/icudefs.txt:
1330
1331# Location (--prefix) of where ICU was installed.
1332set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
1333# Location of the ICU4C source tree.
1334set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c)
1335
1336 $ICU_ROOT/dbg/tools/unicode/c$
1337 cmake ../../../../src/tools/unicode/c
1338 make
1339
1340* generate core properties data files
1341 $ICU_ROOT/dbg/tools/unicode/c$
1342 genprops/genprops $ICU_SRC/icu4c
1343 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
1344 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
1345- rebuild ICU (make install) & tools
1346
1347* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1348 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1349- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1350- Unicode 6.0..10.0: U+2260, U+226E, U+226F
1351- nothing new in this Unicode version, no test file to update
1352
1353* run & fix ICU4C tests
1354- Andy handles RBBI & spoof check test failures
1355
1356* collation: CLDR collation root, UCA DUCET
1357
1358- UCA DUCET goes into Mark's Unicode tools, see
1359 https://sites.google.com/site/unicodetools/home#TOC-UCA
1360- CLDR root data files are checked into $CLDR_SRC/common/uca/
1361 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
1362
1363- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1364 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
1365- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1366 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
1367 (note removing the underscore before "Rules")
1368 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
1369- restore TODO diffs in UCARules.txt
1370 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
1371- update (ICU4C)/source/test/testdata/CollationTest_*.txt
1372 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1373 from the CLDR root files (..._CLDR_..._SHORT.txt)
1374 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1375 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1376 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
1377- if CLDR common/uca/unihan-index.txt changes, then update
1378 CLDR common/collation/root.xml <collation type="private-unihan">
1379 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
1380
1381- run genuca, see command line above;
1382 deal with
1383 Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt:
1384 FDD1 11D10; [70 D5 02, 05, 05] # Masaram_Gondi first primary (compressible)
1385 (add the character to genuca.cpp sampleCharsToScripts[])
1386 + look up the USCRIPT_ code for the new sample characters
1387 (should be obvious from the comment in the error output)
1388 + *add* mappings to sampleCharsToScripts[], do not replace them
1389 (in case the script sample characters flip-flop)
1390 + insert new scripts in DUCET script order, see the top_byte table
1391 at the beginning of FractionalUCA.txt
1392- rebuild ICU4C
1393
1394* Unihan collators
1395 https://sites.google.com/site/unicodetools/unihan
1396- run Unicode Tools
1397 org.unicode.draft.GenerateUnihanCollators
1398 with VM arguments
1399 -ea
1400 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
1401 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
1402 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
1403 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
1404 -DUVERSION=10.0.0
1405- run Unicode Tools
1406 org.unicode.draft.GenerateUnihanCollatorFiles
1407 with the same arguments
1408- check CLDR diffs
1409 cd $CLDR_SRC
1410 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
1411 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
1412- copy to CLDR
1413 cd $CLDR_SRC
1414 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
1415 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
1416- run CLDR unit tests, commit to CLDR
1417- generate ICU zh collation data: run CLDR
1418 org.unicode.cldr.icu.NewLdml2IcuConverter
1419 with program arguments
1420 -t collation
1421 -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation
1422 -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental
1423 -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll
1424 -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation
1425 zh
1426 and VM arguments
1427 -ea
1428 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
1429- rebuild ICU4C
1430
1431* run & fix ICU4C tests, now with new CLDR collation root data
1432- run all tests with the collation test data *_SHORT.txt or the full files
1433 (the full ones have comments, useful for debugging)
1434- note on intltest: if collate/UCAConformanceTest fails, then
1435 utility/MultithreadTest/TestCollators will fail as well;
1436 fix the conformance test before looking into the multi-thread test
1437
1438* update Java data files
1439- refresh just the UCD/UCA-related/derived files, just to be safe
1440- see (ICU4C)/source/data/icu4j-readme.txt
1441- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1442- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1443 output:
1444 ...
1445 Unicode .icu files built to ./out/build/icudt60l
1446 echo timestamp > uni-core-data
1447 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b
1448 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b
1449 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1450 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b
1451 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b"
1452 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/
1453 mkdir -p /tmp/icu4j/main/shared/data
1454 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1455 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/
1456 mkdir -p /tmp/icu4j/main/shared/data
1457 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1458 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data'
1459- copy the big-endian Unicode data files to another location,
1460 separate from the other data files,
1461 and then refresh ICU4J
1462 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
1463 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1464 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1465 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1466 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1467 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1468 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1469 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1470 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1471 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1472
1473* When refreshing all of ICU4J data from ICU4C
1474- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1475- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
1476or
1477- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
1478
1479* update CollationFCD.java
1480 + copy & paste the initializers of lcccIndex[] etc. from
1481 ICU4C/source/i18n/collationfcd.cpp to
1482 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1483
1484* refresh Java test .txt files
1485- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1486 cd $ICU_SRC/icu4c/source/data/unidata
1487 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1488 cd ../../test/testdata
1489 cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1490 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1491
1492* run & fix ICU4J tests
1493
1494*** API additions
1495- send notice to icu-design about new born-@stable API (enum constants etc.)
1496
1497*** CLDR numbering systems
1498- look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket
1499 Unicode 10: http://unicode.org/cldr/trac/ticket/10219
1500 Unicode 9: http://unicode.org/cldr/trac/ticket/9692
1501
1502*** merge the Unicode update branches back onto the trunk
1503- do not merge the icudata.jar and testdata.jar,
1504 instead rebuild them from merged & tested ICU4C
1505- make sure that changes to Unicode tools are checked in:
1506 http://www.unicode.org/utility/trac/log/trunk/unicodetools
f3c0d7a5
A
1507
1508---------------------------------------------------------------------------- ***
1509
1510Emoji 5.0 update for ICU 59
1511- ICU 59 mostly remains on Unicode 9.0
1512- except updates bidi and segmentation data to Unicode 10 beta
1513
1514First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg.
1515
1516* Command-line environment setup
1517
1518ICU_ROOT=~/svn.icu/trunk
1519ICU_SRC_DIR=$ICU_ROOT/src
1520ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c
1521ICUDT=icudt59b
1522export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1523SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in
1524UNIDATA=$ICU4C_SRC_DIR/source/data/unidata
1525
1526*** ICU Trac
1527
1528- ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released
1529- changes directly on trunk
1530
1531*** data files & enums & parser code
1532
1533* download files
1534
1535- download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca)
1536- download emoji 5.0 beta files into the same uni90e50 folder
1537- download Unicode 10.0 beta files: ucd
1538 + copy Unicode 10 bidi files to the uni90e50/ucd folder:
1539 BidiBrackets.txt
1540 BidiCharacterTest.txt
1541 BidiMirroring.txt
1542 BidiTest.txt
1543 extracted/DerivedBidiClass.txt
1544 + copy Unicode 10 segmentation files to the uni90e50/ucd folder:
1545 LineBreak.txt
1546 auxiliary/*
1547
1548* preparseucd.py changes
1549- adjust for combined trunks
1550- write new copyright lines
1551- ignore new Emoji_Component property for now
1552
1553* process and/or copy files
1554- ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR
1555 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1556
1557- cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA
1558
1559* build ICU (make install)
1560 so that the tools build can pick up the new definitions from the installed header files.
1561
1562 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
1563
1564* build Unicode tools using CMake+make
1565
1566~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt:
1567
1568# Location (--prefix) of where ICU was installed.
1569set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
1570# Location of the ICU4C source tree.
1571set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c)
1572
1573 ~/svn.icu/trunk/dbg/tools/unicode/c$
1574 cmake ../../../../src/tools/unicode/c
1575 make
1576
1577* generate core properties data files
1578 ~/svn.icu/trunk/dbg/tools/unicode/c$
1579 genprops/genprops $ICU4C_SRC_DIR
1580- rebuild ICU (make install) & tools
1581
1582* run & fix ICU4C tests
1583- Andy handles RBBI & spoof check test failures
1584
1585* update Java data files
1586- refresh just the UCD/UCA-related/derived files, just to be safe
1587- see (ICU4C)/source/data/icu4j-readme.txt
1588- mkdir /tmp/icu4j
1589- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1590 output:
1591 ...
1592 Unicode .icu files built to ./out/build/icudt59l
1593 echo timestamp > uni-core-data
1594 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b
1595 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b
1596 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1597 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b
1598 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b"
1599 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/
1600 mkdir -p /tmp/icu4j/main/shared/data
1601 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1602 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/
1603 mkdir -p /tmp/icu4j/main/shared/data
1604 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1605 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data'
1606- copy the big-endian Unicode data files to another location,
1607 separate from the other data files,
1608 and then refresh ICU4J
1609 cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j
1610 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1611 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1612 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1613 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1614 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1615 jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1616
1617* When refreshing all of ICU4J data from ICU4C
1618- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1619- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data
1620or
1621- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install
1622
1623* refresh Java test .txt files
1624- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1625 cd $ICU4C_SRC_DIR/source/data/unidata
1626 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1627 cd ../../test/testdata
1628 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1629 cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1630
1631* run & fix ICU4J tests
1632
1633---------------------------------------------------------------------------- ***
1634
1635Unicode 9.0 update for ICU 58
1636
1637* Command-line environment setup
1638
1639ICU_ROOT=~/svn.icu/trunk
1640ICU_SRC_DIR=$ICU_ROOT/src
1641ICUDT=icudt58b
1642export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1643SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
1644UNIDATA=$ICU_SRC_DIR/source/data/unidata
1645
1646http://www.unicode.org/review/pri323/ -- beta review
1647http://www.unicode.org/reports/uax-proposed-updates.html
1648http://www.unicode.org/versions/beta-9.0.0.html
1649http://www.unicode.org/versions/Unicode9.0.0/
1650http://www.unicode.org/reports/tr44/tr44-17.html
1651
1652*** ICU Trac
1653
1654- ticket:12526: integrate Unicode 9
1655- C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b
1656- Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b
1657
1658*** CLDR Trac
1659
1660- cldrbug 9414: UCA 9
1661- ^/branches/markus/uni90 at r11518 from trunk at r11517
1662
1663- cldrbug 8745: Unicode 9.0 script metadata
1664
1665*** Unicode version numbers
1666- makedata.mak
1667- uchar.h
1668- com.ibm.icu.util.VersionInfo
1669- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1670
1671- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1672 so that the makefiles see the new version number.
1673
1674*** data files & enums & parser code
1675
1676* file preparation
1677
1678- download UCD & IDNA files
1679- make sure that the Unicode data folder passed into preparseucd.py
1680 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
1681- only for manual diffs: remove version suffixes from the file names
1682 ~/unidata/uni70/20140403$ ../../desuffixucd.py .
1683 (see https://sites.google.com/site/unicodetools/inputdata)
1684- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1685- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src
1686- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1687
1688- also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt
1689 and copy to $UNIDATA
1690 cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA
1691
1692* preparseucd.py changes
1693- remove or add new Unicode scripts from/to the
1694 only-in-ISO-15924 list according to the error messages:
1695 ValueError: remove ['Tang'] from _scripts_only_in_iso15924
1696 ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD
1697 ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD
1698 ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD
1699 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1700 and in com.ibm.icu.dev.test.lang.TestUScript.java
1701- DerivedNumericValues.txt new numeric values
1702 0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH
1703 0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH
1704 0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS
1705 0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH
1706 0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS
1707 -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(),
1708 uchar.c, UCharacterProperty.java
1709 to support a new series of values
1710- adjust preparseucd.py for Tangut algorithmic names
1711 in ppucd.txt:
1712 algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH-
1713 ->
1714 algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH-
1715- avoid block-compressing most String/Miscellaneous property values,
1716 triggered by genprops not coping with a multi-code point Case_Folding on
1717 block;1C80..1C8F;...;Cased;cf=0442;CWCF;...
1718 keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors
1719
1720* PropertyAliases.txt changes
1721- 1 new property PCM=Prepended_Concatenation_Mark
1722 Ignore: Only useful for layout engines.
1723 Ok to list in ppucd.txt.
1724
1725* PropertyValueAliases.txt new property values
1726 blk; Adlam ; Adlam
1727 blk; Bhaiksuki ; Bhaiksuki
1728 blk; Cyrillic_Ext_C ; Cyrillic_Extended_C
1729 blk; Glagolitic_Sup ; Glagolitic_Supplement
1730 blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation
1731 blk; Marchen ; Marchen
1732 blk; Mongolian_Sup ; Mongolian_Supplement
1733 blk; Newa ; Newa
1734 blk; Osage ; Osage
1735 blk; Tangut ; Tangut
1736 blk; Tangut_Components ; Tangut_Components
1737 -> add to uchar.h
1738 use long property names for enum constants
1739 -> add to UCharacter.UnicodeBlock IDs
1740 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1741 replace public static final int \1_ID = \2; \3
1742 -> add to UCharacter.UnicodeBlock objects
1743 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
1744 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1745
1746 GCB; EB ; E_Base
1747 GCB; EBG ; E_Base_GAZ
1748 GCB; EM ; E_Modifier
1749 GCB; GAZ ; Glue_After_Zwj
1750 GCB; ZWJ ; ZWJ
1751 -> uchar.h & UCharacter.GraphemeClusterBreak
1752
1753 jg ; African_Feh ; African_Feh
1754 jg ; African_Noon ; African_Noon
1755 jg ; African_Qaf ; African_Qaf
1756 -> uchar.h & UCharacter.JoiningGroup
1757
1758 lb ; EB ; E_Base
1759 lb ; EM ; E_Modifier
1760 lb ; ZWJ ; ZWJ
1761 -> uchar.h & UCharacter.LineBreak
1762
1763 sc ; Adlm ; Adlam
1764 sc ; Bhks ; Bhaiksuki
1765 sc ; Marc ; Marchen
1766 sc ; Newa ; Newa
1767 sc ; Osge ; Osage
1768 sc ; Tang ; Tangut
1769 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
1770
1771 WB ; EB ; E_Base
1772 WB ; EBG ; E_Base_GAZ
1773 WB ; EM ; E_Modifier
1774 WB ; GAZ ; Glue_After_Zwj
1775 WB ; ZWJ ; ZWJ
1776 -> uchar.h & UCharacter.WordBreak
1777
1778* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1779 (not strictly necessary for NOT_ENCODED scripts)
1780 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
1781
1782* generate normalization data files
1783 cd $ICU_ROOT/dbg
1784 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
1785 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
1786 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
1787 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1788 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
1789
1790* build ICU (make install)
1791 so that the tools build can pick up the new definitions from the installed header files.
1792
1793 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt
1794
1795* build Unicode tools using CMake+make
1796
1797~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
1798
1799 # Location (--prefix) of where ICU was installed.
1800 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
1801 # Location of the ICU source tree.
1802 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
1803
1804 ~/svn.icutools/trunk/dbg/unicode/c$
1805 cmake ../../../src/unicode/c
1806 make
1807
1808* generate core properties data files
1809 ~/svn.icutools/trunk/dbg/unicode/c$
1810 genprops/genprops $ICU_SRC_DIR
1811 genuca/genuca --hanOrder implicit $ICU_SRC_DIR
1812 genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
1813- rebuild ICU (make install) & tools
1814
1815* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1816 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1817- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1818- Unicode 6.0..9.0: U+2260, U+226E, U+226F
1819- nothing new in 9.0, no test file to update
1820
1821* run & fix ICU4C tests
1822- Andy handles RBBI & spoof check test failures
1823
1824* collation: CLDR collation root, UCA DUCET
1825
1826- UCA DUCET goes into Mark's Unicode tools, see
1827 https://sites.google.com/site/unicodetools/home#TOC-UCA
1828- CLDR root data files are checked into (CLDR UCA branch)/common/uca/
1829 cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
1830
1831- cd (CLDR UCA branch)/common/uca/
1832- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1833 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
1834- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1835 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
1836 (note removing the underscore before "Rules")
1837 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1838- restore TODO diffs in UCARules.txt
1839 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1840- update (ICU4C)/source/test/testdata/CollationTest_*.txt
1841 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1842 from the CLDR root files (..._CLDR_..._SHORT.txt)
1843 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1844 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1845 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
1846- if CLDR common/uca/unihan-index.txt changes, then update
1847 CLDR common/collation/root.xml <collation type="private-unihan">
1848 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
1849
1850- run genuca, see command line above;
1851 deal with
1852 Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt:
1853 FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible)
1854 (add the character to genuca.cpp sampleCharsToScripts[])
1855 + look up the USCRIPT_ code for the new sample characters
1856 (should be obvious from the comment in the error output)
1857 + *add* mappings to sampleCharsToScripts[], do not replace them
1858 (in case the script sample characters flip-flop)
1859 + insert new scripts in DUCET script order, see the top_byte table
1860 at the beginning of FractionalUCA.txt
1861- rebuild ICU4C
1862
1863* Unihan collators
1864- run Unicode Tools
1865 org.unicode.draft.GenerateUnihanCollators
1866 with VM arguments
1867 -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk
1868 -DOTHER_WORKSPACE=/home/mscherer/svn.unitools
1869 -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data
1870 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
1871 -DUVERSION=9.0.0
1872 -ea
1873- run Unicode Tools
1874 org.unicode.draft.GenerateUnihanCollatorFiles
1875 with the same arguments
1876- check CLDR diffs
1877 cd ~/svn.cldr/trunk
1878 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
1879 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
1880- copy to CLDR
1881 cd ~/svn.cldr/trunk
1882 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
1883 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
1884- commit to CLDR
1885- generate ICU zh collation data: run CLDR
1886 org.unicode.cldr.icu.NewLdml2IcuConverter
1887 with program arguments
1888 -t collation
1889 -s /home/mscherer/svn.cldr/trunk/common/collation
1890 -m /home/mscherer/svn.cldr/trunk/common/supplemental
1891 -d /home/mscherer/svn.icu/trunk/src/source/data/coll
1892 -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation
1893 zh
1894 and VM arguments
1895 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
1896- rebuild ICU4C
1897
1898* run & fix ICU4C tests, now with new CLDR collation root data
1899- run all tests with the collation test data *_SHORT.txt or the full files
1900 (the full ones have comments, useful for debugging)
1901- note on intltest: if collate/UCAConformanceTest fails, then
1902 utility/MultithreadTest/TestCollators will fail as well;
1903 fix the conformance test before looking into the multi-thread test
1904
1905* update Java data files
1906- refresh just the UCD/UCA-related/derived files, just to be safe
1907- see (ICU4C)/source/data/icu4j-readme.txt
1908- mkdir /tmp/icu4j
1909- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1910 output:
1911 ...
1912 Unicode .icu files built to ./out/build/icudt58l
1913 echo timestamp > uni-core-data
1914 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b
1915 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b
1916 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1917 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b
1918 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b"
1919 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/
1920 mkdir -p /tmp/icu4j/main/shared/data
1921 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1922 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/
1923 mkdir -p /tmp/icu4j/main/shared/data
1924 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1925 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
1926- copy the big-endian Unicode data files to another location,
1927 separate from the other data files,
1928 and then refresh ICU4J
1929 cd ~/svn.icu/trunk/dbg/data/out/icu4j
1930 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1931 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1932 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1933 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1934 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1935 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1936 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1937 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1938 jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1939
1940* When refreshing all of ICU4J data from ICU4C
1941- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1942- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
1943or
1944- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
1945
1946* update CollationFCD.java
1947 + copy & paste the initializers of lcccIndex[] etc. from
1948 ICU4C/source/i18n/collationfcd.cpp to
1949 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1950
1951* refresh Java test .txt files
1952- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1953 cd $ICU_SRC_DIR/source/data/unidata
1954 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1955 cd ../../test/testdata
1956 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1957 cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1958
1959* run & fix ICU4J tests
1960
1961*** LayoutEngine script information
1962
1963* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
1964 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
1965 in the working directory.
1966
1967 (It also generates ScriptRunData.cpp, which is no longer needed.)
1968
1969 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
1970 (a plain text file)
1971 which maps ICU versions to the numbers of script/language constants
1972 that were added then.
1973 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
1974
1975 The generated files have a current copyright date and "@deprecated" statement.
1976
1977* Review changes, fix Java tool if necessary, and copy to ICU4C
1978 cd ~/svn.icu4j/trunk/src
1979 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
1980 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
1981 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
1982
1983*** API additions
1984- send notice to icu-design about new born-@stable API (enum constants etc.)
1985
1986*** merge the Unicode update branches back onto the trunk
1987- do not merge the icudata.jar and testdata.jar,
1988 instead rebuild them from merged & tested ICU4C
1989- make sure that changes to Unicode tools & ICU tools are checked in
1990 http://www.unicode.org/utility/trac/log/trunk/unicodetools
1991 http://bugs.icu-project.org/trac/log/tools/trunk
1992
1993---------------------------------------------------------------------------- ***
1994
1995New script codes early in ICU 58: http://bugs.icu-project.org/trac/ticket/11764
1996
1997Adding
1998- new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge
1999- new combination/alias codes: Hanb, Jamo
2000 - used in CLDR 29 and in spoof checker
2001- new Z* code: Zsye
2002
2003Add new codes to uscript.h & UScript.java, see Unicode update logs.
2004 -> com.ibm.icu.lang.UScript
2005 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2006 replace public static final int \1 = \2; \3
2007
2008Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h,
2009add new script codes.
2010"Long" script names only where established in Unicode 9 PropertyValueAliases.txt.
2011
2012Note: If we have to run preparseucd.py again before the Unicode 9 update,
2013then we need to manually keep/restore the new script codes.
2014
2015ICU_ROOT=~/svn.icu/trunk
2016ICU_SRC_DIR=$ICU_ROOT/src
2017ICUDT=icudt57b
2018export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
2019SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
2020UNIDATA=$ICU_SRC_DIR/source/data/unidata
2021
2022Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files,
2023see http://bugs.icu-project.org/trac/ticket/12141
2024
2025make install, then icutools cmake & make, then
2026~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
2027
2028Generate Java data as usual, only update pnames.icu & uprops.icu.
2029
2030*** LayoutEngine script information
2031
2032* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
2033 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
2034 in the working directory.
2035
2036 (It also generates ScriptRunData.cpp, which is no longer needed.)
2037
2038 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
2039 (a plain text file)
2040 which maps ICU versions to the numbers of script/language constants
2041 that were added then.
2042 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
2043
2044 The generated files have a current copyright date and "@deprecated" statement.
b331163b 2045
f3c0d7a5
A
2046* Review changes, fix Java tool if necessary, and copy to ICU4C
2047 cd ~/svn.icu4j/trunk/src
2048 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
2049 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
2050 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
b331163b
A
2051
2052---------------------------------------------------------------------------- ***
2053
2ca993e8 2054Emoji properties added in ICU 57: http://bugs.icu-project.org/trac/ticket/11802
b331163b 2055
2ca993e8
A
2056Edit preparseucd.py to add & parse new properties.
2057They share the UCD property namespace but are not listed in PropertyAliases.txt.
b331163b 2058
2ca993e8
A
2059Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/
2060Initial data from emoji/2.0/
b331163b 2061
2ca993e8
A
2062ICU_ROOT=~/svn.icu/trunk
2063ICU_SRC_DIR=$ICU_ROOT/src
2064ICUDT=icudt56b
2065export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
2066SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
2067UNIDATA=$ICU_SRC_DIR/source/data/unidata
b331163b 2068
2ca993e8 2069Add binary-property constants to uchar.h enum UProperty & UProperty.java.
b331163b 2070
2ca993e8
A
2071~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src
2072(Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.)
b331163b 2073
2ca993e8 2074Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java
b331163b 2075
2ca993e8
A
2076make install, then icutools cmake & make, then
2077~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
b331163b 2078
2ca993e8
A
2079Generate Java data as usual, only update pnames.icu & uprops.icu.
2080
2081---------------------------------------------------------------------------- ***
2082
2083Unicode 8.0 update for ICU 56
2084
2085* Command-line environment setup
2086
2087ICU_ROOT=~/svn.icu/trunk
2088ICU_SRC_DIR=$ICU_ROOT/src
2089ICUDT=icudt56b
2090export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
2091SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
2092UNIDATA=$ICU_SRC_DIR/source/data/unidata
2093
2094http://www.unicode.org/review/pri297/ -- beta review
2095http://www.unicode.org/reports/uax-proposed-updates.html
2096http://unicode.org/versions/beta-8.0.0.html
2097http://www.unicode.org/versions/Unicode8.0.0/
2098http://www.unicode.org/reports/tr44/tr44-15.html
2099
2100*** ICU Trac
2101
2102- ticket:11574: Unicode 8
2103- C++ branches/markus/uni80 at r37351 from trunk at r37343
2104- Java branches/markus/uni80 at r37352 from trunk at r37338
2105
2106*** CLDR Trac
2107
2108- cldrbug 8311: UCA 8
2109- branches/markus/uni80 at r11518 from trunk at r11517
2110
2111- cldrbug 8109: Unicode 8.0 script metadata
2112- cldrbug 8418: Updated segmentation for Unicode 8.0
2113
2114*** Unicode version numbers
2115- makedata.mak
2116- uchar.h
2117- com.ibm.icu.util.VersionInfo
2118- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2119
2120- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
2121 so that the makefiles see the new version number.
2122
2123*** data files & enums & parser code
2124
2125* file preparation
2126
2127- download UCD & IDNA files
2128- make sure that the Unicode data folder passed into preparseucd.py
2129 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2130- only for manual diffs: remove version suffixes from the file names
2131 ~/unidata/uni70/20140403$ ../../desuffixucd.py .
2132 (see https://sites.google.com/site/unicodetools/inputdata)
2133- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
2134- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src
2135- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2136
2137- also: from http://unicode.org/Public/security/8.0.0/ download new
2138 confusables.txt & confusablesWholeScript.txt
2139 and copy to $UNIDATA
2140 ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA
2141 ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA
2142
2143* initial preparseucd.py changes
2144- remove new Unicode scripts from the
2145 only-in-ISO-15924 list according to the error message:
2146 ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw']
2147 from _scripts_only_in_iso15924
2148 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2149 and in com.ibm.icu.dev.test.lang.TestUScript.java
2150- property and file name change:
2151 IndicMatraCategory -> IndicPositionalCategory
2152- UnicodeData.txt unusual numeric values (improper fractions)
2153 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;;
2154 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;;
2155 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;;
2156 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;;
2157 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;;
2158 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;;
2159 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;;
2160 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;;
2161 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;;
2162 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;;
2163 -> change preparseucd.py to map them to proper fractions (e.g., 1/6)
2164 which are listed in DerivedNumericValues.txt;
2165 keeps storage in data file simple
2166
2167* PropertyValueAliases.txt changes
2168- 10 new Block (blk) values:
2169 blk; Ahom ; Ahom
2170 blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs
2171 blk; Cherokee_Sup ; Cherokee_Supplement
2172 blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E
2173 blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform
2174 blk; Hatran ; Hatran
2175 blk; Multani ; Multani
2176 blk; Old_Hungarian ; Old_Hungarian
2177 blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs
2178 blk; Sutton_SignWriting ; Sutton_SignWriting
2179 -> add to uchar.h
2180 use long property names for enum constants
2181 -> add to UCharacter.UnicodeBlock IDs
2182 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
2183 replace public static final int \1_ID = \2; \3
2184 -> add to UCharacter.UnicodeBlock objects
2185 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
2186 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
2187- 6 new Script (sc) values:
2188 sc ; Ahom ; Ahom
2189 sc ; Hatr ; Hatran
2190 sc ; Hluw ; Anatolian_Hieroglyphs
2191 sc ; Hung ; Old_Hungarian
2192 sc ; Mult ; Multani
2193 sc ; Sgnw ; SignWriting
2194 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
2195
2196* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
2197 (not strictly necessary for NOT_ENCODED scripts)
2198 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
2199
2200* generate normalization data files
2201 cd $ICU_ROOT/dbg
2202 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
2203 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
2204 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
2205 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2206 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
2207
2208* build ICU (make install)
2209 so that the tools build can pick up the new definitions from the installed header files.
2210
2211 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
2212
2213* build Unicode tools using CMake+make
2214
2215~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
2216
2217 # Location (--prefix) of where ICU was installed.
2218 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
2219 # Location of the ICU source tree.
2220 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
2221
2222 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
2223 ~/svn.icutools/trunk/dbg/unicode/c$ make
2224
2225* generate core properties data files
2226- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
b331163b
A
2227- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR
2228- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
2ca993e8
A
2229- rebuild ICU (make install) & tools
2230- run genuca again (see step above) so that it picks up the new nfc.nrm
2231- rebuild ICU (make install) & tools
2232
2233* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2234 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2235- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2236- Unicode 6.0..8.0: U+2260, U+226E, U+226F
2237- nothing new in 8.0, no test file to update
2238
2239* run & fix ICU4C tests
2240- bad Cherokee case folding due to difference in fallbacks:
2241 UCD case folding falls back to no mapping,
2242 ICU runtime case folding falls back to lowercasing;
2243 fixed casepropsbuilder.cpp to generate scf mappings to self
2244 when there is an slc mapping but no scf
2245- Andy handles RBBI & spoof check test failures
2246
2247* collation: CLDR collation root, UCA DUCET
2248
2249- UCA DUCET goes into Mark's Unicode tools, see
2250 https://sites.google.com/site/unicodetools/home#TOC-UCA
2251- CLDR root data files are checked into (CLDR UCA branch)/common/uca/
2252- cd (CLDR UCA branch)/common/uca/
2253- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2254 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
2255- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2256 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
2257 (note removing the underscore before "Rules")
2258 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
2259- restore TODO diffs in UCARules.txt
2260 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
2261- update (ICU4C)/source/test/testdata/CollationTest_*.txt
2262 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2263 from the CLDR root files (..._CLDR_..._SHORT.txt)
2264 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
2265 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
2266 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
2267- if CLDR common/uca/unihan-index.txt changes, then update
2268 CLDR common/collation/root.xml <collation type="private-unihan">
2269 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
2270- run genuca, see command line above;
2271 deal with
2272 Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt
2273 (add the character to genuca.cpp sampleCharsToScripts[])
2274 + look up the script for the new sample characters
2275 (e.g., in FractionalUCA.txt)
2276 + *add* mappings to sampleCharsToScripts[], do not replace them
2277 (in case the script sample characters flip-flop)
2278 + insert new scripts in DUCET script order, see the top_byte table
2279 at the beginning of FractionalUCA.txt
2280- rebuild ICU4C
2281
2282* run & fix ICU4C tests, now with new CLDR collation root data
2283- run all tests with the collation test data *_SHORT.txt or the full files
2284 (the full ones have comments, useful for debugging)
2285- note on intltest: if collate/UCAConformanceTest fails, then
2286 utility/MultithreadTest/TestCollators will fail as well;
2287 fix the conformance test before looking into the multi-thread test
2288- fixed bug in CollationWeights::getWeightRanges()
2289 exposed by new data and CollationTest::TestRootElements
2290
2291* update Java data files
2292- refresh just the UCD/UCA-related/derived files, just to be safe
2293- see (ICU4C)/source/data/icu4j-readme.txt
2294- mkdir /tmp/icu4j
2295- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2296 output:
2297 ...
2298 Unicode .icu files built to ./out/build/icudt56l
2299 echo timestamp > uni-core-data
2300 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b
2301 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b
2302 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
2303 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b
2304 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b"
2305 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/
2306 mkdir -p /tmp/icu4j/main/shared/data
2307 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2308 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/
2309 mkdir -p /tmp/icu4j/main/shared/data
2310 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2311 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
2312- copy the big-endian Unicode data files to another location,
2313 separate from the other data files,
2314 and then refresh ICU4J
2315 cd ~/svn.icu/trunk/dbg/data/out/icu4j
2316 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2317 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2318 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2319 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2320 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
2321 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2322 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2323 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2324 jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
2325
2326* When refreshing all of ICU4J data from ICU4C
2327- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2328- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2329or
2330- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2331
2332* update CollationFCD.java
2333 + copy & paste the initializers of lcccIndex[] etc. from
2334 ICU4C/source/i18n/collationfcd.cpp to
2335 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
2336
2337* refresh Java test .txt files
2338- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2339 cd $ICU_SRC_DIR/source/data/unidata
2340 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2341 cd ../../test/testdata
2342 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2343 cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2344
2345* run & fix ICU4J tests
2346
2347*** LayoutEngine script information
2348
2349* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more,
2350 because the layout engine was deprecated in ICU 54.
2351 Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java
2352 to write lines that we used to add manually.
2353
2354* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
2355 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
2356 in the working directory.
2357
2358 (It also generates ScriptRunData.cpp, which is no longer needed.)
2359
2360 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
2361 (a plain text file)
2362 which maps ICU versions to the numbers of script/language constants
2363 that were added then.
2364 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
2365
2366 The generated files have a current copyright date and "@deprecated" statement.
2367
2368* Review changes, fix Java tool if necessary, and copy to ICU4C
2369 cd ~/svn.icu4j/trunk/src
2370 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
2371 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
2372 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
2373
2374*** API additions
2375- send notice to icu-design about new born-@stable API (enum constants etc.)
b331163b 2376
2ca993e8
A
2377*** merge the Unicode update branches back onto the trunk
2378- do not merge the icudata.jar and testdata.jar,
2379 instead rebuild them from merged & tested ICU4C
2380- make sure that changes to Unicode tools & ICU tools are checked in
2381 http://www.unicode.org/utility/trac/log/trunk/unicodetools
2382 http://bugs.icu-project.org/trac/log/tools/trunk
b331163b
A
2383
2384---------------------------------------------------------------------------- ***
2385
2386Unicode 7.0 update for ICU 54
2387
2388http://www.unicode.org/review/pri271/ -- beta review
2389http://www.unicode.org/reports/uax-proposed-updates.html
2390http://www.unicode.org/versions/beta-7.0.0.html#notable_issues
2391http://www.unicode.org/reports/tr44/tr44-13.html
2392
2393*** ICU Trac
2394
2395- ticket 10821: Unicode 7.0, UCA 7.0
2396- C++ branches/markus/uni70 at r35584 from trunk at r35580
2397- Java branches/markus/uni70 at r35587 from trunk at r35545
2398
2399*** CLDR Trac
2400
2401- ticket 7195: UCA 7.0 CLDR root collation
2402- branches/markus/uni70 at r10062 from trunk at r10061
2403
2404- ticket 6762: script metadata for Unicode 7.0 new scripts
2405
2406*** Unicode version numbers
2407- makedata.mak
2408- uchar.h
2409- com.ibm.icu.util.VersionInfo
2410- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2411
2412- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
2413 so that the makefiles see the new version number.
2414
2415*** data files & enums & parser code
2416
2417* file preparation
2418
2419- download UCD & IDNA files
2420- make sure that the Unicode data folder passed into preparseucd.py
2421 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2422- only for manual diffs: remove version suffixes from the file names
2423 ~/unidata/uni70/20140403$ ../../desuffixucd.py .
2424 (see https://sites.google.com/site/unicodetools/inputdata)
2425- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
2426- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src
2427- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2428- Restore TODO diffs in source/data/unidata/UCARules.txt
2429 cd $ICU_SRC_DIR
2430 meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt
2431- Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt
2432
2433- also: from http://unicode.org/Public/security/7.0.0/ download new
2434 confusables.txt & confusablesWholeScript.txt
2435 and copy to $ICU_ROOT/src/source/data/unidata/
2436
2437* initial preparseucd.py changes
2438- remove new Unicode scripts from the
2439 only-in-ISO-15924 list according to the error message:
2440 ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass',
2441 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm',
2442 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj']
2443 from _scripts_only_in_iso15924
2444 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2445 and in com.ibm.icu.dev.test.lang.TestUScript.java
2446- NamesList.txt now has a heading with a non-ASCII character
2447 + keep ppucd.txt in platform charset, rather than changing tool/test parsers
2448 + escape non-ASCII characters in heading comments
2449- gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013
2450 + get the copyright from the first file whose copyright line contains the current year
2451
2452* PropertyValueAliases.txt changes
2453- 32 new Block (blk) values:
2454 blk; Bassa_Vah ; Bassa_Vah
2455 blk; Caucasian_Albanian ; Caucasian_Albanian
2456 blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers
2457 blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended
2458 blk; Duployan ; Duployan
2459 blk; Elbasan ; Elbasan
2460 blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended
2461 blk; Grantha ; Grantha
2462 blk; Khojki ; Khojki
2463 blk; Khudawadi ; Khudawadi
2464 blk; Latin_Ext_E ; Latin_Extended_E
2465 blk; Linear_A ; Linear_A
2466 blk; Mahajani ; Mahajani
2467 blk; Manichaean ; Manichaean
2468 blk; Mende_Kikakui ; Mende_Kikakui
2469 blk; Modi ; Modi
2470 blk; Mro ; Mro
2471 blk; Myanmar_Ext_B ; Myanmar_Extended_B
2472 blk; Nabataean ; Nabataean
2473 blk; Old_North_Arabian ; Old_North_Arabian
2474 blk; Old_Permic ; Old_Permic
2475 blk; Ornamental_Dingbats ; Ornamental_Dingbats
2476 blk; Pahawh_Hmong ; Pahawh_Hmong
2477 blk; Palmyrene ; Palmyrene
2478 blk; Pau_Cin_Hau ; Pau_Cin_Hau
2479 blk; Psalter_Pahlavi ; Psalter_Pahlavi
2480 blk; Shorthand_Format_Controls ; Shorthand_Format_Controls
2481 blk; Siddham ; Siddham
2482 blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers
2483 blk; Sup_Arrows_C ; Supplemental_Arrows_C
2484 blk; Tirhuta ; Tirhuta
2485 blk; Warang_Citi ; Warang_Citi
2486 -> add to uchar.h
2487 use long property names for enum constants
2488 -> add to UCharacter.UnicodeBlock IDs
2489 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
2490 replace public static final int \1_ID = \2; \3
2491 -> add to UCharacter.UnicodeBlock objects
2492 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
2493 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
2494- 28 new Joining_Group (jg) values:
2495 jg ; Manichaean_Aleph ; Manichaean_Aleph
2496 jg ; Manichaean_Ayin ; Manichaean_Ayin
2497 jg ; Manichaean_Beth ; Manichaean_Beth
2498 jg ; Manichaean_Daleth ; Manichaean_Daleth
2499 jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh
2500 jg ; Manichaean_Five ; Manichaean_Five
2501 jg ; Manichaean_Gimel ; Manichaean_Gimel
2502 jg ; Manichaean_Heth ; Manichaean_Heth
2503 jg ; Manichaean_Hundred ; Manichaean_Hundred
2504 jg ; Manichaean_Kaph ; Manichaean_Kaph
2505 jg ; Manichaean_Lamedh ; Manichaean_Lamedh
2506 jg ; Manichaean_Mem ; Manichaean_Mem
2507 jg ; Manichaean_Nun ; Manichaean_Nun
2508 jg ; Manichaean_One ; Manichaean_One
2509 jg ; Manichaean_Pe ; Manichaean_Pe
2510 jg ; Manichaean_Qoph ; Manichaean_Qoph
2511 jg ; Manichaean_Resh ; Manichaean_Resh
2512 jg ; Manichaean_Sadhe ; Manichaean_Sadhe
2513 jg ; Manichaean_Samekh ; Manichaean_Samekh
2514 jg ; Manichaean_Taw ; Manichaean_Taw
2515 jg ; Manichaean_Ten ; Manichaean_Ten
2516 jg ; Manichaean_Teth ; Manichaean_Teth
2517 jg ; Manichaean_Thamedh ; Manichaean_Thamedh
2518 jg ; Manichaean_Twenty ; Manichaean_Twenty
2519 jg ; Manichaean_Waw ; Manichaean_Waw
2520 jg ; Manichaean_Yodh ; Manichaean_Yodh
2521 jg ; Manichaean_Zayin ; Manichaean_Zayin
2522 jg ; Straight_Waw ; Straight_Waw
2523 -> uchar.h & UCharacter.JoiningGroup
2524- 23 new Script (sc) values:
2525 sc ; Aghb ; Caucasian_Albanian
2526 sc ; Bass ; Bassa_Vah
2527 sc ; Dupl ; Duployan
2528 sc ; Elba ; Elbasan
2529 sc ; Gran ; Grantha
2530 sc ; Hmng ; Pahawh_Hmong
2531 sc ; Khoj ; Khojki
2532 sc ; Lina ; Linear_A
2533 sc ; Mahj ; Mahajani
2534 sc ; Mani ; Manichaean
2535 sc ; Mend ; Mende_Kikakui
2536 sc ; Modi ; Modi
2537 sc ; Mroo ; Mro
2538 sc ; Narb ; Old_North_Arabian
2539 sc ; Nbat ; Nabataean
2540 sc ; Palm ; Palmyrene
2541 sc ; Pauc ; Pau_Cin_Hau
2542 sc ; Perm ; Old_Permic
2543 sc ; Phlp ; Psalter_Pahlavi
2544 sc ; Sidd ; Siddham
2545 sc ; Sind ; Khudawadi
2546 sc ; Tirh ; Tirhuta
2547 sc ; Wara ; Warang_Citi
2548 -> uscript.h (many were added before)
2549 comment "Mende Kikakui" for USCRIPT_MENDE
2550 add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias
2551 -> com.ibm.icu.lang.UScript
2552 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2553 replace public static final int \1 = \2; \3
2554- 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2555 (added 2012-11-01)
2556 Ahom 338 Ahom
2557 Hatr 127 Hatran
2558 Mult 323 Multani
2559 (added 2013-10-12)
2560 Modi 324 Modi
2561 Pauc 263 Pau Cin Hau
2562 Sidd 302 Siddham
2563 -> uscript.h (some overlap with additions from Unicode)
2564 -> com.ibm.icu.lang.UScript
2565 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2566 replace public static final int \1 = \2; \3
2567 -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924
2568 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2569 and in com.ibm.icu.dev.test.lang.TestUScript.java
2570
2571* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
2572 (not strictly necessary for NOT_ENCODED scripts)
2573 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
2574
2575* generate normalization data files
2576- cd $ICU_ROOT/dbg
2577- export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
2578- SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
2579- UNIDATA=$ICU_SRC_DIR/source/data/unidata
2580- bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
2581- bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
2582- bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
2583- bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2584- bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
2585
2586* build ICU (make install)
2587 so that the tools build can pick up the new definitions from the installed header files.
2588
2589~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
2590
2591* build Unicode tools using CMake+make
2592
2593~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
2594
2595# Location (--prefix) of where ICU was installed.
2596set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst)
2597# Location of the ICU source tree.
2598set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src)
2599
2600~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
2601~/svn.icutools/trunk/dbg/unicode/c$ make
2602
2603* genprops work
2604- new code point range for Joining_Group values: 10AC0..10AFF Manichaean
2605 + add second array of Joining_Group values for at most 10800..10FFF
2606 icutools: unicode/c/genprops/bidipropsbuilder.cpp
2607 icu: source/common/ubidi_props.h/.c/_data.h
2608 icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java
2609
2610* generate core properties data files
2611- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
2612- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR
2613- rebuild ICU (make install) & tools
2614- run genuca again (see step above) so that it picks up the new nfc.nrm
2615- rebuild ICU (make install) & tools
2616
2617* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2618 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2619- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2620- Unicode 6.0..7.0: U+2260, U+226E, U+226F
2621- nothing new in 7.0, no test file to update
2622
2623* run & fix ICU4C tests
2624
2625* update Java data files
2626- refresh just the UCD-related files, just to be safe
2627- see (ICU4C)/source/data/icu4j-readme.txt
2628- mkdir /tmp/icu4j
2629- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2630 output:
2631 ...
2632 Unicode .icu files built to ./out/build/icudt53l
2633 echo timestamp > uni-core-data
2634 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b
2635 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b
2636 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2637 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b
2638 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b"
2639 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/
2640 mkdir -p /tmp/icu4j/main/shared/data
2641 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2642 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/
2643 mkdir -p /tmp/icu4j/main/shared/data
2644 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2645 make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data'
2646- copy the big-endian Unicode data files to another location,
2647 separate from the other data files
2648 ICUDT=icudt54b
2649 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2650 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2651 cd ~/svn.icu/uni70/dbg/data/out/icu4j
2652 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2653 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2654 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
2655 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2656 cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2657 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2658- refresh ICU4J
2659 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
2660
2661* update CollationFCD.java
2662 + copy & paste the initializers of lcccIndex[] etc. from
2663 ICU4C/source/i18n/collationfcd.cpp to
2664 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
2665
2666* refresh Java test .txt files
2667- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2668 cd $ICU_SRC_DIR/source/data/unidata
2669 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2670 cd ../../test/testdata
2671 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2672 cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2673
2674* UCA
2675
2676- download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/
2677- run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata)
2678- update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/
2679- run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA
2680- output files are in ~/svn.unitools/Generated/uca/7.0.0/
2681- review data; compare files, use blankweights.sed or similar
2682 ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt
2683- cd ~/svn.unitools/Generated/uca/7.0.0/
2684- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2685 cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
2686- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2687 (note removing the underscore before "Rules")
2688 cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
2689- update (ICU4C)/source/test/testdata/CollationTest_*.txt
2690 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2691 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
2692 cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
2693 cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
2694 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
2695- run genuca, see command line above
2696- rebuild ICU4C
2697- refresh ICU4J collation data:
2698 (subset of instructions above for properties data refresh, except copies all coll/*)
2699 ICUDT=icudt54b
2700 ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2701 ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2702 ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2703 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
2704- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
2705- note on intltest: if collate/UCAConformanceTest fails, then
2706 utility/MultithreadTest/TestCollators will fail as well;
2707 fix the conformance test before looking into the multi-thread test
2708- copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors
2709- copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch
2710 ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
2711
2712* When refreshing all of ICU4J data from ICU4C
2713- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2714- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2715or
2716- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2717
2718* run & fix ICU4J tests
2719
2720*** LayoutEngine script information
2721
2722(For details see the Unicode 5.2 change log below.)
2723
2724* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
2725 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
2726 in the working directory.
2727 (It also generates ScriptRunData.cpp, which is no longer needed.)
2728
2729 The generated files have a current copyright date and "@stable" statement.
2730 ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java
2731 for "born stable" Unicode API constants, and to stop parsing ICU version numbers
2732 which may not contain dots any more.
2733
2734- diff current <icu>/source/layout files vs. generated ones
2735 ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
2736 review and manually merge desired changes;
2737 fix gratuitous changes, incorrect @draft/@stable and missing aliases;
2738 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
2739- if you just copy the above files, then
2740 fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
2741 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
2742
2743*** API additions
2744- send notice to icu-design about new born-@stable API (enum constants etc.)
2745
2746*** merge the Unicode update branches back onto the trunk
2747- do not merge the icudata.jar and testdata.jar,
2748 instead rebuild them from merged & tested ICU4C
2749
2750---------------------------------------------------------------------------- ***
2751
57a6839d
A
2752Unicode 6.3 update
2753
2754http://www.unicode.org/review/pri249/ -- beta review
2755http://www.unicode.org/reports/uax-proposed-updates.html
2756http://www.unicode.org/versions/beta-6.3.0.html#notable_issues
2757http://www.unicode.org/reports/tr44/tr44-11.html
2758
2759*** ICU Trac
2760
2761- ticket 10128: update ICU to Unicode 6.3 beta
2762- ticket 10168: update ICU to Unicode 6.3 final
2763- C++ branches/markus/uni63 at r33552 from trunk at r33551
2764- Java branches/markus/uni63 at r33550 from trunk at r33553
2765
2766- ticket 10142: implement Unicode 6.3 bidi algorithm additions
2767
2768*** Unicode version numbers
2769- makedata.mak
2770- uchar.h
2771 (configure.in & configure: have been modified to extract the version from uchar.h)
2772- com.ibm.icu.util.VersionInfo
2773- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2774
2775- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
2776 so that the makefiles see the new version number.
2777
2778*** data files & enums & parser code
2779
2780* file preparation
2781
2782- download UCD, UCA & IDNA files
2783- make sure that the Unicode data folder passed into preparseucd.py
2784 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2785- modify preparseucd.py:
2786 parse new file BidiBrackets.txt
2787 with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type
2788- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src
2789- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2790- Check test file diffs for previously commented-out, known-failing data lines;
2791 probably need to keep those commented out.
2792
2793* PropertyAliases.txt changes
2794- 1 new Enumerated Property
2795 bpt ; Bidi_Paired_Bracket_Type
2796 -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType
2797 -> ubidi_props.h & .c & UBiDiProps.java
2798 -> remember to write the max value at UBIDI_MAX_VALUES_INDEX
2799 -> uprops.cpp
2800 -> change ubidi.icu format version from 2.0 to 2.1
2801- 1 new Miscellaneous Property
2802 bpb ; Bidi_Paired_Bracket
2803 -> uchar.h & UProperty.java
2804 -> ppucd.h & .cpp
2805
2806* PropertyValueAliases.txt changes
2807- 3 Bidi_Paired_Bracket_Type (bpt) values:
2808 bpt; c ; Close
2809 bpt; n ; None
2810 bpt; o ; Open
2811 -> uchar.h & UCharacter.BidiPairedBracketType
2812 -> ubidi_props.h & .c & UBiDiProps.java
2813 -> change ubidi.icu format version from 2.0 to 2.1
2814- 4 new Bidi_Class (bc) values:
2815 bc ; FSI ; First_Strong_Isolate
2816 bc ; LRI ; Left_To_Right_Isolate
2817 bc ; RLI ; Right_To_Left_Isolate
2818 bc ; PDI ; Pop_Directional_Isolate
2819 -> uchar.h & UCharacterEnums.ECharacterDirection
2820 -> until the bidi code gets updated,
2821 Roozbeh suggests mapping the new bc values to ON (Other_Neutral)
2822- 3 new Word_Break (WB) values:
2823 WB ; HL ; Hebrew_Letter
2824 WB ; SQ ; Single_Quote
2825 WB ; DQ ; Double_Quote
2826 -> uchar.h & UCharacter.WordBreak
2827 -> first time Word_Break numeric constants exceed 4 bits (now 17 values)
2828- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2829 (added 2012-10-16)
2830 Aghb 239 Caucasian Albanian
2831 Mahj 314 Mahajani
2832 -> uscript.h
2833 -> com.ibm.icu.lang.UScript
2834 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2835 replace public static final int \1 = \2;\3
2836 -> preparseucd.py _scripts_only_in_iso15924
2837 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2838 and in com.ibm.icu.dev.test.lang.TestUScript.java
2839 -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
2840 (not strictly necessary for NOT_ENCODED scripts)
2841
2842* generate normalization data files
2843- ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib
2844- ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in
2845- ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata
2846- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
2847- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
2848- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2849- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
2850
2851* build ICU (make install)
2852 so that the tools build can pick up the new definitions from the installed header files.
2853
2854~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
2855
2856* build Unicode tools using CMake+make
2857
2858~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
2859
2860# Location (--prefix) of where ICU was installed.
2861set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst)
2862# Location of the ICU source tree.
2863set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src)
2864
2865~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
2866~/svn.icutools/trunk/dbg/unicode/c$ make
2867
2868* generate core properties data files
2869- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src
2870- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src
2871- rebuild ICU (make install) & tools
2872- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
2873- rebuild ICU (make install) & tools
2874
2875* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2876 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2877- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2878- Unicode 6.0..6.3: U+2260, U+226E, U+226F
2879- nothing new in 6.3, no test file to update
2880
2881* update Java data files
2882- refresh just the UCD-related files, just to be safe
2883- see (ICU4C)/source/data/icu4j-readme.txt
2884- mkdir /tmp/icu4j
2885- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2886 output:
2887 ...
2888 Unicode .icu files built to ./out/build/icudt52l
2889 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b
2890 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b
2891 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2892 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b
2893 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b"
2894 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/
2895 mkdir -p /tmp/icu4j/main/shared/data
2896 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2897 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/
2898 mkdir -p /tmp/icu4j/main/shared/data
2899 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2900 make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data'
2901- copy the big-endian Unicode data files to another location,
2902 separate from the other data files
2903 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2904 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
2905 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
2906 ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu
2907 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
2908 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2909 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
2910- refresh ICU4J
2911 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
2912
2913* refresh Java test .txt files
2914- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2915
2916* UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files
2917
2918- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
2919- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
2920- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2921- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2922 (note removing the underscore before "Rules")
2923- update (ICU4C)/source/test/testdata/CollationTest_*.txt
2924 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2925 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
2926- check test file diffs for previously commented-out, known-failing data lines;
2927 probably need to keep those commented out
2928- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
2929- run genuca, see command line above
2930- rebuild ICU4C
2931- refresh ICU4J collation data:
2932 (subset of instructions above for properties data refresh, except copies all coll/*)
2933 ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2934 ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2935 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2936 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
2937- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
2938- note on intltest: if collate/UCAConformanceTest fails, then
2939 utility/MultithreadTest/TestCollators will fail as well;
2940 fix the conformance test before looking into the multi-thread test
2941
2942* test ICU, fix test code where necessary
2943
2944* When refreshing all of ICU4J data from ICU4C
2945- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2946- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2947or
2948- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2949
2950*** LayoutEngine script information
2951- skipped for Unicode 6.3: no new scripts
2952
2953*** merge the Unicode update branches back onto the trunk
2954- do not merge the icudata.jar and testdata.jar,
2955 instead rebuild them from merged & tested ICU4C
2956
2957---------------------------------------------------------------------------- ***
2958
51004dcb
A
2959Unicode 6.2 update
2960
2961http://www.unicode.org/review/pri230/
2962http://www.unicode.org/versions/beta-6.2.0.html
2963http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0
2964http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values
2965http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol
2966http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols
2967http://www.unicode.org/reports/tr46/tr46-8.html IDNA
2968http://unicode.org/Public/idna/6.2.0/
2969
2970*** ICU Trac
2971
2972- ticket 9515: Unicode 6.2: final ICU update
2973
2974- ticket 9514: UCA 6.2: fix UCARules.txt
2975
2976- ticket 9437: update ICU to Unicode 6.2
2977- C++ branches/markus/uni62 at r32050 from trunk at r32041
2978- Java branches/markus/uni62 at r32068 from trunk at r32066
2979
2980*** Unicode version numbers
2981- makedata.mak
2982- uchar.h
2983 (configure.in & configure: have been modified to extract the version from uchar.h)
2984- com.ibm.icu.util.VersionInfo
2985- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2986
2987*** data files & enums & parser code
2988
2989* file preparation
2990
2991- download UCD, UCA & IDNA files
2992- make sure that the Unicode data folder passed into preparseucd.py
2993 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2994- modify preparseucd.py: NamesList.txt is now in UTF-8
2995- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src
2996- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2997- Check test file diffs for previously commented-out, known-failing data lines;
2998 probably need to keep those commented out.
2999
3000* PropertyValueAliases.txt changes
3001- 1 new Line_Break (lb) value:
3002 lb ; RI ; Regional_Indicator
3003 -> uchar.h & UCharacter.LineBreak
3004- 1 new Word_Break (WB) value:
3005 WB ; RI ; Regional_Indicator
3006 -> uchar.h & UCharacter.WordBreak
3007- 1 new Grapheme_Cluster_Break (GCB) value:
3008 GCB; RI ; Regional_Indicator
3009 -> uchar.h & UCharacter.GraphemeClusterBreak
3010
3011* 3 new numeric values
3012 The new value -1, which was really supposed to be NaN but that would have required
3013 new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1,
3014 but encodeNumericValue() in corepropsbuilder.cpp had to be fixed.
3015 cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1
3016 cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1
3017 The two new values 216000 and 432000 require an addition to the encoding of numeric values.
3018 cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000
3019 cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000
3020 -> uprops.h, uchar.c & UCharacterProperty.java
3021 -> cucdtst.c & UCharacterTest.java
3022
3023* generate normalization data files
3024- ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib
3025- ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in
3026- ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata
3027- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
3028- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
3029- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
3030- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
3031
3032* build ICU (make install)
3033 so that the tools build can pick up the new definitions from the installed header files.
3034* build Unicode tools using CMake+make
3035
3036* generate core properties data files
3037- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src
3038- in initial bootstrapping, change the UCA version
3039 in source/data/unidata/FractionalUCA.txt to match the new Unicode version
3040- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src
3041- rebuild ICU (make install) & tools
3042 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
3043 check if the UCA version in FractionalUCA.txt matches the new Unicode version
3044 (see step above)
3045- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
3046- rebuild ICU (make install) & tools
3047
3048* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
3049 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
3050- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
3051- Unicode 6.0..6.2: U+2260, U+226E, U+226F
3052- nothing new in 6.2, no test file to update
3053
3054* update Java data files
3055- refresh just the UCD-related files, just to be safe
3056- see (ICU4C)/source/data/icu4j-readme.txt
3057- mkdir /tmp/icu4j
3058- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3059 output:
3060 ...
3061 Unicode .icu files built to ./out/build/icudt50l
3062 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b
3063 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b
3064 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
3065 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b
3066 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b"
3067 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/
3068 mkdir -p /tmp/icu4j/main/shared/data
3069 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3070 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/
3071 mkdir -p /tmp/icu4j/main/shared/data
3072 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
3073 make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data'
3074- copy the big-endian Unicode data files to another location,
3075 separate from the other data files
3076 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
3077 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
3078 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
3079 ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu
3080 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
3081 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
3082 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
3083- refresh ICU4J
3084 ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
3085
3086* refresh Java test .txt files
3087- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
3088
3089* UCA
3090
3091- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
3092- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
3093- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
3094- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
3095 (note removing the underscore before "Rules")
3096- update (ICU4C)/source/test/testdata/CollationTest_*.txt
3097 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3098 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
3099- check test file diffs for previously commented-out, known-failing data lines;
3100 probably need to keep those commented out
3101- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
3102- run genuca, see command line above
3103- rebuild ICU4C
3104- refresh ICU4J collation data:
3105 (subset of instructions above for properties data refresh, except copies all coll/*)
3106 ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3107 ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
3108 ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
3109 ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
3110- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
3111- note on intltest: if collate/UCAConformanceTest fails, then
3112 utility/MultithreadTest/TestCollators will fail as well;
3113 fix the conformance test before looking into the multi-thread test
3114
3115* test ICU, fix test code where necessary
3116
3117* When refreshing all of ICU4J data from ICU4C
3118- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3119- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
3120or
3121- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
3122
3123*** LayoutEngine script information
3124- skipped for Unicode 6.2: no new scripts
3125
3126*** merge the Unicode update branches back onto the trunk
3127- do not merge the icudata.jar and testdata.jar,
3128 instead rebuild them from merged & tested ICU4C
3129
3130---------------------------------------------------------------------------- ***
73c04bcf 3131
4388f060
A
3132Future Unicode update
3133
3134Tools simplified since the Unicode 6.1 update. See
3135- http://site.icu-project.org/design/props/ppucd
3136- http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972
3137
3138* Unicode version numbers
3139- icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates
3140
3141* file preparation
3142- ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py:
3143- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src
3144- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
3145- Check test file diffs for previously commented-out, known-failing data lines;
3146 probably need to keep those commented out.
3147
3148* PropertyValueAliases.txt changes
3149- Script codes that are in ISO 15924 but not in Unicode are now listed in
3150 preparseucd.py, in the _scripts_only_in_iso15924 variable.
3151 If there are new ISO codes, then add them.
3152 If Unicode adds some of them, then remove them from the .py variable.
3153
3154* UnicodeData.txt changes
3155- No more manual changes for CJK ranges for algorithmic names;
3156 those are now written to ppucd.txt and genprops reads them from there.
3157
3158* generate core properties data files (makeprops.sh was deleted)
3159- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src
3160
3161* no more manual updates of source/data/unidata/norm2/nfkc_cf.txt
3162- it is now generated by preparseucd.py
3163
3164* no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt
3165- it is now generated by preparseucd.py
3166- make sure that the Unicode data folder passed into preparseucd.py
3167 includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
3168 (can be in some subfolder)
3169
3170* generate normalization data files
3171- ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib
3172- ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in
3173- ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata
3174- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
3175- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
3176- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
3177- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
3178
3179* build ICU (make install)
3180* build Unicode tools using CMake+make
3181
3182* new way to call genuca (makeuca.sh was deleted)
3183- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src
3184
3185---------------------------------------------------------------------------- ***
3186
3187Unicode 6.1 update
3188
3189*** ICU Trac
3190
3191- ticket 8995 final update to Unicode 6.1
3192- ticket 8994 regenerate source/layout/CanonData.cpp
3193
3194- ticket 8961 support Unicode "Age" value *names*
3195- ticket 8963 support multiple character name aliases & types
3196
3197- ticket 8827 "update ICU to Unicode 6.1"
3198- C++ branches/markus/uni61 at r30864 from trunk at r30843
3199- Java branches/markus/uni61 at r30865 from trunk at r30863
3200
3201*** Unicode version numbers
3202- makedata.mak
3203- uchar.h
3204 (configure.in & configure: have been modified to extract the version from uchar.h)
3205- com.ibm.icu.util.VersionInfo
3206- icutools/unicode/makedefs.sh
3207 + also review & update other definitions in that file,
3208 e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l
3209
3210*** data files & enums & parser code
3211
3212* file preparation
3213
3214~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed
3215- This prepares both unidata and testdata files in respective output subfolders.
3216- Check test file diffs for previously commented-out, known-failing data lines;
3217 probably need to keep those commented out.
3218
3219* PropertyValueAliases.txt changes
3220- 11 new block names:
3221 Arabic_Extended_A
3222 Arabic_Mathematical_Alphabetic_Symbols
3223 Chakma
3224 Meetei_Mayek_Extensions
3225 Meroitic_Cursive
3226 Meroitic_Hieroglyphs
3227 Miao
3228 Sharada
3229 Sora_Sompeng
3230 Sundanese_Supplement
3231 Takri
3232 -> add to uchar.h
3233 -> add to UCharacter.UnicodeBlock IDs
3234 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
3235 replace public static final int \1_ID = \2; \3
3236 -> add to UCharacter.UnicodeBlock objects
3237 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
3238 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
3239- 1 new Joining_Group (jg) value:
3240 Rohingya_Yeh
3241 -> uchar.h & UCharacter.JoiningGroup
3242- 2 new Line_Break (lb) values:
3243 CJ=Conditional_Japanese_Starter
3244 HL=Hebrew_Letter
3245 -> uchar.h & UCharacter.LineBreak
3246- 7 new scripts:
3247 sc ; Cakm ; Chakma
3248 sc ; Merc ; Meroitic_Cursive
3249 sc ; Mero ; Meroitic_Hieroglyphs
3250 sc ; Plrd ; Miao
3251 sc ; Shrd ; Sharada
3252 sc ; Sora ; Sora_Sompeng
3253 sc ; Takr ; Takri
3254 -> remove these from SyntheticPropertyValueAliases.txt
3255 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
3256 and in com.ibm.icu.dev.test.lang.TestUScript.java
3257- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
3258 (added 2011-06-21)
3259 Khoj 322 Khojki
3260 Tirh 326 Tirhuta
3261 and another one added 2011-12-09
3262 Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)
3263 -> uscript.h
3264 -> com.ibm.icu.lang.UScript
3265 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
3266 replace public static final int \1 = \2;\3
3267 -> SyntheticPropertyValueAliases.txt
3268 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
3269 and in com.ibm.icu.dev.test.lang.TestUScript.java
3270
3271* UnicodeData.txt changes
3272- the last Unihan code point changes from U+9FCB to U+9FCC
3273 search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive)
3274 + do change gennames.c
3275 + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java
3276
3277* DerivedBidiClass.txt changes
3278- 2 new default-AL blocks:
3279# Arabic Extended-A: U+08A0 - U+08FF (was default-R)
3280# Arabic Mathematical Alphabetic Symbols:
3281# U+1EE00 - U+1EEFF (was default-R)
3282- 2 new default-R blocks:
3283# Meroitic Hieroglyphs:
3284# U+10980 - U+1099F
3285# Meroitic Cursive: U+109A0 - U+109FF
3286 -> should be picked up by the explicit data in the file
3287
3288* NameAliases.txt changes
3289- from
3290 # Each line has two fields
3291 # First field: Code point
3292 # Second field: Alias
3293- to
3294 # Each line has three fields, as described here:
3295 #
3296 # First field: Code point
3297 # Second field: Alias
3298 # Third field: Type
3299- Also, the file previously allowed multiple aliases but only now does it
3300 actually provide multiple, even multiple of the same type. For example,
3301 FEFF;BYTE ORDER MARK;alternate
3302 FEFF;BOM;abbreviation
3303 FEFF;ZWNBSP;abbreviation
3304- This breaks our gennames parser, unames.icu data structure, and API.
3305 Fix gennames to only pick up "correction" aliases.
3306 New ticket #8963 for further changes.
3307
3308* run genpname/preparse.pl (on Linux)
3309 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
3310 + make sure that data.h is writable
3311 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
3312 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
3313
3314* build ICU (make install)
3315 so that the tools build can pick up the new definitions from the installed header files.
3316* build Unicode tools (at least genpname) using CMake+make
3317
3318* run genpname
3319 (builds both pnames.icu and propname_data.h)
3320- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
3321- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
3322
3323* build ICU (make install)
3324* build Unicode tools using CMake+make
3325
3326* update source/data/unidata/norm2/nfkc_cf.txt
3327- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
3328
3329* update source/data/unidata/norm2/uts46.txt
3330- download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
3331 to ~/svn.icu/tools/trunk/src/unicode/py
3332- adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008".
3333- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
3334- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
3335
3336* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
3337 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
3338- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
3339- Unicode 6.0..6.1: U+2260, U+226E, U+226F
3340- nothing new in 6.1, no test file to update
3341
3342* generate core properties data files
3343- in initial bootstrapping, change the UCA version
3344 in source/data/unidata/FractionalUCA.txt to match the new Unicode version
3345- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3346- rebuild ICU & tools
3347 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
3348 check if the UCA version in FractionalUCA.txt matches the new Unicode version
3349 (see step above)
3350- run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm:
3351 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3352- rebuild ICU & tools
3353
3354* update Java data files
3355- refresh just the UCD-related files, just to be safe
3356- see (ICU4C)/source/data/icu4j-readme.txt
3357- mkdir /tmp/icu4j
3358- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3359 output:
3360 ...
3361 Unicode .icu files built to ./out/build/icudt49l
3362 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b
3363 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b
3364 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
3365 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b
3366 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b"
3367 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/
3368 mkdir -p /tmp/icu4j/main/shared/data
3369 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3370 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/
3371 mkdir -p /tmp/icu4j/main/shared/data
3372 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
3373 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data'
3374- copy the big-endian Unicode data files to another location,
3375 separate from the other data files
3376 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
3377 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
3378 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
3379 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu
3380 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
3381 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
3382 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
3383- refresh ICU4J
3384 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
3385
3386* refresh Java test .txt files
3387- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
3388
3389* test ICU so far, fix test code where necessary
3390- temporarily ignore collation issues that look like UCA/UCD mismatches,
3391 until UCA data is updated
3392
3393* UCA
3394
3395- get output from Mark's tools; look in
3396 http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt
3397- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
3398- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
3399 (note removing the underscore before "Rules")
3400- update (ICU)/source/test/testdata/CollationTest_*.txt
3401 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3402 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
3403- check test file diffs for previously commented-out, known-failing data lines;
3404 probably need to keep those commented out
3405- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
3406- run makeuca.sh:
3407 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3408- rebuild ICU4C
3409- refresh ICU4J collation data:
3410 (subset of instructions above for properties data refresh, except copies all coll/*)
3411 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3412 ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
3413 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
3414 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
3415- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
3416- note on intltest: if collate/UCAConformanceTest fails, then
3417 utility/MultithreadTest/TestCollators will fail as well;
3418 fix the conformance test before looking into the multi-thread test
3419
3420* When refreshing all of ICU4J data from ICU4C
3421- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3422- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
3423or
3424- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
3425
3426*** LayoutEngine script information
3427
3428(For details see the Unicode 5.2 change log below.)
3429
3430* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
3431 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
3432 in the working directory.
3433 (It also generates ScriptRunData.cpp, which is no longer needed.)
3434
3435 The generated files have a current copyright date and "@draft" statement.
3436
3437- diff current <icu>/source/layout files vs. generated ones
3438 ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
3439 review and manually merge desired changes;
3440 fix gratuitous changes, incorrect @draft and missing aliases;
3441 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
3442- if you just copy the above files, then
3443 fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
3444 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
3445
3446*** merge the Unicode update branches back onto the trunk
3447- do not merge the icudata.jar and testdata.jar,
3448 instead rebuild them from merged & tested ICU4C
3449
3450---------------------------------------------------------------------------- ***
3451
3452ICU 4.8 (no Unicode update, just new script codes)
3453
3454* 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
3455 (added 2010-12-21)
3456 Afak 439 Afaka
3457 Jurc 510 Jurchen
3458 Mroo 199 Mro, Mru
3459 Nshu 499 Nüshu
3460 Shrd 319 Sharada, Śāradā
3461 Sora 398 Sora Sompeng
3462 Takr 321 Takri, Ṭākrī, Ṭāṅkrī
3463 Tang 520 Tangut
3464 Wole 480 Woleai
3465 -> uscript.h
3466 -> com.ibm.icu.lang.UScript
3467 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
3468 replace public static final int \1 = \2;\3
3469 -> genpname/SyntheticPropertyValueAliases.txt
3470 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
3471 and in com.ibm.icu.dev.test.lang.TestUScript.java
3472
3473* run genpname/preparse.pl (on Linux)
3474 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
3475 + make sure that data.h is writable
3476 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
3477 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
3478
3479* rebuild Unicode tools (at least genpname) using make
3480- You might first need to "make install" ICU so that the tools build can pick
3481 up the new definitions from the installed header files.
3482
3483* run genpname
3484 (builds both pnames.icu and propname_data.h)
3485- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
3486- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
3487- rebuild ICU & tools
3488
3489* run genprops
3490- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
3491- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
3492- rebuild ICU & tools
3493
3494* update Java data files
3495- refresh just the UCD-related files, just to be safe
3496- see (ICU4C)/source/data/icu4j-readme.txt
3497- mkdir /tmp/icu4j
3498- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3499- copy the big-endian Unicode data files to another location,
3500 separate from the other data files
3501 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
3502 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
3503 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
3504- refresh ICU4J
3505 ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b
3506
3507* should have updated the layout engine script codes but forgot
3508
3509---------------------------------------------------------------------------- ***
3510
729e4ab9
A
3511Unicode 6.0 update
3512
3513*** related ICU Trac tickets
3514
35157264 Unicode 6.0 Update
3516
3517*** Unicode version numbers
3518- makedata.mak
3519- uchar.h
3520 (configure.in & configure: have been modified to extract the version from uchar.h)
3521- com.ibm.icu.util.VersionInfo
3522
3523*** data files & enums & parser code
3524
3525* file preparation
3526
3527~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed
3528- This now prepares both unidata and testdata files in respective output subfolders.
3529
3530* PropertyAliases.txt changes
3531- new Script_Extensions property defined in the new ScriptExtensions.txt file
3532 but not listed in PropertyAliases.txt; reported to unicode.org;
3533 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt
3534 scx; Script_Extensions
3535 -> uchar.h with new UProperty section
3536 -> com.ibm.icu.lang.UProperty, parallel with uchar.h
3537
3538* PropertyValueAliases.txt changes
3539- 12 new block names:
3540 Alchemical_Symbols
3541 Bamum_Supplement
3542 Batak
3543 Brahmi
3544 CJK_Unified_Ideographs_Extension_D
3545 Emoticons
3546 Ethiopic_Extended_A
3547 Kana_Supplement
3548 Mandaic
3549 Miscellaneous_Symbols_And_Pictographs
3550 Playing_Cards
3551 Transport_And_Map_Symbols
3552 -> add to uchar.h
3553 -> add to UCharacter.UnicodeBlock
3554 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
3555 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
3556- Joining_Group (jg) values:
3557 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias
3558 -> uchar.h & UCharacter.JoiningGroup
3559- 3 new scripts:
3560 sc ; Batk ; Batak
3561 sc ; Brah ; Brahmi
3562 sc ; Mand ; Mandaic
3563 -> remove these from SyntheticPropertyValueAliases.txt
3564 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN
3565 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
3566 and in com.ibm.icu.dev.test.lang.TestUScript.java
3567- 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
3568 (added 2009-11-11..2010-07-18)
3569 Bass 259 Bassa Vah
3570 Dupl 755 Duployan shortand
3571 Elba 226 Elbasan
3572 Gran 343 Grantha
3573 Kpel 436 Kpelle
3574 Loma 437 Loma
3575 Mend 438 Mende
3576 Merc 101 Meroitic Cursive
3577 Narb 106 Old North Arabian
3578 Nbat 159 Nabataean
3579 Palm 126 Palmyrene
3580 Sind 318 Sindhi
3581 Wara 262 Warang Citi
3582 -> uscript.h
3583 -> com.ibm.icu.lang.UScript
3584 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
3585 replace public static final int \1 = \2;\3
3586 -> SyntheticPropertyValueAliases.txt
3587 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
3588 and in com.ibm.icu.dev.test.lang.TestUScript.java
3589- ISO 15924 name change
3590 Mero 100 Meroitic Hieroglyphs (was Meroitic)
3591 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC
3592- property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt
3593
3594* UnicodeData.txt changes
3595- new CJK block:
3596 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
3597 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
3598 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion
3599
3600* build Unicode tools using CMake+make
3601
3602* run genpname/preparse.pl (on Linux)
3603 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
3604 + make sure that data.h is writable
3605 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
3606 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
3607
3608* rebuild Unicode tools (at least genpname) using make
3609- You might first need to "make install" ICU so that the tools build can pick
3610 up the new definitions from the installed header files.
3611
3612* run genpname
3613- ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
3614- rebuild ICU & tools
3615
3616* update source/data/unidata/norm2/nfkc_cf.txt
3617- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
3618
3619* update source/data/unidata/norm2/uts46.txt
3620- download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt
3621 to ~/svn.icu/tools/trunk/src/unicode/py
3622- adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values
3623- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
3624- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
3625
3626* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
3627 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
3628- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
3629- Unicode 6.0: U+2260, U+226E, U+226F
3630
3631* generate core properties data files
3632- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3633- rebuild ICU & tools
3634- run makeuca.sh so that genuca picks up the new nfc.nrm:
3635 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3636- rebuild ICU & tools
3637
3638* implement new Script_Extensions property (provisional)
3639- parser & generator: genprops & uprops.icu
3640- uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp
3641- UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java
3642
3643* switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2
3644- (one-time change)
3645- genbidi/gencase/genprops tools changes
3646- re-run makeprops.sh (see above)
3647- UCharacterProperty.java, UCharacterTypeIterator.java,
3648 UBiDiProps.java, UCaseProps.java, and several others with minor changes;
3649 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java
3650
3651* update Java data files
3652- refresh just the UCD-related files, just to be safe
3653- see (ICU4C)/source/data/icu4j-readme.txt
3654- mkdir /tmp/icu4j
3655- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3656 output:
3657 ...
3658 Unicode .icu files built to ./out/build/icudt45l
3659 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b
3660 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
3661 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b
3662 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b
3663 mkdir -p /tmp/icu4j/main/shared/data
3664 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3665- copy the big-endian Unicode data files to another location,
3666 separate from the other data files
3667 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
3668 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
3669 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
3670 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu
3671 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
3672 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
3673 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
3674- refresh ICU4J
3675 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
3676
3677* refresh Java test .txt files
3678- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
3679
3680* un-hardcode normalization skippable (NF*_Inert) test data
3681- removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools
3682
3683* copy updated break iterator test files
3684- now handled by early ucdcopy.py and
3685 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata
3686 (old instructions:
3687 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt
3688 to ~/svn.icu/trunk/src/source/test/testdata)
3689- they are not used in ICU4J
3690
3691* UCA
3692
3693- get output from Mark's tools; look in
3694 http://www.unicode.org/~book/incoming/mark/uca6.0.0/
3695 http://www.macchiato.com/unicode/utc/additional-uca-files
3696 http://www.unicode.org/Public/UCA/6.0.0/
3697 http://www.unicode.org/~mdavis/uca/
3698- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
3699- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
3700- update Han-implicit ranges for new CJK extensions:
3701 swapCJK() in ucol.cpp & ImplicitCEGenerator.java
3702- genuca: allow bytes 02 for U+FFFE, new merge-sort character;
3703 do not add it into invuca so that tailoring primary-after an ignorable works
3704- genuca: permit space between [variable top] bytes
3705- ucol.cpp: treat noncharacters like unassigned rather than ignorable
3706- run makeuca.sh:
3707 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3708- rebuild ICU4C
3709- refresh ICU4J collation data:
3710 (subset of instructions above for properties data refresh, except copies all coll/*)
3711 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3712 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
3713 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
3714 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
3715- update (ICU)/source/test/testdata/CollationTest_*.txt
3716 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3717 with output from Mark's Unicode tools
3718- run all tests with the *_SHORT.txt or the full files (the full ones have comments)
3719- note on intltest: if collate/UCAConformanceTest fails, then
3720 utility/MultithreadTest/TestCollators will fail as well;
3721 fix the conformance test before looking into the multi-thread test
3722
3723* When refreshing all of ICU4J data from ICU4C
3724- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3725- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
3726or
3727- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
3728
3729*** LayoutEngine script information
3730
3731(For details see the Unicode 5.2 change log below.)
3732
3733* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
3734ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
3735ScriptRunData.cpp, which is no longer needed.)
3736
3737The generated files have a current copyright date and "@draft" statement.
3738
3739* copy the above files into <icu>/source/layout, replacing the old files.
3740* fix mixed line endings
3741* review the diffs and fix incorrect @draft and missing aliases;
3742 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
3743* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
3744
3745---------------------------------------------------------------------------- ***
3746
3747Unicode 5.2 update
3748
3749*** related ICU Trac tickets
3750
37517084 Unicode 5.2
3752
37537167 verify collation bytes
37547235 Java test NAME_ALIAS
37557236 Java DerivedCoreProperties.txt test
37567237 Java BidiTest.txt
37577238 UTrie2 in core unidata
37587239 test for tailoring gaps
37597240 Java fix CollationMiscTest
37607243 update layout engine for Unicode 5.2
3761
3762*** Unicode version numbers
3763- makedata.mak
3764- uchar.h
3765- configure.in & configure
3766- update ucdVersion in gennames.c if an algorithmic range changes
3767
3768*** data files & enums & parser code
3769
3770* file preparation
3771
3772python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata
3773- includes finding files regardless of version numbers,
3774 copying them, and performing the equivalent processing of the
3775 ucdstrip and ucdmerge tools on the desired set of files
3776
3777* notes on changes
3778- PropertyAliases.txt
3779 moved from numeric to enumerated:
3780 ccc ; Canonical_Combining_Class
3781 new string properties:
3782 NFKC_CF ; NFKC_Casefold
3783 Name_Alias; Name_Alias
3784 new binary properties:
3785 Cased ; Cased
3786 CI ; Case_Ignorable
3787 CWCF ; Changes_When_Casefolded
3788 CWCM ; Changes_When_Casemapped
3789 CWKCF ; Changes_When_NFKC_Casefolded
3790 CWL ; Changes_When_Lowercased
3791 CWT ; Changes_When_Titlecased
3792 CWU ; Changes_When_Uppercased
3793 new CJK Unihan properties (not supported by ICU)
3794- PropertyValueAliases.txt
3795 new block names
3796 new scripts
3797 one script code change:
3798 sc ; Qaai ; Inherited
3799 ->
3800 sc ; Zinh ; Inherited ; Qaai
3801 new Line_Break (lb) value:
3802 lb ; CP ; Close_Parenthesis
3803 new Joining_Group (jg) values: Farsi_Yeh, Nya
3804 other new values:
3805 ccc; 214; ATA ; Attached_Above
3806- DerivedBidiClass.txt
3807 new default-R range: U+1E800 - U+1EFFF
3808- UnicodeData.txt
3809 all of the ISO comments are gone
3810 new CJK block end:
3811 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last>
3812 new CJK block:
3813 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
3814 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
3815
3816* genpname
3817- run preparse.pl
3818 + cd \svn\icuproj\icu\trunk\source\tools\genpname
3819 + make sure that data.h is writable
3820 + perl preparse.pl \svn\icuproj\icu\trunk > out.txt
3821 + preparse.pl complains with errors like the following:
3822 Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34.
3823 This is because ICU 4.0 had scripts from ISO 15924 which are now
3824 added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt
3825 and PropertyValueAliases.txt.
3826 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
3827 Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt
3828 + preparse.pl complains with errors about block names missing from uchar.h; add them
3829
3830* uchar.h & uscript.h & uprops.h & uprops.c & genprops
3831- new block & script values
3832 + 26 new blocks
3833 copy new blocks from Blocks.txt
3834 MS VC++ 2008 regular expression:
3835 find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$"
3836 replace with " UBLOCK_\3 = 172, /*[\1]*/"
3837 + several new script values already added in ICU 4.0 for ISO 15924 coverage
3838 (removed from SyntheticPropertyValueAliases.txt, see genpname notes above)
3839 + 3 new script values added for ISO 15924 and Unicode 5.2 coverage
3840 + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2)
3841 (added to SyntheticPropertyValueAliases.txt)
3842- new Joining Group (JG) values: Farsi_Yeh, Nya
3843- new Line_Break (lb) value:
3844 lb ; CP ; Close_Parenthesis
3845
3846* hardcoded Unihan range end/limit
3847- Unihan range end moves from 9FC3 to 9FCB
3848 search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive)
3849 + do change gennames.c
3850
3851* Compare definitions of new binary properties with what we used to use
3852 in algorithms, to see if the definitions changed.
3853- Verified that definitions for Cased and Case_Ignorable are unchanged.
3854 The gencase tool now parses the newly public Case_Ignorable values
3855 in case the definition changes in the future.
3856
3857* uchar.c & uprops.h & uprops.c & genprops
3858- new numeric values that didn't exist in Unicode data before:
3859 1/7, 1/9, 1/10, 3/10, 1/16, 3/16
3860 the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5,
3861 therefore redesign the encoding of numeric types and values for formatVersion 6;
3862 design for simple numbers up to at least 144 ("one gross"),
3863 large values up to at least 10^20,
3864 and fractions with numerators -1..17 and denominators 1..16
3865 to cover current and expected future values
3866 (e.g., more Han numeric values, Meroitic twelfths)
3867
3868* reimplement Hangul_Syllable_Type for new Jamo characters
3869- the old code assumed that all Jamo characters are in the 11xx block
3870- Unicode 5.2 fills holes there and adds new Jamo characters in
3871 A960..A97F; Hangul Jamo Extended-A
3872 and in
3873 D7B0..D7FF; Hangul Jamo Extended-B
3874- Hangul_Syllable_Type can be trivially derived from a subset of
3875 Grapheme_Cluster_Break values
3876
3877* build Unicode data source code for hardcoding core data
3878C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data
3879
3880ICU data make path is \svn\icuproj\icu\trunk\source\data\
3881ICU root path is \svn\icuproj\icu\trunk
3882Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
3883Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
3884Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
3885Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
3886Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
3887Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
3888Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
3889Information: cannot find "spreplocal.mk". Not building user-additional stringprep files.
3890Creating data file for Unicode Property Names
3891Creating data file for Unicode Character Properties
3892Creating data file for Unicode Case Mapping Properties
3893Creating data file for Unicode BiDi/Shaping Properties
3894Creating data file for Unicode Normalization
3895Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l"
3896Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp"
3897
3898- copy the .c source files to C:\svn\icuproj\icu\trunk\source\common
3899 and rebuild the common library
3900
3901*** UCA
3902
3903- update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools)
3904- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools
3905- update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools
3906[ Begin obsolete instructions:
3907 Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files.
3908 - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py
3909 on Windows:
3910 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt
3911 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt
3912 End obsolete instructions]
3913- run all tests with the *_SHORT.txt or the full files (the full ones have comments)
3914 not just the *_STUB.txt files
3915- note on intltest: if collate/UCAConformanceTest fails, then
3916 utility/MultithreadTest/TestCollators will fail as well;
3917 fix the conformance test before looking into the multi-thread test
3918
3919*** Implement Cased & Case_Ignorable properties
3920- via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable()
3921- Problem: These properties should be disjoint, but aren't
3922- UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not
3923- change ucase.icu to be able to store any combination of Cased and Case_Ignorable
3924
3925*** Implement Changes_When_Xyz properties
3926- without stored data
3927
3928*** Implement Name_Alias property
3929- add it as another name field in unames.icu
3930- make it available via u_charName() and UCharNameChoice and
3931- consider it in u_charFromName()
3932
3933*** Break iterators
3934
3935* Update break iterator rules to new UAX versions and new property values
3936* Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary
3937
3938*** new BidiTest file
3939- review format and data
3940- copy BidiTest.txt to source/test/testdata
3941- write test code using this data
3942- fix ICU code where it fails the conformance test
3943
3944*** Java
3945- generally, find and update code corresponding to C/C++
3946- UCharacter.UnicodeBlock constants:
3947 a) add an _ID integer per new block, update COUNT
3948 b) add a class instance per new block
3949 Visual Studio regex:
3950 find UBLOCK_{[^ ]+} = [0-9]+, {/.+}
3951 replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
3952- CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias()
3953
3954- port test changes to Java
3955
3956*** LayoutEngine script information
3957
3958(For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833)
3959
3960* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
3961ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
3962ScriptRunData.cpp, which is no longer needed.)
3963
3964The generated files have a current copyright date and "@draft" statement.
3965
3966-> Eric Mader wrote in email on 20090930:
3967 "I think the tool has been modified to update @draft to @stable for
3968 older scripts and to add @draft for new scripts.
3969 (I worked with an intern on this last year.)
3970 You should check the output after you run it."
3971
3972* copy the above files into <icu>/source/layout, replacing the old files.
3973* fix mixed line endings
3974* review the diffs and fix incorrect @draft and missing aliases
3975* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
3976
3977Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
3978and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
3979
3980-> Eric Mader wrote in email on 20090930:
3981 "This is just a matter of making sure that all the per-script tables have
3982 entries for any new scripts that were added.
3983 If any new Indic characters were added, then the class tables in
3984 IndicClassTables.cpp should be updated to reflect this.
3985 John Emmons should know how to do this if it's required."
3986
3987* rebuild the layout and layoutex libraries.
3988
3989*** Documentation
3990- Update User Guide
3991 + Jamo_Short_Name, sfc->scf, binary property value aliases
3992
3993---------------------------------------------------------------------------- ***
3994
46f4442e
A
3995Unicode 5.1 update
3996
3997*** related ICU Trac tickets
3998
39995696 Update to Unicode 5.1
4000
4001*** Unicode version numbers
4002- makedata.mak
4003- uchar.h
4004- configure.in & configure
4005- update ucdVersion in gennames.c if an algorithmic range changes
4006
4007*** data files & enums & parser code
4008
4009* file preparation
4010- ucdstrip:
4011 DerivedCoreProperties.txt
4012 DerivedNormalizationProps.txt
4013 NormalizationTest.txt
4014 PropList.txt
4015 Scripts.txt
4016 GraphemeBreakProperty.txt
4017 SentenceBreakProperty.txt
4018 WordBreakProperty.txt
4019- ucdstrip and ucdmerge:
4020 EastAsianWidth.txt
4021 LineBreak.txt
4022
4023* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
4024copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
4025copy 5.1.0\ucd\Blocks.txt ..\unidata\
4026copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
4027copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
4028copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
4029copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
4030copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
4031copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
4032copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
4033copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
4034copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
4035copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
4036copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
4037
4038ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
4039ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
4040ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
4041ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
4042ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
4043ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
4044ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
4045ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
4046ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
4047ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
4048
4049* genpname
4050- run preparse.pl
4051 + cd \svn\icuproj\icu\uni51\source\tools\genpname
4052 + make sure that data.h is writable
4053 + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
4054 + preparse.pl complains with errors like the following:
4055 Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
4056 This is because ICU 3.8 had scripts from ISO 15924 which are now
4057 added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
4058 and PropertyValueAliases.txt.
4059 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
4060 Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
4061 + PropertyValueAliases.txt now explicitly contains values for boolean properties:
4062 N/Y, No/Yes, F/T, False/True
4063 -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
4064 It will use further values from the file if present.
4065
4066* uchar.h & uscript.h & uprops.h & uprops.c & genprops
4067- new block & script values
4068 + 17 new blocks
4069 + 11 new script values already added in ICU 3.8 for ISO 15924 coverage
4070 (removed from SyntheticPropertyValueAliases.txt)
4071 + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
4072 (added to SyntheticPropertyValueAliases.txt)
4073- uprops.icu (uprops.h) only provides 7 bits for script codes.
4074 In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
4075 There is none above 127 yet which is the script code for an
4076 assigned Unicode character, so ICU 4.0 uprops.icu does not store any
4077 script code values greater than 127.
4078 However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
4079 in a parallel bit field, and that overflows now.
4080 Also, future values >=128 would be incompatible anyway.
4081 uprops.h is modified to move around several of the bit fields
4082 in the properties vector words, and now uses 8 bits for the script code.
4083 Two other bit fields also grow to accommodate future growth:
4084 Block (current count: 172) grows from 8 to 9 bits,
4085 and Word_Break grows from 4 to 5 bits.
4086- renamed property Simple_Case_Folding (sfc->scf)
4087 + nothing to be done: handled as normal alias
4088- new property JSN Jamo_Short_Name
4089 + no new API: only contributes to the Name property
4090- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
4091- new Joining Group (JG) value: Burushashki_Yeh_Barree
4092- new Sentence_Break (SB) values:
4093 SB ; CR ; CR
4094 SB ; EX ; Extend
4095 SB ; LF ; LF
4096 SB ; SC ; SContinue
4097- new Word_Break (WB) values:
4098 WB ; CR ; CR
4099 WB ; Extend ; Extend
4100 WB ; LF ; LF
4101 WB ; MB ; MidNumLet
4102
4103* Further changes in the 2008-02-29 update:
4104- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
4105 because they should not normally be invisible.
4106- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
4107- new Grapheme_Cluster_Break (GCB) value: PP=Prepend
4108- new Word_Break (WB) value: NL=Newline
4109
4110* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
4111- Unihan range end moves from 9FBB to 9FC3
4112 search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
4113 + do change gennames.c
4114
4115* build Unicode data source code for hardcoding core data
4116C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
4117
4118ICU data make path is \svn\icuproj\icu\uni51\source\data\
4119ICU root path is \svn\icuproj\icu\uni51
4120Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
4121Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
4122Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
4123Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
4124Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
4125Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
4126Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
4127Creating data file for Unicode Character Properties
4128Creating data file for Unicode Case Mapping Properties
4129Creating data file for Unicode BiDi/Shaping Properties
4130Creating data file for Unicode Normalization
4131Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
4132Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
4133
4134- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
4135 and rebuild the common library
4136
4137*** Break iterators
4138
4139* Update break iterator rules to new UAX versions and new property values
4140
4141*** UCA
4142
4143* update FractionalUCA.txt and UCARules.txt with new canonical closure
4144
4145*** Test suites
4146- Test that APIs using Unicode property value aliases (like UnicodeSet)
4147 support all of the boolean values N/Y, No/Yes, F/T, False/True
4148 -> TestBinaryValues() tests in both cintltst and intltest
4149
4150*** LayoutEngine script information
4151* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
4152ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
4153ScriptRunData.cpp, which is no longer needed.)
4154
4155The generated files have a current copyright date and "@draft" statement.
4156
4157* copy the above files into <icu>/source/layout, replacing the old files.
4158
4159Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
4160and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
4161
4162* rebuild the layout and layoutex libraries.
4163
4164*** Documentation
4165- Update User Guide
4166 + Jamo_Short_Name, sfc->scf, binary property value aliases
4167
4168---------------------------------------------------------------------------- ***
4169
73c04bcf
A
4170Unicode 5.0 update
4171
4172*** related Jitterbugs
4173
41745084 RFE: Update to Unicode 5.0
4175
4176*** data files & enums & parser code
4177
4178* file preparation
4179- ucdstrip:
4180 DerivedCoreProperties.txt
4181 DerivedNormalizationProps.txt
4182 NormalizationTest.txt
4183 PropList.txt
4184 Scripts.txt
4185 GraphemeBreakProperty.txt
4186 SentenceBreakProperty.txt
4187 WordBreakProperty.txt
4188- ucdstrip and ucdmerge:
4189 EastAsianWidth.txt
4190 LineBreak.txt
4191
46f4442e 4192* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
73c04bcf
A
4193copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
4194copy 5.0.0\ucd\Blocks.txt ..\unidata\
4195copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
4196copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
4197copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
4198copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
4199copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
4200copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
4201copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
4202copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
4203copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
4204copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
4205copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
4206
4207ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
4208ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
4209ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
4210ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
4211ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
4212ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
4213ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
4214ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
4215ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
4216ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
4217
4218* update FractionalUCA.txt and UCARules.txt with new canonical closure
4219
4220* genpname
4221- run preparse.pl
4222 + make sure that data.h is writable
4223 + perl preparse.pl \cvs\oss\icu > out.txt
4224
4225* uchar.h & uscript.h & uprops.h & uprops.c & genprops
4226- new block & script values
4227 + script values already added in ICU 3.6 because all of ISO 15924 is now covered
4228
4229* build Unicode data source code for hardcoding core data
4230C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
4231
4232ICU data make path is \cvs\oss\icu\source\data\
4233ICU root path is \cvs\oss\icu
4234Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
4235[etc.]
4236Creating data file for Unicode Character Properties
4237Creating data file for Unicode Case Mapping Properties
4238Creating data file for Unicode BiDi/Shaping Properties
4239Creating data file for Unicode Normalization
4240Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
4241Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
4242
4243- copy the .c source files to C:\cvs\oss\icu\source\common
4244 and rebuild the common library
4245
4246*** Unicode version numbers
4247- makedata.mak
4248- uchar.h
4249- configure.in
4250
4251*** LayoutEngine script information
4252* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
4253ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
4254ScriptRunData.cpp, which is no longer needed.)
4255
4256The generated files have a current copyright date and "@draft" statement.
4257
4258* copy the above files into <icu>/source/layout, replacing the old files.
4259
4260Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
4261and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
4262
4263* rebuild the layout and layoutex libraries.
4264
4265---------------------------------------------------------------------------- ***
4266
4267Unicode 4.1 update
4268
4269*** related Jitterbugs
4270
42714332 RFE: Update to Unicode 4.1
42724157 RBBI, TR29 4.1 updates
4273
4274*** data files & enums & parser code
4275
4276* file preparation
4277- ucdstrip:
4278 DerivedCoreProperties.txt
4279 DerivedNormalizationProps.txt
4280 NormalizationTest.txt
4281 GraphemeBreakProperty.txt
4282 SentenceBreakProperty.txt
4283 WordBreakProperty.txt
4284- ucdstrip and ucdmerge:
4285 EastAsianWidth.txt
4286 LineBreak.txt
4287
4288* add new files to the repository
4289 GraphemeBreakProperty.txt
4290 SentenceBreakProperty.txt
4291 WordBreakProperty.txt
4292
4293* update FractionalUCA.txt and UCARules.txt with new canonical closure
4294
4295* genpname
4296- handle new enumerated properties in sub read_uchar
4297- run preparse.pl
4298
4299* uchar.h & uscript.h & uprops.h & uprops.c & genprops
4300- new binary properties
4301 + Pattern_Syntax
4302 + Pattern_White_Space
4303- new enumerated properties
4304 + Grapheme_Cluster_Break
4305 + Sentence_Break
4306 + Word_Break
4307- new block & script & line break values
4308
4309* gencase
4310- case-ignorable changes
4311 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
4312 now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
4313
4314*** Unicode version numbers
4315- makedata.mak
4316- uchar.h
4317- configure.in
4318
4319*** tests
4320- verify that u_charMirror() round-trips
4321- test all new properties and some new values of old properties
4322
4323*** other code
4324
4325* hardcoded Unihan range end/limit
4326- Unihan range end moves from 9FA5 to 9FBB
4327 search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
4328 + do not modify BOCU/BOCSU code because that would change the encoding
4329 and break binary compatibility!
4330 + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
4331 NamePrepProfile.txt
4332 + ignore trietest.c: test data is arbitrary
4333 + ignore tstnorm.cpp: test optimization, not important
4334 + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
4335 + do change line_th.txt and word_th.txt
4336 by replacing hardcoded ranges with the new property values
4337 + do change gennames.c
4338
4339source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
4340source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
4341source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5,
4342
4343* case mappings
4344- compare new special casing context conditions with previous ones
4345 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
4346
4347* genpname
4348- consider storing only the short name if it is the same as the long name
4349
4350*** other reviews
4351- UAX #29 changes (grapheme/word/sentence breaks)
4352- UAX #14 changes (line breaks)
4353- Pattern_Syntax & Pattern_White_Space
4354
4355---------------------------------------------------------------------------- ***
4356
374ca955
A
4357Unicode 4.0.1 update
4358
4359*** related Jitterbugs
4360
43613170 RFE: Update to Unicode 4.0.1
43623171 Add new Unicode 4.0.1 properties
43633520 use Unicode 4.0.1 updates for break iteration
4364
4365*** data files & enums & parser code
4366
4367* file preparation
4368- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
4369- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
4370
4371* file fixes
4372- fix UnicodeData.txt general categories of Ethiopic digits Nd->No
4373 according to PRI #26
4374 http://www.unicode.org/review/resolved-pri.html#pri26
4375- undone again because no corrigendum in sight;
4376 instead modified tests to not check consistency on this for Unicode 4.0.1
4377
4378* ucdterms.txt
4379- update from http://www.unicode.org/copyright.html
4380 formatted for plain text
4381
4382* uchar.h & uprops.h & uprops.c & genprops
4383- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
4384- add U_LB_INSEPARABLE due to a spelling fix
4385 + put short name comment only on line with new constant
4386 for genpname perl script parser
4387- new binary properties
4388 + STerm
4389 + Variation_Selector
4390
4391* genpname
4392- fix genpname perl script so that it doesn't choke on more than 2 names per property value
4393- perl script: correctly calculate the maximum number of fields per row
4394
4395* uscript.h
4396- new script code Hrkt=Katakana_Or_Hiragana
4397
4398* gennorm.c track changes in DerivedNormalizationProps.txt
4399- "FNC" -> "FC_NFKC"
4400- single field "NFD_NO" -> two fields "NFD_QC; N" etc.
4401
4402* genprops/props2.c track changes in DerivedNumericValues.txt
4403- changed from 3 columns to 2, dropping the numeric type
4404 + assume that the type is always numeric for Han characters,
4405 and that only those are added in addition to what UnicodeData.txt lists
4406
4407*** Unicode version numbers
4408- makedata.mak
4409- uchar.h
4410- configure.in
4411
4412*** tests
4413- update test of default bidi classes according to PRI #28
4414 /tsutil/cucdtst/TestUnicodeData
4415 http://www.unicode.org/review/resolved-pri.html#pri28
4416- bidi tests: change exemplar character for ES depending on Unicode version
4417- change hardcoded expected property values where they change
4418
4419*** other code
4420
4421* name matching
4422- read UCD.html
4423
4424* scripts
4425- use new Hrkt=Katakana_Or_Hiragana
4426
4427* ZWJ & ZWNJ
4428- are now part of combining character sequences
4429- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ