]> git.saurik.com Git - apple/icu.git/blame - icuSources/data/unidata/changes.txt
ICU-66108.tar.gz
[apple/icu.git] / icuSources / data / unidata / changes.txt
CommitLineData
f3c0d7a5
A
1* Copyright (C) 2016 and later: Unicode, Inc. and others.
2* License & terms of use: http://www.unicode.org/copyright.html
3* Copyright (C) 2004-2016, International Business Machines
73c04bcf
A
4* Corporation and others. All Rights Reserved.
5*
6* file name: changes.txt
7* encoding: US-ASCII
8* tab size: 8 (not used)
9* indentation:4
10*
11* created on: 2004may06
12* created by: Markus W. Scherer
13*
14* change log for Unicode updates
6be67b06
A
15*
16* For each new Unicode version, during the beta period,
17* I copy the change log for the previous version to the top of this file.
18* I adjust the versions, tickets, URLs, and paths.
19* I work my way through the steps listed in the log, top to bottom,
20* adjusting the log as necessary.
21* I report problems to the UTC and/or CLDR and/or ICU.
22* Before the data is final, I "turn the crank" several more times,
23* using appropriate subsets of the steps.
73c04bcf
A
24
25---------------------------------------------------------------------------- ***
51004dcb 26
b331163b
A
27* New ISO 15924 script codes
28
f3c0d7a5
A
29Starting with ICU 55, we do not add UScriptCode constants for new scripts any more
30until they are encoded in Unicode,
31or can be assumed to be encoded in the next Unicode version.
b331163b
A
32Script enum constant names want to follow the Unicode script property value aliases,
33which are assigned only when the scripts are encoded.
34When we encode scripts early and guess wrong, then we have confusing enum constants
35and have sometimes added aliases.
36
f3c0d7a5 37Variant script codes like Latf and Aran that are not subject to separate encoding
b331163b 38can be added at any time.
f3c0d7a5 39(For example, Aran could be added as USCRIPT_ARABIC_NASTALIQ.)
b331163b 40
f3c0d7a5
A
41We add script codes used in CLDR or in the spoof checker.
42This includes combination/alias codes like Hanb and Jamo.
43See http://unicode.org/reports/tr35/#unicode_script_subtag_validity
44and look for "alias" on http://unicode.org/iso15924/iso15924-codes.html
b331163b 45
f3c0d7a5 46We add special Z* script codes like Zsye.
b331163b 47
f3c0d7a5 48For new script codes see http://www.unicode.org/iso15924/codechanges.html
b331163b 49
f3c0d7a5
A
50---------------------------------------------------------------------------- ***
51
340931cb
A
52Unicode 13.0 update for ICU 66
53
54https://www.unicode.org/versions/Unicode13.0.0/
55https://www.unicode.org/versions/beta-13.0.0.html
56https://www.unicode.org/Public/13.0.0/ucd/
57https://www.unicode.org/reports/uax-proposed-updates.html
58https://www.unicode.org/reports/tr44/tr44-25.html
59
60https://unicode-org.atlassian.net/browse/CLDR-13387
61https://unicode-org.atlassian.net/browse/ICU-20893
62
63* Command-line environment setup
64
65UNICODE_DATA=~/unidata/uni13/20200212
66CLDR_SRC=~/cldr/uni/src
67ICU_ROOT=~/icu/uni
68ICU_SRC=$ICU_ROOT/src
69ICUDT=icudt66b
70ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
71ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
72export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
73
74*** Unicode version numbers
75- makedata.mak
76- uchar.h
77- com.ibm.icu.util.VersionInfo
78- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
79
80- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
81 so that the makefiles see the new version number.
82 cd $ICU_ROOT/dbg/icu4c
83 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
84
85*** data files & enums & parser code
86
87* download files
88- mkdir -p $UNICODE_DATA
89- download Unicode files into $UNICODE_DATA
90 + subfolders: emoji, idna, security, ucd, uca
91 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
92 + split Unihan into single-property files
93 ~/unitools/trunk/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan
94 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt
95 or from the ucd/cldr/ output folder of the Unicode Tools:
96 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules.
97 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata
98
99* for manual diffs and for Unicode Tools input data updates:
100 remove version suffixes from the file names
101 ~$ unidata/desuffixucd.py $UNICODE_DATA
102 (see https://sites.google.com/site/unicodetools/inputdata)
103
104* process and/or copy files
105- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
106 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
107 + For debugging, and tweaking how ppucd.txt is written,
108 the tool has an --only_ppucd option:
109 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
110
111- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
112
113* new constants for new property values
114- preparseucd.py error:
115 ValueError: missing uchar.h enum constants for some property values:
116 [(u'blk', set([u'Symbols_For_Legacy_Computing', u'Dives_Akuru', u'Yezidi',
117 u'Tangut_Sup', u'CJK_Ext_G', u'Khitan_Small_Script', u'Chorasmian', u'Lisu_Sup'])),
118 (u'sc', set([u'Chrs', u'Diak', u'Kits', u'Yezi'])),
119 (u'InPC', set([u'Top_And_Bottom_And_Left']))]
120 = PropertyValueAliases.txt new property values (diff old & new .txt files)
121 blk; Chorasmian ; Chorasmian
122 blk; CJK_Ext_G ; CJK_Unified_Ideographs_Extension_G
123 blk; Dives_Akuru ; Dives_Akuru
124 blk; Khitan_Small_Script ; Khitan_Small_Script
125 blk; Lisu_Sup ; Lisu_Supplement
126 blk; Symbols_For_Legacy_Computing ; Symbols_For_Legacy_Computing
127 blk; Tangut_Sup ; Tangut_Supplement
128 blk; Yezidi ; Yezidi
129 -> add to uchar.h before UBLOCK_COUNT
130 use long property names for enum constants,
131 for the trailing comment get the block start code point: diff old & new Blocks.txt
132 -> add to UCharacter.UnicodeBlock IDs
133 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
134 replace public static final int \1_ID = \2; \3
135 -> add to UCharacter.UnicodeBlock objects
136 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
137 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
138
139 sc ; Chrs ; Chorasmian
140 sc ; Diak ; Dives_Akuru
141 sc ; Kits ; Khitan_Small_Script
142 sc ; Yezi ; Yezidi
143 -> uscript.h & com.ibm.icu.lang.UScript
144 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
145 and in com.ibm.icu.dev.test.lang.TestUScript.java
146
147 InPC; Top_And_Bottom_And_Left ; Top_And_Bottom_And_Left
148 -> uchar.h enum UIndicPositionalCategory & UCharacter.java IndicPositionalCategory
149
150* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
151 (not strictly necessary for NOT_ENCODED scripts)
152 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
153
154* build ICU (make install)
155 to make sure that there are no syntax errors, and
156 so that the tools build can pick up the new definitions from the installed header files.
157
158 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
159
160* update spoof checker UnicodeSet initializers:
161 inclusionPat & recommendedPat in i18n/uspoof.cpp
162 INCLUSION & RECOMMENDED in SpoofChecker.java
163- make sure that the Unicode Tools tree contains the latest security data files
164- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
165- update the hardcoded version number there in the DIRECTORY path
166- run the tool (no special environment variables needed)
167- copy & paste from the Console output into the .cpp & .java files
168
169* generate normalization data files
170 cd $ICU_ROOT/dbg/icu4c
171 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
172 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
173 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
174 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
175 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
176
177* build ICU (make install)
178 so that the tools build can pick up the new definitions from the installed header files.
179
180 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
181
182* build Unicode tools using CMake+make
183
184$ICU_SRC/tools/unicode/c/icudefs.txt:
185
186# Location (--prefix) of where ICU was installed.
187set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
188# Location of the ICU4C source tree.
189set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)
190
191 $ICU_ROOT/dbg$
192 mkdir -p tools/unicode/c
193 cd tools/unicode/c
194
195 $ICU_ROOT/dbg/tools/unicode/c$
196 cmake ../../../../src/tools/unicode/c
197 make
198
199* generate core properties data files
200 $ICU_ROOT/dbg/tools/unicode/c$
201 genprops/genprops $ICU_SRC/icu4c
202- tool failure:
203 genprops: Script_Extensions indexes overflow bit field
204 genprops: error parsing or setting values from ppucd.txt line 32696 - U_BUFFER_OVERFLOW_ERROR
205 -> uprops.icu data file format :
206 add two more bits to store a script code or Script_Extensions index
207 -> generator code, C++ & Java runtime, uprops.icu format version 7.7
208- rebuild ICU (make install) & tools
209
210* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
211 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
212- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
213- Unicode 6.0..13.0: U+2260, U+226E, U+226F
214- nothing new in this Unicode version, no test file to update
215
216* run & fix ICU4C tests
217- fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files
218- Andy helps with RBBI & spoof check test failures
219
220* collation: CLDR collation root, UCA DUCET
221
222- UCA DUCET goes into Mark's Unicode tools, see
223 https://sites.google.com/site/unicodetools/home#TOC-UCA
224 diff the main mapping file, look for bad changes
225 (for example, more bytes per weight for common characters)
226 ~/svn.unitools/trunk$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ../Generated/UCA/13.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-13.0.txt
227 ~/svn.unitools/trunk$ meld ../frac-12.1.txt ../frac-13.0.txt
228
229- CLDR root data files are checked into $CLDR_SRC/common/uca/
230 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
231
232- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
233 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
234- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
235 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
236 (note removing the underscore before "Rules")
237 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
238- restore TODO diffs in UCARules.txt
239 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
240- update (ICU4C)/source/test/testdata/CollationTest_*.txt
241 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
242 from the CLDR root files (..._CLDR_..._SHORT.txt)
243 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
244 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
245 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
246- if CLDR common/uca/unihan-index.txt changes, then update
247 CLDR common/collation/root.xml <collation type="private-unihan">
248 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
249
250- run genuca
251 $ICU_ROOT/dbg/tools/unicode/c$
252 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \
253 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
254- rebuild ICU4C
255
256* Unihan collators
257 https://sites.google.com/site/unicodetools/unihan
258- run Unicode Tools
259 org.unicode.draft.GenerateUnihanCollators
260 with VM arguments
261 -ea
262 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
263 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
264 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
265 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src
266 -DUVERSION=13.0.0
267- run Unicode Tools
268 org.unicode.draft.GenerateUnihanCollatorFiles
269 with the same arguments
270- check CLDR diffs
271 cd $CLDR_SRC
272 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
273 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
274- copy to CLDR
275 cd $CLDR_SRC
276 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
277 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
278- run CLDR unit tests, commit to CLDR
279- generate ICU zh collation data: run CLDR
280 org.unicode.cldr.icu.NewLdml2IcuConverter
281 with program arguments
282 -t collation
283 -s /usr/local/google/home/mscherer/cldr/uni/src/common/collation
284 -m /usr/local/google/home/mscherer/cldr/uni/src/common/supplemental
285 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll
286 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation
287 zh
288 and VM arguments
289 -ea
290 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src
291- rebuild ICU4C
292
293* run & fix ICU4C tests, now with new CLDR collation root data
294- run all tests with the collation test data *_SHORT.txt or the full files
295 (the full ones have comments, useful for debugging)
296- note on intltest: if collate/UCAConformanceTest fails, then
297 utility/MultithreadTest/TestCollators will fail as well;
298 fix the conformance test before looking into the multi-thread test
299
300* update Java data files
301- refresh just the UCD/UCA-related/derived files, just to be safe
302- see (ICU4C)/source/data/icu4j-readme.txt
303- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
304- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
305 output:
306 ...
307 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
308 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt66b
309 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b
310 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt66l.dat ./out/icu4j/icudt66b.dat -s ./out/build/icudt66l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt66b
311 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b"
312 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt66b/
313 mkdir -p /tmp/icu4j/main/shared/data
314 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
315 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt66b/
316 mkdir -p /tmp/icu4j/main/shared/data
317 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
318 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
319- copy the big-endian Unicode data files to another location,
320 separate from the other data files,
321 and then refresh ICU4J
322 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
323 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
324 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
325 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
326 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
327 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
328 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
329 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
330 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
331 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
332
333* When refreshing all of ICU4J data from ICU4C
334- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
335- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
336or
337- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
338
339* update CollationFCD.java
340 + copy & paste the initializers of lcccIndex[] etc. from
341 ICU4C/source/i18n/collationfcd.cpp to
342 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
343
344* refresh Java test .txt files
345- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
346 cd $ICU_SRC/icu4c/source/data/unidata
347 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
348 cd ../../test/testdata
349 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
350 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
351
352* run & fix ICU4J tests
353
354*** API additions
355- send notice to icu-design about new born-@stable API (enum constants etc.)
356
357*** CLDR numbering systems
358- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
359 for example, look for
360 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt
361 in new blocks (Blocks.txt)
362 Unicode 13:
363 diak 11950..11959 Dives_Akuru
364
365*** merge the Unicode update branches back onto the trunk
366- do not merge the icudata.jar and testdata.jar,
367 instead rebuild them from merged & tested ICU4C
368- make sure that changes to Unicode tools are checked in:
369 http://www.unicode.org/utility/trac/log/trunk/unicodetools
370
371---------------------------------------------------------------------------- ***
372
3d1f044b
A
373Unicode 12.1 update for ICU 64.2
374
375** This is an abbreviated update with one new character for the new
376** Japanese era expected to start on 2019-May-01: U+32FF SQUARE ERA NAME REIWA
377https://en.wikipedia.org/wiki/Reiwa_period
378
379http://www.unicode.org/versions/Unicode12.1.0/
380
381ICU-20497 Unicode 12.1
382
383cldrbug 11978: Unicode 12.1
384
385* Command-line environment setup
386
387UNICODE_DATA=~/unidata/uni121/20190403
388CLDR_SRC=~/svn.cldr/uni
389ICU_ROOT=~/icu/uni
390ICU_SRC=$ICU_ROOT/src
391ICUDT=icudt64b
392ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
393ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
394export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
395
396*** Unicode version numbers
397- makedata.mak
398- uchar.h
399- com.ibm.icu.util.VersionInfo
400- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
401
402- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
403 so that the makefiles see the new version number.
404 cd $ICU_ROOT/dbg/icu4c
405 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
406
407*** data files & enums & parser code
408
409* download files
410- mkdir -p $UNICODE_DATA
411- download Unicode files into $UNICODE_DATA
412 + subfolders: emoji, idna, security, ucd, uca
413 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
414
415* for manual diffs and for Unicode Tools input data updates:
416 remove version suffixes from the file names
417 ~$ unidata/desuffixucd.py $UNICODE_DATA
418 (see https://sites.google.com/site/unicodetools/inputdata)
419
420* process and/or copy files
421- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
422 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
423 + For debugging, and tweaking how ppucd.txt is written,
424 the tool has an --only_ppucd option:
425 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
426
427- cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
428
429* build ICU (make install)
430 so that the tools build can pick up the new definitions from the installed header files.
431
432 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
433
434* update spoof checker UnicodeSet initializers:
435 inclusionPat & recommendedPat in uspoof.cpp
436 INCLUSION & RECOMMENDED in SpoofChecker.java
437- make sure that the Unicode Tools tree contains the latest security data files
438- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
439- update the hardcoded version number there in the DIRECTORY path
440- run the tool (no special environment variables needed)
441- copy & paste from the Console output into the .cpp & .java files
442
443* generate normalization data files
444 cd $ICU_ROOT/dbg/icu4c
445 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
446 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
447 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
448 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
449 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
450
451* build ICU (make install)
452 so that the tools build can pick up the new definitions from the installed header files.
453
454 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
455
456* build Unicode tools using CMake+make
457
458$ICU_SRC/tools/unicode/c/icudefs.txt:
459
460# Location (--prefix) of where ICU was installed.
461set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
462# Location of the ICU4C source tree.
463set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)
464
465 $ICU_ROOT/dbg$
466 mkdir -p tools/unicode/c
467 cd tools/unicode/c
468
469 $ICU_ROOT/dbg/tools/unicode/c$
470 cmake ../../../../src/tools/unicode/c
471 make
472
473* generate core properties data files
474 $ICU_ROOT/dbg/tools/unicode/c$
475 genprops/genprops $ICU_SRC/icu4c
476 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \
477 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
478- rebuild ICU (make install) & tools
479
480* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
481 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
482- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
483- Unicode 6.0..12.1: U+2260, U+226E, U+226F
484- nothing new in this Unicode version, no test file to update
485
486* run & fix ICU4C tests
487- Andy handles RBBI & spoof check test failures
488
489* collation: CLDR collation root, UCA DUCET
490
491- UCA DUCET goes into Mark's Unicode tools, see
492 https://sites.google.com/site/unicodetools/home#TOC-UCA
493 diff the main mapping file, look for bad changes
494 (for example, more bytes per weight for common characters)
495 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.1.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.1.txt
496 ~/svn.unitools/trunk$ meld ../frac-12.txt ../frac-12.1.txt
497
498- CLDR root data files are checked into $CLDR_SRC/common/uca/
499 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
500
501- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
502 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
503- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
504 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
505 (note removing the underscore before "Rules")
506 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
507- restore TODO diffs in UCARules.txt
508 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
509- update (ICU4C)/source/test/testdata/CollationTest_*.txt
510 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
511 from the CLDR root files (..._CLDR_..._SHORT.txt)
512 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
513 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
514 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
515- if CLDR common/uca/unihan-index.txt changes, then update
516 CLDR common/collation/root.xml <collation type="private-unihan">
517 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
518
519- run genuca, see command line above
520- rebuild ICU4C
521
522* Unihan collators
523 https://sites.google.com/site/unicodetools/unihan
524- run Unicode Tools
525 org.unicode.draft.GenerateUnihanCollators
526 with VM arguments
527 -ea
528 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
529 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
530 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
531 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
532 -DUVERSION=12.1.0
533- run Unicode Tools
534 org.unicode.draft.GenerateUnihanCollatorFiles
535 with the same arguments
536- check CLDR diffs
537 cd $CLDR_SRC
538 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
539 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
540- copy to CLDR
541 cd $CLDR_SRC
542 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
543 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
544- run CLDR unit tests, commit to CLDR
545- generate ICU zh collation data: run CLDR
546 org.unicode.cldr.icu.NewLdml2IcuConverter
547 with program arguments
548 -t collation
549 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
550 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
551 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll
552 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation
553 zh
554 and VM arguments
555 -ea
556 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
557- rebuild ICU4C
558
559* run & fix ICU4C tests, now with new CLDR collation root data
560- run all tests with the collation test data *_SHORT.txt or the full files
561 (the full ones have comments, useful for debugging)
562- note on intltest: if collate/UCAConformanceTest fails, then
563 utility/MultithreadTest/TestCollators will fail as well;
564 fix the conformance test before looking into the multi-thread test
565
566* update Java data files
567- refresh just the UCD/UCA-related/derived files, just to be safe
568- see (ICU4C)/source/data/icu4j-readme.txt
569- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
570- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
571 output:
572 ...
573 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
574 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt64b
575 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b
576 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt64l.dat ./out/icu4j/icudt64b.dat -s ./out/build/icudt64l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt64b
577 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b"
578 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt64b/
579 mkdir -p /tmp/icu4j/main/shared/data
580 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
581 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt64b/
582 mkdir -p /tmp/icu4j/main/shared/data
583 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
584 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
585- copy the big-endian Unicode data files to another location,
586 separate from the other data files,
587 and then refresh ICU4J
588 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
589 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
590 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
591 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
592 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
593 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
594 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
595 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
596 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
597 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
598
599* When refreshing all of ICU4J data from ICU4C
600- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
601- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
602or
603- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
604
605* update CollationFCD.java
606 + copy & paste the initializers of lcccIndex[] etc. from
607 ICU4C/source/i18n/collationfcd.cpp to
608 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
609
610* refresh Java test .txt files
611- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
612 cd $ICU_SRC/icu4c/source/data/unidata
613 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
614 cd ../../test/testdata
615 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
616 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
617
618* run & fix ICU4J tests
619
620*** API additions
621- send notice to icu-design about new born-@stable API (enum constants etc.)
622
623*** CLDR numbering systems
624- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
625 for example, look for
626 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt
627 in new blocks (Blocks.txt)
628 Unicode 12: using Unicode 12 CLDR ticket #11478
629 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong
630 wcho 1E2F0..1E2F9 Wancho
631 Unicode 11: using Unicode 11 CLDR ticket #10978
632 rohg 10D30..10D39 Hanifi_Rohingya
633 gong 11DA0..11DA9 Gunjala_Gondi
634 Earlier: CLDR tickets specific to adding new numbering systems.
635 Unicode 10: http://unicode.org/cldr/trac/ticket/10219
636 Unicode 9: http://unicode.org/cldr/trac/ticket/9692
637
638*** merge the Unicode update branches back onto the trunk
639- do not merge the icudata.jar and testdata.jar,
640 instead rebuild them from merged & tested ICU4C
641- make sure that changes to Unicode tools are checked in:
642 http://www.unicode.org/utility/trac/log/trunk/unicodetools
643
644---------------------------------------------------------------------------- ***
645
646Unicode 12.0 update for ICU 64
647
648http://www.unicode.org/versions/Unicode12.0.0/
649http://unicode.org/versions/beta-12.0.0.html
650https://www.unicode.org/review/pri389/
651http://www.unicode.org/reports/uax-proposed-updates.html
652http://www.unicode.org/reports/tr44/tr44-23.html
653
654ICU-20203 Unicode 12
655
656ICU-20111 move text layout properties data into a data file
657
658cldrbug 11478: Unicode 12
659Accidentally used ^/trunk instead of ^/branches/markus/uni12
660
661* Command-line environment setup
662
663UNICODE_DATA=~/unidata/uni12/20190309
664CLDR_SRC=~/svn.cldr/uni
665ICU_ROOT=~/icu/uni
666ICU_SRC=$ICU_ROOT/src
667ICUDT=icudt63b
668ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
669ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
670export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
671
672*** Unicode version numbers
673- makedata.mak
674- uchar.h
675- com.ibm.icu.util.VersionInfo
676- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
677
678- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
679 so that the makefiles see the new version number.
680
681*** data files & enums & parser code
682
683* download files
684- mkdir -p $UNICODE_DATA
685- download Unicode files into $UNICODE_DATA
686 + subfolders: emoji, idna, security, ucd, uca
687 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
688
689* for manual diffs and for Unicode Tools input data updates:
690 remove version suffixes from the file names
691 ~$ unidata/desuffixucd.py $UNICODE_DATA
692 (see https://sites.google.com/site/unicodetools/inputdata)
693
694* process and/or copy files
695- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
696 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
697 + For debugging, and tweaking how ppucd.txt is written,
698 the tool has an --only_ppucd option:
699 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
700
701- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
702
703* build ICU (make install)
704 so that the tools build can pick up the new definitions from the installed header files.
705
706 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
707
708* new constants for new property values
709- preparseucd.py error:
710 ValueError: missing uchar.h enum constants for some property values:
711 [(u'blk', set([u'Symbols_And_Pictographs_Ext_A', u'Elymaic',
712 u'Ottoman_Siyaq_Numbers', u'Nandinagari', u'Nyiakeng_Puachue_Hmong',
713 u'Small_Kana_Ext', u'Egyptian_Hieroglyph_Format_Controls', u'Wancho', u'Tamil_Sup'])),
714 (u'sc', set([u'Nand', u'Wcho', u'Elym', u'Hmnp']))]
715 = PropertyValueAliases.txt new property values (diff old & new .txt files)
716 blk; Egyptian_Hieroglyph_Format_Controls; Egyptian_Hieroglyph_Format_Controls
717 blk; Elymaic ; Elymaic
718 blk; Nandinagari ; Nandinagari
719 blk; Nyiakeng_Puachue_Hmong ; Nyiakeng_Puachue_Hmong
720 blk; Ottoman_Siyaq_Numbers ; Ottoman_Siyaq_Numbers
721 blk; Small_Kana_Ext ; Small_Kana_Extension
722 blk; Symbols_And_Pictographs_Ext_A ; Symbols_And_Pictographs_Extended_A
723 blk; Tamil_Sup ; Tamil_Supplement
724 blk; Wancho ; Wancho
725 -> add to uchar.h
726 use long property names for enum constants,
727 for the trailing comment get the block start code point: diff old & new Blocks.txt
728 -> add to UCharacter.UnicodeBlock IDs
729 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
730 replace public static final int \1_ID = \2; \3
731 -> add to UCharacter.UnicodeBlock objects
732 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
733 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \3
734
735 sc ; Elym ; Elymaic
736 sc ; Hmnp ; Nyiakeng_Puachue_Hmong
737 sc ; Nand ; Nandinagari
738 sc ; Wcho ; Wancho
739 -> uscript.h & com.ibm.icu.lang.UScript
740 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
741 and in com.ibm.icu.dev.test.lang.TestUScript.java
742
743* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
744 (not strictly necessary for NOT_ENCODED scripts)
745 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
746
747* update spoof checker UnicodeSet initializers:
748 inclusionPat & recommendedPat in uspoof.cpp
749 INCLUSION & RECOMMENDED in SpoofChecker.java
750- make sure that the Unicode Tools tree contains the latest security data files
751- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
752- update the hardcoded version number there in the DIRECTORY path
753- run the tool (no special environment variables needed)
754- copy & paste from the Console output into the .cpp & .java files
755
756* generate normalization data files
757 cd $ICU_ROOT/dbg/icu4c
758 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
759 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
760 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
761 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
762 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
763
764* build ICU (make install)
765 so that the tools build can pick up the new definitions from the installed header files.
766
767 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
768
769* build Unicode tools using CMake+make
770
771$ICU_SRC/tools/unicode/c/icudefs.txt:
772
773# Location (--prefix) of where ICU was installed.
774set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
775# Location of the ICU4C source tree.
776set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)
777
778 $ICU_ROOT/dbg$
779 mkdir -p tools/unicode/c
780 cd tools/unicode/c
781
782 $ICU_ROOT/dbg/tools/unicode/c$
783 cmake ../../../../src/tools/unicode/c
784 make
785
786* generate core properties data files
787 $ICU_ROOT/dbg/tools/unicode/c$
788 genprops/genprops $ICU_SRC/icu4c
789 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \
790 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
791- rebuild ICU (make install) & tools
792
793* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
794 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
795- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
796- Unicode 6.0..12.0: U+2260, U+226E, U+226F
797- nothing new in this Unicode version, no test file to update
798
799* run & fix ICU4C tests
800- update test of default bidi classes:
801 Bidi range \U0001ED00-\U0001ED4F changes default from R to AL,
802 see diffs in DerivedBidiClass.txt
803 + /tsutil/cucdtst/TestUnicodeData enumDefaultsRange() defaultBidi[]
804 + UCharacterTest.java TestIteration() defaultBidi[]
805- Andy handles RBBI & spoof check test failures
806
807* collation: CLDR collation root, UCA DUCET
808
809- UCA DUCET goes into Mark's Unicode tools, see
810 https://sites.google.com/site/unicodetools/home#TOC-UCA
811 diff the main mapping file, look for bad changes
812 (for example, more bytes per weight for common characters)
813 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.txt
814 ~/svn.unitools/trunk$ meld ../frac-11.txt ../frac-12.txt
815
816- CLDR root data files are checked into $CLDR_SRC/common/uca/
817 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
818
819- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
820 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
821- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
822 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
823 (note removing the underscore before "Rules")
824 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
825- restore TODO diffs in UCARules.txt
826 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
827- update (ICU4C)/source/test/testdata/CollationTest_*.txt
828 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
829 from the CLDR root files (..._CLDR_..._SHORT.txt)
830 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
831 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
832 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
833- if CLDR common/uca/unihan-index.txt changes, then update
834 CLDR common/collation/root.xml <collation type="private-unihan">
835 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
836
837- run genuca, see command line above;
838 deal with
839 Error: Unknown script for first-primary sample character U+119CE on line 29233 of /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:
840 FDD1 119CE; [71 CD 02, 05, 05] # Nandinagari first primary (compressible)
841 (add the character to genuca.cpp sampleCharsToScripts[])
842 + This time, I added code to genuca.cpp to use uscript_getSampleUnicodeString(script)
843 and cache its values.
844 Works as long as the script metadata is updated before the collation data.
845- rebuild ICU4C
846
847* Unihan collators
848 https://sites.google.com/site/unicodetools/unihan
849- run Unicode Tools
850 org.unicode.draft.GenerateUnihanCollators
851 with VM arguments
852 -ea
853 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
854 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
855 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
856 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
857 -DUVERSION=12.0.0
858- run Unicode Tools
859 org.unicode.draft.GenerateUnihanCollatorFiles
860 with the same arguments
861- check CLDR diffs
862 cd $CLDR_SRC
863 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
864 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
865- copy to CLDR
866 cd $CLDR_SRC
867 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
868 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
869- run CLDR unit tests, commit to CLDR
870- generate ICU zh collation data: run CLDR
871 org.unicode.cldr.icu.NewLdml2IcuConverter
872 with program arguments
873 -t collation
874 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
875 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
876 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll
877 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation
878 zh
879 and VM arguments
880 -ea
881 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
882- rebuild ICU4C
883
884* run & fix ICU4C tests, now with new CLDR collation root data
885- run all tests with the collation test data *_SHORT.txt or the full files
886 (the full ones have comments, useful for debugging)
887- note on intltest: if collate/UCAConformanceTest fails, then
888 utility/MultithreadTest/TestCollators will fail as well;
889 fix the conformance test before looking into the multi-thread test
890
891* update Java data files
892- refresh just the UCD/UCA-related/derived files, just to be safe
893- see (ICU4C)/source/data/icu4j-readme.txt
894- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
895- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
896 output:
897 ...
898 Unicode .icu files built to ./out/build/icudt63l
899 echo timestamp > uni-core-data
900 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt63b
901 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b
902 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
903 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt63l.dat ./out/icu4j/icudt63b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt63l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt63b
904 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b"
905 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt63b/
906 mkdir -p /tmp/icu4j/main/shared/data
907 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
908 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt63b/
909 mkdir -p /tmp/icu4j/main/shared/data
910 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
911 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
912- copy the big-endian Unicode data files to another location,
913 separate from the other data files,
914 and then refresh ICU4J
915 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
916 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
917 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
918 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
919 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
920 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
921 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
922 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
923 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
924 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
925
926* When refreshing all of ICU4J data from ICU4C
927- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
928- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
929or
930- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
931
932* update CollationFCD.java
933 + copy & paste the initializers of lcccIndex[] etc. from
934 ICU4C/source/i18n/collationfcd.cpp to
935 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
936
937* refresh Java test .txt files
938- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
939 cd $ICU_SRC/icu4c/source/data/unidata
940 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
941 cd ../../test/testdata
942 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
943 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
944
945* run & fix ICU4J tests
946
947*** API additions
948- send notice to icu-design about new born-@stable API (enum constants etc.)
949
950*** CLDR numbering systems
951- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
952 for example, look for
953 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt
954 in new blocks (Blocks.txt)
955 Unicode 12: using Unicode 12 CLDR ticket #11478
956 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong
957 wcho 1E2F0..1E2F9 Wancho
958 Unicode 11: using Unicode 11 CLDR ticket #10978
959 rohg 10D30..10D39 Hanifi_Rohingya
960 gong 11DA0..11DA9 Gunjala_Gondi
961 Earlier: CLDR tickets specific to adding new numbering systems.
962 Unicode 10: http://unicode.org/cldr/trac/ticket/10219
963 Unicode 9: http://unicode.org/cldr/trac/ticket/9692
964
965*** merge the Unicode update branches back onto the trunk
966- do not merge the icudata.jar and testdata.jar,
967 instead rebuild them from merged & tested ICU4C
968- make sure that changes to Unicode tools are checked in:
969 http://www.unicode.org/utility/trac/log/trunk/unicodetools
970
971---------------------------------------------------------------------------- ***
972
973ICU 63 addition of ICU support of text layout properties InPC, InSC, vo
974
975* Command-line environment setup
976
977UNICODE_DATA=~/unidata/uni11/20180609
978CLDR_SRC=~/svn.cldr/uni
979ICU_ROOT=~/icu/mine
980ICU_SRC=$ICU_ROOT/src
981ICUDT=icudt62b
982ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
983ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
984export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
985
986*** Links
987
988https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC
989https://unicode-org.atlassian.net/browse/ICU-12850 vo
990
991*** data files & enums & parser code
992
993* API additions
994- for each of the three new enumerated properties
995 + uchar.h: add the enum UProperty constant UCHAR_<long prop name>
996 + uchar.h: update UCHAR_INT_LIMIT
997 + uchar.h: add the enum U<long prop name>
998 with constants U_<short prop name>_<long value name>
999 + UProperty.java: add the constant <long prop name>
1000 + UProperty.java: update INT_LIMIT
1001 + UCharacter.java: add the interface <long prop name>
1002 with constants <long value name>
1003
1004* process and/or copy files
1005- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
1006 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1007 + It also writes tools/unicode/c/genprops/pnames_data.h with property and value
1008 names and aliases.
1009 + For debugging, and tweaking how ppucd.txt is written,
1010 the tool has an --only_ppucd option:
1011 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
1012
1013* preparseucd.py changes
1014- add new property short names (uppercase) to _prop_and_value_re
1015 so that ParseUCharHeader() parses the new enum constants
1016
1017* build ICU (make install)
1018 so that the tools build can pick up the new definitions from the installed header files.
1019
1020 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
1021
1022* build Unicode tools using CMake+make
1023
1024$ICU_SRC/tools/unicode/c/icudefs.txt:
1025
1026# Location (--prefix) of where ICU was installed.
1027set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
1028# Location of the ICU4C source tree.
1029set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c)
1030
1031 $ICU_ROOT/dbg$
1032 mkdir -p tools/unicode/c
1033 cd tools/unicode/c
1034
1035 $ICU_ROOT/dbg/tools/unicode/c$
1036 cmake ../../../../../src/tools/unicode/c
1037 make
1038
1039* generate core properties data files
1040 $ICU_ROOT/dbg/tools/unicode/c$
1041 genprops/genprops $ICU_SRC/icu4c
1042- rebuild ICU (make install) & tools
1043
1044* write data for runtime, hardcoded for now
1045- add genprops/layoutpropsbuilder.cpp with pieces from sibling files
1046- generate new icu4c/source/common/ulayout_props_data.h
1047- for each of the three new enumerated properties
1048 + int property max value
1049 + small, 8-bit UCPTrie
1050 (A small 16-bit trie with bit fields for these three properties
1051 is very nearly the same size as the sum of the three.)
1052
1053* wire into C++
1054- uprops.cpp: #include ulayout_props_data.h
1055- uprops.cpp: add getInPC() etc. functions
1056- uprops.cpp: add lines to intProps[], include max values
1057- uprops.h: add UPropertySource constants
1058- uprops.cpp: add uprops_addPropertyStarts(src)
1059- uniset_props.cpp: add to UnicodeSet_initInclusion()
1060- intltest/ucdtest.cpp: write unit tests
1061
1062* update Java data files
1063- refresh just the pnames.icu file with the new property [value] names, just to be safe
1064- see $ICU_SRC/icu4c/source/data/icu4j-readme.txt
1065- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1066- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1067- copy the big-endian Unicode data files to another location,
1068 separate from the other data files,
1069 and then refresh ICU4J
1070 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
1071 cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1072 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1073
1074* wire into Java
1075- UCharacterProperty.java: add new SRC_INPC etc. constants as in C++
1076- UCharacterProperty.java: for each new property
1077 + create a nested class to hold its CodePointTrie
1078 + initialize it from a string literal
1079 + paste in the initializer printed by genprops
1080 + add a new IntProperty object to the intProps[] array
1081 + use the correct max int value for each property, also printed by genprops
1082- UCharacterProperty.java: add ulayout_addPropertyStarts(src, set)
1083- UnicodeSet.java: add to getInclusions()
1084- UCharacterTest.java: write unit tests
1085
1086---------------------------------------------------------------------------- ***
1087
0f5d89e8
A
1088Unicode 11.0 update for ICU 62
1089
1090http://www.unicode.org/versions/Unicode11.0.0/
1091http://unicode.org/versions/beta-11.0.0.html
1092https://www.unicode.org/review/pri372/
1093http://www.unicode.org/reports/uax-proposed-updates.html
1094http://www.unicode.org/reports/tr44/tr44-21.html
1095
1096* Command-line environment setup
1097
1098UNICODE_DATA=~/unidata/uni11/20180521
1099CLDR_SRC=~/svn.cldr/uni
1100ICU_ROOT=~/svn.icu/uni
1101ICU_SRC=$ICU_ROOT/src
1102ICUDT=icudt61b
1103ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
1104ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
1105export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
1106
1107*** ICU Trac
1108
1109- ticket:13630: Unicode 11
1110- ^/branches/markus/uni11
1111
1112*** CLDR Trac
1113
1114- cldrbug 10978: Unicode 11
1115- ^/branches/markus/uni11
1116
1117*** Unicode version numbers
1118- makedata.mak
1119- uchar.h
1120- com.ibm.icu.util.VersionInfo
1121- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1122
1123- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1124 so that the makefiles see the new version number.
1125
1126*** data files & enums & parser code
1127
1128* download files
1129- mkdir -p $UNICODE_DATA
1130- download Unicode files into $UNICODE_DATA
1131 + subfolders: emoji, idna, security, ucd, uca
1132 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1133
1134* for manual diffs and for Unicode Tools input data updates:
1135 remove version suffixes from the file names
1136 ~$ unidata/desuffixucd.py $UNICODE_DATA
1137 (see https://sites.google.com/site/unicodetools/inputdata)
1138
1139* process and/or copy files
1140- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
1141 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1142 + For debugging, and tweaking how ppucd.txt is written,
1143 the tool has an --only_ppucd option:
1144 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
1145
1146- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
1147
1148* build ICU (make install)
1149 so that the tools build can pick up the new definitions from the installed header files.
1150
1151 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
1152
1153* preparseucd.py changes
1154- fix other errors
1155 NameError: unknown property Extended_Pictographic
1156 -> add Extended_Pictographic binary property
1157 -> add new short names for all Emoji properties
1158
1159* new constants for new property values
1160- preparseucd.py error:
1161 ValueError: missing uchar.h enum constants for some property values:
1162 [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar',
1163 u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals',
1164 u'Indic_Siyaq_Numbers'])),
1165 (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])),
1166 (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])),
1167 (u'GCB', set([u'LinkC', u'Virama'])),
1168 (u'WB', set([u'WSegSpace']))]
1169 = PropertyValueAliases.txt new property values (diff old & new .txt files)
1170 blk; Chess_Symbols ; Chess_Symbols
1171 blk; Dogra ; Dogra
1172 blk; Georgian_Ext ; Georgian_Extended
1173 blk; Gunjala_Gondi ; Gunjala_Gondi
1174 blk; Hanifi_Rohingya ; Hanifi_Rohingya
1175 blk; Indic_Siyaq_Numbers ; Indic_Siyaq_Numbers
1176 blk; Makasar ; Makasar
1177 blk; Mayan_Numerals ; Mayan_Numerals
1178 blk; Medefaidrin ; Medefaidrin
1179 blk; Old_Sogdian ; Old_Sogdian
1180 blk; Sogdian ; Sogdian
1181 -> add to uchar.h
1182 use long property names for enum constants,
1183 for the trailing comment get the block start code point: diff old & new Blocks.txt
1184 -> add to UCharacter.UnicodeBlock IDs
1185 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1186 replace public static final int \1_ID = \2; \3
1187 -> add to UCharacter.UnicodeBlock objects
1188 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
1189 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1190
1191 GCB; LinkC ; LinkingConsonant
1192 GCB; Virama ; Virama
1193 -> uchar.h & UCharacter.GraphemeClusterBreak
1194 -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76
1195
1196 InSC; Consonant_Initial_Postfixed ; Consonant_Initial_Postfixed
1197 -> ignore: ICU does not yet support this property
1198
1199 jg ; Hanifi_Rohingya_Kinna_Ya ; Hanifi_Rohingya_Kinna_Ya
1200 jg ; Hanifi_Rohingya_Pa ; Hanifi_Rohingya_Pa
1201 -> uchar.h & UCharacter.JoiningGroup
1202
1203 sc ; Dogr ; Dogra
1204 sc ; Gong ; Gunjala_Gondi
1205 sc ; Maka ; Makasar
1206 sc ; Medf ; Medefaidrin
1207 sc ; Rohg ; Hanifi_Rohingya
1208 sc ; Sogd ; Sogdian
1209 sc ; Sogo ; Old_Sogdian
1210 -> uscript.h & com.ibm.icu.lang.UScript
1211 -> Nushu had been added already
1212 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1213 and in com.ibm.icu.dev.test.lang.TestUScript.java
1214
1215 WB ; WSegSpace ; WSegSpace
1216 -> uchar.h & UCharacter.WordBreak
1217
1218* New short names for emoji properties
1219- see UTS #51
1220- short names set in preparseucd.py
1221
1222* New properties
1223- boolean emoji property Extended_Pictographic
1224 -> added in preparseucd.py
1225 -> uchar.h & UProperty.java
1226- misc. property Equivalent_Unified_Ideograph (EqUIdeo)
1227 as shown in PropertyValueAliases.txt
1228 -> ignore for now
1229
1230* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1231 (not strictly necessary for NOT_ENCODED scripts)
1232 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
1233
1234* update spoof checker UnicodeSet initializers:
1235 inclusionPat & recommendedPat in uspoof.cpp
1236 INCLUSION & RECOMMENDED in SpoofChecker.java
1237- make sure that the Unicode Tools tree contains the latest security data files
1238- go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
1239- update the hardcoded version number there in the DIRECTORY path
1240- run the tool (no special environment variables needed)
1241- copy & paste from the Console output into the .cpp & .java files
1242
1243* generate normalization data files
1244 cd $ICU_ROOT/dbg/icu4c
1245 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
1246 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
1247 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
1248 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1249 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
1250
1251* build ICU (make install)
1252 so that the tools build can pick up the new definitions from the installed header files.
1253
1254 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
1255
1256* build Unicode tools using CMake+make
1257
1258$ICU_SRC/tools/unicode/c/icudefs.txt:
1259
1260# Location (--prefix) of where ICU was installed.
1261set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
1262# Location of the ICU4C source tree.
1263set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c)
1264
1265 $ICU_ROOT/dbg$
1266 mkdir -p tools/unicode/c
1267 cd tools/unicode/c
1268
1269 $ICU_ROOT/dbg/tools/unicode/c$
1270 cmake ../../../../src/tools/unicode/c
1271 make
1272
1273* generate core properties data files
1274 $ICU_ROOT/dbg/tools/unicode/c$
1275 genprops/genprops $ICU_SRC/icu4c
1276 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
1277 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
1278- rebuild ICU (make install) & tools
1279
1280* Fix case props
1281 genprops error: casepropsbuilder: too many exceptions words
1282 genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR
1283- With the addition of Georgian Mtavruli capital letters,
1284 there are now too many simple case mappings with big mapping deltas
1285 that yield uncompressible exceptions.
1286- Changing the data structure (now formatVersion 4),
1287 adding one bit for no-simple-case-folding (for Cherokee), and
1288 one optional slot for a big delta (for most faraway mappings),
1289 together with another bit for whether that is negative.
1290 This makes most Cherokee & Georgian etc. case mappings compressible,
1291 reducing the number of exceptions words.
1292- Further changes to gain one more bit for the exceptions index,
1293 for future growth. Details see casepropsbuilder.cpp.
1294
1295* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1296 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1297- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1298- Unicode 6.0..11.0: U+2260, U+226E, U+226F
1299- nothing new in this Unicode version, no test file to update
1300
1301* run & fix ICU4C tests
1302- Andy handles RBBI & spoof check test failures
1303
1304- Errors in char.txt, word.txt, word_POSIX.txt like
1305 createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 46, column 16
1306 because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty.
1307 -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them
1308 not empty, just to get ICU building.
1309 -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables
1310 and properties together with the rules that used them (GB 10, WB 14).
1311 -> Andy adjusts the rule sets further to sync with
1312 Unicode 11 grapheme, word, and line break spec changes.
1313
1314* collation: CLDR collation root, UCA DUCET
1315
1316- UCA DUCET goes into Mark's Unicode tools, see
1317 https://sites.google.com/site/unicodetools/home#TOC-UCA
1318 diff the main mapping file, look for bad changes
1319 (for example, more bytes per weight for common characters)
1320 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt
1321 ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt
1322
1323- CLDR root data files are checked into $CLDR_SRC/common/uca/
1324 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
1325
1326- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1327 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
1328- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1329 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
1330 (note removing the underscore before "Rules")
1331 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
1332- restore TODO diffs in UCARules.txt
1333 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
1334- update (ICU4C)/source/test/testdata/CollationTest_*.txt
1335 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1336 from the CLDR root files (..._CLDR_..._SHORT.txt)
1337 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1338 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1339 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
1340- if CLDR common/uca/unihan-index.txt changes, then update
1341 CLDR common/collation/root.xml <collation type="private-unihan">
1342 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
1343
1344- run genuca, see command line above;
1345 deal with
1346 Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:
1347 FDD1 1180B; [71 CC 02, 05, 05] # Dogra first primary (compressible)
1348 (add the character to genuca.cpp sampleCharsToScripts[])
1349 + look up the USCRIPT_ code for the new sample characters
1350 (should be obvious from the comment in the error output)
1351 + *add* mappings to sampleCharsToScripts[], do not replace them
1352 (in case the script sample characters flip-flop)
1353 + insert new scripts in DUCET script order, see the top_byte table
1354 at the beginning of FractionalUCA.txt
1355- rebuild ICU4C
1356
1357* Unihan collators
1358 https://sites.google.com/site/unicodetools/unihan
1359- run Unicode Tools
1360 org.unicode.draft.GenerateUnihanCollators
1361 with VM arguments
1362 -ea
1363 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
1364 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
1365 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
1366 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
1367 -DUVERSION=11.0.0
1368- run Unicode Tools
1369 org.unicode.draft.GenerateUnihanCollatorFiles
1370 with the same arguments
1371- check CLDR diffs
1372 cd $CLDR_SRC
1373 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
1374 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
1375- copy to CLDR
1376 cd $CLDR_SRC
1377 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
1378 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
1379- run CLDR unit tests, commit to CLDR
1380- generate ICU zh collation data: run CLDR
1381 org.unicode.cldr.icu.NewLdml2IcuConverter
1382 with program arguments
1383 -t collation
1384 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
1385 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
1386 -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll
1387 -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation
1388 zh
1389 and VM arguments
1390 -ea
1391 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
1392- rebuild ICU4C
1393
1394* run & fix ICU4C tests, now with new CLDR collation root data
1395- run all tests with the collation test data *_SHORT.txt or the full files
1396 (the full ones have comments, useful for debugging)
1397- note on intltest: if collate/UCAConformanceTest fails, then
1398 utility/MultithreadTest/TestCollators will fail as well;
1399 fix the conformance test before looking into the multi-thread test
1400
1401* update Java data files
1402- refresh just the UCD/UCA-related/derived files, just to be safe
1403- see (ICU4C)/source/data/icu4j-readme.txt
1404- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1405- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1406 output:
1407 ...
1408 Unicode .icu files built to ./out/build/icudt61l
1409 echo timestamp > uni-core-data
1410 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b
1411 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b
1412 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1413 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b
1414 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b"
1415 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/
1416 mkdir -p /tmp/icu4j/main/shared/data
1417 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1418 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/
1419 mkdir -p /tmp/icu4j/main/shared/data
1420 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1421 make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data'
1422- copy the big-endian Unicode data files to another location,
1423 separate from the other data files,
1424 and then refresh ICU4J
1425 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
1426 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1427 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1428 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1429 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1430 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1431 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1432 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1433 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1434 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1435
1436* When refreshing all of ICU4J data from ICU4C
1437- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1438- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
1439or
1440- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
1441
1442* update CollationFCD.java
1443 + copy & paste the initializers of lcccIndex[] etc. from
1444 ICU4C/source/i18n/collationfcd.cpp to
1445 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1446
1447* refresh Java test .txt files
1448- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1449 cd $ICU_SRC/icu4c/source/data/unidata
1450 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1451 cd ../../test/testdata
1452 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1453 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1454
1455* run & fix ICU4J tests
1456
1457*** API additions
1458- send notice to icu-design about new born-@stable API (enum constants etc.)
1459
1460*** CLDR numbering systems
1461- look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
1462 Unicode 11: using Unicode 11 CLDR ticket #10978
1463 rohg 10D30..10D39 Hanifi_Rohingya
1464 gong 11DA0..11DA9 Gunjala_Gondi
1465 Earlier: CLDR tickets specific to adding new numbering systems.
1466 Unicode 10: http://unicode.org/cldr/trac/ticket/10219
1467 Unicode 9: http://unicode.org/cldr/trac/ticket/9692
1468
1469*** merge the Unicode update branches back onto the trunk
1470- do not merge the icudata.jar and testdata.jar,
1471 instead rebuild them from merged & tested ICU4C
1472- make sure that changes to Unicode tools are checked in:
1473 http://www.unicode.org/utility/trac/log/trunk/unicodetools
1474
1475---------------------------------------------------------------------------- ***
1476
6be67b06
A
1477Unicode 10.0 update for ICU 60
1478
1479http://www.unicode.org/versions/Unicode10.0.0/
1480http://www.unicode.org/versions/beta-10.0.0.html
1481http://blog.unicode.org/2017/03/unicode-100-beta-review.html
1482http://www.unicode.org/review/pri350/
1483http://www.unicode.org/reports/uax-proposed-updates.html
1484http://www.unicode.org/reports/tr44/tr44-19.html
1485
1486* Command-line environment setup
1487
1488UNICODE_DATA=~/unidata/uni10/20170605
1489CLDR_SRC=~/svn.cldr/uni10
1490ICU_ROOT=~/svn.icu/uni10
1491ICU_SRC=$ICU_ROOT/src
1492ICUDT=icudt60b
1493ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
1494ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
1495export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
1496
1497*** ICU Trac
1498
1499- ticket:12985: Unicode 10
1500- ticket:13061: undo hacks from emoji 5.0 update
1501- ticket:13062: add Emoji_Component property
1502- ^/branches/markus/uni10
1503
1504*** CLDR Trac
1505
1506- cldrbug 10055: Unicode 10
1507- cldrbug 9882: Unicode 10 script metadata
1508- cldrbug 10219: numbering systems for Unicode 10
1509
1510*** Unicode version numbers
1511- makedata.mak
1512- uchar.h
1513- com.ibm.icu.util.VersionInfo
1514- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1515
1516- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1517 so that the makefiles see the new version number.
1518
1519*** data files & enums & parser code
1520
1521* download files
1522- mkdir -p $UNICODE_DATA
1523- download Unicode 10.0 files into $UNICODE_DATA
1524 + subfolders: ucd, uca, idna, security
1525 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1526- download emoji 5.0 files into $UNICODE_DATA/emoji
1527
1528* for manual diffs: remove version suffixes from the file names
1529 ~$ unidata/desuffixucd.py $UNICODE_DATA
1530 (see https://sites.google.com/site/unicodetools/inputdata)
1531
1532* process and/or copy files
1533- $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
1534 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1535 + For debugging, and tweaking how ppucd.txt is written,
1536 the tool has an --only_ppucd option:
1537 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
1538
1539- cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
1540
1541* build ICU (make install)
1542 so that the tools build can pick up the new definitions from the installed header files.
1543
1544 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
1545
1546* preparseucd.py changes
1547- remove or add new Unicode scripts from/to the
1548 only-in-ISO-15924 list according to the error messages:
1549 ValueError: remove ['Nshu'] from _scripts_only_in_iso15924
1550 -> adjust _scripts_only_in_iso15924 as indicated
1551- fix other errors
1552 Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo']
1553 -> add vo=Vertical_Orientation to _ignored_properties
1554 -> later removed again, parsing the file, even though we do not yet store data for runtime use
1555
1556* new constants for new property values
1557- preparseucd.py error:
1558 ValueError: missing uchar.h enum constants for some property values:
1559 [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F',
1560 u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])),
1561 (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla',
1562 u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra',
1563 u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])),
1564 (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))]
1565 = PropertyValueAliases.txt new property values (diff old & new .txt files)
1566 blk; CJK_Ext_F ; CJK_Unified_Ideographs_Extension_F
1567 blk; Kana_Ext_A ; Kana_Extended_A
1568 blk; Masaram_Gondi ; Masaram_Gondi
1569 blk; Nushu ; Nushu
1570 blk; Soyombo ; Soyombo
1571 blk; Syriac_Sup ; Syriac_Supplement
1572 blk; Zanabazar_Square ; Zanabazar_Square
1573 -> add to uchar.h
1574 use long property names for enum constants,
1575 for the trailing comment get the block start code point: diff old & new Blocks.txt
1576 -> add to UCharacter.UnicodeBlock IDs
1577 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1578 replace public static final int \1_ID = \2; \3
1579 -> add to UCharacter.UnicodeBlock objects
1580 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
1581 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1582
1583 jg ; Malayalam_Bha ; Malayalam_Bha
1584 jg ; Malayalam_Ja ; Malayalam_Ja
1585 jg ; Malayalam_Lla ; Malayalam_Lla
1586 jg ; Malayalam_Llla ; Malayalam_Llla
1587 jg ; Malayalam_Nga ; Malayalam_Nga
1588 jg ; Malayalam_Nna ; Malayalam_Nna
1589 jg ; Malayalam_Nnna ; Malayalam_Nnna
1590 jg ; Malayalam_Nya ; Malayalam_Nya
1591 jg ; Malayalam_Ra ; Malayalam_Ra
1592 jg ; Malayalam_Ssa ; Malayalam_Ssa
1593 jg ; Malayalam_Tta ; Malayalam_Tta
1594 -> uchar.h & UCharacter.JoiningGroup
1595
1596 sc ; Gonm ; Masaram_Gondi
1597 sc ; Nshu ; Nushu
1598 sc ; Soyo ; Soyombo
1599 sc ; Zanb ; Zanabazar_Square
1600 -> uscript.h & com.ibm.icu.lang.UScript
1601 -> Nushu had been added already
1602 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1603 and in com.ibm.icu.dev.test.lang.TestUScript.java
1604
1605* New properties as shown in PropertyValueAliases.txt changes
1606- boolean Emoji_Component from emoji 5
1607 -> uchar.h & UProperty.java
1608- boolean
1609 # Regional_Indicator (RI)
1610
1611 RI ; N ; No ; F ; False
1612 RI ; Y ; Yes ; T ; True
1613 -> uchar.h & UProperty.java
1614 -> single immutable range, to be hardcoded
1615- boolean
1616 # Prepended_Concatenation_Mark (PCM)
1617
1618 PCM; N ; No ; F ; False
1619 PCM; Y ; Yes ; T ; True
1620 -> was new in Unicode 9
1621 -> uchar.h & UProperty.java
1622- enumerated
1623 # Vertical_Orientation (vo)
1624
1625 vo ; R ; Rotated
1626 vo ; Tr ; Transformed_Rotated
1627 vo ; Tu ; Transformed_Upright
1628 vo ; U ; Upright
1629 -> only pre-parsed for now, but not yet stored for runtime use
1630
1631* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1632 (not strictly necessary for NOT_ENCODED scripts)
1633 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
1634
1635* generate normalization data files
1636 cd $ICU_ROOT/dbg/icu4c
1637 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
1638 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt
1639 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
1640 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1641 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
1642
1643* build ICU (make install)
1644 so that the tools build can pick up the new definitions from the installed header files.
1645
1646 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
1647
1648* build Unicode tools using CMake+make
1649
1650$ICU_SRC/tools/unicode/c/icudefs.txt:
1651
1652# Location (--prefix) of where ICU was installed.
1653set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
1654# Location of the ICU4C source tree.
1655set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c)
1656
1657 $ICU_ROOT/dbg/tools/unicode/c$
1658 cmake ../../../../src/tools/unicode/c
1659 make
1660
1661* generate core properties data files
1662 $ICU_ROOT/dbg/tools/unicode/c$
1663 genprops/genprops $ICU_SRC/icu4c
1664 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
1665 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
1666- rebuild ICU (make install) & tools
1667
1668* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1669 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1670- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1671- Unicode 6.0..10.0: U+2260, U+226E, U+226F
1672- nothing new in this Unicode version, no test file to update
1673
1674* run & fix ICU4C tests
1675- Andy handles RBBI & spoof check test failures
1676
1677* collation: CLDR collation root, UCA DUCET
1678
1679- UCA DUCET goes into Mark's Unicode tools, see
1680 https://sites.google.com/site/unicodetools/home#TOC-UCA
1681- CLDR root data files are checked into $CLDR_SRC/common/uca/
1682 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
1683
1684- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1685 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
1686- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1687 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
1688 (note removing the underscore before "Rules")
1689 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
1690- restore TODO diffs in UCARules.txt
1691 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
1692- update (ICU4C)/source/test/testdata/CollationTest_*.txt
1693 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1694 from the CLDR root files (..._CLDR_..._SHORT.txt)
1695 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1696 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1697 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
1698- if CLDR common/uca/unihan-index.txt changes, then update
1699 CLDR common/collation/root.xml <collation type="private-unihan">
1700 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
1701
1702- run genuca, see command line above;
1703 deal with
1704 Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt:
1705 FDD1 11D10; [70 D5 02, 05, 05] # Masaram_Gondi first primary (compressible)
1706 (add the character to genuca.cpp sampleCharsToScripts[])
1707 + look up the USCRIPT_ code for the new sample characters
1708 (should be obvious from the comment in the error output)
1709 + *add* mappings to sampleCharsToScripts[], do not replace them
1710 (in case the script sample characters flip-flop)
1711 + insert new scripts in DUCET script order, see the top_byte table
1712 at the beginning of FractionalUCA.txt
1713- rebuild ICU4C
1714
1715* Unihan collators
1716 https://sites.google.com/site/unicodetools/unihan
1717- run Unicode Tools
1718 org.unicode.draft.GenerateUnihanCollators
1719 with VM arguments
1720 -ea
1721 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
1722 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
1723 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
1724 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
1725 -DUVERSION=10.0.0
1726- run Unicode Tools
1727 org.unicode.draft.GenerateUnihanCollatorFiles
1728 with the same arguments
1729- check CLDR diffs
1730 cd $CLDR_SRC
1731 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
1732 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
1733- copy to CLDR
1734 cd $CLDR_SRC
1735 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
1736 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
1737- run CLDR unit tests, commit to CLDR
1738- generate ICU zh collation data: run CLDR
1739 org.unicode.cldr.icu.NewLdml2IcuConverter
1740 with program arguments
1741 -t collation
1742 -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation
1743 -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental
1744 -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll
1745 -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation
1746 zh
1747 and VM arguments
1748 -ea
1749 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
1750- rebuild ICU4C
1751
1752* run & fix ICU4C tests, now with new CLDR collation root data
1753- run all tests with the collation test data *_SHORT.txt or the full files
1754 (the full ones have comments, useful for debugging)
1755- note on intltest: if collate/UCAConformanceTest fails, then
1756 utility/MultithreadTest/TestCollators will fail as well;
1757 fix the conformance test before looking into the multi-thread test
1758
1759* update Java data files
1760- refresh just the UCD/UCA-related/derived files, just to be safe
1761- see (ICU4C)/source/data/icu4j-readme.txt
1762- mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1763- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1764 output:
1765 ...
1766 Unicode .icu files built to ./out/build/icudt60l
1767 echo timestamp > uni-core-data
1768 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b
1769 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b
1770 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1771 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b
1772 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b"
1773 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/
1774 mkdir -p /tmp/icu4j/main/shared/data
1775 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1776 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/
1777 mkdir -p /tmp/icu4j/main/shared/data
1778 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1779 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data'
1780- copy the big-endian Unicode data files to another location,
1781 separate from the other data files,
1782 and then refresh ICU4J
1783 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
1784 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1785 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1786 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1787 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1788 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1789 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1790 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1791 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1792 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1793
1794* When refreshing all of ICU4J data from ICU4C
1795- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1796- cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
1797or
1798- $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
1799
1800* update CollationFCD.java
1801 + copy & paste the initializers of lcccIndex[] etc. from
1802 ICU4C/source/i18n/collationfcd.cpp to
1803 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1804
1805* refresh Java test .txt files
1806- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1807 cd $ICU_SRC/icu4c/source/data/unidata
1808 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1809 cd ../../test/testdata
1810 cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1811 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1812
1813* run & fix ICU4J tests
1814
1815*** API additions
1816- send notice to icu-design about new born-@stable API (enum constants etc.)
1817
1818*** CLDR numbering systems
1819- look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket
1820 Unicode 10: http://unicode.org/cldr/trac/ticket/10219
1821 Unicode 9: http://unicode.org/cldr/trac/ticket/9692
1822
1823*** merge the Unicode update branches back onto the trunk
1824- do not merge the icudata.jar and testdata.jar,
1825 instead rebuild them from merged & tested ICU4C
1826- make sure that changes to Unicode tools are checked in:
1827 http://www.unicode.org/utility/trac/log/trunk/unicodetools
f3c0d7a5
A
1828
1829---------------------------------------------------------------------------- ***
1830
1831Emoji 5.0 update for ICU 59
1832- ICU 59 mostly remains on Unicode 9.0
1833- except updates bidi and segmentation data to Unicode 10 beta
1834
1835First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg.
1836
1837* Command-line environment setup
1838
1839ICU_ROOT=~/svn.icu/trunk
1840ICU_SRC_DIR=$ICU_ROOT/src
1841ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c
1842ICUDT=icudt59b
1843export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1844SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in
1845UNIDATA=$ICU4C_SRC_DIR/source/data/unidata
1846
1847*** ICU Trac
1848
1849- ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released
1850- changes directly on trunk
1851
1852*** data files & enums & parser code
1853
1854* download files
1855
1856- download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca)
1857- download emoji 5.0 beta files into the same uni90e50 folder
1858- download Unicode 10.0 beta files: ucd
1859 + copy Unicode 10 bidi files to the uni90e50/ucd folder:
1860 BidiBrackets.txt
1861 BidiCharacterTest.txt
1862 BidiMirroring.txt
1863 BidiTest.txt
1864 extracted/DerivedBidiClass.txt
1865 + copy Unicode 10 segmentation files to the uni90e50/ucd folder:
1866 LineBreak.txt
1867 auxiliary/*
1868
1869* preparseucd.py changes
1870- adjust for combined trunks
1871- write new copyright lines
1872- ignore new Emoji_Component property for now
1873
1874* process and/or copy files
1875- ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR
1876 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1877
1878- cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA
1879
1880* build ICU (make install)
1881 so that the tools build can pick up the new definitions from the installed header files.
1882
1883 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
1884
1885* build Unicode tools using CMake+make
1886
1887~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt:
1888
1889# Location (--prefix) of where ICU was installed.
1890set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
1891# Location of the ICU4C source tree.
1892set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c)
1893
1894 ~/svn.icu/trunk/dbg/tools/unicode/c$
1895 cmake ../../../../src/tools/unicode/c
1896 make
1897
1898* generate core properties data files
1899 ~/svn.icu/trunk/dbg/tools/unicode/c$
1900 genprops/genprops $ICU4C_SRC_DIR
1901- rebuild ICU (make install) & tools
1902
1903* run & fix ICU4C tests
1904- Andy handles RBBI & spoof check test failures
1905
1906* update Java data files
1907- refresh just the UCD/UCA-related/derived files, just to be safe
1908- see (ICU4C)/source/data/icu4j-readme.txt
1909- mkdir /tmp/icu4j
1910- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1911 output:
1912 ...
1913 Unicode .icu files built to ./out/build/icudt59l
1914 echo timestamp > uni-core-data
1915 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b
1916 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b
1917 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1918 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b
1919 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b"
1920 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/
1921 mkdir -p /tmp/icu4j/main/shared/data
1922 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1923 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/
1924 mkdir -p /tmp/icu4j/main/shared/data
1925 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1926 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data'
1927- copy the big-endian Unicode data files to another location,
1928 separate from the other data files,
1929 and then refresh ICU4J
1930 cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j
1931 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1932 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1933 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1934 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1935 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1936 jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1937
1938* When refreshing all of ICU4J data from ICU4C
1939- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1940- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data
1941or
1942- ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install
1943
1944* refresh Java test .txt files
1945- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1946 cd $ICU4C_SRC_DIR/source/data/unidata
1947 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1948 cd ../../test/testdata
1949 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1950 cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1951
1952* run & fix ICU4J tests
1953
1954---------------------------------------------------------------------------- ***
1955
1956Unicode 9.0 update for ICU 58
1957
1958* Command-line environment setup
1959
1960ICU_ROOT=~/svn.icu/trunk
1961ICU_SRC_DIR=$ICU_ROOT/src
1962ICUDT=icudt58b
1963export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1964SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
1965UNIDATA=$ICU_SRC_DIR/source/data/unidata
1966
1967http://www.unicode.org/review/pri323/ -- beta review
1968http://www.unicode.org/reports/uax-proposed-updates.html
1969http://www.unicode.org/versions/beta-9.0.0.html
1970http://www.unicode.org/versions/Unicode9.0.0/
1971http://www.unicode.org/reports/tr44/tr44-17.html
1972
1973*** ICU Trac
1974
1975- ticket:12526: integrate Unicode 9
1976- C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b
1977- Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b
1978
1979*** CLDR Trac
1980
1981- cldrbug 9414: UCA 9
1982- ^/branches/markus/uni90 at r11518 from trunk at r11517
1983
1984- cldrbug 8745: Unicode 9.0 script metadata
1985
1986*** Unicode version numbers
1987- makedata.mak
1988- uchar.h
1989- com.ibm.icu.util.VersionInfo
1990- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1991
1992- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1993 so that the makefiles see the new version number.
1994
1995*** data files & enums & parser code
1996
1997* file preparation
1998
1999- download UCD & IDNA files
2000- make sure that the Unicode data folder passed into preparseucd.py
2001 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2002- only for manual diffs: remove version suffixes from the file names
2003 ~/unidata/uni70/20140403$ ../../desuffixucd.py .
2004 (see https://sites.google.com/site/unicodetools/inputdata)
2005- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
2006- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src
2007- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2008
2009- also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt
2010 and copy to $UNIDATA
2011 cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA
2012
2013* preparseucd.py changes
2014- remove or add new Unicode scripts from/to the
2015 only-in-ISO-15924 list according to the error messages:
2016 ValueError: remove ['Tang'] from _scripts_only_in_iso15924
2017 ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD
2018 ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD
2019 ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD
2020 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2021 and in com.ibm.icu.dev.test.lang.TestUScript.java
2022- DerivedNumericValues.txt new numeric values
2023 0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH
2024 0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH
2025 0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS
2026 0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH
2027 0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS
2028 -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(),
2029 uchar.c, UCharacterProperty.java
2030 to support a new series of values
2031- adjust preparseucd.py for Tangut algorithmic names
2032 in ppucd.txt:
2033 algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH-
2034 ->
2035 algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH-
2036- avoid block-compressing most String/Miscellaneous property values,
2037 triggered by genprops not coping with a multi-code point Case_Folding on
2038 block;1C80..1C8F;...;Cased;cf=0442;CWCF;...
2039 keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors
2040
2041* PropertyAliases.txt changes
2042- 1 new property PCM=Prepended_Concatenation_Mark
2043 Ignore: Only useful for layout engines.
2044 Ok to list in ppucd.txt.
2045
2046* PropertyValueAliases.txt new property values
2047 blk; Adlam ; Adlam
2048 blk; Bhaiksuki ; Bhaiksuki
2049 blk; Cyrillic_Ext_C ; Cyrillic_Extended_C
2050 blk; Glagolitic_Sup ; Glagolitic_Supplement
2051 blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation
2052 blk; Marchen ; Marchen
2053 blk; Mongolian_Sup ; Mongolian_Supplement
2054 blk; Newa ; Newa
2055 blk; Osage ; Osage
2056 blk; Tangut ; Tangut
2057 blk; Tangut_Components ; Tangut_Components
2058 -> add to uchar.h
2059 use long property names for enum constants
2060 -> add to UCharacter.UnicodeBlock IDs
2061 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
2062 replace public static final int \1_ID = \2; \3
2063 -> add to UCharacter.UnicodeBlock objects
2064 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
2065 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
2066
2067 GCB; EB ; E_Base
2068 GCB; EBG ; E_Base_GAZ
2069 GCB; EM ; E_Modifier
2070 GCB; GAZ ; Glue_After_Zwj
2071 GCB; ZWJ ; ZWJ
2072 -> uchar.h & UCharacter.GraphemeClusterBreak
2073
2074 jg ; African_Feh ; African_Feh
2075 jg ; African_Noon ; African_Noon
2076 jg ; African_Qaf ; African_Qaf
2077 -> uchar.h & UCharacter.JoiningGroup
2078
2079 lb ; EB ; E_Base
2080 lb ; EM ; E_Modifier
2081 lb ; ZWJ ; ZWJ
2082 -> uchar.h & UCharacter.LineBreak
2083
2084 sc ; Adlm ; Adlam
2085 sc ; Bhks ; Bhaiksuki
2086 sc ; Marc ; Marchen
2087 sc ; Newa ; Newa
2088 sc ; Osge ; Osage
2089 sc ; Tang ; Tangut
2090 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
2091
2092 WB ; EB ; E_Base
2093 WB ; EBG ; E_Base_GAZ
2094 WB ; EM ; E_Modifier
2095 WB ; GAZ ; Glue_After_Zwj
2096 WB ; ZWJ ; ZWJ
2097 -> uchar.h & UCharacter.WordBreak
2098
2099* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
2100 (not strictly necessary for NOT_ENCODED scripts)
2101 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
2102
2103* generate normalization data files
2104 cd $ICU_ROOT/dbg
2105 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
2106 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
2107 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
2108 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2109 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
2110
2111* build ICU (make install)
2112 so that the tools build can pick up the new definitions from the installed header files.
2113
2114 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt
2115
2116* build Unicode tools using CMake+make
2117
2118~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
2119
2120 # Location (--prefix) of where ICU was installed.
2121 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
2122 # Location of the ICU source tree.
2123 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
2124
2125 ~/svn.icutools/trunk/dbg/unicode/c$
2126 cmake ../../../src/unicode/c
2127 make
2128
2129* generate core properties data files
2130 ~/svn.icutools/trunk/dbg/unicode/c$
2131 genprops/genprops $ICU_SRC_DIR
2132 genuca/genuca --hanOrder implicit $ICU_SRC_DIR
2133 genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
2134- rebuild ICU (make install) & tools
2135
2136* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2137 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2138- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2139- Unicode 6.0..9.0: U+2260, U+226E, U+226F
2140- nothing new in 9.0, no test file to update
2141
2142* run & fix ICU4C tests
2143- Andy handles RBBI & spoof check test failures
2144
2145* collation: CLDR collation root, UCA DUCET
2146
2147- UCA DUCET goes into Mark's Unicode tools, see
2148 https://sites.google.com/site/unicodetools/home#TOC-UCA
2149- CLDR root data files are checked into (CLDR UCA branch)/common/uca/
2150 cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
2151
2152- cd (CLDR UCA branch)/common/uca/
2153- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2154 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
2155- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2156 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
2157 (note removing the underscore before "Rules")
2158 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
2159- restore TODO diffs in UCARules.txt
2160 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
2161- update (ICU4C)/source/test/testdata/CollationTest_*.txt
2162 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2163 from the CLDR root files (..._CLDR_..._SHORT.txt)
2164 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
2165 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
2166 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
2167- if CLDR common/uca/unihan-index.txt changes, then update
2168 CLDR common/collation/root.xml <collation type="private-unihan">
2169 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
2170
2171- run genuca, see command line above;
2172 deal with
2173 Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt:
2174 FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible)
2175 (add the character to genuca.cpp sampleCharsToScripts[])
2176 + look up the USCRIPT_ code for the new sample characters
2177 (should be obvious from the comment in the error output)
2178 + *add* mappings to sampleCharsToScripts[], do not replace them
2179 (in case the script sample characters flip-flop)
2180 + insert new scripts in DUCET script order, see the top_byte table
2181 at the beginning of FractionalUCA.txt
2182- rebuild ICU4C
2183
2184* Unihan collators
2185- run Unicode Tools
2186 org.unicode.draft.GenerateUnihanCollators
2187 with VM arguments
2188 -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk
2189 -DOTHER_WORKSPACE=/home/mscherer/svn.unitools
2190 -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data
2191 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
2192 -DUVERSION=9.0.0
2193 -ea
2194- run Unicode Tools
2195 org.unicode.draft.GenerateUnihanCollatorFiles
2196 with the same arguments
2197- check CLDR diffs
2198 cd ~/svn.cldr/trunk
2199 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
2200 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
2201- copy to CLDR
2202 cd ~/svn.cldr/trunk
2203 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
2204 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
2205- commit to CLDR
2206- generate ICU zh collation data: run CLDR
2207 org.unicode.cldr.icu.NewLdml2IcuConverter
2208 with program arguments
2209 -t collation
2210 -s /home/mscherer/svn.cldr/trunk/common/collation
2211 -m /home/mscherer/svn.cldr/trunk/common/supplemental
2212 -d /home/mscherer/svn.icu/trunk/src/source/data/coll
2213 -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation
2214 zh
2215 and VM arguments
2216 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
2217- rebuild ICU4C
2218
2219* run & fix ICU4C tests, now with new CLDR collation root data
2220- run all tests with the collation test data *_SHORT.txt or the full files
2221 (the full ones have comments, useful for debugging)
2222- note on intltest: if collate/UCAConformanceTest fails, then
2223 utility/MultithreadTest/TestCollators will fail as well;
2224 fix the conformance test before looking into the multi-thread test
2225
2226* update Java data files
2227- refresh just the UCD/UCA-related/derived files, just to be safe
2228- see (ICU4C)/source/data/icu4j-readme.txt
2229- mkdir /tmp/icu4j
2230- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2231 output:
2232 ...
2233 Unicode .icu files built to ./out/build/icudt58l
2234 echo timestamp > uni-core-data
2235 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b
2236 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b
2237 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
2238 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b
2239 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b"
2240 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/
2241 mkdir -p /tmp/icu4j/main/shared/data
2242 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2243 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/
2244 mkdir -p /tmp/icu4j/main/shared/data
2245 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2246 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
2247- copy the big-endian Unicode data files to another location,
2248 separate from the other data files,
2249 and then refresh ICU4J
2250 cd ~/svn.icu/trunk/dbg/data/out/icu4j
2251 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2252 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2253 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2254 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2255 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
2256 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2257 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2258 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2259 jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
2260
2261* When refreshing all of ICU4J data from ICU4C
2262- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2263- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2264or
2265- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2266
2267* update CollationFCD.java
2268 + copy & paste the initializers of lcccIndex[] etc. from
2269 ICU4C/source/i18n/collationfcd.cpp to
2270 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
2271
2272* refresh Java test .txt files
2273- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2274 cd $ICU_SRC_DIR/source/data/unidata
2275 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2276 cd ../../test/testdata
2277 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2278 cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2279
2280* run & fix ICU4J tests
2281
2282*** LayoutEngine script information
2283
2284* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
2285 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
2286 in the working directory.
2287
2288 (It also generates ScriptRunData.cpp, which is no longer needed.)
2289
2290 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
2291 (a plain text file)
2292 which maps ICU versions to the numbers of script/language constants
2293 that were added then.
2294 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
2295
2296 The generated files have a current copyright date and "@deprecated" statement.
2297
2298* Review changes, fix Java tool if necessary, and copy to ICU4C
2299 cd ~/svn.icu4j/trunk/src
2300 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
2301 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
2302 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
2303
2304*** API additions
2305- send notice to icu-design about new born-@stable API (enum constants etc.)
2306
2307*** merge the Unicode update branches back onto the trunk
2308- do not merge the icudata.jar and testdata.jar,
2309 instead rebuild them from merged & tested ICU4C
2310- make sure that changes to Unicode tools & ICU tools are checked in
2311 http://www.unicode.org/utility/trac/log/trunk/unicodetools
2312 http://bugs.icu-project.org/trac/log/tools/trunk
2313
2314---------------------------------------------------------------------------- ***
2315
2316New script codes early in ICU 58: http://bugs.icu-project.org/trac/ticket/11764
2317
2318Adding
2319- new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge
2320- new combination/alias codes: Hanb, Jamo
2321 - used in CLDR 29 and in spoof checker
2322- new Z* code: Zsye
2323
2324Add new codes to uscript.h & UScript.java, see Unicode update logs.
2325 -> com.ibm.icu.lang.UScript
2326 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2327 replace public static final int \1 = \2; \3
2328
2329Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h,
2330add new script codes.
2331"Long" script names only where established in Unicode 9 PropertyValueAliases.txt.
2332
2333Note: If we have to run preparseucd.py again before the Unicode 9 update,
2334then we need to manually keep/restore the new script codes.
2335
2336ICU_ROOT=~/svn.icu/trunk
2337ICU_SRC_DIR=$ICU_ROOT/src
2338ICUDT=icudt57b
2339export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
2340SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
2341UNIDATA=$ICU_SRC_DIR/source/data/unidata
2342
2343Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files,
2344see http://bugs.icu-project.org/trac/ticket/12141
2345
2346make install, then icutools cmake & make, then
2347~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
2348
2349Generate Java data as usual, only update pnames.icu & uprops.icu.
2350
2351*** LayoutEngine script information
2352
2353* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
2354 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
2355 in the working directory.
2356
2357 (It also generates ScriptRunData.cpp, which is no longer needed.)
2358
2359 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
2360 (a plain text file)
2361 which maps ICU versions to the numbers of script/language constants
2362 that were added then.
2363 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
2364
2365 The generated files have a current copyright date and "@deprecated" statement.
b331163b 2366
f3c0d7a5
A
2367* Review changes, fix Java tool if necessary, and copy to ICU4C
2368 cd ~/svn.icu4j/trunk/src
2369 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
2370 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
2371 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
b331163b
A
2372
2373---------------------------------------------------------------------------- ***
2374
2ca993e8 2375Emoji properties added in ICU 57: http://bugs.icu-project.org/trac/ticket/11802
b331163b 2376
2ca993e8
A
2377Edit preparseucd.py to add & parse new properties.
2378They share the UCD property namespace but are not listed in PropertyAliases.txt.
b331163b 2379
2ca993e8
A
2380Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/
2381Initial data from emoji/2.0/
b331163b 2382
2ca993e8
A
2383ICU_ROOT=~/svn.icu/trunk
2384ICU_SRC_DIR=$ICU_ROOT/src
2385ICUDT=icudt56b
2386export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
2387SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
2388UNIDATA=$ICU_SRC_DIR/source/data/unidata
b331163b 2389
2ca993e8 2390Add binary-property constants to uchar.h enum UProperty & UProperty.java.
b331163b 2391
2ca993e8
A
2392~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src
2393(Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.)
b331163b 2394
2ca993e8 2395Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java
b331163b 2396
2ca993e8
A
2397make install, then icutools cmake & make, then
2398~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
b331163b 2399
2ca993e8
A
2400Generate Java data as usual, only update pnames.icu & uprops.icu.
2401
2402---------------------------------------------------------------------------- ***
2403
2404Unicode 8.0 update for ICU 56
2405
2406* Command-line environment setup
2407
2408ICU_ROOT=~/svn.icu/trunk
2409ICU_SRC_DIR=$ICU_ROOT/src
2410ICUDT=icudt56b
2411export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
2412SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
2413UNIDATA=$ICU_SRC_DIR/source/data/unidata
2414
2415http://www.unicode.org/review/pri297/ -- beta review
2416http://www.unicode.org/reports/uax-proposed-updates.html
2417http://unicode.org/versions/beta-8.0.0.html
2418http://www.unicode.org/versions/Unicode8.0.0/
2419http://www.unicode.org/reports/tr44/tr44-15.html
2420
2421*** ICU Trac
2422
2423- ticket:11574: Unicode 8
2424- C++ branches/markus/uni80 at r37351 from trunk at r37343
2425- Java branches/markus/uni80 at r37352 from trunk at r37338
2426
2427*** CLDR Trac
2428
2429- cldrbug 8311: UCA 8
2430- branches/markus/uni80 at r11518 from trunk at r11517
2431
2432- cldrbug 8109: Unicode 8.0 script metadata
2433- cldrbug 8418: Updated segmentation for Unicode 8.0
2434
2435*** Unicode version numbers
2436- makedata.mak
2437- uchar.h
2438- com.ibm.icu.util.VersionInfo
2439- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2440
2441- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
2442 so that the makefiles see the new version number.
2443
2444*** data files & enums & parser code
2445
2446* file preparation
2447
2448- download UCD & IDNA files
2449- make sure that the Unicode data folder passed into preparseucd.py
2450 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2451- only for manual diffs: remove version suffixes from the file names
2452 ~/unidata/uni70/20140403$ ../../desuffixucd.py .
2453 (see https://sites.google.com/site/unicodetools/inputdata)
2454- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
2455- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src
2456- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2457
2458- also: from http://unicode.org/Public/security/8.0.0/ download new
2459 confusables.txt & confusablesWholeScript.txt
2460 and copy to $UNIDATA
2461 ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA
2462 ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA
2463
2464* initial preparseucd.py changes
2465- remove new Unicode scripts from the
2466 only-in-ISO-15924 list according to the error message:
2467 ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw']
2468 from _scripts_only_in_iso15924
2469 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2470 and in com.ibm.icu.dev.test.lang.TestUScript.java
2471- property and file name change:
2472 IndicMatraCategory -> IndicPositionalCategory
2473- UnicodeData.txt unusual numeric values (improper fractions)
2474 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;;
2475 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;;
2476 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;;
2477 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;;
2478 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;;
2479 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;;
2480 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;;
2481 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;;
2482 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;;
2483 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;;
2484 -> change preparseucd.py to map them to proper fractions (e.g., 1/6)
2485 which are listed in DerivedNumericValues.txt;
2486 keeps storage in data file simple
2487
2488* PropertyValueAliases.txt changes
2489- 10 new Block (blk) values:
2490 blk; Ahom ; Ahom
2491 blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs
2492 blk; Cherokee_Sup ; Cherokee_Supplement
2493 blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E
2494 blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform
2495 blk; Hatran ; Hatran
2496 blk; Multani ; Multani
2497 blk; Old_Hungarian ; Old_Hungarian
2498 blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs
2499 blk; Sutton_SignWriting ; Sutton_SignWriting
2500 -> add to uchar.h
2501 use long property names for enum constants
2502 -> add to UCharacter.UnicodeBlock IDs
2503 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
2504 replace public static final int \1_ID = \2; \3
2505 -> add to UCharacter.UnicodeBlock objects
2506 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
2507 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
2508- 6 new Script (sc) values:
2509 sc ; Ahom ; Ahom
2510 sc ; Hatr ; Hatran
2511 sc ; Hluw ; Anatolian_Hieroglyphs
2512 sc ; Hung ; Old_Hungarian
2513 sc ; Mult ; Multani
2514 sc ; Sgnw ; SignWriting
2515 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
2516
2517* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
2518 (not strictly necessary for NOT_ENCODED scripts)
2519 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
2520
2521* generate normalization data files
2522 cd $ICU_ROOT/dbg
2523 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
2524 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
2525 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
2526 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2527 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
2528
2529* build ICU (make install)
2530 so that the tools build can pick up the new definitions from the installed header files.
2531
2532 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
2533
2534* build Unicode tools using CMake+make
2535
2536~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
2537
2538 # Location (--prefix) of where ICU was installed.
2539 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
2540 # Location of the ICU source tree.
2541 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
2542
2543 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
2544 ~/svn.icutools/trunk/dbg/unicode/c$ make
2545
2546* generate core properties data files
2547- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
b331163b
A
2548- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR
2549- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
2ca993e8
A
2550- rebuild ICU (make install) & tools
2551- run genuca again (see step above) so that it picks up the new nfc.nrm
2552- rebuild ICU (make install) & tools
2553
2554* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2555 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2556- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2557- Unicode 6.0..8.0: U+2260, U+226E, U+226F
2558- nothing new in 8.0, no test file to update
2559
2560* run & fix ICU4C tests
2561- bad Cherokee case folding due to difference in fallbacks:
2562 UCD case folding falls back to no mapping,
2563 ICU runtime case folding falls back to lowercasing;
2564 fixed casepropsbuilder.cpp to generate scf mappings to self
2565 when there is an slc mapping but no scf
2566- Andy handles RBBI & spoof check test failures
2567
2568* collation: CLDR collation root, UCA DUCET
2569
2570- UCA DUCET goes into Mark's Unicode tools, see
2571 https://sites.google.com/site/unicodetools/home#TOC-UCA
2572- CLDR root data files are checked into (CLDR UCA branch)/common/uca/
2573- cd (CLDR UCA branch)/common/uca/
2574- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2575 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
2576- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2577 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
2578 (note removing the underscore before "Rules")
2579 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
2580- restore TODO diffs in UCARules.txt
2581 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
2582- update (ICU4C)/source/test/testdata/CollationTest_*.txt
2583 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2584 from the CLDR root files (..._CLDR_..._SHORT.txt)
2585 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
2586 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
2587 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
2588- if CLDR common/uca/unihan-index.txt changes, then update
2589 CLDR common/collation/root.xml <collation type="private-unihan">
2590 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
2591- run genuca, see command line above;
2592 deal with
2593 Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt
2594 (add the character to genuca.cpp sampleCharsToScripts[])
2595 + look up the script for the new sample characters
2596 (e.g., in FractionalUCA.txt)
2597 + *add* mappings to sampleCharsToScripts[], do not replace them
2598 (in case the script sample characters flip-flop)
2599 + insert new scripts in DUCET script order, see the top_byte table
2600 at the beginning of FractionalUCA.txt
2601- rebuild ICU4C
2602
2603* run & fix ICU4C tests, now with new CLDR collation root data
2604- run all tests with the collation test data *_SHORT.txt or the full files
2605 (the full ones have comments, useful for debugging)
2606- note on intltest: if collate/UCAConformanceTest fails, then
2607 utility/MultithreadTest/TestCollators will fail as well;
2608 fix the conformance test before looking into the multi-thread test
2609- fixed bug in CollationWeights::getWeightRanges()
2610 exposed by new data and CollationTest::TestRootElements
2611
2612* update Java data files
2613- refresh just the UCD/UCA-related/derived files, just to be safe
2614- see (ICU4C)/source/data/icu4j-readme.txt
2615- mkdir /tmp/icu4j
2616- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2617 output:
2618 ...
2619 Unicode .icu files built to ./out/build/icudt56l
2620 echo timestamp > uni-core-data
2621 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b
2622 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b
2623 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
2624 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b
2625 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b"
2626 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/
2627 mkdir -p /tmp/icu4j/main/shared/data
2628 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2629 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/
2630 mkdir -p /tmp/icu4j/main/shared/data
2631 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2632 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
2633- copy the big-endian Unicode data files to another location,
2634 separate from the other data files,
2635 and then refresh ICU4J
2636 cd ~/svn.icu/trunk/dbg/data/out/icu4j
2637 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2638 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2639 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2640 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2641 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
2642 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2643 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2644 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2645 jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
2646
2647* When refreshing all of ICU4J data from ICU4C
2648- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2649- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2650or
2651- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2652
2653* update CollationFCD.java
2654 + copy & paste the initializers of lcccIndex[] etc. from
2655 ICU4C/source/i18n/collationfcd.cpp to
2656 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
2657
2658* refresh Java test .txt files
2659- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2660 cd $ICU_SRC_DIR/source/data/unidata
2661 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2662 cd ../../test/testdata
2663 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2664 cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2665
2666* run & fix ICU4J tests
2667
2668*** LayoutEngine script information
2669
2670* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more,
2671 because the layout engine was deprecated in ICU 54.
2672 Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java
2673 to write lines that we used to add manually.
2674
2675* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
2676 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
2677 in the working directory.
2678
2679 (It also generates ScriptRunData.cpp, which is no longer needed.)
2680
2681 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
2682 (a plain text file)
2683 which maps ICU versions to the numbers of script/language constants
2684 that were added then.
2685 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
2686
2687 The generated files have a current copyright date and "@deprecated" statement.
2688
2689* Review changes, fix Java tool if necessary, and copy to ICU4C
2690 cd ~/svn.icu4j/trunk/src
2691 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
2692 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
2693 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
2694
2695*** API additions
2696- send notice to icu-design about new born-@stable API (enum constants etc.)
b331163b 2697
2ca993e8
A
2698*** merge the Unicode update branches back onto the trunk
2699- do not merge the icudata.jar and testdata.jar,
2700 instead rebuild them from merged & tested ICU4C
2701- make sure that changes to Unicode tools & ICU tools are checked in
2702 http://www.unicode.org/utility/trac/log/trunk/unicodetools
2703 http://bugs.icu-project.org/trac/log/tools/trunk
b331163b
A
2704
2705---------------------------------------------------------------------------- ***
2706
2707Unicode 7.0 update for ICU 54
2708
2709http://www.unicode.org/review/pri271/ -- beta review
2710http://www.unicode.org/reports/uax-proposed-updates.html
2711http://www.unicode.org/versions/beta-7.0.0.html#notable_issues
2712http://www.unicode.org/reports/tr44/tr44-13.html
2713
2714*** ICU Trac
2715
2716- ticket 10821: Unicode 7.0, UCA 7.0
2717- C++ branches/markus/uni70 at r35584 from trunk at r35580
2718- Java branches/markus/uni70 at r35587 from trunk at r35545
2719
2720*** CLDR Trac
2721
2722- ticket 7195: UCA 7.0 CLDR root collation
2723- branches/markus/uni70 at r10062 from trunk at r10061
2724
2725- ticket 6762: script metadata for Unicode 7.0 new scripts
2726
2727*** Unicode version numbers
2728- makedata.mak
2729- uchar.h
2730- com.ibm.icu.util.VersionInfo
2731- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2732
2733- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
2734 so that the makefiles see the new version number.
2735
2736*** data files & enums & parser code
2737
2738* file preparation
2739
2740- download UCD & IDNA files
2741- make sure that the Unicode data folder passed into preparseucd.py
2742 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2743- only for manual diffs: remove version suffixes from the file names
2744 ~/unidata/uni70/20140403$ ../../desuffixucd.py .
2745 (see https://sites.google.com/site/unicodetools/inputdata)
2746- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
2747- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src
2748- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2749- Restore TODO diffs in source/data/unidata/UCARules.txt
2750 cd $ICU_SRC_DIR
2751 meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt
2752- Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt
2753
2754- also: from http://unicode.org/Public/security/7.0.0/ download new
2755 confusables.txt & confusablesWholeScript.txt
2756 and copy to $ICU_ROOT/src/source/data/unidata/
2757
2758* initial preparseucd.py changes
2759- remove new Unicode scripts from the
2760 only-in-ISO-15924 list according to the error message:
2761 ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass',
2762 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm',
2763 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj']
2764 from _scripts_only_in_iso15924
2765 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2766 and in com.ibm.icu.dev.test.lang.TestUScript.java
2767- NamesList.txt now has a heading with a non-ASCII character
2768 + keep ppucd.txt in platform charset, rather than changing tool/test parsers
2769 + escape non-ASCII characters in heading comments
2770- gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013
2771 + get the copyright from the first file whose copyright line contains the current year
2772
2773* PropertyValueAliases.txt changes
2774- 32 new Block (blk) values:
2775 blk; Bassa_Vah ; Bassa_Vah
2776 blk; Caucasian_Albanian ; Caucasian_Albanian
2777 blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers
2778 blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended
2779 blk; Duployan ; Duployan
2780 blk; Elbasan ; Elbasan
2781 blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended
2782 blk; Grantha ; Grantha
2783 blk; Khojki ; Khojki
2784 blk; Khudawadi ; Khudawadi
2785 blk; Latin_Ext_E ; Latin_Extended_E
2786 blk; Linear_A ; Linear_A
2787 blk; Mahajani ; Mahajani
2788 blk; Manichaean ; Manichaean
2789 blk; Mende_Kikakui ; Mende_Kikakui
2790 blk; Modi ; Modi
2791 blk; Mro ; Mro
2792 blk; Myanmar_Ext_B ; Myanmar_Extended_B
2793 blk; Nabataean ; Nabataean
2794 blk; Old_North_Arabian ; Old_North_Arabian
2795 blk; Old_Permic ; Old_Permic
2796 blk; Ornamental_Dingbats ; Ornamental_Dingbats
2797 blk; Pahawh_Hmong ; Pahawh_Hmong
2798 blk; Palmyrene ; Palmyrene
2799 blk; Pau_Cin_Hau ; Pau_Cin_Hau
2800 blk; Psalter_Pahlavi ; Psalter_Pahlavi
2801 blk; Shorthand_Format_Controls ; Shorthand_Format_Controls
2802 blk; Siddham ; Siddham
2803 blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers
2804 blk; Sup_Arrows_C ; Supplemental_Arrows_C
2805 blk; Tirhuta ; Tirhuta
2806 blk; Warang_Citi ; Warang_Citi
2807 -> add to uchar.h
2808 use long property names for enum constants
2809 -> add to UCharacter.UnicodeBlock IDs
2810 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
2811 replace public static final int \1_ID = \2; \3
2812 -> add to UCharacter.UnicodeBlock objects
2813 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
2814 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
2815- 28 new Joining_Group (jg) values:
2816 jg ; Manichaean_Aleph ; Manichaean_Aleph
2817 jg ; Manichaean_Ayin ; Manichaean_Ayin
2818 jg ; Manichaean_Beth ; Manichaean_Beth
2819 jg ; Manichaean_Daleth ; Manichaean_Daleth
2820 jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh
2821 jg ; Manichaean_Five ; Manichaean_Five
2822 jg ; Manichaean_Gimel ; Manichaean_Gimel
2823 jg ; Manichaean_Heth ; Manichaean_Heth
2824 jg ; Manichaean_Hundred ; Manichaean_Hundred
2825 jg ; Manichaean_Kaph ; Manichaean_Kaph
2826 jg ; Manichaean_Lamedh ; Manichaean_Lamedh
2827 jg ; Manichaean_Mem ; Manichaean_Mem
2828 jg ; Manichaean_Nun ; Manichaean_Nun
2829 jg ; Manichaean_One ; Manichaean_One
2830 jg ; Manichaean_Pe ; Manichaean_Pe
2831 jg ; Manichaean_Qoph ; Manichaean_Qoph
2832 jg ; Manichaean_Resh ; Manichaean_Resh
2833 jg ; Manichaean_Sadhe ; Manichaean_Sadhe
2834 jg ; Manichaean_Samekh ; Manichaean_Samekh
2835 jg ; Manichaean_Taw ; Manichaean_Taw
2836 jg ; Manichaean_Ten ; Manichaean_Ten
2837 jg ; Manichaean_Teth ; Manichaean_Teth
2838 jg ; Manichaean_Thamedh ; Manichaean_Thamedh
2839 jg ; Manichaean_Twenty ; Manichaean_Twenty
2840 jg ; Manichaean_Waw ; Manichaean_Waw
2841 jg ; Manichaean_Yodh ; Manichaean_Yodh
2842 jg ; Manichaean_Zayin ; Manichaean_Zayin
2843 jg ; Straight_Waw ; Straight_Waw
2844 -> uchar.h & UCharacter.JoiningGroup
2845- 23 new Script (sc) values:
2846 sc ; Aghb ; Caucasian_Albanian
2847 sc ; Bass ; Bassa_Vah
2848 sc ; Dupl ; Duployan
2849 sc ; Elba ; Elbasan
2850 sc ; Gran ; Grantha
2851 sc ; Hmng ; Pahawh_Hmong
2852 sc ; Khoj ; Khojki
2853 sc ; Lina ; Linear_A
2854 sc ; Mahj ; Mahajani
2855 sc ; Mani ; Manichaean
2856 sc ; Mend ; Mende_Kikakui
2857 sc ; Modi ; Modi
2858 sc ; Mroo ; Mro
2859 sc ; Narb ; Old_North_Arabian
2860 sc ; Nbat ; Nabataean
2861 sc ; Palm ; Palmyrene
2862 sc ; Pauc ; Pau_Cin_Hau
2863 sc ; Perm ; Old_Permic
2864 sc ; Phlp ; Psalter_Pahlavi
2865 sc ; Sidd ; Siddham
2866 sc ; Sind ; Khudawadi
2867 sc ; Tirh ; Tirhuta
2868 sc ; Wara ; Warang_Citi
2869 -> uscript.h (many were added before)
2870 comment "Mende Kikakui" for USCRIPT_MENDE
2871 add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias
2872 -> com.ibm.icu.lang.UScript
2873 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2874 replace public static final int \1 = \2; \3
2875- 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2876 (added 2012-11-01)
2877 Ahom 338 Ahom
2878 Hatr 127 Hatran
2879 Mult 323 Multani
2880 (added 2013-10-12)
2881 Modi 324 Modi
2882 Pauc 263 Pau Cin Hau
2883 Sidd 302 Siddham
2884 -> uscript.h (some overlap with additions from Unicode)
2885 -> com.ibm.icu.lang.UScript
2886 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2887 replace public static final int \1 = \2; \3
2888 -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924
2889 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2890 and in com.ibm.icu.dev.test.lang.TestUScript.java
2891
2892* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
2893 (not strictly necessary for NOT_ENCODED scripts)
2894 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
2895
2896* generate normalization data files
2897- cd $ICU_ROOT/dbg
2898- export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
2899- SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
2900- UNIDATA=$ICU_SRC_DIR/source/data/unidata
2901- bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
2902- bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
2903- bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
2904- bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2905- bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
2906
2907* build ICU (make install)
2908 so that the tools build can pick up the new definitions from the installed header files.
2909
2910~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
2911
2912* build Unicode tools using CMake+make
2913
2914~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
2915
2916# Location (--prefix) of where ICU was installed.
2917set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst)
2918# Location of the ICU source tree.
2919set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src)
2920
2921~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
2922~/svn.icutools/trunk/dbg/unicode/c$ make
2923
2924* genprops work
2925- new code point range for Joining_Group values: 10AC0..10AFF Manichaean
2926 + add second array of Joining_Group values for at most 10800..10FFF
2927 icutools: unicode/c/genprops/bidipropsbuilder.cpp
2928 icu: source/common/ubidi_props.h/.c/_data.h
2929 icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java
2930
2931* generate core properties data files
2932- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
2933- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR
2934- rebuild ICU (make install) & tools
2935- run genuca again (see step above) so that it picks up the new nfc.nrm
2936- rebuild ICU (make install) & tools
2937
2938* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2939 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2940- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2941- Unicode 6.0..7.0: U+2260, U+226E, U+226F
2942- nothing new in 7.0, no test file to update
2943
2944* run & fix ICU4C tests
2945
2946* update Java data files
2947- refresh just the UCD-related files, just to be safe
2948- see (ICU4C)/source/data/icu4j-readme.txt
2949- mkdir /tmp/icu4j
2950- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2951 output:
2952 ...
2953 Unicode .icu files built to ./out/build/icudt53l
2954 echo timestamp > uni-core-data
2955 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b
2956 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b
2957 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2958 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b
2959 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b"
2960 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/
2961 mkdir -p /tmp/icu4j/main/shared/data
2962 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2963 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/
2964 mkdir -p /tmp/icu4j/main/shared/data
2965 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2966 make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data'
2967- copy the big-endian Unicode data files to another location,
2968 separate from the other data files
2969 ICUDT=icudt54b
2970 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2971 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2972 cd ~/svn.icu/uni70/dbg/data/out/icu4j
2973 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2974 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2975 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
2976 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2977 cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2978 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2979- refresh ICU4J
2980 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
2981
2982* update CollationFCD.java
2983 + copy & paste the initializers of lcccIndex[] etc. from
2984 ICU4C/source/i18n/collationfcd.cpp to
2985 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
2986
2987* refresh Java test .txt files
2988- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2989 cd $ICU_SRC_DIR/source/data/unidata
2990 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2991 cd ../../test/testdata
2992 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2993 cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2994
2995* UCA
2996
2997- download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/
2998- run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata)
2999- update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/
3000- run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA
3001- output files are in ~/svn.unitools/Generated/uca/7.0.0/
3002- review data; compare files, use blankweights.sed or similar
3003 ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt
3004- cd ~/svn.unitools/Generated/uca/7.0.0/
3005- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
3006 cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
3007- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
3008 (note removing the underscore before "Rules")
3009 cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
3010- update (ICU4C)/source/test/testdata/CollationTest_*.txt
3011 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3012 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
3013 cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
3014 cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
3015 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
3016- run genuca, see command line above
3017- rebuild ICU4C
3018- refresh ICU4J collation data:
3019 (subset of instructions above for properties data refresh, except copies all coll/*)
3020 ICUDT=icudt54b
3021 ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3022 ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
3023 ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
3024 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
3025- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
3026- note on intltest: if collate/UCAConformanceTest fails, then
3027 utility/MultithreadTest/TestCollators will fail as well;
3028 fix the conformance test before looking into the multi-thread test
3029- copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors
3030- copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch
3031 ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
3032
3033* When refreshing all of ICU4J data from ICU4C
3034- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3035- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
3036or
3037- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
3038
3039* run & fix ICU4J tests
3040
3041*** LayoutEngine script information
3042
3043(For details see the Unicode 5.2 change log below.)
3044
3045* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
3046 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
3047 in the working directory.
3048 (It also generates ScriptRunData.cpp, which is no longer needed.)
3049
3050 The generated files have a current copyright date and "@stable" statement.
3051 ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java
3052 for "born stable" Unicode API constants, and to stop parsing ICU version numbers
3053 which may not contain dots any more.
3054
3055- diff current <icu>/source/layout files vs. generated ones
3056 ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
3057 review and manually merge desired changes;
3058 fix gratuitous changes, incorrect @draft/@stable and missing aliases;
3059 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
3060- if you just copy the above files, then
3061 fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
3062 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
3063
3064*** API additions
3065- send notice to icu-design about new born-@stable API (enum constants etc.)
3066
3067*** merge the Unicode update branches back onto the trunk
3068- do not merge the icudata.jar and testdata.jar,
3069 instead rebuild them from merged & tested ICU4C
3070
3071---------------------------------------------------------------------------- ***
3072
57a6839d
A
3073Unicode 6.3 update
3074
3075http://www.unicode.org/review/pri249/ -- beta review
3076http://www.unicode.org/reports/uax-proposed-updates.html
3077http://www.unicode.org/versions/beta-6.3.0.html#notable_issues
3078http://www.unicode.org/reports/tr44/tr44-11.html
3079
3080*** ICU Trac
3081
3082- ticket 10128: update ICU to Unicode 6.3 beta
3083- ticket 10168: update ICU to Unicode 6.3 final
3084- C++ branches/markus/uni63 at r33552 from trunk at r33551
3085- Java branches/markus/uni63 at r33550 from trunk at r33553
3086
3087- ticket 10142: implement Unicode 6.3 bidi algorithm additions
3088
3089*** Unicode version numbers
3090- makedata.mak
3091- uchar.h
3092 (configure.in & configure: have been modified to extract the version from uchar.h)
3093- com.ibm.icu.util.VersionInfo
3094- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
3095
3096- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
3097 so that the makefiles see the new version number.
3098
3099*** data files & enums & parser code
3100
3101* file preparation
3102
3103- download UCD, UCA & IDNA files
3104- make sure that the Unicode data folder passed into preparseucd.py
3105 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
3106- modify preparseucd.py:
3107 parse new file BidiBrackets.txt
3108 with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type
3109- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src
3110- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
3111- Check test file diffs for previously commented-out, known-failing data lines;
3112 probably need to keep those commented out.
3113
3114* PropertyAliases.txt changes
3115- 1 new Enumerated Property
3116 bpt ; Bidi_Paired_Bracket_Type
3117 -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType
3118 -> ubidi_props.h & .c & UBiDiProps.java
3119 -> remember to write the max value at UBIDI_MAX_VALUES_INDEX
3120 -> uprops.cpp
3121 -> change ubidi.icu format version from 2.0 to 2.1
3122- 1 new Miscellaneous Property
3123 bpb ; Bidi_Paired_Bracket
3124 -> uchar.h & UProperty.java
3125 -> ppucd.h & .cpp
3126
3127* PropertyValueAliases.txt changes
3128- 3 Bidi_Paired_Bracket_Type (bpt) values:
3129 bpt; c ; Close
3130 bpt; n ; None
3131 bpt; o ; Open
3132 -> uchar.h & UCharacter.BidiPairedBracketType
3133 -> ubidi_props.h & .c & UBiDiProps.java
3134 -> change ubidi.icu format version from 2.0 to 2.1
3135- 4 new Bidi_Class (bc) values:
3136 bc ; FSI ; First_Strong_Isolate
3137 bc ; LRI ; Left_To_Right_Isolate
3138 bc ; RLI ; Right_To_Left_Isolate
3139 bc ; PDI ; Pop_Directional_Isolate
3140 -> uchar.h & UCharacterEnums.ECharacterDirection
3141 -> until the bidi code gets updated,
3142 Roozbeh suggests mapping the new bc values to ON (Other_Neutral)
3143- 3 new Word_Break (WB) values:
3144 WB ; HL ; Hebrew_Letter
3145 WB ; SQ ; Single_Quote
3146 WB ; DQ ; Double_Quote
3147 -> uchar.h & UCharacter.WordBreak
3148 -> first time Word_Break numeric constants exceed 4 bits (now 17 values)
3149- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
3150 (added 2012-10-16)
3151 Aghb 239 Caucasian Albanian
3152 Mahj 314 Mahajani
3153 -> uscript.h
3154 -> com.ibm.icu.lang.UScript
3155 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
3156 replace public static final int \1 = \2;\3
3157 -> preparseucd.py _scripts_only_in_iso15924
3158 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
3159 and in com.ibm.icu.dev.test.lang.TestUScript.java
3160 -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
3161 (not strictly necessary for NOT_ENCODED scripts)
3162
3163* generate normalization data files
3164- ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib
3165- ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in
3166- ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata
3167- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
3168- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
3169- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
3170- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
3171
3172* build ICU (make install)
3173 so that the tools build can pick up the new definitions from the installed header files.
3174
3175~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
3176
3177* build Unicode tools using CMake+make
3178
3179~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
3180
3181# Location (--prefix) of where ICU was installed.
3182set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst)
3183# Location of the ICU source tree.
3184set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src)
3185
3186~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
3187~/svn.icutools/trunk/dbg/unicode/c$ make
3188
3189* generate core properties data files
3190- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src
3191- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src
3192- rebuild ICU (make install) & tools
3193- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
3194- rebuild ICU (make install) & tools
3195
3196* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
3197 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
3198- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
3199- Unicode 6.0..6.3: U+2260, U+226E, U+226F
3200- nothing new in 6.3, no test file to update
3201
3202* update Java data files
3203- refresh just the UCD-related files, just to be safe
3204- see (ICU4C)/source/data/icu4j-readme.txt
3205- mkdir /tmp/icu4j
3206- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3207 output:
3208 ...
3209 Unicode .icu files built to ./out/build/icudt52l
3210 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b
3211 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b
3212 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
3213 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b
3214 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b"
3215 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/
3216 mkdir -p /tmp/icu4j/main/shared/data
3217 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3218 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/
3219 mkdir -p /tmp/icu4j/main/shared/data
3220 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
3221 make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data'
3222- copy the big-endian Unicode data files to another location,
3223 separate from the other data files
3224 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
3225 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
3226 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
3227 ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu
3228 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
3229 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
3230 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
3231- refresh ICU4J
3232 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
3233
3234* refresh Java test .txt files
3235- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
3236
3237* UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files
3238
3239- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
3240- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
3241- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
3242- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
3243 (note removing the underscore before "Rules")
3244- update (ICU4C)/source/test/testdata/CollationTest_*.txt
3245 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3246 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
3247- check test file diffs for previously commented-out, known-failing data lines;
3248 probably need to keep those commented out
3249- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
3250- run genuca, see command line above
3251- rebuild ICU4C
3252- refresh ICU4J collation data:
3253 (subset of instructions above for properties data refresh, except copies all coll/*)
3254 ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3255 ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
3256 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
3257 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
3258- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
3259- note on intltest: if collate/UCAConformanceTest fails, then
3260 utility/MultithreadTest/TestCollators will fail as well;
3261 fix the conformance test before looking into the multi-thread test
3262
3263* test ICU, fix test code where necessary
3264
3265* When refreshing all of ICU4J data from ICU4C
3266- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3267- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
3268or
3269- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
3270
3271*** LayoutEngine script information
3272- skipped for Unicode 6.3: no new scripts
3273
3274*** merge the Unicode update branches back onto the trunk
3275- do not merge the icudata.jar and testdata.jar,
3276 instead rebuild them from merged & tested ICU4C
3277
3278---------------------------------------------------------------------------- ***
3279
51004dcb
A
3280Unicode 6.2 update
3281
3282http://www.unicode.org/review/pri230/
3283http://www.unicode.org/versions/beta-6.2.0.html
3284http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0
3285http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values
3286http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol
3287http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols
3288http://www.unicode.org/reports/tr46/tr46-8.html IDNA
3289http://unicode.org/Public/idna/6.2.0/
3290
3291*** ICU Trac
3292
3293- ticket 9515: Unicode 6.2: final ICU update
3294
3295- ticket 9514: UCA 6.2: fix UCARules.txt
3296
3297- ticket 9437: update ICU to Unicode 6.2
3298- C++ branches/markus/uni62 at r32050 from trunk at r32041
3299- Java branches/markus/uni62 at r32068 from trunk at r32066
3300
3301*** Unicode version numbers
3302- makedata.mak
3303- uchar.h
3304 (configure.in & configure: have been modified to extract the version from uchar.h)
3305- com.ibm.icu.util.VersionInfo
3306- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
3307
3308*** data files & enums & parser code
3309
3310* file preparation
3311
3312- download UCD, UCA & IDNA files
3313- make sure that the Unicode data folder passed into preparseucd.py
3314 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
3315- modify preparseucd.py: NamesList.txt is now in UTF-8
3316- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src
3317- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
3318- Check test file diffs for previously commented-out, known-failing data lines;
3319 probably need to keep those commented out.
3320
3321* PropertyValueAliases.txt changes
3322- 1 new Line_Break (lb) value:
3323 lb ; RI ; Regional_Indicator
3324 -> uchar.h & UCharacter.LineBreak
3325- 1 new Word_Break (WB) value:
3326 WB ; RI ; Regional_Indicator
3327 -> uchar.h & UCharacter.WordBreak
3328- 1 new Grapheme_Cluster_Break (GCB) value:
3329 GCB; RI ; Regional_Indicator
3330 -> uchar.h & UCharacter.GraphemeClusterBreak
3331
3332* 3 new numeric values
3333 The new value -1, which was really supposed to be NaN but that would have required
3334 new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1,
3335 but encodeNumericValue() in corepropsbuilder.cpp had to be fixed.
3336 cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1
3337 cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1
3338 The two new values 216000 and 432000 require an addition to the encoding of numeric values.
3339 cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000
3340 cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000
3341 -> uprops.h, uchar.c & UCharacterProperty.java
3342 -> cucdtst.c & UCharacterTest.java
3343
3344* generate normalization data files
3345- ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib
3346- ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in
3347- ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata
3348- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
3349- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
3350- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
3351- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
3352
3353* build ICU (make install)
3354 so that the tools build can pick up the new definitions from the installed header files.
3355* build Unicode tools using CMake+make
3356
3357* generate core properties data files
3358- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src
3359- in initial bootstrapping, change the UCA version
3360 in source/data/unidata/FractionalUCA.txt to match the new Unicode version
3361- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src
3362- rebuild ICU (make install) & tools
3363 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
3364 check if the UCA version in FractionalUCA.txt matches the new Unicode version
3365 (see step above)
3366- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
3367- rebuild ICU (make install) & tools
3368
3369* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
3370 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
3371- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
3372- Unicode 6.0..6.2: U+2260, U+226E, U+226F
3373- nothing new in 6.2, no test file to update
3374
3375* update Java data files
3376- refresh just the UCD-related files, just to be safe
3377- see (ICU4C)/source/data/icu4j-readme.txt
3378- mkdir /tmp/icu4j
3379- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3380 output:
3381 ...
3382 Unicode .icu files built to ./out/build/icudt50l
3383 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b
3384 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b
3385 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
3386 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b
3387 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b"
3388 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/
3389 mkdir -p /tmp/icu4j/main/shared/data
3390 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3391 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/
3392 mkdir -p /tmp/icu4j/main/shared/data
3393 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
3394 make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data'
3395- copy the big-endian Unicode data files to another location,
3396 separate from the other data files
3397 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
3398 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
3399 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
3400 ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu
3401 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
3402 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
3403 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
3404- refresh ICU4J
3405 ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
3406
3407* refresh Java test .txt files
3408- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
3409
3410* UCA
3411
3412- get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
3413- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
3414- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
3415- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
3416 (note removing the underscore before "Rules")
3417- update (ICU4C)/source/test/testdata/CollationTest_*.txt
3418 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3419 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
3420- check test file diffs for previously commented-out, known-failing data lines;
3421 probably need to keep those commented out
3422- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
3423- run genuca, see command line above
3424- rebuild ICU4C
3425- refresh ICU4J collation data:
3426 (subset of instructions above for properties data refresh, except copies all coll/*)
3427 ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3428 ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
3429 ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
3430 ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
3431- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
3432- note on intltest: if collate/UCAConformanceTest fails, then
3433 utility/MultithreadTest/TestCollators will fail as well;
3434 fix the conformance test before looking into the multi-thread test
3435
3436* test ICU, fix test code where necessary
3437
3438* When refreshing all of ICU4J data from ICU4C
3439- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3440- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
3441or
3442- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
3443
3444*** LayoutEngine script information
3445- skipped for Unicode 6.2: no new scripts
3446
3447*** merge the Unicode update branches back onto the trunk
3448- do not merge the icudata.jar and testdata.jar,
3449 instead rebuild them from merged & tested ICU4C
3450
3451---------------------------------------------------------------------------- ***
73c04bcf 3452
4388f060
A
3453Future Unicode update
3454
3455Tools simplified since the Unicode 6.1 update. See
3456- http://site.icu-project.org/design/props/ppucd
3457- http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972
3458
3459* Unicode version numbers
3460- icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates
3461
3462* file preparation
3463- ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py:
3464- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src
3465- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
3466- Check test file diffs for previously commented-out, known-failing data lines;
3467 probably need to keep those commented out.
3468
3469* PropertyValueAliases.txt changes
3470- Script codes that are in ISO 15924 but not in Unicode are now listed in
3471 preparseucd.py, in the _scripts_only_in_iso15924 variable.
3472 If there are new ISO codes, then add them.
3473 If Unicode adds some of them, then remove them from the .py variable.
3474
3475* UnicodeData.txt changes
3476- No more manual changes for CJK ranges for algorithmic names;
3477 those are now written to ppucd.txt and genprops reads them from there.
3478
3479* generate core properties data files (makeprops.sh was deleted)
3480- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src
3481
3482* no more manual updates of source/data/unidata/norm2/nfkc_cf.txt
3483- it is now generated by preparseucd.py
3484
3485* no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt
3486- it is now generated by preparseucd.py
3487- make sure that the Unicode data folder passed into preparseucd.py
3488 includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
3489 (can be in some subfolder)
3490
3491* generate normalization data files
3492- ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib
3493- ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in
3494- ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata
3495- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
3496- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
3497- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
3498- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
3499
3500* build ICU (make install)
3501* build Unicode tools using CMake+make
3502
3503* new way to call genuca (makeuca.sh was deleted)
3504- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src
3505
3506---------------------------------------------------------------------------- ***
3507
3508Unicode 6.1 update
3509
3510*** ICU Trac
3511
3512- ticket 8995 final update to Unicode 6.1
3513- ticket 8994 regenerate source/layout/CanonData.cpp
3514
3515- ticket 8961 support Unicode "Age" value *names*
3516- ticket 8963 support multiple character name aliases & types
3517
3518- ticket 8827 "update ICU to Unicode 6.1"
3519- C++ branches/markus/uni61 at r30864 from trunk at r30843
3520- Java branches/markus/uni61 at r30865 from trunk at r30863
3521
3522*** Unicode version numbers
3523- makedata.mak
3524- uchar.h
3525 (configure.in & configure: have been modified to extract the version from uchar.h)
3526- com.ibm.icu.util.VersionInfo
3527- icutools/unicode/makedefs.sh
3528 + also review & update other definitions in that file,
3529 e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l
3530
3531*** data files & enums & parser code
3532
3533* file preparation
3534
3535~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed
3536- This prepares both unidata and testdata files in respective output subfolders.
3537- Check test file diffs for previously commented-out, known-failing data lines;
3538 probably need to keep those commented out.
3539
3540* PropertyValueAliases.txt changes
3541- 11 new block names:
3542 Arabic_Extended_A
3543 Arabic_Mathematical_Alphabetic_Symbols
3544 Chakma
3545 Meetei_Mayek_Extensions
3546 Meroitic_Cursive
3547 Meroitic_Hieroglyphs
3548 Miao
3549 Sharada
3550 Sora_Sompeng
3551 Sundanese_Supplement
3552 Takri
3553 -> add to uchar.h
3554 -> add to UCharacter.UnicodeBlock IDs
3555 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
3556 replace public static final int \1_ID = \2; \3
3557 -> add to UCharacter.UnicodeBlock objects
3558 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
3559 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
3560- 1 new Joining_Group (jg) value:
3561 Rohingya_Yeh
3562 -> uchar.h & UCharacter.JoiningGroup
3563- 2 new Line_Break (lb) values:
3564 CJ=Conditional_Japanese_Starter
3565 HL=Hebrew_Letter
3566 -> uchar.h & UCharacter.LineBreak
3567- 7 new scripts:
3568 sc ; Cakm ; Chakma
3569 sc ; Merc ; Meroitic_Cursive
3570 sc ; Mero ; Meroitic_Hieroglyphs
3571 sc ; Plrd ; Miao
3572 sc ; Shrd ; Sharada
3573 sc ; Sora ; Sora_Sompeng
3574 sc ; Takr ; Takri
3575 -> remove these from SyntheticPropertyValueAliases.txt
3576 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
3577 and in com.ibm.icu.dev.test.lang.TestUScript.java
3578- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
3579 (added 2011-06-21)
3580 Khoj 322 Khojki
3581 Tirh 326 Tirhuta
3582 and another one added 2011-12-09
3583 Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)
3584 -> uscript.h
3585 -> com.ibm.icu.lang.UScript
3586 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
3587 replace public static final int \1 = \2;\3
3588 -> SyntheticPropertyValueAliases.txt
3589 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
3590 and in com.ibm.icu.dev.test.lang.TestUScript.java
3591
3592* UnicodeData.txt changes
3593- the last Unihan code point changes from U+9FCB to U+9FCC
3594 search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive)
3595 + do change gennames.c
3596 + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java
3597
3598* DerivedBidiClass.txt changes
3599- 2 new default-AL blocks:
3600# Arabic Extended-A: U+08A0 - U+08FF (was default-R)
3601# Arabic Mathematical Alphabetic Symbols:
3602# U+1EE00 - U+1EEFF (was default-R)
3603- 2 new default-R blocks:
3604# Meroitic Hieroglyphs:
3605# U+10980 - U+1099F
3606# Meroitic Cursive: U+109A0 - U+109FF
3607 -> should be picked up by the explicit data in the file
3608
3609* NameAliases.txt changes
3610- from
3611 # Each line has two fields
3612 # First field: Code point
3613 # Second field: Alias
3614- to
3615 # Each line has three fields, as described here:
3616 #
3617 # First field: Code point
3618 # Second field: Alias
3619 # Third field: Type
3620- Also, the file previously allowed multiple aliases but only now does it
3621 actually provide multiple, even multiple of the same type. For example,
3622 FEFF;BYTE ORDER MARK;alternate
3623 FEFF;BOM;abbreviation
3624 FEFF;ZWNBSP;abbreviation
3625- This breaks our gennames parser, unames.icu data structure, and API.
3626 Fix gennames to only pick up "correction" aliases.
3627 New ticket #8963 for further changes.
3628
3629* run genpname/preparse.pl (on Linux)
3630 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
3631 + make sure that data.h is writable
3632 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
3633 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
3634
3635* build ICU (make install)
3636 so that the tools build can pick up the new definitions from the installed header files.
3637* build Unicode tools (at least genpname) using CMake+make
3638
3639* run genpname
3640 (builds both pnames.icu and propname_data.h)
3641- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
3642- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
3643
3644* build ICU (make install)
3645* build Unicode tools using CMake+make
3646
3647* update source/data/unidata/norm2/nfkc_cf.txt
3648- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
3649
3650* update source/data/unidata/norm2/uts46.txt
3651- download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
3652 to ~/svn.icu/tools/trunk/src/unicode/py
3653- adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008".
3654- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
3655- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
3656
3657* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
3658 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
3659- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
3660- Unicode 6.0..6.1: U+2260, U+226E, U+226F
3661- nothing new in 6.1, no test file to update
3662
3663* generate core properties data files
3664- in initial bootstrapping, change the UCA version
3665 in source/data/unidata/FractionalUCA.txt to match the new Unicode version
3666- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3667- rebuild ICU & tools
3668 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
3669 check if the UCA version in FractionalUCA.txt matches the new Unicode version
3670 (see step above)
3671- run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm:
3672 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3673- rebuild ICU & tools
3674
3675* update Java data files
3676- refresh just the UCD-related files, just to be safe
3677- see (ICU4C)/source/data/icu4j-readme.txt
3678- mkdir /tmp/icu4j
3679- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3680 output:
3681 ...
3682 Unicode .icu files built to ./out/build/icudt49l
3683 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b
3684 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b
3685 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
3686 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b
3687 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b"
3688 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/
3689 mkdir -p /tmp/icu4j/main/shared/data
3690 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3691 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/
3692 mkdir -p /tmp/icu4j/main/shared/data
3693 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
3694 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data'
3695- copy the big-endian Unicode data files to another location,
3696 separate from the other data files
3697 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
3698 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
3699 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
3700 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu
3701 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
3702 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
3703 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
3704- refresh ICU4J
3705 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
3706
3707* refresh Java test .txt files
3708- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
3709
3710* test ICU so far, fix test code where necessary
3711- temporarily ignore collation issues that look like UCA/UCD mismatches,
3712 until UCA data is updated
3713
3714* UCA
3715
3716- get output from Mark's tools; look in
3717 http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt
3718- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
3719- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
3720 (note removing the underscore before "Rules")
3721- update (ICU)/source/test/testdata/CollationTest_*.txt
3722 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3723 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
3724- check test file diffs for previously commented-out, known-failing data lines;
3725 probably need to keep those commented out
3726- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
3727- run makeuca.sh:
3728 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3729- rebuild ICU4C
3730- refresh ICU4J collation data:
3731 (subset of instructions above for properties data refresh, except copies all coll/*)
3732 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3733 ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
3734 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
3735 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
3736- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
3737- note on intltest: if collate/UCAConformanceTest fails, then
3738 utility/MultithreadTest/TestCollators will fail as well;
3739 fix the conformance test before looking into the multi-thread test
3740
3741* When refreshing all of ICU4J data from ICU4C
3742- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3743- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
3744or
3745- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
3746
3747*** LayoutEngine script information
3748
3749(For details see the Unicode 5.2 change log below.)
3750
3751* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
3752 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
3753 in the working directory.
3754 (It also generates ScriptRunData.cpp, which is no longer needed.)
3755
3756 The generated files have a current copyright date and "@draft" statement.
3757
3758- diff current <icu>/source/layout files vs. generated ones
3759 ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
3760 review and manually merge desired changes;
3761 fix gratuitous changes, incorrect @draft and missing aliases;
3762 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
3763- if you just copy the above files, then
3764 fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
3765 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
3766
3767*** merge the Unicode update branches back onto the trunk
3768- do not merge the icudata.jar and testdata.jar,
3769 instead rebuild them from merged & tested ICU4C
3770
3771---------------------------------------------------------------------------- ***
3772
3773ICU 4.8 (no Unicode update, just new script codes)
3774
3775* 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
3776 (added 2010-12-21)
3777 Afak 439 Afaka
3778 Jurc 510 Jurchen
3779 Mroo 199 Mro, Mru
3780 Nshu 499 Nüshu
3781 Shrd 319 Sharada, Śāradā
3782 Sora 398 Sora Sompeng
3783 Takr 321 Takri, Ṭākrī, Ṭāṅkrī
3784 Tang 520 Tangut
3785 Wole 480 Woleai
3786 -> uscript.h
3787 -> com.ibm.icu.lang.UScript
3788 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
3789 replace public static final int \1 = \2;\3
3790 -> genpname/SyntheticPropertyValueAliases.txt
3791 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
3792 and in com.ibm.icu.dev.test.lang.TestUScript.java
3793
3794* run genpname/preparse.pl (on Linux)
3795 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
3796 + make sure that data.h is writable
3797 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
3798 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
3799
3800* rebuild Unicode tools (at least genpname) using make
3801- You might first need to "make install" ICU so that the tools build can pick
3802 up the new definitions from the installed header files.
3803
3804* run genpname
3805 (builds both pnames.icu and propname_data.h)
3806- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
3807- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
3808- rebuild ICU & tools
3809
3810* run genprops
3811- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
3812- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
3813- rebuild ICU & tools
3814
3815* update Java data files
3816- refresh just the UCD-related files, just to be safe
3817- see (ICU4C)/source/data/icu4j-readme.txt
3818- mkdir /tmp/icu4j
3819- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3820- copy the big-endian Unicode data files to another location,
3821 separate from the other data files
3822 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
3823 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
3824 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
3825- refresh ICU4J
3826 ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b
3827
3828* should have updated the layout engine script codes but forgot
3829
3830---------------------------------------------------------------------------- ***
3831
729e4ab9
A
3832Unicode 6.0 update
3833
3834*** related ICU Trac tickets
3835
38367264 Unicode 6.0 Update
3837
3838*** Unicode version numbers
3839- makedata.mak
3840- uchar.h
3841 (configure.in & configure: have been modified to extract the version from uchar.h)
3842- com.ibm.icu.util.VersionInfo
3843
3844*** data files & enums & parser code
3845
3846* file preparation
3847
3848~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed
3849- This now prepares both unidata and testdata files in respective output subfolders.
3850
3851* PropertyAliases.txt changes
3852- new Script_Extensions property defined in the new ScriptExtensions.txt file
3853 but not listed in PropertyAliases.txt; reported to unicode.org;
3854 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt
3855 scx; Script_Extensions
3856 -> uchar.h with new UProperty section
3857 -> com.ibm.icu.lang.UProperty, parallel with uchar.h
3858
3859* PropertyValueAliases.txt changes
3860- 12 new block names:
3861 Alchemical_Symbols
3862 Bamum_Supplement
3863 Batak
3864 Brahmi
3865 CJK_Unified_Ideographs_Extension_D
3866 Emoticons
3867 Ethiopic_Extended_A
3868 Kana_Supplement
3869 Mandaic
3870 Miscellaneous_Symbols_And_Pictographs
3871 Playing_Cards
3872 Transport_And_Map_Symbols
3873 -> add to uchar.h
3874 -> add to UCharacter.UnicodeBlock
3875 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
3876 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
3877- Joining_Group (jg) values:
3878 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias
3879 -> uchar.h & UCharacter.JoiningGroup
3880- 3 new scripts:
3881 sc ; Batk ; Batak
3882 sc ; Brah ; Brahmi
3883 sc ; Mand ; Mandaic
3884 -> remove these from SyntheticPropertyValueAliases.txt
3885 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN
3886 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
3887 and in com.ibm.icu.dev.test.lang.TestUScript.java
3888- 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
3889 (added 2009-11-11..2010-07-18)
3890 Bass 259 Bassa Vah
3891 Dupl 755 Duployan shortand
3892 Elba 226 Elbasan
3893 Gran 343 Grantha
3894 Kpel 436 Kpelle
3895 Loma 437 Loma
3896 Mend 438 Mende
3897 Merc 101 Meroitic Cursive
3898 Narb 106 Old North Arabian
3899 Nbat 159 Nabataean
3900 Palm 126 Palmyrene
3901 Sind 318 Sindhi
3902 Wara 262 Warang Citi
3903 -> uscript.h
3904 -> com.ibm.icu.lang.UScript
3905 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
3906 replace public static final int \1 = \2;\3
3907 -> SyntheticPropertyValueAliases.txt
3908 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
3909 and in com.ibm.icu.dev.test.lang.TestUScript.java
3910- ISO 15924 name change
3911 Mero 100 Meroitic Hieroglyphs (was Meroitic)
3912 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC
3913- property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt
3914
3915* UnicodeData.txt changes
3916- new CJK block:
3917 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
3918 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
3919 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion
3920
3921* build Unicode tools using CMake+make
3922
3923* run genpname/preparse.pl (on Linux)
3924 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
3925 + make sure that data.h is writable
3926 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
3927 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
3928
3929* rebuild Unicode tools (at least genpname) using make
3930- You might first need to "make install" ICU so that the tools build can pick
3931 up the new definitions from the installed header files.
3932
3933* run genpname
3934- ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
3935- rebuild ICU & tools
3936
3937* update source/data/unidata/norm2/nfkc_cf.txt
3938- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
3939
3940* update source/data/unidata/norm2/uts46.txt
3941- download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt
3942 to ~/svn.icu/tools/trunk/src/unicode/py
3943- adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values
3944- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
3945- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
3946
3947* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
3948 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
3949- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
3950- Unicode 6.0: U+2260, U+226E, U+226F
3951
3952* generate core properties data files
3953- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3954- rebuild ICU & tools
3955- run makeuca.sh so that genuca picks up the new nfc.nrm:
3956 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3957- rebuild ICU & tools
3958
3959* implement new Script_Extensions property (provisional)
3960- parser & generator: genprops & uprops.icu
3961- uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp
3962- UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java
3963
3964* switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2
3965- (one-time change)
3966- genbidi/gencase/genprops tools changes
3967- re-run makeprops.sh (see above)
3968- UCharacterProperty.java, UCharacterTypeIterator.java,
3969 UBiDiProps.java, UCaseProps.java, and several others with minor changes;
3970 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java
3971
3972* update Java data files
3973- refresh just the UCD-related files, just to be safe
3974- see (ICU4C)/source/data/icu4j-readme.txt
3975- mkdir /tmp/icu4j
3976- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3977 output:
3978 ...
3979 Unicode .icu files built to ./out/build/icudt45l
3980 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b
3981 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
3982 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b
3983 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b
3984 mkdir -p /tmp/icu4j/main/shared/data
3985 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3986- copy the big-endian Unicode data files to another location,
3987 separate from the other data files
3988 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
3989 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
3990 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
3991 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu
3992 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
3993 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
3994 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
3995- refresh ICU4J
3996 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
3997
3998* refresh Java test .txt files
3999- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
4000
4001* un-hardcode normalization skippable (NF*_Inert) test data
4002- removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools
4003
4004* copy updated break iterator test files
4005- now handled by early ucdcopy.py and
4006 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata
4007 (old instructions:
4008 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt
4009 to ~/svn.icu/trunk/src/source/test/testdata)
4010- they are not used in ICU4J
4011
4012* UCA
4013
4014- get output from Mark's tools; look in
4015 http://www.unicode.org/~book/incoming/mark/uca6.0.0/
4016 http://www.macchiato.com/unicode/utc/additional-uca-files
4017 http://www.unicode.org/Public/UCA/6.0.0/
4018 http://www.unicode.org/~mdavis/uca/
4019- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
4020- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
4021- update Han-implicit ranges for new CJK extensions:
4022 swapCJK() in ucol.cpp & ImplicitCEGenerator.java
4023- genuca: allow bytes 02 for U+FFFE, new merge-sort character;
4024 do not add it into invuca so that tailoring primary-after an ignorable works
4025- genuca: permit space between [variable top] bytes
4026- ucol.cpp: treat noncharacters like unassigned rather than ignorable
4027- run makeuca.sh:
4028 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
4029- rebuild ICU4C
4030- refresh ICU4J collation data:
4031 (subset of instructions above for properties data refresh, except copies all coll/*)
4032 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4033 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
4034 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
4035 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
4036- update (ICU)/source/test/testdata/CollationTest_*.txt
4037 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
4038 with output from Mark's Unicode tools
4039- run all tests with the *_SHORT.txt or the full files (the full ones have comments)
4040- note on intltest: if collate/UCAConformanceTest fails, then
4041 utility/MultithreadTest/TestCollators will fail as well;
4042 fix the conformance test before looking into the multi-thread test
4043
4044* When refreshing all of ICU4J data from ICU4C
4045- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
4046- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
4047or
4048- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
4049
4050*** LayoutEngine script information
4051
4052(For details see the Unicode 5.2 change log below.)
4053
4054* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
4055ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
4056ScriptRunData.cpp, which is no longer needed.)
4057
4058The generated files have a current copyright date and "@draft" statement.
4059
4060* copy the above files into <icu>/source/layout, replacing the old files.
4061* fix mixed line endings
4062* review the diffs and fix incorrect @draft and missing aliases;
4063 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
4064* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
4065
4066---------------------------------------------------------------------------- ***
4067
4068Unicode 5.2 update
4069
4070*** related ICU Trac tickets
4071
40727084 Unicode 5.2
4073
40747167 verify collation bytes
40757235 Java test NAME_ALIAS
40767236 Java DerivedCoreProperties.txt test
40777237 Java BidiTest.txt
40787238 UTrie2 in core unidata
40797239 test for tailoring gaps
40807240 Java fix CollationMiscTest
40817243 update layout engine for Unicode 5.2
4082
4083*** Unicode version numbers
4084- makedata.mak
4085- uchar.h
4086- configure.in & configure
4087- update ucdVersion in gennames.c if an algorithmic range changes
4088
4089*** data files & enums & parser code
4090
4091* file preparation
4092
4093python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata
4094- includes finding files regardless of version numbers,
4095 copying them, and performing the equivalent processing of the
4096 ucdstrip and ucdmerge tools on the desired set of files
4097
4098* notes on changes
4099- PropertyAliases.txt
4100 moved from numeric to enumerated:
4101 ccc ; Canonical_Combining_Class
4102 new string properties:
4103 NFKC_CF ; NFKC_Casefold
4104 Name_Alias; Name_Alias
4105 new binary properties:
4106 Cased ; Cased
4107 CI ; Case_Ignorable
4108 CWCF ; Changes_When_Casefolded
4109 CWCM ; Changes_When_Casemapped
4110 CWKCF ; Changes_When_NFKC_Casefolded
4111 CWL ; Changes_When_Lowercased
4112 CWT ; Changes_When_Titlecased
4113 CWU ; Changes_When_Uppercased
4114 new CJK Unihan properties (not supported by ICU)
4115- PropertyValueAliases.txt
4116 new block names
4117 new scripts
4118 one script code change:
4119 sc ; Qaai ; Inherited
4120 ->
4121 sc ; Zinh ; Inherited ; Qaai
4122 new Line_Break (lb) value:
4123 lb ; CP ; Close_Parenthesis
4124 new Joining_Group (jg) values: Farsi_Yeh, Nya
4125 other new values:
4126 ccc; 214; ATA ; Attached_Above
4127- DerivedBidiClass.txt
4128 new default-R range: U+1E800 - U+1EFFF
4129- UnicodeData.txt
4130 all of the ISO comments are gone
4131 new CJK block end:
4132 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last>
4133 new CJK block:
4134 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
4135 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
4136
4137* genpname
4138- run preparse.pl
4139 + cd \svn\icuproj\icu\trunk\source\tools\genpname
4140 + make sure that data.h is writable
4141 + perl preparse.pl \svn\icuproj\icu\trunk > out.txt
4142 + preparse.pl complains with errors like the following:
4143 Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34.
4144 This is because ICU 4.0 had scripts from ISO 15924 which are now
4145 added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt
4146 and PropertyValueAliases.txt.
4147 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
4148 Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt
4149 + preparse.pl complains with errors about block names missing from uchar.h; add them
4150
4151* uchar.h & uscript.h & uprops.h & uprops.c & genprops
4152- new block & script values
4153 + 26 new blocks
4154 copy new blocks from Blocks.txt
4155 MS VC++ 2008 regular expression:
4156 find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$"
4157 replace with " UBLOCK_\3 = 172, /*[\1]*/"
4158 + several new script values already added in ICU 4.0 for ISO 15924 coverage
4159 (removed from SyntheticPropertyValueAliases.txt, see genpname notes above)
4160 + 3 new script values added for ISO 15924 and Unicode 5.2 coverage
4161 + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2)
4162 (added to SyntheticPropertyValueAliases.txt)
4163- new Joining Group (JG) values: Farsi_Yeh, Nya
4164- new Line_Break (lb) value:
4165 lb ; CP ; Close_Parenthesis
4166
4167* hardcoded Unihan range end/limit
4168- Unihan range end moves from 9FC3 to 9FCB
4169 search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive)
4170 + do change gennames.c
4171
4172* Compare definitions of new binary properties with what we used to use
4173 in algorithms, to see if the definitions changed.
4174- Verified that definitions for Cased and Case_Ignorable are unchanged.
4175 The gencase tool now parses the newly public Case_Ignorable values
4176 in case the definition changes in the future.
4177
4178* uchar.c & uprops.h & uprops.c & genprops
4179- new numeric values that didn't exist in Unicode data before:
4180 1/7, 1/9, 1/10, 3/10, 1/16, 3/16
4181 the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5,
4182 therefore redesign the encoding of numeric types and values for formatVersion 6;
4183 design for simple numbers up to at least 144 ("one gross"),
4184 large values up to at least 10^20,
4185 and fractions with numerators -1..17 and denominators 1..16
4186 to cover current and expected future values
4187 (e.g., more Han numeric values, Meroitic twelfths)
4188
4189* reimplement Hangul_Syllable_Type for new Jamo characters
4190- the old code assumed that all Jamo characters are in the 11xx block
4191- Unicode 5.2 fills holes there and adds new Jamo characters in
4192 A960..A97F; Hangul Jamo Extended-A
4193 and in
4194 D7B0..D7FF; Hangul Jamo Extended-B
4195- Hangul_Syllable_Type can be trivially derived from a subset of
4196 Grapheme_Cluster_Break values
4197
4198* build Unicode data source code for hardcoding core data
4199C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data
4200
4201ICU data make path is \svn\icuproj\icu\trunk\source\data\
4202ICU root path is \svn\icuproj\icu\trunk
4203Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
4204Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
4205Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
4206Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
4207Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
4208Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
4209Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
4210Information: cannot find "spreplocal.mk". Not building user-additional stringprep files.
4211Creating data file for Unicode Property Names
4212Creating data file for Unicode Character Properties
4213Creating data file for Unicode Case Mapping Properties
4214Creating data file for Unicode BiDi/Shaping Properties
4215Creating data file for Unicode Normalization
4216Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l"
4217Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp"
4218
4219- copy the .c source files to C:\svn\icuproj\icu\trunk\source\common
4220 and rebuild the common library
4221
4222*** UCA
4223
4224- update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools)
4225- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools
4226- update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools
4227[ Begin obsolete instructions:
4228 Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files.
4229 - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py
4230 on Windows:
4231 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt
4232 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt
4233 End obsolete instructions]
4234- run all tests with the *_SHORT.txt or the full files (the full ones have comments)
4235 not just the *_STUB.txt files
4236- note on intltest: if collate/UCAConformanceTest fails, then
4237 utility/MultithreadTest/TestCollators will fail as well;
4238 fix the conformance test before looking into the multi-thread test
4239
4240*** Implement Cased & Case_Ignorable properties
4241- via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable()
4242- Problem: These properties should be disjoint, but aren't
4243- UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not
4244- change ucase.icu to be able to store any combination of Cased and Case_Ignorable
4245
4246*** Implement Changes_When_Xyz properties
4247- without stored data
4248
4249*** Implement Name_Alias property
4250- add it as another name field in unames.icu
4251- make it available via u_charName() and UCharNameChoice and
4252- consider it in u_charFromName()
4253
4254*** Break iterators
4255
4256* Update break iterator rules to new UAX versions and new property values
4257* Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary
4258
4259*** new BidiTest file
4260- review format and data
4261- copy BidiTest.txt to source/test/testdata
4262- write test code using this data
4263- fix ICU code where it fails the conformance test
4264
4265*** Java
4266- generally, find and update code corresponding to C/C++
4267- UCharacter.UnicodeBlock constants:
4268 a) add an _ID integer per new block, update COUNT
4269 b) add a class instance per new block
4270 Visual Studio regex:
4271 find UBLOCK_{[^ ]+} = [0-9]+, {/.+}
4272 replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
4273- CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias()
4274
4275- port test changes to Java
4276
4277*** LayoutEngine script information
4278
4279(For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833)
4280
4281* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
4282ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
4283ScriptRunData.cpp, which is no longer needed.)
4284
4285The generated files have a current copyright date and "@draft" statement.
4286
4287-> Eric Mader wrote in email on 20090930:
4288 "I think the tool has been modified to update @draft to @stable for
4289 older scripts and to add @draft for new scripts.
4290 (I worked with an intern on this last year.)
4291 You should check the output after you run it."
4292
4293* copy the above files into <icu>/source/layout, replacing the old files.
4294* fix mixed line endings
4295* review the diffs and fix incorrect @draft and missing aliases
4296* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
4297
4298Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
4299and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
4300
4301-> Eric Mader wrote in email on 20090930:
4302 "This is just a matter of making sure that all the per-script tables have
4303 entries for any new scripts that were added.
4304 If any new Indic characters were added, then the class tables in
4305 IndicClassTables.cpp should be updated to reflect this.
4306 John Emmons should know how to do this if it's required."
4307
4308* rebuild the layout and layoutex libraries.
4309
4310*** Documentation
4311- Update User Guide
4312 + Jamo_Short_Name, sfc->scf, binary property value aliases
4313
4314---------------------------------------------------------------------------- ***
4315
46f4442e
A
4316Unicode 5.1 update
4317
4318*** related ICU Trac tickets
4319
43205696 Update to Unicode 5.1
4321
4322*** Unicode version numbers
4323- makedata.mak
4324- uchar.h
4325- configure.in & configure
4326- update ucdVersion in gennames.c if an algorithmic range changes
4327
4328*** data files & enums & parser code
4329
4330* file preparation
4331- ucdstrip:
4332 DerivedCoreProperties.txt
4333 DerivedNormalizationProps.txt
4334 NormalizationTest.txt
4335 PropList.txt
4336 Scripts.txt
4337 GraphemeBreakProperty.txt
4338 SentenceBreakProperty.txt
4339 WordBreakProperty.txt
4340- ucdstrip and ucdmerge:
4341 EastAsianWidth.txt
4342 LineBreak.txt
4343
4344* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
4345copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
4346copy 5.1.0\ucd\Blocks.txt ..\unidata\
4347copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
4348copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
4349copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
4350copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
4351copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
4352copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
4353copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
4354copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
4355copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
4356copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
4357copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
4358
4359ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
4360ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
4361ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
4362ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
4363ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
4364ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
4365ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
4366ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
4367ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
4368ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
4369
4370* genpname
4371- run preparse.pl
4372 + cd \svn\icuproj\icu\uni51\source\tools\genpname
4373 + make sure that data.h is writable
4374 + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
4375 + preparse.pl complains with errors like the following:
4376 Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
4377 This is because ICU 3.8 had scripts from ISO 15924 which are now
4378 added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
4379 and PropertyValueAliases.txt.
4380 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
4381 Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
4382 + PropertyValueAliases.txt now explicitly contains values for boolean properties:
4383 N/Y, No/Yes, F/T, False/True
4384 -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
4385 It will use further values from the file if present.
4386
4387* uchar.h & uscript.h & uprops.h & uprops.c & genprops
4388- new block & script values
4389 + 17 new blocks
4390 + 11 new script values already added in ICU 3.8 for ISO 15924 coverage
4391 (removed from SyntheticPropertyValueAliases.txt)
4392 + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
4393 (added to SyntheticPropertyValueAliases.txt)
4394- uprops.icu (uprops.h) only provides 7 bits for script codes.
4395 In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
4396 There is none above 127 yet which is the script code for an
4397 assigned Unicode character, so ICU 4.0 uprops.icu does not store any
4398 script code values greater than 127.
4399 However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
4400 in a parallel bit field, and that overflows now.
4401 Also, future values >=128 would be incompatible anyway.
4402 uprops.h is modified to move around several of the bit fields
4403 in the properties vector words, and now uses 8 bits for the script code.
4404 Two other bit fields also grow to accommodate future growth:
4405 Block (current count: 172) grows from 8 to 9 bits,
4406 and Word_Break grows from 4 to 5 bits.
4407- renamed property Simple_Case_Folding (sfc->scf)
4408 + nothing to be done: handled as normal alias
4409- new property JSN Jamo_Short_Name
4410 + no new API: only contributes to the Name property
4411- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
4412- new Joining Group (JG) value: Burushashki_Yeh_Barree
4413- new Sentence_Break (SB) values:
4414 SB ; CR ; CR
4415 SB ; EX ; Extend
4416 SB ; LF ; LF
4417 SB ; SC ; SContinue
4418- new Word_Break (WB) values:
4419 WB ; CR ; CR
4420 WB ; Extend ; Extend
4421 WB ; LF ; LF
4422 WB ; MB ; MidNumLet
4423
4424* Further changes in the 2008-02-29 update:
4425- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
4426 because they should not normally be invisible.
4427- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
4428- new Grapheme_Cluster_Break (GCB) value: PP=Prepend
4429- new Word_Break (WB) value: NL=Newline
4430
4431* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
4432- Unihan range end moves from 9FBB to 9FC3
4433 search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
4434 + do change gennames.c
4435
4436* build Unicode data source code for hardcoding core data
4437C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
4438
4439ICU data make path is \svn\icuproj\icu\uni51\source\data\
4440ICU root path is \svn\icuproj\icu\uni51
4441Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
4442Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
4443Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
4444Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
4445Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
4446Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
4447Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
4448Creating data file for Unicode Character Properties
4449Creating data file for Unicode Case Mapping Properties
4450Creating data file for Unicode BiDi/Shaping Properties
4451Creating data file for Unicode Normalization
4452Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
4453Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
4454
4455- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
4456 and rebuild the common library
4457
4458*** Break iterators
4459
4460* Update break iterator rules to new UAX versions and new property values
4461
4462*** UCA
4463
4464* update FractionalUCA.txt and UCARules.txt with new canonical closure
4465
4466*** Test suites
4467- Test that APIs using Unicode property value aliases (like UnicodeSet)
4468 support all of the boolean values N/Y, No/Yes, F/T, False/True
4469 -> TestBinaryValues() tests in both cintltst and intltest
4470
4471*** LayoutEngine script information
4472* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
4473ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
4474ScriptRunData.cpp, which is no longer needed.)
4475
4476The generated files have a current copyright date and "@draft" statement.
4477
4478* copy the above files into <icu>/source/layout, replacing the old files.
4479
4480Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
4481and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
4482
4483* rebuild the layout and layoutex libraries.
4484
4485*** Documentation
4486- Update User Guide
4487 + Jamo_Short_Name, sfc->scf, binary property value aliases
4488
4489---------------------------------------------------------------------------- ***
4490
73c04bcf
A
4491Unicode 5.0 update
4492
4493*** related Jitterbugs
4494
44955084 RFE: Update to Unicode 5.0
4496
4497*** data files & enums & parser code
4498
4499* file preparation
4500- ucdstrip:
4501 DerivedCoreProperties.txt
4502 DerivedNormalizationProps.txt
4503 NormalizationTest.txt
4504 PropList.txt
4505 Scripts.txt
4506 GraphemeBreakProperty.txt
4507 SentenceBreakProperty.txt
4508 WordBreakProperty.txt
4509- ucdstrip and ucdmerge:
4510 EastAsianWidth.txt
4511 LineBreak.txt
4512
46f4442e 4513* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
73c04bcf
A
4514copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
4515copy 5.0.0\ucd\Blocks.txt ..\unidata\
4516copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
4517copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
4518copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
4519copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
4520copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
4521copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
4522copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
4523copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
4524copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
4525copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
4526copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
4527
4528ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
4529ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
4530ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
4531ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
4532ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
4533ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
4534ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
4535ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
4536ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
4537ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
4538
4539* update FractionalUCA.txt and UCARules.txt with new canonical closure
4540
4541* genpname
4542- run preparse.pl
4543 + make sure that data.h is writable
4544 + perl preparse.pl \cvs\oss\icu > out.txt
4545
4546* uchar.h & uscript.h & uprops.h & uprops.c & genprops
4547- new block & script values
4548 + script values already added in ICU 3.6 because all of ISO 15924 is now covered
4549
4550* build Unicode data source code for hardcoding core data
4551C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
4552
4553ICU data make path is \cvs\oss\icu\source\data\
4554ICU root path is \cvs\oss\icu
4555Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
4556[etc.]
4557Creating data file for Unicode Character Properties
4558Creating data file for Unicode Case Mapping Properties
4559Creating data file for Unicode BiDi/Shaping Properties
4560Creating data file for Unicode Normalization
4561Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
4562Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
4563
4564- copy the .c source files to C:\cvs\oss\icu\source\common
4565 and rebuild the common library
4566
4567*** Unicode version numbers
4568- makedata.mak
4569- uchar.h
4570- configure.in
4571
4572*** LayoutEngine script information
4573* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
4574ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
4575ScriptRunData.cpp, which is no longer needed.)
4576
4577The generated files have a current copyright date and "@draft" statement.
4578
4579* copy the above files into <icu>/source/layout, replacing the old files.
4580
4581Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
4582and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
4583
4584* rebuild the layout and layoutex libraries.
4585
4586---------------------------------------------------------------------------- ***
4587
4588Unicode 4.1 update
4589
4590*** related Jitterbugs
4591
45924332 RFE: Update to Unicode 4.1
45934157 RBBI, TR29 4.1 updates
4594
4595*** data files & enums & parser code
4596
4597* file preparation
4598- ucdstrip:
4599 DerivedCoreProperties.txt
4600 DerivedNormalizationProps.txt
4601 NormalizationTest.txt
4602 GraphemeBreakProperty.txt
4603 SentenceBreakProperty.txt
4604 WordBreakProperty.txt
4605- ucdstrip and ucdmerge:
4606 EastAsianWidth.txt
4607 LineBreak.txt
4608
4609* add new files to the repository
4610 GraphemeBreakProperty.txt
4611 SentenceBreakProperty.txt
4612 WordBreakProperty.txt
4613
4614* update FractionalUCA.txt and UCARules.txt with new canonical closure
4615
4616* genpname
4617- handle new enumerated properties in sub read_uchar
4618- run preparse.pl
4619
4620* uchar.h & uscript.h & uprops.h & uprops.c & genprops
4621- new binary properties
4622 + Pattern_Syntax
4623 + Pattern_White_Space
4624- new enumerated properties
4625 + Grapheme_Cluster_Break
4626 + Sentence_Break
4627 + Word_Break
4628- new block & script & line break values
4629
4630* gencase
4631- case-ignorable changes
4632 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
4633 now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
4634
4635*** Unicode version numbers
4636- makedata.mak
4637- uchar.h
4638- configure.in
4639
4640*** tests
4641- verify that u_charMirror() round-trips
4642- test all new properties and some new values of old properties
4643
4644*** other code
4645
4646* hardcoded Unihan range end/limit
4647- Unihan range end moves from 9FA5 to 9FBB
4648 search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
4649 + do not modify BOCU/BOCSU code because that would change the encoding
4650 and break binary compatibility!
4651 + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
4652 NamePrepProfile.txt
4653 + ignore trietest.c: test data is arbitrary
4654 + ignore tstnorm.cpp: test optimization, not important
4655 + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
4656 + do change line_th.txt and word_th.txt
4657 by replacing hardcoded ranges with the new property values
4658 + do change gennames.c
4659
4660source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
4661source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
4662source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5,
4663
4664* case mappings
4665- compare new special casing context conditions with previous ones
4666 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
4667
4668* genpname
4669- consider storing only the short name if it is the same as the long name
4670
4671*** other reviews
4672- UAX #29 changes (grapheme/word/sentence breaks)
4673- UAX #14 changes (line breaks)
4674- Pattern_Syntax & Pattern_White_Space
4675
4676---------------------------------------------------------------------------- ***
4677
374ca955
A
4678Unicode 4.0.1 update
4679
4680*** related Jitterbugs
4681
46823170 RFE: Update to Unicode 4.0.1
46833171 Add new Unicode 4.0.1 properties
46843520 use Unicode 4.0.1 updates for break iteration
4685
4686*** data files & enums & parser code
4687
4688* file preparation
4689- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
4690- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
4691
4692* file fixes
4693- fix UnicodeData.txt general categories of Ethiopic digits Nd->No
4694 according to PRI #26
4695 http://www.unicode.org/review/resolved-pri.html#pri26
4696- undone again because no corrigendum in sight;
4697 instead modified tests to not check consistency on this for Unicode 4.0.1
4698
4699* ucdterms.txt
4700- update from http://www.unicode.org/copyright.html
4701 formatted for plain text
4702
4703* uchar.h & uprops.h & uprops.c & genprops
4704- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
4705- add U_LB_INSEPARABLE due to a spelling fix
4706 + put short name comment only on line with new constant
4707 for genpname perl script parser
4708- new binary properties
4709 + STerm
4710 + Variation_Selector
4711
4712* genpname
4713- fix genpname perl script so that it doesn't choke on more than 2 names per property value
4714- perl script: correctly calculate the maximum number of fields per row
4715
4716* uscript.h
4717- new script code Hrkt=Katakana_Or_Hiragana
4718
4719* gennorm.c track changes in DerivedNormalizationProps.txt
4720- "FNC" -> "FC_NFKC"
4721- single field "NFD_NO" -> two fields "NFD_QC; N" etc.
4722
4723* genprops/props2.c track changes in DerivedNumericValues.txt
4724- changed from 3 columns to 2, dropping the numeric type
4725 + assume that the type is always numeric for Han characters,
4726 and that only those are added in addition to what UnicodeData.txt lists
4727
4728*** Unicode version numbers
4729- makedata.mak
4730- uchar.h
4731- configure.in
4732
4733*** tests
4734- update test of default bidi classes according to PRI #28
4735 /tsutil/cucdtst/TestUnicodeData
4736 http://www.unicode.org/review/resolved-pri.html#pri28
4737- bidi tests: change exemplar character for ES depending on Unicode version
4738- change hardcoded expected property values where they change
4739
4740*** other code
4741
4742* name matching
4743- read UCD.html
4744
4745* scripts
4746- use new Hrkt=Katakana_Or_Hiragana
4747
4748* ZWJ & ZWNJ
4749- are now part of combining character sequences
4750- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ