icuSources/data/unidata/changes.txt

   1 * Copyright (C) 2016 and later: Unicode, Inc. and others.
   2 * License & terms of use: http://www.unicode.org/copyright.html
   3 * Copyright (C) 2004-2016, International Business Machines
   4 * Corporation and others.  All Rights Reserved.
   5 *
   6 *   file name:  changes.txt
   7 *   encoding:   US-ASCII
   8 *   tab size:   8 (not used)
   9 *   indentation:4
  10 *
  11 *   created on: 2004may06
  12 *   created by: Markus W. Scherer
  13 *
  14 * change log for Unicode updates
  15 *
  16 * For each new Unicode version, during the beta period,
  17 * I copy the change log for the previous version to the top of this file.
  18 * I adjust the versions, tickets, URLs, and paths.
  19 * I work my way through the steps listed in the log, top to bottom,
  20 * adjusting the log as necessary.
  21 * I report problems to the UTC and/or CLDR and/or ICU.
  22 * Before the data is final, I "turn the crank" several more times,
  23 * using appropriate subsets of the steps.
  24
  25 ---------------------------------------------------------------------------- ***
  26
  27 * New ISO 15924 script codes
  28
  29 Starting with ICU 55, we do not add UScriptCode constants for new scripts any more
  30 until they are encoded in Unicode,
  31 or can be assumed to be encoded in the next Unicode version.
  32 Script enum constant names want to follow the Unicode script property value aliases,
  33 which are assigned only when the scripts are encoded.
  34 When we encode scripts early and guess wrong, then we have confusing enum constants
  35 and have sometimes added aliases.
  36
  37 Variant script codes like Latf and Aran that are not subject to separate encoding
  38 can be added at any time.
  39 (For example, Aran could be added as USCRIPT_ARABIC_NASTALIQ.)
  40
  41 We add script codes used in CLDR or in the spoof checker.
  42 This includes combination/alias codes like Hanb and Jamo.
  43 See http://unicode.org/reports/tr35/#unicode_script_subtag_validity
  44 and look for "alias" on http://unicode.org/iso15924/iso15924-codes.html
  45
  46 We add special Z* script codes like Zsye.
  47
  48 For new script codes see http://www.unicode.org/iso15924/codechanges.html
  49
  50 ---------------------------------------------------------------------------- ***
  51
  52 Unicode 12.1 update for ICU 64.2
  53
  54 ** This is an abbreviated update with one new character for the new
  55 ** Japanese era expected to start on 2019-May-01: U+32FF SQUARE ERA NAME REIWA
  56 https://en.wikipedia.org/wiki/Reiwa_period
  57
  58 http://www.unicode.org/versions/Unicode12.1.0/
  59
  60 ICU-20497 Unicode 12.1
  61
  62 cldrbug 11978: Unicode 12.1
  63
  64 * Command-line environment setup
  65
  66 UNICODE_DATA=~/unidata/uni121/20190403
  67 CLDR_SRC=~/svn.cldr/uni
  68 ICU_ROOT=~/icu/uni
  69 ICU_SRC=$ICU_ROOT/src
  70 ICUDT=icudt64b
  71 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
  72 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
  73 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
  74
  75 *** Unicode version numbers
  76 - makedata.mak
  77 - uchar.h
  78 - com.ibm.icu.util.VersionInfo
  79 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
  80
  81 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
  82     so that the makefiles see the new version number.
  83   cd $ICU_ROOT/dbg/icu4c
  84   ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh
  85
  86 *** data files & enums & parser code
  87
  88 * download files
  89 - mkdir -p $UNICODE_DATA
  90 - download Unicode files into $UNICODE_DATA
  91   + subfolders: emoji, idna, security, ucd, uca
  92   + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
  93
  94 * for manual diffs and for Unicode Tools input data updates:
  95   remove version suffixes from the file names
  96     ~$ unidata/desuffixucd.py $UNICODE_DATA
  97   (see https://sites.google.com/site/unicodetools/inputdata)
  98
  99 * process and/or copy files
 100 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
 101   + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
 102   + For debugging, and tweaking how ppucd.txt is written,
 103     the tool has an --only_ppucd option:
 104     py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
 105
 106 - cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
 107
 108 * build ICU (make install)
 109   so that the tools build can pick up the new definitions from the installed header files.
 110
 111   $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
 112
 113 * update spoof checker UnicodeSet initializers:
 114     inclusionPat & recommendedPat in uspoof.cpp
 115     INCLUSION & RECOMMENDED in SpoofChecker.java
 116 - make sure that the Unicode Tools tree contains the latest security data files
 117 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
 118 - update the hardcoded version number there in the DIRECTORY path
 119 - run the tool (no special environment variables needed)
 120 - copy & paste from the Console output into the .cpp & .java files
 121
 122 * generate normalization data files
 123   cd $ICU_ROOT/dbg/icu4c
 124   bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
 125   bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm     -s $ICU4C_UNIDATA/norm2 nfc.txt
 126   bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm    -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
 127   bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
 128   bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm   -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
 129
 130 * build ICU (make install)
 131   so that the tools build can pick up the new definitions from the installed header files.
 132
 133   $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
 134
 135 * build Unicode tools using CMake+make
 136
 137 $ICU_SRC/tools/unicode/c/icudefs.txt:
 138
 139 # Location (--prefix) of where ICU was installed.
 140 set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
 141 # Location of the ICU4C source tree.
 142 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)
 143
 144   $ICU_ROOT/dbg$
 145     mkdir -p tools/unicode/c
 146     cd tools/unicode/c
 147
 148   $ICU_ROOT/dbg/tools/unicode/c$
 149     cmake ../../../../src/tools/unicode/c
 150     make
 151
 152 * generate core properties data files
 153   $ICU_ROOT/dbg/tools/unicode/c$
 154     genprops/genprops $ICU_SRC/icu4c
 155     genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \
 156     genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
 157 - rebuild ICU (make install) & tools
 158
 159 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
 160   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
 161 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
 162 - Unicode 6.0..12.1: U+2260, U+226E, U+226F
 163 - nothing new in this Unicode version, no test file to update
 164
 165 * run & fix ICU4C tests
 166 - Andy handles RBBI & spoof check test failures
 167
 168 * collation: CLDR collation root, UCA DUCET
 169
 170 - UCA DUCET goes into Mark's Unicode tools, see
 171     https://sites.google.com/site/unicodetools/home#TOC-UCA
 172   diff the main mapping file, look for bad changes
 173   (for example, more bytes per weight for common characters)
 174     ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.1.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.1.txt
 175     ~/svn.unitools/trunk$ meld ../frac-12.txt ../frac-12.1.txt
 176
 177 - CLDR root data files are checked into $CLDR_SRC/common/uca/
 178     cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
 179
 180 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
 181     cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
 182 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
 183     cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
 184     (note removing the underscore before "Rules")
 185     cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
 186 - restore TODO diffs in UCARules.txt
 187     meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
 188 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
 189   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
 190   from the CLDR root files (..._CLDR_..._SHORT.txt)
 191     cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
 192     cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
 193     cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
 194 - if CLDR common/uca/unihan-index.txt changes, then update
 195   CLDR common/collation/root.xml <collation type="private-unihan">
 196   and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
 197
 198 - run genuca, see command line above
 199 - rebuild ICU4C
 200
 201 * Unihan collators
 202     https://sites.google.com/site/unicodetools/unihan
 203 - run Unicode Tools
 204     org.unicode.draft.GenerateUnihanCollators
 205   with VM arguments
 206     -ea
 207     -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
 208     -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
 209     -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
 210     -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
 211     -DUVERSION=12.1.0
 212 - run Unicode Tools
 213     org.unicode.draft.GenerateUnihanCollatorFiles
 214   with the same arguments
 215 - check CLDR diffs
 216     cd $CLDR_SRC
 217     meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
 218     meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
 219 - copy to CLDR
 220     cd $CLDR_SRC
 221     cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
 222     cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
 223 - run CLDR unit tests, commit to CLDR
 224 - generate ICU zh collation data: run CLDR
 225     org.unicode.cldr.icu.NewLdml2IcuConverter
 226   with program arguments
 227     -t collation
 228     -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
 229     -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
 230     -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll
 231     -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation
 232     zh
 233   and VM arguments
 234     -ea
 235     -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
 236 - rebuild ICU4C
 237
 238 * run & fix ICU4C tests, now with new CLDR collation root data
 239 - run all tests with the collation test data *_SHORT.txt or the full files
 240   (the full ones have comments, useful for debugging)
 241 - note on intltest: if collate/UCAConformanceTest fails, then
 242   utility/MultithreadTest/TestCollators will fail as well;
 243   fix the conformance test before looking into the multi-thread test
 244
 245 * update Java data files
 246 - refresh just the UCD/UCA-related/derived files, just to be safe
 247 - see (ICU4C)/source/data/icu4j-readme.txt
 248 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 249 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 250   output:
 251     ...
 252     make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
 253     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt64b
 254     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b
 255     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt64l.dat ./out/icu4j/icudt64b.dat -s ./out/build/icudt64l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt64b
 256     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b"
 257     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt64b/
 258     mkdir -p /tmp/icu4j/main/shared/data
 259     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
 260     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt64b/
 261     mkdir -p /tmp/icu4j/main/shared/data
 262     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
 263     make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
 264 - copy the big-endian Unicode data files to another location,
 265   separate from the other data files,
 266   and then refresh ICU4J
 267     cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
 268     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
 269     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
 270     cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 271     cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 272     rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
 273     cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 274     cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
 275     cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
 276     jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
 277
 278 * When refreshing all of ICU4J data from ICU4C
 279 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 280 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
 281 or
 282 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
 283
 284 * update CollationFCD.java
 285   + copy & paste the initializers of lcccIndex[] etc. from
 286     ICU4C/source/i18n/collationfcd.cpp to
 287     ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
 288
 289 * refresh Java test .txt files
 290 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
 291     cd $ICU_SRC/icu4c/source/data/unidata
 292     cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
 293     cd ../../test/testdata
 294     cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
 295     cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
 296
 297 * run & fix ICU4J tests
 298
 299 *** API additions
 300 - send notice to icu-design about new born-@stable API (enum constants etc.)
 301
 302 *** CLDR numbering systems
 303 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
 304   for example, look for
 305     ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt
 306     in new blocks (Blocks.txt)
 307   Unicode 12: using Unicode 12 CLDR ticket #11478
 308     hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong
 309     wcho 1E2F0..1E2F9 Wancho
 310   Unicode 11: using Unicode 11 CLDR ticket #10978
 311     rohg 10D30..10D39 Hanifi_Rohingya
 312     gong 11DA0..11DA9 Gunjala_Gondi
 313   Earlier: CLDR tickets specific to adding new numbering systems.
 314   Unicode 10: http://unicode.org/cldr/trac/ticket/10219
 315   Unicode 9: http://unicode.org/cldr/trac/ticket/9692
 316
 317 *** merge the Unicode update branches back onto the trunk
 318 - do not merge the icudata.jar and testdata.jar,
 319   instead rebuild them from merged & tested ICU4C
 320 - make sure that changes to Unicode tools are checked in:
 321   http://www.unicode.org/utility/trac/log/trunk/unicodetools
 322
 323 ---------------------------------------------------------------------------- ***
 324
 325 Unicode 12.0 update for ICU 64
 326
 327 http://www.unicode.org/versions/Unicode12.0.0/
 328 http://unicode.org/versions/beta-12.0.0.html
 329 https://www.unicode.org/review/pri389/
 330 http://www.unicode.org/reports/uax-proposed-updates.html
 331 http://www.unicode.org/reports/tr44/tr44-23.html
 332
 333 ICU-20203 Unicode 12
 334
 335 ICU-20111 move text layout properties data into a data file
 336
 337 cldrbug 11478: Unicode 12
 338 Accidentally used ^/trunk instead of ^/branches/markus/uni12
 339
 340 * Command-line environment setup
 341
 342 UNICODE_DATA=~/unidata/uni12/20190309
 343 CLDR_SRC=~/svn.cldr/uni
 344 ICU_ROOT=~/icu/uni
 345 ICU_SRC=$ICU_ROOT/src
 346 ICUDT=icudt63b
 347 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
 348 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
 349 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
 350
 351 *** Unicode version numbers
 352 - makedata.mak
 353 - uchar.h
 354 - com.ibm.icu.util.VersionInfo
 355 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
 356
 357 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
 358   so that the makefiles see the new version number.
 359
 360 *** data files & enums & parser code
 361
 362 * download files
 363 - mkdir -p $UNICODE_DATA
 364 - download Unicode files into $UNICODE_DATA
 365   + subfolders: emoji, idna, security, ucd, uca
 366   + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
 367
 368 * for manual diffs and for Unicode Tools input data updates:
 369   remove version suffixes from the file names
 370     ~$ unidata/desuffixucd.py $UNICODE_DATA
 371   (see https://sites.google.com/site/unicodetools/inputdata)
 372
 373 * process and/or copy files
 374 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
 375   + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
 376   + For debugging, and tweaking how ppucd.txt is written,
 377     the tool has an --only_ppucd option:
 378     py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
 379
 380 - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
 381
 382 * build ICU (make install)
 383   so that the tools build can pick up the new definitions from the installed header files.
 384
 385   $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
 386
 387 * new constants for new property values
 388 - preparseucd.py error:
 389     ValueError: missing uchar.h enum constants for some property values:
 390     [(u'blk', set([u'Symbols_And_Pictographs_Ext_A', u'Elymaic',
 391         u'Ottoman_Siyaq_Numbers', u'Nandinagari', u'Nyiakeng_Puachue_Hmong',
 392         u'Small_Kana_Ext', u'Egyptian_Hieroglyph_Format_Controls', u'Wancho', u'Tamil_Sup'])),
 393     (u'sc', set([u'Nand', u'Wcho', u'Elym', u'Hmnp']))]
 394   = PropertyValueAliases.txt new property values (diff old & new .txt files)
 395     blk; Egyptian_Hieroglyph_Format_Controls; Egyptian_Hieroglyph_Format_Controls
 396     blk; Elymaic                          ; Elymaic
 397     blk; Nandinagari                      ; Nandinagari
 398     blk; Nyiakeng_Puachue_Hmong           ; Nyiakeng_Puachue_Hmong
 399     blk; Ottoman_Siyaq_Numbers            ; Ottoman_Siyaq_Numbers
 400     blk; Small_Kana_Ext                   ; Small_Kana_Extension
 401     blk; Symbols_And_Pictographs_Ext_A    ; Symbols_And_Pictographs_Extended_A
 402     blk; Tamil_Sup                        ; Tamil_Supplement
 403     blk; Wancho                           ; Wancho
 404   -> add to uchar.h
 405     use long property names for enum constants,
 406     for the trailing comment get the block start code point: diff old & new Blocks.txt
 407   -> add to UCharacter.UnicodeBlock IDs
 408     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
 409             replace  public static final int \1_ID = \2; \3
 410   -> add to UCharacter.UnicodeBlock objects
 411     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
 412             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \3
 413
 414     sc ; Elym                             ; Elymaic
 415     sc ; Hmnp                             ; Nyiakeng_Puachue_Hmong
 416     sc ; Nand                             ; Nandinagari
 417     sc ; Wcho                             ; Wancho
 418   -> uscript.h & com.ibm.icu.lang.UScript
 419   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
 420       and in com.ibm.icu.dev.test.lang.TestUScript.java
 421
 422 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
 423     (not strictly necessary for NOT_ENCODED scripts)
 424   $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
 425
 426 * update spoof checker UnicodeSet initializers:
 427     inclusionPat & recommendedPat in uspoof.cpp
 428     INCLUSION & RECOMMENDED in SpoofChecker.java
 429 - make sure that the Unicode Tools tree contains the latest security data files
 430 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
 431 - update the hardcoded version number there in the DIRECTORY path
 432 - run the tool (no special environment variables needed)
 433 - copy & paste from the Console output into the .cpp & .java files
 434
 435 * generate normalization data files
 436   cd $ICU_ROOT/dbg/icu4c
 437   bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
 438   bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm     -s $ICU4C_UNIDATA/norm2 nfc.txt
 439   bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm    -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
 440   bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
 441   bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm   -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
 442
 443 * build ICU (make install)
 444   so that the tools build can pick up the new definitions from the installed header files.
 445
 446   $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date
 447
 448 * build Unicode tools using CMake+make
 449
 450 $ICU_SRC/tools/unicode/c/icudefs.txt:
 451
 452 # Location (--prefix) of where ICU was installed.
 453 set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
 454 # Location of the ICU4C source tree.
 455 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)
 456
 457   $ICU_ROOT/dbg$
 458     mkdir -p tools/unicode/c
 459     cd tools/unicode/c
 460
 461   $ICU_ROOT/dbg/tools/unicode/c$
 462     cmake ../../../../src/tools/unicode/c
 463     make
 464
 465 * generate core properties data files
 466   $ICU_ROOT/dbg/tools/unicode/c$
 467     genprops/genprops $ICU_SRC/icu4c
 468     genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \
 469     genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
 470 - rebuild ICU (make install) & tools
 471
 472 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
 473   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
 474 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
 475 - Unicode 6.0..12.0: U+2260, U+226E, U+226F
 476 - nothing new in this Unicode version, no test file to update
 477
 478 * run & fix ICU4C tests
 479 - update test of default bidi classes:
 480   Bidi range \U0001ED00-\U0001ED4F changes default from R to AL,
 481   see diffs in DerivedBidiClass.txt
 482   + /tsutil/cucdtst/TestUnicodeData enumDefaultsRange() defaultBidi[]
 483   + UCharacterTest.java TestIteration() defaultBidi[]
 484 - Andy handles RBBI & spoof check test failures
 485
 486 * collation: CLDR collation root, UCA DUCET
 487
 488 - UCA DUCET goes into Mark's Unicode tools, see
 489     https://sites.google.com/site/unicodetools/home#TOC-UCA
 490   diff the main mapping file, look for bad changes
 491   (for example, more bytes per weight for common characters)
 492     ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.txt
 493     ~/svn.unitools/trunk$ meld ../frac-11.txt ../frac-12.txt
 494
 495 - CLDR root data files are checked into $CLDR_SRC/common/uca/
 496     cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
 497
 498 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
 499     cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
 500 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
 501     cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
 502     (note removing the underscore before "Rules")
 503     cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
 504 - restore TODO diffs in UCARules.txt
 505     meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
 506 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
 507   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
 508   from the CLDR root files (..._CLDR_..._SHORT.txt)
 509     cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
 510     cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
 511     cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
 512 - if CLDR common/uca/unihan-index.txt changes, then update
 513   CLDR common/collation/root.xml <collation type="private-unihan">
 514   and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
 515
 516 - run genuca, see command line above;
 517   deal with
 518     Error: Unknown script for first-primary sample character U+119CE on line 29233 of /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:
 519     FDD1 119CE; [71 CD 02, 05, 05]      # Nandinagari first primary (compressible)
 520         (add the character to genuca.cpp sampleCharsToScripts[])
 521   + This time, I added code to genuca.cpp to use uscript_getSampleUnicodeString(script)
 522     and cache its values.
 523     Works as long as the script metadata is updated before the collation data.
 524 - rebuild ICU4C
 525
 526 * Unihan collators
 527     https://sites.google.com/site/unicodetools/unihan
 528 - run Unicode Tools
 529     org.unicode.draft.GenerateUnihanCollators
 530   with VM arguments
 531     -ea
 532     -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
 533     -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
 534     -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
 535     -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
 536     -DUVERSION=12.0.0
 537 - run Unicode Tools
 538     org.unicode.draft.GenerateUnihanCollatorFiles
 539   with the same arguments
 540 - check CLDR diffs
 541     cd $CLDR_SRC
 542     meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
 543     meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
 544 - copy to CLDR
 545     cd $CLDR_SRC
 546     cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
 547     cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
 548 - run CLDR unit tests, commit to CLDR
 549 - generate ICU zh collation data: run CLDR
 550     org.unicode.cldr.icu.NewLdml2IcuConverter
 551   with program arguments
 552     -t collation
 553     -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
 554     -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
 555     -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll
 556     -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation
 557     zh
 558   and VM arguments
 559     -ea
 560     -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
 561 - rebuild ICU4C
 562
 563 * run & fix ICU4C tests, now with new CLDR collation root data
 564 - run all tests with the collation test data *_SHORT.txt or the full files
 565   (the full ones have comments, useful for debugging)
 566 - note on intltest: if collate/UCAConformanceTest fails, then
 567   utility/MultithreadTest/TestCollators will fail as well;
 568   fix the conformance test before looking into the multi-thread test
 569
 570 * update Java data files
 571 - refresh just the UCD/UCA-related/derived files, just to be safe
 572 - see (ICU4C)/source/data/icu4j-readme.txt
 573 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 574 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 575   output:
 576     ...
 577     Unicode .icu files built to ./out/build/icudt63l
 578     echo timestamp > uni-core-data
 579     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt63b
 580     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b
 581     echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
 582     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt63l.dat ./out/icu4j/icudt63b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt63l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt63b
 583     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b"
 584     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt63b/
 585     mkdir -p /tmp/icu4j/main/shared/data
 586     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
 587     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt63b/
 588     mkdir -p /tmp/icu4j/main/shared/data
 589     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
 590     make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'
 591 - copy the big-endian Unicode data files to another location,
 592   separate from the other data files,
 593   and then refresh ICU4J
 594     cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
 595     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
 596     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
 597     cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 598     cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 599     rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
 600     cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 601     cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
 602     cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
 603     jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
 604
 605 * When refreshing all of ICU4J data from ICU4C
 606 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 607 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
 608 or
 609 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
 610
 611 * update CollationFCD.java
 612   + copy & paste the initializers of lcccIndex[] etc. from
 613     ICU4C/source/i18n/collationfcd.cpp to
 614     ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
 615
 616 * refresh Java test .txt files
 617 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
 618     cd $ICU_SRC/icu4c/source/data/unidata
 619     cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
 620     cd ../../test/testdata
 621     cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
 622     cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
 623
 624 * run & fix ICU4J tests
 625
 626 *** API additions
 627 - send notice to icu-design about new born-@stable API (enum constants etc.)
 628
 629 *** CLDR numbering systems
 630 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
 631   for example, look for
 632     ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt
 633     in new blocks (Blocks.txt)
 634   Unicode 12: using Unicode 12 CLDR ticket #11478
 635     hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong
 636     wcho 1E2F0..1E2F9 Wancho
 637   Unicode 11: using Unicode 11 CLDR ticket #10978
 638     rohg 10D30..10D39 Hanifi_Rohingya
 639     gong 11DA0..11DA9 Gunjala_Gondi
 640   Earlier: CLDR tickets specific to adding new numbering systems.
 641   Unicode 10: http://unicode.org/cldr/trac/ticket/10219
 642   Unicode 9: http://unicode.org/cldr/trac/ticket/9692
 643
 644 *** merge the Unicode update branches back onto the trunk
 645 - do not merge the icudata.jar and testdata.jar,
 646   instead rebuild them from merged & tested ICU4C
 647 - make sure that changes to Unicode tools are checked in:
 648   http://www.unicode.org/utility/trac/log/trunk/unicodetools
 649
 650 ---------------------------------------------------------------------------- ***
 651
 652 ICU 63 addition of ICU support of text layout properties InPC, InSC, vo
 653
 654 * Command-line environment setup
 655
 656 UNICODE_DATA=~/unidata/uni11/20180609
 657 CLDR_SRC=~/svn.cldr/uni
 658 ICU_ROOT=~/icu/mine
 659 ICU_SRC=$ICU_ROOT/src
 660 ICUDT=icudt62b
 661 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
 662 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
 663 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
 664
 665 *** Links
 666
 667 https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC
 668 https://unicode-org.atlassian.net/browse/ICU-12850 vo
 669
 670 *** data files & enums & parser code
 671
 672 * API additions
 673 - for each of the three new enumerated properties
 674   + uchar.h: add the enum UProperty constant UCHAR_<long prop name>
 675   + uchar.h: update UCHAR_INT_LIMIT
 676   + uchar.h: add the enum U<long prop name>
 677     with constants U_<short prop name>_<long value name>
 678   + UProperty.java: add the constant <long prop name>
 679   + UProperty.java: update INT_LIMIT
 680   + UCharacter.java: add the interface <long prop name>
 681     with constants <long value name>
 682
 683 * process and/or copy files
 684 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
 685   + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
 686   + It also writes tools/unicode/c/genprops/pnames_data.h with property and value
 687     names and aliases.
 688   + For debugging, and tweaking how ppucd.txt is written,
 689     the tool has an --only_ppucd option:
 690     py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
 691
 692 * preparseucd.py changes
 693 - add new property short names (uppercase) to _prop_and_value_re
 694   so that ParseUCharHeader() parses the new enum constants
 695
 696 * build ICU (make install)
 697   so that the tools build can pick up the new definitions from the installed header files.
 698
 699   $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
 700
 701 * build Unicode tools using CMake+make
 702
 703 $ICU_SRC/tools/unicode/c/icudefs.txt:
 704
 705 # Location (--prefix) of where ICU was installed.
 706 set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)
 707 # Location of the ICU4C source tree.
 708 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c)
 709
 710   $ICU_ROOT/dbg$
 711     mkdir -p tools/unicode/c
 712     cd tools/unicode/c
 713
 714   $ICU_ROOT/dbg/tools/unicode/c$
 715     cmake ../../../../../src/tools/unicode/c
 716     make
 717
 718 * generate core properties data files
 719   $ICU_ROOT/dbg/tools/unicode/c$
 720     genprops/genprops $ICU_SRC/icu4c
 721 - rebuild ICU (make install) & tools
 722
 723 * write data for runtime, hardcoded for now
 724 - add genprops/layoutpropsbuilder.cpp with pieces from sibling files
 725 - generate new icu4c/source/common/ulayout_props_data.h
 726 - for each of the three new enumerated properties
 727   + int property max value
 728   + small, 8-bit UCPTrie
 729     (A small 16-bit trie with bit fields for these three properties
 730     is very nearly the same size as the sum of the three.)
 731
 732 * wire into C++
 733 - uprops.cpp: #include ulayout_props_data.h
 734 - uprops.cpp: add getInPC() etc. functions
 735 - uprops.cpp: add lines to intProps[], include max values
 736 - uprops.h: add UPropertySource constants
 737 - uprops.cpp: add uprops_addPropertyStarts(src)
 738 - uniset_props.cpp: add to UnicodeSet_initInclusion()
 739 - intltest/ucdtest.cpp: write unit tests
 740
 741 * update Java data files
 742 - refresh just the pnames.icu file with the new property [value] names, just to be safe
 743 - see $ICU_SRC/icu4c/source/data/icu4j-readme.txt
 744 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 745 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 746 - copy the big-endian Unicode data files to another location,
 747   separate from the other data files,
 748   and then refresh ICU4J
 749     cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
 750     cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 751     jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
 752
 753 * wire into Java
 754 - UCharacterProperty.java: add new SRC_INPC etc. constants as in C++
 755 - UCharacterProperty.java: for each new property
 756   + create a nested class to hold its CodePointTrie
 757   + initialize it from a string literal
 758   + paste in the initializer printed by genprops
 759   + add a new IntProperty object to the intProps[] array
 760   + use the correct max int value for each property, also printed by genprops
 761 - UCharacterProperty.java: add ulayout_addPropertyStarts(src, set)
 762 - UnicodeSet.java: add to getInclusions()
 763 - UCharacterTest.java: write unit tests
 764
 765 ---------------------------------------------------------------------------- ***
 766
 767 Unicode 11.0 update for ICU 62
 768
 769 http://www.unicode.org/versions/Unicode11.0.0/
 770 http://unicode.org/versions/beta-11.0.0.html
 771 https://www.unicode.org/review/pri372/
 772 http://www.unicode.org/reports/uax-proposed-updates.html
 773 http://www.unicode.org/reports/tr44/tr44-21.html
 774
 775 * Command-line environment setup
 776
 777 UNICODE_DATA=~/unidata/uni11/20180521
 778 CLDR_SRC=~/svn.cldr/uni
 779 ICU_ROOT=~/svn.icu/uni
 780 ICU_SRC=$ICU_ROOT/src
 781 ICUDT=icudt61b
 782 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
 783 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
 784 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
 785
 786 *** ICU Trac
 787
 788 - ticket:13630: Unicode 11
 789 - ^/branches/markus/uni11
 790
 791 *** CLDR Trac
 792
 793 - cldrbug 10978: Unicode 11
 794 - ^/branches/markus/uni11
 795
 796 *** Unicode version numbers
 797 - makedata.mak
 798 - uchar.h
 799 - com.ibm.icu.util.VersionInfo
 800 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
 801
 802 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
 803   so that the makefiles see the new version number.
 804
 805 *** data files & enums & parser code
 806
 807 * download files
 808 - mkdir -p $UNICODE_DATA
 809 - download Unicode files into $UNICODE_DATA
 810   + subfolders: emoji, idna, security, ucd, uca
 811   + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
 812
 813 * for manual diffs and for Unicode Tools input data updates:
 814   remove version suffixes from the file names
 815     ~$ unidata/desuffixucd.py $UNICODE_DATA
 816   (see https://sites.google.com/site/unicodetools/inputdata)
 817
 818 * process and/or copy files
 819 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
 820   + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
 821   + For debugging, and tweaking how ppucd.txt is written,
 822     the tool has an --only_ppucd option:
 823     py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
 824
 825 - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
 826
 827 * build ICU (make install)
 828   so that the tools build can pick up the new definitions from the installed header files.
 829
 830   $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
 831
 832 * preparseucd.py changes
 833 - fix other errors
 834     NameError: unknown property Extended_Pictographic
 835   -> add Extended_Pictographic binary property
 836   -> add new short names for all Emoji properties
 837
 838 * new constants for new property values
 839 - preparseucd.py error:
 840     ValueError: missing uchar.h enum constants for some property values:
 841     [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar',
 842                    u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals',
 843                    u'Indic_Siyaq_Numbers'])),
 844      (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])),
 845      (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])),
 846      (u'GCB', set([u'LinkC', u'Virama'])),
 847      (u'WB', set([u'WSegSpace']))]
 848   = PropertyValueAliases.txt new property values (diff old & new .txt files)
 849     blk; Chess_Symbols                    ; Chess_Symbols
 850     blk; Dogra                            ; Dogra
 851     blk; Georgian_Ext                     ; Georgian_Extended
 852     blk; Gunjala_Gondi                    ; Gunjala_Gondi
 853     blk; Hanifi_Rohingya                  ; Hanifi_Rohingya
 854     blk; Indic_Siyaq_Numbers              ; Indic_Siyaq_Numbers
 855     blk; Makasar                          ; Makasar
 856     blk; Mayan_Numerals                   ; Mayan_Numerals
 857     blk; Medefaidrin                      ; Medefaidrin
 858     blk; Old_Sogdian                      ; Old_Sogdian
 859     blk; Sogdian                          ; Sogdian
 860   -> add to uchar.h
 861     use long property names for enum constants,
 862     for the trailing comment get the block start code point: diff old & new Blocks.txt
 863   -> add to UCharacter.UnicodeBlock IDs
 864     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
 865             replace  public static final int \1_ID = \2; \3
 866   -> add to UCharacter.UnicodeBlock objects
 867     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
 868             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
 869
 870     GCB; LinkC                            ; LinkingConsonant
 871     GCB; Virama                           ; Virama
 872   -> uchar.h & UCharacter.GraphemeClusterBreak
 873   -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76
 874
 875     InSC; Consonant_Initial_Postfixed     ; Consonant_Initial_Postfixed
 876   -> ignore: ICU does not yet support this property
 877
 878     jg ; Hanifi_Rohingya_Kinna_Ya         ; Hanifi_Rohingya_Kinna_Ya
 879     jg ; Hanifi_Rohingya_Pa               ; Hanifi_Rohingya_Pa
 880   -> uchar.h & UCharacter.JoiningGroup
 881
 882     sc ; Dogr                             ; Dogra
 883     sc ; Gong                             ; Gunjala_Gondi
 884     sc ; Maka                             ; Makasar
 885     sc ; Medf                             ; Medefaidrin
 886     sc ; Rohg                             ; Hanifi_Rohingya
 887     sc ; Sogd                             ; Sogdian
 888     sc ; Sogo                             ; Old_Sogdian
 889   -> uscript.h & com.ibm.icu.lang.UScript
 890   -> Nushu had been added already
 891   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
 892       and in com.ibm.icu.dev.test.lang.TestUScript.java
 893
 894     WB ; WSegSpace                        ; WSegSpace
 895   -> uchar.h & UCharacter.WordBreak
 896
 897 * New short names for emoji properties
 898 - see UTS #51
 899 - short names set in preparseucd.py
 900
 901 * New properties
 902 - boolean emoji property Extended_Pictographic
 903   -> added in preparseucd.py
 904   -> uchar.h & UProperty.java
 905 - misc. property Equivalent_Unified_Ideograph (EqUIdeo)
 906   as shown in PropertyValueAliases.txt
 907   -> ignore for now
 908
 909 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
 910     (not strictly necessary for NOT_ENCODED scripts)
 911   $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
 912
 913 * update spoof checker UnicodeSet initializers:
 914     inclusionPat & recommendedPat in uspoof.cpp
 915     INCLUSION & RECOMMENDED in SpoofChecker.java
 916 - make sure that the Unicode Tools tree contains the latest security data files
 917 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
 918 - update the hardcoded version number there in the DIRECTORY path
 919 - run the tool (no special environment variables needed)
 920 - copy & paste from the Console output into the .cpp & .java files
 921
 922 * generate normalization data files
 923   cd $ICU_ROOT/dbg/icu4c
 924   bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
 925   bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm     -s $ICU4C_UNIDATA/norm2 nfc.txt
 926   bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm    -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
 927   bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
 928   bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm   -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
 929
 930 * build ICU (make install)
 931   so that the tools build can pick up the new definitions from the installed header files.
 932
 933   $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
 934
 935 * build Unicode tools using CMake+make
 936
 937 $ICU_SRC/tools/unicode/c/icudefs.txt:
 938
 939 # Location (--prefix) of where ICU was installed.
 940 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
 941 # Location of the ICU4C source tree.
 942 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c)
 943
 944   $ICU_ROOT/dbg$
 945     mkdir -p tools/unicode/c
 946     cd tools/unicode/c
 947
 948   $ICU_ROOT/dbg/tools/unicode/c$
 949     cmake ../../../../src/tools/unicode/c
 950     make
 951
 952 * generate core properties data files
 953   $ICU_ROOT/dbg/tools/unicode/c$
 954     genprops/genprops $ICU_SRC/icu4c
 955     genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
 956     genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
 957 - rebuild ICU (make install) & tools
 958
 959 * Fix case props
 960     genprops error: casepropsbuilder: too many exceptions words
 961     genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR
 962 - With the addition of Georgian Mtavruli capital letters,
 963   there are now too many simple case mappings with big mapping deltas
 964   that yield uncompressible exceptions.
 965 - Changing the data structure (now formatVersion 4),
 966   adding one bit for no-simple-case-folding (for Cherokee), and
 967   one optional slot for a big delta (for most faraway mappings),
 968   together with another bit for whether that is negative.
 969   This makes most Cherokee & Georgian etc. case mappings compressible,
 970   reducing the number of exceptions words.
 971 - Further changes to gain one more bit for the exceptions index,
 972   for future growth. Details see casepropsbuilder.cpp.
 973
 974 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
 975   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
 976 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
 977 - Unicode 6.0..11.0: U+2260, U+226E, U+226F
 978 - nothing new in this Unicode version, no test file to update
 979
 980 * run & fix ICU4C tests
 981 - Andy handles RBBI & spoof check test failures
 982
 983 - Errors in char.txt, word.txt, word_POSIX.txt like
 984     createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET"  at line 46, column 16
 985   because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty.
 986   -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them
 987      not empty, just to get ICU building.
 988   -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables
 989      and properties together with the rules that used them (GB 10, WB 14).
 990   -> Andy adjusts the rule sets further to sync with
 991      Unicode 11 grapheme, word, and line break spec changes.
 992
 993 * collation: CLDR collation root, UCA DUCET
 994
 995 - UCA DUCET goes into Mark's Unicode tools, see
 996     https://sites.google.com/site/unicodetools/home#TOC-UCA
 997   diff the main mapping file, look for bad changes
 998   (for example, more bytes per weight for common characters)
 999     ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt
1000     ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt
1001
1002 - CLDR root data files are checked into $CLDR_SRC/common/uca/
1003     cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
1004
1005 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1006     cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
1007 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1008     cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
1009     (note removing the underscore before "Rules")
1010     cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
1011 - restore TODO diffs in UCARules.txt
1012     meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
1013 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
1014   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1015   from the CLDR root files (..._CLDR_..._SHORT.txt)
1016     cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1017     cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1018     cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
1019 - if CLDR common/uca/unihan-index.txt changes, then update
1020   CLDR common/collation/root.xml <collation type="private-unihan">
1021   and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
1022
1023 - run genuca, see command line above;
1024   deal with
1025     Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:
1026     FDD1 1180B; [71 CC 02, 05, 05]      # Dogra first primary (compressible)
1027         (add the character to genuca.cpp sampleCharsToScripts[])
1028   + look up the USCRIPT_ code for the new sample characters
1029     (should be obvious from the comment in the error output)
1030   + *add* mappings to sampleCharsToScripts[], do not replace them
1031     (in case the script sample characters flip-flop)
1032   + insert new scripts in DUCET script order, see the top_byte table
1033     at the beginning of FractionalUCA.txt
1034 - rebuild ICU4C
1035
1036 * Unihan collators
1037     https://sites.google.com/site/unicodetools/unihan
1038 - run Unicode Tools
1039     org.unicode.draft.GenerateUnihanCollators
1040   with VM arguments
1041     -ea
1042     -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
1043     -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
1044     -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
1045     -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
1046     -DUVERSION=11.0.0
1047 - run Unicode Tools
1048     org.unicode.draft.GenerateUnihanCollatorFiles
1049   with the same arguments
1050 - check CLDR diffs
1051     cd $CLDR_SRC
1052     meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
1053     meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
1054 - copy to CLDR
1055     cd $CLDR_SRC
1056     cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
1057     cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
1058 - run CLDR unit tests, commit to CLDR
1059 - generate ICU zh collation data: run CLDR
1060     org.unicode.cldr.icu.NewLdml2IcuConverter
1061   with program arguments
1062     -t collation
1063     -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
1064     -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
1065     -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll
1066     -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation
1067     zh
1068   and VM arguments
1069     -ea
1070     -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
1071 - rebuild ICU4C
1072
1073 * run & fix ICU4C tests, now with new CLDR collation root data
1074 - run all tests with the collation test data *_SHORT.txt or the full files
1075   (the full ones have comments, useful for debugging)
1076 - note on intltest: if collate/UCAConformanceTest fails, then
1077   utility/MultithreadTest/TestCollators will fail as well;
1078   fix the conformance test before looking into the multi-thread test
1079
1080 * update Java data files
1081 - refresh just the UCD/UCA-related/derived files, just to be safe
1082 - see (ICU4C)/source/data/icu4j-readme.txt
1083 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1084 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1085   output:
1086     ...
1087     Unicode .icu files built to ./out/build/icudt61l
1088     echo timestamp > uni-core-data
1089     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b
1090     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b
1091     echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1092     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b
1093     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b"
1094     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/
1095     mkdir -p /tmp/icu4j/main/shared/data
1096     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1097     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/
1098     mkdir -p /tmp/icu4j/main/shared/data
1099     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1100     make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data'
1101 - copy the big-endian Unicode data files to another location,
1102   separate from the other data files,
1103   and then refresh ICU4J
1104     cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
1105     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1106     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1107     cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1108     cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1109     rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1110     cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1111     cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1112     cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1113     jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1114
1115 * When refreshing all of ICU4J data from ICU4C
1116 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1117 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
1118 or
1119 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
1120
1121 * update CollationFCD.java
1122   + copy & paste the initializers of lcccIndex[] etc. from
1123     ICU4C/source/i18n/collationfcd.cpp to
1124     ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1125
1126 * refresh Java test .txt files
1127 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1128     cd $ICU_SRC/icu4c/source/data/unidata
1129     cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1130     cd ../../test/testdata
1131     cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1132     cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1133
1134 * run & fix ICU4J tests
1135
1136 *** API additions
1137 - send notice to icu-design about new born-@stable API (enum constants etc.)
1138
1139 *** CLDR numbering systems
1140 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
1141   Unicode 11: using Unicode 11 CLDR ticket #10978
1142     rohg 10D30..10D39 Hanifi_Rohingya
1143     gong 11DA0..11DA9 Gunjala_Gondi
1144   Earlier: CLDR tickets specific to adding new numbering systems.
1145   Unicode 10: http://unicode.org/cldr/trac/ticket/10219
1146   Unicode 9: http://unicode.org/cldr/trac/ticket/9692
1147
1148 *** merge the Unicode update branches back onto the trunk
1149 - do not merge the icudata.jar and testdata.jar,
1150   instead rebuild them from merged & tested ICU4C
1151 - make sure that changes to Unicode tools are checked in:
1152   http://www.unicode.org/utility/trac/log/trunk/unicodetools
1153
1154 ---------------------------------------------------------------------------- ***
1155
1156 Unicode 10.0 update for ICU 60
1157
1158 http://www.unicode.org/versions/Unicode10.0.0/
1159 http://www.unicode.org/versions/beta-10.0.0.html
1160 http://blog.unicode.org/2017/03/unicode-100-beta-review.html
1161 http://www.unicode.org/review/pri350/
1162 http://www.unicode.org/reports/uax-proposed-updates.html
1163 http://www.unicode.org/reports/tr44/tr44-19.html
1164
1165 * Command-line environment setup
1166
1167 UNICODE_DATA=~/unidata/uni10/20170605
1168 CLDR_SRC=~/svn.cldr/uni10
1169 ICU_ROOT=~/svn.icu/uni10
1170 ICU_SRC=$ICU_ROOT/src
1171 ICUDT=icudt60b
1172 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
1173 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
1174 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
1175
1176 *** ICU Trac
1177
1178 - ticket:12985: Unicode 10
1179 - ticket:13061: undo hacks from emoji 5.0 update
1180 - ticket:13062: add Emoji_Component property
1181 - ^/branches/markus/uni10
1182
1183 *** CLDR Trac
1184
1185 - cldrbug 10055: Unicode 10
1186 - cldrbug 9882: Unicode 10 script metadata
1187 - cldrbug 10219: numbering systems for Unicode 10
1188
1189 *** Unicode version numbers
1190 - makedata.mak
1191 - uchar.h
1192 - com.ibm.icu.util.VersionInfo
1193 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1194
1195 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1196   so that the makefiles see the new version number.
1197
1198 *** data files & enums & parser code
1199
1200 * download files
1201 - mkdir -p $UNICODE_DATA
1202 - download Unicode 10.0 files into $UNICODE_DATA
1203   + subfolders: ucd, uca, idna, security
1204   + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1205 - download emoji 5.0 files into $UNICODE_DATA/emoji
1206
1207 * for manual diffs: remove version suffixes from the file names
1208   ~$ unidata/desuffixucd.py $UNICODE_DATA
1209   (see https://sites.google.com/site/unicodetools/inputdata)
1210
1211 * process and/or copy files
1212 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
1213   + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1214   + For debugging, and tweaking how ppucd.txt is written,
1215     the tool has an --only_ppucd option:
1216     py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
1217
1218 - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
1219
1220 * build ICU (make install)
1221   so that the tools build can pick up the new definitions from the installed header files.
1222
1223   $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
1224
1225 * preparseucd.py changes
1226 - remove or add new Unicode scripts from/to the
1227   only-in-ISO-15924 list according to the error messages:
1228     ValueError: remove ['Nshu'] from _scripts_only_in_iso15924
1229   -> adjust _scripts_only_in_iso15924 as indicated
1230 - fix other errors
1231     Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo']
1232   -> add vo=Vertical_Orientation to _ignored_properties
1233   -> later removed again, parsing the file, even though we do not yet store data for runtime use
1234
1235 * new constants for new property values
1236 - preparseucd.py error:
1237     ValueError: missing uchar.h enum constants for some property values:
1238     [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F',
1239                    u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])),
1240      (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla',
1241                   u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra',
1242                   u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])),
1243      (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))]
1244   = PropertyValueAliases.txt new property values (diff old & new .txt files)
1245     blk; CJK_Ext_F                        ; CJK_Unified_Ideographs_Extension_F
1246     blk; Kana_Ext_A                       ; Kana_Extended_A
1247     blk; Masaram_Gondi                    ; Masaram_Gondi
1248     blk; Nushu                            ; Nushu
1249     blk; Soyombo                          ; Soyombo
1250     blk; Syriac_Sup                       ; Syriac_Supplement
1251     blk; Zanabazar_Square                 ; Zanabazar_Square
1252   -> add to uchar.h
1253     use long property names for enum constants,
1254     for the trailing comment get the block start code point: diff old & new Blocks.txt
1255   -> add to UCharacter.UnicodeBlock IDs
1256     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1257             replace  public static final int \1_ID = \2; \3
1258   -> add to UCharacter.UnicodeBlock objects
1259     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
1260             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1261
1262     jg ; Malayalam_Bha                    ; Malayalam_Bha
1263     jg ; Malayalam_Ja                     ; Malayalam_Ja
1264     jg ; Malayalam_Lla                    ; Malayalam_Lla
1265     jg ; Malayalam_Llla                   ; Malayalam_Llla
1266     jg ; Malayalam_Nga                    ; Malayalam_Nga
1267     jg ; Malayalam_Nna                    ; Malayalam_Nna
1268     jg ; Malayalam_Nnna                   ; Malayalam_Nnna
1269     jg ; Malayalam_Nya                    ; Malayalam_Nya
1270     jg ; Malayalam_Ra                     ; Malayalam_Ra
1271     jg ; Malayalam_Ssa                    ; Malayalam_Ssa
1272     jg ; Malayalam_Tta                    ; Malayalam_Tta
1273   -> uchar.h & UCharacter.JoiningGroup
1274
1275     sc ; Gonm                             ; Masaram_Gondi
1276     sc ; Nshu                             ; Nushu
1277     sc ; Soyo                             ; Soyombo
1278     sc ; Zanb                             ; Zanabazar_Square
1279   -> uscript.h & com.ibm.icu.lang.UScript
1280   -> Nushu had been added already
1281   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1282       and in com.ibm.icu.dev.test.lang.TestUScript.java
1283
1284 * New properties as shown in PropertyValueAliases.txt changes
1285 - boolean Emoji_Component from emoji 5
1286   -> uchar.h & UProperty.java
1287 - boolean
1288     # Regional_Indicator (RI)
1289
1290     RI ; N                                ; No                               ; F                                ; False
1291     RI ; Y                                ; Yes                              ; T                                ; True
1292   -> uchar.h & UProperty.java
1293   -> single immutable range, to be hardcoded
1294 - boolean
1295     # Prepended_Concatenation_Mark (PCM)
1296
1297     PCM; N                                ; No                               ; F                                ; False
1298     PCM; Y                                ; Yes                              ; T                                ; True
1299   -> was new in Unicode 9
1300   -> uchar.h & UProperty.java
1301 - enumerated
1302     # Vertical_Orientation (vo)
1303
1304     vo ; R                                ; Rotated
1305     vo ; Tr                               ; Transformed_Rotated
1306     vo ; Tu                               ; Transformed_Upright
1307     vo ; U                                ; Upright
1308   -> only pre-parsed for now, but not yet stored for runtime use
1309
1310 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1311     (not strictly necessary for NOT_ENCODED scripts)
1312   $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
1313
1314 * generate normalization data files
1315   cd $ICU_ROOT/dbg/icu4c
1316   bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
1317   bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm     -s $ICU4C_UNIDATA/norm2 nfc.txt
1318   bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm    -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
1319   bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1320   bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm   -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
1321
1322 * build ICU (make install)
1323   so that the tools build can pick up the new definitions from the installed header files.
1324
1325   $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
1326
1327 * build Unicode tools using CMake+make
1328
1329 $ICU_SRC/tools/unicode/c/icudefs.txt:
1330
1331 # Location (--prefix) of where ICU was installed.
1332 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
1333 # Location of the ICU4C source tree.
1334 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c)
1335
1336   $ICU_ROOT/dbg/tools/unicode/c$
1337     cmake ../../../../src/tools/unicode/c
1338     make
1339
1340 * generate core properties data files
1341   $ICU_ROOT/dbg/tools/unicode/c$
1342     genprops/genprops $ICU_SRC/icu4c
1343     genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
1344     genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
1345 - rebuild ICU (make install) & tools
1346
1347 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1348   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1349 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1350 - Unicode 6.0..10.0: U+2260, U+226E, U+226F
1351 - nothing new in this Unicode version, no test file to update
1352
1353 * run & fix ICU4C tests
1354 - Andy handles RBBI & spoof check test failures
1355
1356 * collation: CLDR collation root, UCA DUCET
1357
1358 - UCA DUCET goes into Mark's Unicode tools, see
1359   https://sites.google.com/site/unicodetools/home#TOC-UCA
1360 - CLDR root data files are checked into $CLDR_SRC/common/uca/
1361     cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
1362
1363 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1364     cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
1365 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1366     cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
1367     (note removing the underscore before "Rules")
1368     cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
1369 - restore TODO diffs in UCARules.txt
1370     meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
1371 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
1372   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1373   from the CLDR root files (..._CLDR_..._SHORT.txt)
1374     cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1375     cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1376     cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
1377 - if CLDR common/uca/unihan-index.txt changes, then update
1378   CLDR common/collation/root.xml <collation type="private-unihan">
1379   and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
1380
1381 - run genuca, see command line above;
1382   deal with
1383     Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt:
1384     FDD1 11D10;     [70 D5 02, 05, 05]      # Masaram_Gondi first primary (compressible)
1385         (add the character to genuca.cpp sampleCharsToScripts[])
1386   + look up the USCRIPT_ code for the new sample characters
1387     (should be obvious from the comment in the error output)
1388   + *add* mappings to sampleCharsToScripts[], do not replace them
1389     (in case the script sample characters flip-flop)
1390   + insert new scripts in DUCET script order, see the top_byte table
1391     at the beginning of FractionalUCA.txt
1392 - rebuild ICU4C
1393
1394 * Unihan collators
1395     https://sites.google.com/site/unicodetools/unihan
1396 - run Unicode Tools
1397     org.unicode.draft.GenerateUnihanCollators
1398   with VM arguments
1399     -ea
1400     -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
1401     -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
1402     -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
1403     -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
1404     -DUVERSION=10.0.0
1405 - run Unicode Tools
1406     org.unicode.draft.GenerateUnihanCollatorFiles
1407   with the same arguments
1408 - check CLDR diffs
1409     cd $CLDR_SRC
1410     meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
1411     meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
1412 - copy to CLDR
1413     cd $CLDR_SRC
1414     cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
1415     cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
1416 - run CLDR unit tests, commit to CLDR
1417 - generate ICU zh collation data: run CLDR
1418     org.unicode.cldr.icu.NewLdml2IcuConverter
1419   with program arguments
1420     -t collation
1421     -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation
1422     -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental
1423     -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll
1424     -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation
1425     zh
1426   and VM arguments
1427     -ea
1428     -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
1429 - rebuild ICU4C
1430
1431 * run & fix ICU4C tests, now with new CLDR collation root data
1432 - run all tests with the collation test data *_SHORT.txt or the full files
1433   (the full ones have comments, useful for debugging)
1434 - note on intltest: if collate/UCAConformanceTest fails, then
1435   utility/MultithreadTest/TestCollators will fail as well;
1436   fix the conformance test before looking into the multi-thread test
1437
1438 * update Java data files
1439 - refresh just the UCD/UCA-related/derived files, just to be safe
1440 - see (ICU4C)/source/data/icu4j-readme.txt
1441 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1442 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1443   output:
1444     ...
1445     Unicode .icu files built to ./out/build/icudt60l
1446     echo timestamp > uni-core-data
1447     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b
1448     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b
1449     echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1450     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b
1451     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b"
1452     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/
1453     mkdir -p /tmp/icu4j/main/shared/data
1454     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1455     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/
1456     mkdir -p /tmp/icu4j/main/shared/data
1457     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1458     make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data'
1459 - copy the big-endian Unicode data files to another location,
1460   separate from the other data files,
1461   and then refresh ICU4J
1462     cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
1463     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1464     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1465     cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1466     cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1467     rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1468     cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1469     cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1470     cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1471     jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1472
1473 * When refreshing all of ICU4J data from ICU4C
1474 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1475 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
1476 or
1477 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
1478
1479 * update CollationFCD.java
1480   + copy & paste the initializers of lcccIndex[] etc. from
1481     ICU4C/source/i18n/collationfcd.cpp to
1482     ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1483
1484 * refresh Java test .txt files
1485 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1486     cd $ICU_SRC/icu4c/source/data/unidata
1487     cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1488     cd ../../test/testdata
1489     cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1490     cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1491
1492 * run & fix ICU4J tests
1493
1494 *** API additions
1495 - send notice to icu-design about new born-@stable API (enum constants etc.)
1496
1497 *** CLDR numbering systems
1498 - look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket
1499   Unicode 10: http://unicode.org/cldr/trac/ticket/10219
1500   Unicode 9: http://unicode.org/cldr/trac/ticket/9692
1501
1502 *** merge the Unicode update branches back onto the trunk
1503 - do not merge the icudata.jar and testdata.jar,
1504   instead rebuild them from merged & tested ICU4C
1505 - make sure that changes to Unicode tools are checked in:
1506   http://www.unicode.org/utility/trac/log/trunk/unicodetools
1507
1508 ---------------------------------------------------------------------------- ***
1509
1510 Emoji 5.0 update for ICU 59
1511 - ICU 59 mostly remains on Unicode 9.0
1512 - except updates bidi and segmentation data to Unicode 10 beta
1513
1514 First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg.
1515
1516 * Command-line environment setup
1517
1518 ICU_ROOT=~/svn.icu/trunk
1519 ICU_SRC_DIR=$ICU_ROOT/src
1520 ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c
1521 ICUDT=icudt59b
1522 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1523 SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in
1524 UNIDATA=$ICU4C_SRC_DIR/source/data/unidata
1525
1526 *** ICU Trac
1527
1528 - ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released
1529 - changes directly on trunk
1530
1531 *** data files & enums & parser code
1532
1533 * download files
1534
1535 - download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca)
1536 - download emoji 5.0 beta files into the same uni90e50 folder
1537 - download Unicode 10.0 beta files: ucd
1538   + copy Unicode 10 bidi files to the uni90e50/ucd folder:
1539     BidiBrackets.txt
1540     BidiCharacterTest.txt
1541     BidiMirroring.txt
1542     BidiTest.txt
1543     extracted/DerivedBidiClass.txt
1544   + copy Unicode 10 segmentation files to the uni90e50/ucd folder:
1545     LineBreak.txt
1546     auxiliary/*
1547
1548 * preparseucd.py changes
1549 - adjust for combined trunks
1550 - write new copyright lines
1551 - ignore new Emoji_Component property for now
1552
1553 * process and/or copy files
1554 - ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR
1555   + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1556
1557 - cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA
1558
1559 * build ICU (make install)
1560   so that the tools build can pick up the new definitions from the installed header files.
1561
1562   $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
1563
1564 * build Unicode tools using CMake+make
1565
1566 ~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt:
1567
1568 # Location (--prefix) of where ICU was installed.
1569 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
1570 # Location of the ICU4C source tree.
1571 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c)
1572
1573   ~/svn.icu/trunk/dbg/tools/unicode/c$
1574     cmake ../../../../src/tools/unicode/c
1575     make
1576
1577 * generate core properties data files
1578   ~/svn.icu/trunk/dbg/tools/unicode/c$
1579     genprops/genprops $ICU4C_SRC_DIR
1580 - rebuild ICU (make install) & tools
1581
1582 * run & fix ICU4C tests
1583 - Andy handles RBBI & spoof check test failures
1584
1585 * update Java data files
1586 - refresh just the UCD/UCA-related/derived files, just to be safe
1587 - see (ICU4C)/source/data/icu4j-readme.txt
1588 - mkdir /tmp/icu4j
1589 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1590   output:
1591     ...
1592     Unicode .icu files built to ./out/build/icudt59l
1593     echo timestamp > uni-core-data
1594     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b
1595     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b
1596     echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1597     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b
1598     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b"
1599     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/
1600     mkdir -p /tmp/icu4j/main/shared/data
1601     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1602     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/
1603     mkdir -p /tmp/icu4j/main/shared/data
1604     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1605     make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data'
1606 - copy the big-endian Unicode data files to another location,
1607   separate from the other data files,
1608   and then refresh ICU4J
1609     cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j
1610     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1611     cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1612     cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1613     rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1614     cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1615     jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1616
1617 * When refreshing all of ICU4J data from ICU4C
1618 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1619 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data
1620 or
1621 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install
1622
1623 * refresh Java test .txt files
1624 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1625     cd $ICU4C_SRC_DIR/source/data/unidata
1626     cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1627     cd ../../test/testdata
1628     cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1629     cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
1630
1631 * run & fix ICU4J tests
1632
1633 ---------------------------------------------------------------------------- ***
1634
1635 Unicode 9.0 update for ICU 58
1636
1637 * Command-line environment setup
1638
1639 ICU_ROOT=~/svn.icu/trunk
1640 ICU_SRC_DIR=$ICU_ROOT/src
1641 ICUDT=icudt58b
1642 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1643 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
1644 UNIDATA=$ICU_SRC_DIR/source/data/unidata
1645
1646 http://www.unicode.org/review/pri323/  -- beta review
1647 http://www.unicode.org/reports/uax-proposed-updates.html
1648 http://www.unicode.org/versions/beta-9.0.0.html
1649 http://www.unicode.org/versions/Unicode9.0.0/
1650 http://www.unicode.org/reports/tr44/tr44-17.html
1651
1652 *** ICU Trac
1653
1654 - ticket:12526: integrate Unicode 9
1655 - C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b
1656 - Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b
1657
1658 *** CLDR Trac
1659
1660 - cldrbug 9414: UCA 9
1661 - ^/branches/markus/uni90 at r11518 from trunk at r11517
1662
1663 - cldrbug 8745: Unicode 9.0 script metadata
1664
1665 *** Unicode version numbers
1666 - makedata.mak
1667 - uchar.h
1668 - com.ibm.icu.util.VersionInfo
1669 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1670
1671 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1672   so that the makefiles see the new version number.
1673
1674 *** data files & enums & parser code
1675
1676 * file preparation
1677
1678 - download UCD & IDNA files
1679 - make sure that the Unicode data folder passed into preparseucd.py
1680   includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
1681 - only for manual diffs: remove version suffixes from the file names
1682   ~/unidata/uni70/20140403$ ../../desuffixucd.py .
1683   (see https://sites.google.com/site/unicodetools/inputdata)
1684 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1685 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src
1686 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1687
1688 - also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt
1689   and copy to $UNIDATA
1690     cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA
1691
1692 * preparseucd.py changes
1693 - remove or add new Unicode scripts from/to the
1694   only-in-ISO-15924 list according to the error messages:
1695     ValueError: remove ['Tang'] from _scripts_only_in_iso15924
1696     ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD
1697     ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD
1698     ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD
1699   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1700       and in com.ibm.icu.dev.test.lang.TestUScript.java
1701 - DerivedNumericValues.txt new numeric values
1702     0D58          ; 0.00625 ; ; 1/160 # No       MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH
1703     0D59          ; 0.025 ; ; 1/40 # No       MALAYALAM FRACTION ONE FORTIETH
1704     0D5A          ; 0.0375 ; ; 3/80 # No       MALAYALAM FRACTION THREE EIGHTIETHS
1705     0D5B          ; 0.05 ; ; 1/20 # No       MALAYALAM FRACTION ONE TWENTIETH
1706     0D5D          ; 0.15 ; ; 3/20 # No       MALAYALAM FRACTION THREE TWENTIETHS
1707   -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(),
1708      uchar.c, UCharacterProperty.java
1709      to support a new series of values
1710 - adjust preparseucd.py for Tangut algorithmic names
1711   in ppucd.txt:
1712     algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH-
1713   ->
1714     algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH-
1715 - avoid block-compressing most String/Miscellaneous property values,
1716   triggered by genprops not coping with a multi-code point Case_Folding on
1717     block;1C80..1C8F;...;Cased;cf=0442;CWCF;...
1718   keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors
1719
1720 * PropertyAliases.txt changes
1721 - 1 new property PCM=Prepended_Concatenation_Mark
1722   Ignore: Only useful for layout engines.
1723   Ok to list in ppucd.txt.
1724
1725 * PropertyValueAliases.txt new property values
1726     blk; Adlam                            ; Adlam
1727     blk; Bhaiksuki                        ; Bhaiksuki
1728     blk; Cyrillic_Ext_C                   ; Cyrillic_Extended_C
1729     blk; Glagolitic_Sup                   ; Glagolitic_Supplement
1730     blk; Ideographic_Symbols              ; Ideographic_Symbols_And_Punctuation
1731     blk; Marchen                          ; Marchen
1732     blk; Mongolian_Sup                    ; Mongolian_Supplement
1733     blk; Newa                             ; Newa
1734     blk; Osage                            ; Osage
1735     blk; Tangut                           ; Tangut
1736     blk; Tangut_Components                ; Tangut_Components
1737   -> add to uchar.h
1738     use long property names for enum constants
1739   -> add to UCharacter.UnicodeBlock IDs
1740     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1741             replace  public static final int \1_ID = \2; \3
1742   -> add to UCharacter.UnicodeBlock objects
1743     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
1744             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1745
1746     GCB; EB                               ; E_Base
1747     GCB; EBG                              ; E_Base_GAZ
1748     GCB; EM                               ; E_Modifier
1749     GCB; GAZ                              ; Glue_After_Zwj
1750     GCB; ZWJ                              ; ZWJ
1751   -> uchar.h & UCharacter.GraphemeClusterBreak
1752
1753     jg ; African_Feh                      ; African_Feh
1754     jg ; African_Noon                     ; African_Noon
1755     jg ; African_Qaf                      ; African_Qaf
1756   -> uchar.h & UCharacter.JoiningGroup
1757
1758     lb ; EB                               ; E_Base
1759     lb ; EM                               ; E_Modifier
1760     lb ; ZWJ                              ; ZWJ
1761   -> uchar.h & UCharacter.LineBreak
1762
1763     sc ; Adlm                             ; Adlam
1764     sc ; Bhks                             ; Bhaiksuki
1765     sc ; Marc                             ; Marchen
1766     sc ; Newa                             ; Newa
1767     sc ; Osge                             ; Osage
1768     sc ; Tang                             ; Tangut
1769   -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
1770
1771     WB ; EB                               ; E_Base
1772     WB ; EBG                              ; E_Base_GAZ
1773     WB ; EM                               ; E_Modifier
1774     WB ; GAZ                              ; Glue_After_Zwj
1775     WB ; ZWJ                              ; ZWJ
1776   -> uchar.h & UCharacter.WordBreak
1777
1778 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1779     (not strictly necessary for NOT_ENCODED scripts)
1780   ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
1781
1782 * generate normalization data files
1783   cd $ICU_ROOT/dbg
1784   bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
1785   bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
1786   bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
1787   bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1788   bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
1789
1790 * build ICU (make install)
1791   so that the tools build can pick up the new definitions from the installed header files.
1792
1793   $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt
1794
1795 * build Unicode tools using CMake+make
1796
1797 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
1798
1799   # Location (--prefix) of where ICU was installed.
1800   set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
1801   # Location of the ICU source tree.
1802   set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
1803
1804   ~/svn.icutools/trunk/dbg/unicode/c$
1805     cmake ../../../src/unicode/c
1806     make
1807
1808 * generate core properties data files
1809   ~/svn.icutools/trunk/dbg/unicode/c$
1810     genprops/genprops $ICU_SRC_DIR
1811     genuca/genuca --hanOrder implicit $ICU_SRC_DIR
1812     genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
1813 - rebuild ICU (make install) & tools
1814
1815 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1816   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1817 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1818 - Unicode 6.0..9.0: U+2260, U+226E, U+226F
1819 - nothing new in 9.0, no test file to update
1820
1821 * run & fix ICU4C tests
1822 - Andy handles RBBI & spoof check test failures
1823
1824 * collation: CLDR collation root, UCA DUCET
1825
1826 - UCA DUCET goes into Mark's Unicode tools, see
1827   https://sites.google.com/site/unicodetools/home#TOC-UCA
1828 - CLDR root data files are checked into (CLDR UCA branch)/common/uca/
1829     cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
1830
1831 - cd (CLDR UCA branch)/common/uca/
1832 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1833     cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
1834 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1835     cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
1836     (note removing the underscore before "Rules")
1837     cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1838 - restore TODO diffs in UCARules.txt
1839     meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1840 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
1841   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1842   from the CLDR root files (..._CLDR_..._SHORT.txt)
1843     cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1844     cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1845     cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
1846 - if CLDR common/uca/unihan-index.txt changes, then update
1847   CLDR common/collation/root.xml <collation type="private-unihan">
1848   and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
1849
1850 - run genuca, see command line above;
1851   deal with
1852     Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt:
1853     FDD1 104B5;     [75 B8 02, 05, 05]      # Osage first primary (compressible)
1854         (add the character to genuca.cpp sampleCharsToScripts[])
1855   + look up the USCRIPT_ code for the new sample characters
1856     (should be obvious from the comment in the error output)
1857   + *add* mappings to sampleCharsToScripts[], do not replace them
1858     (in case the script sample characters flip-flop)
1859   + insert new scripts in DUCET script order, see the top_byte table
1860     at the beginning of FractionalUCA.txt
1861 - rebuild ICU4C
1862
1863 * Unihan collators
1864 - run Unicode Tools
1865     org.unicode.draft.GenerateUnihanCollators
1866   with VM arguments
1867     -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk
1868     -DOTHER_WORKSPACE=/home/mscherer/svn.unitools
1869     -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data
1870     -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
1871     -DUVERSION=9.0.0
1872     -ea
1873 - run Unicode Tools
1874     org.unicode.draft.GenerateUnihanCollatorFiles
1875   with the same arguments
1876 - check CLDR diffs
1877     cd ~/svn.cldr/trunk
1878     meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
1879     meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
1880 - copy to CLDR
1881     cd ~/svn.cldr/trunk
1882     cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
1883     cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
1884 - commit to CLDR
1885 - generate ICU zh collation data: run CLDR
1886     org.unicode.cldr.icu.NewLdml2IcuConverter
1887   with program arguments
1888     -t collation
1889     -s /home/mscherer/svn.cldr/trunk/common/collation
1890     -m /home/mscherer/svn.cldr/trunk/common/supplemental
1891     -d /home/mscherer/svn.icu/trunk/src/source/data/coll
1892     -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation
1893     zh
1894   and VM arguments
1895     -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
1896 - rebuild ICU4C
1897
1898 * run & fix ICU4C tests, now with new CLDR collation root data
1899 - run all tests with the collation test data *_SHORT.txt or the full files
1900   (the full ones have comments, useful for debugging)
1901 - note on intltest: if collate/UCAConformanceTest fails, then
1902   utility/MultithreadTest/TestCollators will fail as well;
1903   fix the conformance test before looking into the multi-thread test
1904
1905 * update Java data files
1906 - refresh just the UCD/UCA-related/derived files, just to be safe
1907 - see (ICU4C)/source/data/icu4j-readme.txt
1908 - mkdir /tmp/icu4j
1909 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1910   output:
1911     ...
1912     Unicode .icu files built to ./out/build/icudt58l
1913     echo timestamp > uni-core-data
1914     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b
1915     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b
1916     echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1917     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b
1918     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b"
1919     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/
1920     mkdir -p /tmp/icu4j/main/shared/data
1921     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1922     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/
1923     mkdir -p /tmp/icu4j/main/shared/data
1924     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1925     make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
1926 - copy the big-endian Unicode data files to another location,
1927   separate from the other data files,
1928   and then refresh ICU4J
1929     cd ~/svn.icu/trunk/dbg/data/out/icu4j
1930     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1931     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1932     cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1933     cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1934     rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1935     cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1936     cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1937     cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1938     jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1939
1940 * When refreshing all of ICU4J data from ICU4C
1941 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1942 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
1943 or
1944 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
1945
1946 * update CollationFCD.java
1947   + copy & paste the initializers of lcccIndex[] etc. from
1948     ICU4C/source/i18n/collationfcd.cpp to
1949     ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1950
1951 * refresh Java test .txt files
1952 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1953     cd $ICU_SRC_DIR/source/data/unidata
1954     cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1955     cd ../../test/testdata
1956     cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1957     cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1958
1959 * run & fix ICU4J tests
1960
1961 *** LayoutEngine script information
1962
1963 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
1964   This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
1965   in the working directory.
1966
1967   (It also generates ScriptRunData.cpp, which is no longer needed.)
1968
1969   It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
1970   (a plain text file)
1971   which maps ICU versions to the numbers of script/language constants
1972   that were added then.
1973   (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
1974
1975   The generated files have a current copyright date and "@deprecated" statement.
1976
1977 * Review changes, fix Java tool if necessary, and copy to ICU4C
1978   cd ~/svn.icu4j/trunk/src
1979   meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
1980   cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
1981   cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
1982
1983 *** API additions
1984 - send notice to icu-design about new born-@stable API (enum constants etc.)
1985
1986 *** merge the Unicode update branches back onto the trunk
1987 - do not merge the icudata.jar and testdata.jar,
1988   instead rebuild them from merged & tested ICU4C
1989 - make sure that changes to Unicode tools & ICU tools are checked in
1990   http://www.unicode.org/utility/trac/log/trunk/unicodetools
1991   http://bugs.icu-project.org/trac/log/tools/trunk
1992
1993 ---------------------------------------------------------------------------- ***
1994
1995 New script codes early in ICU 58: http://bugs.icu-project.org/trac/ticket/11764
1996
1997 Adding
1998 - new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge
1999 - new combination/alias codes: Hanb, Jamo
2000   - used in CLDR 29 and in spoof checker
2001 - new Z* code: Zsye
2002
2003 Add new codes to uscript.h & UScript.java, see Unicode update logs.
2004   -> com.ibm.icu.lang.UScript
2005     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2006     replace  public static final int \1 = \2; \3
2007
2008 Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h,
2009 add new script codes.
2010 "Long" script names only where established in Unicode 9 PropertyValueAliases.txt.
2011
2012 Note: If we have to run preparseucd.py again before the Unicode 9 update,
2013 then we need to manually keep/restore the new script codes.
2014
2015 ICU_ROOT=~/svn.icu/trunk
2016 ICU_SRC_DIR=$ICU_ROOT/src
2017 ICUDT=icudt57b
2018 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
2019 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
2020 UNIDATA=$ICU_SRC_DIR/source/data/unidata
2021
2022 Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files,
2023 see http://bugs.icu-project.org/trac/ticket/12141
2024
2025 make install, then icutools cmake & make, then
2026 ~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
2027
2028 Generate Java data as usual, only update pnames.icu & uprops.icu.
2029
2030 *** LayoutEngine script information
2031
2032 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
2033   This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
2034   in the working directory.
2035
2036   (It also generates ScriptRunData.cpp, which is no longer needed.)
2037
2038   It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
2039   (a plain text file)
2040   which maps ICU versions to the numbers of script/language constants
2041   that were added then.
2042   (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
2043
2044   The generated files have a current copyright date and "@deprecated" statement.
2045
2046 * Review changes, fix Java tool if necessary, and copy to ICU4C
2047   cd ~/svn.icu4j/trunk/src
2048   meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
2049   cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
2050   cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
2051
2052 ---------------------------------------------------------------------------- ***
2053
2054 Emoji properties added in ICU 57: http://bugs.icu-project.org/trac/ticket/11802
2055
2056 Edit preparseucd.py to add & parse new properties.
2057 They share the UCD property namespace but are not listed in PropertyAliases.txt.
2058
2059 Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/
2060 Initial data from emoji/2.0/
2061
2062 ICU_ROOT=~/svn.icu/trunk
2063 ICU_SRC_DIR=$ICU_ROOT/src
2064 ICUDT=icudt56b
2065 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
2066 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
2067 UNIDATA=$ICU_SRC_DIR/source/data/unidata
2068
2069 Add binary-property constants to uchar.h enum UProperty & UProperty.java.
2070
2071 ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src
2072 (Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.)
2073
2074 Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java
2075
2076 make install, then icutools cmake & make, then
2077 ~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
2078
2079 Generate Java data as usual, only update pnames.icu & uprops.icu.
2080
2081 ---------------------------------------------------------------------------- ***
2082
2083 Unicode 8.0 update for ICU 56
2084
2085 * Command-line environment setup
2086
2087 ICU_ROOT=~/svn.icu/trunk
2088 ICU_SRC_DIR=$ICU_ROOT/src
2089 ICUDT=icudt56b
2090 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
2091 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
2092 UNIDATA=$ICU_SRC_DIR/source/data/unidata
2093
2094 http://www.unicode.org/review/pri297/  -- beta review
2095 http://www.unicode.org/reports/uax-proposed-updates.html
2096 http://unicode.org/versions/beta-8.0.0.html
2097 http://www.unicode.org/versions/Unicode8.0.0/
2098 http://www.unicode.org/reports/tr44/tr44-15.html
2099
2100 *** ICU Trac
2101
2102 - ticket:11574: Unicode 8
2103 - C++ branches/markus/uni80 at r37351 from trunk at r37343
2104 - Java branches/markus/uni80 at r37352 from trunk at r37338
2105
2106 *** CLDR Trac
2107
2108 - cldrbug 8311: UCA 8
2109 - branches/markus/uni80 at r11518 from trunk at r11517
2110
2111 - cldrbug 8109: Unicode 8.0 script metadata
2112 - cldrbug 8418: Updated segmentation for Unicode 8.0
2113
2114 *** Unicode version numbers
2115 - makedata.mak
2116 - uchar.h
2117 - com.ibm.icu.util.VersionInfo
2118 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2119
2120 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
2121   so that the makefiles see the new version number.
2122
2123 *** data files & enums & parser code
2124
2125 * file preparation
2126
2127 - download UCD & IDNA files
2128 - make sure that the Unicode data folder passed into preparseucd.py
2129   includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2130 - only for manual diffs: remove version suffixes from the file names
2131   ~/unidata/uni70/20140403$ ../../desuffixucd.py .
2132   (see https://sites.google.com/site/unicodetools/inputdata)
2133 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
2134 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src
2135 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2136
2137 - also: from http://unicode.org/Public/security/8.0.0/ download new
2138   confusables.txt & confusablesWholeScript.txt
2139   and copy to $UNIDATA
2140     ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA
2141     ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA
2142
2143 * initial preparseucd.py changes
2144 - remove new Unicode scripts from the
2145   only-in-ISO-15924 list according to the error message:
2146     ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw']
2147     from _scripts_only_in_iso15924
2148   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2149       and in com.ibm.icu.dev.test.lang.TestUScript.java
2150 - property and file name change:
2151     IndicMatraCategory -> IndicPositionalCategory
2152 - UnicodeData.txt unusual numeric values (improper fractions)
2153     109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;;
2154     109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;;
2155     109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;;
2156     109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;;
2157     109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;;
2158     109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;;
2159     109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;;
2160     109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;;
2161     109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;;
2162     109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;;
2163   -> change preparseucd.py to map them to proper fractions (e.g., 1/6)
2164      which are listed in DerivedNumericValues.txt;
2165      keeps storage in data file simple
2166
2167 * PropertyValueAliases.txt changes
2168 - 10 new Block (blk) values:
2169     blk; Ahom                             ; Ahom
2170     blk; Anatolian_Hieroglyphs            ; Anatolian_Hieroglyphs
2171     blk; Cherokee_Sup                     ; Cherokee_Supplement
2172     blk; CJK_Ext_E                        ; CJK_Unified_Ideographs_Extension_E
2173     blk; Early_Dynastic_Cuneiform         ; Early_Dynastic_Cuneiform
2174     blk; Hatran                           ; Hatran
2175     blk; Multani                          ; Multani
2176     blk; Old_Hungarian                    ; Old_Hungarian
2177     blk; Sup_Symbols_And_Pictographs      ; Supplemental_Symbols_And_Pictographs
2178     blk; Sutton_SignWriting               ; Sutton_SignWriting
2179   -> add to uchar.h
2180     use long property names for enum constants
2181   -> add to UCharacter.UnicodeBlock IDs
2182     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
2183             replace  public static final int \1_ID = \2; \3
2184   -> add to UCharacter.UnicodeBlock objects
2185     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
2186             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
2187 - 6 new Script (sc) values:
2188     sc ; Ahom                             ; Ahom
2189     sc ; Hatr                             ; Hatran
2190     sc ; Hluw                             ; Anatolian_Hieroglyphs
2191     sc ; Hung                             ; Old_Hungarian
2192     sc ; Mult                             ; Multani
2193     sc ; Sgnw                             ; SignWriting
2194   -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
2195
2196 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
2197     (not strictly necessary for NOT_ENCODED scripts)
2198   ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
2199
2200 * generate normalization data files
2201   cd $ICU_ROOT/dbg
2202   bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
2203   bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
2204   bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
2205   bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2206   bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
2207
2208 * build ICU (make install)
2209   so that the tools build can pick up the new definitions from the installed header files.
2210
2211   $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
2212
2213 * build Unicode tools using CMake+make
2214
2215 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
2216
2217   # Location (--prefix) of where ICU was installed.
2218   set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
2219   # Location of the ICU source tree.
2220   set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
2221
2222   ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
2223   ~/svn.icutools/trunk/dbg/unicode/c$ make
2224
2225 * generate core properties data files
2226 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
2227 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR
2228 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
2229 - rebuild ICU (make install) & tools
2230 - run genuca again (see step above) so that it picks up the new nfc.nrm
2231 - rebuild ICU (make install) & tools
2232
2233 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2234   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2235 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2236 - Unicode 6.0..8.0: U+2260, U+226E, U+226F
2237 - nothing new in 8.0, no test file to update
2238
2239 * run & fix ICU4C tests
2240 - bad Cherokee case folding due to difference in fallbacks:
2241   UCD case folding falls back to no mapping,
2242   ICU runtime case folding falls back to lowercasing;
2243   fixed casepropsbuilder.cpp to generate scf mappings to self
2244   when there is an slc mapping but no scf
2245 - Andy handles RBBI & spoof check test failures
2246
2247 * collation: CLDR collation root, UCA DUCET
2248
2249 - UCA DUCET goes into Mark's Unicode tools, see
2250   https://sites.google.com/site/unicodetools/home#TOC-UCA
2251 - CLDR root data files are checked into (CLDR UCA branch)/common/uca/
2252 - cd (CLDR UCA branch)/common/uca/
2253 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2254   cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
2255 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2256     cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
2257     (note removing the underscore before "Rules")
2258     cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
2259 - restore TODO diffs in UCARules.txt
2260     meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
2261 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
2262   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2263   from the CLDR root files (..._CLDR_..._SHORT.txt)
2264     cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
2265     cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
2266     cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
2267 - if CLDR common/uca/unihan-index.txt changes, then update
2268   CLDR common/collation/root.xml <collation type="private-unihan">
2269   and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
2270 - run genuca, see command line above;
2271   deal with
2272     Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt
2273         (add the character to genuca.cpp sampleCharsToScripts[])
2274   + look up the script for the new sample characters
2275     (e.g., in FractionalUCA.txt)
2276   + *add* mappings to sampleCharsToScripts[], do not replace them
2277     (in case the script sample characters flip-flop)
2278   + insert new scripts in DUCET script order, see the top_byte table
2279     at the beginning of FractionalUCA.txt
2280 - rebuild ICU4C
2281
2282 * run & fix ICU4C tests, now with new CLDR collation root data
2283 - run all tests with the collation test data *_SHORT.txt or the full files
2284   (the full ones have comments, useful for debugging)
2285 - note on intltest: if collate/UCAConformanceTest fails, then
2286   utility/MultithreadTest/TestCollators will fail as well;
2287   fix the conformance test before looking into the multi-thread test
2288 - fixed bug in CollationWeights::getWeightRanges()
2289   exposed by new data and CollationTest::TestRootElements
2290
2291 * update Java data files
2292 - refresh just the UCD/UCA-related/derived files, just to be safe
2293 - see (ICU4C)/source/data/icu4j-readme.txt
2294 - mkdir /tmp/icu4j
2295 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2296   output:
2297     ...
2298     Unicode .icu files built to ./out/build/icudt56l
2299     echo timestamp > uni-core-data
2300     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b
2301     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b
2302     echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
2303     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b
2304     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b"
2305     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/
2306     mkdir -p /tmp/icu4j/main/shared/data
2307     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2308     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/
2309     mkdir -p /tmp/icu4j/main/shared/data
2310     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2311     make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
2312 - copy the big-endian Unicode data files to another location,
2313   separate from the other data files,
2314   and then refresh ICU4J
2315     cd ~/svn.icu/trunk/dbg/data/out/icu4j
2316     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2317     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2318     cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2319     cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2320     rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
2321     cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2322     cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2323     cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2324     jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
2325
2326 * When refreshing all of ICU4J data from ICU4C
2327 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2328 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2329 or
2330 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2331
2332 * update CollationFCD.java
2333   + copy & paste the initializers of lcccIndex[] etc. from
2334     ICU4C/source/i18n/collationfcd.cpp to
2335     ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
2336
2337 * refresh Java test .txt files
2338 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2339     cd $ICU_SRC_DIR/source/data/unidata
2340     cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2341     cd ../../test/testdata
2342     cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2343     cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2344
2345 * run & fix ICU4J tests
2346
2347 *** LayoutEngine script information
2348
2349 * ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more,
2350   because the layout engine was deprecated in ICU 54.
2351   Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java
2352   to write lines that we used to add manually.
2353
2354 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
2355   This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
2356   in the working directory.
2357
2358   (It also generates ScriptRunData.cpp, which is no longer needed.)
2359
2360   It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
2361   (a plain text file)
2362   which maps ICU versions to the numbers of script/language constants
2363   that were added then.
2364   (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
2365
2366   The generated files have a current copyright date and "@deprecated" statement.
2367
2368 * Review changes, fix Java tool if necessary, and copy to ICU4C
2369   cd ~/svn.icu4j/trunk/src
2370   meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
2371   cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
2372   cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
2373
2374 *** API additions
2375 - send notice to icu-design about new born-@stable API (enum constants etc.)
2376
2377 *** merge the Unicode update branches back onto the trunk
2378 - do not merge the icudata.jar and testdata.jar,
2379   instead rebuild them from merged & tested ICU4C
2380 - make sure that changes to Unicode tools & ICU tools are checked in
2381   http://www.unicode.org/utility/trac/log/trunk/unicodetools
2382   http://bugs.icu-project.org/trac/log/tools/trunk
2383
2384 ---------------------------------------------------------------------------- ***
2385
2386 Unicode 7.0 update for ICU 54
2387
2388 http://www.unicode.org/review/pri271/  -- beta review
2389 http://www.unicode.org/reports/uax-proposed-updates.html
2390 http://www.unicode.org/versions/beta-7.0.0.html#notable_issues
2391 http://www.unicode.org/reports/tr44/tr44-13.html
2392
2393 *** ICU Trac
2394
2395 - ticket 10821: Unicode 7.0, UCA 7.0
2396 - C++ branches/markus/uni70 at r35584 from trunk at r35580
2397 - Java branches/markus/uni70 at r35587 from trunk at r35545
2398
2399 *** CLDR Trac
2400
2401 - ticket 7195: UCA 7.0 CLDR root collation
2402 - branches/markus/uni70 at r10062 from trunk at r10061
2403
2404 - ticket 6762: script metadata for Unicode 7.0 new scripts
2405
2406 *** Unicode version numbers
2407 - makedata.mak
2408 - uchar.h
2409 - com.ibm.icu.util.VersionInfo
2410 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2411
2412 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
2413   so that the makefiles see the new version number.
2414
2415 *** data files & enums & parser code
2416
2417 * file preparation
2418
2419 - download UCD & IDNA files
2420 - make sure that the Unicode data folder passed into preparseucd.py
2421   includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2422 - only for manual diffs: remove version suffixes from the file names
2423   ~/unidata/uni70/20140403$ ../../desuffixucd.py .
2424   (see https://sites.google.com/site/unicodetools/inputdata)
2425 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
2426 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src
2427 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2428 - Restore TODO diffs in source/data/unidata/UCARules.txt
2429     cd $ICU_SRC_DIR
2430     meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt
2431 - Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt
2432
2433 - also: from http://unicode.org/Public/security/7.0.0/ download new
2434   confusables.txt & confusablesWholeScript.txt
2435   and copy to $ICU_ROOT/src/source/data/unidata/
2436
2437 * initial preparseucd.py changes
2438 - remove new Unicode scripts from the
2439   only-in-ISO-15924 list according to the error message:
2440     ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass',
2441                         'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm',
2442                         'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj']
2443     from _scripts_only_in_iso15924
2444   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2445       and in com.ibm.icu.dev.test.lang.TestUScript.java
2446 - NamesList.txt now has a heading with a non-ASCII character
2447   + keep ppucd.txt in platform charset, rather than changing tool/test parsers
2448   + escape non-ASCII characters in heading comments
2449 - gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013
2450   + get the copyright from the first file whose copyright line contains the current year
2451
2452 * PropertyValueAliases.txt changes
2453 - 32 new Block (blk) values:
2454     blk; Bassa_Vah                        ; Bassa_Vah
2455     blk; Caucasian_Albanian               ; Caucasian_Albanian
2456     blk; Coptic_Epact_Numbers             ; Coptic_Epact_Numbers
2457     blk; Diacriticals_Ext                 ; Combining_Diacritical_Marks_Extended
2458     blk; Duployan                         ; Duployan
2459     blk; Elbasan                          ; Elbasan
2460     blk; Geometric_Shapes_Ext             ; Geometric_Shapes_Extended
2461     blk; Grantha                          ; Grantha
2462     blk; Khojki                           ; Khojki
2463     blk; Khudawadi                        ; Khudawadi
2464     blk; Latin_Ext_E                      ; Latin_Extended_E
2465     blk; Linear_A                         ; Linear_A
2466     blk; Mahajani                         ; Mahajani
2467     blk; Manichaean                       ; Manichaean
2468     blk; Mende_Kikakui                    ; Mende_Kikakui
2469     blk; Modi                             ; Modi
2470     blk; Mro                              ; Mro
2471     blk; Myanmar_Ext_B                    ; Myanmar_Extended_B
2472     blk; Nabataean                        ; Nabataean
2473     blk; Old_North_Arabian                ; Old_North_Arabian
2474     blk; Old_Permic                       ; Old_Permic
2475     blk; Ornamental_Dingbats              ; Ornamental_Dingbats
2476     blk; Pahawh_Hmong                     ; Pahawh_Hmong
2477     blk; Palmyrene                        ; Palmyrene
2478     blk; Pau_Cin_Hau                      ; Pau_Cin_Hau
2479     blk; Psalter_Pahlavi                  ; Psalter_Pahlavi
2480     blk; Shorthand_Format_Controls        ; Shorthand_Format_Controls
2481     blk; Siddham                          ; Siddham
2482     blk; Sinhala_Archaic_Numbers          ; Sinhala_Archaic_Numbers
2483     blk; Sup_Arrows_C                     ; Supplemental_Arrows_C
2484     blk; Tirhuta                          ; Tirhuta
2485     blk; Warang_Citi                      ; Warang_Citi
2486   -> add to uchar.h
2487     use long property names for enum constants
2488   -> add to UCharacter.UnicodeBlock IDs
2489     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
2490             replace  public static final int \1_ID = \2; \3
2491   -> add to UCharacter.UnicodeBlock objects
2492     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
2493             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
2494 - 28 new Joining_Group (jg) values:
2495     jg ; Manichaean_Aleph                 ; Manichaean_Aleph
2496     jg ; Manichaean_Ayin                  ; Manichaean_Ayin
2497     jg ; Manichaean_Beth                  ; Manichaean_Beth
2498     jg ; Manichaean_Daleth                ; Manichaean_Daleth
2499     jg ; Manichaean_Dhamedh               ; Manichaean_Dhamedh
2500     jg ; Manichaean_Five                  ; Manichaean_Five
2501     jg ; Manichaean_Gimel                 ; Manichaean_Gimel
2502     jg ; Manichaean_Heth                  ; Manichaean_Heth
2503     jg ; Manichaean_Hundred               ; Manichaean_Hundred
2504     jg ; Manichaean_Kaph                  ; Manichaean_Kaph
2505     jg ; Manichaean_Lamedh                ; Manichaean_Lamedh
2506     jg ; Manichaean_Mem                   ; Manichaean_Mem
2507     jg ; Manichaean_Nun                   ; Manichaean_Nun
2508     jg ; Manichaean_One                   ; Manichaean_One
2509     jg ; Manichaean_Pe                    ; Manichaean_Pe
2510     jg ; Manichaean_Qoph                  ; Manichaean_Qoph
2511     jg ; Manichaean_Resh                  ; Manichaean_Resh
2512     jg ; Manichaean_Sadhe                 ; Manichaean_Sadhe
2513     jg ; Manichaean_Samekh                ; Manichaean_Samekh
2514     jg ; Manichaean_Taw                   ; Manichaean_Taw
2515     jg ; Manichaean_Ten                   ; Manichaean_Ten
2516     jg ; Manichaean_Teth                  ; Manichaean_Teth
2517     jg ; Manichaean_Thamedh               ; Manichaean_Thamedh
2518     jg ; Manichaean_Twenty                ; Manichaean_Twenty
2519     jg ; Manichaean_Waw                   ; Manichaean_Waw
2520     jg ; Manichaean_Yodh                  ; Manichaean_Yodh
2521     jg ; Manichaean_Zayin                 ; Manichaean_Zayin
2522     jg ; Straight_Waw                     ; Straight_Waw
2523   -> uchar.h & UCharacter.JoiningGroup
2524 - 23 new Script (sc) values:
2525     sc ; Aghb                             ; Caucasian_Albanian
2526     sc ; Bass                             ; Bassa_Vah
2527     sc ; Dupl                             ; Duployan
2528     sc ; Elba                             ; Elbasan
2529     sc ; Gran                             ; Grantha
2530     sc ; Hmng                             ; Pahawh_Hmong
2531     sc ; Khoj                             ; Khojki
2532     sc ; Lina                             ; Linear_A
2533     sc ; Mahj                             ; Mahajani
2534     sc ; Mani                             ; Manichaean
2535     sc ; Mend                             ; Mende_Kikakui
2536     sc ; Modi                             ; Modi
2537     sc ; Mroo                             ; Mro
2538     sc ; Narb                             ; Old_North_Arabian
2539     sc ; Nbat                             ; Nabataean
2540     sc ; Palm                             ; Palmyrene
2541     sc ; Pauc                             ; Pau_Cin_Hau
2542     sc ; Perm                             ; Old_Permic
2543     sc ; Phlp                             ; Psalter_Pahlavi
2544     sc ; Sidd                             ; Siddham
2545     sc ; Sind                             ; Khudawadi
2546     sc ; Tirh                             ; Tirhuta
2547     sc ; Wara                             ; Warang_Citi
2548   -> uscript.h (many were added before)
2549     comment "Mende Kikakui" for USCRIPT_MENDE
2550     add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias
2551   -> com.ibm.icu.lang.UScript
2552     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2553     replace  public static final int \1 = \2; \3
2554 - 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2555   (added 2012-11-01)
2556     Ahom        338     Ahom
2557     Hatr        127     Hatran
2558     Mult        323     Multani
2559   (added 2013-10-12)
2560     Modi        324     Modi
2561     Pauc        263     Pau Cin Hau
2562     Sidd        302     Siddham
2563   -> uscript.h (some overlap with additions from Unicode)
2564   -> com.ibm.icu.lang.UScript
2565     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2566     replace  public static final int \1 = \2; \3
2567   -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924
2568   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2569       and in com.ibm.icu.dev.test.lang.TestUScript.java
2570
2571 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
2572     (not strictly necessary for NOT_ENCODED scripts)
2573   ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
2574
2575 * generate normalization data files
2576 - cd $ICU_ROOT/dbg
2577 - export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
2578 - SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
2579 - UNIDATA=$ICU_SRC_DIR/source/data/unidata
2580 - bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
2581 - bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
2582 - bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
2583 - bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2584 - bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
2585
2586 * build ICU (make install)
2587   so that the tools build can pick up the new definitions from the installed header files.
2588
2589 ~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
2590
2591 * build Unicode tools using CMake+make
2592
2593 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
2594
2595 # Location (--prefix) of where ICU was installed.
2596 set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst)
2597 # Location of the ICU source tree.
2598 set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src)
2599
2600 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
2601 ~/svn.icutools/trunk/dbg/unicode/c$ make
2602
2603 * genprops work
2604 - new code point range for Joining_Group values: 10AC0..10AFF Manichaean
2605   + add second array of Joining_Group values for at most 10800..10FFF
2606     icutools: unicode/c/genprops/bidipropsbuilder.cpp
2607     icu: source/common/ubidi_props.h/.c/_data.h
2608     icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java
2609
2610 * generate core properties data files
2611 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
2612 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR
2613 - rebuild ICU (make install) & tools
2614 - run genuca again (see step above) so that it picks up the new nfc.nrm
2615 - rebuild ICU (make install) & tools
2616
2617 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2618   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2619 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2620 - Unicode 6.0..7.0: U+2260, U+226E, U+226F
2621 - nothing new in 7.0, no test file to update
2622
2623 * run & fix ICU4C tests
2624
2625 * update Java data files
2626 - refresh just the UCD-related files, just to be safe
2627 - see (ICU4C)/source/data/icu4j-readme.txt
2628 - mkdir /tmp/icu4j
2629 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2630   output:
2631     ...
2632     Unicode .icu files built to ./out/build/icudt53l
2633     echo timestamp > uni-core-data
2634     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b
2635     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b
2636     echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2637     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b
2638     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b"
2639     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/
2640     mkdir -p /tmp/icu4j/main/shared/data
2641     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2642     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/
2643     mkdir -p /tmp/icu4j/main/shared/data
2644     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2645     make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data'
2646 - copy the big-endian Unicode data files to another location,
2647   separate from the other data files
2648     ICUDT=icudt54b
2649     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2650     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2651     cd ~/svn.icu/uni70/dbg/data/out/icu4j
2652     cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2653     cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2654     rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
2655     cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
2656     cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2657     cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
2658 - refresh ICU4J
2659     ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
2660
2661 * update CollationFCD.java
2662   + copy & paste the initializers of lcccIndex[] etc. from
2663     ICU4C/source/i18n/collationfcd.cpp to
2664     ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
2665
2666 * refresh Java test .txt files
2667 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2668     cd $ICU_SRC_DIR/source/data/unidata
2669     cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2670     cd ../../test/testdata
2671     cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2672     cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
2673
2674 * UCA
2675
2676 - download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/
2677 - run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata)
2678 - update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/
2679 - run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA
2680 - output files are in ~/svn.unitools/Generated/uca/7.0.0/
2681 - review data; compare files, use blankweights.sed or similar
2682   ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt
2683 - cd ~/svn.unitools/Generated/uca/7.0.0/
2684 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2685   cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
2686 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2687     (note removing the underscore before "Rules")
2688     cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
2689 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
2690   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2691   with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
2692     cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
2693     cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
2694     cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
2695 - run genuca, see command line above
2696 - rebuild ICU4C
2697 - refresh ICU4J collation data:
2698   (subset of instructions above for properties data refresh, except copies all coll/*)
2699     ICUDT=icudt54b
2700     ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2701     ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2702     ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
2703     ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
2704 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
2705 - note on intltest: if collate/UCAConformanceTest fails, then
2706   utility/MultithreadTest/TestCollators will fail as well;
2707   fix the conformance test before looking into the multi-thread test
2708 - copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors
2709 - copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch
2710   ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
2711
2712 * When refreshing all of ICU4J data from ICU4C
2713 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2714 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2715 or
2716 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2717
2718 * run & fix ICU4J tests
2719
2720 *** LayoutEngine script information
2721
2722 (For details see the Unicode 5.2 change log below.)
2723
2724 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
2725   This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
2726   in the working directory.
2727   (It also generates ScriptRunData.cpp, which is no longer needed.)
2728
2729   The generated files have a current copyright date and "@stable" statement.
2730   ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java
2731   for "born stable" Unicode API constants, and to stop parsing ICU version numbers
2732   which may not contain dots any more.
2733
2734 - diff current <icu>/source/layout files vs. generated ones
2735     ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
2736   review and manually merge desired changes;
2737   fix gratuitous changes, incorrect @draft/@stable and missing aliases;
2738   Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
2739 - if you just copy the above files, then
2740   fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
2741   manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
2742
2743 *** API additions
2744 - send notice to icu-design about new born-@stable API (enum constants etc.)
2745
2746 *** merge the Unicode update branches back onto the trunk
2747 - do not merge the icudata.jar and testdata.jar,
2748   instead rebuild them from merged & tested ICU4C
2749
2750 ---------------------------------------------------------------------------- ***
2751
2752 Unicode 6.3 update
2753
2754 http://www.unicode.org/review/pri249/  -- beta review
2755 http://www.unicode.org/reports/uax-proposed-updates.html
2756 http://www.unicode.org/versions/beta-6.3.0.html#notable_issues
2757 http://www.unicode.org/reports/tr44/tr44-11.html
2758
2759 *** ICU Trac
2760
2761 - ticket 10128: update ICU to Unicode 6.3 beta
2762 - ticket 10168: update ICU to Unicode 6.3 final
2763 - C++ branches/markus/uni63 at r33552 from trunk at r33551
2764 - Java branches/markus/uni63 at r33550 from trunk at r33553
2765
2766 - ticket 10142: implement Unicode 6.3 bidi algorithm additions
2767
2768 *** Unicode version numbers
2769 - makedata.mak
2770 - uchar.h
2771   (configure.in & configure: have been modified to extract the version from uchar.h)
2772 - com.ibm.icu.util.VersionInfo
2773 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2774
2775 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
2776   so that the makefiles see the new version number.
2777
2778 *** data files & enums & parser code
2779
2780 * file preparation
2781
2782 - download UCD, UCA & IDNA files
2783 - make sure that the Unicode data folder passed into preparseucd.py
2784   includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2785 - modify preparseucd.py:
2786   parse new file BidiBrackets.txt
2787   with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type
2788 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src
2789 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2790 - Check test file diffs for previously commented-out, known-failing data lines;
2791   probably need to keep those commented out.
2792
2793 * PropertyAliases.txt changes
2794 - 1 new Enumerated Property
2795   bpt                      ; Bidi_Paired_Bracket_Type
2796   -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType
2797   -> ubidi_props.h & .c & UBiDiProps.java
2798   -> remember to write the max value at UBIDI_MAX_VALUES_INDEX
2799   -> uprops.cpp
2800   -> change ubidi.icu format version from 2.0 to 2.1
2801 - 1 new Miscellaneous Property
2802   bpb                      ; Bidi_Paired_Bracket
2803   -> uchar.h & UProperty.java
2804   -> ppucd.h & .cpp
2805
2806 * PropertyValueAliases.txt changes
2807 - 3 Bidi_Paired_Bracket_Type (bpt) values:
2808   bpt; c                                ; Close
2809   bpt; n                                ; None
2810   bpt; o                                ; Open
2811   -> uchar.h & UCharacter.BidiPairedBracketType
2812   -> ubidi_props.h & .c & UBiDiProps.java
2813   -> change ubidi.icu format version from 2.0 to 2.1
2814 - 4 new Bidi_Class (bc) values:
2815   bc ; FSI                              ; First_Strong_Isolate
2816   bc ; LRI                              ; Left_To_Right_Isolate
2817   bc ; RLI                              ; Right_To_Left_Isolate
2818   bc ; PDI                              ; Pop_Directional_Isolate
2819   -> uchar.h & UCharacterEnums.ECharacterDirection
2820   -> until the bidi code gets updated,
2821      Roozbeh suggests mapping the new bc values to ON (Other_Neutral)
2822 - 3 new Word_Break (WB) values:
2823   WB ; HL                               ; Hebrew_Letter
2824   WB ; SQ                               ; Single_Quote
2825   WB ; DQ                               ; Double_Quote
2826   -> uchar.h & UCharacter.WordBreak
2827   -> first time Word_Break numeric constants exceed 4 bits (now 17 values)
2828 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2829   (added 2012-10-16)
2830   Aghb  239     Caucasian Albanian
2831   Mahj  314     Mahajani
2832   -> uscript.h
2833   -> com.ibm.icu.lang.UScript
2834     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2835     replace  public static final int \1 = \2;\3
2836   -> preparseucd.py _scripts_only_in_iso15924
2837   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2838       and in com.ibm.icu.dev.test.lang.TestUScript.java
2839   -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
2840      (not strictly necessary for NOT_ENCODED scripts)
2841
2842 * generate normalization data files
2843 - ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib
2844 - ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in
2845 - ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata
2846 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
2847 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
2848 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2849 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
2850
2851 * build ICU (make install)
2852   so that the tools build can pick up the new definitions from the installed header files.
2853
2854 ~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
2855
2856 * build Unicode tools using CMake+make
2857
2858 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
2859
2860 # Location (--prefix) of where ICU was installed.
2861 set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst)
2862 # Location of the ICU source tree.
2863 set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src)
2864
2865 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
2866 ~/svn.icutools/trunk/dbg/unicode/c$ make
2867
2868 * generate core properties data files
2869 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src
2870 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src
2871 - rebuild ICU (make install) & tools
2872 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
2873 - rebuild ICU (make install) & tools
2874
2875 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2876   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2877 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2878 - Unicode 6.0..6.3: U+2260, U+226E, U+226F
2879 - nothing new in 6.3, no test file to update
2880
2881 * update Java data files
2882 - refresh just the UCD-related files, just to be safe
2883 - see (ICU4C)/source/data/icu4j-readme.txt
2884 - mkdir /tmp/icu4j
2885 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2886   output:
2887     ...
2888     Unicode .icu files built to ./out/build/icudt52l
2889     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b
2890     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b
2891     echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2892     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b
2893     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b"
2894     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/
2895     mkdir -p /tmp/icu4j/main/shared/data
2896     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2897     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/
2898     mkdir -p /tmp/icu4j/main/shared/data
2899     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2900     make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data'
2901 - copy the big-endian Unicode data files to another location,
2902   separate from the other data files
2903     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2904     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
2905     ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
2906     ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu
2907     ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
2908     ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2909     ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
2910 - refresh ICU4J
2911     ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
2912
2913 * refresh Java test .txt files
2914 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2915
2916 * UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files
2917
2918 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
2919 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
2920 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2921 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2922   (note removing the underscore before "Rules")
2923 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
2924   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2925   with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
2926 - check test file diffs for previously commented-out, known-failing data lines;
2927   probably need to keep those commented out
2928 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
2929 - run genuca, see command line above
2930 - rebuild ICU4C
2931 - refresh ICU4J collation data:
2932   (subset of instructions above for properties data refresh, except copies all coll/*)
2933     ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2934     ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2935     ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2936     ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
2937 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
2938 - note on intltest: if collate/UCAConformanceTest fails, then
2939   utility/MultithreadTest/TestCollators will fail as well;
2940   fix the conformance test before looking into the multi-thread test
2941
2942 * test ICU, fix test code where necessary
2943
2944 * When refreshing all of ICU4J data from ICU4C
2945 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2946 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2947 or
2948 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2949
2950 *** LayoutEngine script information
2951 - skipped for Unicode 6.3: no new scripts
2952
2953 *** merge the Unicode update branches back onto the trunk
2954 - do not merge the icudata.jar and testdata.jar,
2955   instead rebuild them from merged & tested ICU4C
2956
2957 ---------------------------------------------------------------------------- ***
2958
2959 Unicode 6.2 update
2960
2961 http://www.unicode.org/review/pri230/
2962 http://www.unicode.org/versions/beta-6.2.0.html
2963 http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0
2964 http://www.unicode.org/review/pri227/  Changes to Script Extensions Property Values
2965 http://www.unicode.org/review/pri228/  Changing some common characters from Punctuation to Symbol
2966 http://www.unicode.org/review/pri229/  Linebreaking Changes for Pictographic Symbols
2967 http://www.unicode.org/reports/tr46/tr46-8.html  IDNA
2968 http://unicode.org/Public/idna/6.2.0/
2969
2970 *** ICU Trac
2971
2972 - ticket 9515: Unicode 6.2: final ICU update
2973
2974 - ticket 9514: UCA 6.2: fix UCARules.txt
2975
2976 - ticket 9437: update ICU to Unicode 6.2
2977 - C++ branches/markus/uni62 at r32050 from trunk at r32041
2978 - Java branches/markus/uni62 at r32068 from trunk at r32066
2979
2980 *** Unicode version numbers
2981 - makedata.mak
2982 - uchar.h
2983   (configure.in & configure: have been modified to extract the version from uchar.h)
2984 - com.ibm.icu.util.VersionInfo
2985 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2986
2987 *** data files & enums & parser code
2988
2989 * file preparation
2990
2991 - download UCD, UCA & IDNA files
2992 - make sure that the Unicode data folder passed into preparseucd.py
2993   includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2994 - modify preparseucd.py: NamesList.txt is now in UTF-8
2995 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src
2996 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2997 - Check test file diffs for previously commented-out, known-failing data lines;
2998   probably need to keep those commented out.
2999
3000 * PropertyValueAliases.txt changes
3001 - 1 new Line_Break (lb) value:
3002   lb ; RI                               ; Regional_Indicator
3003   -> uchar.h & UCharacter.LineBreak
3004 - 1 new Word_Break (WB) value:
3005   WB ; RI                               ; Regional_Indicator
3006   -> uchar.h & UCharacter.WordBreak
3007 - 1 new Grapheme_Cluster_Break (GCB) value:
3008   GCB; RI                               ; Regional_Indicator
3009   -> uchar.h & UCharacter.GraphemeClusterBreak
3010
3011 * 3 new numeric values
3012   The new value -1, which was really supposed to be NaN but that would have required
3013   new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1,
3014   but encodeNumericValue() in corepropsbuilder.cpp had to be fixed.
3015     cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1
3016     cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1
3017   The two new values 216000 and 432000 require an addition to the encoding of numeric values.
3018     cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000
3019     cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000
3020   -> uprops.h, uchar.c & UCharacterProperty.java
3021   -> cucdtst.c & UCharacterTest.java
3022
3023 * generate normalization data files
3024 - ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib
3025 - ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in
3026 - ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata
3027 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
3028 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
3029 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
3030 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
3031
3032 * build ICU (make install)
3033   so that the tools build can pick up the new definitions from the installed header files.
3034 * build Unicode tools using CMake+make
3035
3036 * generate core properties data files
3037 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src
3038 - in initial bootstrapping, change the UCA version
3039   in source/data/unidata/FractionalUCA.txt to match the new Unicode version
3040 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src
3041 - rebuild ICU (make install) & tools
3042   + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
3043     check if the UCA version in FractionalUCA.txt matches the new Unicode version
3044     (see step above)
3045 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
3046 - rebuild ICU (make install) & tools
3047
3048 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
3049   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
3050 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
3051 - Unicode 6.0..6.2: U+2260, U+226E, U+226F
3052 - nothing new in 6.2, no test file to update
3053
3054 * update Java data files
3055 - refresh just the UCD-related files, just to be safe
3056 - see (ICU4C)/source/data/icu4j-readme.txt
3057 - mkdir /tmp/icu4j
3058 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3059   output:
3060     ...
3061     Unicode .icu files built to ./out/build/icudt50l
3062     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b
3063     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b
3064     echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
3065     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b
3066     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b"
3067     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/
3068     mkdir -p /tmp/icu4j/main/shared/data
3069     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3070     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/
3071     mkdir -p /tmp/icu4j/main/shared/data
3072     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
3073     make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data'
3074 - copy the big-endian Unicode data files to another location,
3075   separate from the other data files
3076     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
3077     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
3078     ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
3079     ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu
3080     ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
3081     ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
3082     ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
3083 - refresh ICU4J
3084     ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
3085
3086 * refresh Java test .txt files
3087 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
3088
3089 * UCA
3090
3091 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
3092 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
3093 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
3094 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
3095   (note removing the underscore before "Rules")
3096 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
3097   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3098   with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
3099 - check test file diffs for previously commented-out, known-failing data lines;
3100   probably need to keep those commented out
3101 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
3102 - run genuca, see command line above
3103 - rebuild ICU4C
3104 - refresh ICU4J collation data:
3105   (subset of instructions above for properties data refresh, except copies all coll/*)
3106     ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3107     ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
3108     ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
3109     ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
3110 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
3111 - note on intltest: if collate/UCAConformanceTest fails, then
3112   utility/MultithreadTest/TestCollators will fail as well;
3113   fix the conformance test before looking into the multi-thread test
3114
3115 * test ICU, fix test code where necessary
3116
3117 * When refreshing all of ICU4J data from ICU4C
3118 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3119 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
3120 or
3121 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
3122
3123 *** LayoutEngine script information
3124 - skipped for Unicode 6.2: no new scripts
3125
3126 *** merge the Unicode update branches back onto the trunk
3127 - do not merge the icudata.jar and testdata.jar,
3128   instead rebuild them from merged & tested ICU4C
3129
3130 ---------------------------------------------------------------------------- ***
3131
3132 Future Unicode update
3133
3134 Tools simplified since the Unicode 6.1 update. See
3135 - http://site.icu-project.org/design/props/ppucd
3136 - http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972
3137
3138 * Unicode version numbers
3139 - icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates
3140
3141 * file preparation
3142 - ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py:
3143 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src
3144 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
3145 - Check test file diffs for previously commented-out, known-failing data lines;
3146   probably need to keep those commented out.
3147
3148 * PropertyValueAliases.txt changes
3149 - Script codes that are in ISO 15924 but not in Unicode are now listed in
3150   preparseucd.py, in the _scripts_only_in_iso15924 variable.
3151   If there are new ISO codes, then add them.
3152   If Unicode adds some of them, then remove them from the .py variable.
3153
3154 * UnicodeData.txt changes
3155 - No more manual changes for CJK ranges for algorithmic names;
3156   those are now written to ppucd.txt and genprops reads them from there.
3157
3158 * generate core properties data files (makeprops.sh was deleted)
3159 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src
3160
3161 * no more manual updates of source/data/unidata/norm2/nfkc_cf.txt
3162 - it is now generated by preparseucd.py
3163
3164 * no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt
3165 - it is now generated by preparseucd.py
3166 - make sure that the Unicode data folder passed into preparseucd.py
3167   includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
3168   (can be in some subfolder)
3169
3170 * generate normalization data files
3171 - ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib
3172 - ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in
3173 - ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata
3174 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
3175 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
3176 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
3177 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
3178
3179 * build ICU (make install)
3180 * build Unicode tools using CMake+make
3181
3182 * new way to call genuca (makeuca.sh was deleted)
3183 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src
3184
3185 ---------------------------------------------------------------------------- ***
3186
3187 Unicode 6.1 update
3188
3189 *** ICU Trac
3190
3191 - ticket 8995 final update to Unicode 6.1
3192 - ticket 8994 regenerate source/layout/CanonData.cpp
3193
3194 - ticket 8961 support Unicode "Age" value *names*
3195 - ticket 8963 support multiple character name aliases & types
3196
3197 - ticket 8827 "update ICU to Unicode 6.1"
3198 - C++ branches/markus/uni61 at r30864 from trunk at r30843
3199 - Java branches/markus/uni61 at r30865 from trunk at r30863
3200
3201 *** Unicode version numbers
3202 - makedata.mak
3203 - uchar.h
3204   (configure.in & configure: have been modified to extract the version from uchar.h)
3205 - com.ibm.icu.util.VersionInfo
3206 - icutools/unicode/makedefs.sh
3207   + also review & update other definitions in that file,
3208     e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l
3209
3210 *** data files & enums & parser code
3211
3212 * file preparation
3213
3214 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed
3215 - This prepares both unidata and testdata files in respective output subfolders.
3216 - Check test file diffs for previously commented-out, known-failing data lines;
3217   probably need to keep those commented out.
3218
3219 * PropertyValueAliases.txt changes
3220 - 11 new block names:
3221   Arabic_Extended_A
3222   Arabic_Mathematical_Alphabetic_Symbols
3223   Chakma
3224   Meetei_Mayek_Extensions
3225   Meroitic_Cursive
3226   Meroitic_Hieroglyphs
3227   Miao
3228   Sharada
3229   Sora_Sompeng
3230   Sundanese_Supplement
3231   Takri
3232   -> add to uchar.h
3233   -> add to UCharacter.UnicodeBlock IDs
3234     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
3235             replace  public static final int \1_ID = \2; \3
3236   -> add to UCharacter.UnicodeBlock objects
3237     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
3238             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
3239 - 1 new Joining_Group (jg) value:
3240   Rohingya_Yeh
3241   -> uchar.h & UCharacter.JoiningGroup
3242 - 2 new Line_Break (lb) values:
3243   CJ=Conditional_Japanese_Starter
3244   HL=Hebrew_Letter
3245   -> uchar.h & UCharacter.LineBreak
3246 - 7 new scripts:
3247   sc ; Cakm      ; Chakma
3248   sc ; Merc      ; Meroitic_Cursive
3249   sc ; Mero      ; Meroitic_Hieroglyphs
3250   sc ; Plrd      ; Miao
3251   sc ; Shrd      ; Sharada
3252   sc ; Sora      ; Sora_Sompeng
3253   sc ; Takr      ; Takri
3254   -> remove these from SyntheticPropertyValueAliases.txt
3255   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
3256       and in com.ibm.icu.dev.test.lang.TestUScript.java
3257 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
3258   (added 2011-06-21)
3259   Khoj        322     Khojki
3260   Tirh        326     Tirhuta
3261     and another one added 2011-12-09
3262   Hluw        080     Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)
3263   -> uscript.h
3264   -> com.ibm.icu.lang.UScript
3265     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
3266     replace  public static final int \1 = \2;\3
3267   -> SyntheticPropertyValueAliases.txt
3268   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
3269       and in com.ibm.icu.dev.test.lang.TestUScript.java
3270
3271 * UnicodeData.txt changes
3272 - the last Unihan code point changes from U+9FCB to U+9FCC
3273   search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive)
3274   + do change gennames.c
3275   + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java
3276
3277 * DerivedBidiClass.txt changes
3278 - 2 new default-AL blocks:
3279 #     Arabic Extended-A: U+08A0  -  U+08FF  (was default-R)
3280 #     Arabic Mathematical Alphabetic Symbols:
3281 #                       U+1EE00  - U+1EEFF  (was default-R)
3282 - 2 new default-R blocks:
3283 #     Meroitic Hieroglyphs:
3284 #                        U+10980 - U+1099F
3285 #     Meroitic Cursive:  U+109A0 - U+109FF
3286   -> should be picked up by the explicit data in the file
3287
3288 * NameAliases.txt changes
3289 - from
3290     # Each line has two fields
3291     # First field: Code point
3292     # Second field: Alias
3293 - to
3294     # Each line has three fields, as described here:
3295     #
3296     # First field:  Code point
3297     # Second field: Alias
3298     # Third field:  Type
3299 - Also, the file previously allowed multiple aliases but only now does it
3300   actually provide multiple, even multiple of the same type. For example,
3301     FEFF;BYTE ORDER MARK;alternate
3302     FEFF;BOM;abbreviation
3303     FEFF;ZWNBSP;abbreviation
3304 - This breaks our gennames parser, unames.icu data structure, and API.
3305   Fix gennames to only pick up "correction" aliases.
3306   New ticket #8963 for further changes.
3307
3308 * run genpname/preparse.pl (on Linux)
3309   + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
3310   + make sure that data.h is writable
3311   + perl preparse.pl ~/svn.icu/trunk/src > out.txt
3312   + preparse.pl shows no errors, out.txt Info and Warning lines look ok
3313
3314 * build ICU (make install)
3315   so that the tools build can pick up the new definitions from the installed header files.
3316 * build Unicode tools (at least genpname) using CMake+make
3317
3318 * run genpname
3319   (builds both pnames.icu and propname_data.h)
3320 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
3321 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
3322
3323 * build ICU (make install)
3324 * build Unicode tools using CMake+make
3325
3326 * update source/data/unidata/norm2/nfkc_cf.txt
3327 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
3328
3329 * update source/data/unidata/norm2/uts46.txt
3330 - download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
3331   to ~/svn.icu/tools/trunk/src/unicode/py
3332 - adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008".
3333 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
3334 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
3335
3336 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
3337   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
3338 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
3339 - Unicode 6.0..6.1: U+2260, U+226E, U+226F
3340 - nothing new in 6.1, no test file to update
3341
3342 * generate core properties data files
3343 - in initial bootstrapping, change the UCA version
3344   in source/data/unidata/FractionalUCA.txt to match the new Unicode version
3345 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3346 - rebuild ICU & tools
3347   + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
3348     check if the UCA version in FractionalUCA.txt matches the new Unicode version
3349     (see step above)
3350 - run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm:
3351   ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3352 - rebuild ICU & tools
3353
3354 * update Java data files
3355 - refresh just the UCD-related files, just to be safe
3356 - see (ICU4C)/source/data/icu4j-readme.txt
3357 - mkdir /tmp/icu4j
3358 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3359   output:
3360     ...
3361     Unicode .icu files built to ./out/build/icudt49l
3362     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b
3363     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b
3364     echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
3365     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b
3366     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b"
3367     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/
3368     mkdir -p /tmp/icu4j/main/shared/data
3369     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3370     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/
3371     mkdir -p /tmp/icu4j/main/shared/data
3372     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
3373     make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data'
3374 - copy the big-endian Unicode data files to another location,
3375   separate from the other data files
3376     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
3377     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
3378     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
3379     ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu
3380     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
3381     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
3382     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
3383 - refresh ICU4J
3384     ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
3385
3386 * refresh Java test .txt files
3387 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
3388
3389 * test ICU so far, fix test code where necessary
3390 - temporarily ignore collation issues that look like UCA/UCD mismatches,
3391   until UCA data is updated
3392
3393 * UCA
3394
3395 - get output from Mark's tools; look in
3396     http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt
3397 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
3398 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
3399   (note removing the underscore before "Rules")
3400 - update (ICU)/source/test/testdata/CollationTest_*.txt
3401   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3402   with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
3403 - check test file diffs for previously commented-out, known-failing data lines;
3404   probably need to keep those commented out
3405 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
3406 - run makeuca.sh:
3407   ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3408 - rebuild ICU4C
3409 - refresh ICU4J collation data:
3410   (subset of instructions above for properties data refresh, except copies all coll/*)
3411     ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3412     ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
3413     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
3414     ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
3415 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
3416 - note on intltest: if collate/UCAConformanceTest fails, then
3417   utility/MultithreadTest/TestCollators will fail as well;
3418   fix the conformance test before looking into the multi-thread test
3419
3420 * When refreshing all of ICU4J data from ICU4C
3421 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3422 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
3423 or
3424 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
3425
3426 *** LayoutEngine script information
3427
3428 (For details see the Unicode 5.2 change log below.)
3429
3430 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
3431   This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
3432   in the working directory.
3433   (It also generates ScriptRunData.cpp, which is no longer needed.)
3434
3435   The generated files have a current copyright date and "@draft" statement.
3436
3437 - diff current <icu>/source/layout files vs. generated ones
3438     ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
3439   review and manually merge desired changes;
3440   fix gratuitous changes, incorrect @draft and missing aliases;
3441   Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
3442 - if you just copy the above files, then
3443   fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
3444   manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
3445
3446 *** merge the Unicode update branches back onto the trunk
3447 - do not merge the icudata.jar and testdata.jar,
3448   instead rebuild them from merged & tested ICU4C
3449
3450 ---------------------------------------------------------------------------- ***
3451
3452 ICU 4.8 (no Unicode update, just new script codes)
3453
3454 * 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
3455   (added 2010-12-21)
3456     Afak    439     Afaka
3457     Jurc    510     Jurchen
3458     Mroo    199     Mro, Mru
3459     Nshu    499     Nüshu
3460     Shrd    319     Sharada, Śāradā
3461     Sora    398     Sora Sompeng
3462     Takr    321     Takri, Ṭākrī, Ṭāṅkrī
3463     Tang    520     Tangut
3464     Wole    480     Woleai
3465   -> uscript.h
3466   -> com.ibm.icu.lang.UScript
3467     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
3468     replace  public static final int \1 = \2;\3
3469   -> genpname/SyntheticPropertyValueAliases.txt
3470   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
3471       and in com.ibm.icu.dev.test.lang.TestUScript.java
3472
3473 * run genpname/preparse.pl (on Linux)
3474   + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
3475   + make sure that data.h is writable
3476   + perl preparse.pl ~/svn.icu/trunk/src > out.txt
3477   + preparse.pl shows no errors, out.txt Info and Warning lines look ok
3478
3479 * rebuild Unicode tools (at least genpname) using make
3480 - You might first need to "make install" ICU so that the tools build can pick
3481   up the new definitions from the installed header files.
3482
3483 * run genpname
3484   (builds both pnames.icu and propname_data.h)
3485 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
3486 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
3487 - rebuild ICU & tools
3488
3489 * run genprops
3490 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
3491 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
3492 - rebuild ICU & tools
3493
3494 * update Java data files
3495 - refresh just the UCD-related files, just to be safe
3496 - see (ICU4C)/source/data/icu4j-readme.txt
3497 - mkdir /tmp/icu4j
3498 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3499 - copy the big-endian Unicode data files to another location,
3500   separate from the other data files
3501     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
3502     ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
3503     ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
3504 - refresh ICU4J
3505     ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b
3506
3507 * should have updated the layout engine script codes but forgot
3508
3509 ---------------------------------------------------------------------------- ***
3510
3511 Unicode 6.0 update
3512
3513 *** related ICU Trac tickets
3514
3515 7264 Unicode 6.0 Update
3516
3517 *** Unicode version numbers
3518 - makedata.mak
3519 - uchar.h
3520   (configure.in & configure: have been modified to extract the version from uchar.h)
3521 - com.ibm.icu.util.VersionInfo
3522
3523 *** data files & enums & parser code
3524
3525 * file preparation
3526
3527 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed
3528 - This now prepares both unidata and testdata files in respective output subfolders.
3529
3530 * PropertyAliases.txt changes
3531 - new Script_Extensions property defined in the new ScriptExtensions.txt file
3532   but not listed in PropertyAliases.txt; reported to unicode.org;
3533   -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt
3534     scx; Script_Extensions
3535   -> uchar.h with new UProperty section
3536   -> com.ibm.icu.lang.UProperty, parallel with uchar.h
3537
3538 * PropertyValueAliases.txt changes
3539 - 12 new block names:
3540   Alchemical_Symbols
3541   Bamum_Supplement
3542   Batak
3543   Brahmi
3544   CJK_Unified_Ideographs_Extension_D
3545   Emoticons
3546   Ethiopic_Extended_A
3547   Kana_Supplement
3548   Mandaic
3549   Miscellaneous_Symbols_And_Pictographs
3550   Playing_Cards
3551   Transport_And_Map_Symbols
3552   -> add to uchar.h
3553   -> add to UCharacter.UnicodeBlock
3554     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
3555             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
3556 - Joining_Group (jg) values:
3557   Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias
3558   -> uchar.h & UCharacter.JoiningGroup
3559 - 3 new scripts:
3560   sc ; Batk      ; Batak
3561   sc ; Brah      ; Brahmi
3562   sc ; Mand      ; Mandaic
3563   -> remove these from SyntheticPropertyValueAliases.txt
3564   -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN
3565   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
3566       and in com.ibm.icu.dev.test.lang.TestUScript.java
3567 - 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
3568   (added 2009-11-11..2010-07-18)
3569   Bass        259     Bassa Vah
3570   Dupl        755     Duployan shortand
3571   Elba        226     Elbasan
3572   Gran        343     Grantha
3573   Kpel        436     Kpelle
3574   Loma        437     Loma
3575   Mend        438     Mende
3576   Merc        101     Meroitic Cursive
3577   Narb        106     Old North Arabian
3578   Nbat        159     Nabataean
3579   Palm        126     Palmyrene
3580   Sind        318     Sindhi
3581   Wara        262     Warang Citi
3582   -> uscript.h
3583   -> com.ibm.icu.lang.UScript
3584     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
3585     replace  public static final int \1 = \2;\3
3586   -> SyntheticPropertyValueAliases.txt
3587   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
3588       and in com.ibm.icu.dev.test.lang.TestUScript.java
3589 - ISO 15924 name change
3590   Mero        100     Meroitic Hieroglyphs (was Meroitic)
3591   -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC
3592 - property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt
3593
3594 * UnicodeData.txt changes
3595 - new CJK block:
3596   2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
3597   2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
3598   -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion
3599
3600 * build Unicode tools using CMake+make
3601
3602 * run genpname/preparse.pl (on Linux)
3603   + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
3604   + make sure that data.h is writable
3605   + perl preparse.pl ~/svn.icu/trunk/src > out.txt
3606   + preparse.pl shows no errors, out.txt Info and Warning lines look ok
3607
3608 * rebuild Unicode tools (at least genpname) using make
3609 - You might first need to "make install" ICU so that the tools build can pick
3610   up the new definitions from the installed header files.
3611
3612 * run genpname
3613 - ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
3614 - rebuild ICU & tools
3615
3616 * update source/data/unidata/norm2/nfkc_cf.txt
3617 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
3618
3619 * update source/data/unidata/norm2/uts46.txt
3620 - download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt
3621   to ~/svn.icu/tools/trunk/src/unicode/py
3622 - adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values
3623 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
3624 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
3625
3626 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
3627   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
3628 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
3629 - Unicode 6.0: U+2260, U+226E, U+226F
3630
3631 * generate core properties data files
3632 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3633 - rebuild ICU & tools
3634 - run makeuca.sh so that genuca picks up the new nfc.nrm:
3635   ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3636 - rebuild ICU & tools
3637
3638 * implement new Script_Extensions property (provisional)
3639 - parser & generator: genprops & uprops.icu
3640 - uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp
3641 - UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java
3642
3643 * switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2
3644 - (one-time change)
3645 - genbidi/gencase/genprops tools changes
3646 - re-run makeprops.sh (see above)
3647 - UCharacterProperty.java, UCharacterTypeIterator.java,
3648   UBiDiProps.java, UCaseProps.java, and several others with minor changes;
3649   UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java
3650
3651 * update Java data files
3652 - refresh just the UCD-related files, just to be safe
3653 - see (ICU4C)/source/data/icu4j-readme.txt
3654 - mkdir /tmp/icu4j
3655 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3656   output:
3657     ...
3658     Unicode .icu files built to ./out/build/icudt45l
3659     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b
3660     echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
3661     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b
3662     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b
3663     mkdir -p /tmp/icu4j/main/shared/data
3664     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
3665 - copy the big-endian Unicode data files to another location,
3666   separate from the other data files
3667     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
3668     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
3669     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
3670     ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu
3671     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
3672     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
3673     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
3674 - refresh ICU4J
3675     ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
3676
3677 * refresh Java test .txt files
3678 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
3679
3680 * un-hardcode normalization skippable (NF*_Inert) test data
3681 - removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools
3682
3683 * copy updated break iterator test files
3684 - now handled by early ucdcopy.py and
3685   copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata
3686   (old instructions:
3687    copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt
3688    to ~/svn.icu/trunk/src/source/test/testdata)
3689 - they are not used in ICU4J
3690
3691 * UCA
3692
3693 - get output from Mark's tools; look in
3694     http://www.unicode.org/~book/incoming/mark/uca6.0.0/
3695     http://www.macchiato.com/unicode/utc/additional-uca-files
3696     http://www.unicode.org/Public/UCA/6.0.0/
3697     http://www.unicode.org/~mdavis/uca/
3698 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
3699 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
3700 - update Han-implicit ranges for new CJK extensions:
3701   swapCJK() in ucol.cpp & ImplicitCEGenerator.java
3702 - genuca: allow bytes 02 for U+FFFE, new merge-sort character;
3703   do not add it into invuca so that tailoring primary-after an ignorable works
3704 - genuca: permit space between [variable top] bytes
3705 - ucol.cpp: treat noncharacters like unassigned rather than ignorable
3706 - run makeuca.sh:
3707   ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
3708 - rebuild ICU4C
3709 - refresh ICU4J collation data:
3710   (subset of instructions above for properties data refresh, except copies all coll/*)
3711     ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3712     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
3713     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
3714     ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
3715 - update (ICU)/source/test/testdata/CollationTest_*.txt
3716   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3717   with output from Mark's Unicode tools
3718 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)
3719 - note on intltest: if collate/UCAConformanceTest fails, then
3720   utility/MultithreadTest/TestCollators will fail as well;
3721   fix the conformance test before looking into the multi-thread test
3722
3723 * When refreshing all of ICU4J data from ICU4C
3724 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3725 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
3726 or
3727 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
3728
3729 *** LayoutEngine script information
3730
3731 (For details see the Unicode 5.2 change log below.)
3732
3733 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
3734 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
3735 ScriptRunData.cpp, which is no longer needed.)
3736
3737 The generated files have a current copyright date and "@draft" statement.
3738
3739 * copy the above files into <icu>/source/layout, replacing the old files.
3740 * fix mixed line endings
3741 * review the diffs and fix incorrect @draft and missing aliases;
3742   Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
3743 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
3744
3745 ---------------------------------------------------------------------------- ***
3746
3747 Unicode 5.2 update
3748
3749 *** related ICU Trac tickets
3750
3751 7084 Unicode 5.2
3752
3753 7167 verify collation bytes
3754 7235 Java test NAME_ALIAS
3755 7236 Java DerivedCoreProperties.txt test
3756 7237 Java BidiTest.txt
3757 7238 UTrie2 in core unidata
3758 7239 test for tailoring gaps
3759 7240 Java fix CollationMiscTest
3760 7243 update layout engine for Unicode 5.2
3761
3762 *** Unicode version numbers
3763 - makedata.mak
3764 - uchar.h
3765 - configure.in & configure
3766 - update ucdVersion in gennames.c if an algorithmic range changes
3767
3768 *** data files & enums & parser code
3769
3770 * file preparation
3771
3772 python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata
3773 - includes finding files regardless of version numbers,
3774   copying them, and performing the equivalent processing of the
3775   ucdstrip and ucdmerge tools on the desired set of files
3776
3777 * notes on changes
3778 - PropertyAliases.txt
3779   moved from numeric to enumerated:
3780     ccc       ; Canonical_Combining_Class
3781   new string properties:
3782     NFKC_CF   ; NFKC_Casefold
3783     Name_Alias; Name_Alias
3784   new binary properties:
3785     Cased     ; Cased
3786     CI        ; Case_Ignorable
3787     CWCF      ; Changes_When_Casefolded
3788     CWCM      ; Changes_When_Casemapped
3789     CWKCF     ; Changes_When_NFKC_Casefolded
3790     CWL       ; Changes_When_Lowercased
3791     CWT       ; Changes_When_Titlecased
3792     CWU       ; Changes_When_Uppercased
3793   new CJK Unihan properties (not supported by ICU)
3794 - PropertyValueAliases.txt
3795   new block names
3796   new scripts
3797   one script code change:
3798     sc ; Qaai      ; Inherited
3799     ->
3800     sc ; Zinh      ; Inherited                        ; Qaai
3801   new Line_Break (lb) value:
3802     lb ; CP        ; Close_Parenthesis
3803   new Joining_Group (jg) values: Farsi_Yeh, Nya
3804   other new values:
3805     ccc; 214; ATA  ; Attached_Above
3806 - DerivedBidiClass.txt
3807   new default-R range: U+1E800 - U+1EFFF
3808 - UnicodeData.txt
3809   all of the ISO comments are gone
3810   new CJK block end:
3811     9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last>
3812   new CJK block:
3813     2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
3814     2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
3815
3816 * genpname
3817 - run preparse.pl
3818   + cd \svn\icuproj\icu\trunk\source\tools\genpname
3819   + make sure that data.h is writable
3820   + perl preparse.pl \svn\icuproj\icu\trunk > out.txt
3821   + preparse.pl complains with errors like the following:
3822       Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34.
3823     This is because ICU 4.0 had scripts from ISO 15924 which are now
3824     added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt
3825     and PropertyValueAliases.txt.
3826     -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
3827        Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt
3828   + preparse.pl complains with errors about block names missing from uchar.h; add them
3829
3830 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
3831 - new block & script values
3832   + 26 new blocks
3833     copy new blocks from Blocks.txt
3834     MS VC++ 2008 regular expression:
3835       find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$"
3836       replace with "    UBLOCK_\3 = 172, /*[\1]*/"
3837   + several new script values already added in ICU 4.0 for ISO 15924 coverage
3838     (removed from SyntheticPropertyValueAliases.txt, see genpname notes above)
3839   + 3 new script values added for ISO 15924 and Unicode 5.2 coverage
3840   + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2)
3841     (added to SyntheticPropertyValueAliases.txt)
3842 - new Joining Group (JG) values: Farsi_Yeh, Nya
3843 - new Line_Break (lb) value:
3844     lb ; CP        ; Close_Parenthesis
3845
3846 * hardcoded Unihan range end/limit
3847 - Unihan range end moves from 9FC3 to 9FCB
3848   search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive)
3849   + do change gennames.c
3850
3851 * Compare definitions of new binary properties with what we used to use
3852   in algorithms, to see if the definitions changed.
3853 - Verified that definitions for Cased and Case_Ignorable are unchanged.
3854   The gencase tool now parses the newly public Case_Ignorable values
3855   in case the definition changes in the future.
3856
3857 * uchar.c & uprops.h & uprops.c & genprops
3858 - new numeric values that didn't exist in Unicode data before:
3859     1/7, 1/9, 1/10, 3/10, 1/16, 3/16
3860   the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5,
3861   therefore redesign the encoding of numeric types and values for formatVersion 6;
3862   design for simple numbers up to at least 144 ("one gross"),
3863   large values up to at least 10^20,
3864   and fractions with numerators -1..17 and denominators 1..16
3865   to cover current and expected future values
3866   (e.g., more Han numeric values, Meroitic twelfths)
3867
3868 * reimplement Hangul_Syllable_Type for new Jamo characters
3869 - the old code assumed that all Jamo characters are in the 11xx block
3870 - Unicode 5.2 fills holes there and adds new Jamo characters in
3871     A960..A97F; Hangul Jamo Extended-A
3872   and in
3873     D7B0..D7FF; Hangul Jamo Extended-B
3874 - Hangul_Syllable_Type can be trivially derived from a subset of
3875   Grapheme_Cluster_Break values
3876
3877 * build Unicode data source code for hardcoding core data
3878 C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data
3879
3880 ICU data make path is \svn\icuproj\icu\trunk\source\data\
3881 ICU root path is \svn\icuproj\icu\trunk
3882 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
3883 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
3884 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
3885 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
3886 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
3887 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
3888 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
3889 Information: cannot find "spreplocal.mk". Not building user-additional stringprep files.
3890 Creating data file for Unicode Property Names
3891 Creating data file for Unicode Character Properties
3892 Creating data file for Unicode Case Mapping Properties
3893 Creating data file for Unicode BiDi/Shaping Properties
3894 Creating data file for Unicode Normalization
3895 Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l"
3896 Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp"
3897
3898 - copy the .c source files to C:\svn\icuproj\icu\trunk\source\common
3899   and rebuild the common library
3900
3901 *** UCA
3902
3903 - update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools)
3904 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools
3905 - update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools
3906 [ Begin obsolete instructions:
3907   Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files.
3908     - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py
3909       on Windows:
3910         python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt
3911         python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt
3912   End obsolete instructions]
3913 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)
3914   not just the *_STUB.txt files
3915 - note on intltest: if collate/UCAConformanceTest fails, then
3916   utility/MultithreadTest/TestCollators will fail as well;
3917   fix the conformance test before looking into the multi-thread test
3918
3919 *** Implement Cased & Case_Ignorable properties
3920 - via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable()
3921 - Problem: These properties should be disjoint, but aren't
3922 - UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not
3923 - change ucase.icu to be able to store any combination of Cased and Case_Ignorable
3924
3925 *** Implement Changes_When_Xyz properties
3926 - without stored data
3927
3928 *** Implement Name_Alias property
3929 - add it as another name field in unames.icu
3930 - make it available via u_charName() and UCharNameChoice and
3931 - consider it in u_charFromName()
3932
3933 *** Break iterators
3934
3935 * Update break iterator rules to new UAX versions and new property values
3936 * Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary
3937
3938 *** new BidiTest file
3939 - review format and data
3940 - copy BidiTest.txt to source/test/testdata
3941 - write test code using this data
3942 - fix ICU code where it fails the conformance test
3943
3944 *** Java
3945 - generally, find and update code corresponding to C/C++
3946 - UCharacter.UnicodeBlock constants:
3947   a) add an _ID integer per new block, update COUNT
3948   b) add a class instance per new block
3949      Visual Studio regex:
3950         find            UBLOCK_{[^ ]+} = [0-9]+, {/.+}
3951         replace with    public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
3952 - CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias()
3953
3954 - port test changes to Java
3955
3956 *** LayoutEngine script information
3957
3958 (For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833)
3959
3960 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
3961 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
3962 ScriptRunData.cpp, which is no longer needed.)
3963
3964 The generated files have a current copyright date and "@draft" statement.
3965
3966 -> Eric Mader wrote in email on 20090930:
3967     "I think the tool has been modified to update @draft to @stable for
3968      older scripts and to add @draft for new scripts.
3969      (I worked with an intern on this last year.)
3970      You should check the output after you run it."
3971
3972 * copy the above files into <icu>/source/layout, replacing the old files.
3973 * fix mixed line endings
3974 * review the diffs and fix incorrect @draft and missing aliases
3975 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
3976
3977 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
3978 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
3979
3980 -> Eric Mader wrote in email on 20090930:
3981     "This is just a matter of making sure that all the per-script tables have
3982      entries for any new scripts that were added.
3983      If any new Indic characters were added, then the class tables in
3984      IndicClassTables.cpp should be updated to reflect this.
3985      John Emmons should know how to do this if it's required."
3986
3987 * rebuild the layout and layoutex libraries.
3988
3989 *** Documentation
3990 - Update User Guide
3991   + Jamo_Short_Name, sfc->scf, binary property value aliases
3992
3993 ---------------------------------------------------------------------------- ***
3994
3995 Unicode 5.1 update
3996
3997 *** related ICU Trac tickets
3998
3999 5696 Update to Unicode 5.1
4000
4001 *** Unicode version numbers
4002 - makedata.mak
4003 - uchar.h
4004 - configure.in & configure
4005 - update ucdVersion in gennames.c if an algorithmic range changes
4006
4007 *** data files & enums & parser code
4008
4009 * file preparation
4010 - ucdstrip:
4011     DerivedCoreProperties.txt
4012     DerivedNormalizationProps.txt
4013     NormalizationTest.txt
4014     PropList.txt
4015     Scripts.txt
4016     GraphemeBreakProperty.txt
4017     SentenceBreakProperty.txt
4018     WordBreakProperty.txt
4019 - ucdstrip and ucdmerge:
4020     EastAsianWidth.txt
4021     LineBreak.txt
4022
4023 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
4024 copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
4025 copy 5.1.0\ucd\Blocks.txt ..\unidata\
4026 copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
4027 copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
4028 copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
4029 copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
4030 copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
4031 copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
4032 copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
4033 copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
4034 copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
4035 copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
4036 copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
4037
4038 ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
4039 ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
4040 ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
4041 ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
4042 ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
4043 ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
4044 ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
4045 ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
4046 ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
4047 ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
4048
4049 * genpname
4050 - run preparse.pl
4051   + cd \svn\icuproj\icu\uni51\source\tools\genpname
4052   + make sure that data.h is writable
4053   + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
4054   + preparse.pl complains with errors like the following:
4055       Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
4056     This is because ICU 3.8 had scripts from ISO 15924 which are now
4057     added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
4058     and PropertyValueAliases.txt.
4059     -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
4060        Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
4061   + PropertyValueAliases.txt now explicitly contains values for boolean properties:
4062       N/Y, No/Yes, F/T, False/True
4063     -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
4064        It will use further values from the file if present.
4065
4066 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
4067 - new block & script values
4068   + 17 new blocks
4069   + 11 new script values already added in ICU 3.8 for ISO 15924 coverage
4070     (removed from SyntheticPropertyValueAliases.txt)
4071   + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
4072     (added to SyntheticPropertyValueAliases.txt)
4073 - uprops.icu (uprops.h) only provides 7 bits for script codes.
4074   In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
4075   There is none above 127 yet which is the script code for an
4076   assigned Unicode character, so ICU 4.0 uprops.icu does not store any
4077   script code values greater than 127.
4078   However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
4079   in a parallel bit field, and that overflows now.
4080   Also, future values >=128 would be incompatible anyway.
4081   uprops.h is modified to move around several of the bit fields
4082   in the properties vector words, and now uses 8 bits for the script code.
4083   Two other bit fields also grow to accommodate future growth:
4084   Block (current count: 172) grows from 8 to 9 bits,
4085   and Word_Break grows from 4 to 5 bits.
4086 - renamed property Simple_Case_Folding (sfc->scf)
4087   + nothing to be done: handled as normal alias
4088 - new property JSN Jamo_Short_Name
4089   + no new API: only contributes to the Name property
4090 - new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
4091 - new Joining Group (JG) value: Burushashki_Yeh_Barree
4092 - new Sentence_Break (SB) values:
4093     SB ; CR        ; CR
4094     SB ; EX        ; Extend
4095     SB ; LF        ; LF
4096     SB ; SC        ; SContinue
4097 - new Word_Break (WB) values:
4098     WB ; CR        ; CR
4099     WB ; Extend    ; Extend
4100     WB ; LF        ; LF
4101     WB ; MB        ; MidNumLet
4102
4103 * Further changes in the 2008-02-29 update:
4104 - Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
4105   because they should not normally be invisible.
4106 - new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
4107 - new Grapheme_Cluster_Break (GCB) value: PP=Prepend
4108 - new Word_Break (WB) value: NL=Newline
4109
4110 * hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
4111 - Unihan range end moves from 9FBB to 9FC3
4112   search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
4113   + do change gennames.c
4114
4115 * build Unicode data source code for hardcoding core data
4116 C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
4117
4118 ICU data make path is \svn\icuproj\icu\uni51\source\data\
4119 ICU root path is \svn\icuproj\icu\uni51
4120 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
4121 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
4122 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
4123 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
4124 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
4125 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
4126 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
4127 Creating data file for Unicode Character Properties
4128 Creating data file for Unicode Case Mapping Properties
4129 Creating data file for Unicode BiDi/Shaping Properties
4130 Creating data file for Unicode Normalization
4131 Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
4132 Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
4133
4134 - copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
4135   and rebuild the common library
4136
4137 *** Break iterators
4138
4139 * Update break iterator rules to new UAX versions and new property values
4140
4141 *** UCA
4142
4143 * update FractionalUCA.txt and UCARules.txt with new canonical closure
4144
4145 *** Test suites
4146 - Test that APIs using Unicode property value aliases (like UnicodeSet)
4147   support all of the boolean values N/Y, No/Yes, F/T, False/True
4148   -> TestBinaryValues() tests in both cintltst and intltest
4149
4150 *** LayoutEngine script information
4151 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
4152 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
4153 ScriptRunData.cpp, which is no longer needed.)
4154
4155 The generated files have a current copyright date and "@draft" statement.
4156
4157 * copy the above files into <icu>/source/layout, replacing the old files.
4158
4159 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
4160 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
4161
4162 * rebuild the layout and layoutex libraries.
4163
4164 *** Documentation
4165 - Update User Guide
4166   + Jamo_Short_Name, sfc->scf, binary property value aliases
4167
4168 ---------------------------------------------------------------------------- ***
4169
4170 Unicode 5.0 update
4171
4172 *** related Jitterbugs
4173
4174 5084 RFE: Update to Unicode 5.0
4175
4176 *** data files & enums & parser code
4177
4178 * file preparation
4179 - ucdstrip:
4180     DerivedCoreProperties.txt
4181     DerivedNormalizationProps.txt
4182     NormalizationTest.txt
4183     PropList.txt
4184     Scripts.txt
4185     GraphemeBreakProperty.txt
4186     SentenceBreakProperty.txt
4187     WordBreakProperty.txt
4188 - ucdstrip and ucdmerge:
4189     EastAsianWidth.txt
4190     LineBreak.txt
4191
4192 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
4193 copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
4194 copy 5.0.0\ucd\Blocks.txt ..\unidata\
4195 copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
4196 copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
4197 copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
4198 copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
4199 copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
4200 copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
4201 copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
4202 copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
4203 copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
4204 copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
4205 copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
4206
4207 ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
4208 ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
4209 ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
4210 ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
4211 ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
4212 ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
4213 ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
4214 ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
4215 ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
4216 ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
4217
4218 * update FractionalUCA.txt and UCARules.txt with new canonical closure
4219
4220 * genpname
4221 - run preparse.pl
4222   + make sure that data.h is writable
4223   + perl preparse.pl \cvs\oss\icu > out.txt
4224
4225 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
4226 - new block & script values
4227   + script values already added in ICU 3.6 because all of ISO 15924 is now covered
4228
4229 * build Unicode data source code for hardcoding core data
4230 C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
4231
4232 ICU data make path is \cvs\oss\icu\source\data\
4233 ICU root path is \cvs\oss\icu
4234 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
4235 [etc.]
4236 Creating data file for Unicode Character Properties
4237 Creating data file for Unicode Case Mapping Properties
4238 Creating data file for Unicode BiDi/Shaping Properties
4239 Creating data file for Unicode Normalization
4240 Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
4241 Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
4242
4243 - copy the .c source files to C:\cvs\oss\icu\source\common
4244   and rebuild the common library
4245
4246 *** Unicode version numbers
4247 - makedata.mak
4248 - uchar.h
4249 - configure.in
4250
4251 *** LayoutEngine script information
4252 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
4253 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
4254 ScriptRunData.cpp, which is no longer needed.)
4255
4256 The generated files have a current copyright date and "@draft" statement.
4257
4258 * copy the above files into <icu>/source/layout, replacing the old files.
4259
4260 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
4261 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
4262
4263 * rebuild the layout and layoutex libraries.
4264
4265 ---------------------------------------------------------------------------- ***
4266
4267 Unicode 4.1 update
4268
4269 *** related Jitterbugs
4270
4271 4332 RFE: Update to Unicode 4.1
4272 4157 RBBI, TR29 4.1 updates
4273
4274 *** data files & enums & parser code
4275
4276 * file preparation
4277 - ucdstrip:
4278     DerivedCoreProperties.txt
4279     DerivedNormalizationProps.txt
4280     NormalizationTest.txt
4281     GraphemeBreakProperty.txt
4282     SentenceBreakProperty.txt
4283     WordBreakProperty.txt
4284 - ucdstrip and ucdmerge:
4285     EastAsianWidth.txt
4286     LineBreak.txt
4287
4288 * add new files to the repository
4289     GraphemeBreakProperty.txt
4290     SentenceBreakProperty.txt
4291     WordBreakProperty.txt
4292
4293 * update FractionalUCA.txt and UCARules.txt with new canonical closure
4294
4295 * genpname
4296 - handle new enumerated properties in sub read_uchar
4297 - run preparse.pl
4298
4299 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
4300 - new binary properties
4301   + Pattern_Syntax
4302   + Pattern_White_Space
4303 - new enumerated properties
4304   + Grapheme_Cluster_Break
4305   + Sentence_Break
4306   + Word_Break
4307 - new block & script & line break values
4308
4309 * gencase
4310 - case-ignorable changes
4311   see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
4312   now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
4313
4314 *** Unicode version numbers
4315 - makedata.mak
4316 - uchar.h
4317 - configure.in
4318
4319 *** tests
4320 - verify that u_charMirror() round-trips
4321 - test all new properties and some new values of old properties
4322
4323 *** other code
4324
4325 * hardcoded Unihan range end/limit
4326 - Unihan range end moves from 9FA5 to 9FBB
4327   search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
4328   + do not modify BOCU/BOCSU code because that would change the encoding
4329     and break binary compatibility!
4330   + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
4331     NamePrepProfile.txt
4332   + ignore trietest.c: test data is arbitrary
4333   + ignore tstnorm.cpp: test optimization, not important
4334   + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
4335   + do change line_th.txt and word_th.txt
4336     by replacing hardcoded ranges with the new property values
4337   + do change gennames.c
4338
4339 source\data\brkitr\line_th.txt(229):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
4340 source\data\brkitr\word_th.txt(23):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
4341 source\tools\gennames\gennames.c(971):        0x4e00, 0x9fa5,
4342
4343 * case mappings
4344 - compare new special casing context conditions with previous ones
4345   see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
4346
4347 * genpname
4348 - consider storing only the short name if it is the same as the long name
4349
4350 *** other reviews
4351 - UAX #29 changes (grapheme/word/sentence breaks)
4352 - UAX #14 changes (line breaks)
4353 - Pattern_Syntax & Pattern_White_Space
4354
4355 ---------------------------------------------------------------------------- ***
4356
4357 Unicode 4.0.1 update
4358
4359 *** related Jitterbugs
4360
4361 3170 RFE: Update to Unicode 4.0.1
4362 3171 Add new Unicode 4.0.1 properties
4363 3520 use Unicode 4.0.1 updates for break iteration
4364
4365 *** data files & enums & parser code
4366
4367 * file preparation
4368 - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
4369 - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
4370
4371 * file fixes
4372 - fix UnicodeData.txt general categories of Ethiopic digits Nd->No
4373   according to PRI #26
4374   http://www.unicode.org/review/resolved-pri.html#pri26
4375 - undone again because no corrigendum in sight;
4376   instead modified tests to not check consistency on this for Unicode 4.0.1
4377
4378 * ucdterms.txt
4379 - update from http://www.unicode.org/copyright.html
4380   formatted for plain text
4381
4382 * uchar.h & uprops.h & uprops.c & genprops
4383 - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
4384 - add U_LB_INSEPARABLE due to a spelling fix
4385   + put short name comment only on line with new constant
4386     for genpname perl script parser
4387 - new binary properties
4388   + STerm
4389   + Variation_Selector
4390
4391 * genpname
4392 - fix genpname perl script so that it doesn't choke on more than 2 names per property value
4393 - perl script: correctly calculate the maximum number of fields per row
4394
4395 * uscript.h
4396 - new script code Hrkt=Katakana_Or_Hiragana
4397
4398 * gennorm.c track changes in DerivedNormalizationProps.txt
4399 - "FNC" -> "FC_NFKC"
4400 - single field "NFD_NO" -> two fields "NFD_QC; N" etc.
4401
4402 * genprops/props2.c track changes in DerivedNumericValues.txt
4403 - changed from 3 columns to 2, dropping the numeric type
4404   + assume that the type is always numeric for Han characters,
4405     and that only those are added in addition to what UnicodeData.txt lists
4406
4407 *** Unicode version numbers
4408 - makedata.mak
4409 - uchar.h
4410 - configure.in
4411
4412 *** tests
4413 - update test of default bidi classes according to PRI #28
4414   /tsutil/cucdtst/TestUnicodeData
4415   http://www.unicode.org/review/resolved-pri.html#pri28
4416 - bidi tests: change exemplar character for ES depending on Unicode version
4417 - change hardcoded expected property values where they change
4418
4419 *** other code
4420
4421 * name matching
4422 - read UCD.html
4423
4424 * scripts
4425 - use new Hrkt=Katakana_Or_Hiragana
4426
4427 * ZWJ & ZWNJ
4428 - are now part of combining character sequences
4429 - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ