icuSources/data/unidata/changes.txt

   1 * Copyright (C) 2016 and later: Unicode, Inc. and others.
   2 * License & terms of use: http://www.unicode.org/copyright.html
   3 * Copyright (C) 2004-2016, International Business Machines
   4 * Corporation and others.  All Rights Reserved.
   5 *
   6 *   file name:  changes.txt
   7 *   encoding:   US-ASCII
   8 *   tab size:   8 (not used)
   9 *   indentation:4
  10 *
  11 *   created on: 2004may06
  12 *   created by: Markus W. Scherer
  13 *
  14 * change log for Unicode updates
  15 *
  16 * For each new Unicode version, during the beta period,
  17 * I copy the change log for the previous version to the top of this file.
  18 * I adjust the versions, tickets, URLs, and paths.
  19 * I work my way through the steps listed in the log, top to bottom,
  20 * adjusting the log as necessary.
  21 * I report problems to the UTC and/or CLDR and/or ICU.
  22 * Before the data is final, I "turn the crank" several more times,
  23 * using appropriate subsets of the steps.
  24
  25 ---------------------------------------------------------------------------- ***
  26
  27 * New ISO 15924 script codes
  28
  29 Starting with ICU 55, we do not add UScriptCode constants for new scripts any more
  30 until they are encoded in Unicode,
  31 or can be assumed to be encoded in the next Unicode version.
  32 Script enum constant names want to follow the Unicode script property value aliases,
  33 which are assigned only when the scripts are encoded.
  34 When we encode scripts early and guess wrong, then we have confusing enum constants
  35 and have sometimes added aliases.
  36
  37 Variant script codes like Latf and Aran that are not subject to separate encoding
  38 can be added at any time.
  39 (For example, Aran could be added as USCRIPT_ARABIC_NASTALIQ.)
  40
  41 We add script codes used in CLDR or in the spoof checker.
  42 This includes combination/alias codes like Hanb and Jamo.
  43 See http://unicode.org/reports/tr35/#unicode_script_subtag_validity
  44 and look for "alias" on http://unicode.org/iso15924/iso15924-codes.html
  45
  46 We add special Z* script codes like Zsye.
  47
  48 For new script codes see http://www.unicode.org/iso15924/codechanges.html
  49
  50 ---------------------------------------------------------------------------- ***
  51
  52 Unicode 11.0 update for ICU 62
  53
  54 http://www.unicode.org/versions/Unicode11.0.0/
  55 http://unicode.org/versions/beta-11.0.0.html
  56 https://www.unicode.org/review/pri372/
  57 http://www.unicode.org/reports/uax-proposed-updates.html
  58 http://www.unicode.org/reports/tr44/tr44-21.html
  59
  60 * Command-line environment setup
  61
  62 UNICODE_DATA=~/unidata/uni11/20180521
  63 CLDR_SRC=~/svn.cldr/uni
  64 ICU_ROOT=~/svn.icu/uni
  65 ICU_SRC=$ICU_ROOT/src
  66 ICUDT=icudt61b
  67 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
  68 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
  69 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
  70
  71 *** ICU Trac
  72
  73 - ticket:13630: Unicode 11
  74 - ^/branches/markus/uni11
  75
  76 *** CLDR Trac
  77
  78 - cldrbug 10978: Unicode 11
  79 - ^/branches/markus/uni11
  80
  81 *** Unicode version numbers
  82 - makedata.mak
  83 - uchar.h
  84 - com.ibm.icu.util.VersionInfo
  85 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
  86
  87 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
  88   so that the makefiles see the new version number.
  89
  90 *** data files & enums & parser code
  91
  92 * download files
  93 - mkdir -p $UNICODE_DATA
  94 - download Unicode files into $UNICODE_DATA
  95   + subfolders: emoji, idna, security, ucd, uca
  96   + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
  97
  98 * for manual diffs and for Unicode Tools input data updates:
  99   remove version suffixes from the file names
 100     ~$ unidata/desuffixucd.py $UNICODE_DATA
 101   (see https://sites.google.com/site/unicodetools/inputdata)
 102
 103 * process and/or copy files
 104 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
 105   + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
 106   + For debugging, and tweaking how ppucd.txt is written,
 107     the tool has an --only_ppucd option:
 108     py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
 109
 110 - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
 111
 112 * build ICU (make install)
 113   so that the tools build can pick up the new definitions from the installed header files.
 114
 115   $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
 116
 117 * preparseucd.py changes
 118 - fix other errors
 119     NameError: unknown property Extended_Pictographic
 120   -> add Extended_Pictographic binary property
 121   -> add new short names for all Emoji properties
 122
 123 * new constants for new property values
 124 - preparseucd.py error:
 125     ValueError: missing uchar.h enum constants for some property values:
 126     [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar',
 127                    u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals',
 128                    u'Indic_Siyaq_Numbers'])),
 129      (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])),
 130      (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])),
 131      (u'GCB', set([u'LinkC', u'Virama'])),
 132      (u'WB', set([u'WSegSpace']))]
 133   = PropertyValueAliases.txt new property values (diff old & new .txt files)
 134     blk; Chess_Symbols                    ; Chess_Symbols
 135     blk; Dogra                            ; Dogra
 136     blk; Georgian_Ext                     ; Georgian_Extended
 137     blk; Gunjala_Gondi                    ; Gunjala_Gondi
 138     blk; Hanifi_Rohingya                  ; Hanifi_Rohingya
 139     blk; Indic_Siyaq_Numbers              ; Indic_Siyaq_Numbers
 140     blk; Makasar                          ; Makasar
 141     blk; Mayan_Numerals                   ; Mayan_Numerals
 142     blk; Medefaidrin                      ; Medefaidrin
 143     blk; Old_Sogdian                      ; Old_Sogdian
 144     blk; Sogdian                          ; Sogdian
 145   -> add to uchar.h
 146     use long property names for enum constants,
 147     for the trailing comment get the block start code point: diff old & new Blocks.txt
 148   -> add to UCharacter.UnicodeBlock IDs
 149     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
 150             replace  public static final int \1_ID = \2; \3
 151   -> add to UCharacter.UnicodeBlock objects
 152     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
 153             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
 154
 155     GCB; LinkC                            ; LinkingConsonant
 156     GCB; Virama                           ; Virama
 157   -> uchar.h & UCharacter.GraphemeClusterBreak
 158   -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76
 159
 160     InSC; Consonant_Initial_Postfixed     ; Consonant_Initial_Postfixed
 161   -> ignore: ICU does not yet support this property
 162
 163     jg ; Hanifi_Rohingya_Kinna_Ya         ; Hanifi_Rohingya_Kinna_Ya
 164     jg ; Hanifi_Rohingya_Pa               ; Hanifi_Rohingya_Pa
 165   -> uchar.h & UCharacter.JoiningGroup
 166
 167     sc ; Dogr                             ; Dogra
 168     sc ; Gong                             ; Gunjala_Gondi
 169     sc ; Maka                             ; Makasar
 170     sc ; Medf                             ; Medefaidrin
 171     sc ; Rohg                             ; Hanifi_Rohingya
 172     sc ; Sogd                             ; Sogdian
 173     sc ; Sogo                             ; Old_Sogdian
 174   -> uscript.h & com.ibm.icu.lang.UScript
 175   -> Nushu had been added already
 176   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
 177       and in com.ibm.icu.dev.test.lang.TestUScript.java
 178
 179     WB ; WSegSpace                        ; WSegSpace
 180   -> uchar.h & UCharacter.WordBreak
 181
 182 * New short names for emoji properties
 183 - see UTS #51
 184 - short names set in preparseucd.py
 185
 186 * New properties
 187 - boolean emoji property Extended_Pictographic
 188   -> added in preparseucd.py
 189   -> uchar.h & UProperty.java
 190 - misc. property Equivalent_Unified_Ideograph (EqUIdeo)
 191   as shown in PropertyValueAliases.txt
 192   -> ignore for now
 193
 194 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
 195     (not strictly necessary for NOT_ENCODED scripts)
 196   $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
 197
 198 * update spoof checker UnicodeSet initializers:
 199     inclusionPat & recommendedPat in uspoof.cpp
 200     INCLUSION & RECOMMENDED in SpoofChecker.java
 201 - make sure that the Unicode Tools tree contains the latest security data files
 202 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator
 203 - update the hardcoded version number there in the DIRECTORY path
 204 - run the tool (no special environment variables needed)
 205 - copy & paste from the Console output into the .cpp & .java files
 206
 207 * generate normalization data files
 208   cd $ICU_ROOT/dbg/icu4c
 209   bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
 210   bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm     -s $ICU4C_UNIDATA/norm2 nfc.txt
 211   bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm    -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
 212   bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
 213   bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm   -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
 214
 215 * build ICU (make install)
 216   so that the tools build can pick up the new definitions from the installed header files.
 217
 218   $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
 219
 220 * build Unicode tools using CMake+make
 221
 222 $ICU_SRC/tools/unicode/c/icudefs.txt:
 223
 224 # Location (--prefix) of where ICU was installed.
 225 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
 226 # Location of the ICU4C source tree.
 227 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c)
 228
 229   $ICU_ROOT/dbg$
 230     mkdir -p tools/unicode/c
 231     cd tools/unicode/c
 232
 233   $ICU_ROOT/dbg/tools/unicode/c$
 234     cmake ../../../../src/tools/unicode/c
 235     make
 236
 237 * generate core properties data files
 238   $ICU_ROOT/dbg/tools/unicode/c$
 239     genprops/genprops $ICU_SRC/icu4c
 240     genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
 241     genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
 242 - rebuild ICU (make install) & tools
 243
 244 * Fix case props
 245     genprops error: casepropsbuilder: too many exceptions words
 246     genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR
 247 - With the addition of Georgian Mtavruli capital letters,
 248   there are now too many simple case mappings with big mapping deltas
 249   that yield uncompressible exceptions.
 250 - Changing the data structure (now formatVersion 4),
 251   adding one bit for no-simple-case-folding (for Cherokee), and
 252   one optional slot for a big delta (for most faraway mappings),
 253   together with another bit for whether that is negative.
 254   This makes most Cherokee & Georgian etc. case mappings compressible,
 255   reducing the number of exceptions words.
 256 - Further changes to gain one more bit for the exceptions index,
 257   for future growth. Details see casepropsbuilder.cpp.
 258
 259 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
 260   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
 261 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
 262 - Unicode 6.0..11.0: U+2260, U+226E, U+226F
 263 - nothing new in this Unicode version, no test file to update
 264
 265 * run & fix ICU4C tests
 266 - Andy handles RBBI & spoof check test failures
 267
 268 - Errors in char.txt, word.txt, word_POSIX.txt like
 269     createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET"  at line 46, column 16
 270   because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty.
 271   -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them
 272      not empty, just to get ICU building.
 273   -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables
 274      and properties together with the rules that used them (GB 10, WB 14).
 275   -> Andy adjusts the rule sets further to sync with
 276      Unicode 11 grapheme, word, and line break spec changes.
 277
 278 * collation: CLDR collation root, UCA DUCET
 279
 280 - UCA DUCET goes into Mark's Unicode tools, see
 281     https://sites.google.com/site/unicodetools/home#TOC-UCA
 282   diff the main mapping file, look for bad changes
 283   (for example, more bytes per weight for common characters)
 284     ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt
 285     ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt
 286
 287 - CLDR root data files are checked into $CLDR_SRC/common/uca/
 288     cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
 289
 290 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
 291     cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
 292 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
 293     cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
 294     (note removing the underscore before "Rules")
 295     cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
 296 - restore TODO diffs in UCARules.txt
 297     meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
 298 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
 299   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
 300   from the CLDR root files (..._CLDR_..._SHORT.txt)
 301     cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
 302     cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
 303     cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
 304 - if CLDR common/uca/unihan-index.txt changes, then update
 305   CLDR common/collation/root.xml <collation type="private-unihan">
 306   and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
 307
 308 - run genuca, see command line above;
 309   deal with
 310     Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:
 311     FDD1 1180B; [71 CC 02, 05, 05]      # Dogra first primary (compressible)
 312         (add the character to genuca.cpp sampleCharsToScripts[])
 313   + look up the USCRIPT_ code for the new sample characters
 314     (should be obvious from the comment in the error output)
 315   + *add* mappings to sampleCharsToScripts[], do not replace them
 316     (in case the script sample characters flip-flop)
 317   + insert new scripts in DUCET script order, see the top_byte table
 318     at the beginning of FractionalUCA.txt
 319 - rebuild ICU4C
 320
 321 * Unihan collators
 322     https://sites.google.com/site/unicodetools/unihan
 323 - run Unicode Tools
 324     org.unicode.draft.GenerateUnihanCollators
 325   with VM arguments
 326     -ea
 327     -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
 328     -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
 329     -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
 330     -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
 331     -DUVERSION=11.0.0
 332 - run Unicode Tools
 333     org.unicode.draft.GenerateUnihanCollatorFiles
 334   with the same arguments
 335 - check CLDR diffs
 336     cd $CLDR_SRC
 337     meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
 338     meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
 339 - copy to CLDR
 340     cd $CLDR_SRC
 341     cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
 342     cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
 343 - run CLDR unit tests, commit to CLDR
 344 - generate ICU zh collation data: run CLDR
 345     org.unicode.cldr.icu.NewLdml2IcuConverter
 346   with program arguments
 347     -t collation
 348     -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation
 349     -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental
 350     -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll
 351     -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation
 352     zh
 353   and VM arguments
 354     -ea
 355     -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni
 356 - rebuild ICU4C
 357
 358 * run & fix ICU4C tests, now with new CLDR collation root data
 359 - run all tests with the collation test data *_SHORT.txt or the full files
 360   (the full ones have comments, useful for debugging)
 361 - note on intltest: if collate/UCAConformanceTest fails, then
 362   utility/MultithreadTest/TestCollators will fail as well;
 363   fix the conformance test before looking into the multi-thread test
 364
 365 * update Java data files
 366 - refresh just the UCD/UCA-related/derived files, just to be safe
 367 - see (ICU4C)/source/data/icu4j-readme.txt
 368 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 369 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 370   output:
 371     ...
 372     Unicode .icu files built to ./out/build/icudt61l
 373     echo timestamp > uni-core-data
 374     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b
 375     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b
 376     echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
 377     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b
 378     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b"
 379     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/
 380     mkdir -p /tmp/icu4j/main/shared/data
 381     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
 382     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/
 383     mkdir -p /tmp/icu4j/main/shared/data
 384     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
 385     make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data'
 386 - copy the big-endian Unicode data files to another location,
 387   separate from the other data files,
 388   and then refresh ICU4J
 389     cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
 390     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
 391     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
 392     cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 393     cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 394     rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
 395     cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 396     cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
 397     cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
 398     jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
 399
 400 * When refreshing all of ICU4J data from ICU4C
 401 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 402 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
 403 or
 404 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
 405
 406 * update CollationFCD.java
 407   + copy & paste the initializers of lcccIndex[] etc. from
 408     ICU4C/source/i18n/collationfcd.cpp to
 409     ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
 410
 411 * refresh Java test .txt files
 412 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
 413     cd $ICU_SRC/icu4c/source/data/unidata
 414     cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
 415     cd ../../test/testdata
 416     cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
 417     cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
 418
 419 * run & fix ICU4J tests
 420
 421 *** API additions
 422 - send notice to icu-design about new born-@stable API (enum constants etc.)
 423
 424 *** CLDR numbering systems
 425 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR
 426   Unicode 11: using Unicode 11 CLDR ticket #10978
 427     rohg 10D30..10D39 Hanifi_Rohingya
 428     gong 11DA0..11DA9 Gunjala_Gondi
 429   Earlier: CLDR tickets specific to adding new numbering systems.
 430   Unicode 10: http://unicode.org/cldr/trac/ticket/10219
 431   Unicode 9: http://unicode.org/cldr/trac/ticket/9692
 432
 433 *** merge the Unicode update branches back onto the trunk
 434 - do not merge the icudata.jar and testdata.jar,
 435   instead rebuild them from merged & tested ICU4C
 436 - make sure that changes to Unicode tools are checked in:
 437   http://www.unicode.org/utility/trac/log/trunk/unicodetools
 438
 439 ---------------------------------------------------------------------------- ***
 440
 441 Unicode 10.0 update for ICU 60
 442
 443 http://www.unicode.org/versions/Unicode10.0.0/
 444 http://www.unicode.org/versions/beta-10.0.0.html
 445 http://blog.unicode.org/2017/03/unicode-100-beta-review.html
 446 http://www.unicode.org/review/pri350/
 447 http://www.unicode.org/reports/uax-proposed-updates.html
 448 http://www.unicode.org/reports/tr44/tr44-19.html
 449
 450 * Command-line environment setup
 451
 452 UNICODE_DATA=~/unidata/uni10/20170605
 453 CLDR_SRC=~/svn.cldr/uni10
 454 ICU_ROOT=~/svn.icu/uni10
 455 ICU_SRC=$ICU_ROOT/src
 456 ICUDT=icudt60b
 457 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in
 458 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata
 459 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib
 460
 461 *** ICU Trac
 462
 463 - ticket:12985: Unicode 10
 464 - ticket:13061: undo hacks from emoji 5.0 update
 465 - ticket:13062: add Emoji_Component property
 466 - ^/branches/markus/uni10
 467
 468 *** CLDR Trac
 469
 470 - cldrbug 10055: Unicode 10
 471 - cldrbug 9882: Unicode 10 script metadata
 472 - cldrbug 10219: numbering systems for Unicode 10
 473
 474 *** Unicode version numbers
 475 - makedata.mak
 476 - uchar.h
 477 - com.ibm.icu.util.VersionInfo
 478 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
 479
 480 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
 481   so that the makefiles see the new version number.
 482
 483 *** data files & enums & parser code
 484
 485 * download files
 486 - mkdir -p $UNICODE_DATA
 487 - download Unicode 10.0 files into $UNICODE_DATA
 488   + subfolders: ucd, uca, idna, security
 489   + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
 490 - download emoji 5.0 files into $UNICODE_DATA/emoji
 491
 492 * for manual diffs: remove version suffixes from the file names
 493   ~$ unidata/desuffixucd.py $UNICODE_DATA
 494   (see https://sites.google.com/site/unicodetools/inputdata)
 495
 496 * process and/or copy files
 497 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC
 498   + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
 499   + For debugging, and tweaking how ppucd.txt is written,
 500     the tool has an --only_ppucd option:
 501     py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile
 502
 503 - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA
 504
 505 * build ICU (make install)
 506   so that the tools build can pick up the new definitions from the installed header files.
 507
 508   $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
 509
 510 * preparseucd.py changes
 511 - remove or add new Unicode scripts from/to the
 512   only-in-ISO-15924 list according to the error messages:
 513     ValueError: remove ['Nshu'] from _scripts_only_in_iso15924
 514   -> adjust _scripts_only_in_iso15924 as indicated
 515 - fix other errors
 516     Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo']
 517   -> add vo=Vertical_Orientation to _ignored_properties
 518   -> later removed again, parsing the file, even though we do not yet store data for runtime use
 519
 520 * new constants for new property values
 521 - preparseucd.py error:
 522     ValueError: missing uchar.h enum constants for some property values:
 523     [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F',
 524                    u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])),
 525      (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla',
 526                   u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra',
 527                   u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])),
 528      (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))]
 529   = PropertyValueAliases.txt new property values (diff old & new .txt files)
 530     blk; CJK_Ext_F                        ; CJK_Unified_Ideographs_Extension_F
 531     blk; Kana_Ext_A                       ; Kana_Extended_A
 532     blk; Masaram_Gondi                    ; Masaram_Gondi
 533     blk; Nushu                            ; Nushu
 534     blk; Soyombo                          ; Soyombo
 535     blk; Syriac_Sup                       ; Syriac_Supplement
 536     blk; Zanabazar_Square                 ; Zanabazar_Square
 537   -> add to uchar.h
 538     use long property names for enum constants,
 539     for the trailing comment get the block start code point: diff old & new Blocks.txt
 540   -> add to UCharacter.UnicodeBlock IDs
 541     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
 542             replace  public static final int \1_ID = \2; \3
 543   -> add to UCharacter.UnicodeBlock objects
 544     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
 545             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
 546
 547     jg ; Malayalam_Bha                    ; Malayalam_Bha
 548     jg ; Malayalam_Ja                     ; Malayalam_Ja
 549     jg ; Malayalam_Lla                    ; Malayalam_Lla
 550     jg ; Malayalam_Llla                   ; Malayalam_Llla
 551     jg ; Malayalam_Nga                    ; Malayalam_Nga
 552     jg ; Malayalam_Nna                    ; Malayalam_Nna
 553     jg ; Malayalam_Nnna                   ; Malayalam_Nnna
 554     jg ; Malayalam_Nya                    ; Malayalam_Nya
 555     jg ; Malayalam_Ra                     ; Malayalam_Ra
 556     jg ; Malayalam_Ssa                    ; Malayalam_Ssa
 557     jg ; Malayalam_Tta                    ; Malayalam_Tta
 558   -> uchar.h & UCharacter.JoiningGroup
 559
 560     sc ; Gonm                             ; Masaram_Gondi
 561     sc ; Nshu                             ; Nushu
 562     sc ; Soyo                             ; Soyombo
 563     sc ; Zanb                             ; Zanabazar_Square
 564   -> uscript.h & com.ibm.icu.lang.UScript
 565   -> Nushu had been added already
 566   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
 567       and in com.ibm.icu.dev.test.lang.TestUScript.java
 568
 569 * New properties as shown in PropertyValueAliases.txt changes
 570 - boolean Emoji_Component from emoji 5
 571   -> uchar.h & UProperty.java
 572 - boolean
 573     # Regional_Indicator (RI)
 574
 575     RI ; N                                ; No                               ; F                                ; False
 576     RI ; Y                                ; Yes                              ; T                                ; True
 577   -> uchar.h & UProperty.java
 578   -> single immutable range, to be hardcoded
 579 - boolean
 580     # Prepended_Concatenation_Mark (PCM)
 581
 582     PCM; N                                ; No                               ; F                                ; False
 583     PCM; Y                                ; Yes                              ; T                                ; True
 584   -> was new in Unicode 9
 585   -> uchar.h & UProperty.java
 586 - enumerated
 587     # Vertical_Orientation (vo)
 588
 589     vo ; R                                ; Rotated
 590     vo ; Tr                               ; Transformed_Rotated
 591     vo ; Tu                               ; Transformed_Upright
 592     vo ; U                                ; Upright
 593   -> only pre-parsed for now, but not yet stored for runtime use
 594
 595 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
 596     (not strictly necessary for NOT_ENCODED scripts)
 597   $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt
 598
 599 * generate normalization data files
 600   cd $ICU_ROOT/dbg/icu4c
 601   bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource
 602   bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm     -s $ICU4C_UNIDATA/norm2 nfc.txt
 603   bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm    -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt
 604   bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
 605   bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm   -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt
 606
 607 * build ICU (make install)
 608   so that the tools build can pick up the new definitions from the installed header files.
 609
 610   $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
 611
 612 * build Unicode tools using CMake+make
 613
 614 $ICU_SRC/tools/unicode/c/icudefs.txt:
 615
 616 # Location (--prefix) of where ICU was installed.
 617 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
 618 # Location of the ICU4C source tree.
 619 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c)
 620
 621   $ICU_ROOT/dbg/tools/unicode/c$
 622     cmake ../../../../src/tools/unicode/c
 623     make
 624
 625 * generate core properties data files
 626   $ICU_ROOT/dbg/tools/unicode/c$
 627     genprops/genprops $ICU_SRC/icu4c
 628     genuca/genuca --hanOrder implicit $ICU_SRC/icu4c
 629     genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c
 630 - rebuild ICU (make install) & tools
 631
 632 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
 633   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
 634 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
 635 - Unicode 6.0..10.0: U+2260, U+226E, U+226F
 636 - nothing new in this Unicode version, no test file to update
 637
 638 * run & fix ICU4C tests
 639 - Andy handles RBBI & spoof check test failures
 640
 641 * collation: CLDR collation root, UCA DUCET
 642
 643 - UCA DUCET goes into Mark's Unicode tools, see
 644   https://sites.google.com/site/unicodetools/home#TOC-UCA
 645 - CLDR root data files are checked into $CLDR_SRC/common/uca/
 646     cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/
 647
 648 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
 649     cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt
 650 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
 651     cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt
 652     (note removing the underscore before "Rules")
 653     cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt
 654 - restore TODO diffs in UCARules.txt
 655     meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt
 656 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
 657   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
 658   from the CLDR root files (..._CLDR_..._SHORT.txt)
 659     cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
 660     cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
 661     cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data
 662 - if CLDR common/uca/unihan-index.txt changes, then update
 663   CLDR common/collation/root.xml <collation type="private-unihan">
 664   and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt
 665
 666 - run genuca, see command line above;
 667   deal with
 668     Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt:
 669     FDD1 11D10;     [70 D5 02, 05, 05]      # Masaram_Gondi first primary (compressible)
 670         (add the character to genuca.cpp sampleCharsToScripts[])
 671   + look up the USCRIPT_ code for the new sample characters
 672     (should be obvious from the comment in the error output)
 673   + *add* mappings to sampleCharsToScripts[], do not replace them
 674     (in case the script sample characters flip-flop)
 675   + insert new scripts in DUCET script order, see the top_byte table
 676     at the beginning of FractionalUCA.txt
 677 - rebuild ICU4C
 678
 679 * Unihan collators
 680     https://sites.google.com/site/unicodetools/unihan
 681 - run Unicode Tools
 682     org.unicode.draft.GenerateUnihanCollators
 683   with VM arguments
 684     -ea
 685     -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk
 686     -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools
 687     -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data
 688     -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
 689     -DUVERSION=10.0.0
 690 - run Unicode Tools
 691     org.unicode.draft.GenerateUnihanCollatorFiles
 692   with the same arguments
 693 - check CLDR diffs
 694     cd $CLDR_SRC
 695     meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
 696     meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
 697 - copy to CLDR
 698     cd $CLDR_SRC
 699     cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
 700     cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
 701 - run CLDR unit tests, commit to CLDR
 702 - generate ICU zh collation data: run CLDR
 703     org.unicode.cldr.icu.NewLdml2IcuConverter
 704   with program arguments
 705     -t collation
 706     -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation
 707     -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental
 708     -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll
 709     -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation
 710     zh
 711   and VM arguments
 712     -ea
 713     -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10
 714 - rebuild ICU4C
 715
 716 * run & fix ICU4C tests, now with new CLDR collation root data
 717 - run all tests with the collation test data *_SHORT.txt or the full files
 718   (the full ones have comments, useful for debugging)
 719 - note on intltest: if collate/UCAConformanceTest fails, then
 720   utility/MultithreadTest/TestCollators will fail as well;
 721   fix the conformance test before looking into the multi-thread test
 722
 723 * update Java data files
 724 - refresh just the UCD/UCA-related/derived files, just to be safe
 725 - see (ICU4C)/source/data/icu4j-readme.txt
 726 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 727 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 728   output:
 729     ...
 730     Unicode .icu files built to ./out/build/icudt60l
 731     echo timestamp > uni-core-data
 732     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b
 733     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b
 734     echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
 735     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b
 736     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b"
 737     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/
 738     mkdir -p /tmp/icu4j/main/shared/data
 739     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
 740     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/
 741     mkdir -p /tmp/icu4j/main/shared/data
 742     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
 743     make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data'
 744 - copy the big-endian Unicode data files to another location,
 745   separate from the other data files,
 746   and then refresh ICU4J
 747     cd $ICU_ROOT/dbg/icu4c/data/out/icu4j
 748     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
 749     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
 750     cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 751     cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 752     rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
 753     cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 754     cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
 755     cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
 756     jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
 757
 758 * When refreshing all of ICU4J data from ICU4C
 759 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 760 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data
 761 or
 762 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install
 763
 764 * update CollationFCD.java
 765   + copy & paste the initializers of lcccIndex[] etc. from
 766     ICU4C/source/i18n/collationfcd.cpp to
 767     ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
 768
 769 * refresh Java test .txt files
 770 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
 771     cd $ICU_SRC/icu4c/source/data/unidata
 772     cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
 773     cd ../../test/testdata
 774     cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
 775     cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
 776
 777 * run & fix ICU4J tests
 778
 779 *** API additions
 780 - send notice to icu-design about new born-@stable API (enum constants etc.)
 781
 782 *** CLDR numbering systems
 783 - look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket
 784   Unicode 10: http://unicode.org/cldr/trac/ticket/10219
 785   Unicode 9: http://unicode.org/cldr/trac/ticket/9692
 786
 787 *** merge the Unicode update branches back onto the trunk
 788 - do not merge the icudata.jar and testdata.jar,
 789   instead rebuild them from merged & tested ICU4C
 790 - make sure that changes to Unicode tools are checked in:
 791   http://www.unicode.org/utility/trac/log/trunk/unicodetools
 792
 793 ---------------------------------------------------------------------------- ***
 794
 795 Emoji 5.0 update for ICU 59
 796 - ICU 59 mostly remains on Unicode 9.0
 797 - except updates bidi and segmentation data to Unicode 10 beta
 798
 799 First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg.
 800
 801 * Command-line environment setup
 802
 803 ICU_ROOT=~/svn.icu/trunk
 804 ICU_SRC_DIR=$ICU_ROOT/src
 805 ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c
 806 ICUDT=icudt59b
 807 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
 808 SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in
 809 UNIDATA=$ICU4C_SRC_DIR/source/data/unidata
 810
 811 *** ICU Trac
 812
 813 - ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released
 814 - changes directly on trunk
 815
 816 *** data files & enums & parser code
 817
 818 * download files
 819
 820 - download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca)
 821 - download emoji 5.0 beta files into the same uni90e50 folder
 822 - download Unicode 10.0 beta files: ucd
 823   + copy Unicode 10 bidi files to the uni90e50/ucd folder:
 824     BidiBrackets.txt
 825     BidiCharacterTest.txt
 826     BidiMirroring.txt
 827     BidiTest.txt
 828     extracted/DerivedBidiClass.txt
 829   + copy Unicode 10 segmentation files to the uni90e50/ucd folder:
 830     LineBreak.txt
 831     auxiliary/*
 832
 833 * preparseucd.py changes
 834 - adjust for combined trunks
 835 - write new copyright lines
 836 - ignore new Emoji_Component property for now
 837
 838 * process and/or copy files
 839 - ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR
 840   + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
 841
 842 - cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA
 843
 844 * build ICU (make install)
 845   so that the tools build can pick up the new definitions from the installed header files.
 846
 847   $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date
 848
 849 * build Unicode tools using CMake+make
 850
 851 ~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt:
 852
 853 # Location (--prefix) of where ICU was installed.
 854 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)
 855 # Location of the ICU4C source tree.
 856 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c)
 857
 858   ~/svn.icu/trunk/dbg/tools/unicode/c$
 859     cmake ../../../../src/tools/unicode/c
 860     make
 861
 862 * generate core properties data files
 863   ~/svn.icu/trunk/dbg/tools/unicode/c$
 864     genprops/genprops $ICU4C_SRC_DIR
 865 - rebuild ICU (make install) & tools
 866
 867 * run & fix ICU4C tests
 868 - Andy handles RBBI & spoof check test failures
 869
 870 * update Java data files
 871 - refresh just the UCD/UCA-related/derived files, just to be safe
 872 - see (ICU4C)/source/data/icu4j-readme.txt
 873 - mkdir /tmp/icu4j
 874 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 875   output:
 876     ...
 877     Unicode .icu files built to ./out/build/icudt59l
 878     echo timestamp > uni-core-data
 879     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b
 880     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b
 881     echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
 882     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b
 883     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b"
 884     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/
 885     mkdir -p /tmp/icu4j/main/shared/data
 886     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
 887     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/
 888     mkdir -p /tmp/icu4j/main/shared/data
 889     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
 890     make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data'
 891 - copy the big-endian Unicode data files to another location,
 892   separate from the other data files,
 893   and then refresh ICU4J
 894     cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j
 895     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
 896     cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 897     cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 898     rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
 899     cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
 900     jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
 901
 902 * When refreshing all of ICU4J data from ICU4C
 903 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 904 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data
 905 or
 906 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install
 907
 908 * refresh Java test .txt files
 909 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
 910     cd $ICU4C_SRC_DIR/source/data/unidata
 911     cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
 912     cd ../../test/testdata
 913     cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
 914     cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode
 915
 916 * run & fix ICU4J tests
 917
 918 ---------------------------------------------------------------------------- ***
 919
 920 Unicode 9.0 update for ICU 58
 921
 922 * Command-line environment setup
 923
 924 ICU_ROOT=~/svn.icu/trunk
 925 ICU_SRC_DIR=$ICU_ROOT/src
 926 ICUDT=icudt58b
 927 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
 928 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
 929 UNIDATA=$ICU_SRC_DIR/source/data/unidata
 930
 931 http://www.unicode.org/review/pri323/  -- beta review
 932 http://www.unicode.org/reports/uax-proposed-updates.html
 933 http://www.unicode.org/versions/beta-9.0.0.html
 934 http://www.unicode.org/versions/Unicode9.0.0/
 935 http://www.unicode.org/reports/tr44/tr44-17.html
 936
 937 *** ICU Trac
 938
 939 - ticket:12526: integrate Unicode 9
 940 - C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b
 941 - Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b
 942
 943 *** CLDR Trac
 944
 945 - cldrbug 9414: UCA 9
 946 - ^/branches/markus/uni90 at r11518 from trunk at r11517
 947
 948 - cldrbug 8745: Unicode 9.0 script metadata
 949
 950 *** Unicode version numbers
 951 - makedata.mak
 952 - uchar.h
 953 - com.ibm.icu.util.VersionInfo
 954 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
 955
 956 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
 957   so that the makefiles see the new version number.
 958
 959 *** data files & enums & parser code
 960
 961 * file preparation
 962
 963 - download UCD & IDNA files
 964 - make sure that the Unicode data folder passed into preparseucd.py
 965   includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
 966 - only for manual diffs: remove version suffixes from the file names
 967   ~/unidata/uni70/20140403$ ../../desuffixucd.py .
 968   (see https://sites.google.com/site/unicodetools/inputdata)
 969 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
 970 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src
 971 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
 972
 973 - also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt
 974   and copy to $UNIDATA
 975     cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA
 976
 977 * preparseucd.py changes
 978 - remove or add new Unicode scripts from/to the
 979   only-in-ISO-15924 list according to the error messages:
 980     ValueError: remove ['Tang'] from _scripts_only_in_iso15924
 981     ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD
 982     ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD
 983     ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD
 984   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
 985       and in com.ibm.icu.dev.test.lang.TestUScript.java
 986 - DerivedNumericValues.txt new numeric values
 987     0D58          ; 0.00625 ; ; 1/160 # No       MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH
 988     0D59          ; 0.025 ; ; 1/40 # No       MALAYALAM FRACTION ONE FORTIETH
 989     0D5A          ; 0.0375 ; ; 3/80 # No       MALAYALAM FRACTION THREE EIGHTIETHS
 990     0D5B          ; 0.05 ; ; 1/20 # No       MALAYALAM FRACTION ONE TWENTIETH
 991     0D5D          ; 0.15 ; ; 3/20 # No       MALAYALAM FRACTION THREE TWENTIETHS
 992   -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(),
 993      uchar.c, UCharacterProperty.java
 994      to support a new series of values
 995 - adjust preparseucd.py for Tangut algorithmic names
 996   in ppucd.txt:
 997     algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH-
 998   ->
 999     algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH-
1000 - avoid block-compressing most String/Miscellaneous property values,
1001   triggered by genprops not coping with a multi-code point Case_Folding on
1002     block;1C80..1C8F;...;Cased;cf=0442;CWCF;...
1003   keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors
1004
1005 * PropertyAliases.txt changes
1006 - 1 new property PCM=Prepended_Concatenation_Mark
1007   Ignore: Only useful for layout engines.
1008   Ok to list in ppucd.txt.
1009
1010 * PropertyValueAliases.txt new property values
1011     blk; Adlam                            ; Adlam
1012     blk; Bhaiksuki                        ; Bhaiksuki
1013     blk; Cyrillic_Ext_C                   ; Cyrillic_Extended_C
1014     blk; Glagolitic_Sup                   ; Glagolitic_Supplement
1015     blk; Ideographic_Symbols              ; Ideographic_Symbols_And_Punctuation
1016     blk; Marchen                          ; Marchen
1017     blk; Mongolian_Sup                    ; Mongolian_Supplement
1018     blk; Newa                             ; Newa
1019     blk; Osage                            ; Osage
1020     blk; Tangut                           ; Tangut
1021     blk; Tangut_Components                ; Tangut_Components
1022   -> add to uchar.h
1023     use long property names for enum constants
1024   -> add to UCharacter.UnicodeBlock IDs
1025     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1026             replace  public static final int \1_ID = \2; \3
1027   -> add to UCharacter.UnicodeBlock objects
1028     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
1029             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1030
1031     GCB; EB                               ; E_Base
1032     GCB; EBG                              ; E_Base_GAZ
1033     GCB; EM                               ; E_Modifier
1034     GCB; GAZ                              ; Glue_After_Zwj
1035     GCB; ZWJ                              ; ZWJ
1036   -> uchar.h & UCharacter.GraphemeClusterBreak
1037
1038     jg ; African_Feh                      ; African_Feh
1039     jg ; African_Noon                     ; African_Noon
1040     jg ; African_Qaf                      ; African_Qaf
1041   -> uchar.h & UCharacter.JoiningGroup
1042
1043     lb ; EB                               ; E_Base
1044     lb ; EM                               ; E_Modifier
1045     lb ; ZWJ                              ; ZWJ
1046   -> uchar.h & UCharacter.LineBreak
1047
1048     sc ; Adlm                             ; Adlam
1049     sc ; Bhks                             ; Bhaiksuki
1050     sc ; Marc                             ; Marchen
1051     sc ; Newa                             ; Newa
1052     sc ; Osge                             ; Osage
1053     sc ; Tang                             ; Tangut
1054   -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
1055
1056     WB ; EB                               ; E_Base
1057     WB ; EBG                              ; E_Base_GAZ
1058     WB ; EM                               ; E_Modifier
1059     WB ; GAZ                              ; Glue_After_Zwj
1060     WB ; ZWJ                              ; ZWJ
1061   -> uchar.h & UCharacter.WordBreak
1062
1063 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1064     (not strictly necessary for NOT_ENCODED scripts)
1065   ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
1066
1067 * generate normalization data files
1068   cd $ICU_ROOT/dbg
1069   bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
1070   bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
1071   bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
1072   bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1073   bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
1074
1075 * build ICU (make install)
1076   so that the tools build can pick up the new definitions from the installed header files.
1077
1078   $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt
1079
1080 * build Unicode tools using CMake+make
1081
1082 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
1083
1084   # Location (--prefix) of where ICU was installed.
1085   set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
1086   # Location of the ICU source tree.
1087   set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
1088
1089   ~/svn.icutools/trunk/dbg/unicode/c$
1090     cmake ../../../src/unicode/c
1091     make
1092
1093 * generate core properties data files
1094   ~/svn.icutools/trunk/dbg/unicode/c$
1095     genprops/genprops $ICU_SRC_DIR
1096     genuca/genuca --hanOrder implicit $ICU_SRC_DIR
1097     genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
1098 - rebuild ICU (make install) & tools
1099
1100 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1101   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1102 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1103 - Unicode 6.0..9.0: U+2260, U+226E, U+226F
1104 - nothing new in 9.0, no test file to update
1105
1106 * run & fix ICU4C tests
1107 - Andy handles RBBI & spoof check test failures
1108
1109 * collation: CLDR collation root, UCA DUCET
1110
1111 - UCA DUCET goes into Mark's Unicode tools, see
1112   https://sites.google.com/site/unicodetools/home#TOC-UCA
1113 - CLDR root data files are checked into (CLDR UCA branch)/common/uca/
1114     cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
1115
1116 - cd (CLDR UCA branch)/common/uca/
1117 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1118     cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
1119 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1120     cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
1121     (note removing the underscore before "Rules")
1122     cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1123 - restore TODO diffs in UCARules.txt
1124     meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1125 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
1126   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1127   from the CLDR root files (..._CLDR_..._SHORT.txt)
1128     cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1129     cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1130     cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
1131 - if CLDR common/uca/unihan-index.txt changes, then update
1132   CLDR common/collation/root.xml <collation type="private-unihan">
1133   and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
1134
1135 - run genuca, see command line above;
1136   deal with
1137     Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt:
1138     FDD1 104B5;     [75 B8 02, 05, 05]      # Osage first primary (compressible)
1139         (add the character to genuca.cpp sampleCharsToScripts[])
1140   + look up the USCRIPT_ code for the new sample characters
1141     (should be obvious from the comment in the error output)
1142   + *add* mappings to sampleCharsToScripts[], do not replace them
1143     (in case the script sample characters flip-flop)
1144   + insert new scripts in DUCET script order, see the top_byte table
1145     at the beginning of FractionalUCA.txt
1146 - rebuild ICU4C
1147
1148 * Unihan collators
1149 - run Unicode Tools
1150     org.unicode.draft.GenerateUnihanCollators
1151   with VM arguments
1152     -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk
1153     -DOTHER_WORKSPACE=/home/mscherer/svn.unitools
1154     -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data
1155     -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
1156     -DUVERSION=9.0.0
1157     -ea
1158 - run Unicode Tools
1159     org.unicode.draft.GenerateUnihanCollatorFiles
1160   with the same arguments
1161 - check CLDR diffs
1162     cd ~/svn.cldr/trunk
1163     meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
1164     meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
1165 - copy to CLDR
1166     cd ~/svn.cldr/trunk
1167     cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
1168     cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
1169 - commit to CLDR
1170 - generate ICU zh collation data: run CLDR
1171     org.unicode.cldr.icu.NewLdml2IcuConverter
1172   with program arguments
1173     -t collation
1174     -s /home/mscherer/svn.cldr/trunk/common/collation
1175     -m /home/mscherer/svn.cldr/trunk/common/supplemental
1176     -d /home/mscherer/svn.icu/trunk/src/source/data/coll
1177     -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation
1178     zh
1179   and VM arguments
1180     -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
1181 - rebuild ICU4C
1182
1183 * run & fix ICU4C tests, now with new CLDR collation root data
1184 - run all tests with the collation test data *_SHORT.txt or the full files
1185   (the full ones have comments, useful for debugging)
1186 - note on intltest: if collate/UCAConformanceTest fails, then
1187   utility/MultithreadTest/TestCollators will fail as well;
1188   fix the conformance test before looking into the multi-thread test
1189
1190 * update Java data files
1191 - refresh just the UCD/UCA-related/derived files, just to be safe
1192 - see (ICU4C)/source/data/icu4j-readme.txt
1193 - mkdir /tmp/icu4j
1194 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1195   output:
1196     ...
1197     Unicode .icu files built to ./out/build/icudt58l
1198     echo timestamp > uni-core-data
1199     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b
1200     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b
1201     echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1202     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b
1203     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b"
1204     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/
1205     mkdir -p /tmp/icu4j/main/shared/data
1206     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1207     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/
1208     mkdir -p /tmp/icu4j/main/shared/data
1209     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1210     make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
1211 - copy the big-endian Unicode data files to another location,
1212   separate from the other data files,
1213   and then refresh ICU4J
1214     cd ~/svn.icu/trunk/dbg/data/out/icu4j
1215     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1216     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1217     cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1218     cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1219     rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1220     cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1221     cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1222     cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1223     jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1224
1225 * When refreshing all of ICU4J data from ICU4C
1226 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1227 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
1228 or
1229 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
1230
1231 * update CollationFCD.java
1232   + copy & paste the initializers of lcccIndex[] etc. from
1233     ICU4C/source/i18n/collationfcd.cpp to
1234     ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1235
1236 * refresh Java test .txt files
1237 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1238     cd $ICU_SRC_DIR/source/data/unidata
1239     cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1240     cd ../../test/testdata
1241     cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1242     cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1243
1244 * run & fix ICU4J tests
1245
1246 *** LayoutEngine script information
1247
1248 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
1249   This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
1250   in the working directory.
1251
1252   (It also generates ScriptRunData.cpp, which is no longer needed.)
1253
1254   It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
1255   (a plain text file)
1256   which maps ICU versions to the numbers of script/language constants
1257   that were added then.
1258   (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
1259
1260   The generated files have a current copyright date and "@deprecated" statement.
1261
1262 * Review changes, fix Java tool if necessary, and copy to ICU4C
1263   cd ~/svn.icu4j/trunk/src
1264   meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
1265   cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
1266   cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
1267
1268 *** API additions
1269 - send notice to icu-design about new born-@stable API (enum constants etc.)
1270
1271 *** merge the Unicode update branches back onto the trunk
1272 - do not merge the icudata.jar and testdata.jar,
1273   instead rebuild them from merged & tested ICU4C
1274 - make sure that changes to Unicode tools & ICU tools are checked in
1275   http://www.unicode.org/utility/trac/log/trunk/unicodetools
1276   http://bugs.icu-project.org/trac/log/tools/trunk
1277
1278 ---------------------------------------------------------------------------- ***
1279
1280 New script codes early in ICU 58: http://bugs.icu-project.org/trac/ticket/11764
1281
1282 Adding
1283 - new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge
1284 - new combination/alias codes: Hanb, Jamo
1285   - used in CLDR 29 and in spoof checker
1286 - new Z* code: Zsye
1287
1288 Add new codes to uscript.h & UScript.java, see Unicode update logs.
1289   -> com.ibm.icu.lang.UScript
1290     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
1291     replace  public static final int \1 = \2; \3
1292
1293 Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h,
1294 add new script codes.
1295 "Long" script names only where established in Unicode 9 PropertyValueAliases.txt.
1296
1297 Note: If we have to run preparseucd.py again before the Unicode 9 update,
1298 then we need to manually keep/restore the new script codes.
1299
1300 ICU_ROOT=~/svn.icu/trunk
1301 ICU_SRC_DIR=$ICU_ROOT/src
1302 ICUDT=icudt57b
1303 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1304 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
1305 UNIDATA=$ICU_SRC_DIR/source/data/unidata
1306
1307 Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files,
1308 see http://bugs.icu-project.org/trac/ticket/12141
1309
1310 make install, then icutools cmake & make, then
1311 ~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
1312
1313 Generate Java data as usual, only update pnames.icu & uprops.icu.
1314
1315 *** LayoutEngine script information
1316
1317 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
1318   This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
1319   in the working directory.
1320
1321   (It also generates ScriptRunData.cpp, which is no longer needed.)
1322
1323   It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
1324   (a plain text file)
1325   which maps ICU versions to the numbers of script/language constants
1326   that were added then.
1327   (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
1328
1329   The generated files have a current copyright date and "@deprecated" statement.
1330
1331 * Review changes, fix Java tool if necessary, and copy to ICU4C
1332   cd ~/svn.icu4j/trunk/src
1333   meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
1334   cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
1335   cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
1336
1337 ---------------------------------------------------------------------------- ***
1338
1339 Emoji properties added in ICU 57: http://bugs.icu-project.org/trac/ticket/11802
1340
1341 Edit preparseucd.py to add & parse new properties.
1342 They share the UCD property namespace but are not listed in PropertyAliases.txt.
1343
1344 Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/
1345 Initial data from emoji/2.0/
1346
1347 ICU_ROOT=~/svn.icu/trunk
1348 ICU_SRC_DIR=$ICU_ROOT/src
1349 ICUDT=icudt56b
1350 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1351 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
1352 UNIDATA=$ICU_SRC_DIR/source/data/unidata
1353
1354 Add binary-property constants to uchar.h enum UProperty & UProperty.java.
1355
1356 ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src
1357 (Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.)
1358
1359 Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java
1360
1361 make install, then icutools cmake & make, then
1362 ~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
1363
1364 Generate Java data as usual, only update pnames.icu & uprops.icu.
1365
1366 ---------------------------------------------------------------------------- ***
1367
1368 Unicode 8.0 update for ICU 56
1369
1370 * Command-line environment setup
1371
1372 ICU_ROOT=~/svn.icu/trunk
1373 ICU_SRC_DIR=$ICU_ROOT/src
1374 ICUDT=icudt56b
1375 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1376 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
1377 UNIDATA=$ICU_SRC_DIR/source/data/unidata
1378
1379 http://www.unicode.org/review/pri297/  -- beta review
1380 http://www.unicode.org/reports/uax-proposed-updates.html
1381 http://unicode.org/versions/beta-8.0.0.html
1382 http://www.unicode.org/versions/Unicode8.0.0/
1383 http://www.unicode.org/reports/tr44/tr44-15.html
1384
1385 *** ICU Trac
1386
1387 - ticket:11574: Unicode 8
1388 - C++ branches/markus/uni80 at r37351 from trunk at r37343
1389 - Java branches/markus/uni80 at r37352 from trunk at r37338
1390
1391 *** CLDR Trac
1392
1393 - cldrbug 8311: UCA 8
1394 - branches/markus/uni80 at r11518 from trunk at r11517
1395
1396 - cldrbug 8109: Unicode 8.0 script metadata
1397 - cldrbug 8418: Updated segmentation for Unicode 8.0
1398
1399 *** Unicode version numbers
1400 - makedata.mak
1401 - uchar.h
1402 - com.ibm.icu.util.VersionInfo
1403 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1404
1405 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1406   so that the makefiles see the new version number.
1407
1408 *** data files & enums & parser code
1409
1410 * file preparation
1411
1412 - download UCD & IDNA files
1413 - make sure that the Unicode data folder passed into preparseucd.py
1414   includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
1415 - only for manual diffs: remove version suffixes from the file names
1416   ~/unidata/uni70/20140403$ ../../desuffixucd.py .
1417   (see https://sites.google.com/site/unicodetools/inputdata)
1418 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1419 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src
1420 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1421
1422 - also: from http://unicode.org/Public/security/8.0.0/ download new
1423   confusables.txt & confusablesWholeScript.txt
1424   and copy to $UNIDATA
1425     ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA
1426     ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA
1427
1428 * initial preparseucd.py changes
1429 - remove new Unicode scripts from the
1430   only-in-ISO-15924 list according to the error message:
1431     ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw']
1432     from _scripts_only_in_iso15924
1433   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1434       and in com.ibm.icu.dev.test.lang.TestUScript.java
1435 - property and file name change:
1436     IndicMatraCategory -> IndicPositionalCategory
1437 - UnicodeData.txt unusual numeric values (improper fractions)
1438     109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;;
1439     109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;;
1440     109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;;
1441     109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;;
1442     109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;;
1443     109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;;
1444     109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;;
1445     109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;;
1446     109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;;
1447     109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;;
1448   -> change preparseucd.py to map them to proper fractions (e.g., 1/6)
1449      which are listed in DerivedNumericValues.txt;
1450      keeps storage in data file simple
1451
1452 * PropertyValueAliases.txt changes
1453 - 10 new Block (blk) values:
1454     blk; Ahom                             ; Ahom
1455     blk; Anatolian_Hieroglyphs            ; Anatolian_Hieroglyphs
1456     blk; Cherokee_Sup                     ; Cherokee_Supplement
1457     blk; CJK_Ext_E                        ; CJK_Unified_Ideographs_Extension_E
1458     blk; Early_Dynastic_Cuneiform         ; Early_Dynastic_Cuneiform
1459     blk; Hatran                           ; Hatran
1460     blk; Multani                          ; Multani
1461     blk; Old_Hungarian                    ; Old_Hungarian
1462     blk; Sup_Symbols_And_Pictographs      ; Supplemental_Symbols_And_Pictographs
1463     blk; Sutton_SignWriting               ; Sutton_SignWriting
1464   -> add to uchar.h
1465     use long property names for enum constants
1466   -> add to UCharacter.UnicodeBlock IDs
1467     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1468             replace  public static final int \1_ID = \2; \3
1469   -> add to UCharacter.UnicodeBlock objects
1470     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
1471             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1472 - 6 new Script (sc) values:
1473     sc ; Ahom                             ; Ahom
1474     sc ; Hatr                             ; Hatran
1475     sc ; Hluw                             ; Anatolian_Hieroglyphs
1476     sc ; Hung                             ; Old_Hungarian
1477     sc ; Mult                             ; Multani
1478     sc ; Sgnw                             ; SignWriting
1479   -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
1480
1481 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1482     (not strictly necessary for NOT_ENCODED scripts)
1483   ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
1484
1485 * generate normalization data files
1486   cd $ICU_ROOT/dbg
1487   bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
1488   bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
1489   bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
1490   bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1491   bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
1492
1493 * build ICU (make install)
1494   so that the tools build can pick up the new definitions from the installed header files.
1495
1496   $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
1497
1498 * build Unicode tools using CMake+make
1499
1500 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
1501
1502   # Location (--prefix) of where ICU was installed.
1503   set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
1504   # Location of the ICU source tree.
1505   set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
1506
1507   ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
1508   ~/svn.icutools/trunk/dbg/unicode/c$ make
1509
1510 * generate core properties data files
1511 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
1512 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR
1513 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
1514 - rebuild ICU (make install) & tools
1515 - run genuca again (see step above) so that it picks up the new nfc.nrm
1516 - rebuild ICU (make install) & tools
1517
1518 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1519   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1520 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1521 - Unicode 6.0..8.0: U+2260, U+226E, U+226F
1522 - nothing new in 8.0, no test file to update
1523
1524 * run & fix ICU4C tests
1525 - bad Cherokee case folding due to difference in fallbacks:
1526   UCD case folding falls back to no mapping,
1527   ICU runtime case folding falls back to lowercasing;
1528   fixed casepropsbuilder.cpp to generate scf mappings to self
1529   when there is an slc mapping but no scf
1530 - Andy handles RBBI & spoof check test failures
1531
1532 * collation: CLDR collation root, UCA DUCET
1533
1534 - UCA DUCET goes into Mark's Unicode tools, see
1535   https://sites.google.com/site/unicodetools/home#TOC-UCA
1536 - CLDR root data files are checked into (CLDR UCA branch)/common/uca/
1537 - cd (CLDR UCA branch)/common/uca/
1538 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1539   cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
1540 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1541     cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
1542     (note removing the underscore before "Rules")
1543     cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1544 - restore TODO diffs in UCARules.txt
1545     meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1546 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
1547   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1548   from the CLDR root files (..._CLDR_..._SHORT.txt)
1549     cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1550     cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1551     cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
1552 - if CLDR common/uca/unihan-index.txt changes, then update
1553   CLDR common/collation/root.xml <collation type="private-unihan">
1554   and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
1555 - run genuca, see command line above;
1556   deal with
1557     Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt
1558         (add the character to genuca.cpp sampleCharsToScripts[])
1559   + look up the script for the new sample characters
1560     (e.g., in FractionalUCA.txt)
1561   + *add* mappings to sampleCharsToScripts[], do not replace them
1562     (in case the script sample characters flip-flop)
1563   + insert new scripts in DUCET script order, see the top_byte table
1564     at the beginning of FractionalUCA.txt
1565 - rebuild ICU4C
1566
1567 * run & fix ICU4C tests, now with new CLDR collation root data
1568 - run all tests with the collation test data *_SHORT.txt or the full files
1569   (the full ones have comments, useful for debugging)
1570 - note on intltest: if collate/UCAConformanceTest fails, then
1571   utility/MultithreadTest/TestCollators will fail as well;
1572   fix the conformance test before looking into the multi-thread test
1573 - fixed bug in CollationWeights::getWeightRanges()
1574   exposed by new data and CollationTest::TestRootElements
1575
1576 * update Java data files
1577 - refresh just the UCD/UCA-related/derived files, just to be safe
1578 - see (ICU4C)/source/data/icu4j-readme.txt
1579 - mkdir /tmp/icu4j
1580 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1581   output:
1582     ...
1583     Unicode .icu files built to ./out/build/icudt56l
1584     echo timestamp > uni-core-data
1585     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b
1586     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b
1587     echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
1588     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b
1589     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b"
1590     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/
1591     mkdir -p /tmp/icu4j/main/shared/data
1592     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1593     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/
1594     mkdir -p /tmp/icu4j/main/shared/data
1595     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1596     make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
1597 - copy the big-endian Unicode data files to another location,
1598   separate from the other data files,
1599   and then refresh ICU4J
1600     cd ~/svn.icu/trunk/dbg/data/out/icu4j
1601     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1602     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1603     cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1604     cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1605     rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1606     cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1607     cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1608     cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1609     jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1610
1611 * When refreshing all of ICU4J data from ICU4C
1612 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1613 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
1614 or
1615 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
1616
1617 * update CollationFCD.java
1618   + copy & paste the initializers of lcccIndex[] etc. from
1619     ICU4C/source/i18n/collationfcd.cpp to
1620     ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1621
1622 * refresh Java test .txt files
1623 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1624     cd $ICU_SRC_DIR/source/data/unidata
1625     cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1626     cd ../../test/testdata
1627     cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1628     cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1629
1630 * run & fix ICU4J tests
1631
1632 *** LayoutEngine script information
1633
1634 * ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more,
1635   because the layout engine was deprecated in ICU 54.
1636   Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java
1637   to write lines that we used to add manually.
1638
1639 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
1640   This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
1641   in the working directory.
1642
1643   (It also generates ScriptRunData.cpp, which is no longer needed.)
1644
1645   It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
1646   (a plain text file)
1647   which maps ICU versions to the numbers of script/language constants
1648   that were added then.
1649   (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
1650
1651   The generated files have a current copyright date and "@deprecated" statement.
1652
1653 * Review changes, fix Java tool if necessary, and copy to ICU4C
1654   cd ~/svn.icu4j/trunk/src
1655   meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
1656   cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
1657   cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
1658
1659 *** API additions
1660 - send notice to icu-design about new born-@stable API (enum constants etc.)
1661
1662 *** merge the Unicode update branches back onto the trunk
1663 - do not merge the icudata.jar and testdata.jar,
1664   instead rebuild them from merged & tested ICU4C
1665 - make sure that changes to Unicode tools & ICU tools are checked in
1666   http://www.unicode.org/utility/trac/log/trunk/unicodetools
1667   http://bugs.icu-project.org/trac/log/tools/trunk
1668
1669 ---------------------------------------------------------------------------- ***
1670
1671 Unicode 7.0 update for ICU 54
1672
1673 http://www.unicode.org/review/pri271/  -- beta review
1674 http://www.unicode.org/reports/uax-proposed-updates.html
1675 http://www.unicode.org/versions/beta-7.0.0.html#notable_issues
1676 http://www.unicode.org/reports/tr44/tr44-13.html
1677
1678 *** ICU Trac
1679
1680 - ticket 10821: Unicode 7.0, UCA 7.0
1681 - C++ branches/markus/uni70 at r35584 from trunk at r35580
1682 - Java branches/markus/uni70 at r35587 from trunk at r35545
1683
1684 *** CLDR Trac
1685
1686 - ticket 7195: UCA 7.0 CLDR root collation
1687 - branches/markus/uni70 at r10062 from trunk at r10061
1688
1689 - ticket 6762: script metadata for Unicode 7.0 new scripts
1690
1691 *** Unicode version numbers
1692 - makedata.mak
1693 - uchar.h
1694 - com.ibm.icu.util.VersionInfo
1695 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
1696
1697 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
1698   so that the makefiles see the new version number.
1699
1700 *** data files & enums & parser code
1701
1702 * file preparation
1703
1704 - download UCD & IDNA files
1705 - make sure that the Unicode data folder passed into preparseucd.py
1706   includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
1707 - only for manual diffs: remove version suffixes from the file names
1708   ~/unidata/uni70/20140403$ ../../desuffixucd.py .
1709   (see https://sites.google.com/site/unicodetools/inputdata)
1710 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
1711 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src
1712 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1713 - Restore TODO diffs in source/data/unidata/UCARules.txt
1714     cd $ICU_SRC_DIR
1715     meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt
1716 - Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt
1717
1718 - also: from http://unicode.org/Public/security/7.0.0/ download new
1719   confusables.txt & confusablesWholeScript.txt
1720   and copy to $ICU_ROOT/src/source/data/unidata/
1721
1722 * initial preparseucd.py changes
1723 - remove new Unicode scripts from the
1724   only-in-ISO-15924 list according to the error message:
1725     ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass',
1726                         'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm',
1727                         'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj']
1728     from _scripts_only_in_iso15924
1729   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1730       and in com.ibm.icu.dev.test.lang.TestUScript.java
1731 - NamesList.txt now has a heading with a non-ASCII character
1732   + keep ppucd.txt in platform charset, rather than changing tool/test parsers
1733   + escape non-ASCII characters in heading comments
1734 - gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013
1735   + get the copyright from the first file whose copyright line contains the current year
1736
1737 * PropertyValueAliases.txt changes
1738 - 32 new Block (blk) values:
1739     blk; Bassa_Vah                        ; Bassa_Vah
1740     blk; Caucasian_Albanian               ; Caucasian_Albanian
1741     blk; Coptic_Epact_Numbers             ; Coptic_Epact_Numbers
1742     blk; Diacriticals_Ext                 ; Combining_Diacritical_Marks_Extended
1743     blk; Duployan                         ; Duployan
1744     blk; Elbasan                          ; Elbasan
1745     blk; Geometric_Shapes_Ext             ; Geometric_Shapes_Extended
1746     blk; Grantha                          ; Grantha
1747     blk; Khojki                           ; Khojki
1748     blk; Khudawadi                        ; Khudawadi
1749     blk; Latin_Ext_E                      ; Latin_Extended_E
1750     blk; Linear_A                         ; Linear_A
1751     blk; Mahajani                         ; Mahajani
1752     blk; Manichaean                       ; Manichaean
1753     blk; Mende_Kikakui                    ; Mende_Kikakui
1754     blk; Modi                             ; Modi
1755     blk; Mro                              ; Mro
1756     blk; Myanmar_Ext_B                    ; Myanmar_Extended_B
1757     blk; Nabataean                        ; Nabataean
1758     blk; Old_North_Arabian                ; Old_North_Arabian
1759     blk; Old_Permic                       ; Old_Permic
1760     blk; Ornamental_Dingbats              ; Ornamental_Dingbats
1761     blk; Pahawh_Hmong                     ; Pahawh_Hmong
1762     blk; Palmyrene                        ; Palmyrene
1763     blk; Pau_Cin_Hau                      ; Pau_Cin_Hau
1764     blk; Psalter_Pahlavi                  ; Psalter_Pahlavi
1765     blk; Shorthand_Format_Controls        ; Shorthand_Format_Controls
1766     blk; Siddham                          ; Siddham
1767     blk; Sinhala_Archaic_Numbers          ; Sinhala_Archaic_Numbers
1768     blk; Sup_Arrows_C                     ; Supplemental_Arrows_C
1769     blk; Tirhuta                          ; Tirhuta
1770     blk; Warang_Citi                      ; Warang_Citi
1771   -> add to uchar.h
1772     use long property names for enum constants
1773   -> add to UCharacter.UnicodeBlock IDs
1774     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1775             replace  public static final int \1_ID = \2; \3
1776   -> add to UCharacter.UnicodeBlock objects
1777     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
1778             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1779 - 28 new Joining_Group (jg) values:
1780     jg ; Manichaean_Aleph                 ; Manichaean_Aleph
1781     jg ; Manichaean_Ayin                  ; Manichaean_Ayin
1782     jg ; Manichaean_Beth                  ; Manichaean_Beth
1783     jg ; Manichaean_Daleth                ; Manichaean_Daleth
1784     jg ; Manichaean_Dhamedh               ; Manichaean_Dhamedh
1785     jg ; Manichaean_Five                  ; Manichaean_Five
1786     jg ; Manichaean_Gimel                 ; Manichaean_Gimel
1787     jg ; Manichaean_Heth                  ; Manichaean_Heth
1788     jg ; Manichaean_Hundred               ; Manichaean_Hundred
1789     jg ; Manichaean_Kaph                  ; Manichaean_Kaph
1790     jg ; Manichaean_Lamedh                ; Manichaean_Lamedh
1791     jg ; Manichaean_Mem                   ; Manichaean_Mem
1792     jg ; Manichaean_Nun                   ; Manichaean_Nun
1793     jg ; Manichaean_One                   ; Manichaean_One
1794     jg ; Manichaean_Pe                    ; Manichaean_Pe
1795     jg ; Manichaean_Qoph                  ; Manichaean_Qoph
1796     jg ; Manichaean_Resh                  ; Manichaean_Resh
1797     jg ; Manichaean_Sadhe                 ; Manichaean_Sadhe
1798     jg ; Manichaean_Samekh                ; Manichaean_Samekh
1799     jg ; Manichaean_Taw                   ; Manichaean_Taw
1800     jg ; Manichaean_Ten                   ; Manichaean_Ten
1801     jg ; Manichaean_Teth                  ; Manichaean_Teth
1802     jg ; Manichaean_Thamedh               ; Manichaean_Thamedh
1803     jg ; Manichaean_Twenty                ; Manichaean_Twenty
1804     jg ; Manichaean_Waw                   ; Manichaean_Waw
1805     jg ; Manichaean_Yodh                  ; Manichaean_Yodh
1806     jg ; Manichaean_Zayin                 ; Manichaean_Zayin
1807     jg ; Straight_Waw                     ; Straight_Waw
1808   -> uchar.h & UCharacter.JoiningGroup
1809 - 23 new Script (sc) values:
1810     sc ; Aghb                             ; Caucasian_Albanian
1811     sc ; Bass                             ; Bassa_Vah
1812     sc ; Dupl                             ; Duployan
1813     sc ; Elba                             ; Elbasan
1814     sc ; Gran                             ; Grantha
1815     sc ; Hmng                             ; Pahawh_Hmong
1816     sc ; Khoj                             ; Khojki
1817     sc ; Lina                             ; Linear_A
1818     sc ; Mahj                             ; Mahajani
1819     sc ; Mani                             ; Manichaean
1820     sc ; Mend                             ; Mende_Kikakui
1821     sc ; Modi                             ; Modi
1822     sc ; Mroo                             ; Mro
1823     sc ; Narb                             ; Old_North_Arabian
1824     sc ; Nbat                             ; Nabataean
1825     sc ; Palm                             ; Palmyrene
1826     sc ; Pauc                             ; Pau_Cin_Hau
1827     sc ; Perm                             ; Old_Permic
1828     sc ; Phlp                             ; Psalter_Pahlavi
1829     sc ; Sidd                             ; Siddham
1830     sc ; Sind                             ; Khudawadi
1831     sc ; Tirh                             ; Tirhuta
1832     sc ; Wara                             ; Warang_Citi
1833   -> uscript.h (many were added before)
1834     comment "Mende Kikakui" for USCRIPT_MENDE
1835     add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias
1836   -> com.ibm.icu.lang.UScript
1837     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
1838     replace  public static final int \1 = \2; \3
1839 - 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
1840   (added 2012-11-01)
1841     Ahom        338     Ahom
1842     Hatr        127     Hatran
1843     Mult        323     Multani
1844   (added 2013-10-12)
1845     Modi        324     Modi
1846     Pauc        263     Pau Cin Hau
1847     Sidd        302     Siddham
1848   -> uscript.h (some overlap with additions from Unicode)
1849   -> com.ibm.icu.lang.UScript
1850     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
1851     replace  public static final int \1 = \2; \3
1852   -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924
1853   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
1854       and in com.ibm.icu.dev.test.lang.TestUScript.java
1855
1856 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
1857     (not strictly necessary for NOT_ENCODED scripts)
1858   ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
1859
1860 * generate normalization data files
1861 - cd $ICU_ROOT/dbg
1862 - export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
1863 - SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
1864 - UNIDATA=$ICU_SRC_DIR/source/data/unidata
1865 - bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
1866 - bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
1867 - bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
1868 - bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1869 - bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
1870
1871 * build ICU (make install)
1872   so that the tools build can pick up the new definitions from the installed header files.
1873
1874 ~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
1875
1876 * build Unicode tools using CMake+make
1877
1878 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
1879
1880 # Location (--prefix) of where ICU was installed.
1881 set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst)
1882 # Location of the ICU source tree.
1883 set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src)
1884
1885 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
1886 ~/svn.icutools/trunk/dbg/unicode/c$ make
1887
1888 * genprops work
1889 - new code point range for Joining_Group values: 10AC0..10AFF Manichaean
1890   + add second array of Joining_Group values for at most 10800..10FFF
1891     icutools: unicode/c/genprops/bidipropsbuilder.cpp
1892     icu: source/common/ubidi_props.h/.c/_data.h
1893     icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java
1894
1895 * generate core properties data files
1896 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
1897 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR
1898 - rebuild ICU (make install) & tools
1899 - run genuca again (see step above) so that it picks up the new nfc.nrm
1900 - rebuild ICU (make install) & tools
1901
1902 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1903   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1904 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1905 - Unicode 6.0..7.0: U+2260, U+226E, U+226F
1906 - nothing new in 7.0, no test file to update
1907
1908 * run & fix ICU4C tests
1909
1910 * update Java data files
1911 - refresh just the UCD-related files, just to be safe
1912 - see (ICU4C)/source/data/icu4j-readme.txt
1913 - mkdir /tmp/icu4j
1914 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1915   output:
1916     ...
1917     Unicode .icu files built to ./out/build/icudt53l
1918     echo timestamp > uni-core-data
1919     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b
1920     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b
1921     echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
1922     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b
1923     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b"
1924     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/
1925     mkdir -p /tmp/icu4j/main/shared/data
1926     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1927     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/
1928     mkdir -p /tmp/icu4j/main/shared/data
1929     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1930     make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data'
1931 - copy the big-endian Unicode data files to another location,
1932   separate from the other data files
1933     ICUDT=icudt54b
1934     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1935     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1936     cd ~/svn.icu/uni70/dbg/data/out/icu4j
1937     cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1938     cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1939     rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
1940     cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
1941     cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1942     cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
1943 - refresh ICU4J
1944     ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1945
1946 * update CollationFCD.java
1947   + copy & paste the initializers of lcccIndex[] etc. from
1948     ICU4C/source/i18n/collationfcd.cpp to
1949     ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
1950
1951 * refresh Java test .txt files
1952 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1953     cd $ICU_SRC_DIR/source/data/unidata
1954     cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1955     cd ../../test/testdata
1956     cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1957     cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
1958
1959 * UCA
1960
1961 - download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/
1962 - run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata)
1963 - update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/
1964 - run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA
1965 - output files are in ~/svn.unitools/Generated/uca/7.0.0/
1966 - review data; compare files, use blankweights.sed or similar
1967   ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt
1968 - cd ~/svn.unitools/Generated/uca/7.0.0/
1969 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1970   cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
1971 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1972     (note removing the underscore before "Rules")
1973     cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
1974 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
1975   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1976   with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
1977     cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
1978     cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
1979     cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
1980 - run genuca, see command line above
1981 - rebuild ICU4C
1982 - refresh ICU4J collation data:
1983   (subset of instructions above for properties data refresh, except copies all coll/*)
1984     ICUDT=icudt54b
1985     ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1986     ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1987     ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
1988     ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
1989 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
1990 - note on intltest: if collate/UCAConformanceTest fails, then
1991   utility/MultithreadTest/TestCollators will fail as well;
1992   fix the conformance test before looking into the multi-thread test
1993 - copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors
1994 - copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch
1995   ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
1996
1997 * When refreshing all of ICU4J data from ICU4C
1998 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1999 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2000 or
2001 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2002
2003 * run & fix ICU4J tests
2004
2005 *** LayoutEngine script information
2006
2007 (For details see the Unicode 5.2 change log below.)
2008
2009 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
2010   This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
2011   in the working directory.
2012   (It also generates ScriptRunData.cpp, which is no longer needed.)
2013
2014   The generated files have a current copyright date and "@stable" statement.
2015   ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java
2016   for "born stable" Unicode API constants, and to stop parsing ICU version numbers
2017   which may not contain dots any more.
2018
2019 - diff current <icu>/source/layout files vs. generated ones
2020     ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
2021   review and manually merge desired changes;
2022   fix gratuitous changes, incorrect @draft/@stable and missing aliases;
2023   Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
2024 - if you just copy the above files, then
2025   fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
2026   manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
2027
2028 *** API additions
2029 - send notice to icu-design about new born-@stable API (enum constants etc.)
2030
2031 *** merge the Unicode update branches back onto the trunk
2032 - do not merge the icudata.jar and testdata.jar,
2033   instead rebuild them from merged & tested ICU4C
2034
2035 ---------------------------------------------------------------------------- ***
2036
2037 Unicode 6.3 update
2038
2039 http://www.unicode.org/review/pri249/  -- beta review
2040 http://www.unicode.org/reports/uax-proposed-updates.html
2041 http://www.unicode.org/versions/beta-6.3.0.html#notable_issues
2042 http://www.unicode.org/reports/tr44/tr44-11.html
2043
2044 *** ICU Trac
2045
2046 - ticket 10128: update ICU to Unicode 6.3 beta
2047 - ticket 10168: update ICU to Unicode 6.3 final
2048 - C++ branches/markus/uni63 at r33552 from trunk at r33551
2049 - Java branches/markus/uni63 at r33550 from trunk at r33553
2050
2051 - ticket 10142: implement Unicode 6.3 bidi algorithm additions
2052
2053 *** Unicode version numbers
2054 - makedata.mak
2055 - uchar.h
2056   (configure.in & configure: have been modified to extract the version from uchar.h)
2057 - com.ibm.icu.util.VersionInfo
2058 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2059
2060 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
2061   so that the makefiles see the new version number.
2062
2063 *** data files & enums & parser code
2064
2065 * file preparation
2066
2067 - download UCD, UCA & IDNA files
2068 - make sure that the Unicode data folder passed into preparseucd.py
2069   includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2070 - modify preparseucd.py:
2071   parse new file BidiBrackets.txt
2072   with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type
2073 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src
2074 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2075 - Check test file diffs for previously commented-out, known-failing data lines;
2076   probably need to keep those commented out.
2077
2078 * PropertyAliases.txt changes
2079 - 1 new Enumerated Property
2080   bpt                      ; Bidi_Paired_Bracket_Type
2081   -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType
2082   -> ubidi_props.h & .c & UBiDiProps.java
2083   -> remember to write the max value at UBIDI_MAX_VALUES_INDEX
2084   -> uprops.cpp
2085   -> change ubidi.icu format version from 2.0 to 2.1
2086 - 1 new Miscellaneous Property
2087   bpb                      ; Bidi_Paired_Bracket
2088   -> uchar.h & UProperty.java
2089   -> ppucd.h & .cpp
2090
2091 * PropertyValueAliases.txt changes
2092 - 3 Bidi_Paired_Bracket_Type (bpt) values:
2093   bpt; c                                ; Close
2094   bpt; n                                ; None
2095   bpt; o                                ; Open
2096   -> uchar.h & UCharacter.BidiPairedBracketType
2097   -> ubidi_props.h & .c & UBiDiProps.java
2098   -> change ubidi.icu format version from 2.0 to 2.1
2099 - 4 new Bidi_Class (bc) values:
2100   bc ; FSI                              ; First_Strong_Isolate
2101   bc ; LRI                              ; Left_To_Right_Isolate
2102   bc ; RLI                              ; Right_To_Left_Isolate
2103   bc ; PDI                              ; Pop_Directional_Isolate
2104   -> uchar.h & UCharacterEnums.ECharacterDirection
2105   -> until the bidi code gets updated,
2106      Roozbeh suggests mapping the new bc values to ON (Other_Neutral)
2107 - 3 new Word_Break (WB) values:
2108   WB ; HL                               ; Hebrew_Letter
2109   WB ; SQ                               ; Single_Quote
2110   WB ; DQ                               ; Double_Quote
2111   -> uchar.h & UCharacter.WordBreak
2112   -> first time Word_Break numeric constants exceed 4 bits (now 17 values)
2113 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2114   (added 2012-10-16)
2115   Aghb  239     Caucasian Albanian
2116   Mahj  314     Mahajani
2117   -> uscript.h
2118   -> com.ibm.icu.lang.UScript
2119     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2120     replace  public static final int \1 = \2;\3
2121   -> preparseucd.py _scripts_only_in_iso15924
2122   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2123       and in com.ibm.icu.dev.test.lang.TestUScript.java
2124   -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
2125      (not strictly necessary for NOT_ENCODED scripts)
2126
2127 * generate normalization data files
2128 - ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib
2129 - ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in
2130 - ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata
2131 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
2132 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
2133 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2134 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
2135
2136 * build ICU (make install)
2137   so that the tools build can pick up the new definitions from the installed header files.
2138
2139 ~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
2140
2141 * build Unicode tools using CMake+make
2142
2143 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
2144
2145 # Location (--prefix) of where ICU was installed.
2146 set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst)
2147 # Location of the ICU source tree.
2148 set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src)
2149
2150 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
2151 ~/svn.icutools/trunk/dbg/unicode/c$ make
2152
2153 * generate core properties data files
2154 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src
2155 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src
2156 - rebuild ICU (make install) & tools
2157 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
2158 - rebuild ICU (make install) & tools
2159
2160 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2161   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2162 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2163 - Unicode 6.0..6.3: U+2260, U+226E, U+226F
2164 - nothing new in 6.3, no test file to update
2165
2166 * update Java data files
2167 - refresh just the UCD-related files, just to be safe
2168 - see (ICU4C)/source/data/icu4j-readme.txt
2169 - mkdir /tmp/icu4j
2170 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2171   output:
2172     ...
2173     Unicode .icu files built to ./out/build/icudt52l
2174     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b
2175     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b
2176     echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2177     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b
2178     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b"
2179     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/
2180     mkdir -p /tmp/icu4j/main/shared/data
2181     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2182     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/
2183     mkdir -p /tmp/icu4j/main/shared/data
2184     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2185     make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data'
2186 - copy the big-endian Unicode data files to another location,
2187   separate from the other data files
2188     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2189     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
2190     ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
2191     ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu
2192     ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
2193     ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2194     ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
2195 - refresh ICU4J
2196     ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
2197
2198 * refresh Java test .txt files
2199 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2200
2201 * UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files
2202
2203 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
2204 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
2205 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2206 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2207   (note removing the underscore before "Rules")
2208 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
2209   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2210   with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
2211 - check test file diffs for previously commented-out, known-failing data lines;
2212   probably need to keep those commented out
2213 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
2214 - run genuca, see command line above
2215 - rebuild ICU4C
2216 - refresh ICU4J collation data:
2217   (subset of instructions above for properties data refresh, except copies all coll/*)
2218     ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2219     ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2220     ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
2221     ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
2222 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
2223 - note on intltest: if collate/UCAConformanceTest fails, then
2224   utility/MultithreadTest/TestCollators will fail as well;
2225   fix the conformance test before looking into the multi-thread test
2226
2227 * test ICU, fix test code where necessary
2228
2229 * When refreshing all of ICU4J data from ICU4C
2230 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2231 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2232 or
2233 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2234
2235 *** LayoutEngine script information
2236 - skipped for Unicode 6.3: no new scripts
2237
2238 *** merge the Unicode update branches back onto the trunk
2239 - do not merge the icudata.jar and testdata.jar,
2240   instead rebuild them from merged & tested ICU4C
2241
2242 ---------------------------------------------------------------------------- ***
2243
2244 Unicode 6.2 update
2245
2246 http://www.unicode.org/review/pri230/
2247 http://www.unicode.org/versions/beta-6.2.0.html
2248 http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0
2249 http://www.unicode.org/review/pri227/  Changes to Script Extensions Property Values
2250 http://www.unicode.org/review/pri228/  Changing some common characters from Punctuation to Symbol
2251 http://www.unicode.org/review/pri229/  Linebreaking Changes for Pictographic Symbols
2252 http://www.unicode.org/reports/tr46/tr46-8.html  IDNA
2253 http://unicode.org/Public/idna/6.2.0/
2254
2255 *** ICU Trac
2256
2257 - ticket 9515: Unicode 6.2: final ICU update
2258
2259 - ticket 9514: UCA 6.2: fix UCARules.txt
2260
2261 - ticket 9437: update ICU to Unicode 6.2
2262 - C++ branches/markus/uni62 at r32050 from trunk at r32041
2263 - Java branches/markus/uni62 at r32068 from trunk at r32066
2264
2265 *** Unicode version numbers
2266 - makedata.mak
2267 - uchar.h
2268   (configure.in & configure: have been modified to extract the version from uchar.h)
2269 - com.ibm.icu.util.VersionInfo
2270 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
2271
2272 *** data files & enums & parser code
2273
2274 * file preparation
2275
2276 - download UCD, UCA & IDNA files
2277 - make sure that the Unicode data folder passed into preparseucd.py
2278   includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
2279 - modify preparseucd.py: NamesList.txt is now in UTF-8
2280 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src
2281 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2282 - Check test file diffs for previously commented-out, known-failing data lines;
2283   probably need to keep those commented out.
2284
2285 * PropertyValueAliases.txt changes
2286 - 1 new Line_Break (lb) value:
2287   lb ; RI                               ; Regional_Indicator
2288   -> uchar.h & UCharacter.LineBreak
2289 - 1 new Word_Break (WB) value:
2290   WB ; RI                               ; Regional_Indicator
2291   -> uchar.h & UCharacter.WordBreak
2292 - 1 new Grapheme_Cluster_Break (GCB) value:
2293   GCB; RI                               ; Regional_Indicator
2294   -> uchar.h & UCharacter.GraphemeClusterBreak
2295
2296 * 3 new numeric values
2297   The new value -1, which was really supposed to be NaN but that would have required
2298   new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1,
2299   but encodeNumericValue() in corepropsbuilder.cpp had to be fixed.
2300     cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1
2301     cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1
2302   The two new values 216000 and 432000 require an addition to the encoding of numeric values.
2303     cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000
2304     cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000
2305   -> uprops.h, uchar.c & UCharacterProperty.java
2306   -> cucdtst.c & UCharacterTest.java
2307
2308 * generate normalization data files
2309 - ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib
2310 - ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in
2311 - ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata
2312 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
2313 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
2314 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2315 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
2316
2317 * build ICU (make install)
2318   so that the tools build can pick up the new definitions from the installed header files.
2319 * build Unicode tools using CMake+make
2320
2321 * generate core properties data files
2322 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src
2323 - in initial bootstrapping, change the UCA version
2324   in source/data/unidata/FractionalUCA.txt to match the new Unicode version
2325 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src
2326 - rebuild ICU (make install) & tools
2327   + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
2328     check if the UCA version in FractionalUCA.txt matches the new Unicode version
2329     (see step above)
2330 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
2331 - rebuild ICU (make install) & tools
2332
2333 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2334   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2335 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2336 - Unicode 6.0..6.2: U+2260, U+226E, U+226F
2337 - nothing new in 6.2, no test file to update
2338
2339 * update Java data files
2340 - refresh just the UCD-related files, just to be safe
2341 - see (ICU4C)/source/data/icu4j-readme.txt
2342 - mkdir /tmp/icu4j
2343 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2344   output:
2345     ...
2346     Unicode .icu files built to ./out/build/icudt50l
2347     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b
2348     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b
2349     echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2350     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b
2351     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b"
2352     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/
2353     mkdir -p /tmp/icu4j/main/shared/data
2354     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2355     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/
2356     mkdir -p /tmp/icu4j/main/shared/data
2357     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2358     make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data'
2359 - copy the big-endian Unicode data files to another location,
2360   separate from the other data files
2361     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
2362     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
2363     ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
2364     ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu
2365     ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
2366     ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
2367     ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
2368 - refresh ICU4J
2369     ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
2370
2371 * refresh Java test .txt files
2372 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2373
2374 * UCA
2375
2376 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
2377 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
2378 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2379 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2380   (note removing the underscore before "Rules")
2381 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
2382   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2383   with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
2384 - check test file diffs for previously commented-out, known-failing data lines;
2385   probably need to keep those commented out
2386 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
2387 - run genuca, see command line above
2388 - rebuild ICU4C
2389 - refresh ICU4J collation data:
2390   (subset of instructions above for properties data refresh, except copies all coll/*)
2391     ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2392     ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
2393     ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
2394     ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
2395 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
2396 - note on intltest: if collate/UCAConformanceTest fails, then
2397   utility/MultithreadTest/TestCollators will fail as well;
2398   fix the conformance test before looking into the multi-thread test
2399
2400 * test ICU, fix test code where necessary
2401
2402 * When refreshing all of ICU4J data from ICU4C
2403 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2404 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2405 or
2406 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2407
2408 *** LayoutEngine script information
2409 - skipped for Unicode 6.2: no new scripts
2410
2411 *** merge the Unicode update branches back onto the trunk
2412 - do not merge the icudata.jar and testdata.jar,
2413   instead rebuild them from merged & tested ICU4C
2414
2415 ---------------------------------------------------------------------------- ***
2416
2417 Future Unicode update
2418
2419 Tools simplified since the Unicode 6.1 update. See
2420 - http://site.icu-project.org/design/props/ppucd
2421 - http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972
2422
2423 * Unicode version numbers
2424 - icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates
2425
2426 * file preparation
2427 - ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py:
2428 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src
2429 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
2430 - Check test file diffs for previously commented-out, known-failing data lines;
2431   probably need to keep those commented out.
2432
2433 * PropertyValueAliases.txt changes
2434 - Script codes that are in ISO 15924 but not in Unicode are now listed in
2435   preparseucd.py, in the _scripts_only_in_iso15924 variable.
2436   If there are new ISO codes, then add them.
2437   If Unicode adds some of them, then remove them from the .py variable.
2438
2439 * UnicodeData.txt changes
2440 - No more manual changes for CJK ranges for algorithmic names;
2441   those are now written to ppucd.txt and genprops reads them from there.
2442
2443 * generate core properties data files (makeprops.sh was deleted)
2444 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src
2445
2446 * no more manual updates of source/data/unidata/norm2/nfkc_cf.txt
2447 - it is now generated by preparseucd.py
2448
2449 * no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt
2450 - it is now generated by preparseucd.py
2451 - make sure that the Unicode data folder passed into preparseucd.py
2452   includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
2453   (can be in some subfolder)
2454
2455 * generate normalization data files
2456 - ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib
2457 - ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in
2458 - ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata
2459 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
2460 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
2461 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
2462 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
2463
2464 * build ICU (make install)
2465 * build Unicode tools using CMake+make
2466
2467 * new way to call genuca (makeuca.sh was deleted)
2468 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src
2469
2470 ---------------------------------------------------------------------------- ***
2471
2472 Unicode 6.1 update
2473
2474 *** ICU Trac
2475
2476 - ticket 8995 final update to Unicode 6.1
2477 - ticket 8994 regenerate source/layout/CanonData.cpp
2478
2479 - ticket 8961 support Unicode "Age" value *names*
2480 - ticket 8963 support multiple character name aliases & types
2481
2482 - ticket 8827 "update ICU to Unicode 6.1"
2483 - C++ branches/markus/uni61 at r30864 from trunk at r30843
2484 - Java branches/markus/uni61 at r30865 from trunk at r30863
2485
2486 *** Unicode version numbers
2487 - makedata.mak
2488 - uchar.h
2489   (configure.in & configure: have been modified to extract the version from uchar.h)
2490 - com.ibm.icu.util.VersionInfo
2491 - icutools/unicode/makedefs.sh
2492   + also review & update other definitions in that file,
2493     e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l
2494
2495 *** data files & enums & parser code
2496
2497 * file preparation
2498
2499 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed
2500 - This prepares both unidata and testdata files in respective output subfolders.
2501 - Check test file diffs for previously commented-out, known-failing data lines;
2502   probably need to keep those commented out.
2503
2504 * PropertyValueAliases.txt changes
2505 - 11 new block names:
2506   Arabic_Extended_A
2507   Arabic_Mathematical_Alphabetic_Symbols
2508   Chakma
2509   Meetei_Mayek_Extensions
2510   Meroitic_Cursive
2511   Meroitic_Hieroglyphs
2512   Miao
2513   Sharada
2514   Sora_Sompeng
2515   Sundanese_Supplement
2516   Takri
2517   -> add to uchar.h
2518   -> add to UCharacter.UnicodeBlock IDs
2519     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
2520             replace  public static final int \1_ID = \2; \3
2521   -> add to UCharacter.UnicodeBlock objects
2522     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
2523             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
2524 - 1 new Joining_Group (jg) value:
2525   Rohingya_Yeh
2526   -> uchar.h & UCharacter.JoiningGroup
2527 - 2 new Line_Break (lb) values:
2528   CJ=Conditional_Japanese_Starter
2529   HL=Hebrew_Letter
2530   -> uchar.h & UCharacter.LineBreak
2531 - 7 new scripts:
2532   sc ; Cakm      ; Chakma
2533   sc ; Merc      ; Meroitic_Cursive
2534   sc ; Mero      ; Meroitic_Hieroglyphs
2535   sc ; Plrd      ; Miao
2536   sc ; Shrd      ; Sharada
2537   sc ; Sora      ; Sora_Sompeng
2538   sc ; Takr      ; Takri
2539   -> remove these from SyntheticPropertyValueAliases.txt
2540   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2541       and in com.ibm.icu.dev.test.lang.TestUScript.java
2542 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2543   (added 2011-06-21)
2544   Khoj        322     Khojki
2545   Tirh        326     Tirhuta
2546     and another one added 2011-12-09
2547   Hluw        080     Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)
2548   -> uscript.h
2549   -> com.ibm.icu.lang.UScript
2550     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2551     replace  public static final int \1 = \2;\3
2552   -> SyntheticPropertyValueAliases.txt
2553   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2554       and in com.ibm.icu.dev.test.lang.TestUScript.java
2555
2556 * UnicodeData.txt changes
2557 - the last Unihan code point changes from U+9FCB to U+9FCC
2558   search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive)
2559   + do change gennames.c
2560   + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java
2561
2562 * DerivedBidiClass.txt changes
2563 - 2 new default-AL blocks:
2564 #     Arabic Extended-A: U+08A0  -  U+08FF  (was default-R)
2565 #     Arabic Mathematical Alphabetic Symbols:
2566 #                       U+1EE00  - U+1EEFF  (was default-R)
2567 - 2 new default-R blocks:
2568 #     Meroitic Hieroglyphs:
2569 #                        U+10980 - U+1099F
2570 #     Meroitic Cursive:  U+109A0 - U+109FF
2571   -> should be picked up by the explicit data in the file
2572
2573 * NameAliases.txt changes
2574 - from
2575     # Each line has two fields
2576     # First field: Code point
2577     # Second field: Alias
2578 - to
2579     # Each line has three fields, as described here:
2580     #
2581     # First field:  Code point
2582     # Second field: Alias
2583     # Third field:  Type
2584 - Also, the file previously allowed multiple aliases but only now does it
2585   actually provide multiple, even multiple of the same type. For example,
2586     FEFF;BYTE ORDER MARK;alternate
2587     FEFF;BOM;abbreviation
2588     FEFF;ZWNBSP;abbreviation
2589 - This breaks our gennames parser, unames.icu data structure, and API.
2590   Fix gennames to only pick up "correction" aliases.
2591   New ticket #8963 for further changes.
2592
2593 * run genpname/preparse.pl (on Linux)
2594   + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
2595   + make sure that data.h is writable
2596   + perl preparse.pl ~/svn.icu/trunk/src > out.txt
2597   + preparse.pl shows no errors, out.txt Info and Warning lines look ok
2598
2599 * build ICU (make install)
2600   so that the tools build can pick up the new definitions from the installed header files.
2601 * build Unicode tools (at least genpname) using CMake+make
2602
2603 * run genpname
2604   (builds both pnames.icu and propname_data.h)
2605 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
2606 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
2607
2608 * build ICU (make install)
2609 * build Unicode tools using CMake+make
2610
2611 * update source/data/unidata/norm2/nfkc_cf.txt
2612 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
2613
2614 * update source/data/unidata/norm2/uts46.txt
2615 - download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
2616   to ~/svn.icu/tools/trunk/src/unicode/py
2617 - adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008".
2618 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
2619 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
2620
2621 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2622   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2623 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2624 - Unicode 6.0..6.1: U+2260, U+226E, U+226F
2625 - nothing new in 6.1, no test file to update
2626
2627 * generate core properties data files
2628 - in initial bootstrapping, change the UCA version
2629   in source/data/unidata/FractionalUCA.txt to match the new Unicode version
2630 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2631 - rebuild ICU & tools
2632   + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
2633     check if the UCA version in FractionalUCA.txt matches the new Unicode version
2634     (see step above)
2635 - run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm:
2636   ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2637 - rebuild ICU & tools
2638
2639 * update Java data files
2640 - refresh just the UCD-related files, just to be safe
2641 - see (ICU4C)/source/data/icu4j-readme.txt
2642 - mkdir /tmp/icu4j
2643 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2644   output:
2645     ...
2646     Unicode .icu files built to ./out/build/icudt49l
2647     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b
2648     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b
2649     echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2650     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b
2651     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b"
2652     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/
2653     mkdir -p /tmp/icu4j/main/shared/data
2654     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2655     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/
2656     mkdir -p /tmp/icu4j/main/shared/data
2657     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
2658     make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data'
2659 - copy the big-endian Unicode data files to another location,
2660   separate from the other data files
2661     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
2662     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
2663     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
2664     ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu
2665     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
2666     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
2667     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
2668 - refresh ICU4J
2669     ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
2670
2671 * refresh Java test .txt files
2672 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2673
2674 * test ICU so far, fix test code where necessary
2675 - temporarily ignore collation issues that look like UCA/UCD mismatches,
2676   until UCA data is updated
2677
2678 * UCA
2679
2680 - get output from Mark's tools; look in
2681     http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt
2682 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2683 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2684   (note removing the underscore before "Rules")
2685 - update (ICU)/source/test/testdata/CollationTest_*.txt
2686   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
2687   with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
2688 - check test file diffs for previously commented-out, known-failing data lines;
2689   probably need to keep those commented out
2690 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
2691 - run makeuca.sh:
2692   ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2693 - rebuild ICU4C
2694 - refresh ICU4J collation data:
2695   (subset of instructions above for properties data refresh, except copies all coll/*)
2696     ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2697     ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
2698     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
2699     ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
2700 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
2701 - note on intltest: if collate/UCAConformanceTest fails, then
2702   utility/MultithreadTest/TestCollators will fail as well;
2703   fix the conformance test before looking into the multi-thread test
2704
2705 * When refreshing all of ICU4J data from ICU4C
2706 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2707 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
2708 or
2709 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
2710
2711 *** LayoutEngine script information
2712
2713 (For details see the Unicode 5.2 change log below.)
2714
2715 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
2716   This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
2717   in the working directory.
2718   (It also generates ScriptRunData.cpp, which is no longer needed.)
2719
2720   The generated files have a current copyright date and "@draft" statement.
2721
2722 - diff current <icu>/source/layout files vs. generated ones
2723     ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
2724   review and manually merge desired changes;
2725   fix gratuitous changes, incorrect @draft and missing aliases;
2726   Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
2727 - if you just copy the above files, then
2728   fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
2729   manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
2730
2731 *** merge the Unicode update branches back onto the trunk
2732 - do not merge the icudata.jar and testdata.jar,
2733   instead rebuild them from merged & tested ICU4C
2734
2735 ---------------------------------------------------------------------------- ***
2736
2737 ICU 4.8 (no Unicode update, just new script codes)
2738
2739 * 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2740   (added 2010-12-21)
2741     Afak    439     Afaka
2742     Jurc    510     Jurchen
2743     Mroo    199     Mro, Mru
2744     Nshu    499     Nüshu
2745     Shrd    319     Sharada, Śāradā
2746     Sora    398     Sora Sompeng
2747     Takr    321     Takri, Ṭākrī, Ṭāṅkrī
2748     Tang    520     Tangut
2749     Wole    480     Woleai
2750   -> uscript.h
2751   -> com.ibm.icu.lang.UScript
2752     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2753     replace  public static final int \1 = \2;\3
2754   -> genpname/SyntheticPropertyValueAliases.txt
2755   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2756       and in com.ibm.icu.dev.test.lang.TestUScript.java
2757
2758 * run genpname/preparse.pl (on Linux)
2759   + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
2760   + make sure that data.h is writable
2761   + perl preparse.pl ~/svn.icu/trunk/src > out.txt
2762   + preparse.pl shows no errors, out.txt Info and Warning lines look ok
2763
2764 * rebuild Unicode tools (at least genpname) using make
2765 - You might first need to "make install" ICU so that the tools build can pick
2766   up the new definitions from the installed header files.
2767
2768 * run genpname
2769   (builds both pnames.icu and propname_data.h)
2770 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
2771 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
2772 - rebuild ICU & tools
2773
2774 * run genprops
2775 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
2776 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
2777 - rebuild ICU & tools
2778
2779 * update Java data files
2780 - refresh just the UCD-related files, just to be safe
2781 - see (ICU4C)/source/data/icu4j-readme.txt
2782 - mkdir /tmp/icu4j
2783 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2784 - copy the big-endian Unicode data files to another location,
2785   separate from the other data files
2786     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
2787     ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
2788     ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
2789 - refresh ICU4J
2790     ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b
2791
2792 * should have updated the layout engine script codes but forgot
2793
2794 ---------------------------------------------------------------------------- ***
2795
2796 Unicode 6.0 update
2797
2798 *** related ICU Trac tickets
2799
2800 7264 Unicode 6.0 Update
2801
2802 *** Unicode version numbers
2803 - makedata.mak
2804 - uchar.h
2805   (configure.in & configure: have been modified to extract the version from uchar.h)
2806 - com.ibm.icu.util.VersionInfo
2807
2808 *** data files & enums & parser code
2809
2810 * file preparation
2811
2812 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed
2813 - This now prepares both unidata and testdata files in respective output subfolders.
2814
2815 * PropertyAliases.txt changes
2816 - new Script_Extensions property defined in the new ScriptExtensions.txt file
2817   but not listed in PropertyAliases.txt; reported to unicode.org;
2818   -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt
2819     scx; Script_Extensions
2820   -> uchar.h with new UProperty section
2821   -> com.ibm.icu.lang.UProperty, parallel with uchar.h
2822
2823 * PropertyValueAliases.txt changes
2824 - 12 new block names:
2825   Alchemical_Symbols
2826   Bamum_Supplement
2827   Batak
2828   Brahmi
2829   CJK_Unified_Ideographs_Extension_D
2830   Emoticons
2831   Ethiopic_Extended_A
2832   Kana_Supplement
2833   Mandaic
2834   Miscellaneous_Symbols_And_Pictographs
2835   Playing_Cards
2836   Transport_And_Map_Symbols
2837   -> add to uchar.h
2838   -> add to UCharacter.UnicodeBlock
2839     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
2840             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
2841 - Joining_Group (jg) values:
2842   Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias
2843   -> uchar.h & UCharacter.JoiningGroup
2844 - 3 new scripts:
2845   sc ; Batk      ; Batak
2846   sc ; Brah      ; Brahmi
2847   sc ; Mand      ; Mandaic
2848   -> remove these from SyntheticPropertyValueAliases.txt
2849   -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN
2850   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
2851       and in com.ibm.icu.dev.test.lang.TestUScript.java
2852 - 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
2853   (added 2009-11-11..2010-07-18)
2854   Bass        259     Bassa Vah
2855   Dupl        755     Duployan shortand
2856   Elba        226     Elbasan
2857   Gran        343     Grantha
2858   Kpel        436     Kpelle
2859   Loma        437     Loma
2860   Mend        438     Mende
2861   Merc        101     Meroitic Cursive
2862   Narb        106     Old North Arabian
2863   Nbat        159     Nabataean
2864   Palm        126     Palmyrene
2865   Sind        318     Sindhi
2866   Wara        262     Warang Citi
2867   -> uscript.h
2868   -> com.ibm.icu.lang.UScript
2869     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
2870     replace  public static final int \1 = \2;\3
2871   -> SyntheticPropertyValueAliases.txt
2872   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
2873       and in com.ibm.icu.dev.test.lang.TestUScript.java
2874 - ISO 15924 name change
2875   Mero        100     Meroitic Hieroglyphs (was Meroitic)
2876   -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC
2877 - property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt
2878
2879 * UnicodeData.txt changes
2880 - new CJK block:
2881   2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
2882   2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
2883   -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion
2884
2885 * build Unicode tools using CMake+make
2886
2887 * run genpname/preparse.pl (on Linux)
2888   + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
2889   + make sure that data.h is writable
2890   + perl preparse.pl ~/svn.icu/trunk/src > out.txt
2891   + preparse.pl shows no errors, out.txt Info and Warning lines look ok
2892
2893 * rebuild Unicode tools (at least genpname) using make
2894 - You might first need to "make install" ICU so that the tools build can pick
2895   up the new definitions from the installed header files.
2896
2897 * run genpname
2898 - ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
2899 - rebuild ICU & tools
2900
2901 * update source/data/unidata/norm2/nfkc_cf.txt
2902 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
2903
2904 * update source/data/unidata/norm2/uts46.txt
2905 - download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt
2906   to ~/svn.icu/tools/trunk/src/unicode/py
2907 - adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values
2908 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
2909 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
2910
2911 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
2912   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
2913 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
2914 - Unicode 6.0: U+2260, U+226E, U+226F
2915
2916 * generate core properties data files
2917 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2918 - rebuild ICU & tools
2919 - run makeuca.sh so that genuca picks up the new nfc.nrm:
2920   ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2921 - rebuild ICU & tools
2922
2923 * implement new Script_Extensions property (provisional)
2924 - parser & generator: genprops & uprops.icu
2925 - uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp
2926 - UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java
2927
2928 * switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2
2929 - (one-time change)
2930 - genbidi/gencase/genprops tools changes
2931 - re-run makeprops.sh (see above)
2932 - UCharacterProperty.java, UCharacterTypeIterator.java,
2933   UBiDiProps.java, UCaseProps.java, and several others with minor changes;
2934   UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java
2935
2936 * update Java data files
2937 - refresh just the UCD-related files, just to be safe
2938 - see (ICU4C)/source/data/icu4j-readme.txt
2939 - mkdir /tmp/icu4j
2940 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2941   output:
2942     ...
2943     Unicode .icu files built to ./out/build/icudt45l
2944     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b
2945     echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
2946     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b
2947     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b
2948     mkdir -p /tmp/icu4j/main/shared/data
2949     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
2950 - copy the big-endian Unicode data files to another location,
2951   separate from the other data files
2952     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
2953     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
2954     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
2955     ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu
2956     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
2957     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
2958     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
2959 - refresh ICU4J
2960     ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
2961
2962 * refresh Java test .txt files
2963 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
2964
2965 * un-hardcode normalization skippable (NF*_Inert) test data
2966 - removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools
2967
2968 * copy updated break iterator test files
2969 - now handled by early ucdcopy.py and
2970   copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata
2971   (old instructions:
2972    copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt
2973    to ~/svn.icu/trunk/src/source/test/testdata)
2974 - they are not used in ICU4J
2975
2976 * UCA
2977
2978 - get output from Mark's tools; look in
2979     http://www.unicode.org/~book/incoming/mark/uca6.0.0/
2980     http://www.macchiato.com/unicode/utc/additional-uca-files
2981     http://www.unicode.org/Public/UCA/6.0.0/
2982     http://www.unicode.org/~mdavis/uca/
2983 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
2984 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
2985 - update Han-implicit ranges for new CJK extensions:
2986   swapCJK() in ucol.cpp & ImplicitCEGenerator.java
2987 - genuca: allow bytes 02 for U+FFFE, new merge-sort character;
2988   do not add it into invuca so that tailoring primary-after an ignorable works
2989 - genuca: permit space between [variable top] bytes
2990 - ucol.cpp: treat noncharacters like unassigned rather than ignorable
2991 - run makeuca.sh:
2992   ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
2993 - rebuild ICU4C
2994 - refresh ICU4J collation data:
2995   (subset of instructions above for properties data refresh, except copies all coll/*)
2996     ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
2997     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
2998     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
2999     ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
3000 - update (ICU)/source/test/testdata/CollationTest_*.txt
3001   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
3002   with output from Mark's Unicode tools
3003 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)
3004 - note on intltest: if collate/UCAConformanceTest fails, then
3005   utility/MultithreadTest/TestCollators will fail as well;
3006   fix the conformance test before looking into the multi-thread test
3007
3008 * When refreshing all of ICU4J data from ICU4C
3009 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
3010 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
3011 or
3012 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
3013
3014 *** LayoutEngine script information
3015
3016 (For details see the Unicode 5.2 change log below.)
3017
3018 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
3019 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
3020 ScriptRunData.cpp, which is no longer needed.)
3021
3022 The generated files have a current copyright date and "@draft" statement.
3023
3024 * copy the above files into <icu>/source/layout, replacing the old files.
3025 * fix mixed line endings
3026 * review the diffs and fix incorrect @draft and missing aliases;
3027   Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
3028 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
3029
3030 ---------------------------------------------------------------------------- ***
3031
3032 Unicode 5.2 update
3033
3034 *** related ICU Trac tickets
3035
3036 7084 Unicode 5.2
3037
3038 7167 verify collation bytes
3039 7235 Java test NAME_ALIAS
3040 7236 Java DerivedCoreProperties.txt test
3041 7237 Java BidiTest.txt
3042 7238 UTrie2 in core unidata
3043 7239 test for tailoring gaps
3044 7240 Java fix CollationMiscTest
3045 7243 update layout engine for Unicode 5.2
3046
3047 *** Unicode version numbers
3048 - makedata.mak
3049 - uchar.h
3050 - configure.in & configure
3051 - update ucdVersion in gennames.c if an algorithmic range changes
3052
3053 *** data files & enums & parser code
3054
3055 * file preparation
3056
3057 python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata
3058 - includes finding files regardless of version numbers,
3059   copying them, and performing the equivalent processing of the
3060   ucdstrip and ucdmerge tools on the desired set of files
3061
3062 * notes on changes
3063 - PropertyAliases.txt
3064   moved from numeric to enumerated:
3065     ccc       ; Canonical_Combining_Class
3066   new string properties:
3067     NFKC_CF   ; NFKC_Casefold
3068     Name_Alias; Name_Alias
3069   new binary properties:
3070     Cased     ; Cased
3071     CI        ; Case_Ignorable
3072     CWCF      ; Changes_When_Casefolded
3073     CWCM      ; Changes_When_Casemapped
3074     CWKCF     ; Changes_When_NFKC_Casefolded
3075     CWL       ; Changes_When_Lowercased
3076     CWT       ; Changes_When_Titlecased
3077     CWU       ; Changes_When_Uppercased
3078   new CJK Unihan properties (not supported by ICU)
3079 - PropertyValueAliases.txt
3080   new block names
3081   new scripts
3082   one script code change:
3083     sc ; Qaai      ; Inherited
3084     ->
3085     sc ; Zinh      ; Inherited                        ; Qaai
3086   new Line_Break (lb) value:
3087     lb ; CP        ; Close_Parenthesis
3088   new Joining_Group (jg) values: Farsi_Yeh, Nya
3089   other new values:
3090     ccc; 214; ATA  ; Attached_Above
3091 - DerivedBidiClass.txt
3092   new default-R range: U+1E800 - U+1EFFF
3093 - UnicodeData.txt
3094   all of the ISO comments are gone
3095   new CJK block end:
3096     9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last>
3097   new CJK block:
3098     2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
3099     2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
3100
3101 * genpname
3102 - run preparse.pl
3103   + cd \svn\icuproj\icu\trunk\source\tools\genpname
3104   + make sure that data.h is writable
3105   + perl preparse.pl \svn\icuproj\icu\trunk > out.txt
3106   + preparse.pl complains with errors like the following:
3107       Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34.
3108     This is because ICU 4.0 had scripts from ISO 15924 which are now
3109     added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt
3110     and PropertyValueAliases.txt.
3111     -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
3112        Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt
3113   + preparse.pl complains with errors about block names missing from uchar.h; add them
3114
3115 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
3116 - new block & script values
3117   + 26 new blocks
3118     copy new blocks from Blocks.txt
3119     MS VC++ 2008 regular expression:
3120       find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$"
3121       replace with "    UBLOCK_\3 = 172, /*[\1]*/"
3122   + several new script values already added in ICU 4.0 for ISO 15924 coverage
3123     (removed from SyntheticPropertyValueAliases.txt, see genpname notes above)
3124   + 3 new script values added for ISO 15924 and Unicode 5.2 coverage
3125   + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2)
3126     (added to SyntheticPropertyValueAliases.txt)
3127 - new Joining Group (JG) values: Farsi_Yeh, Nya
3128 - new Line_Break (lb) value:
3129     lb ; CP        ; Close_Parenthesis
3130
3131 * hardcoded Unihan range end/limit
3132 - Unihan range end moves from 9FC3 to 9FCB
3133   search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive)
3134   + do change gennames.c
3135
3136 * Compare definitions of new binary properties with what we used to use
3137   in algorithms, to see if the definitions changed.
3138 - Verified that definitions for Cased and Case_Ignorable are unchanged.
3139   The gencase tool now parses the newly public Case_Ignorable values
3140   in case the definition changes in the future.
3141
3142 * uchar.c & uprops.h & uprops.c & genprops
3143 - new numeric values that didn't exist in Unicode data before:
3144     1/7, 1/9, 1/10, 3/10, 1/16, 3/16
3145   the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5,
3146   therefore redesign the encoding of numeric types and values for formatVersion 6;
3147   design for simple numbers up to at least 144 ("one gross"),
3148   large values up to at least 10^20,
3149   and fractions with numerators -1..17 and denominators 1..16
3150   to cover current and expected future values
3151   (e.g., more Han numeric values, Meroitic twelfths)
3152
3153 * reimplement Hangul_Syllable_Type for new Jamo characters
3154 - the old code assumed that all Jamo characters are in the 11xx block
3155 - Unicode 5.2 fills holes there and adds new Jamo characters in
3156     A960..A97F; Hangul Jamo Extended-A
3157   and in
3158     D7B0..D7FF; Hangul Jamo Extended-B
3159 - Hangul_Syllable_Type can be trivially derived from a subset of
3160   Grapheme_Cluster_Break values
3161
3162 * build Unicode data source code for hardcoding core data
3163 C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data
3164
3165 ICU data make path is \svn\icuproj\icu\trunk\source\data\
3166 ICU root path is \svn\icuproj\icu\trunk
3167 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
3168 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
3169 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
3170 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
3171 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
3172 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
3173 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
3174 Information: cannot find "spreplocal.mk". Not building user-additional stringprep files.
3175 Creating data file for Unicode Property Names
3176 Creating data file for Unicode Character Properties
3177 Creating data file for Unicode Case Mapping Properties
3178 Creating data file for Unicode BiDi/Shaping Properties
3179 Creating data file for Unicode Normalization
3180 Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l"
3181 Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp"
3182
3183 - copy the .c source files to C:\svn\icuproj\icu\trunk\source\common
3184   and rebuild the common library
3185
3186 *** UCA
3187
3188 - update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools)
3189 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools
3190 - update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools
3191 [ Begin obsolete instructions:
3192   Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files.
3193     - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py
3194       on Windows:
3195         python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt
3196         python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt
3197   End obsolete instructions]
3198 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)
3199   not just the *_STUB.txt files
3200 - note on intltest: if collate/UCAConformanceTest fails, then
3201   utility/MultithreadTest/TestCollators will fail as well;
3202   fix the conformance test before looking into the multi-thread test
3203
3204 *** Implement Cased & Case_Ignorable properties
3205 - via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable()
3206 - Problem: These properties should be disjoint, but aren't
3207 - UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not
3208 - change ucase.icu to be able to store any combination of Cased and Case_Ignorable
3209
3210 *** Implement Changes_When_Xyz properties
3211 - without stored data
3212
3213 *** Implement Name_Alias property
3214 - add it as another name field in unames.icu
3215 - make it available via u_charName() and UCharNameChoice and
3216 - consider it in u_charFromName()
3217
3218 *** Break iterators
3219
3220 * Update break iterator rules to new UAX versions and new property values
3221 * Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary
3222
3223 *** new BidiTest file
3224 - review format and data
3225 - copy BidiTest.txt to source/test/testdata
3226 - write test code using this data
3227 - fix ICU code where it fails the conformance test
3228
3229 *** Java
3230 - generally, find and update code corresponding to C/C++
3231 - UCharacter.UnicodeBlock constants:
3232   a) add an _ID integer per new block, update COUNT
3233   b) add a class instance per new block
3234      Visual Studio regex:
3235         find            UBLOCK_{[^ ]+} = [0-9]+, {/.+}
3236         replace with    public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
3237 - CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias()
3238
3239 - port test changes to Java
3240
3241 *** LayoutEngine script information
3242
3243 (For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833)
3244
3245 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
3246 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
3247 ScriptRunData.cpp, which is no longer needed.)
3248
3249 The generated files have a current copyright date and "@draft" statement.
3250
3251 -> Eric Mader wrote in email on 20090930:
3252     "I think the tool has been modified to update @draft to @stable for
3253      older scripts and to add @draft for new scripts.
3254      (I worked with an intern on this last year.)
3255      You should check the output after you run it."
3256
3257 * copy the above files into <icu>/source/layout, replacing the old files.
3258 * fix mixed line endings
3259 * review the diffs and fix incorrect @draft and missing aliases
3260 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
3261
3262 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
3263 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
3264
3265 -> Eric Mader wrote in email on 20090930:
3266     "This is just a matter of making sure that all the per-script tables have
3267      entries for any new scripts that were added.
3268      If any new Indic characters were added, then the class tables in
3269      IndicClassTables.cpp should be updated to reflect this.
3270      John Emmons should know how to do this if it's required."
3271
3272 * rebuild the layout and layoutex libraries.
3273
3274 *** Documentation
3275 - Update User Guide
3276   + Jamo_Short_Name, sfc->scf, binary property value aliases
3277
3278 ---------------------------------------------------------------------------- ***
3279
3280 Unicode 5.1 update
3281
3282 *** related ICU Trac tickets
3283
3284 5696 Update to Unicode 5.1
3285
3286 *** Unicode version numbers
3287 - makedata.mak
3288 - uchar.h
3289 - configure.in & configure
3290 - update ucdVersion in gennames.c if an algorithmic range changes
3291
3292 *** data files & enums & parser code
3293
3294 * file preparation
3295 - ucdstrip:
3296     DerivedCoreProperties.txt
3297     DerivedNormalizationProps.txt
3298     NormalizationTest.txt
3299     PropList.txt
3300     Scripts.txt
3301     GraphemeBreakProperty.txt
3302     SentenceBreakProperty.txt
3303     WordBreakProperty.txt
3304 - ucdstrip and ucdmerge:
3305     EastAsianWidth.txt
3306     LineBreak.txt
3307
3308 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
3309 copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
3310 copy 5.1.0\ucd\Blocks.txt ..\unidata\
3311 copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
3312 copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
3313 copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
3314 copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
3315 copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
3316 copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
3317 copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
3318 copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
3319 copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
3320 copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
3321 copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
3322
3323 ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
3324 ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
3325 ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
3326 ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
3327 ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
3328 ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
3329 ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
3330 ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
3331 ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
3332 ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
3333
3334 * genpname
3335 - run preparse.pl
3336   + cd \svn\icuproj\icu\uni51\source\tools\genpname
3337   + make sure that data.h is writable
3338   + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
3339   + preparse.pl complains with errors like the following:
3340       Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
3341     This is because ICU 3.8 had scripts from ISO 15924 which are now
3342     added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
3343     and PropertyValueAliases.txt.
3344     -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
3345        Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
3346   + PropertyValueAliases.txt now explicitly contains values for boolean properties:
3347       N/Y, No/Yes, F/T, False/True
3348     -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
3349        It will use further values from the file if present.
3350
3351 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
3352 - new block & script values
3353   + 17 new blocks
3354   + 11 new script values already added in ICU 3.8 for ISO 15924 coverage
3355     (removed from SyntheticPropertyValueAliases.txt)
3356   + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
3357     (added to SyntheticPropertyValueAliases.txt)
3358 - uprops.icu (uprops.h) only provides 7 bits for script codes.
3359   In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
3360   There is none above 127 yet which is the script code for an
3361   assigned Unicode character, so ICU 4.0 uprops.icu does not store any
3362   script code values greater than 127.
3363   However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
3364   in a parallel bit field, and that overflows now.
3365   Also, future values >=128 would be incompatible anyway.
3366   uprops.h is modified to move around several of the bit fields
3367   in the properties vector words, and now uses 8 bits for the script code.
3368   Two other bit fields also grow to accommodate future growth:
3369   Block (current count: 172) grows from 8 to 9 bits,
3370   and Word_Break grows from 4 to 5 bits.
3371 - renamed property Simple_Case_Folding (sfc->scf)
3372   + nothing to be done: handled as normal alias
3373 - new property JSN Jamo_Short_Name
3374   + no new API: only contributes to the Name property
3375 - new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
3376 - new Joining Group (JG) value: Burushashki_Yeh_Barree
3377 - new Sentence_Break (SB) values:
3378     SB ; CR        ; CR
3379     SB ; EX        ; Extend
3380     SB ; LF        ; LF
3381     SB ; SC        ; SContinue
3382 - new Word_Break (WB) values:
3383     WB ; CR        ; CR
3384     WB ; Extend    ; Extend
3385     WB ; LF        ; LF
3386     WB ; MB        ; MidNumLet
3387
3388 * Further changes in the 2008-02-29 update:
3389 - Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
3390   because they should not normally be invisible.
3391 - new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
3392 - new Grapheme_Cluster_Break (GCB) value: PP=Prepend
3393 - new Word_Break (WB) value: NL=Newline
3394
3395 * hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
3396 - Unihan range end moves from 9FBB to 9FC3
3397   search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
3398   + do change gennames.c
3399
3400 * build Unicode data source code for hardcoding core data
3401 C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
3402
3403 ICU data make path is \svn\icuproj\icu\uni51\source\data\
3404 ICU root path is \svn\icuproj\icu\uni51
3405 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
3406 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
3407 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
3408 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
3409 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
3410 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
3411 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
3412 Creating data file for Unicode Character Properties
3413 Creating data file for Unicode Case Mapping Properties
3414 Creating data file for Unicode BiDi/Shaping Properties
3415 Creating data file for Unicode Normalization
3416 Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
3417 Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
3418
3419 - copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
3420   and rebuild the common library
3421
3422 *** Break iterators
3423
3424 * Update break iterator rules to new UAX versions and new property values
3425
3426 *** UCA
3427
3428 * update FractionalUCA.txt and UCARules.txt with new canonical closure
3429
3430 *** Test suites
3431 - Test that APIs using Unicode property value aliases (like UnicodeSet)
3432   support all of the boolean values N/Y, No/Yes, F/T, False/True
3433   -> TestBinaryValues() tests in both cintltst and intltest
3434
3435 *** LayoutEngine script information
3436 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
3437 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
3438 ScriptRunData.cpp, which is no longer needed.)
3439
3440 The generated files have a current copyright date and "@draft" statement.
3441
3442 * copy the above files into <icu>/source/layout, replacing the old files.
3443
3444 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
3445 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
3446
3447 * rebuild the layout and layoutex libraries.
3448
3449 *** Documentation
3450 - Update User Guide
3451   + Jamo_Short_Name, sfc->scf, binary property value aliases
3452
3453 ---------------------------------------------------------------------------- ***
3454
3455 Unicode 5.0 update
3456
3457 *** related Jitterbugs
3458
3459 5084 RFE: Update to Unicode 5.0
3460
3461 *** data files & enums & parser code
3462
3463 * file preparation
3464 - ucdstrip:
3465     DerivedCoreProperties.txt
3466     DerivedNormalizationProps.txt
3467     NormalizationTest.txt
3468     PropList.txt
3469     Scripts.txt
3470     GraphemeBreakProperty.txt
3471     SentenceBreakProperty.txt
3472     WordBreakProperty.txt
3473 - ucdstrip and ucdmerge:
3474     EastAsianWidth.txt
3475     LineBreak.txt
3476
3477 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
3478 copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
3479 copy 5.0.0\ucd\Blocks.txt ..\unidata\
3480 copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
3481 copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
3482 copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
3483 copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
3484 copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
3485 copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
3486 copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
3487 copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
3488 copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
3489 copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
3490 copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
3491
3492 ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
3493 ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
3494 ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
3495 ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
3496 ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
3497 ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
3498 ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
3499 ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
3500 ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
3501 ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
3502
3503 * update FractionalUCA.txt and UCARules.txt with new canonical closure
3504
3505 * genpname
3506 - run preparse.pl
3507   + make sure that data.h is writable
3508   + perl preparse.pl \cvs\oss\icu > out.txt
3509
3510 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
3511 - new block & script values
3512   + script values already added in ICU 3.6 because all of ISO 15924 is now covered
3513
3514 * build Unicode data source code for hardcoding core data
3515 C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
3516
3517 ICU data make path is \cvs\oss\icu\source\data\
3518 ICU root path is \cvs\oss\icu
3519 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
3520 [etc.]
3521 Creating data file for Unicode Character Properties
3522 Creating data file for Unicode Case Mapping Properties
3523 Creating data file for Unicode BiDi/Shaping Properties
3524 Creating data file for Unicode Normalization
3525 Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
3526 Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
3527
3528 - copy the .c source files to C:\cvs\oss\icu\source\common
3529   and rebuild the common library
3530
3531 *** Unicode version numbers
3532 - makedata.mak
3533 - uchar.h
3534 - configure.in
3535
3536 *** LayoutEngine script information
3537 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
3538 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
3539 ScriptRunData.cpp, which is no longer needed.)
3540
3541 The generated files have a current copyright date and "@draft" statement.
3542
3543 * copy the above files into <icu>/source/layout, replacing the old files.
3544
3545 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
3546 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
3547
3548 * rebuild the layout and layoutex libraries.
3549
3550 ---------------------------------------------------------------------------- ***
3551
3552 Unicode 4.1 update
3553
3554 *** related Jitterbugs
3555
3556 4332 RFE: Update to Unicode 4.1
3557 4157 RBBI, TR29 4.1 updates
3558
3559 *** data files & enums & parser code
3560
3561 * file preparation
3562 - ucdstrip:
3563     DerivedCoreProperties.txt
3564     DerivedNormalizationProps.txt
3565     NormalizationTest.txt
3566     GraphemeBreakProperty.txt
3567     SentenceBreakProperty.txt
3568     WordBreakProperty.txt
3569 - ucdstrip and ucdmerge:
3570     EastAsianWidth.txt
3571     LineBreak.txt
3572
3573 * add new files to the repository
3574     GraphemeBreakProperty.txt
3575     SentenceBreakProperty.txt
3576     WordBreakProperty.txt
3577
3578 * update FractionalUCA.txt and UCARules.txt with new canonical closure
3579
3580 * genpname
3581 - handle new enumerated properties in sub read_uchar
3582 - run preparse.pl
3583
3584 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
3585 - new binary properties
3586   + Pattern_Syntax
3587   + Pattern_White_Space
3588 - new enumerated properties
3589   + Grapheme_Cluster_Break
3590   + Sentence_Break
3591   + Word_Break
3592 - new block & script & line break values
3593
3594 * gencase
3595 - case-ignorable changes
3596   see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
3597   now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
3598
3599 *** Unicode version numbers
3600 - makedata.mak
3601 - uchar.h
3602 - configure.in
3603
3604 *** tests
3605 - verify that u_charMirror() round-trips
3606 - test all new properties and some new values of old properties
3607
3608 *** other code
3609
3610 * hardcoded Unihan range end/limit
3611 - Unihan range end moves from 9FA5 to 9FBB
3612   search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
3613   + do not modify BOCU/BOCSU code because that would change the encoding
3614     and break binary compatibility!
3615   + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
3616     NamePrepProfile.txt
3617   + ignore trietest.c: test data is arbitrary
3618   + ignore tstnorm.cpp: test optimization, not important
3619   + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
3620   + do change line_th.txt and word_th.txt
3621     by replacing hardcoded ranges with the new property values
3622   + do change gennames.c
3623
3624 source\data\brkitr\line_th.txt(229):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
3625 source\data\brkitr\word_th.txt(23):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
3626 source\tools\gennames\gennames.c(971):        0x4e00, 0x9fa5,
3627
3628 * case mappings
3629 - compare new special casing context conditions with previous ones
3630   see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
3631
3632 * genpname
3633 - consider storing only the short name if it is the same as the long name
3634
3635 *** other reviews
3636 - UAX #29 changes (grapheme/word/sentence breaks)
3637 - UAX #14 changes (line breaks)
3638 - Pattern_Syntax & Pattern_White_Space
3639
3640 ---------------------------------------------------------------------------- ***
3641
3642 Unicode 4.0.1 update
3643
3644 *** related Jitterbugs
3645
3646 3170 RFE: Update to Unicode 4.0.1
3647 3171 Add new Unicode 4.0.1 properties
3648 3520 use Unicode 4.0.1 updates for break iteration
3649
3650 *** data files & enums & parser code
3651
3652 * file preparation
3653 - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
3654 - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
3655
3656 * file fixes
3657 - fix UnicodeData.txt general categories of Ethiopic digits Nd->No
3658   according to PRI #26
3659   http://www.unicode.org/review/resolved-pri.html#pri26
3660 - undone again because no corrigendum in sight;
3661   instead modified tests to not check consistency on this for Unicode 4.0.1
3662
3663 * ucdterms.txt
3664 - update from http://www.unicode.org/copyright.html
3665   formatted for plain text
3666
3667 * uchar.h & uprops.h & uprops.c & genprops
3668 - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
3669 - add U_LB_INSEPARABLE due to a spelling fix
3670   + put short name comment only on line with new constant
3671     for genpname perl script parser
3672 - new binary properties
3673   + STerm
3674   + Variation_Selector
3675
3676 * genpname
3677 - fix genpname perl script so that it doesn't choke on more than 2 names per property value
3678 - perl script: correctly calculate the maximum number of fields per row
3679
3680 * uscript.h
3681 - new script code Hrkt=Katakana_Or_Hiragana
3682
3683 * gennorm.c track changes in DerivedNormalizationProps.txt
3684 - "FNC" -> "FC_NFKC"
3685 - single field "NFD_NO" -> two fields "NFD_QC; N" etc.
3686
3687 * genprops/props2.c track changes in DerivedNumericValues.txt
3688 - changed from 3 columns to 2, dropping the numeric type
3689   + assume that the type is always numeric for Han characters,
3690     and that only those are added in addition to what UnicodeData.txt lists
3691
3692 *** Unicode version numbers
3693 - makedata.mak
3694 - uchar.h
3695 - configure.in
3696
3697 *** tests
3698 - update test of default bidi classes according to PRI #28
3699   /tsutil/cucdtst/TestUnicodeData
3700   http://www.unicode.org/review/resolved-pri.html#pri28
3701 - bidi tests: change exemplar character for ES depending on Unicode version
3702 - change hardcoded expected property values where they change
3703
3704 *** other code
3705
3706 * name matching
3707 - read UCD.html
3708
3709 * scripts
3710 - use new Hrkt=Katakana_Or_Hiragana
3711
3712 * ZWJ & ZWNJ
3713 - are now part of combining character sequences
3714 - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ