icuSources/data/unidata/changes.txt

   1 * Copyright (C) 2004-2016, International Business Machines
   2 * Corporation and others.  All Rights Reserved.
   3 *
   4 *   file name:  changes.txt
   5 *   encoding:   US-ASCII
   6 *   tab size:   8 (not used)
   7 *   indentation:4
   8 *
   9 *   created on: 2004may06
  10 *   created by: Markus W. Scherer
  11 *
  12 * change log for Unicode updates
  13
  14 ---------------------------------------------------------------------------- ***
  15
  16 * New ISO 15924 script codes
  17
  18 Starting with ICU 55, we do not add UScriptCode constants any more until their scripts
  19 are encoded in Unicode, or can be assumed to be encoded in the next Unicode version.
  20 Script enum constant names want to follow the Unicode script property value aliases,
  21 which are assigned only when the scripts are encoded.
  22 When we encode scripts early and guess wrong, then we have confusing enum constants
  23 and have sometimes added aliases.
  24
  25 Exception: Script codes like Latf and Aran that are not subject to separate encoding
  26 can be added at any time.
  27
  28 Script codes not yet in ICU: http://www.unicode.org/iso15924/codechanges.html
  29
  30 Added 2014-11-15, see http://bugs.icu-project.org/trac/ticket/11561
  31 - Adlm  166     Adlam
  32 - Aran  161     Arabic (Nastaliq variant)
  33 - Kitl  505     Khitan large script
  34 - Kits  288     Khitan small script
  35 - Marc  332     Marchen
  36 - Osge  219     Osage
  37
  38 Aran can be added as USCRIPT_ARABIC_NASTALIQ at any time.
  39
  40 Adlam, Marchen, and Osage are expected to go into Unicode 9;
  41 we should assign Unicode script property value aliases for them
  42 soon after Unicode 8 is released, and add them in ICU 56.
  43
  44 Khitan scripts will be encoded later.
  45
  46 ---------------------------------------------------------------------------- ***
  47
  48 Emoji properties added in ICU 57: http://bugs.icu-project.org/trac/ticket/11802
  49
  50 Edit preparseucd.py to add & parse new properties.
  51 They share the UCD property namespace but are not listed in PropertyAliases.txt.
  52
  53 Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/
  54 Initial data from emoji/2.0/
  55
  56 ICU_ROOT=~/svn.icu/trunk
  57 ICU_SRC_DIR=$ICU_ROOT/src
  58 ICUDT=icudt56b
  59 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
  60 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
  61 UNIDATA=$ICU_SRC_DIR/source/data/unidata
  62
  63 Add binary-property constants to uchar.h enum UProperty & UProperty.java.
  64
  65 ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src
  66 (Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.)
  67
  68 Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java
  69
  70 make install, then icutools cmake & make, then
  71 ~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR
  72
  73 Generate Java data as usual, only update pnames.icu & uprops.icu.
  74
  75 ---------------------------------------------------------------------------- ***
  76
  77 Unicode 8.0 update for ICU 56
  78
  79 * Command-line environment setup
  80
  81 ICU_ROOT=~/svn.icu/trunk
  82 ICU_SRC_DIR=$ICU_ROOT/src
  83 ICUDT=icudt56b
  84 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
  85 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
  86 UNIDATA=$ICU_SRC_DIR/source/data/unidata
  87
  88 http://www.unicode.org/review/pri297/  -- beta review
  89 http://www.unicode.org/reports/uax-proposed-updates.html
  90 http://unicode.org/versions/beta-8.0.0.html
  91 http://www.unicode.org/versions/Unicode8.0.0/
  92 http://www.unicode.org/reports/tr44/tr44-15.html
  93
  94 *** ICU Trac
  95
  96 - ticket:11574: Unicode 8
  97 - C++ branches/markus/uni80 at r37351 from trunk at r37343
  98 - Java branches/markus/uni80 at r37352 from trunk at r37338
  99
 100 *** CLDR Trac
 101
 102 - cldrbug 8311: UCA 8
 103 - branches/markus/uni80 at r11518 from trunk at r11517
 104
 105 - cldrbug 8109: Unicode 8.0 script metadata
 106 - cldrbug 8418: Updated segmentation for Unicode 8.0
 107
 108 *** Unicode version numbers
 109 - makedata.mak
 110 - uchar.h
 111 - com.ibm.icu.util.VersionInfo
 112 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
 113
 114 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
 115   so that the makefiles see the new version number.
 116
 117 *** data files & enums & parser code
 118
 119 * file preparation
 120
 121 - download UCD & IDNA files
 122 - make sure that the Unicode data folder passed into preparseucd.py
 123   includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
 124 - only for manual diffs: remove version suffixes from the file names
 125   ~/unidata/uni70/20140403$ ../../desuffixucd.py .
 126   (see https://sites.google.com/site/unicodetools/inputdata)
 127 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
 128 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src
 129 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
 130
 131 - also: from http://unicode.org/Public/security/8.0.0/ download new
 132   confusables.txt & confusablesWholeScript.txt
 133   and copy to $UNIDATA
 134     ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA
 135     ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA
 136
 137 * initial preparseucd.py changes
 138 - remove new Unicode scripts from the
 139   only-in-ISO-15924 list according to the error message:
 140     ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw']
 141     from _scripts_only_in_iso15924
 142   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
 143       and in com.ibm.icu.dev.test.lang.TestUScript.java
 144 - property and file name change:
 145     IndicMatraCategory -> IndicPositionalCategory
 146 - UnicodeData.txt unusual numeric values (improper fractions)
 147     109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;;
 148     109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;;
 149     109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;;
 150     109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;;
 151     109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;;
 152     109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;;
 153     109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;;
 154     109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;;
 155     109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;;
 156     109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;;
 157   -> change preparseucd.py to map them to proper fractions (e.g., 1/6)
 158      which are listed in DerivedNumericValues.txt;
 159      keeps storage in data file simple
 160
 161 * PropertyValueAliases.txt changes
 162 - 10 new Block (blk) values:
 163     blk; Ahom                             ; Ahom
 164     blk; Anatolian_Hieroglyphs            ; Anatolian_Hieroglyphs
 165     blk; Cherokee_Sup                     ; Cherokee_Supplement
 166     blk; CJK_Ext_E                        ; CJK_Unified_Ideographs_Extension_E
 167     blk; Early_Dynastic_Cuneiform         ; Early_Dynastic_Cuneiform
 168     blk; Hatran                           ; Hatran
 169     blk; Multani                          ; Multani
 170     blk; Old_Hungarian                    ; Old_Hungarian
 171     blk; Sup_Symbols_And_Pictographs      ; Supplemental_Symbols_And_Pictographs
 172     blk; Sutton_SignWriting               ; Sutton_SignWriting
 173   -> add to uchar.h
 174     use long property names for enum constants
 175   -> add to UCharacter.UnicodeBlock IDs
 176     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
 177             replace  public static final int \1_ID = \2; \3
 178   -> add to UCharacter.UnicodeBlock objects
 179     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
 180             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
 181 - 6 new Script (sc) values:
 182     sc ; Ahom                             ; Ahom
 183     sc ; Hatr                             ; Hatran
 184     sc ; Hluw                             ; Anatolian_Hieroglyphs
 185     sc ; Hung                             ; Old_Hungarian
 186     sc ; Mult                             ; Multani
 187     sc ; Sgnw                             ; SignWriting
 188   -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript
 189
 190 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
 191     (not strictly necessary for NOT_ENCODED scripts)
 192   ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
 193
 194 * generate normalization data files
 195   cd $ICU_ROOT/dbg
 196   bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
 197   bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
 198   bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
 199   bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
 200   bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
 201
 202 * build ICU (make install)
 203   so that the tools build can pick up the new definitions from the installed header files.
 204
 205   $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
 206
 207 * build Unicode tools using CMake+make
 208
 209 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
 210
 211   # Location (--prefix) of where ICU was installed.
 212   set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)
 213   # Location of the ICU source tree.
 214   set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)
 215
 216   ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
 217   ~/svn.icutools/trunk/dbg/unicode/c$ make
 218
 219 * generate core properties data files
 220 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
 221 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR
 222 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR
 223 - rebuild ICU (make install) & tools
 224 - run genuca again (see step above) so that it picks up the new nfc.nrm
 225 - rebuild ICU (make install) & tools
 226
 227 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
 228   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
 229 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
 230 - Unicode 6.0..8.0: U+2260, U+226E, U+226F
 231 - nothing new in 8.0, no test file to update
 232
 233 * run & fix ICU4C tests
 234 - bad Cherokee case folding due to difference in fallbacks:
 235   UCD case folding falls back to no mapping,
 236   ICU runtime case folding falls back to lowercasing;
 237   fixed casepropsbuilder.cpp to generate scf mappings to self
 238   when there is an slc mapping but no scf
 239 - Andy handles RBBI & spoof check test failures
 240
 241 * collation: CLDR collation root, UCA DUCET
 242
 243 - UCA DUCET goes into Mark's Unicode tools, see
 244   https://sites.google.com/site/unicodetools/home#TOC-UCA
 245 - CLDR root data files are checked into (CLDR UCA branch)/common/uca/
 246 - cd (CLDR UCA branch)/common/uca/
 247 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
 248   cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
 249 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
 250     cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt
 251     (note removing the underscore before "Rules")
 252     cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
 253 - restore TODO diffs in UCARules.txt
 254     meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
 255 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
 256   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
 257   from the CLDR root files (..._CLDR_..._SHORT.txt)
 258     cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
 259     cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
 260     cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
 261 - if CLDR common/uca/unihan-index.txt changes, then update
 262   CLDR common/collation/root.xml <collation type="private-unihan">
 263   and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt
 264 - run genuca, see command line above;
 265   deal with
 266     Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt
 267         (add the character to genuca.cpp sampleCharsToScripts[])
 268   + look up the script for the new sample characters
 269     (e.g., in FractionalUCA.txt)
 270   + *add* mappings to sampleCharsToScripts[], do not replace them
 271     (in case the script sample characters flip-flop)
 272   + insert new scripts in DUCET script order, see the top_byte table
 273     at the beginning of FractionalUCA.txt
 274 - rebuild ICU4C
 275
 276 * run & fix ICU4C tests, now with new CLDR collation root data
 277 - run all tests with the collation test data *_SHORT.txt or the full files
 278   (the full ones have comments, useful for debugging)
 279 - note on intltest: if collate/UCAConformanceTest fails, then
 280   utility/MultithreadTest/TestCollators will fail as well;
 281   fix the conformance test before looking into the multi-thread test
 282 - fixed bug in CollationWeights::getWeightRanges()
 283   exposed by new data and CollationTest::TestRootElements
 284
 285 * update Java data files
 286 - refresh just the UCD/UCA-related/derived files, just to be safe
 287 - see (ICU4C)/source/data/icu4j-readme.txt
 288 - mkdir /tmp/icu4j
 289 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 290   output:
 291     ...
 292     Unicode .icu files built to ./out/build/icudt56l
 293     echo timestamp > uni-core-data
 294     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b
 295     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b
 296     echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt
 297     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b
 298     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b"
 299     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/
 300     mkdir -p /tmp/icu4j/main/shared/data
 301     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
 302     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/
 303     mkdir -p /tmp/icu4j/main/shared/data
 304     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
 305     make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'
 306 - copy the big-endian Unicode data files to another location,
 307   separate from the other data files,
 308   and then refresh ICU4J
 309     cd ~/svn.icu/trunk/dbg/data/out/icu4j
 310     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
 311     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
 312     cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 313     cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 314     rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
 315     cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 316     cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
 317     cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
 318     jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
 319
 320 * When refreshing all of ICU4J data from ICU4C
 321 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 322 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
 323 or
 324 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
 325
 326 * update CollationFCD.java
 327   + copy & paste the initializers of lcccIndex[] etc. from
 328     ICU4C/source/i18n/collationfcd.cpp to
 329     ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
 330
 331 * refresh Java test .txt files
 332 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
 333     cd $ICU_SRC_DIR/source/data/unidata
 334     cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
 335     cd ../../test/testdata
 336     cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
 337     cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
 338
 339 * run & fix ICU4J tests
 340
 341 *** LayoutEngine script information
 342
 343 * ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more,
 344   because the layout engine was deprecated in ICU 54.
 345   Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java
 346   to write lines that we used to add manually.
 347
 348 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
 349   This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
 350   in the working directory.
 351
 352   (It also generates ScriptRunData.cpp, which is no longer needed.)
 353
 354   It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages
 355   (a plain text file)
 356   which maps ICU versions to the numbers of script/language constants
 357   that were added then.
 358   (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)
 359
 360   The generated files have a current copyright date and "@deprecated" statement.
 361
 362 * Review changes, fix Java tool if necessary, and copy to ICU4C
 363   cd ~/svn.icu4j/trunk/src
 364   meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
 365   cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout
 366   cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout
 367
 368 *** API additions
 369 - send notice to icu-design about new born-@stable API (enum constants etc.)
 370
 371 *** merge the Unicode update branches back onto the trunk
 372 - do not merge the icudata.jar and testdata.jar,
 373   instead rebuild them from merged & tested ICU4C
 374 - make sure that changes to Unicode tools & ICU tools are checked in
 375   http://www.unicode.org/utility/trac/log/trunk/unicodetools
 376   http://bugs.icu-project.org/trac/log/tools/trunk
 377
 378 ---------------------------------------------------------------------------- ***
 379
 380 Unicode 7.0 update for ICU 54
 381
 382 http://www.unicode.org/review/pri271/  -- beta review
 383 http://www.unicode.org/reports/uax-proposed-updates.html
 384 http://www.unicode.org/versions/beta-7.0.0.html#notable_issues
 385 http://www.unicode.org/reports/tr44/tr44-13.html
 386
 387 *** ICU Trac
 388
 389 - ticket 10821: Unicode 7.0, UCA 7.0
 390 - C++ branches/markus/uni70 at r35584 from trunk at r35580
 391 - Java branches/markus/uni70 at r35587 from trunk at r35545
 392
 393 *** CLDR Trac
 394
 395 - ticket 7195: UCA 7.0 CLDR root collation
 396 - branches/markus/uni70 at r10062 from trunk at r10061
 397
 398 - ticket 6762: script metadata for Unicode 7.0 new scripts
 399
 400 *** Unicode version numbers
 401 - makedata.mak
 402 - uchar.h
 403 - com.ibm.icu.util.VersionInfo
 404 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
 405
 406 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
 407   so that the makefiles see the new version number.
 408
 409 *** data files & enums & parser code
 410
 411 * file preparation
 412
 413 - download UCD & IDNA files
 414 - make sure that the Unicode data folder passed into preparseucd.py
 415   includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
 416 - only for manual diffs: remove version suffixes from the file names
 417   ~/unidata/uni70/20140403$ ../../desuffixucd.py .
 418   (see https://sites.google.com/site/unicodetools/inputdata)
 419 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip
 420 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src
 421 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
 422 - Restore TODO diffs in source/data/unidata/UCARules.txt
 423     cd $ICU_SRC_DIR
 424     meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt
 425 - Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt
 426
 427 - also: from http://unicode.org/Public/security/7.0.0/ download new
 428   confusables.txt & confusablesWholeScript.txt
 429   and copy to $ICU_ROOT/src/source/data/unidata/
 430
 431 * initial preparseucd.py changes
 432 - remove new Unicode scripts from the
 433   only-in-ISO-15924 list according to the error message:
 434     ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass',
 435                         'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm',
 436                         'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj']
 437     from _scripts_only_in_iso15924
 438   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
 439       and in com.ibm.icu.dev.test.lang.TestUScript.java
 440 - NamesList.txt now has a heading with a non-ASCII character
 441   + keep ppucd.txt in platform charset, rather than changing tool/test parsers
 442   + escape non-ASCII characters in heading comments
 443 - gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013
 444   + get the copyright from the first file whose copyright line contains the current year
 445
 446 * PropertyValueAliases.txt changes
 447 - 32 new Block (blk) values:
 448     blk; Bassa_Vah                        ; Bassa_Vah
 449     blk; Caucasian_Albanian               ; Caucasian_Albanian
 450     blk; Coptic_Epact_Numbers             ; Coptic_Epact_Numbers
 451     blk; Diacriticals_Ext                 ; Combining_Diacritical_Marks_Extended
 452     blk; Duployan                         ; Duployan
 453     blk; Elbasan                          ; Elbasan
 454     blk; Geometric_Shapes_Ext             ; Geometric_Shapes_Extended
 455     blk; Grantha                          ; Grantha
 456     blk; Khojki                           ; Khojki
 457     blk; Khudawadi                        ; Khudawadi
 458     blk; Latin_Ext_E                      ; Latin_Extended_E
 459     blk; Linear_A                         ; Linear_A
 460     blk; Mahajani                         ; Mahajani
 461     blk; Manichaean                       ; Manichaean
 462     blk; Mende_Kikakui                    ; Mende_Kikakui
 463     blk; Modi                             ; Modi
 464     blk; Mro                              ; Mro
 465     blk; Myanmar_Ext_B                    ; Myanmar_Extended_B
 466     blk; Nabataean                        ; Nabataean
 467     blk; Old_North_Arabian                ; Old_North_Arabian
 468     blk; Old_Permic                       ; Old_Permic
 469     blk; Ornamental_Dingbats              ; Ornamental_Dingbats
 470     blk; Pahawh_Hmong                     ; Pahawh_Hmong
 471     blk; Palmyrene                        ; Palmyrene
 472     blk; Pau_Cin_Hau                      ; Pau_Cin_Hau
 473     blk; Psalter_Pahlavi                  ; Psalter_Pahlavi
 474     blk; Shorthand_Format_Controls        ; Shorthand_Format_Controls
 475     blk; Siddham                          ; Siddham
 476     blk; Sinhala_Archaic_Numbers          ; Sinhala_Archaic_Numbers
 477     blk; Sup_Arrows_C                     ; Supplemental_Arrows_C
 478     blk; Tirhuta                          ; Tirhuta
 479     blk; Warang_Citi                      ; Warang_Citi
 480   -> add to uchar.h
 481     use long property names for enum constants
 482   -> add to UCharacter.UnicodeBlock IDs
 483     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
 484             replace  public static final int \1_ID = \2; \3
 485   -> add to UCharacter.UnicodeBlock objects
 486     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
 487             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
 488 - 28 new Joining_Group (jg) values:
 489     jg ; Manichaean_Aleph                 ; Manichaean_Aleph
 490     jg ; Manichaean_Ayin                  ; Manichaean_Ayin
 491     jg ; Manichaean_Beth                  ; Manichaean_Beth
 492     jg ; Manichaean_Daleth                ; Manichaean_Daleth
 493     jg ; Manichaean_Dhamedh               ; Manichaean_Dhamedh
 494     jg ; Manichaean_Five                  ; Manichaean_Five
 495     jg ; Manichaean_Gimel                 ; Manichaean_Gimel
 496     jg ; Manichaean_Heth                  ; Manichaean_Heth
 497     jg ; Manichaean_Hundred               ; Manichaean_Hundred
 498     jg ; Manichaean_Kaph                  ; Manichaean_Kaph
 499     jg ; Manichaean_Lamedh                ; Manichaean_Lamedh
 500     jg ; Manichaean_Mem                   ; Manichaean_Mem
 501     jg ; Manichaean_Nun                   ; Manichaean_Nun
 502     jg ; Manichaean_One                   ; Manichaean_One
 503     jg ; Manichaean_Pe                    ; Manichaean_Pe
 504     jg ; Manichaean_Qoph                  ; Manichaean_Qoph
 505     jg ; Manichaean_Resh                  ; Manichaean_Resh
 506     jg ; Manichaean_Sadhe                 ; Manichaean_Sadhe
 507     jg ; Manichaean_Samekh                ; Manichaean_Samekh
 508     jg ; Manichaean_Taw                   ; Manichaean_Taw
 509     jg ; Manichaean_Ten                   ; Manichaean_Ten
 510     jg ; Manichaean_Teth                  ; Manichaean_Teth
 511     jg ; Manichaean_Thamedh               ; Manichaean_Thamedh
 512     jg ; Manichaean_Twenty                ; Manichaean_Twenty
 513     jg ; Manichaean_Waw                   ; Manichaean_Waw
 514     jg ; Manichaean_Yodh                  ; Manichaean_Yodh
 515     jg ; Manichaean_Zayin                 ; Manichaean_Zayin
 516     jg ; Straight_Waw                     ; Straight_Waw
 517   -> uchar.h & UCharacter.JoiningGroup
 518 - 23 new Script (sc) values:
 519     sc ; Aghb                             ; Caucasian_Albanian
 520     sc ; Bass                             ; Bassa_Vah
 521     sc ; Dupl                             ; Duployan
 522     sc ; Elba                             ; Elbasan
 523     sc ; Gran                             ; Grantha
 524     sc ; Hmng                             ; Pahawh_Hmong
 525     sc ; Khoj                             ; Khojki
 526     sc ; Lina                             ; Linear_A
 527     sc ; Mahj                             ; Mahajani
 528     sc ; Mani                             ; Manichaean
 529     sc ; Mend                             ; Mende_Kikakui
 530     sc ; Modi                             ; Modi
 531     sc ; Mroo                             ; Mro
 532     sc ; Narb                             ; Old_North_Arabian
 533     sc ; Nbat                             ; Nabataean
 534     sc ; Palm                             ; Palmyrene
 535     sc ; Pauc                             ; Pau_Cin_Hau
 536     sc ; Perm                             ; Old_Permic
 537     sc ; Phlp                             ; Psalter_Pahlavi
 538     sc ; Sidd                             ; Siddham
 539     sc ; Sind                             ; Khudawadi
 540     sc ; Tirh                             ; Tirhuta
 541     sc ; Wara                             ; Warang_Citi
 542   -> uscript.h (many were added before)
 543     comment "Mende Kikakui" for USCRIPT_MENDE
 544     add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias
 545   -> com.ibm.icu.lang.UScript
 546     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
 547     replace  public static final int \1 = \2; \3
 548 - 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
 549   (added 2012-11-01)
 550     Ahom        338     Ahom
 551     Hatr        127     Hatran
 552     Mult        323     Multani
 553   (added 2013-10-12)
 554     Modi        324     Modi
 555     Pauc        263     Pau Cin Hau
 556     Sidd        302     Siddham
 557   -> uscript.h (some overlap with additions from Unicode)
 558   -> com.ibm.icu.lang.UScript
 559     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
 560     replace  public static final int \1 = \2; \3
 561   -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924
 562   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
 563       and in com.ibm.icu.dev.test.lang.TestUScript.java
 564
 565 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
 566     (not strictly necessary for NOT_ENCODED scripts)
 567   ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt
 568
 569 * generate normalization data files
 570 - cd $ICU_ROOT/dbg
 571 - export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib
 572 - SRC_DATA_IN=$ICU_SRC_DIR/source/data/in
 573 - UNIDATA=$ICU_SRC_DIR/source/data/unidata
 574 - bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource
 575 - bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
 576 - bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
 577 - bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
 578 - bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
 579
 580 * build ICU (make install)
 581   so that the tools build can pick up the new definitions from the installed header files.
 582
 583 ~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
 584
 585 * build Unicode tools using CMake+make
 586
 587 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
 588
 589 # Location (--prefix) of where ICU was installed.
 590 set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst)
 591 # Location of the ICU source tree.
 592 set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src)
 593
 594 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
 595 ~/svn.icutools/trunk/dbg/unicode/c$ make
 596
 597 * genprops work
 598 - new code point range for Joining_Group values: 10AC0..10AFF Manichaean
 599   + add second array of Joining_Group values for at most 10800..10FFF
 600     icutools: unicode/c/genprops/bidipropsbuilder.cpp
 601     icu: source/common/ubidi_props.h/.c/_data.h
 602     icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java
 603
 604 * generate core properties data files
 605 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR
 606 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR
 607 - rebuild ICU (make install) & tools
 608 - run genuca again (see step above) so that it picks up the new nfc.nrm
 609 - rebuild ICU (make install) & tools
 610
 611 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
 612   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
 613 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
 614 - Unicode 6.0..7.0: U+2260, U+226E, U+226F
 615 - nothing new in 7.0, no test file to update
 616
 617 * run & fix ICU4C tests
 618
 619 * update Java data files
 620 - refresh just the UCD-related files, just to be safe
 621 - see (ICU4C)/source/data/icu4j-readme.txt
 622 - mkdir /tmp/icu4j
 623 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 624   output:
 625     ...
 626     Unicode .icu files built to ./out/build/icudt53l
 627     echo timestamp > uni-core-data
 628     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b
 629     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b
 630     echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
 631     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b
 632     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b"
 633     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/
 634     mkdir -p /tmp/icu4j/main/shared/data
 635     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
 636     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/
 637     mkdir -p /tmp/icu4j/main/shared/data
 638     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
 639     make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data'
 640 - copy the big-endian Unicode data files to another location,
 641   separate from the other data files
 642     ICUDT=icudt54b
 643     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
 644     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
 645     cd ~/svn.icu/uni70/dbg/data/out/icu4j
 646     cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 647     cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 648     rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu
 649     cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT
 650     cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
 651     cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr
 652 - refresh ICU4J
 653     ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
 654
 655 * update CollationFCD.java
 656   + copy & paste the initializers of lcccIndex[] etc. from
 657     ICU4C/source/i18n/collationfcd.cpp to
 658     ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java
 659
 660 * refresh Java test .txt files
 661 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
 662     cd $ICU_SRC_DIR/source/data/unidata
 663     cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
 664     cd ../../test/testdata
 665     cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
 666     cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode
 667
 668 * UCA
 669
 670 - download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/
 671 - run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata)
 672 - update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/
 673 - run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA
 674 - output files are in ~/svn.unitools/Generated/uca/7.0.0/
 675 - review data; compare files, use blankweights.sed or similar
 676   ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt
 677 - cd ~/svn.unitools/Generated/uca/7.0.0/
 678 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
 679   cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt
 680 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
 681     (note removing the underscore before "Rules")
 682     cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt
 683 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
 684   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
 685   with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
 686     cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt
 687     cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt
 688     cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data
 689 - run genuca, see command line above
 690 - rebuild ICU4C
 691 - refresh ICU4J collation data:
 692   (subset of instructions above for properties data refresh, except copies all coll/*)
 693     ICUDT=icudt54b
 694     ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 695     ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
 696     ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll
 697     ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT
 698 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
 699 - note on intltest: if collate/UCAConformanceTest fails, then
 700   utility/MultithreadTest/TestCollators will fail as well;
 701   fix the conformance test before looking into the multi-thread test
 702 - copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors
 703 - copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch
 704   ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
 705
 706 * When refreshing all of ICU4J data from ICU4C
 707 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 708 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
 709 or
 710 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
 711
 712 * run & fix ICU4J tests
 713
 714 *** LayoutEngine script information
 715
 716 (For details see the Unicode 5.2 change log below.)
 717
 718 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
 719   This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
 720   in the working directory.
 721   (It also generates ScriptRunData.cpp, which is no longer needed.)
 722
 723   The generated files have a current copyright date and "@stable" statement.
 724   ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java
 725   for "born stable" Unicode API constants, and to stop parsing ICU version numbers
 726   which may not contain dots any more.
 727
 728 - diff current <icu>/source/layout files vs. generated ones
 729     ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
 730   review and manually merge desired changes;
 731   fix gratuitous changes, incorrect @draft/@stable and missing aliases;
 732   Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
 733 - if you just copy the above files, then
 734   fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
 735   manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
 736
 737 *** API additions
 738 - send notice to icu-design about new born-@stable API (enum constants etc.)
 739
 740 *** merge the Unicode update branches back onto the trunk
 741 - do not merge the icudata.jar and testdata.jar,
 742   instead rebuild them from merged & tested ICU4C
 743
 744 ---------------------------------------------------------------------------- ***
 745
 746 Unicode 6.3 update
 747
 748 http://www.unicode.org/review/pri249/  -- beta review
 749 http://www.unicode.org/reports/uax-proposed-updates.html
 750 http://www.unicode.org/versions/beta-6.3.0.html#notable_issues
 751 http://www.unicode.org/reports/tr44/tr44-11.html
 752
 753 *** ICU Trac
 754
 755 - ticket 10128: update ICU to Unicode 6.3 beta
 756 - ticket 10168: update ICU to Unicode 6.3 final
 757 - C++ branches/markus/uni63 at r33552 from trunk at r33551
 758 - Java branches/markus/uni63 at r33550 from trunk at r33553
 759
 760 - ticket 10142: implement Unicode 6.3 bidi algorithm additions
 761
 762 *** Unicode version numbers
 763 - makedata.mak
 764 - uchar.h
 765   (configure.in & configure: have been modified to extract the version from uchar.h)
 766 - com.ibm.icu.util.VersionInfo
 767 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
 768
 769 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h
 770   so that the makefiles see the new version number.
 771
 772 *** data files & enums & parser code
 773
 774 * file preparation
 775
 776 - download UCD, UCA & IDNA files
 777 - make sure that the Unicode data folder passed into preparseucd.py
 778   includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
 779 - modify preparseucd.py:
 780   parse new file BidiBrackets.txt
 781   with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type
 782 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src
 783 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
 784 - Check test file diffs for previously commented-out, known-failing data lines;
 785   probably need to keep those commented out.
 786
 787 * PropertyAliases.txt changes
 788 - 1 new Enumerated Property
 789   bpt                      ; Bidi_Paired_Bracket_Type
 790   -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType
 791   -> ubidi_props.h & .c & UBiDiProps.java
 792   -> remember to write the max value at UBIDI_MAX_VALUES_INDEX
 793   -> uprops.cpp
 794   -> change ubidi.icu format version from 2.0 to 2.1
 795 - 1 new Miscellaneous Property
 796   bpb                      ; Bidi_Paired_Bracket
 797   -> uchar.h & UProperty.java
 798   -> ppucd.h & .cpp
 799
 800 * PropertyValueAliases.txt changes
 801 - 3 Bidi_Paired_Bracket_Type (bpt) values:
 802   bpt; c                                ; Close
 803   bpt; n                                ; None
 804   bpt; o                                ; Open
 805   -> uchar.h & UCharacter.BidiPairedBracketType
 806   -> ubidi_props.h & .c & UBiDiProps.java
 807   -> change ubidi.icu format version from 2.0 to 2.1
 808 - 4 new Bidi_Class (bc) values:
 809   bc ; FSI                              ; First_Strong_Isolate
 810   bc ; LRI                              ; Left_To_Right_Isolate
 811   bc ; RLI                              ; Right_To_Left_Isolate
 812   bc ; PDI                              ; Pop_Directional_Isolate
 813   -> uchar.h & UCharacterEnums.ECharacterDirection
 814   -> until the bidi code gets updated,
 815      Roozbeh suggests mapping the new bc values to ON (Other_Neutral)
 816 - 3 new Word_Break (WB) values:
 817   WB ; HL                               ; Hebrew_Letter
 818   WB ; SQ                               ; Single_Quote
 819   WB ; DQ                               ; Double_Quote
 820   -> uchar.h & UCharacter.WordBreak
 821   -> first time Word_Break numeric constants exceed 4 bits (now 17 values)
 822 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
 823   (added 2012-10-16)
 824   Aghb  239     Caucasian Albanian
 825   Mahj  314     Mahajani
 826   -> uscript.h
 827   -> com.ibm.icu.lang.UScript
 828     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
 829     replace  public static final int \1 = \2;\3
 830   -> preparseucd.py _scripts_only_in_iso15924
 831   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
 832       and in com.ibm.icu.dev.test.lang.TestUScript.java
 833   -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata
 834      (not strictly necessary for NOT_ENCODED scripts)
 835
 836 * generate normalization data files
 837 - ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib
 838 - ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in
 839 - ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata
 840 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
 841 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
 842 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
 843 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
 844
 845 * build ICU (make install)
 846   so that the tools build can pick up the new definitions from the installed header files.
 847
 848 ~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt
 849
 850 * build Unicode tools using CMake+make
 851
 852 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:
 853
 854 # Location (--prefix) of where ICU was installed.
 855 set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst)
 856 # Location of the ICU source tree.
 857 set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src)
 858
 859 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c
 860 ~/svn.icutools/trunk/dbg/unicode/c$ make
 861
 862 * generate core properties data files
 863 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src
 864 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src
 865 - rebuild ICU (make install) & tools
 866 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
 867 - rebuild ICU (make install) & tools
 868
 869 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
 870   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
 871 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
 872 - Unicode 6.0..6.3: U+2260, U+226E, U+226F
 873 - nothing new in 6.3, no test file to update
 874
 875 * update Java data files
 876 - refresh just the UCD-related files, just to be safe
 877 - see (ICU4C)/source/data/icu4j-readme.txt
 878 - mkdir /tmp/icu4j
 879 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 880   output:
 881     ...
 882     Unicode .icu files built to ./out/build/icudt52l
 883     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b
 884     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b
 885     echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
 886     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b
 887     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b"
 888     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/
 889     mkdir -p /tmp/icu4j/main/shared/data
 890     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
 891     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/
 892     mkdir -p /tmp/icu4j/main/shared/data
 893     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
 894     make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data'
 895 - copy the big-endian Unicode data files to another location,
 896   separate from the other data files
 897     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
 898     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
 899     ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
 900     ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu
 901     ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b
 902     ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
 903     ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr
 904 - refresh ICU4J
 905     ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
 906
 907 * refresh Java test .txt files
 908 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
 909
 910 * UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files
 911
 912 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
 913 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
 914 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
 915 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
 916   (note removing the underscore before "Rules")
 917 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
 918   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
 919   with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
 920 - check test file diffs for previously commented-out, known-failing data lines;
 921   probably need to keep those commented out
 922 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
 923 - run genuca, see command line above
 924 - rebuild ICU4C
 925 - refresh ICU4J collation data:
 926   (subset of instructions above for properties data refresh, except copies all coll/*)
 927     ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 928     ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
 929     ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll
 930     ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b
 931 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
 932 - note on intltest: if collate/UCAConformanceTest fails, then
 933   utility/MultithreadTest/TestCollators will fail as well;
 934   fix the conformance test before looking into the multi-thread test
 935
 936 * test ICU, fix test code where necessary
 937
 938 * When refreshing all of ICU4J data from ICU4C
 939 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 940 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
 941 or
 942 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
 943
 944 *** LayoutEngine script information
 945 - skipped for Unicode 6.3: no new scripts
 946
 947 *** merge the Unicode update branches back onto the trunk
 948 - do not merge the icudata.jar and testdata.jar,
 949   instead rebuild them from merged & tested ICU4C
 950
 951 ---------------------------------------------------------------------------- ***
 952
 953 Unicode 6.2 update
 954
 955 http://www.unicode.org/review/pri230/
 956 http://www.unicode.org/versions/beta-6.2.0.html
 957 http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0
 958 http://www.unicode.org/review/pri227/  Changes to Script Extensions Property Values
 959 http://www.unicode.org/review/pri228/  Changing some common characters from Punctuation to Symbol
 960 http://www.unicode.org/review/pri229/  Linebreaking Changes for Pictographic Symbols
 961 http://www.unicode.org/reports/tr46/tr46-8.html  IDNA
 962 http://unicode.org/Public/idna/6.2.0/
 963
 964 *** ICU Trac
 965
 966 - ticket 9515: Unicode 6.2: final ICU update
 967
 968 - ticket 9514: UCA 6.2: fix UCARules.txt
 969
 970 - ticket 9437: update ICU to Unicode 6.2
 971 - C++ branches/markus/uni62 at r32050 from trunk at r32041
 972 - Java branches/markus/uni62 at r32068 from trunk at r32066
 973
 974 *** Unicode version numbers
 975 - makedata.mak
 976 - uchar.h
 977   (configure.in & configure: have been modified to extract the version from uchar.h)
 978 - com.ibm.icu.util.VersionInfo
 979 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
 980
 981 *** data files & enums & parser code
 982
 983 * file preparation
 984
 985 - download UCD, UCA & IDNA files
 986 - make sure that the Unicode data folder passed into preparseucd.py
 987   includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
 988 - modify preparseucd.py: NamesList.txt is now in UTF-8
 989 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src
 990 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
 991 - Check test file diffs for previously commented-out, known-failing data lines;
 992   probably need to keep those commented out.
 993
 994 * PropertyValueAliases.txt changes
 995 - 1 new Line_Break (lb) value:
 996   lb ; RI                               ; Regional_Indicator
 997   -> uchar.h & UCharacter.LineBreak
 998 - 1 new Word_Break (WB) value:
 999   WB ; RI                               ; Regional_Indicator
1000   -> uchar.h & UCharacter.WordBreak
1001 - 1 new Grapheme_Cluster_Break (GCB) value:
1002   GCB; RI                               ; Regional_Indicator
1003   -> uchar.h & UCharacter.GraphemeClusterBreak
1004
1005 * 3 new numeric values
1006   The new value -1, which was really supposed to be NaN but that would have required
1007   new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1,
1008   but encodeNumericValue() in corepropsbuilder.cpp had to be fixed.
1009     cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1
1010     cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1
1011   The two new values 216000 and 432000 require an addition to the encoding of numeric values.
1012     cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000
1013     cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000
1014   -> uprops.h, uchar.c & UCharacterProperty.java
1015   -> cucdtst.c & UCharacterTest.java
1016
1017 * generate normalization data files
1018 - ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib
1019 - ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in
1020 - ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata
1021 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
1022 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
1023 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1024 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
1025
1026 * build ICU (make install)
1027   so that the tools build can pick up the new definitions from the installed header files.
1028 * build Unicode tools using CMake+make
1029
1030 * generate core properties data files
1031 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src
1032 - in initial bootstrapping, change the UCA version
1033   in source/data/unidata/FractionalUCA.txt to match the new Unicode version
1034 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src
1035 - rebuild ICU (make install) & tools
1036   + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
1037     check if the UCA version in FractionalUCA.txt matches the new Unicode version
1038     (see step above)
1039 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
1040 - rebuild ICU (make install) & tools
1041
1042 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1043   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1044 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1045 - Unicode 6.0..6.2: U+2260, U+226E, U+226F
1046 - nothing new in 6.2, no test file to update
1047
1048 * update Java data files
1049 - refresh just the UCD-related files, just to be safe
1050 - see (ICU4C)/source/data/icu4j-readme.txt
1051 - mkdir /tmp/icu4j
1052 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1053   output:
1054     ...
1055     Unicode .icu files built to ./out/build/icudt50l
1056     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b
1057     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b
1058     echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
1059     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b
1060     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b"
1061     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/
1062     mkdir -p /tmp/icu4j/main/shared/data
1063     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1064     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/
1065     mkdir -p /tmp/icu4j/main/shared/data
1066     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1067     make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data'
1068 - copy the big-endian Unicode data files to another location,
1069   separate from the other data files
1070     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
1071     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
1072     ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
1073     ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu
1074     ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
1075     ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
1076     ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
1077 - refresh ICU4J
1078     ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
1079
1080 * refresh Java test .txt files
1081 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1082
1083 * UCA
1084
1085 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
1086 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
1087 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1088 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1089   (note removing the underscore before "Rules")
1090 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
1091   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1092   with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
1093 - check test file diffs for previously commented-out, known-failing data lines;
1094   probably need to keep those commented out
1095 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
1096 - run genuca, see command line above
1097 - rebuild ICU4C
1098 - refresh ICU4J collation data:
1099   (subset of instructions above for properties data refresh, except copies all coll/*)
1100     ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1101     ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
1102     ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
1103     ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
1104 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
1105 - note on intltest: if collate/UCAConformanceTest fails, then
1106   utility/MultithreadTest/TestCollators will fail as well;
1107   fix the conformance test before looking into the multi-thread test
1108
1109 * test ICU, fix test code where necessary
1110
1111 * When refreshing all of ICU4J data from ICU4C
1112 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1113 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
1114 or
1115 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
1116
1117 *** LayoutEngine script information
1118 - skipped for Unicode 6.2: no new scripts
1119
1120 *** merge the Unicode update branches back onto the trunk
1121 - do not merge the icudata.jar and testdata.jar,
1122   instead rebuild them from merged & tested ICU4C
1123
1124 ---------------------------------------------------------------------------- ***
1125
1126 Future Unicode update
1127
1128 Tools simplified since the Unicode 6.1 update. See
1129 - http://site.icu-project.org/design/props/ppucd
1130 - http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972
1131
1132 * Unicode version numbers
1133 - icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates
1134
1135 * file preparation
1136 - ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py:
1137 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src
1138 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
1139 - Check test file diffs for previously commented-out, known-failing data lines;
1140   probably need to keep those commented out.
1141
1142 * PropertyValueAliases.txt changes
1143 - Script codes that are in ISO 15924 but not in Unicode are now listed in
1144   preparseucd.py, in the _scripts_only_in_iso15924 variable.
1145   If there are new ISO codes, then add them.
1146   If Unicode adds some of them, then remove them from the .py variable.
1147
1148 * UnicodeData.txt changes
1149 - No more manual changes for CJK ranges for algorithmic names;
1150   those are now written to ppucd.txt and genprops reads them from there.
1151
1152 * generate core properties data files (makeprops.sh was deleted)
1153 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src
1154
1155 * no more manual updates of source/data/unidata/norm2/nfkc_cf.txt
1156 - it is now generated by preparseucd.py
1157
1158 * no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt
1159 - it is now generated by preparseucd.py
1160 - make sure that the Unicode data folder passed into preparseucd.py
1161   includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
1162   (can be in some subfolder)
1163
1164 * generate normalization data files
1165 - ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib
1166 - ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in
1167 - ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata
1168 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
1169 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
1170 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
1171 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
1172
1173 * build ICU (make install)
1174 * build Unicode tools using CMake+make
1175
1176 * new way to call genuca (makeuca.sh was deleted)
1177 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src
1178
1179 ---------------------------------------------------------------------------- ***
1180
1181 Unicode 6.1 update
1182
1183 *** ICU Trac
1184
1185 - ticket 8995 final update to Unicode 6.1
1186 - ticket 8994 regenerate source/layout/CanonData.cpp
1187
1188 - ticket 8961 support Unicode "Age" value *names*
1189 - ticket 8963 support multiple character name aliases & types
1190
1191 - ticket 8827 "update ICU to Unicode 6.1"
1192 - C++ branches/markus/uni61 at r30864 from trunk at r30843
1193 - Java branches/markus/uni61 at r30865 from trunk at r30863
1194
1195 *** Unicode version numbers
1196 - makedata.mak
1197 - uchar.h
1198   (configure.in & configure: have been modified to extract the version from uchar.h)
1199 - com.ibm.icu.util.VersionInfo
1200 - icutools/unicode/makedefs.sh
1201   + also review & update other definitions in that file,
1202     e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l
1203
1204 *** data files & enums & parser code
1205
1206 * file preparation
1207
1208 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed
1209 - This prepares both unidata and testdata files in respective output subfolders.
1210 - Check test file diffs for previously commented-out, known-failing data lines;
1211   probably need to keep those commented out.
1212
1213 * PropertyValueAliases.txt changes
1214 - 11 new block names:
1215   Arabic_Extended_A
1216   Arabic_Mathematical_Alphabetic_Symbols
1217   Chakma
1218   Meetei_Mayek_Extensions
1219   Meroitic_Cursive
1220   Meroitic_Hieroglyphs
1221   Miao
1222   Sharada
1223   Sora_Sompeng
1224   Sundanese_Supplement
1225   Takri
1226   -> add to uchar.h
1227   -> add to UCharacter.UnicodeBlock IDs
1228     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
1229             replace  public static final int \1_ID = \2; \3
1230   -> add to UCharacter.UnicodeBlock objects
1231     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
1232             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1233 - 1 new Joining_Group (jg) value:
1234   Rohingya_Yeh
1235   -> uchar.h & UCharacter.JoiningGroup
1236 - 2 new Line_Break (lb) values:
1237   CJ=Conditional_Japanese_Starter
1238   HL=Hebrew_Letter
1239   -> uchar.h & UCharacter.LineBreak
1240 - 7 new scripts:
1241   sc ; Cakm      ; Chakma
1242   sc ; Merc      ; Meroitic_Cursive
1243   sc ; Mero      ; Meroitic_Hieroglyphs
1244   sc ; Plrd      ; Miao
1245   sc ; Shrd      ; Sharada
1246   sc ; Sora      ; Sora_Sompeng
1247   sc ; Takr      ; Takri
1248   -> remove these from SyntheticPropertyValueAliases.txt
1249   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1250       and in com.ibm.icu.dev.test.lang.TestUScript.java
1251 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
1252   (added 2011-06-21)
1253   Khoj        322     Khojki
1254   Tirh        326     Tirhuta
1255     and another one added 2011-12-09
1256   Hluw        080     Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)
1257   -> uscript.h
1258   -> com.ibm.icu.lang.UScript
1259     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
1260     replace  public static final int \1 = \2;\3
1261   -> SyntheticPropertyValueAliases.txt
1262   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
1263       and in com.ibm.icu.dev.test.lang.TestUScript.java
1264
1265 * UnicodeData.txt changes
1266 - the last Unihan code point changes from U+9FCB to U+9FCC
1267   search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive)
1268   + do change gennames.c
1269   + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java
1270
1271 * DerivedBidiClass.txt changes
1272 - 2 new default-AL blocks:
1273 #     Arabic Extended-A: U+08A0  -  U+08FF  (was default-R)
1274 #     Arabic Mathematical Alphabetic Symbols:
1275 #                       U+1EE00  - U+1EEFF  (was default-R)
1276 - 2 new default-R blocks:
1277 #     Meroitic Hieroglyphs:
1278 #                        U+10980 - U+1099F
1279 #     Meroitic Cursive:  U+109A0 - U+109FF
1280   -> should be picked up by the explicit data in the file
1281
1282 * NameAliases.txt changes
1283 - from
1284     # Each line has two fields
1285     # First field: Code point
1286     # Second field: Alias
1287 - to
1288     # Each line has three fields, as described here:
1289     #
1290     # First field:  Code point
1291     # Second field: Alias
1292     # Third field:  Type
1293 - Also, the file previously allowed multiple aliases but only now does it
1294   actually provide multiple, even multiple of the same type. For example,
1295     FEFF;BYTE ORDER MARK;alternate
1296     FEFF;BOM;abbreviation
1297     FEFF;ZWNBSP;abbreviation
1298 - This breaks our gennames parser, unames.icu data structure, and API.
1299   Fix gennames to only pick up "correction" aliases.
1300   New ticket #8963 for further changes.
1301
1302 * run genpname/preparse.pl (on Linux)
1303   + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
1304   + make sure that data.h is writable
1305   + perl preparse.pl ~/svn.icu/trunk/src > out.txt
1306   + preparse.pl shows no errors, out.txt Info and Warning lines look ok
1307
1308 * build ICU (make install)
1309   so that the tools build can pick up the new definitions from the installed header files.
1310 * build Unicode tools (at least genpname) using CMake+make
1311
1312 * run genpname
1313   (builds both pnames.icu and propname_data.h)
1314 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
1315 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
1316
1317 * build ICU (make install)
1318 * build Unicode tools using CMake+make
1319
1320 * update source/data/unidata/norm2/nfkc_cf.txt
1321 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
1322
1323 * update source/data/unidata/norm2/uts46.txt
1324 - download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
1325   to ~/svn.icu/tools/trunk/src/unicode/py
1326 - adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008".
1327 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
1328 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
1329
1330 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1331   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1332 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1333 - Unicode 6.0..6.1: U+2260, U+226E, U+226F
1334 - nothing new in 6.1, no test file to update
1335
1336 * generate core properties data files
1337 - in initial bootstrapping, change the UCA version
1338   in source/data/unidata/FractionalUCA.txt to match the new Unicode version
1339 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
1340 - rebuild ICU & tools
1341   + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
1342     check if the UCA version in FractionalUCA.txt matches the new Unicode version
1343     (see step above)
1344 - run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm:
1345   ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
1346 - rebuild ICU & tools
1347
1348 * update Java data files
1349 - refresh just the UCD-related files, just to be safe
1350 - see (ICU4C)/source/data/icu4j-readme.txt
1351 - mkdir /tmp/icu4j
1352 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1353   output:
1354     ...
1355     Unicode .icu files built to ./out/build/icudt49l
1356     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b
1357     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b
1358     echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
1359     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b
1360     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b"
1361     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/
1362     mkdir -p /tmp/icu4j/main/shared/data
1363     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1364     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/
1365     mkdir -p /tmp/icu4j/main/shared/data
1366     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
1367     make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data'
1368 - copy the big-endian Unicode data files to another location,
1369   separate from the other data files
1370     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
1371     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
1372     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
1373     ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu
1374     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
1375     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
1376     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
1377 - refresh ICU4J
1378     ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
1379
1380 * refresh Java test .txt files
1381 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1382
1383 * test ICU so far, fix test code where necessary
1384 - temporarily ignore collation issues that look like UCA/UCD mismatches,
1385   until UCA data is updated
1386
1387 * UCA
1388
1389 - get output from Mark's tools; look in
1390     http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt
1391 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1392 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1393   (note removing the underscore before "Rules")
1394 - update (ICU)/source/test/testdata/CollationTest_*.txt
1395   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1396   with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
1397 - check test file diffs for previously commented-out, known-failing data lines;
1398   probably need to keep those commented out
1399 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
1400 - run makeuca.sh:
1401   ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
1402 - rebuild ICU4C
1403 - refresh ICU4J collation data:
1404   (subset of instructions above for properties data refresh, except copies all coll/*)
1405     ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1406     ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
1407     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
1408     ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
1409 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
1410 - note on intltest: if collate/UCAConformanceTest fails, then
1411   utility/MultithreadTest/TestCollators will fail as well;
1412   fix the conformance test before looking into the multi-thread test
1413
1414 * When refreshing all of ICU4J data from ICU4C
1415 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1416 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
1417 or
1418 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
1419
1420 *** LayoutEngine script information
1421
1422 (For details see the Unicode 5.2 change log below.)
1423
1424 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
1425   This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
1426   in the working directory.
1427   (It also generates ScriptRunData.cpp, which is no longer needed.)
1428
1429   The generated files have a current copyright date and "@draft" statement.
1430
1431 - diff current <icu>/source/layout files vs. generated ones
1432     ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
1433   review and manually merge desired changes;
1434   fix gratuitous changes, incorrect @draft and missing aliases;
1435   Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
1436 - if you just copy the above files, then
1437   fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
1438   manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
1439
1440 *** merge the Unicode update branches back onto the trunk
1441 - do not merge the icudata.jar and testdata.jar,
1442   instead rebuild them from merged & tested ICU4C
1443
1444 ---------------------------------------------------------------------------- ***
1445
1446 ICU 4.8 (no Unicode update, just new script codes)
1447
1448 * 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
1449   (added 2010-12-21)
1450     Afak    439     Afaka
1451     Jurc    510     Jurchen
1452     Mroo    199     Mro, Mru
1453     Nshu    499     Nüshu
1454     Shrd    319     Sharada, Śāradā
1455     Sora    398     Sora Sompeng
1456     Takr    321     Takri, Ṭākrī, Ṭāṅkrī
1457     Tang    520     Tangut
1458     Wole    480     Woleai
1459   -> uscript.h
1460   -> com.ibm.icu.lang.UScript
1461     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
1462     replace  public static final int \1 = \2;\3
1463   -> genpname/SyntheticPropertyValueAliases.txt
1464   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
1465       and in com.ibm.icu.dev.test.lang.TestUScript.java
1466
1467 * run genpname/preparse.pl (on Linux)
1468   + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
1469   + make sure that data.h is writable
1470   + perl preparse.pl ~/svn.icu/trunk/src > out.txt
1471   + preparse.pl shows no errors, out.txt Info and Warning lines look ok
1472
1473 * rebuild Unicode tools (at least genpname) using make
1474 - You might first need to "make install" ICU so that the tools build can pick
1475   up the new definitions from the installed header files.
1476
1477 * run genpname
1478   (builds both pnames.icu and propname_data.h)
1479 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
1480 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
1481 - rebuild ICU & tools
1482
1483 * run genprops
1484 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
1485 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
1486 - rebuild ICU & tools
1487
1488 * update Java data files
1489 - refresh just the UCD-related files, just to be safe
1490 - see (ICU4C)/source/data/icu4j-readme.txt
1491 - mkdir /tmp/icu4j
1492 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1493 - copy the big-endian Unicode data files to another location,
1494   separate from the other data files
1495     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
1496     ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
1497     ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
1498 - refresh ICU4J
1499     ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b
1500
1501 * should have updated the layout engine script codes but forgot
1502
1503 ---------------------------------------------------------------------------- ***
1504
1505 Unicode 6.0 update
1506
1507 *** related ICU Trac tickets
1508
1509 7264 Unicode 6.0 Update
1510
1511 *** Unicode version numbers
1512 - makedata.mak
1513 - uchar.h
1514   (configure.in & configure: have been modified to extract the version from uchar.h)
1515 - com.ibm.icu.util.VersionInfo
1516
1517 *** data files & enums & parser code
1518
1519 * file preparation
1520
1521 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed
1522 - This now prepares both unidata and testdata files in respective output subfolders.
1523
1524 * PropertyAliases.txt changes
1525 - new Script_Extensions property defined in the new ScriptExtensions.txt file
1526   but not listed in PropertyAliases.txt; reported to unicode.org;
1527   -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt
1528     scx; Script_Extensions
1529   -> uchar.h with new UProperty section
1530   -> com.ibm.icu.lang.UProperty, parallel with uchar.h
1531
1532 * PropertyValueAliases.txt changes
1533 - 12 new block names:
1534   Alchemical_Symbols
1535   Bamum_Supplement
1536   Batak
1537   Brahmi
1538   CJK_Unified_Ideographs_Extension_D
1539   Emoticons
1540   Ethiopic_Extended_A
1541   Kana_Supplement
1542   Mandaic
1543   Miscellaneous_Symbols_And_Pictographs
1544   Playing_Cards
1545   Transport_And_Map_Symbols
1546   -> add to uchar.h
1547   -> add to UCharacter.UnicodeBlock
1548     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
1549             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1550 - Joining_Group (jg) values:
1551   Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias
1552   -> uchar.h & UCharacter.JoiningGroup
1553 - 3 new scripts:
1554   sc ; Batk      ; Batak
1555   sc ; Brah      ; Brahmi
1556   sc ; Mand      ; Mandaic
1557   -> remove these from SyntheticPropertyValueAliases.txt
1558   -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN
1559   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
1560       and in com.ibm.icu.dev.test.lang.TestUScript.java
1561 - 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
1562   (added 2009-11-11..2010-07-18)
1563   Bass        259     Bassa Vah
1564   Dupl        755     Duployan shortand
1565   Elba        226     Elbasan
1566   Gran        343     Grantha
1567   Kpel        436     Kpelle
1568   Loma        437     Loma
1569   Mend        438     Mende
1570   Merc        101     Meroitic Cursive
1571   Narb        106     Old North Arabian
1572   Nbat        159     Nabataean
1573   Palm        126     Palmyrene
1574   Sind        318     Sindhi
1575   Wara        262     Warang Citi
1576   -> uscript.h
1577   -> com.ibm.icu.lang.UScript
1578     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
1579     replace  public static final int \1 = \2;\3
1580   -> SyntheticPropertyValueAliases.txt
1581   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
1582       and in com.ibm.icu.dev.test.lang.TestUScript.java
1583 - ISO 15924 name change
1584   Mero        100     Meroitic Hieroglyphs (was Meroitic)
1585   -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC
1586 - property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt
1587
1588 * UnicodeData.txt changes
1589 - new CJK block:
1590   2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
1591   2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
1592   -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion
1593
1594 * build Unicode tools using CMake+make
1595
1596 * run genpname/preparse.pl (on Linux)
1597   + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
1598   + make sure that data.h is writable
1599   + perl preparse.pl ~/svn.icu/trunk/src > out.txt
1600   + preparse.pl shows no errors, out.txt Info and Warning lines look ok
1601
1602 * rebuild Unicode tools (at least genpname) using make
1603 - You might first need to "make install" ICU so that the tools build can pick
1604   up the new definitions from the installed header files.
1605
1606 * run genpname
1607 - ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
1608 - rebuild ICU & tools
1609
1610 * update source/data/unidata/norm2/nfkc_cf.txt
1611 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
1612
1613 * update source/data/unidata/norm2/uts46.txt
1614 - download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt
1615   to ~/svn.icu/tools/trunk/src/unicode/py
1616 - adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values
1617 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
1618 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
1619
1620 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
1621   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
1622 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
1623 - Unicode 6.0: U+2260, U+226E, U+226F
1624
1625 * generate core properties data files
1626 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
1627 - rebuild ICU & tools
1628 - run makeuca.sh so that genuca picks up the new nfc.nrm:
1629   ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
1630 - rebuild ICU & tools
1631
1632 * implement new Script_Extensions property (provisional)
1633 - parser & generator: genprops & uprops.icu
1634 - uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp
1635 - UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java
1636
1637 * switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2
1638 - (one-time change)
1639 - genbidi/gencase/genprops tools changes
1640 - re-run makeprops.sh (see above)
1641 - UCharacterProperty.java, UCharacterTypeIterator.java,
1642   UBiDiProps.java, UCaseProps.java, and several others with minor changes;
1643   UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java
1644
1645 * update Java data files
1646 - refresh just the UCD-related files, just to be safe
1647 - see (ICU4C)/source/data/icu4j-readme.txt
1648 - mkdir /tmp/icu4j
1649 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1650   output:
1651     ...
1652     Unicode .icu files built to ./out/build/icudt45l
1653     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b
1654     echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
1655     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b
1656     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b
1657     mkdir -p /tmp/icu4j/main/shared/data
1658     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
1659 - copy the big-endian Unicode data files to another location,
1660   separate from the other data files
1661     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
1662     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
1663     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
1664     ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu
1665     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
1666     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
1667     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
1668 - refresh ICU4J
1669     ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
1670
1671 * refresh Java test .txt files
1672 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
1673
1674 * un-hardcode normalization skippable (NF*_Inert) test data
1675 - removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools
1676
1677 * copy updated break iterator test files
1678 - now handled by early ucdcopy.py and
1679   copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata
1680   (old instructions:
1681    copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt
1682    to ~/svn.icu/trunk/src/source/test/testdata)
1683 - they are not used in ICU4J
1684
1685 * UCA
1686
1687 - get output from Mark's tools; look in
1688     http://www.unicode.org/~book/incoming/mark/uca6.0.0/
1689     http://www.macchiato.com/unicode/utc/additional-uca-files
1690     http://www.unicode.org/Public/UCA/6.0.0/
1691     http://www.unicode.org/~mdavis/uca/
1692 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
1693 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
1694 - update Han-implicit ranges for new CJK extensions:
1695   swapCJK() in ucol.cpp & ImplicitCEGenerator.java
1696 - genuca: allow bytes 02 for U+FFFE, new merge-sort character;
1697   do not add it into invuca so that tailoring primary-after an ignorable works
1698 - genuca: permit space between [variable top] bytes
1699 - ucol.cpp: treat noncharacters like unassigned rather than ignorable
1700 - run makeuca.sh:
1701   ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
1702 - rebuild ICU4C
1703 - refresh ICU4J collation data:
1704   (subset of instructions above for properties data refresh, except copies all coll/*)
1705     ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1706     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
1707     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
1708     ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
1709 - update (ICU)/source/test/testdata/CollationTest_*.txt
1710   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
1711   with output from Mark's Unicode tools
1712 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)
1713 - note on intltest: if collate/UCAConformanceTest fails, then
1714   utility/MultithreadTest/TestCollators will fail as well;
1715   fix the conformance test before looking into the multi-thread test
1716
1717 * When refreshing all of ICU4J data from ICU4C
1718 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
1719 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
1720 or
1721 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
1722
1723 *** LayoutEngine script information
1724
1725 (For details see the Unicode 5.2 change log below.)
1726
1727 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
1728 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
1729 ScriptRunData.cpp, which is no longer needed.)
1730
1731 The generated files have a current copyright date and "@draft" statement.
1732
1733 * copy the above files into <icu>/source/layout, replacing the old files.
1734 * fix mixed line endings
1735 * review the diffs and fix incorrect @draft and missing aliases;
1736   Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
1737 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
1738
1739 ---------------------------------------------------------------------------- ***
1740
1741 Unicode 5.2 update
1742
1743 *** related ICU Trac tickets
1744
1745 7084 Unicode 5.2
1746
1747 7167 verify collation bytes
1748 7235 Java test NAME_ALIAS
1749 7236 Java DerivedCoreProperties.txt test
1750 7237 Java BidiTest.txt
1751 7238 UTrie2 in core unidata
1752 7239 test for tailoring gaps
1753 7240 Java fix CollationMiscTest
1754 7243 update layout engine for Unicode 5.2
1755
1756 *** Unicode version numbers
1757 - makedata.mak
1758 - uchar.h
1759 - configure.in & configure
1760 - update ucdVersion in gennames.c if an algorithmic range changes
1761
1762 *** data files & enums & parser code
1763
1764 * file preparation
1765
1766 python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata
1767 - includes finding files regardless of version numbers,
1768   copying them, and performing the equivalent processing of the
1769   ucdstrip and ucdmerge tools on the desired set of files
1770
1771 * notes on changes
1772 - PropertyAliases.txt
1773   moved from numeric to enumerated:
1774     ccc       ; Canonical_Combining_Class
1775   new string properties:
1776     NFKC_CF   ; NFKC_Casefold
1777     Name_Alias; Name_Alias
1778   new binary properties:
1779     Cased     ; Cased
1780     CI        ; Case_Ignorable
1781     CWCF      ; Changes_When_Casefolded
1782     CWCM      ; Changes_When_Casemapped
1783     CWKCF     ; Changes_When_NFKC_Casefolded
1784     CWL       ; Changes_When_Lowercased
1785     CWT       ; Changes_When_Titlecased
1786     CWU       ; Changes_When_Uppercased
1787   new CJK Unihan properties (not supported by ICU)
1788 - PropertyValueAliases.txt
1789   new block names
1790   new scripts
1791   one script code change:
1792     sc ; Qaai      ; Inherited
1793     ->
1794     sc ; Zinh      ; Inherited                        ; Qaai
1795   new Line_Break (lb) value:
1796     lb ; CP        ; Close_Parenthesis
1797   new Joining_Group (jg) values: Farsi_Yeh, Nya
1798   other new values:
1799     ccc; 214; ATA  ; Attached_Above
1800 - DerivedBidiClass.txt
1801   new default-R range: U+1E800 - U+1EFFF
1802 - UnicodeData.txt
1803   all of the ISO comments are gone
1804   new CJK block end:
1805     9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last>
1806   new CJK block:
1807     2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
1808     2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
1809
1810 * genpname
1811 - run preparse.pl
1812   + cd \svn\icuproj\icu\trunk\source\tools\genpname
1813   + make sure that data.h is writable
1814   + perl preparse.pl \svn\icuproj\icu\trunk > out.txt
1815   + preparse.pl complains with errors like the following:
1816       Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34.
1817     This is because ICU 4.0 had scripts from ISO 15924 which are now
1818     added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt
1819     and PropertyValueAliases.txt.
1820     -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
1821        Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt
1822   + preparse.pl complains with errors about block names missing from uchar.h; add them
1823
1824 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
1825 - new block & script values
1826   + 26 new blocks
1827     copy new blocks from Blocks.txt
1828     MS VC++ 2008 regular expression:
1829       find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$"
1830       replace with "    UBLOCK_\3 = 172, /*[\1]*/"
1831   + several new script values already added in ICU 4.0 for ISO 15924 coverage
1832     (removed from SyntheticPropertyValueAliases.txt, see genpname notes above)
1833   + 3 new script values added for ISO 15924 and Unicode 5.2 coverage
1834   + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2)
1835     (added to SyntheticPropertyValueAliases.txt)
1836 - new Joining Group (JG) values: Farsi_Yeh, Nya
1837 - new Line_Break (lb) value:
1838     lb ; CP        ; Close_Parenthesis
1839
1840 * hardcoded Unihan range end/limit
1841 - Unihan range end moves from 9FC3 to 9FCB
1842   search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive)
1843   + do change gennames.c
1844
1845 * Compare definitions of new binary properties with what we used to use
1846   in algorithms, to see if the definitions changed.
1847 - Verified that definitions for Cased and Case_Ignorable are unchanged.
1848   The gencase tool now parses the newly public Case_Ignorable values
1849   in case the definition changes in the future.
1850
1851 * uchar.c & uprops.h & uprops.c & genprops
1852 - new numeric values that didn't exist in Unicode data before:
1853     1/7, 1/9, 1/10, 3/10, 1/16, 3/16
1854   the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5,
1855   therefore redesign the encoding of numeric types and values for formatVersion 6;
1856   design for simple numbers up to at least 144 ("one gross"),
1857   large values up to at least 10^20,
1858   and fractions with numerators -1..17 and denominators 1..16
1859   to cover current and expected future values
1860   (e.g., more Han numeric values, Meroitic twelfths)
1861
1862 * reimplement Hangul_Syllable_Type for new Jamo characters
1863 - the old code assumed that all Jamo characters are in the 11xx block
1864 - Unicode 5.2 fills holes there and adds new Jamo characters in
1865     A960..A97F; Hangul Jamo Extended-A
1866   and in
1867     D7B0..D7FF; Hangul Jamo Extended-B
1868 - Hangul_Syllable_Type can be trivially derived from a subset of
1869   Grapheme_Cluster_Break values
1870
1871 * build Unicode data source code for hardcoding core data
1872 C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data
1873
1874 ICU data make path is \svn\icuproj\icu\trunk\source\data\
1875 ICU root path is \svn\icuproj\icu\trunk
1876 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
1877 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
1878 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
1879 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
1880 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
1881 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
1882 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
1883 Information: cannot find "spreplocal.mk". Not building user-additional stringprep files.
1884 Creating data file for Unicode Property Names
1885 Creating data file for Unicode Character Properties
1886 Creating data file for Unicode Case Mapping Properties
1887 Creating data file for Unicode BiDi/Shaping Properties
1888 Creating data file for Unicode Normalization
1889 Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l"
1890 Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp"
1891
1892 - copy the .c source files to C:\svn\icuproj\icu\trunk\source\common
1893   and rebuild the common library
1894
1895 *** UCA
1896
1897 - update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools)
1898 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools
1899 - update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools
1900 [ Begin obsolete instructions:
1901   Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files.
1902     - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py
1903       on Windows:
1904         python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt
1905         python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt
1906   End obsolete instructions]
1907 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)
1908   not just the *_STUB.txt files
1909 - note on intltest: if collate/UCAConformanceTest fails, then
1910   utility/MultithreadTest/TestCollators will fail as well;
1911   fix the conformance test before looking into the multi-thread test
1912
1913 *** Implement Cased & Case_Ignorable properties
1914 - via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable()
1915 - Problem: These properties should be disjoint, but aren't
1916 - UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not
1917 - change ucase.icu to be able to store any combination of Cased and Case_Ignorable
1918
1919 *** Implement Changes_When_Xyz properties
1920 - without stored data
1921
1922 *** Implement Name_Alias property
1923 - add it as another name field in unames.icu
1924 - make it available via u_charName() and UCharNameChoice and
1925 - consider it in u_charFromName()
1926
1927 *** Break iterators
1928
1929 * Update break iterator rules to new UAX versions and new property values
1930 * Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary
1931
1932 *** new BidiTest file
1933 - review format and data
1934 - copy BidiTest.txt to source/test/testdata
1935 - write test code using this data
1936 - fix ICU code where it fails the conformance test
1937
1938 *** Java
1939 - generally, find and update code corresponding to C/C++
1940 - UCharacter.UnicodeBlock constants:
1941   a) add an _ID integer per new block, update COUNT
1942   b) add a class instance per new block
1943      Visual Studio regex:
1944         find            UBLOCK_{[^ ]+} = [0-9]+, {/.+}
1945         replace with    public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1946 - CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias()
1947
1948 - port test changes to Java
1949
1950 *** LayoutEngine script information
1951
1952 (For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833)
1953
1954 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
1955 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
1956 ScriptRunData.cpp, which is no longer needed.)
1957
1958 The generated files have a current copyright date and "@draft" statement.
1959
1960 -> Eric Mader wrote in email on 20090930:
1961     "I think the tool has been modified to update @draft to @stable for
1962      older scripts and to add @draft for new scripts.
1963      (I worked with an intern on this last year.)
1964      You should check the output after you run it."
1965
1966 * copy the above files into <icu>/source/layout, replacing the old files.
1967 * fix mixed line endings
1968 * review the diffs and fix incorrect @draft and missing aliases
1969 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
1970
1971 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
1972 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
1973
1974 -> Eric Mader wrote in email on 20090930:
1975     "This is just a matter of making sure that all the per-script tables have
1976      entries for any new scripts that were added.
1977      If any new Indic characters were added, then the class tables in
1978      IndicClassTables.cpp should be updated to reflect this.
1979      John Emmons should know how to do this if it's required."
1980
1981 * rebuild the layout and layoutex libraries.
1982
1983 *** Documentation
1984 - Update User Guide
1985   + Jamo_Short_Name, sfc->scf, binary property value aliases
1986
1987 ---------------------------------------------------------------------------- ***
1988
1989 Unicode 5.1 update
1990
1991 *** related ICU Trac tickets
1992
1993 5696 Update to Unicode 5.1
1994
1995 *** Unicode version numbers
1996 - makedata.mak
1997 - uchar.h
1998 - configure.in & configure
1999 - update ucdVersion in gennames.c if an algorithmic range changes
2000
2001 *** data files & enums & parser code
2002
2003 * file preparation
2004 - ucdstrip:
2005     DerivedCoreProperties.txt
2006     DerivedNormalizationProps.txt
2007     NormalizationTest.txt
2008     PropList.txt
2009     Scripts.txt
2010     GraphemeBreakProperty.txt
2011     SentenceBreakProperty.txt
2012     WordBreakProperty.txt
2013 - ucdstrip and ucdmerge:
2014     EastAsianWidth.txt
2015     LineBreak.txt
2016
2017 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
2018 copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
2019 copy 5.1.0\ucd\Blocks.txt ..\unidata\
2020 copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
2021 copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
2022 copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
2023 copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
2024 copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
2025 copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
2026 copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
2027 copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
2028 copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
2029 copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
2030 copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
2031
2032 ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
2033 ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
2034 ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
2035 ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
2036 ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
2037 ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
2038 ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
2039 ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
2040 ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
2041 ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
2042
2043 * genpname
2044 - run preparse.pl
2045   + cd \svn\icuproj\icu\uni51\source\tools\genpname
2046   + make sure that data.h is writable
2047   + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
2048   + preparse.pl complains with errors like the following:
2049       Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
2050     This is because ICU 3.8 had scripts from ISO 15924 which are now
2051     added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
2052     and PropertyValueAliases.txt.
2053     -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
2054        Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
2055   + PropertyValueAliases.txt now explicitly contains values for boolean properties:
2056       N/Y, No/Yes, F/T, False/True
2057     -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
2058        It will use further values from the file if present.
2059
2060 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
2061 - new block & script values
2062   + 17 new blocks
2063   + 11 new script values already added in ICU 3.8 for ISO 15924 coverage
2064     (removed from SyntheticPropertyValueAliases.txt)
2065   + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
2066     (added to SyntheticPropertyValueAliases.txt)
2067 - uprops.icu (uprops.h) only provides 7 bits for script codes.
2068   In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
2069   There is none above 127 yet which is the script code for an
2070   assigned Unicode character, so ICU 4.0 uprops.icu does not store any
2071   script code values greater than 127.
2072   However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
2073   in a parallel bit field, and that overflows now.
2074   Also, future values >=128 would be incompatible anyway.
2075   uprops.h is modified to move around several of the bit fields
2076   in the properties vector words, and now uses 8 bits for the script code.
2077   Two other bit fields also grow to accommodate future growth:
2078   Block (current count: 172) grows from 8 to 9 bits,
2079   and Word_Break grows from 4 to 5 bits.
2080 - renamed property Simple_Case_Folding (sfc->scf)
2081   + nothing to be done: handled as normal alias
2082 - new property JSN Jamo_Short_Name
2083   + no new API: only contributes to the Name property
2084 - new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
2085 - new Joining Group (JG) value: Burushashki_Yeh_Barree
2086 - new Sentence_Break (SB) values:
2087     SB ; CR        ; CR
2088     SB ; EX        ; Extend
2089     SB ; LF        ; LF
2090     SB ; SC        ; SContinue
2091 - new Word_Break (WB) values:
2092     WB ; CR        ; CR
2093     WB ; Extend    ; Extend
2094     WB ; LF        ; LF
2095     WB ; MB        ; MidNumLet
2096
2097 * Further changes in the 2008-02-29 update:
2098 - Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
2099   because they should not normally be invisible.
2100 - new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
2101 - new Grapheme_Cluster_Break (GCB) value: PP=Prepend
2102 - new Word_Break (WB) value: NL=Newline
2103
2104 * hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
2105 - Unihan range end moves from 9FBB to 9FC3
2106   search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
2107   + do change gennames.c
2108
2109 * build Unicode data source code for hardcoding core data
2110 C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
2111
2112 ICU data make path is \svn\icuproj\icu\uni51\source\data\
2113 ICU root path is \svn\icuproj\icu\uni51
2114 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
2115 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
2116 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
2117 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
2118 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
2119 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
2120 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
2121 Creating data file for Unicode Character Properties
2122 Creating data file for Unicode Case Mapping Properties
2123 Creating data file for Unicode BiDi/Shaping Properties
2124 Creating data file for Unicode Normalization
2125 Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
2126 Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
2127
2128 - copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
2129   and rebuild the common library
2130
2131 *** Break iterators
2132
2133 * Update break iterator rules to new UAX versions and new property values
2134
2135 *** UCA
2136
2137 * update FractionalUCA.txt and UCARules.txt with new canonical closure
2138
2139 *** Test suites
2140 - Test that APIs using Unicode property value aliases (like UnicodeSet)
2141   support all of the boolean values N/Y, No/Yes, F/T, False/True
2142   -> TestBinaryValues() tests in both cintltst and intltest
2143
2144 *** LayoutEngine script information
2145 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
2146 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
2147 ScriptRunData.cpp, which is no longer needed.)
2148
2149 The generated files have a current copyright date and "@draft" statement.
2150
2151 * copy the above files into <icu>/source/layout, replacing the old files.
2152
2153 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
2154 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
2155
2156 * rebuild the layout and layoutex libraries.
2157
2158 *** Documentation
2159 - Update User Guide
2160   + Jamo_Short_Name, sfc->scf, binary property value aliases
2161
2162 ---------------------------------------------------------------------------- ***
2163
2164 Unicode 5.0 update
2165
2166 *** related Jitterbugs
2167
2168 5084 RFE: Update to Unicode 5.0
2169
2170 *** data files & enums & parser code
2171
2172 * file preparation
2173 - ucdstrip:
2174     DerivedCoreProperties.txt
2175     DerivedNormalizationProps.txt
2176     NormalizationTest.txt
2177     PropList.txt
2178     Scripts.txt
2179     GraphemeBreakProperty.txt
2180     SentenceBreakProperty.txt
2181     WordBreakProperty.txt
2182 - ucdstrip and ucdmerge:
2183     EastAsianWidth.txt
2184     LineBreak.txt
2185
2186 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
2187 copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
2188 copy 5.0.0\ucd\Blocks.txt ..\unidata\
2189 copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
2190 copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
2191 copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
2192 copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
2193 copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
2194 copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
2195 copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
2196 copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
2197 copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
2198 copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
2199 copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
2200
2201 ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
2202 ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
2203 ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
2204 ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
2205 ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
2206 ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
2207 ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
2208 ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
2209 ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
2210 ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
2211
2212 * update FractionalUCA.txt and UCARules.txt with new canonical closure
2213
2214 * genpname
2215 - run preparse.pl
2216   + make sure that data.h is writable
2217   + perl preparse.pl \cvs\oss\icu > out.txt
2218
2219 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
2220 - new block & script values
2221   + script values already added in ICU 3.6 because all of ISO 15924 is now covered
2222
2223 * build Unicode data source code for hardcoding core data
2224 C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
2225
2226 ICU data make path is \cvs\oss\icu\source\data\
2227 ICU root path is \cvs\oss\icu
2228 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
2229 [etc.]
2230 Creating data file for Unicode Character Properties
2231 Creating data file for Unicode Case Mapping Properties
2232 Creating data file for Unicode BiDi/Shaping Properties
2233 Creating data file for Unicode Normalization
2234 Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
2235 Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
2236
2237 - copy the .c source files to C:\cvs\oss\icu\source\common
2238   and rebuild the common library
2239
2240 *** Unicode version numbers
2241 - makedata.mak
2242 - uchar.h
2243 - configure.in
2244
2245 *** LayoutEngine script information
2246 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
2247 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
2248 ScriptRunData.cpp, which is no longer needed.)
2249
2250 The generated files have a current copyright date and "@draft" statement.
2251
2252 * copy the above files into <icu>/source/layout, replacing the old files.
2253
2254 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
2255 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
2256
2257 * rebuild the layout and layoutex libraries.
2258
2259 ---------------------------------------------------------------------------- ***
2260
2261 Unicode 4.1 update
2262
2263 *** related Jitterbugs
2264
2265 4332 RFE: Update to Unicode 4.1
2266 4157 RBBI, TR29 4.1 updates
2267
2268 *** data files & enums & parser code
2269
2270 * file preparation
2271 - ucdstrip:
2272     DerivedCoreProperties.txt
2273     DerivedNormalizationProps.txt
2274     NormalizationTest.txt
2275     GraphemeBreakProperty.txt
2276     SentenceBreakProperty.txt
2277     WordBreakProperty.txt
2278 - ucdstrip and ucdmerge:
2279     EastAsianWidth.txt
2280     LineBreak.txt
2281
2282 * add new files to the repository
2283     GraphemeBreakProperty.txt
2284     SentenceBreakProperty.txt
2285     WordBreakProperty.txt
2286
2287 * update FractionalUCA.txt and UCARules.txt with new canonical closure
2288
2289 * genpname
2290 - handle new enumerated properties in sub read_uchar
2291 - run preparse.pl
2292
2293 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
2294 - new binary properties
2295   + Pattern_Syntax
2296   + Pattern_White_Space
2297 - new enumerated properties
2298   + Grapheme_Cluster_Break
2299   + Sentence_Break
2300   + Word_Break
2301 - new block & script & line break values
2302
2303 * gencase
2304 - case-ignorable changes
2305   see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
2306   now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
2307
2308 *** Unicode version numbers
2309 - makedata.mak
2310 - uchar.h
2311 - configure.in
2312
2313 *** tests
2314 - verify that u_charMirror() round-trips
2315 - test all new properties and some new values of old properties
2316
2317 *** other code
2318
2319 * hardcoded Unihan range end/limit
2320 - Unihan range end moves from 9FA5 to 9FBB
2321   search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
2322   + do not modify BOCU/BOCSU code because that would change the encoding
2323     and break binary compatibility!
2324   + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
2325     NamePrepProfile.txt
2326   + ignore trietest.c: test data is arbitrary
2327   + ignore tstnorm.cpp: test optimization, not important
2328   + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
2329   + do change line_th.txt and word_th.txt
2330     by replacing hardcoded ranges with the new property values
2331   + do change gennames.c
2332
2333 source\data\brkitr\line_th.txt(229):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
2334 source\data\brkitr\word_th.txt(23):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
2335 source\tools\gennames\gennames.c(971):        0x4e00, 0x9fa5,
2336
2337 * case mappings
2338 - compare new special casing context conditions with previous ones
2339   see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
2340
2341 * genpname
2342 - consider storing only the short name if it is the same as the long name
2343
2344 *** other reviews
2345 - UAX #29 changes (grapheme/word/sentence breaks)
2346 - UAX #14 changes (line breaks)
2347 - Pattern_Syntax & Pattern_White_Space
2348
2349 ---------------------------------------------------------------------------- ***
2350
2351 Unicode 4.0.1 update
2352
2353 *** related Jitterbugs
2354
2355 3170 RFE: Update to Unicode 4.0.1
2356 3171 Add new Unicode 4.0.1 properties
2357 3520 use Unicode 4.0.1 updates for break iteration
2358
2359 *** data files & enums & parser code
2360
2361 * file preparation
2362 - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
2363 - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
2364
2365 * file fixes
2366 - fix UnicodeData.txt general categories of Ethiopic digits Nd->No
2367   according to PRI #26
2368   http://www.unicode.org/review/resolved-pri.html#pri26
2369 - undone again because no corrigendum in sight;
2370   instead modified tests to not check consistency on this for Unicode 4.0.1
2371
2372 * ucdterms.txt
2373 - update from http://www.unicode.org/copyright.html
2374   formatted for plain text
2375
2376 * uchar.h & uprops.h & uprops.c & genprops
2377 - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
2378 - add U_LB_INSEPARABLE due to a spelling fix
2379   + put short name comment only on line with new constant
2380     for genpname perl script parser
2381 - new binary properties
2382   + STerm
2383   + Variation_Selector
2384
2385 * genpname
2386 - fix genpname perl script so that it doesn't choke on more than 2 names per property value
2387 - perl script: correctly calculate the maximum number of fields per row
2388
2389 * uscript.h
2390 - new script code Hrkt=Katakana_Or_Hiragana
2391
2392 * gennorm.c track changes in DerivedNormalizationProps.txt
2393 - "FNC" -> "FC_NFKC"
2394 - single field "NFD_NO" -> two fields "NFD_QC; N" etc.
2395
2396 * genprops/props2.c track changes in DerivedNumericValues.txt
2397 - changed from 3 columns to 2, dropping the numeric type
2398   + assume that the type is always numeric for Han characters,
2399     and that only those are added in addition to what UnicodeData.txt lists
2400
2401 *** Unicode version numbers
2402 - makedata.mak
2403 - uchar.h
2404 - configure.in
2405
2406 *** tests
2407 - update test of default bidi classes according to PRI #28
2408   /tsutil/cucdtst/TestUnicodeData
2409   http://www.unicode.org/review/resolved-pri.html#pri28
2410 - bidi tests: change exemplar character for ES depending on Unicode version
2411 - change hardcoded expected property values where they change
2412
2413 *** other code
2414
2415 * name matching
2416 - read UCD.html
2417
2418 * scripts
2419 - use new Hrkt=Katakana_Or_Hiragana
2420
2421 * ZWJ & ZWNJ
2422 - are now part of combining character sequences
2423 - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ