icuSources/data/unidata/changes.txt

   1 * Copyright (C) 2004-2012, International Business Machines
   2 * Corporation and others.  All Rights Reserved.
   3 *
   4 *   file name:  changes.txt
   5 *   encoding:   US-ASCII
   6 *   tab size:   8 (not used)
   7 *   indentation:4
   8 *
   9 *   created on: 2004may06
  10 *   created by: Markus W. Scherer
  11 *
  12 * change log for Unicode updates
  13
  14 ---------------------------------------------------------------------------- ***
  15
  16 Unicode 6.2 update
  17
  18 http://www.unicode.org/review/pri230/
  19 http://www.unicode.org/versions/beta-6.2.0.html
  20 http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0
  21 http://www.unicode.org/review/pri227/  Changes to Script Extensions Property Values
  22 http://www.unicode.org/review/pri228/  Changing some common characters from Punctuation to Symbol
  23 http://www.unicode.org/review/pri229/  Linebreaking Changes for Pictographic Symbols
  24 http://www.unicode.org/reports/tr46/tr46-8.html  IDNA
  25 http://unicode.org/Public/idna/6.2.0/
  26
  27 *** ICU Trac
  28
  29 - ticket 9515: Unicode 6.2: final ICU update
  30
  31 - ticket 9514: UCA 6.2: fix UCARules.txt
  32
  33 - ticket 9437: update ICU to Unicode 6.2
  34 - C++ branches/markus/uni62 at r32050 from trunk at r32041
  35 - Java branches/markus/uni62 at r32068 from trunk at r32066
  36
  37 *** Unicode version numbers
  38 - makedata.mak
  39 - uchar.h
  40   (configure.in & configure: have been modified to extract the version from uchar.h)
  41 - com.ibm.icu.util.VersionInfo
  42 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_
  43
  44 *** data files & enums & parser code
  45
  46 * file preparation
  47
  48 - download UCD, UCA & IDNA files
  49 - make sure that the Unicode data folder passed into preparseucd.py
  50   includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)
  51 - modify preparseucd.py: NamesList.txt is now in UTF-8
  52 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src
  53 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
  54 - Check test file diffs for previously commented-out, known-failing data lines;
  55   probably need to keep those commented out.
  56
  57 * PropertyValueAliases.txt changes
  58 - 1 new Line_Break (lb) value:
  59   lb ; RI                               ; Regional_Indicator
  60   -> uchar.h & UCharacter.LineBreak
  61 - 1 new Word_Break (WB) value:
  62   WB ; RI                               ; Regional_Indicator
  63   -> uchar.h & UCharacter.WordBreak
  64 - 1 new Grapheme_Cluster_Break (GCB) value:
  65   GCB; RI                               ; Regional_Indicator
  66   -> uchar.h & UCharacter.GraphemeClusterBreak
  67
  68 * 3 new numeric values
  69   The new value -1, which was really supposed to be NaN but that would have required
  70   new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1,
  71   but encodeNumericValue() in corepropsbuilder.cpp had to be fixed.
  72     cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1
  73     cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1
  74   The two new values 216000 and 432000 require an addition to the encoding of numeric values.
  75     cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000
  76     cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000
  77   -> uprops.h, uchar.c & UCharacterProperty.java
  78   -> cucdtst.c & UCharacterTest.java
  79
  80 * generate normalization data files
  81 - ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib
  82 - ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in
  83 - ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata
  84 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
  85 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
  86 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
  87 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
  88
  89 * build ICU (make install)
  90   so that the tools build can pick up the new definitions from the installed header files.
  91 * build Unicode tools using CMake+make
  92
  93 * generate core properties data files
  94 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src
  95 - in initial bootstrapping, change the UCA version
  96   in source/data/unidata/FractionalUCA.txt to match the new Unicode version
  97 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src
  98 - rebuild ICU (make install) & tools
  99   + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
 100     check if the UCA version in FractionalUCA.txt matches the new Unicode version
 101     (see step above)
 102 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm
 103 - rebuild ICU (make install) & tools
 104
 105 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
 106   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
 107 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
 108 - Unicode 6.0..6.2: U+2260, U+226E, U+226F
 109 - nothing new in 6.2, no test file to update
 110
 111 * update Java data files
 112 - refresh just the UCD-related files, just to be safe
 113 - see (ICU4C)/source/data/icu4j-readme.txt
 114 - mkdir /tmp/icu4j
 115 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 116   output:
 117     ...
 118     Unicode .icu files built to ./out/build/icudt50l
 119     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b
 120     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b
 121     echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
 122     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b
 123     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b"
 124     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/
 125     mkdir -p /tmp/icu4j/main/shared/data
 126     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
 127     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/
 128     mkdir -p /tmp/icu4j/main/shared/data
 129     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
 130     make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data'
 131 - copy the big-endian Unicode data files to another location,
 132   separate from the other data files
 133     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
 134     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
 135     ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
 136     ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu
 137     ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b
 138     ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
 139     ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr
 140 - refresh ICU4J
 141     ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
 142
 143 * refresh Java test .txt files
 144 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
 145
 146 * UCA
 147
 148 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/
 149 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that
 150 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
 151 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
 152   (note removing the underscore before "Rules")
 153 - update (ICU4C)/source/test/testdata/CollationTest_*.txt
 154   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
 155   with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
 156 - check test file diffs for previously commented-out, known-failing data lines;
 157   probably need to keep those commented out
 158 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
 159 - run genuca, see command line above
 160 - rebuild ICU4C
 161 - refresh ICU4J collation data:
 162   (subset of instructions above for properties data refresh, except copies all coll/*)
 163     ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 164     ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
 165     ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll
 166     ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b
 167 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
 168 - note on intltest: if collate/UCAConformanceTest fails, then
 169   utility/MultithreadTest/TestCollators will fail as well;
 170   fix the conformance test before looking into the multi-thread test
 171
 172 * test ICU, fix test code where necessary
 173
 174 * When refreshing all of ICU4J data from ICU4C
 175 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 176 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
 177 or
 178 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
 179
 180 *** LayoutEngine script information
 181 - skipped for Unicode 6.2: no new scripts
 182
 183 *** merge the Unicode update branches back onto the trunk
 184 - do not merge the icudata.jar and testdata.jar,
 185   instead rebuild them from merged & tested ICU4C
 186
 187 ---------------------------------------------------------------------------- ***
 188
 189 Future Unicode update
 190
 191 Tools simplified since the Unicode 6.1 update. See
 192 - http://site.icu-project.org/design/props/ppucd
 193 - http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972
 194
 195 * Unicode version numbers
 196 - icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates
 197
 198 * file preparation
 199 - ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py:
 200 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src
 201 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
 202 - Check test file diffs for previously commented-out, known-failing data lines;
 203   probably need to keep those commented out.
 204
 205 * PropertyValueAliases.txt changes
 206 - Script codes that are in ISO 15924 but not in Unicode are now listed in
 207   preparseucd.py, in the _scripts_only_in_iso15924 variable.
 208   If there are new ISO codes, then add them.
 209   If Unicode adds some of them, then remove them from the .py variable.
 210
 211 * UnicodeData.txt changes
 212 - No more manual changes for CJK ranges for algorithmic names;
 213   those are now written to ppucd.txt and genprops reads them from there.
 214
 215 * generate core properties data files (makeprops.sh was deleted)
 216 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src
 217
 218 * no more manual updates of source/data/unidata/norm2/nfkc_cf.txt
 219 - it is now generated by preparseucd.py
 220
 221 * no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt
 222 - it is now generated by preparseucd.py
 223 - make sure that the Unicode data folder passed into preparseucd.py
 224   includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
 225   (can be in some subfolder)
 226
 227 * generate normalization data files
 228 - ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib
 229 - ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in
 230 - ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata
 231 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm     -s $UNIDATA/norm2 nfc.txt
 232 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm    -s $UNIDATA/norm2 nfc.txt nfkc.txt
 233 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
 234 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm   -s $UNIDATA/norm2 nfc.txt uts46.txt
 235
 236 * build ICU (make install)
 237 * build Unicode tools using CMake+make
 238
 239 * new way to call genuca (makeuca.sh was deleted)
 240 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src
 241
 242 ---------------------------------------------------------------------------- ***
 243
 244 Unicode 6.1 update
 245
 246 *** ICU Trac
 247
 248 - ticket 8995 final update to Unicode 6.1
 249 - ticket 8994 regenerate source/layout/CanonData.cpp
 250
 251 - ticket 8961 support Unicode "Age" value *names*
 252 - ticket 8963 support multiple character name aliases & types
 253
 254 - ticket 8827 "update ICU to Unicode 6.1"
 255 - C++ branches/markus/uni61 at r30864 from trunk at r30843
 256 - Java branches/markus/uni61 at r30865 from trunk at r30863
 257
 258 *** Unicode version numbers
 259 - makedata.mak
 260 - uchar.h
 261   (configure.in & configure: have been modified to extract the version from uchar.h)
 262 - com.ibm.icu.util.VersionInfo
 263 - icutools/unicode/makedefs.sh
 264   + also review & update other definitions in that file,
 265     e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l
 266
 267 *** data files & enums & parser code
 268
 269 * file preparation
 270
 271 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed
 272 - This prepares both unidata and testdata files in respective output subfolders.
 273 - Check test file diffs for previously commented-out, known-failing data lines;
 274   probably need to keep those commented out.
 275
 276 * PropertyValueAliases.txt changes
 277 - 11 new block names:
 278   Arabic_Extended_A
 279   Arabic_Mathematical_Alphabetic_Symbols
 280   Chakma
 281   Meetei_Mayek_Extensions
 282   Meroitic_Cursive
 283   Meroitic_Hieroglyphs
 284   Miao
 285   Sharada
 286   Sora_Sompeng
 287   Sundanese_Supplement
 288   Takri
 289   -> add to uchar.h
 290   -> add to UCharacter.UnicodeBlock IDs
 291     Eclipse find     UBLOCK_([^ ]+) = ([0-9]+), (/.+)
 292             replace  public static final int \1_ID = \2; \3
 293   -> add to UCharacter.UnicodeBlock objects
 294     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
 295             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
 296 - 1 new Joining_Group (jg) value:
 297   Rohingya_Yeh
 298   -> uchar.h & UCharacter.JoiningGroup
 299 - 2 new Line_Break (lb) values:
 300   CJ=Conditional_Japanese_Starter
 301   HL=Hebrew_Letter
 302   -> uchar.h & UCharacter.LineBreak
 303 - 7 new scripts:
 304   sc ; Cakm      ; Chakma
 305   sc ; Merc      ; Meroitic_Cursive
 306   sc ; Mero      ; Meroitic_Hieroglyphs
 307   sc ; Plrd      ; Miao
 308   sc ; Shrd      ; Sharada
 309   sc ; Sora      ; Sora_Sompeng
 310   sc ; Takr      ; Takri
 311   -> remove these from SyntheticPropertyValueAliases.txt
 312   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
 313       and in com.ibm.icu.dev.test.lang.TestUScript.java
 314 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
 315   (added 2011-06-21)
 316   Khoj        322     Khojki
 317   Tirh        326     Tirhuta
 318     and another one added 2011-12-09
 319   Hluw        080     Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)
 320   -> uscript.h
 321   -> com.ibm.icu.lang.UScript
 322     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
 323     replace  public static final int \1 = \2;\3
 324   -> SyntheticPropertyValueAliases.txt
 325   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
 326       and in com.ibm.icu.dev.test.lang.TestUScript.java
 327
 328 * UnicodeData.txt changes
 329 - the last Unihan code point changes from U+9FCB to U+9FCC
 330   search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive)
 331   + do change gennames.c
 332   + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java
 333
 334 * DerivedBidiClass.txt changes
 335 - 2 new default-AL blocks:
 336 #     Arabic Extended-A: U+08A0  -  U+08FF  (was default-R)
 337 #     Arabic Mathematical Alphabetic Symbols:
 338 #                       U+1EE00  - U+1EEFF  (was default-R)
 339 - 2 new default-R blocks:
 340 #     Meroitic Hieroglyphs:
 341 #                        U+10980 - U+1099F
 342 #     Meroitic Cursive:  U+109A0 - U+109FF
 343   -> should be picked up by the explicit data in the file
 344
 345 * NameAliases.txt changes
 346 - from
 347     # Each line has two fields
 348     # First field: Code point
 349     # Second field: Alias
 350 - to
 351     # Each line has three fields, as described here:
 352     #
 353     # First field:  Code point
 354     # Second field: Alias
 355     # Third field:  Type
 356 - Also, the file previously allowed multiple aliases but only now does it
 357   actually provide multiple, even multiple of the same type. For example,
 358     FEFF;BYTE ORDER MARK;alternate
 359     FEFF;BOM;abbreviation
 360     FEFF;ZWNBSP;abbreviation
 361 - This breaks our gennames parser, unames.icu data structure, and API.
 362   Fix gennames to only pick up "correction" aliases.
 363   New ticket #8963 for further changes.
 364
 365 * run genpname/preparse.pl (on Linux)
 366   + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
 367   + make sure that data.h is writable
 368   + perl preparse.pl ~/svn.icu/trunk/src > out.txt
 369   + preparse.pl shows no errors, out.txt Info and Warning lines look ok
 370
 371 * build ICU (make install)
 372   so that the tools build can pick up the new definitions from the installed header files.
 373 * build Unicode tools (at least genpname) using CMake+make
 374
 375 * run genpname
 376   (builds both pnames.icu and propname_data.h)
 377 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
 378 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
 379
 380 * build ICU (make install)
 381 * build Unicode tools using CMake+make
 382
 383 * update source/data/unidata/norm2/nfkc_cf.txt
 384 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
 385
 386 * update source/data/unidata/norm2/uts46.txt
 387 - download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
 388   to ~/svn.icu/tools/trunk/src/unicode/py
 389 - adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008".
 390 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
 391 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
 392
 393 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
 394   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
 395 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
 396 - Unicode 6.0..6.1: U+2260, U+226E, U+226F
 397 - nothing new in 6.1, no test file to update
 398
 399 * generate core properties data files
 400 - in initial bootstrapping, change the UCA version
 401   in source/data/unidata/FractionalUCA.txt to match the new Unicode version
 402 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
 403 - rebuild ICU & tools
 404   + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
 405     check if the UCA version in FractionalUCA.txt matches the new Unicode version
 406     (see step above)
 407 - run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm:
 408   ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
 409 - rebuild ICU & tools
 410
 411 * update Java data files
 412 - refresh just the UCD-related files, just to be safe
 413 - see (ICU4C)/source/data/icu4j-readme.txt
 414 - mkdir /tmp/icu4j
 415 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 416   output:
 417     ...
 418     Unicode .icu files built to ./out/build/icudt49l
 419     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b
 420     mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b
 421     echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
 422     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b
 423     mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b"
 424     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/
 425     mkdir -p /tmp/icu4j/main/shared/data
 426     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
 427     jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/
 428     mkdir -p /tmp/icu4j/main/shared/data
 429     cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
 430     make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data'
 431 - copy the big-endian Unicode data files to another location,
 432   separate from the other data files
 433     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
 434     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
 435     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
 436     ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu
 437     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
 438     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
 439     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
 440 - refresh ICU4J
 441     ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
 442
 443 * refresh Java test .txt files
 444 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
 445
 446 * test ICU so far, fix test code where necessary
 447 - temporarily ignore collation issues that look like UCA/UCD mismatches,
 448   until UCA data is updated
 449
 450 * UCA
 451
 452 - get output from Mark's tools; look in
 453     http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt
 454 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
 455 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
 456   (note removing the underscore before "Rules")
 457 - update (ICU)/source/test/testdata/CollationTest_*.txt
 458   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
 459   with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
 460 - check test file diffs for previously commented-out, known-failing data lines;
 461   probably need to keep those commented out
 462 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
 463 - run makeuca.sh:
 464   ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
 465 - rebuild ICU4C
 466 - refresh ICU4J collation data:
 467   (subset of instructions above for properties data refresh, except copies all coll/*)
 468     ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 469     ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
 470     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
 471     ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
 472 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
 473 - note on intltest: if collate/UCAConformanceTest fails, then
 474   utility/MultithreadTest/TestCollators will fail as well;
 475   fix the conformance test before looking into the multi-thread test
 476
 477 * When refreshing all of ICU4J data from ICU4C
 478 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 479 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
 480 or
 481 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
 482
 483 *** LayoutEngine script information
 484
 485 (For details see the Unicode 5.2 change log below.)
 486
 487 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
 488   This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
 489   in the working directory.
 490   (It also generates ScriptRunData.cpp, which is no longer needed.)
 491
 492   The generated files have a current copyright date and "@draft" statement.
 493
 494 - diff current <icu>/source/layout files vs. generated ones
 495     ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
 496   review and manually merge desired changes;
 497   fix gratuitous changes, incorrect @draft and missing aliases;
 498   Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
 499 - if you just copy the above files, then
 500   fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
 501   manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
 502
 503 *** merge the Unicode update branches back onto the trunk
 504 - do not merge the icudata.jar and testdata.jar,
 505   instead rebuild them from merged & tested ICU4C
 506
 507 ---------------------------------------------------------------------------- ***
 508
 509 ICU 4.8 (no Unicode update, just new script codes)
 510
 511 * 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
 512   (added 2010-12-21)
 513     Afak    439     Afaka
 514     Jurc    510     Jurchen
 515     Mroo    199     Mro, Mru
 516     Nshu    499     Nüshu
 517     Shrd    319     Sharada, Śāradā
 518     Sora    398     Sora Sompeng
 519     Takr    321     Takri, Ṭākrī, Ṭāṅkrī
 520     Tang    520     Tangut
 521     Wole    480     Woleai
 522   -> uscript.h
 523   -> com.ibm.icu.lang.UScript
 524     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
 525     replace  public static final int \1 = \2;\3
 526   -> genpname/SyntheticPropertyValueAliases.txt
 527   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
 528       and in com.ibm.icu.dev.test.lang.TestUScript.java
 529
 530 * run genpname/preparse.pl (on Linux)
 531   + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
 532   + make sure that data.h is writable
 533   + perl preparse.pl ~/svn.icu/trunk/src > out.txt
 534   + preparse.pl shows no errors, out.txt Info and Warning lines look ok
 535
 536 * rebuild Unicode tools (at least genpname) using make
 537 - You might first need to "make install" ICU so that the tools build can pick
 538   up the new definitions from the installed header files.
 539
 540 * run genpname
 541   (builds both pnames.icu and propname_data.h)
 542 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
 543 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
 544 - rebuild ICU & tools
 545
 546 * run genprops
 547 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
 548 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
 549 - rebuild ICU & tools
 550
 551 * update Java data files
 552 - refresh just the UCD-related files, just to be safe
 553 - see (ICU4C)/source/data/icu4j-readme.txt
 554 - mkdir /tmp/icu4j
 555 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 556 - copy the big-endian Unicode data files to another location,
 557   separate from the other data files
 558     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
 559     ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
 560     ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
 561 - refresh ICU4J
 562     ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b
 563
 564 * should have updated the layout engine script codes but forgot
 565
 566 ---------------------------------------------------------------------------- ***
 567
 568 Unicode 6.0 update
 569
 570 *** related ICU Trac tickets
 571
 572 7264 Unicode 6.0 Update
 573
 574 *** Unicode version numbers
 575 - makedata.mak
 576 - uchar.h
 577   (configure.in & configure: have been modified to extract the version from uchar.h)
 578 - com.ibm.icu.util.VersionInfo
 579
 580 *** data files & enums & parser code
 581
 582 * file preparation
 583
 584 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed
 585 - This now prepares both unidata and testdata files in respective output subfolders.
 586
 587 * PropertyAliases.txt changes
 588 - new Script_Extensions property defined in the new ScriptExtensions.txt file
 589   but not listed in PropertyAliases.txt; reported to unicode.org;
 590   -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt
 591     scx; Script_Extensions
 592   -> uchar.h with new UProperty section
 593   -> com.ibm.icu.lang.UProperty, parallel with uchar.h
 594
 595 * PropertyValueAliases.txt changes
 596 - 12 new block names:
 597   Alchemical_Symbols
 598   Bamum_Supplement
 599   Batak
 600   Brahmi
 601   CJK_Unified_Ideographs_Extension_D
 602   Emoticons
 603   Ethiopic_Extended_A
 604   Kana_Supplement
 605   Mandaic
 606   Miscellaneous_Symbols_And_Pictographs
 607   Playing_Cards
 608   Transport_And_Map_Symbols
 609   -> add to uchar.h
 610   -> add to UCharacter.UnicodeBlock
 611     Eclipse find     UBLOCK_([^ ]+) = [0-9]+, (/.+)
 612             replace  public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
 613 - Joining_Group (jg) values:
 614   Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias
 615   -> uchar.h & UCharacter.JoiningGroup
 616 - 3 new scripts:
 617   sc ; Batk      ; Batak
 618   sc ; Brah      ; Brahmi
 619   sc ; Mand      ; Mandaic
 620   -> remove these from SyntheticPropertyValueAliases.txt
 621   -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN
 622   -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
 623       and in com.ibm.icu.dev.test.lang.TestUScript.java
 624 - 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
 625   (added 2009-11-11..2010-07-18)
 626   Bass        259     Bassa Vah
 627   Dupl        755     Duployan shortand
 628   Elba        226     Elbasan
 629   Gran        343     Grantha
 630   Kpel        436     Kpelle
 631   Loma        437     Loma
 632   Mend        438     Mende
 633   Merc        101     Meroitic Cursive
 634   Narb        106     Old North Arabian
 635   Nbat        159     Nabataean
 636   Palm        126     Palmyrene
 637   Sind        318     Sindhi
 638   Wara        262     Warang Citi
 639   -> uscript.h
 640   -> com.ibm.icu.lang.UScript
 641     find     USCRIPT_([^ ]+) *= ([0-9]+),(.+)
 642     replace  public static final int \1 = \2;\3
 643   -> SyntheticPropertyValueAliases.txt
 644   -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
 645       and in com.ibm.icu.dev.test.lang.TestUScript.java
 646 - ISO 15924 name change
 647   Mero        100     Meroitic Hieroglyphs (was Meroitic)
 648   -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC
 649 - property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt
 650
 651 * UnicodeData.txt changes
 652 - new CJK block:
 653   2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
 654   2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
 655   -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion
 656
 657 * build Unicode tools using CMake+make
 658
 659 * run genpname/preparse.pl (on Linux)
 660   + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
 661   + make sure that data.h is writable
 662   + perl preparse.pl ~/svn.icu/trunk/src > out.txt
 663   + preparse.pl shows no errors, out.txt Info and Warning lines look ok
 664
 665 * rebuild Unicode tools (at least genpname) using make
 666 - You might first need to "make install" ICU so that the tools build can pick
 667   up the new definitions from the installed header files.
 668
 669 * run genpname
 670 - ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
 671 - rebuild ICU & tools
 672
 673 * update source/data/unidata/norm2/nfkc_cf.txt
 674 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
 675
 676 * update source/data/unidata/norm2/uts46.txt
 677 - download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt
 678   to ~/svn.icu/tools/trunk/src/unicode/py
 679 - adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values
 680 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
 681 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
 682
 683 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
 684   sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
 685 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
 686 - Unicode 6.0: U+2260, U+226E, U+226F
 687
 688 * generate core properties data files
 689 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
 690 - rebuild ICU & tools
 691 - run makeuca.sh so that genuca picks up the new nfc.nrm:
 692   ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
 693 - rebuild ICU & tools
 694
 695 * implement new Script_Extensions property (provisional)
 696 - parser & generator: genprops & uprops.icu
 697 - uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp
 698 - UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java
 699
 700 * switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2
 701 - (one-time change)
 702 - genbidi/gencase/genprops tools changes
 703 - re-run makeprops.sh (see above)
 704 - UCharacterProperty.java, UCharacterTypeIterator.java,
 705   UBiDiProps.java, UCaseProps.java, and several others with minor changes;
 706   UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java
 707
 708 * update Java data files
 709 - refresh just the UCD-related files, just to be safe
 710 - see (ICU4C)/source/data/icu4j-readme.txt
 711 - mkdir /tmp/icu4j
 712 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 713   output:
 714     ...
 715     Unicode .icu files built to ./out/build/icudt45l
 716     mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b
 717     echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
 718     LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH  ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b
 719     jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b
 720     mkdir -p /tmp/icu4j/main/shared/data
 721     cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
 722 - copy the big-endian Unicode data files to another location,
 723   separate from the other data files
 724     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
 725     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
 726     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
 727     ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu
 728     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
 729     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
 730     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
 731 - refresh ICU4J
 732     ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
 733
 734 * refresh Java test .txt files
 735 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
 736
 737 * un-hardcode normalization skippable (NF*_Inert) test data
 738 - removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools
 739
 740 * copy updated break iterator test files
 741 - now handled by early ucdcopy.py and
 742   copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata
 743   (old instructions:
 744    copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt
 745    to ~/svn.icu/trunk/src/source/test/testdata)
 746 - they are not used in ICU4J
 747
 748 * UCA
 749
 750 - get output from Mark's tools; look in
 751     http://www.unicode.org/~book/incoming/mark/uca6.0.0/
 752     http://www.macchiato.com/unicode/utc/additional-uca-files
 753     http://www.unicode.org/Public/UCA/6.0.0/
 754     http://www.unicode.org/~mdavis/uca/
 755 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
 756 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
 757 - update Han-implicit ranges for new CJK extensions:
 758   swapCJK() in ucol.cpp & ImplicitCEGenerator.java
 759 - genuca: allow bytes 02 for U+FFFE, new merge-sort character;
 760   do not add it into invuca so that tailoring primary-after an ignorable works
 761 - genuca: permit space between [variable top] bytes
 762 - ucol.cpp: treat noncharacters like unassigned rather than ignorable
 763 - run makeuca.sh:
 764   ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
 765 - rebuild ICU4C
 766 - refresh ICU4J collation data:
 767   (subset of instructions above for properties data refresh, except copies all coll/*)
 768     ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 769     mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
 770     ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
 771     ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
 772 - update (ICU)/source/test/testdata/CollationTest_*.txt
 773   and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
 774   with output from Mark's Unicode tools
 775 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)
 776 - note on intltest: if collate/UCAConformanceTest fails, then
 777   utility/MultithreadTest/TestCollators will fail as well;
 778   fix the conformance test before looking into the multi-thread test
 779
 780 * When refreshing all of ICU4J data from ICU4C
 781 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
 782 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
 783 or
 784 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
 785
 786 *** LayoutEngine script information
 787
 788 (For details see the Unicode 5.2 change log below.)
 789
 790 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
 791 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
 792 ScriptRunData.cpp, which is no longer needed.)
 793
 794 The generated files have a current copyright date and "@draft" statement.
 795
 796 * copy the above files into <icu>/source/layout, replacing the old files.
 797 * fix mixed line endings
 798 * review the diffs and fix incorrect @draft and missing aliases;
 799   Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
 800 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
 801
 802 ---------------------------------------------------------------------------- ***
 803
 804 Unicode 5.2 update
 805
 806 *** related ICU Trac tickets
 807
 808 7084 Unicode 5.2
 809
 810 7167 verify collation bytes
 811 7235 Java test NAME_ALIAS
 812 7236 Java DerivedCoreProperties.txt test
 813 7237 Java BidiTest.txt
 814 7238 UTrie2 in core unidata
 815 7239 test for tailoring gaps
 816 7240 Java fix CollationMiscTest
 817 7243 update layout engine for Unicode 5.2
 818
 819 *** Unicode version numbers
 820 - makedata.mak
 821 - uchar.h
 822 - configure.in & configure
 823 - update ucdVersion in gennames.c if an algorithmic range changes
 824
 825 *** data files & enums & parser code
 826
 827 * file preparation
 828
 829 python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata
 830 - includes finding files regardless of version numbers,
 831   copying them, and performing the equivalent processing of the
 832   ucdstrip and ucdmerge tools on the desired set of files
 833
 834 * notes on changes
 835 - PropertyAliases.txt
 836   moved from numeric to enumerated:
 837     ccc       ; Canonical_Combining_Class
 838   new string properties:
 839     NFKC_CF   ; NFKC_Casefold
 840     Name_Alias; Name_Alias
 841   new binary properties:
 842     Cased     ; Cased
 843     CI        ; Case_Ignorable
 844     CWCF      ; Changes_When_Casefolded
 845     CWCM      ; Changes_When_Casemapped
 846     CWKCF     ; Changes_When_NFKC_Casefolded
 847     CWL       ; Changes_When_Lowercased
 848     CWT       ; Changes_When_Titlecased
 849     CWU       ; Changes_When_Uppercased
 850   new CJK Unihan properties (not supported by ICU)
 851 - PropertyValueAliases.txt
 852   new block names
 853   new scripts
 854   one script code change:
 855     sc ; Qaai      ; Inherited
 856     ->
 857     sc ; Zinh      ; Inherited                        ; Qaai
 858   new Line_Break (lb) value:
 859     lb ; CP        ; Close_Parenthesis
 860   new Joining_Group (jg) values: Farsi_Yeh, Nya
 861   other new values:
 862     ccc; 214; ATA  ; Attached_Above
 863 - DerivedBidiClass.txt
 864   new default-R range: U+1E800 - U+1EFFF
 865 - UnicodeData.txt
 866   all of the ISO comments are gone
 867   new CJK block end:
 868     9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last>
 869   new CJK block:
 870     2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
 871     2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
 872
 873 * genpname
 874 - run preparse.pl
 875   + cd \svn\icuproj\icu\trunk\source\tools\genpname
 876   + make sure that data.h is writable
 877   + perl preparse.pl \svn\icuproj\icu\trunk > out.txt
 878   + preparse.pl complains with errors like the following:
 879       Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34.
 880     This is because ICU 4.0 had scripts from ISO 15924 which are now
 881     added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt
 882     and PropertyValueAliases.txt.
 883     -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
 884        Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt
 885   + preparse.pl complains with errors about block names missing from uchar.h; add them
 886
 887 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
 888 - new block & script values
 889   + 26 new blocks
 890     copy new blocks from Blocks.txt
 891     MS VC++ 2008 regular expression:
 892       find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$"
 893       replace with "    UBLOCK_\3 = 172, /*[\1]*/"
 894   + several new script values already added in ICU 4.0 for ISO 15924 coverage
 895     (removed from SyntheticPropertyValueAliases.txt, see genpname notes above)
 896   + 3 new script values added for ISO 15924 and Unicode 5.2 coverage
 897   + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2)
 898     (added to SyntheticPropertyValueAliases.txt)
 899 - new Joining Group (JG) values: Farsi_Yeh, Nya
 900 - new Line_Break (lb) value:
 901     lb ; CP        ; Close_Parenthesis
 902
 903 * hardcoded Unihan range end/limit
 904 - Unihan range end moves from 9FC3 to 9FCB
 905   search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive)
 906   + do change gennames.c
 907
 908 * Compare definitions of new binary properties with what we used to use
 909   in algorithms, to see if the definitions changed.
 910 - Verified that definitions for Cased and Case_Ignorable are unchanged.
 911   The gencase tool now parses the newly public Case_Ignorable values
 912   in case the definition changes in the future.
 913
 914 * uchar.c & uprops.h & uprops.c & genprops
 915 - new numeric values that didn't exist in Unicode data before:
 916     1/7, 1/9, 1/10, 3/10, 1/16, 3/16
 917   the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5,
 918   therefore redesign the encoding of numeric types and values for formatVersion 6;
 919   design for simple numbers up to at least 144 ("one gross"),
 920   large values up to at least 10^20,
 921   and fractions with numerators -1..17 and denominators 1..16
 922   to cover current and expected future values
 923   (e.g., more Han numeric values, Meroitic twelfths)
 924
 925 * reimplement Hangul_Syllable_Type for new Jamo characters
 926 - the old code assumed that all Jamo characters are in the 11xx block
 927 - Unicode 5.2 fills holes there and adds new Jamo characters in
 928     A960..A97F; Hangul Jamo Extended-A
 929   and in
 930     D7B0..D7FF; Hangul Jamo Extended-B
 931 - Hangul_Syllable_Type can be trivially derived from a subset of
 932   Grapheme_Cluster_Break values
 933
 934 * build Unicode data source code for hardcoding core data
 935 C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data
 936
 937 ICU data make path is \svn\icuproj\icu\trunk\source\data\
 938 ICU root path is \svn\icuproj\icu\trunk
 939 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
 940 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
 941 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
 942 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
 943 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
 944 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
 945 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
 946 Information: cannot find "spreplocal.mk". Not building user-additional stringprep files.
 947 Creating data file for Unicode Property Names
 948 Creating data file for Unicode Character Properties
 949 Creating data file for Unicode Case Mapping Properties
 950 Creating data file for Unicode BiDi/Shaping Properties
 951 Creating data file for Unicode Normalization
 952 Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l"
 953 Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp"
 954
 955 - copy the .c source files to C:\svn\icuproj\icu\trunk\source\common
 956   and rebuild the common library
 957
 958 *** UCA
 959
 960 - update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools)
 961 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools
 962 - update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools
 963 [ Begin obsolete instructions:
 964   Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files.
 965     - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py
 966       on Windows:
 967         python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt
 968         python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt
 969   End obsolete instructions]
 970 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)
 971   not just the *_STUB.txt files
 972 - note on intltest: if collate/UCAConformanceTest fails, then
 973   utility/MultithreadTest/TestCollators will fail as well;
 974   fix the conformance test before looking into the multi-thread test
 975
 976 *** Implement Cased & Case_Ignorable properties
 977 - via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable()
 978 - Problem: These properties should be disjoint, but aren't
 979 - UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not
 980 - change ucase.icu to be able to store any combination of Cased and Case_Ignorable
 981
 982 *** Implement Changes_When_Xyz properties
 983 - without stored data
 984
 985 *** Implement Name_Alias property
 986 - add it as another name field in unames.icu
 987 - make it available via u_charName() and UCharNameChoice and
 988 - consider it in u_charFromName()
 989
 990 *** Break iterators
 991
 992 * Update break iterator rules to new UAX versions and new property values
 993 * Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary
 994
 995 *** new BidiTest file
 996 - review format and data
 997 - copy BidiTest.txt to source/test/testdata
 998 - write test code using this data
 999 - fix ICU code where it fails the conformance test
1000
1001 *** Java
1002 - generally, find and update code corresponding to C/C++
1003 - UCharacter.UnicodeBlock constants:
1004   a) add an _ID integer per new block, update COUNT
1005   b) add a class instance per new block
1006      Visual Studio regex:
1007         find            UBLOCK_{[^ ]+} = [0-9]+, {/.+}
1008         replace with    public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
1009 - CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias()
1010
1011 - port test changes to Java
1012
1013 *** LayoutEngine script information
1014
1015 (For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833)
1016
1017 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
1018 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
1019 ScriptRunData.cpp, which is no longer needed.)
1020
1021 The generated files have a current copyright date and "@draft" statement.
1022
1023 -> Eric Mader wrote in email on 20090930:
1024     "I think the tool has been modified to update @draft to @stable for
1025      older scripts and to add @draft for new scripts.
1026      (I worked with an intern on this last year.)
1027      You should check the output after you run it."
1028
1029 * copy the above files into <icu>/source/layout, replacing the old files.
1030 * fix mixed line endings
1031 * review the diffs and fix incorrect @draft and missing aliases
1032 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
1033
1034 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
1035 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
1036
1037 -> Eric Mader wrote in email on 20090930:
1038     "This is just a matter of making sure that all the per-script tables have
1039      entries for any new scripts that were added.
1040      If any new Indic characters were added, then the class tables in
1041      IndicClassTables.cpp should be updated to reflect this.
1042      John Emmons should know how to do this if it's required."
1043
1044 * rebuild the layout and layoutex libraries.
1045
1046 *** Documentation
1047 - Update User Guide
1048   + Jamo_Short_Name, sfc->scf, binary property value aliases
1049
1050 ---------------------------------------------------------------------------- ***
1051
1052 Unicode 5.1 update
1053
1054 *** related ICU Trac tickets
1055
1056 5696 Update to Unicode 5.1
1057
1058 *** Unicode version numbers
1059 - makedata.mak
1060 - uchar.h
1061 - configure.in & configure
1062 - update ucdVersion in gennames.c if an algorithmic range changes
1063
1064 *** data files & enums & parser code
1065
1066 * file preparation
1067 - ucdstrip:
1068     DerivedCoreProperties.txt
1069     DerivedNormalizationProps.txt
1070     NormalizationTest.txt
1071     PropList.txt
1072     Scripts.txt
1073     GraphemeBreakProperty.txt
1074     SentenceBreakProperty.txt
1075     WordBreakProperty.txt
1076 - ucdstrip and ucdmerge:
1077     EastAsianWidth.txt
1078     LineBreak.txt
1079
1080 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
1081 copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
1082 copy 5.1.0\ucd\Blocks.txt ..\unidata\
1083 copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
1084 copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
1085 copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
1086 copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
1087 copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
1088 copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
1089 copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
1090 copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
1091 copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
1092 copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
1093 copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
1094
1095 ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
1096 ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
1097 ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
1098 ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
1099 ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
1100 ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
1101 ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
1102 ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
1103 ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
1104 ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
1105
1106 * genpname
1107 - run preparse.pl
1108   + cd \svn\icuproj\icu\uni51\source\tools\genpname
1109   + make sure that data.h is writable
1110   + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
1111   + preparse.pl complains with errors like the following:
1112       Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
1113     This is because ICU 3.8 had scripts from ISO 15924 which are now
1114     added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
1115     and PropertyValueAliases.txt.
1116     -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
1117        Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
1118   + PropertyValueAliases.txt now explicitly contains values for boolean properties:
1119       N/Y, No/Yes, F/T, False/True
1120     -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
1121        It will use further values from the file if present.
1122
1123 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
1124 - new block & script values
1125   + 17 new blocks
1126   + 11 new script values already added in ICU 3.8 for ISO 15924 coverage
1127     (removed from SyntheticPropertyValueAliases.txt)
1128   + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
1129     (added to SyntheticPropertyValueAliases.txt)
1130 - uprops.icu (uprops.h) only provides 7 bits for script codes.
1131   In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
1132   There is none above 127 yet which is the script code for an
1133   assigned Unicode character, so ICU 4.0 uprops.icu does not store any
1134   script code values greater than 127.
1135   However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
1136   in a parallel bit field, and that overflows now.
1137   Also, future values >=128 would be incompatible anyway.
1138   uprops.h is modified to move around several of the bit fields
1139   in the properties vector words, and now uses 8 bits for the script code.
1140   Two other bit fields also grow to accommodate future growth:
1141   Block (current count: 172) grows from 8 to 9 bits,
1142   and Word_Break grows from 4 to 5 bits.
1143 - renamed property Simple_Case_Folding (sfc->scf)
1144   + nothing to be done: handled as normal alias
1145 - new property JSN Jamo_Short_Name
1146   + no new API: only contributes to the Name property
1147 - new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
1148 - new Joining Group (JG) value: Burushashki_Yeh_Barree
1149 - new Sentence_Break (SB) values:
1150     SB ; CR        ; CR
1151     SB ; EX        ; Extend
1152     SB ; LF        ; LF
1153     SB ; SC        ; SContinue
1154 - new Word_Break (WB) values:
1155     WB ; CR        ; CR
1156     WB ; Extend    ; Extend
1157     WB ; LF        ; LF
1158     WB ; MB        ; MidNumLet
1159
1160 * Further changes in the 2008-02-29 update:
1161 - Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
1162   because they should not normally be invisible.
1163 - new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
1164 - new Grapheme_Cluster_Break (GCB) value: PP=Prepend
1165 - new Word_Break (WB) value: NL=Newline
1166
1167 * hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
1168 - Unihan range end moves from 9FBB to 9FC3
1169   search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
1170   + do change gennames.c
1171
1172 * build Unicode data source code for hardcoding core data
1173 C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
1174
1175 ICU data make path is \svn\icuproj\icu\uni51\source\data\
1176 ICU root path is \svn\icuproj\icu\uni51
1177 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
1178 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
1179 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
1180 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
1181 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
1182 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
1183 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
1184 Creating data file for Unicode Character Properties
1185 Creating data file for Unicode Case Mapping Properties
1186 Creating data file for Unicode BiDi/Shaping Properties
1187 Creating data file for Unicode Normalization
1188 Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
1189 Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
1190
1191 - copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
1192   and rebuild the common library
1193
1194 *** Break iterators
1195
1196 * Update break iterator rules to new UAX versions and new property values
1197
1198 *** UCA
1199
1200 * update FractionalUCA.txt and UCARules.txt with new canonical closure
1201
1202 *** Test suites
1203 - Test that APIs using Unicode property value aliases (like UnicodeSet)
1204   support all of the boolean values N/Y, No/Yes, F/T, False/True
1205   -> TestBinaryValues() tests in both cintltst and intltest
1206
1207 *** LayoutEngine script information
1208 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
1209 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
1210 ScriptRunData.cpp, which is no longer needed.)
1211
1212 The generated files have a current copyright date and "@draft" statement.
1213
1214 * copy the above files into <icu>/source/layout, replacing the old files.
1215
1216 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
1217 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
1218
1219 * rebuild the layout and layoutex libraries.
1220
1221 *** Documentation
1222 - Update User Guide
1223   + Jamo_Short_Name, sfc->scf, binary property value aliases
1224
1225 ---------------------------------------------------------------------------- ***
1226
1227 Unicode 5.0 update
1228
1229 *** related Jitterbugs
1230
1231 5084 RFE: Update to Unicode 5.0
1232
1233 *** data files & enums & parser code
1234
1235 * file preparation
1236 - ucdstrip:
1237     DerivedCoreProperties.txt
1238     DerivedNormalizationProps.txt
1239     NormalizationTest.txt
1240     PropList.txt
1241     Scripts.txt
1242     GraphemeBreakProperty.txt
1243     SentenceBreakProperty.txt
1244     WordBreakProperty.txt
1245 - ucdstrip and ucdmerge:
1246     EastAsianWidth.txt
1247     LineBreak.txt
1248
1249 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
1250 copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
1251 copy 5.0.0\ucd\Blocks.txt ..\unidata\
1252 copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
1253 copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
1254 copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
1255 copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
1256 copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
1257 copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
1258 copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
1259 copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
1260 copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
1261 copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
1262 copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
1263
1264 ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
1265 ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
1266 ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
1267 ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
1268 ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
1269 ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
1270 ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
1271 ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
1272 ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
1273 ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
1274
1275 * update FractionalUCA.txt and UCARules.txt with new canonical closure
1276
1277 * genpname
1278 - run preparse.pl
1279   + make sure that data.h is writable
1280   + perl preparse.pl \cvs\oss\icu > out.txt
1281
1282 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
1283 - new block & script values
1284   + script values already added in ICU 3.6 because all of ISO 15924 is now covered
1285
1286 * build Unicode data source code for hardcoding core data
1287 C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
1288
1289 ICU data make path is \cvs\oss\icu\source\data\
1290 ICU root path is \cvs\oss\icu
1291 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
1292 [etc.]
1293 Creating data file for Unicode Character Properties
1294 Creating data file for Unicode Case Mapping Properties
1295 Creating data file for Unicode BiDi/Shaping Properties
1296 Creating data file for Unicode Normalization
1297 Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
1298 Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
1299
1300 - copy the .c source files to C:\cvs\oss\icu\source\common
1301   and rebuild the common library
1302
1303 *** Unicode version numbers
1304 - makedata.mak
1305 - uchar.h
1306 - configure.in
1307
1308 *** LayoutEngine script information
1309 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
1310 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
1311 ScriptRunData.cpp, which is no longer needed.)
1312
1313 The generated files have a current copyright date and "@draft" statement.
1314
1315 * copy the above files into <icu>/source/layout, replacing the old files.
1316
1317 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
1318 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
1319
1320 * rebuild the layout and layoutex libraries.
1321
1322 ---------------------------------------------------------------------------- ***
1323
1324 Unicode 4.1 update
1325
1326 *** related Jitterbugs
1327
1328 4332 RFE: Update to Unicode 4.1
1329 4157 RBBI, TR29 4.1 updates
1330
1331 *** data files & enums & parser code
1332
1333 * file preparation
1334 - ucdstrip:
1335     DerivedCoreProperties.txt
1336     DerivedNormalizationProps.txt
1337     NormalizationTest.txt
1338     GraphemeBreakProperty.txt
1339     SentenceBreakProperty.txt
1340     WordBreakProperty.txt
1341 - ucdstrip and ucdmerge:
1342     EastAsianWidth.txt
1343     LineBreak.txt
1344
1345 * add new files to the repository
1346     GraphemeBreakProperty.txt
1347     SentenceBreakProperty.txt
1348     WordBreakProperty.txt
1349
1350 * update FractionalUCA.txt and UCARules.txt with new canonical closure
1351
1352 * genpname
1353 - handle new enumerated properties in sub read_uchar
1354 - run preparse.pl
1355
1356 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
1357 - new binary properties
1358   + Pattern_Syntax
1359   + Pattern_White_Space
1360 - new enumerated properties
1361   + Grapheme_Cluster_Break
1362   + Sentence_Break
1363   + Word_Break
1364 - new block & script & line break values
1365
1366 * gencase
1367 - case-ignorable changes
1368   see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
1369   now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
1370
1371 *** Unicode version numbers
1372 - makedata.mak
1373 - uchar.h
1374 - configure.in
1375
1376 *** tests
1377 - verify that u_charMirror() round-trips
1378 - test all new properties and some new values of old properties
1379
1380 *** other code
1381
1382 * hardcoded Unihan range end/limit
1383 - Unihan range end moves from 9FA5 to 9FBB
1384   search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
1385   + do not modify BOCU/BOCSU code because that would change the encoding
1386     and break binary compatibility!
1387   + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
1388     NamePrepProfile.txt
1389   + ignore trietest.c: test data is arbitrary
1390   + ignore tstnorm.cpp: test optimization, not important
1391   + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
1392   + do change line_th.txt and word_th.txt
1393     by replacing hardcoded ranges with the new property values
1394   + do change gennames.c
1395
1396 source\data\brkitr\line_th.txt(229):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
1397 source\data\brkitr\word_th.txt(23):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
1398 source\tools\gennames\gennames.c(971):        0x4e00, 0x9fa5,
1399
1400 * case mappings
1401 - compare new special casing context conditions with previous ones
1402   see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
1403
1404 * genpname
1405 - consider storing only the short name if it is the same as the long name
1406
1407 *** other reviews
1408 - UAX #29 changes (grapheme/word/sentence breaks)
1409 - UAX #14 changes (line breaks)
1410 - Pattern_Syntax & Pattern_White_Space
1411
1412 ---------------------------------------------------------------------------- ***
1413
1414 Unicode 4.0.1 update
1415
1416 *** related Jitterbugs
1417
1418 3170 RFE: Update to Unicode 4.0.1
1419 3171 Add new Unicode 4.0.1 properties
1420 3520 use Unicode 4.0.1 updates for break iteration
1421
1422 *** data files & enums & parser code
1423
1424 * file preparation
1425 - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
1426 - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
1427
1428 * file fixes
1429 - fix UnicodeData.txt general categories of Ethiopic digits Nd->No
1430   according to PRI #26
1431   http://www.unicode.org/review/resolved-pri.html#pri26
1432 - undone again because no corrigendum in sight;
1433   instead modified tests to not check consistency on this for Unicode 4.0.1
1434
1435 * ucdterms.txt
1436 - update from http://www.unicode.org/copyright.html
1437   formatted for plain text
1438
1439 * uchar.h & uprops.h & uprops.c & genprops
1440 - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
1441 - add U_LB_INSEPARABLE due to a spelling fix
1442   + put short name comment only on line with new constant
1443     for genpname perl script parser
1444 - new binary properties
1445   + STerm
1446   + Variation_Selector
1447
1448 * genpname
1449 - fix genpname perl script so that it doesn't choke on more than 2 names per property value
1450 - perl script: correctly calculate the maximum number of fields per row
1451
1452 * uscript.h
1453 - new script code Hrkt=Katakana_Or_Hiragana
1454
1455 * gennorm.c track changes in DerivedNormalizationProps.txt
1456 - "FNC" -> "FC_NFKC"
1457 - single field "NFD_NO" -> two fields "NFD_QC; N" etc.
1458
1459 * genprops/props2.c track changes in DerivedNumericValues.txt
1460 - changed from 3 columns to 2, dropping the numeric type
1461   + assume that the type is always numeric for Han characters,
1462     and that only those are added in addition to what UnicodeData.txt lists
1463
1464 *** Unicode version numbers
1465 - makedata.mak
1466 - uchar.h
1467 - configure.in
1468
1469 *** tests
1470 - update test of default bidi classes according to PRI #28
1471   /tsutil/cucdtst/TestUnicodeData
1472   http://www.unicode.org/review/resolved-pri.html#pri28
1473 - bidi tests: change exemplar character for ES depending on Unicode version
1474 - change hardcoded expected property values where they change
1475
1476 *** other code
1477
1478 * name matching
1479 - read UCD.html
1480
1481 * scripts
1482 - use new Hrkt=Katakana_Or_Hiragana
1483
1484 * ZWJ & ZWNJ
1485 - are now part of combining character sequences
1486 - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ