X-Git-Url: https://git.saurik.com/apple/icu.git/blobdiff_plain/729e4ab9bc6618bc3d8a898e575df7f4019e29ca..f59164e3d128c7675a4d3934206346a3384e53a5:/icuSources/data/unidata/changes.txt diff --git a/icuSources/data/unidata/changes.txt b/icuSources/data/unidata/changes.txt index 5e0b8768..20602129 100644 --- a/icuSources/data/unidata/changes.txt +++ b/icuSources/data/unidata/changes.txt @@ -1,4 +1,4 @@ -* Copyright (C) 2004-2010, International Business Machines +* Copyright (C) 2004-2016, International Business Machines * Corporation and others. All Rights Reserved. * * file name: changes.txt @@ -13,6 +13,1495 @@ ---------------------------------------------------------------------------- *** +* New ISO 15924 script codes + +Starting with ICU 55, we do not add UScriptCode constants any more until their scripts +are encoded in Unicode, or can be assumed to be encoded in the next Unicode version. +Script enum constant names want to follow the Unicode script property value aliases, +which are assigned only when the scripts are encoded. +When we encode scripts early and guess wrong, then we have confusing enum constants +and have sometimes added aliases. + +Exception: Script codes like Latf and Aran that are not subject to separate encoding +can be added at any time. + +Script codes not yet in ICU: http://www.unicode.org/iso15924/codechanges.html + +Added 2014-11-15, see http://bugs.icu-project.org/trac/ticket/11561 +- Adlm 166 Adlam +- Aran 161 Arabic (Nastaliq variant) +- Kitl 505 Khitan large script +- Kits 288 Khitan small script +- Marc 332 Marchen +- Osge 219 Osage + +Aran can be added as USCRIPT_ARABIC_NASTALIQ at any time. + +Adlam, Marchen, and Osage are expected to go into Unicode 9; +we should assign Unicode script property value aliases for them +soon after Unicode 8 is released, and add them in ICU 56. + +Khitan scripts will be encoded later. + +---------------------------------------------------------------------------- *** + +Emoji properties added in ICU 57: http://bugs.icu-project.org/trac/ticket/11802 + +Edit preparseucd.py to add & parse new properties. +They share the UCD property namespace but are not listed in PropertyAliases.txt. + +Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/ +Initial data from emoji/2.0/ + +ICU_ROOT=~/svn.icu/trunk +ICU_SRC_DIR=$ICU_ROOT/src +ICUDT=icudt56b +export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib +SRC_DATA_IN=$ICU_SRC_DIR/source/data/in +UNIDATA=$ICU_SRC_DIR/source/data/unidata + +Add binary-property constants to uchar.h enum UProperty & UProperty.java. + +~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src +(Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.) + +Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java + +make install, then icutools cmake & make, then +~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR + +Generate Java data as usual, only update pnames.icu & uprops.icu. + +---------------------------------------------------------------------------- *** + +Unicode 8.0 update for ICU 56 + +* Command-line environment setup + +ICU_ROOT=~/svn.icu/trunk +ICU_SRC_DIR=$ICU_ROOT/src +ICUDT=icudt56b +export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib +SRC_DATA_IN=$ICU_SRC_DIR/source/data/in +UNIDATA=$ICU_SRC_DIR/source/data/unidata + +http://www.unicode.org/review/pri297/ -- beta review +http://www.unicode.org/reports/uax-proposed-updates.html +http://unicode.org/versions/beta-8.0.0.html +http://www.unicode.org/versions/Unicode8.0.0/ +http://www.unicode.org/reports/tr44/tr44-15.html + +*** ICU Trac + +- ticket:11574: Unicode 8 +- C++ branches/markus/uni80 at r37351 from trunk at r37343 +- Java branches/markus/uni80 at r37352 from trunk at r37338 + +*** CLDR Trac + +- cldrbug 8311: UCA 8 +- branches/markus/uni80 at r11518 from trunk at r11517 + +- cldrbug 8109: Unicode 8.0 script metadata +- cldrbug 8418: Updated segmentation for Unicode 8.0 + +*** Unicode version numbers +- makedata.mak +- uchar.h +- com.ibm.icu.util.VersionInfo +- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ + +- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h + so that the makefiles see the new version number. + +*** data files & enums & parser code + +* file preparation + +- download UCD & IDNA files +- make sure that the Unicode data folder passed into preparseucd.py + includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) +- only for manual diffs: remove version suffixes from the file names + ~/unidata/uni70/20140403$ ../../desuffixucd.py . + (see https://sites.google.com/site/unicodetools/inputdata) +- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip +- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src +- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. + +- also: from http://unicode.org/Public/security/8.0.0/ download new + confusables.txt & confusablesWholeScript.txt + and copy to $UNIDATA + ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA + ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA + +* initial preparseucd.py changes +- remove new Unicode scripts from the + only-in-ISO-15924 list according to the error message: + ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw'] + from _scripts_only_in_iso15924 + -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java +- property and file name change: + IndicMatraCategory -> IndicPositionalCategory +- UnicodeData.txt unusual numeric values (improper fractions) + 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;; + 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;; + 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;; + 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;; + 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;; + 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;; + 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;; + 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;; + 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;; + 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;; + -> change preparseucd.py to map them to proper fractions (e.g., 1/6) + which are listed in DerivedNumericValues.txt; + keeps storage in data file simple + +* PropertyValueAliases.txt changes +- 10 new Block (blk) values: + blk; Ahom ; Ahom + blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs + blk; Cherokee_Sup ; Cherokee_Supplement + blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E + blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform + blk; Hatran ; Hatran + blk; Multani ; Multani + blk; Old_Hungarian ; Old_Hungarian + blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs + blk; Sutton_SignWriting ; Sutton_SignWriting + -> add to uchar.h + use long property names for enum constants + -> add to UCharacter.UnicodeBlock IDs + Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) + replace public static final int \1_ID = \2; \3 + -> add to UCharacter.UnicodeBlock objects + Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) + replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 +- 6 new Script (sc) values: + sc ; Ahom ; Ahom + sc ; Hatr ; Hatran + sc ; Hluw ; Anatolian_Hieroglyphs + sc ; Hung ; Old_Hungarian + sc ; Mult ; Multani + sc ; Sgnw ; SignWriting + -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript + +* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata + (not strictly necessary for NOT_ENCODED scripts) + ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt + +* generate normalization data files + cd $ICU_ROOT/dbg + bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource + bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt + bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt + bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt + bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + + $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt + +* build Unicode tools using CMake+make + +~/svn.icutools/trunk/src/unicode/c/icudefs.txt: + + # Location (--prefix) of where ICU was installed. + set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) + # Location of the ICU source tree. + set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) + + ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c + ~/svn.icutools/trunk/dbg/unicode/c$ make + +* generate core properties data files +- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR +- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR +- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR +- rebuild ICU (make install) & tools +- run genuca again (see step above) so that it picks up the new nfc.nrm +- rebuild ICU (make install) & tools + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..8.0: U+2260, U+226E, U+226F +- nothing new in 8.0, no test file to update + +* run & fix ICU4C tests +- bad Cherokee case folding due to difference in fallbacks: + UCD case folding falls back to no mapping, + ICU runtime case folding falls back to lowercasing; + fixed casepropsbuilder.cpp to generate scf mappings to self + when there is an slc mapping but no scf +- Andy handles RBBI & spoof check test failures + +* collation: CLDR collation root, UCA DUCET + +- UCA DUCET goes into Mark's Unicode tools, see + https://sites.google.com/site/unicodetools/home#TOC-UCA +- CLDR root data files are checked into (CLDR UCA branch)/common/uca/ +- cd (CLDR UCA branch)/common/uca/ +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt + cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt + (note removing the underscore before "Rules") + cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt +- restore TODO diffs in UCARules.txt + meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + from the CLDR root files (..._CLDR_..._SHORT.txt) + cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt + cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt + cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data +- if CLDR common/uca/unihan-index.txt changes, then update + CLDR common/collation/root.xml + and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt +- run genuca, see command line above; + deal with + Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt + (add the character to genuca.cpp sampleCharsToScripts[]) + + look up the script for the new sample characters + (e.g., in FractionalUCA.txt) + + *add* mappings to sampleCharsToScripts[], do not replace them + (in case the script sample characters flip-flop) + + insert new scripts in DUCET script order, see the top_byte table + at the beginning of FractionalUCA.txt +- rebuild ICU4C + +* run & fix ICU4C tests, now with new CLDR collation root data +- run all tests with the collation test data *_SHORT.txt or the full files + (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test +- fixed bug in CollationWeights::getWeightRanges() + exposed by new data and CollationTest::TestRootElements + +* update Java data files +- refresh just the UCD/UCA-related/derived files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir /tmp/icu4j +- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + Unicode .icu files built to ./out/build/icudt56l + echo timestamp > uni-core-data + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b + echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files, + and then refresh ICU4J + cd ~/svn.icu/trunk/dbg/data/out/icu4j + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu + cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT + +* When refreshing all of ICU4J data from ICU4C +- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data +or +- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install + +* update CollationFCD.java + + copy & paste the initializers of lcccIndex[] etc. from + ICU4C/source/i18n/collationfcd.cpp to + ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + cd $ICU_SRC_DIR/source/data/unidata + cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode + cd ../../test/testdata + cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode + cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode + +* run & fix ICU4J tests + +*** LayoutEngine script information + +* ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more, + because the layout engine was deprecated in ICU 54. + Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java + to write lines that we used to add manually. + +* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. + This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp + in the working directory. + + (It also generates ScriptRunData.cpp, which is no longer needed.) + + It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages + (a plain text file) + which maps ICU versions to the numbers of script/language constants + that were added then. + (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) + + The generated files have a current copyright date and "@deprecated" statement. + +* Review changes, fix Java tool if necessary, and copy to ICU4C + cd ~/svn.icu4j/trunk/src + meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout + cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout + cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout + +*** API additions +- send notice to icu-design about new born-@stable API (enum constants etc.) + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C +- make sure that changes to Unicode tools & ICU tools are checked in + http://www.unicode.org/utility/trac/log/trunk/unicodetools + http://bugs.icu-project.org/trac/log/tools/trunk + +---------------------------------------------------------------------------- *** + +Unicode 7.0 update for ICU 54 + +http://www.unicode.org/review/pri271/ -- beta review +http://www.unicode.org/reports/uax-proposed-updates.html +http://www.unicode.org/versions/beta-7.0.0.html#notable_issues +http://www.unicode.org/reports/tr44/tr44-13.html + +*** ICU Trac + +- ticket 10821: Unicode 7.0, UCA 7.0 +- C++ branches/markus/uni70 at r35584 from trunk at r35580 +- Java branches/markus/uni70 at r35587 from trunk at r35545 + +*** CLDR Trac + +- ticket 7195: UCA 7.0 CLDR root collation +- branches/markus/uni70 at r10062 from trunk at r10061 + +- ticket 6762: script metadata for Unicode 7.0 new scripts + +*** Unicode version numbers +- makedata.mak +- uchar.h +- com.ibm.icu.util.VersionInfo +- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ + +- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h + so that the makefiles see the new version number. + +*** data files & enums & parser code + +* file preparation + +- download UCD & IDNA files +- make sure that the Unicode data folder passed into preparseucd.py + includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) +- only for manual diffs: remove version suffixes from the file names + ~/unidata/uni70/20140403$ ../../desuffixucd.py . + (see https://sites.google.com/site/unicodetools/inputdata) +- only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip +- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src +- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. +- Restore TODO diffs in source/data/unidata/UCARules.txt + cd $ICU_SRC_DIR + meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt +- Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt + +- also: from http://unicode.org/Public/security/7.0.0/ download new + confusables.txt & confusablesWholeScript.txt + and copy to $ICU_ROOT/src/source/data/unidata/ + +* initial preparseucd.py changes +- remove new Unicode scripts from the + only-in-ISO-15924 list according to the error message: + ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass', + 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm', + 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj'] + from _scripts_only_in_iso15924 + -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java +- NamesList.txt now has a heading with a non-ASCII character + + keep ppucd.txt in platform charset, rather than changing tool/test parsers + + escape non-ASCII characters in heading comments +- gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013 + + get the copyright from the first file whose copyright line contains the current year + +* PropertyValueAliases.txt changes +- 32 new Block (blk) values: + blk; Bassa_Vah ; Bassa_Vah + blk; Caucasian_Albanian ; Caucasian_Albanian + blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers + blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended + blk; Duployan ; Duployan + blk; Elbasan ; Elbasan + blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended + blk; Grantha ; Grantha + blk; Khojki ; Khojki + blk; Khudawadi ; Khudawadi + blk; Latin_Ext_E ; Latin_Extended_E + blk; Linear_A ; Linear_A + blk; Mahajani ; Mahajani + blk; Manichaean ; Manichaean + blk; Mende_Kikakui ; Mende_Kikakui + blk; Modi ; Modi + blk; Mro ; Mro + blk; Myanmar_Ext_B ; Myanmar_Extended_B + blk; Nabataean ; Nabataean + blk; Old_North_Arabian ; Old_North_Arabian + blk; Old_Permic ; Old_Permic + blk; Ornamental_Dingbats ; Ornamental_Dingbats + blk; Pahawh_Hmong ; Pahawh_Hmong + blk; Palmyrene ; Palmyrene + blk; Pau_Cin_Hau ; Pau_Cin_Hau + blk; Psalter_Pahlavi ; Psalter_Pahlavi + blk; Shorthand_Format_Controls ; Shorthand_Format_Controls + blk; Siddham ; Siddham + blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers + blk; Sup_Arrows_C ; Supplemental_Arrows_C + blk; Tirhuta ; Tirhuta + blk; Warang_Citi ; Warang_Citi + -> add to uchar.h + use long property names for enum constants + -> add to UCharacter.UnicodeBlock IDs + Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) + replace public static final int \1_ID = \2; \3 + -> add to UCharacter.UnicodeBlock objects + Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) + replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 +- 28 new Joining_Group (jg) values: + jg ; Manichaean_Aleph ; Manichaean_Aleph + jg ; Manichaean_Ayin ; Manichaean_Ayin + jg ; Manichaean_Beth ; Manichaean_Beth + jg ; Manichaean_Daleth ; Manichaean_Daleth + jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh + jg ; Manichaean_Five ; Manichaean_Five + jg ; Manichaean_Gimel ; Manichaean_Gimel + jg ; Manichaean_Heth ; Manichaean_Heth + jg ; Manichaean_Hundred ; Manichaean_Hundred + jg ; Manichaean_Kaph ; Manichaean_Kaph + jg ; Manichaean_Lamedh ; Manichaean_Lamedh + jg ; Manichaean_Mem ; Manichaean_Mem + jg ; Manichaean_Nun ; Manichaean_Nun + jg ; Manichaean_One ; Manichaean_One + jg ; Manichaean_Pe ; Manichaean_Pe + jg ; Manichaean_Qoph ; Manichaean_Qoph + jg ; Manichaean_Resh ; Manichaean_Resh + jg ; Manichaean_Sadhe ; Manichaean_Sadhe + jg ; Manichaean_Samekh ; Manichaean_Samekh + jg ; Manichaean_Taw ; Manichaean_Taw + jg ; Manichaean_Ten ; Manichaean_Ten + jg ; Manichaean_Teth ; Manichaean_Teth + jg ; Manichaean_Thamedh ; Manichaean_Thamedh + jg ; Manichaean_Twenty ; Manichaean_Twenty + jg ; Manichaean_Waw ; Manichaean_Waw + jg ; Manichaean_Yodh ; Manichaean_Yodh + jg ; Manichaean_Zayin ; Manichaean_Zayin + jg ; Straight_Waw ; Straight_Waw + -> uchar.h & UCharacter.JoiningGroup +- 23 new Script (sc) values: + sc ; Aghb ; Caucasian_Albanian + sc ; Bass ; Bassa_Vah + sc ; Dupl ; Duployan + sc ; Elba ; Elbasan + sc ; Gran ; Grantha + sc ; Hmng ; Pahawh_Hmong + sc ; Khoj ; Khojki + sc ; Lina ; Linear_A + sc ; Mahj ; Mahajani + sc ; Mani ; Manichaean + sc ; Mend ; Mende_Kikakui + sc ; Modi ; Modi + sc ; Mroo ; Mro + sc ; Narb ; Old_North_Arabian + sc ; Nbat ; Nabataean + sc ; Palm ; Palmyrene + sc ; Pauc ; Pau_Cin_Hau + sc ; Perm ; Old_Permic + sc ; Phlp ; Psalter_Pahlavi + sc ; Sidd ; Siddham + sc ; Sind ; Khudawadi + sc ; Tirh ; Tirhuta + sc ; Wara ; Warang_Citi + -> uscript.h (many were added before) + comment "Mende Kikakui" for USCRIPT_MENDE + add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias + -> com.ibm.icu.lang.UScript + find USCRIPT_([^ ]+) *= ([0-9]+),(.+) + replace public static final int \1 = \2; \3 +- 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html + (added 2012-11-01) + Ahom 338 Ahom + Hatr 127 Hatran + Mult 323 Multani + (added 2013-10-12) + Modi 324 Modi + Pauc 263 Pau Cin Hau + Sidd 302 Siddham + -> uscript.h (some overlap with additions from Unicode) + -> com.ibm.icu.lang.UScript + find USCRIPT_([^ ]+) *= ([0-9]+),(.+) + replace public static final int \1 = \2; \3 + -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924 + -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java + +* update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata + (not strictly necessary for NOT_ENCODED scripts) + ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt + +* generate normalization data files +- cd $ICU_ROOT/dbg +- export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib +- SRC_DATA_IN=$ICU_SRC_DIR/source/data/in +- UNIDATA=$ICU_SRC_DIR/source/data/unidata +- bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource +- bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt +- bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt +- bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt +- bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + +~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt + +* build Unicode tools using CMake+make + +~/svn.icutools/trunk/src/unicode/c/icudefs.txt: + +# Location (--prefix) of where ICU was installed. +set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst) +# Location of the ICU source tree. +set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src) + +~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c +~/svn.icutools/trunk/dbg/unicode/c$ make + +* genprops work +- new code point range for Joining_Group values: 10AC0..10AFF Manichaean + + add second array of Joining_Group values for at most 10800..10FFF + icutools: unicode/c/genprops/bidipropsbuilder.cpp + icu: source/common/ubidi_props.h/.c/_data.h + icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java + +* generate core properties data files +- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR +- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR +- rebuild ICU (make install) & tools +- run genuca again (see step above) so that it picks up the new nfc.nrm +- rebuild ICU (make install) & tools + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..7.0: U+2260, U+226E, U+226F +- nothing new in 7.0, no test file to update + +* run & fix ICU4C tests + +* update Java data files +- refresh just the UCD-related files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir /tmp/icu4j +- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + Unicode .icu files built to ./out/build/icudt53l + echo timestamp > uni-core-data + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b + echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files + ICUDT=icudt54b + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr + cd ~/svn.icu/uni70/dbg/data/out/icu4j + cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu + cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT + cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr +- refresh ICU4J + ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT + +* update CollationFCD.java + + copy & paste the initializers of lcccIndex[] etc. from + ICU4C/source/i18n/collationfcd.cpp to + ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + cd $ICU_SRC_DIR/source/data/unidata + cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode + cd ../../test/testdata + cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode + cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode + +* UCA + +- download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA// +- run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata) +- update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/ +- run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA +- output files are in ~/svn.unitools/Generated/uca/7.0.0/ +- review data; compare files, use blankweights.sed or similar + ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt +- cd ~/svn.unitools/Generated/uca/7.0.0/ +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt + cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + (note removing the underscore before "Rules") + cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) + cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt + cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt + cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data +- run genuca, see command line above +- rebuild ICU4C +- refresh ICU4J collation data: + (subset of instructions above for properties data refresh, except copies all coll/*) + ICUDT=icudt54b + ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll + ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT +- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test +- copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors +- copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch + ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ + +* When refreshing all of ICU4J data from ICU4C +- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data +or +- ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install + +* run & fix ICU4J tests + +*** LayoutEngine script information + +(For details see the Unicode 5.2 change log below.) + +* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. + This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp + in the working directory. + (It also generates ScriptRunData.cpp, which is no longer needed.) + + The generated files have a current copyright date and "@stable" statement. + ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java + for "born stable" Unicode API constants, and to stop parsing ICU version numbers + which may not contain dots any more. + +- diff current /source/layout files vs. generated ones + ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout + review and manually merge desired changes; + fix gratuitous changes, incorrect @draft/@stable and missing aliases; + Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. +- if you just copy the above files, then + fix mixed line endings, review the diffs as above and restore changes to API tags etc.; + manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h + +*** API additions +- send notice to icu-design about new born-@stable API (enum constants etc.) + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C + +---------------------------------------------------------------------------- *** + +Unicode 6.3 update + +http://www.unicode.org/review/pri249/ -- beta review +http://www.unicode.org/reports/uax-proposed-updates.html +http://www.unicode.org/versions/beta-6.3.0.html#notable_issues +http://www.unicode.org/reports/tr44/tr44-11.html + +*** ICU Trac + +- ticket 10128: update ICU to Unicode 6.3 beta +- ticket 10168: update ICU to Unicode 6.3 final +- C++ branches/markus/uni63 at r33552 from trunk at r33551 +- Java branches/markus/uni63 at r33550 from trunk at r33553 + +- ticket 10142: implement Unicode 6.3 bidi algorithm additions + +*** Unicode version numbers +- makedata.mak +- uchar.h + (configure.in & configure: have been modified to extract the version from uchar.h) +- com.ibm.icu.util.VersionInfo +- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ + +- Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h + so that the makefiles see the new version number. + +*** data files & enums & parser code + +* file preparation + +- download UCD, UCA & IDNA files +- make sure that the Unicode data folder passed into preparseucd.py + includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) +- modify preparseucd.py: + parse new file BidiBrackets.txt + with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type +- ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src +- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. +- Check test file diffs for previously commented-out, known-failing data lines; + probably need to keep those commented out. + +* PropertyAliases.txt changes +- 1 new Enumerated Property + bpt ; Bidi_Paired_Bracket_Type + -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType + -> ubidi_props.h & .c & UBiDiProps.java + -> remember to write the max value at UBIDI_MAX_VALUES_INDEX + -> uprops.cpp + -> change ubidi.icu format version from 2.0 to 2.1 +- 1 new Miscellaneous Property + bpb ; Bidi_Paired_Bracket + -> uchar.h & UProperty.java + -> ppucd.h & .cpp + +* PropertyValueAliases.txt changes +- 3 Bidi_Paired_Bracket_Type (bpt) values: + bpt; c ; Close + bpt; n ; None + bpt; o ; Open + -> uchar.h & UCharacter.BidiPairedBracketType + -> ubidi_props.h & .c & UBiDiProps.java + -> change ubidi.icu format version from 2.0 to 2.1 +- 4 new Bidi_Class (bc) values: + bc ; FSI ; First_Strong_Isolate + bc ; LRI ; Left_To_Right_Isolate + bc ; RLI ; Right_To_Left_Isolate + bc ; PDI ; Pop_Directional_Isolate + -> uchar.h & UCharacterEnums.ECharacterDirection + -> until the bidi code gets updated, + Roozbeh suggests mapping the new bc values to ON (Other_Neutral) +- 3 new Word_Break (WB) values: + WB ; HL ; Hebrew_Letter + WB ; SQ ; Single_Quote + WB ; DQ ; Double_Quote + -> uchar.h & UCharacter.WordBreak + -> first time Word_Break numeric constants exceed 4 bits (now 17 values) +- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html + (added 2012-10-16) + Aghb 239 Caucasian Albanian + Mahj 314 Mahajani + -> uscript.h + -> com.ibm.icu.lang.UScript + find USCRIPT_([^ ]+) *= ([0-9]+),(.+) + replace public static final int \1 = \2;\3 + -> preparseucd.py _scripts_only_in_iso15924 + -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java + -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata + (not strictly necessary for NOT_ENCODED scripts) + +* generate normalization data files +- ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib +- ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in +- ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata +- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt +- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt +- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt +- ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. + +~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt + +* build Unicode tools using CMake+make + +~/svn.icutools/trunk/src/unicode/c/icudefs.txt: + +# Location (--prefix) of where ICU was installed. +set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst) +# Location of the ICU source tree. +set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src) + +~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c +~/svn.icutools/trunk/dbg/unicode/c$ make + +* generate core properties data files +- ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src +- ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src +- rebuild ICU (make install) & tools +- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm +- rebuild ICU (make install) & tools + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..6.3: U+2260, U+226E, U+226F +- nothing new in 6.3, no test file to update + +* update Java data files +- refresh just the UCD-related files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir /tmp/icu4j +- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + Unicode .icu files built to ./out/build/icudt52l + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b + echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr + ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b + ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu + ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b + ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll + ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr +- refresh ICU4J + ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + +* UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files + +- get output from Mark's tools; look in http://www.unicode.org/Public/UCA// +- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + (note removing the underscore before "Rules") +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) +- check test file diffs for previously commented-out, known-failing data lines; + probably need to keep those commented out +- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani +- run genuca, see command line above +- rebuild ICU4C +- refresh ICU4J collation data: + (subset of instructions above for properties data refresh, except copies all coll/*) + ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll + ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll + ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b +- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +* test ICU, fix test code where necessary + +* When refreshing all of ICU4J data from ICU4C +- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data +or +- ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install + +*** LayoutEngine script information +- skipped for Unicode 6.3: no new scripts + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C + +---------------------------------------------------------------------------- *** + +Unicode 6.2 update + +http://www.unicode.org/review/pri230/ +http://www.unicode.org/versions/beta-6.2.0.html +http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0 +http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values +http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol +http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols +http://www.unicode.org/reports/tr46/tr46-8.html IDNA +http://unicode.org/Public/idna/6.2.0/ + +*** ICU Trac + +- ticket 9515: Unicode 6.2: final ICU update + +- ticket 9514: UCA 6.2: fix UCARules.txt + +- ticket 9437: update ICU to Unicode 6.2 +- C++ branches/markus/uni62 at r32050 from trunk at r32041 +- Java branches/markus/uni62 at r32068 from trunk at r32066 + +*** Unicode version numbers +- makedata.mak +- uchar.h + (configure.in & configure: have been modified to extract the version from uchar.h) +- com.ibm.icu.util.VersionInfo +- com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ + +*** data files & enums & parser code + +* file preparation + +- download UCD, UCA & IDNA files +- make sure that the Unicode data folder passed into preparseucd.py + includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) +- modify preparseucd.py: NamesList.txt is now in UTF-8 +- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src +- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. +- Check test file diffs for previously commented-out, known-failing data lines; + probably need to keep those commented out. + +* PropertyValueAliases.txt changes +- 1 new Line_Break (lb) value: + lb ; RI ; Regional_Indicator + -> uchar.h & UCharacter.LineBreak +- 1 new Word_Break (WB) value: + WB ; RI ; Regional_Indicator + -> uchar.h & UCharacter.WordBreak +- 1 new Grapheme_Cluster_Break (GCB) value: + GCB; RI ; Regional_Indicator + -> uchar.h & UCharacter.GraphemeClusterBreak + +* 3 new numeric values + The new value -1, which was really supposed to be NaN but that would have required + new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1, + but encodeNumericValue() in corepropsbuilder.cpp had to be fixed. + cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1 + cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1 + The two new values 216000 and 432000 require an addition to the encoding of numeric values. + cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000 + cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000 + -> uprops.h, uchar.c & UCharacterProperty.java + -> cucdtst.c & UCharacterTest.java + +* generate normalization data files +- ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib +- ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in +- ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata +- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt +- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt +- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt +- ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. +* build Unicode tools using CMake+make + +* generate core properties data files +- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src +- in initial bootstrapping, change the UCA version + in source/data/unidata/FractionalUCA.txt to match the new Unicode version +- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src +- rebuild ICU (make install) & tools + + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, + check if the UCA version in FractionalUCA.txt matches the new Unicode version + (see step above) +- run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm +- rebuild ICU (make install) & tools + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..6.2: U+2260, U+226E, U+226F +- nothing new in 6.2, no test file to update + +* update Java data files +- refresh just the UCD-related files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir /tmp/icu4j +- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + Unicode .icu files built to ./out/build/icudt50l + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b + echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr + ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b + ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu + ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b + ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll + ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr +- refresh ICU4J + ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + +* UCA + +- get output from Mark's tools; look in http://www.unicode.org/Public/UCA// +- CLDR root files for ICU are in CollationAuxiliary.zip; unpack that +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + (note removing the underscore before "Rules") +- update (ICU4C)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) +- check test file diffs for previously commented-out, known-failing data lines; + probably need to keep those commented out +- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani +- run genuca, see command line above +- rebuild ICU4C +- refresh ICU4J collation data: + (subset of instructions above for properties data refresh, except copies all coll/*) + ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll + ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll + ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b +- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +* test ICU, fix test code where necessary + +* When refreshing all of ICU4J data from ICU4C +- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data +or +- ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install + +*** LayoutEngine script information +- skipped for Unicode 6.2: no new scripts + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C + +---------------------------------------------------------------------------- *** + +Future Unicode update + +Tools simplified since the Unicode 6.1 update. See +- http://site.icu-project.org/design/props/ppucd +- http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972 + +* Unicode version numbers +- icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates + +* file preparation +- ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py: +- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src +- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. +- Check test file diffs for previously commented-out, known-failing data lines; + probably need to keep those commented out. + +* PropertyValueAliases.txt changes +- Script codes that are in ISO 15924 but not in Unicode are now listed in + preparseucd.py, in the _scripts_only_in_iso15924 variable. + If there are new ISO codes, then add them. + If Unicode adds some of them, then remove them from the .py variable. + +* UnicodeData.txt changes +- No more manual changes for CJK ranges for algorithmic names; + those are now written to ppucd.txt and genprops reads them from there. + +* generate core properties data files (makeprops.sh was deleted) +- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src + +* no more manual updates of source/data/unidata/norm2/nfkc_cf.txt +- it is now generated by preparseucd.py + +* no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt +- it is now generated by preparseucd.py +- make sure that the Unicode data folder passed into preparseucd.py + includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt + (can be in some subfolder) + +* generate normalization data files +- ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib +- ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in +- ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata +- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt +- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt +- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt +- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt + +* build ICU (make install) +* build Unicode tools using CMake+make + +* new way to call genuca (makeuca.sh was deleted) +- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src + +---------------------------------------------------------------------------- *** + +Unicode 6.1 update + +*** ICU Trac + +- ticket 8995 final update to Unicode 6.1 +- ticket 8994 regenerate source/layout/CanonData.cpp + +- ticket 8961 support Unicode "Age" value *names* +- ticket 8963 support multiple character name aliases & types + +- ticket 8827 "update ICU to Unicode 6.1" +- C++ branches/markus/uni61 at r30864 from trunk at r30843 +- Java branches/markus/uni61 at r30865 from trunk at r30863 + +*** Unicode version numbers +- makedata.mak +- uchar.h + (configure.in & configure: have been modified to extract the version from uchar.h) +- com.ibm.icu.util.VersionInfo +- icutools/unicode/makedefs.sh + + also review & update other definitions in that file, + e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l + +*** data files & enums & parser code + +* file preparation + +~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed +- This prepares both unidata and testdata files in respective output subfolders. +- Check test file diffs for previously commented-out, known-failing data lines; + probably need to keep those commented out. + +* PropertyValueAliases.txt changes +- 11 new block names: + Arabic_Extended_A + Arabic_Mathematical_Alphabetic_Symbols + Chakma + Meetei_Mayek_Extensions + Meroitic_Cursive + Meroitic_Hieroglyphs + Miao + Sharada + Sora_Sompeng + Sundanese_Supplement + Takri + -> add to uchar.h + -> add to UCharacter.UnicodeBlock IDs + Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) + replace public static final int \1_ID = \2; \3 + -> add to UCharacter.UnicodeBlock objects + Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) + replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 +- 1 new Joining_Group (jg) value: + Rohingya_Yeh + -> uchar.h & UCharacter.JoiningGroup +- 2 new Line_Break (lb) values: + CJ=Conditional_Japanese_Starter + HL=Hebrew_Letter + -> uchar.h & UCharacter.LineBreak +- 7 new scripts: + sc ; Cakm ; Chakma + sc ; Merc ; Meroitic_Cursive + sc ; Mero ; Meroitic_Hieroglyphs + sc ; Plrd ; Miao + sc ; Shrd ; Sharada + sc ; Sora ; Sora_Sompeng + sc ; Takr ; Takri + -> remove these from SyntheticPropertyValueAliases.txt + -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java +- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html + (added 2011-06-21) + Khoj 322 Khojki + Tirh 326 Tirhuta + and another one added 2011-12-09 + Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs) + -> uscript.h + -> com.ibm.icu.lang.UScript + find USCRIPT_([^ ]+) *= ([0-9]+),(.+) + replace public static final int \1 = \2;\3 + -> SyntheticPropertyValueAliases.txt + -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java + +* UnicodeData.txt changes +- the last Unihan code point changes from U+9FCB to U+9FCC + search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive) + + do change gennames.c + + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java + +* DerivedBidiClass.txt changes +- 2 new default-AL blocks: +# Arabic Extended-A: U+08A0 - U+08FF (was default-R) +# Arabic Mathematical Alphabetic Symbols: +# U+1EE00 - U+1EEFF (was default-R) +- 2 new default-R blocks: +# Meroitic Hieroglyphs: +# U+10980 - U+1099F +# Meroitic Cursive: U+109A0 - U+109FF + -> should be picked up by the explicit data in the file + +* NameAliases.txt changes +- from + # Each line has two fields + # First field: Code point + # Second field: Alias +- to + # Each line has three fields, as described here: + # + # First field: Code point + # Second field: Alias + # Third field: Type +- Also, the file previously allowed multiple aliases but only now does it + actually provide multiple, even multiple of the same type. For example, + FEFF;BYTE ORDER MARK;alternate + FEFF;BOM;abbreviation + FEFF;ZWNBSP;abbreviation +- This breaks our gennames parser, unames.icu data structure, and API. + Fix gennames to only pick up "correction" aliases. + New ticket #8963 for further changes. + +* run genpname/preparse.pl (on Linux) + + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname + + make sure that data.h is writable + + perl preparse.pl ~/svn.icu/trunk/src > out.txt + + preparse.pl shows no errors, out.txt Info and Warning lines look ok + +* build ICU (make install) + so that the tools build can pick up the new definitions from the installed header files. +* build Unicode tools (at least genpname) using CMake+make + +* run genpname + (builds both pnames.icu and propname_data.h) +- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in +- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource + +* build ICU (make install) +* build Unicode tools using CMake+make + +* update source/data/unidata/norm2/nfkc_cf.txt +- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt + +* update source/data/unidata/norm2/uts46.txt +- download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt + to ~/svn.icu/tools/trunk/src/unicode/py +- adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008". +- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py +- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 + +* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to + sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) +- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters +- Unicode 6.0..6.1: U+2260, U+226E, U+226F +- nothing new in 6.1, no test file to update + +* generate core properties data files +- in initial bootstrapping, change the UCA version + in source/data/unidata/FractionalUCA.txt to match the new Unicode version +- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld +- rebuild ICU & tools + + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, + check if the UCA version in FractionalUCA.txt matches the new Unicode version + (see step above) +- run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm: + ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld +- rebuild ICU & tools + +* update Java data files +- refresh just the UCD-related files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir /tmp/icu4j +- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + output: + ... + Unicode .icu files built to ./out/build/icudt49l + mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b + mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b + echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt + LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b + mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b" + jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data + jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/ + mkdir -p /tmp/icu4j/main/shared/data + cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data + make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data' +- copy the big-endian Unicode data files to another location, + separate from the other data files + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr + ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b + ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu + ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b + ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll + ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr +- refresh ICU4J + ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b + +* refresh Java test .txt files +- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode + +* test ICU so far, fix test code where necessary +- temporarily ignore collation issues that look like UCA/UCD mismatches, + until UCA data is updated + +* UCA + +- get output from Mark's tools; look in + http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-.txt +- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt +- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt + (note removing the underscore before "Rules") +- update (ICU)/source/test/testdata/CollationTest_*.txt + and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt + with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) +- check test file diffs for previously commented-out, known-failing data lines; + probably need to keep those commented out +- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani +- run makeuca.sh: + ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld +- rebuild ICU4C +- refresh ICU4J collation data: + (subset of instructions above for properties data refresh, except copies all coll/*) + ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install + ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll + ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll + ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b +- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) +- note on intltest: if collate/UCAConformanceTest fails, then + utility/MultithreadTest/TestCollators will fail as well; + fix the conformance test before looking into the multi-thread test + +* When refreshing all of ICU4J data from ICU4C +- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data +or +- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install + +*** LayoutEngine script information + +(For details see the Unicode 5.2 change log below.) + +* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. + This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp + in the working directory. + (It also generates ScriptRunData.cpp, which is no longer needed.) + + The generated files have a current copyright date and "@draft" statement. + +- diff current /source/layout files vs. generated ones + ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout + review and manually merge desired changes; + fix gratuitous changes, incorrect @draft and missing aliases; + Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. +- if you just copy the above files, then + fix mixed line endings, review the diffs as above and restore changes to API tags etc.; + manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h + +*** merge the Unicode update branches back onto the trunk +- do not merge the icudata.jar and testdata.jar, + instead rebuild them from merged & tested ICU4C + +---------------------------------------------------------------------------- *** + +ICU 4.8 (no Unicode update, just new script codes) + +* 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html + (added 2010-12-21) + Afak 439 Afaka + Jurc 510 Jurchen + Mroo 199 Mro, Mru + Nshu 499 Nüshu + Shrd 319 Sharada, Śāradā + Sora 398 Sora Sompeng + Takr 321 Takri, Ṭākrī, Ṭāṅkrī + Tang 520 Tangut + Wole 480 Woleai + -> uscript.h + -> com.ibm.icu.lang.UScript + find USCRIPT_([^ ]+) *= ([0-9]+),(.+) + replace public static final int \1 = \2;\3 + -> genpname/SyntheticPropertyValueAliases.txt + -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() + and in com.ibm.icu.dev.test.lang.TestUScript.java + +* run genpname/preparse.pl (on Linux) + + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname + + make sure that data.h is writable + + perl preparse.pl ~/svn.icu/trunk/src > out.txt + + preparse.pl shows no errors, out.txt Info and Warning lines look ok + +* rebuild Unicode tools (at least genpname) using make +- You might first need to "make install" ICU so that the tools build can pick + up the new definitions from the installed header files. + +* run genpname + (builds both pnames.icu and propname_data.h) +- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in +- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource +- rebuild ICU & tools + +* run genprops +- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 +- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 +- rebuild ICU & tools + +* update Java data files +- refresh just the UCD-related files, just to be safe +- see (ICU4C)/source/data/icu4j-readme.txt +- mkdir /tmp/icu4j +- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install +- copy the big-endian Unicode data files to another location, + separate from the other data files + mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b + ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b + ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b +- refresh ICU4J + ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b + +* should have updated the layout engine script codes but forgot + +---------------------------------------------------------------------------- *** + Unicode 6.0 update *** related ICU Trac tickets