icuSources/data/unidata/changes.txt

   1 * Copyright (C) 2004-2006, International Business Machines
   2 * Corporation and others.  All Rights Reserved.
   3 *
   4 *   file name:  changes.txt
   5 *   encoding:   US-ASCII
   6 *   tab size:   8 (not used)
   7 *   indentation:4
   8 *
   9 *   created on: 2004may06
  10 *   created by: Markus W. Scherer
  11 *
  12 * change log for Unicode updates
  13
  14 ---------------------------------------------------------------------------- ***
  15
  16 Unicode 5.0 update
  17
  18 *** related Jitterbugs
  19
  20 5084 RFE: Update to Unicode 5.0
  21
  22 *** data files & enums & parser code
  23
  24 * file preparation
  25 - ucdstrip:
  26     DerivedCoreProperties.txt
  27     DerivedNormalizationProps.txt
  28     NormalizationTest.txt
  29     PropList.txt
  30     Scripts.txt
  31     GraphemeBreakProperty.txt
  32     SentenceBreakProperty.txt
  33     WordBreakProperty.txt
  34 - ucdstrip and ucdmerge:
  35     EastAsianWidth.txt
  36     LineBreak.txt
  37
  38 * my ucd2unidata.txt (needs to be updated each time with UCD and file version numbers)
  39 copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
  40 copy 5.0.0\ucd\Blocks.txt ..\unidata\
  41 copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
  42 copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
  43 copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
  44 copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
  45 copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
  46 copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
  47 copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
  48 copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
  49 copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
  50 copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
  51 copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
  52
  53 ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
  54 ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
  55 ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
  56 ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
  57 ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
  58 ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
  59 ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
  60 ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
  61 ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
  62 ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
  63
  64 * update FractionalUCA.txt and UCARules.txt with new canonical closure
  65
  66 * genpname
  67 - run preparse.pl
  68   + make sure that data.h is writable
  69   + perl preparse.pl \cvs\oss\icu > out.txt
  70
  71 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
  72 - new block & script values
  73   + script values already added in ICU 3.6 because all of ISO 15924 is now covered
  74
  75 * build Unicode data source code for hardcoding core data
  76 C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
  77
  78 ICU data make path is \cvs\oss\icu\source\data\
  79 ICU root path is \cvs\oss\icu
  80 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
  81 [etc.]
  82 Creating data file for Unicode Character Properties
  83 Creating data file for Unicode Case Mapping Properties
  84 Creating data file for Unicode BiDi/Shaping Properties
  85 Creating data file for Unicode Normalization
  86 Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
  87 Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
  88
  89 - copy the .c source files to C:\cvs\oss\icu\source\common
  90   and rebuild the common library
  91
  92 *** Unicode version numbers
  93 - makedata.mak
  94 - uchar.h
  95 - configure.in
  96
  97 *** LayoutEngine script information
  98 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
  99 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
 100 ScriptRunData.cpp, which is no longer needed.)
 101
 102 The generated files have a current copyright date and "@draft" statement.
 103
 104 * copy the above files into <icu>/source/layout, replacing the old files.
 105
 106 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
 107 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
 108
 109 * rebuild the layout and layoutex libraries.
 110
 111 ---------------------------------------------------------------------------- ***
 112
 113 Unicode 4.1 update
 114
 115 *** related Jitterbugs
 116
 117 4332 RFE: Update to Unicode 4.1
 118 4157 RBBI, TR29 4.1 updates
 119
 120 *** data files & enums & parser code
 121
 122 * file preparation
 123 - ucdstrip:
 124     DerivedCoreProperties.txt
 125     DerivedNormalizationProps.txt
 126     NormalizationTest.txt
 127     GraphemeBreakProperty.txt
 128     SentenceBreakProperty.txt
 129     WordBreakProperty.txt
 130 - ucdstrip and ucdmerge:
 131     EastAsianWidth.txt
 132     LineBreak.txt
 133
 134 * add new files to the repository
 135     GraphemeBreakProperty.txt
 136     SentenceBreakProperty.txt
 137     WordBreakProperty.txt
 138
 139 * update FractionalUCA.txt and UCARules.txt with new canonical closure
 140
 141 * genpname
 142 - handle new enumerated properties in sub read_uchar
 143 - run preparse.pl
 144
 145 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
 146 - new binary properties
 147   + Pattern_Syntax
 148   + Pattern_White_Space
 149 - new enumerated properties
 150   + Grapheme_Cluster_Break
 151   + Sentence_Break
 152   + Word_Break
 153 - new block & script & line break values
 154
 155 * gencase
 156 - case-ignorable changes
 157   see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
 158   now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
 159
 160 *** Unicode version numbers
 161 - makedata.mak
 162 - uchar.h
 163 - configure.in
 164
 165 *** tests
 166 - verify that u_charMirror() round-trips
 167 - test all new properties and some new values of old properties
 168
 169 *** other code
 170
 171 * hardcoded Unihan range end/limit
 172 - Unihan range end moves from 9FA5 to 9FBB
 173   search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
 174   + do not modify BOCU/BOCSU code because that would change the encoding
 175     and break binary compatibility!
 176   + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
 177     NamePrepProfile.txt
 178   + ignore trietest.c: test data is arbitrary
 179   + ignore tstnorm.cpp: test optimization, not important
 180   + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
 181   + do change line_th.txt and word_th.txt
 182     by replacing hardcoded ranges with the new property values
 183   + do change gennames.c
 184
 185 source\data\brkitr\line_th.txt(229):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
 186 source\data\brkitr\word_th.txt(23):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
 187 source\tools\gennames\gennames.c(971):        0x4e00, 0x9fa5,
 188
 189 * case mappings
 190 - compare new special casing context conditions with previous ones
 191   see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
 192
 193 * genpname
 194 - consider storing only the short name if it is the same as the long name
 195
 196 *** other reviews
 197 - UAX #29 changes (grapheme/word/sentence breaks)
 198 - UAX #14 changes (line breaks)
 199 - Pattern_Syntax & Pattern_White_Space
 200
 201 ---------------------------------------------------------------------------- ***
 202
 203 Unicode 4.0.1 update
 204
 205 *** related Jitterbugs
 206
 207 3170 RFE: Update to Unicode 4.0.1
 208 3171 Add new Unicode 4.0.1 properties
 209 3520 use Unicode 4.0.1 updates for break iteration
 210
 211 *** data files & enums & parser code
 212
 213 * file preparation
 214 - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
 215 - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
 216
 217 * file fixes
 218 - fix UnicodeData.txt general categories of Ethiopic digits Nd->No
 219   according to PRI #26
 220   http://www.unicode.org/review/resolved-pri.html#pri26
 221 - undone again because no corrigendum in sight;
 222   instead modified tests to not check consistency on this for Unicode 4.0.1
 223
 224 * ucdterms.txt
 225 - update from http://www.unicode.org/copyright.html
 226   formatted for plain text
 227
 228 * uchar.h & uprops.h & uprops.c & genprops
 229 - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
 230 - add U_LB_INSEPARABLE due to a spelling fix
 231   + put short name comment only on line with new constant
 232     for genpname perl script parser
 233 - new binary properties
 234   + STerm
 235   + Variation_Selector
 236
 237 * genpname
 238 - fix genpname perl script so that it doesn't choke on more than 2 names per property value
 239 - perl script: correctly calculate the maximum number of fields per row
 240
 241 * uscript.h
 242 - new script code Hrkt=Katakana_Or_Hiragana
 243
 244 * gennorm.c track changes in DerivedNormalizationProps.txt
 245 - "FNC" -> "FC_NFKC"
 246 - single field "NFD_NO" -> two fields "NFD_QC; N" etc.
 247
 248 * genprops/props2.c track changes in DerivedNumericValues.txt
 249 - changed from 3 columns to 2, dropping the numeric type
 250   + assume that the type is always numeric for Han characters,
 251     and that only those are added in addition to what UnicodeData.txt lists
 252
 253 *** Unicode version numbers
 254 - makedata.mak
 255 - uchar.h
 256 - configure.in
 257
 258 *** tests
 259 - update test of default bidi classes according to PRI #28
 260   /tsutil/cucdtst/TestUnicodeData
 261   http://www.unicode.org/review/resolved-pri.html#pri28
 262 - bidi tests: change exemplar character for ES depending on Unicode version
 263 - change hardcoded expected property values where they change
 264
 265 *** other code
 266
 267 * name matching
 268 - read UCD.html
 269
 270 * scripts
 271 - use new Hrkt=Katakana_Or_Hiragana
 272
 273 * ZWJ & ZWNJ
 274 - are now part of combining character sequences
 275 - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ