icuSources/data/unidata/changes.txt

   1 * Copyright (C) 2004-2008, International Business Machines
   2 * Corporation and others.  All Rights Reserved.
   3 *
   4 *   file name:  changes.txt
   5 *   encoding:   US-ASCII
   6 *   tab size:   8 (not used)
   7 *   indentation:4
   8 *
   9 *   created on: 2004may06
  10 *   created by: Markus W. Scherer
  11 *
  12 * change log for Unicode updates
  13
  14 ---------------------------------------------------------------------------- ***
  15
  16 Unicode 5.1 update
  17
  18 *** related ICU Trac tickets
  19
  20 5696 Update to Unicode 5.1
  21
  22 *** Unicode version numbers
  23 - makedata.mak
  24 - uchar.h
  25 - configure.in & configure
  26 - update ucdVersion in gennames.c if an algorithmic range changes
  27
  28 *** data files & enums & parser code
  29
  30 * file preparation
  31 - ucdstrip:
  32     DerivedCoreProperties.txt
  33     DerivedNormalizationProps.txt
  34     NormalizationTest.txt
  35     PropList.txt
  36     Scripts.txt
  37     GraphemeBreakProperty.txt
  38     SentenceBreakProperty.txt
  39     WordBreakProperty.txt
  40 - ucdstrip and ucdmerge:
  41     EastAsianWidth.txt
  42     LineBreak.txt
  43
  44 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
  45 copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
  46 copy 5.1.0\ucd\Blocks.txt ..\unidata\
  47 copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
  48 copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
  49 copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
  50 copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
  51 copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
  52 copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
  53 copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
  54 copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
  55 copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
  56 copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
  57 copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
  58
  59 ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
  60 ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
  61 ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
  62 ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
  63 ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
  64 ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
  65 ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
  66 ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
  67 ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
  68 ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
  69
  70 * genpname
  71 - run preparse.pl
  72   + cd \svn\icuproj\icu\uni51\source\tools\genpname
  73   + make sure that data.h is writable
  74   + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
  75   + preparse.pl complains with errors like the following:
  76       Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
  77     This is because ICU 3.8 had scripts from ISO 15924 which are now
  78     added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
  79     and PropertyValueAliases.txt.
  80     -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
  81        Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
  82   + PropertyValueAliases.txt now explicitly contains values for boolean properties:
  83       N/Y, No/Yes, F/T, False/True
  84     -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
  85        It will use further values from the file if present.
  86
  87 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
  88 - new block & script values
  89   + 17 new blocks
  90   + 11 new script values already added in ICU 3.8 for ISO 15924 coverage
  91     (removed from SyntheticPropertyValueAliases.txt)
  92   + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
  93     (added to SyntheticPropertyValueAliases.txt)
  94 - uprops.icu (uprops.h) only provides 7 bits for script codes.
  95   In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
  96   There is none above 127 yet which is the script code for an
  97   assigned Unicode character, so ICU 4.0 uprops.icu does not store any
  98   script code values greater than 127.
  99   However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
 100   in a parallel bit field, and that overflows now.
 101   Also, future values >=128 would be incompatible anyway.
 102   uprops.h is modified to move around several of the bit fields
 103   in the properties vector words, and now uses 8 bits for the script code.
 104   Two other bit fields also grow to accommodate future growth:
 105   Block (current count: 172) grows from 8 to 9 bits,
 106   and Word_Break grows from 4 to 5 bits.
 107 - renamed property Simple_Case_Folding (sfc->scf)
 108   + nothing to be done: handled as normal alias
 109 - new property JSN Jamo_Short_Name
 110   + no new API: only contributes to the Name property
 111 - new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
 112 - new Joining Group (JG) value: Burushashki_Yeh_Barree
 113 - new Sentence_Break (SB) values:
 114     SB ; CR        ; CR
 115     SB ; EX        ; Extend
 116     SB ; LF        ; LF
 117     SB ; SC        ; SContinue
 118 - new Word_Break (WB) values:
 119     WB ; CR        ; CR
 120     WB ; Extend    ; Extend
 121     WB ; LF        ; LF
 122     WB ; MB        ; MidNumLet
 123
 124 * Further changes in the 2008-02-29 update:
 125 - Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
 126   because they should not normally be invisible.
 127 - new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
 128 - new Grapheme_Cluster_Break (GCB) value: PP=Prepend
 129 - new Word_Break (WB) value: NL=Newline
 130
 131 * hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
 132 - Unihan range end moves from 9FBB to 9FC3
 133   search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
 134   + do change gennames.c
 135
 136 * build Unicode data source code for hardcoding core data
 137 C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
 138
 139 ICU data make path is \svn\icuproj\icu\uni51\source\data\
 140 ICU root path is \svn\icuproj\icu\uni51
 141 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
 142 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
 143 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
 144 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
 145 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
 146 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
 147 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
 148 Creating data file for Unicode Character Properties
 149 Creating data file for Unicode Case Mapping Properties
 150 Creating data file for Unicode BiDi/Shaping Properties
 151 Creating data file for Unicode Normalization
 152 Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
 153 Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
 154
 155 - copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
 156   and rebuild the common library
 157
 158 *** Break iterators
 159
 160 * Update break iterator rules to new UAX versions and new property values
 161
 162 *** UCA
 163
 164 * update FractionalUCA.txt and UCARules.txt with new canonical closure
 165
 166 *** Test suites
 167 - Test that APIs using Unicode property value aliases (like UnicodeSet)
 168   support all of the boolean values N/Y, No/Yes, F/T, False/True
 169   -> TestBinaryValues() tests in both cintltst and intltest
 170
 171 *** LayoutEngine script information
 172 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
 173 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
 174 ScriptRunData.cpp, which is no longer needed.)
 175
 176 The generated files have a current copyright date and "@draft" statement.
 177
 178 * copy the above files into <icu>/source/layout, replacing the old files.
 179
 180 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
 181 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
 182
 183 * rebuild the layout and layoutex libraries.
 184
 185 *** Documentation
 186 - Update User Guide
 187   + Jamo_Short_Name, sfc->scf, binary property value aliases
 188
 189 ---------------------------------------------------------------------------- ***
 190
 191 Unicode 5.0 update
 192
 193 *** related Jitterbugs
 194
 195 5084 RFE: Update to Unicode 5.0
 196
 197 *** data files & enums & parser code
 198
 199 * file preparation
 200 - ucdstrip:
 201     DerivedCoreProperties.txt
 202     DerivedNormalizationProps.txt
 203     NormalizationTest.txt
 204     PropList.txt
 205     Scripts.txt
 206     GraphemeBreakProperty.txt
 207     SentenceBreakProperty.txt
 208     WordBreakProperty.txt
 209 - ucdstrip and ucdmerge:
 210     EastAsianWidth.txt
 211     LineBreak.txt
 212
 213 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
 214 copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
 215 copy 5.0.0\ucd\Blocks.txt ..\unidata\
 216 copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
 217 copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
 218 copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
 219 copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
 220 copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
 221 copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
 222 copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
 223 copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
 224 copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
 225 copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
 226 copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
 227
 228 ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
 229 ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
 230 ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
 231 ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
 232 ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
 233 ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
 234 ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
 235 ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
 236 ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
 237 ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
 238
 239 * update FractionalUCA.txt and UCARules.txt with new canonical closure
 240
 241 * genpname
 242 - run preparse.pl
 243   + make sure that data.h is writable
 244   + perl preparse.pl \cvs\oss\icu > out.txt
 245
 246 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
 247 - new block & script values
 248   + script values already added in ICU 3.6 because all of ISO 15924 is now covered
 249
 250 * build Unicode data source code for hardcoding core data
 251 C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
 252
 253 ICU data make path is \cvs\oss\icu\source\data\
 254 ICU root path is \cvs\oss\icu
 255 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
 256 [etc.]
 257 Creating data file for Unicode Character Properties
 258 Creating data file for Unicode Case Mapping Properties
 259 Creating data file for Unicode BiDi/Shaping Properties
 260 Creating data file for Unicode Normalization
 261 Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
 262 Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
 263
 264 - copy the .c source files to C:\cvs\oss\icu\source\common
 265   and rebuild the common library
 266
 267 *** Unicode version numbers
 268 - makedata.mak
 269 - uchar.h
 270 - configure.in
 271
 272 *** LayoutEngine script information
 273 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
 274 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
 275 ScriptRunData.cpp, which is no longer needed.)
 276
 277 The generated files have a current copyright date and "@draft" statement.
 278
 279 * copy the above files into <icu>/source/layout, replacing the old files.
 280
 281 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
 282 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
 283
 284 * rebuild the layout and layoutex libraries.
 285
 286 ---------------------------------------------------------------------------- ***
 287
 288 Unicode 4.1 update
 289
 290 *** related Jitterbugs
 291
 292 4332 RFE: Update to Unicode 4.1
 293 4157 RBBI, TR29 4.1 updates
 294
 295 *** data files & enums & parser code
 296
 297 * file preparation
 298 - ucdstrip:
 299     DerivedCoreProperties.txt
 300     DerivedNormalizationProps.txt
 301     NormalizationTest.txt
 302     GraphemeBreakProperty.txt
 303     SentenceBreakProperty.txt
 304     WordBreakProperty.txt
 305 - ucdstrip and ucdmerge:
 306     EastAsianWidth.txt
 307     LineBreak.txt
 308
 309 * add new files to the repository
 310     GraphemeBreakProperty.txt
 311     SentenceBreakProperty.txt
 312     WordBreakProperty.txt
 313
 314 * update FractionalUCA.txt and UCARules.txt with new canonical closure
 315
 316 * genpname
 317 - handle new enumerated properties in sub read_uchar
 318 - run preparse.pl
 319
 320 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
 321 - new binary properties
 322   + Pattern_Syntax
 323   + Pattern_White_Space
 324 - new enumerated properties
 325   + Grapheme_Cluster_Break
 326   + Sentence_Break
 327   + Word_Break
 328 - new block & script & line break values
 329
 330 * gencase
 331 - case-ignorable changes
 332   see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
 333   now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
 334
 335 *** Unicode version numbers
 336 - makedata.mak
 337 - uchar.h
 338 - configure.in
 339
 340 *** tests
 341 - verify that u_charMirror() round-trips
 342 - test all new properties and some new values of old properties
 343
 344 *** other code
 345
 346 * hardcoded Unihan range end/limit
 347 - Unihan range end moves from 9FA5 to 9FBB
 348   search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
 349   + do not modify BOCU/BOCSU code because that would change the encoding
 350     and break binary compatibility!
 351   + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
 352     NamePrepProfile.txt
 353   + ignore trietest.c: test data is arbitrary
 354   + ignore tstnorm.cpp: test optimization, not important
 355   + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
 356   + do change line_th.txt and word_th.txt
 357     by replacing hardcoded ranges with the new property values
 358   + do change gennames.c
 359
 360 source\data\brkitr\line_th.txt(229):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
 361 source\data\brkitr\word_th.txt(23):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
 362 source\tools\gennames\gennames.c(971):        0x4e00, 0x9fa5,
 363
 364 * case mappings
 365 - compare new special casing context conditions with previous ones
 366   see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
 367
 368 * genpname
 369 - consider storing only the short name if it is the same as the long name
 370
 371 *** other reviews
 372 - UAX #29 changes (grapheme/word/sentence breaks)
 373 - UAX #14 changes (line breaks)
 374 - Pattern_Syntax & Pattern_White_Space
 375
 376 ---------------------------------------------------------------------------- ***
 377
 378 Unicode 4.0.1 update
 379
 380 *** related Jitterbugs
 381
 382 3170 RFE: Update to Unicode 4.0.1
 383 3171 Add new Unicode 4.0.1 properties
 384 3520 use Unicode 4.0.1 updates for break iteration
 385
 386 *** data files & enums & parser code
 387
 388 * file preparation
 389 - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
 390 - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
 391
 392 * file fixes
 393 - fix UnicodeData.txt general categories of Ethiopic digits Nd->No
 394   according to PRI #26
 395   http://www.unicode.org/review/resolved-pri.html#pri26
 396 - undone again because no corrigendum in sight;
 397   instead modified tests to not check consistency on this for Unicode 4.0.1
 398
 399 * ucdterms.txt
 400 - update from http://www.unicode.org/copyright.html
 401   formatted for plain text
 402
 403 * uchar.h & uprops.h & uprops.c & genprops
 404 - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
 405 - add U_LB_INSEPARABLE due to a spelling fix
 406   + put short name comment only on line with new constant
 407     for genpname perl script parser
 408 - new binary properties
 409   + STerm
 410   + Variation_Selector
 411
 412 * genpname
 413 - fix genpname perl script so that it doesn't choke on more than 2 names per property value
 414 - perl script: correctly calculate the maximum number of fields per row
 415
 416 * uscript.h
 417 - new script code Hrkt=Katakana_Or_Hiragana
 418
 419 * gennorm.c track changes in DerivedNormalizationProps.txt
 420 - "FNC" -> "FC_NFKC"
 421 - single field "NFD_NO" -> two fields "NFD_QC; N" etc.
 422
 423 * genprops/props2.c track changes in DerivedNumericValues.txt
 424 - changed from 3 columns to 2, dropping the numeric type
 425   + assume that the type is always numeric for Han characters,
 426     and that only those are added in addition to what UnicodeData.txt lists
 427
 428 *** Unicode version numbers
 429 - makedata.mak
 430 - uchar.h
 431 - configure.in
 432
 433 *** tests
 434 - update test of default bidi classes according to PRI #28
 435   /tsutil/cucdtst/TestUnicodeData
 436   http://www.unicode.org/review/resolved-pri.html#pri28
 437 - bidi tests: change exemplar character for ES depending on Unicode version
 438 - change hardcoded expected property values where they change
 439
 440 *** other code
 441
 442 * name matching
 443 - read UCD.html
 444
 445 * scripts
 446 - use new Hrkt=Katakana_Or_Hiragana
 447
 448 * ZWJ & ZWNJ
 449 - are now part of combining character sequences
 450 - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ