[apple/icu.git] / icuSources / data / unidata / changes.txt

* Copyright (C) 2004-2008, International Business Machines
* Corporation and others.  All Rights Reserved.
*
*   file name:  changes.txt
*   encoding:   US-ASCII
*   tab size:   8 (not used)
*   indentation:4
*
*   created on: 2004may06
*   created by: Markus W. Scherer
*
* change log for Unicode updates

---------------------------------------------------------------------------- ***

Unicode 5.1 update

*** related ICU Trac tickets

5696 Update to Unicode 5.1

*** Unicode version numbers
- makedata.mak
- uchar.h
- configure.in & configure
- update ucdVersion in gennames.c if an algorithmic range changes

*** data files & enums & parser code

* file preparation
- ucdstrip:
    DerivedCoreProperties.txt
    DerivedNormalizationProps.txt
    NormalizationTest.txt
    PropList.txt
    Scripts.txt
    GraphemeBreakProperty.txt
    SentenceBreakProperty.txt
    WordBreakProperty.txt
- ucdstrip and ucdmerge:
    EastAsianWidth.txt
    LineBreak.txt

* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
copy 5.1.0\ucd\Blocks.txt ..\unidata\
copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
copy 5.1.0\ucd\UnicodeData.txt ..\unidata\

ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt

* genpname
- run preparse.pl
  + cd \svn\icuproj\icu\uni51\source\tools\genpname
  + make sure that data.h is writable
  + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
  + preparse.pl complains with errors like the following:
      Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
    This is because ICU 3.8 had scripts from ISO 15924 which are now
    added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
    and PropertyValueAliases.txt.
    -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
       Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
  + PropertyValueAliases.txt now explicitly contains values for boolean properties:
      N/Y, No/Yes, F/T, False/True
    -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
       It will use further values from the file if present.

* uchar.h & uscript.h & uprops.h & uprops.c & genprops
- new block & script values
  + 17 new blocks
  + 11 new script values already added in ICU 3.8 for ISO 15924 coverage
    (removed from SyntheticPropertyValueAliases.txt)
  + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
    (added to SyntheticPropertyValueAliases.txt)
- uprops.icu (uprops.h) only provides 7 bits for script codes.
  In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
  There is none above 127 yet which is the script code for an
  assigned Unicode character, so ICU 4.0 uprops.icu does not store any
  script code values greater than 127.
  However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
  in a parallel bit field, and that overflows now.
  Also, future values >=128 would be incompatible anyway.
  uprops.h is modified to move around several of the bit fields
  in the properties vector words, and now uses 8 bits for the script code.
  Two other bit fields also grow to accommodate future growth:
  Block (current count: 172) grows from 8 to 9 bits,
  and Word_Break grows from 4 to 5 bits.
- renamed property Simple_Case_Folding (sfc->scf)
  + nothing to be done: handled as normal alias
- new property JSN Jamo_Short_Name
  + no new API: only contributes to the Name property
- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
- new Joining Group (JG) value: Burushashki_Yeh_Barree
- new Sentence_Break (SB) values:
    SB ; CR        ; CR
    SB ; EX        ; Extend
    SB ; LF        ; LF
    SB ; SC        ; SContinue
- new Word_Break (WB) values:
    WB ; CR        ; CR
    WB ; Extend    ; Extend
    WB ; LF        ; LF
    WB ; MB        ; MidNumLet

* Further changes in the 2008-02-29 update:
- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
  because they should not normally be invisible.
- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
- new Grapheme_Cluster_Break (GCB) value: PP=Prepend
- new Word_Break (WB) value: NL=Newline

* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
- Unihan range end moves from 9FBB to 9FC3
  search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
  + do change gennames.c

* build Unicode data source code for hardcoding core data
C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data

ICU data make path is \svn\icuproj\icu\uni51\source\data\
ICU root path is \svn\icuproj\icu\uni51
Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
Creating data file for Unicode Character Properties
Creating data file for Unicode Case Mapping Properties
Creating data file for Unicode BiDi/Shaping Properties
Creating data file for Unicode Normalization
Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"

- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
  and rebuild the common library

*** Break iterators

* Update break iterator rules to new UAX versions and new property values

*** UCA

* update FractionalUCA.txt and UCARules.txt with new canonical closure

*** Test suites
- Test that APIs using Unicode property value aliases (like UnicodeSet)
  support all of the boolean values N/Y, No/Yes, F/T, False/True
  -> TestBinaryValues() tests in both cintltst and intltest

*** LayoutEngine script information
* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
ScriptRunData.cpp, which is no longer needed.)

The generated files have a current copyright date and "@draft" statement.

* copy the above files into <icu>/source/layout, replacing the old files.

Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)

* rebuild the layout and layoutex libraries.

*** Documentation
- Update User Guide
  + Jamo_Short_Name, sfc->scf, binary property value aliases

---------------------------------------------------------------------------- ***

Unicode 5.0 update

*** related Jitterbugs

5084 RFE: Update to Unicode 5.0

*** data files & enums & parser code

* file preparation
- ucdstrip:
    DerivedCoreProperties.txt
    DerivedNormalizationProps.txt
    NormalizationTest.txt
    PropList.txt
    Scripts.txt
    GraphemeBreakProperty.txt
    SentenceBreakProperty.txt
    WordBreakProperty.txt
- ucdstrip and ucdmerge:
    EastAsianWidth.txt
    LineBreak.txt

* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
copy 5.0.0\ucd\Blocks.txt ..\unidata\
copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
copy 5.0.0\ucd\UnicodeData.txt ..\unidata\

ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt

* update FractionalUCA.txt and UCARules.txt with new canonical closure

* genpname
- run preparse.pl
  + make sure that data.h is writable
  + perl preparse.pl \cvs\oss\icu > out.txt

* uchar.h & uscript.h & uprops.h & uprops.c & genprops
- new block & script values
  + script values already added in ICU 3.6 because all of ISO 15924 is now covered

* build Unicode data source code for hardcoding core data
C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data

ICU data make path is \cvs\oss\icu\source\data\
ICU root path is \cvs\oss\icu
Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
[etc.]
Creating data file for Unicode Character Properties
Creating data file for Unicode Case Mapping Properties
Creating data file for Unicode BiDi/Shaping Properties
Creating data file for Unicode Normalization
Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"

- copy the .c source files to C:\cvs\oss\icu\source\common
  and rebuild the common library

*** Unicode version numbers
- makedata.mak
- uchar.h
- configure.in

*** LayoutEngine script information
* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
ScriptRunData.cpp, which is no longer needed.)

The generated files have a current copyright date and "@draft" statement.

* copy the above files into <icu>/source/layout, replacing the old files.

Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)

* rebuild the layout and layoutex libraries.

---------------------------------------------------------------------------- ***

Unicode 4.1 update

*** related Jitterbugs

4332 RFE: Update to Unicode 4.1
4157 RBBI, TR29 4.1 updates

*** data files & enums & parser code

* file preparation
- ucdstrip:
    DerivedCoreProperties.txt
    DerivedNormalizationProps.txt
    NormalizationTest.txt
    GraphemeBreakProperty.txt
    SentenceBreakProperty.txt
    WordBreakProperty.txt
- ucdstrip and ucdmerge:
    EastAsianWidth.txt
    LineBreak.txt

* add new files to the repository
    GraphemeBreakProperty.txt
    SentenceBreakProperty.txt
    WordBreakProperty.txt

* update FractionalUCA.txt and UCARules.txt with new canonical closure

* genpname
- handle new enumerated properties in sub read_uchar
- run preparse.pl

* uchar.h & uscript.h & uprops.h & uprops.c & genprops
- new binary properties
  + Pattern_Syntax
  + Pattern_White_Space
- new enumerated properties
  + Grapheme_Cluster_Break
  + Sentence_Break
  + Word_Break
- new block & script & line break values

* gencase
- case-ignorable changes
  see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
  now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk

*** Unicode version numbers
- makedata.mak
- uchar.h
- configure.in

*** tests
- verify that u_charMirror() round-trips
- test all new properties and some new values of old properties

*** other code

* hardcoded Unihan range end/limit
- Unihan range end moves from 9FA5 to 9FBB
  search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
  + do not modify BOCU/BOCSU code because that would change the encoding
    and break binary compatibility!
  + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
    NamePrepProfile.txt
  + ignore trietest.c: test data is arbitrary
  + ignore tstnorm.cpp: test optimization, not important
  + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
  + do change line_th.txt and word_th.txt
    by replacing hardcoded ranges with the new property values
  + do change gennames.c

source\data\brkitr\line_th.txt(229):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
source\data\brkitr\word_th.txt(23):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
source\tools\gennames\gennames.c(971):        0x4e00, 0x9fa5,

* case mappings
- compare new special casing context conditions with previous ones
  see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods

* genpname
- consider storing only the short name if it is the same as the long name

*** other reviews
- UAX #29 changes (grapheme/word/sentence breaks)
- UAX #14 changes (line breaks)
- Pattern_Syntax & Pattern_White_Space

---------------------------------------------------------------------------- ***

Unicode 4.0.1 update

*** related Jitterbugs

3170 RFE: Update to Unicode 4.0.1
3171 Add new Unicode 4.0.1 properties
3520 use Unicode 4.0.1 updates for break iteration

*** data files & enums & parser code

* file preparation
- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt

* file fixes
- fix UnicodeData.txt general categories of Ethiopic digits Nd->No
  according to PRI #26
  http://www.unicode.org/review/resolved-pri.html#pri26
- undone again because no corrigendum in sight;
  instead modified tests to not check consistency on this for Unicode 4.0.1

* ucdterms.txt
- update from http://www.unicode.org/copyright.html
  formatted for plain text

* uchar.h & uprops.h & uprops.c & genprops
- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
- add U_LB_INSEPARABLE due to a spelling fix
  + put short name comment only on line with new constant
    for genpname perl script parser
- new binary properties
  + STerm
  + Variation_Selector

* genpname
- fix genpname perl script so that it doesn't choke on more than 2 names per property value
- perl script: correctly calculate the maximum number of fields per row

* uscript.h
- new script code Hrkt=Katakana_Or_Hiragana

* gennorm.c track changes in DerivedNormalizationProps.txt
- "FNC" -> "FC_NFKC"
- single field "NFD_NO" -> two fields "NFD_QC; N" etc.

* genprops/props2.c track changes in DerivedNumericValues.txt
- changed from 3 columns to 2, dropping the numeric type
  + assume that the type is always numeric for Han characters,
    and that only those are added in addition to what UnicodeData.txt lists

*** Unicode version numbers
- makedata.mak
- uchar.h
- configure.in

*** tests
- update test of default bidi classes according to PRI #28
  /tsutil/cucdtst/TestUnicodeData
  http://www.unicode.org/review/resolved-pri.html#pri28
- bidi tests: change exemplar character for ES depending on Unicode version
- change hardcoded expected property values where they change

*** other code

* name matching
- read UCD.html

* scripts
- use new Hrkt=Katakana_Or_Hiragana

* ZWJ & ZWNJ
- are now part of combining character sequences
- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ
Commit	Line	Data
46f4442e	1	* Copyright (C) 2004-2008, International Business Machines
73c04bcf A	2	* Corporation and others. All Rights Reserved.
	3	*
	4	* file name: changes.txt
	5	* encoding: US-ASCII
	6	* tab size: 8 (not used)
	7	* indentation:4
	8	*
	9	* created on: 2004may06
	10	* created by: Markus W. Scherer
	11	*
	12	* change log for Unicode updates
	13
	14	---------------------------------------------------------------------------- ***
	15
46f4442e A	16	Unicode 5.1 update
	17
	18	*** related ICU Trac tickets
	19
	20	5696 Update to Unicode 5.1
	21
	22	*** Unicode version numbers
	23	- makedata.mak
	24	- uchar.h
	25	- configure.in & configure
	26	- update ucdVersion in gennames.c if an algorithmic range changes
	27
	28	*** data files & enums & parser code
	29
	30	* file preparation
	31	- ucdstrip:
	32	DerivedCoreProperties.txt
	33	DerivedNormalizationProps.txt
	34	NormalizationTest.txt
	35	PropList.txt
	36	Scripts.txt
	37	GraphemeBreakProperty.txt
	38	SentenceBreakProperty.txt
	39	WordBreakProperty.txt
	40	- ucdstrip and ucdmerge:
	41	EastAsianWidth.txt
	42	LineBreak.txt
	43
	44	* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
	45	copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
	46	copy 5.1.0\ucd\Blocks.txt ..\unidata\
	47	copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
	48	copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
	49	copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
	50	copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
	51	copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
	52	copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
	53	copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
	54	copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
	55	copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
	56	copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
	57	copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
	58
	59	ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
	60	ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
	61	ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
	62	ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
	63	ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
	64	ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
	65	ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
	66	ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
	67	ucdstrip < 5.1.0\ucd\EastAsianWidth.txt \| ucdmerge > ..\unidata\EastAsianWidth.txt
	68	ucdstrip < 5.1.0\ucd\LineBreak.txt \| ucdmerge > ..\unidata\LineBreak.txt
	69
	70	* genpname
	71	- run preparse.pl
	72	+ cd \svn\icuproj\icu\uni51\source\tools\genpname
	73	+ make sure that data.h is writable
	74	+ perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
	75	+ preparse.pl complains with errors like the following:
	76	Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
	77	This is because ICU 3.8 had scripts from ISO 15924 which are now
	78	added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
	79	and PropertyValueAliases.txt.
80	-> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
81	Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
82	+ PropertyValueAliases.txt now explicitly contains values for boolean properties:
83	N/Y, No/Yes, F/T, False/True
84	-> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
85	It will use further values from the file if present.
86
87	* uchar.h & uscript.h & uprops.h & uprops.c & genprops
88	- new block & script values
89	+ 17 new blocks
90	+ 11 new script values already added in ICU 3.8 for ISO 15924 coverage
91	(removed from SyntheticPropertyValueAliases.txt)
92	+ 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
93	(added to SyntheticPropertyValueAliases.txt)
94	- uprops.icu (uprops.h) only provides 7 bits for script codes.
95	In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
96	There is none above 127 yet which is the script code for an
97	assigned Unicode character, so ICU 4.0 uprops.icu does not store any
98	script code values greater than 127.
99	However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
100	in a parallel bit field, and that overflows now.
101	Also, future values >=128 would be incompatible anyway.
102	uprops.h is modified to move around several of the bit fields
103	in the properties vector words, and now uses 8 bits for the script code.
104	Two other bit fields also grow to accommodate future growth:
105	Block (current count: 172) grows from 8 to 9 bits,
106	and Word_Break grows from 4 to 5 bits.
107	- renamed property Simple_Case_Folding (sfc->scf)
108	+ nothing to be done: handled as normal alias
109	- new property JSN Jamo_Short_Name
110	+ no new API: only contributes to the Name property
111	- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
112	- new Joining Group (JG) value: Burushashki_Yeh_Barree
113	- new Sentence_Break (SB) values:
114	SB ; CR ; CR
115	SB ; EX ; Extend
116	SB ; LF ; LF
117	SB ; SC ; SContinue
118	- new Word_Break (WB) values:
119	WB ; CR ; CR
120	WB ; Extend ; Extend
121	WB ; LF ; LF
122	WB ; MB ; MidNumLet
123
124	* Further changes in the 2008-02-29 update:
125	- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
126	because they should not normally be invisible.
127	- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
128	- new Grapheme_Cluster_Break (GCB) value: PP=Prepend
129	- new Word_Break (WB) value: NL=Newline
130
131	* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
132	- Unihan range end moves from 9FBB to 9FC3
133	search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
134	+ do change gennames.c
135
136	* build Unicode data source code for hardcoding core data
137	C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
138
139	ICU data make path is \svn\icuproj\icu\uni51\source\data\
140	ICU root path is \svn\icuproj\icu\uni51
141	Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
142	Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
143	Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
144	Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
145	Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
146	Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
147	Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
148	Creating data file for Unicode Character Properties
149	Creating data file for Unicode Case Mapping Properties
150	Creating data file for Unicode BiDi/Shaping Properties
151	Creating data file for Unicode Normalization
152	Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
153	Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
154
155	- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
156	and rebuild the common library
157
158	*** Break iterators
159
160	* Update break iterator rules to new UAX versions and new property values
161
162	*** UCA
163
164	* update FractionalUCA.txt and UCARules.txt with new canonical closure
165
166	*** Test suites
167	- Test that APIs using Unicode property value aliases (like UnicodeSet)
168	support all of the boolean values N/Y, No/Yes, F/T, False/True
169	-> TestBinaryValues() tests in both cintltst and intltest
170
171	*** LayoutEngine script information
172	* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
173	ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
174	ScriptRunData.cpp, which is no longer needed.)
175
176	The generated files have a current copyright date and "@draft" statement.
177
178	* copy the above files into <icu>/source/layout, replacing the old files.
179
180	Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
181	and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
182
183	* rebuild the layout and layoutex libraries.
184
185	*** Documentation
186	- Update User Guide
187	+ Jamo_Short_Name, sfc->scf, binary property value aliases
188
189	---------------------------------------------------------------------------- ***
190
73c04bcf A	191	Unicode 5.0 update
	192
	193	*** related Jitterbugs
	194
	195	5084 RFE: Update to Unicode 5.0
	196
	197	*** data files & enums & parser code
	198
	199	* file preparation
	200	- ucdstrip:
	201	DerivedCoreProperties.txt
	202	DerivedNormalizationProps.txt
	203	NormalizationTest.txt
	204	PropList.txt
	205	Scripts.txt
	206	GraphemeBreakProperty.txt
	207	SentenceBreakProperty.txt
	208	WordBreakProperty.txt
	209	- ucdstrip and ucdmerge:
	210	EastAsianWidth.txt
	211	LineBreak.txt
	212
46f4442e	213	* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
73c04bcf A	214	copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
	215	copy 5.0.0\ucd\Blocks.txt ..\unidata\
	216	copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
	217	copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
	218	copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
	219	copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
	220	copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
	221	copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
	222	copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
	223	copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
	224	copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
	225	copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
	226	copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
	227
	228	ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
	229	ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
	230	ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
	231	ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
	232	ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
	233	ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
	234	ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
	235	ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
	236	ucdstrip < 5.0.0\ucd\EastAsianWidth.txt \| ucdmerge > ..\unidata\EastAsianWidth.txt
	237	ucdstrip < 5.0.0\ucd\LineBreak.txt \| ucdmerge > ..\unidata\LineBreak.txt
	238
	239	* update FractionalUCA.txt and UCARules.txt with new canonical closure
	240
	241	* genpname
	242	- run preparse.pl
	243	+ make sure that data.h is writable
	244	+ perl preparse.pl \cvs\oss\icu > out.txt
	245
	246	* uchar.h & uscript.h & uprops.h & uprops.c & genprops
	247	- new block & script values
	248	+ script values already added in ICU 3.6 because all of ISO 15924 is now covered
	249
	250	* build Unicode data source code for hardcoding core data
	251	C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
	252
	253	ICU data make path is \cvs\oss\icu\source\data\
	254	ICU root path is \cvs\oss\icu
	255	Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
	256	[etc.]
	257	Creating data file for Unicode Character Properties
	258	Creating data file for Unicode Case Mapping Properties
	259	Creating data file for Unicode BiDi/Shaping Properties
	260	Creating data file for Unicode Normalization
	261	Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
	262	Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
	263
	264	- copy the .c source files to C:\cvs\oss\icu\source\common
	265	and rebuild the common library
	266
	267	*** Unicode version numbers
	268	- makedata.mak
	269	- uchar.h
	270	- configure.in
	271
	272	*** LayoutEngine script information
	273	* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
	274	ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
	275	ScriptRunData.cpp, which is no longer needed.)
	276
	277	The generated files have a current copyright date and "@draft" statement.
278
279	* copy the above files into <icu>/source/layout, replacing the old files.
280
281	Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
282	and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
283
284	* rebuild the layout and layoutex libraries.
285
286	---------------------------------------------------------------------------- ***
287
288	Unicode 4.1 update
289
290	*** related Jitterbugs
291
292	4332 RFE: Update to Unicode 4.1
293	4157 RBBI, TR29 4.1 updates
294
295	*** data files & enums & parser code
296
297	* file preparation
298	- ucdstrip:
299	DerivedCoreProperties.txt
300	DerivedNormalizationProps.txt
301	NormalizationTest.txt
302	GraphemeBreakProperty.txt
303	SentenceBreakProperty.txt
304	WordBreakProperty.txt
305	- ucdstrip and ucdmerge:
306	EastAsianWidth.txt
307	LineBreak.txt
308
309	* add new files to the repository
310	GraphemeBreakProperty.txt
311	SentenceBreakProperty.txt
312	WordBreakProperty.txt
313
314	* update FractionalUCA.txt and UCARules.txt with new canonical closure
315
316	* genpname
317	- handle new enumerated properties in sub read_uchar
318	- run preparse.pl
319
320	* uchar.h & uscript.h & uprops.h & uprops.c & genprops
321	- new binary properties
322	+ Pattern_Syntax
323	+ Pattern_White_Space
324	- new enumerated properties
325	+ Grapheme_Cluster_Break
326	+ Sentence_Break
327	+ Word_Break
328	- new block & script & line break values
329
330	* gencase
331	- case-ignorable changes
332	see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
333	now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
334
335	*** Unicode version numbers
336	- makedata.mak
337	- uchar.h
338	- configure.in
339
340	*** tests
341	- verify that u_charMirror() round-trips
342	- test all new properties and some new values of old properties
343
344	*** other code
345
346	* hardcoded Unihan range end/limit
347	- Unihan range end moves from 9FA5 to 9FBB
348	search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
349	+ do not modify BOCU/BOCSU code because that would change the encoding
350	and break binary compatibility!
351	+ similarly, do not change the GB 18030 range data (ucnvmbcs.c),
352	NamePrepProfile.txt
353	+ ignore trietest.c: test data is arbitrary
354	+ ignore tstnorm.cpp: test optimization, not important
355	+ ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
356	+ do change line_th.txt and word_th.txt
357	by replacing hardcoded ranges with the new property values
358	+ do change gennames.c
359
360	source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
361	source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
362	source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5,
363
364	* case mappings
365	- compare new special casing context conditions with previous ones
366	see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
367
368	* genpname
369	- consider storing only the short name if it is the same as the long name
370
371	*** other reviews
372	- UAX #29 changes (grapheme/word/sentence breaks)
373	- UAX #14 changes (line breaks)
374	- Pattern_Syntax & Pattern_White_Space
375
376	---------------------------------------------------------------------------- ***
377
374ca955 A	378	Unicode 4.0.1 update
	379
	380	*** related Jitterbugs
	381
	382	3170 RFE: Update to Unicode 4.0.1
	383	3171 Add new Unicode 4.0.1 properties
	384	3520 use Unicode 4.0.1 updates for break iteration
	385
	386	*** data files & enums & parser code
	387
	388	* file preparation
	389	- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
	390	- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
	391
	392	* file fixes
	393	- fix UnicodeData.txt general categories of Ethiopic digits Nd->No
	394	according to PRI #26
	395	http://www.unicode.org/review/resolved-pri.html#pri26
	396	- undone again because no corrigendum in sight;
	397	instead modified tests to not check consistency on this for Unicode 4.0.1
	398
	399	* ucdterms.txt
	400	- update from http://www.unicode.org/copyright.html
	401	formatted for plain text
	402
	403	* uchar.h & uprops.h & uprops.c & genprops
	404	- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
	405	- add U_LB_INSEPARABLE due to a spelling fix
	406	+ put short name comment only on line with new constant
	407	for genpname perl script parser
	408	- new binary properties
	409	+ STerm
	410	+ Variation_Selector
	411
	412	* genpname
	413	- fix genpname perl script so that it doesn't choke on more than 2 names per property value
	414	- perl script: correctly calculate the maximum number of fields per row
	415
	416	* uscript.h
	417	- new script code Hrkt=Katakana_Or_Hiragana
	418
	419	* gennorm.c track changes in DerivedNormalizationProps.txt
	420	- "FNC" -> "FC_NFKC"
	421	- single field "NFD_NO" -> two fields "NFD_QC; N" etc.
	422
	423	* genprops/props2.c track changes in DerivedNumericValues.txt
	424	- changed from 3 columns to 2, dropping the numeric type
	425	+ assume that the type is always numeric for Han characters,
	426	and that only those are added in addition to what UnicodeData.txt lists
	427
	428	*** Unicode version numbers
	429	- makedata.mak
	430	- uchar.h
	431	- configure.in
	432
	433	*** tests
	434	- update test of default bidi classes according to PRI #28
	435	/tsutil/cucdtst/TestUnicodeData
	436	http://www.unicode.org/review/resolved-pri.html#pri28
	437	- bidi tests: change exemplar character for ES depending on Unicode version
	438	- change hardcoded expected property values where they change
	439
	440	*** other code
	441
442	* name matching
443	- read UCD.html
444
445	* scripts
446	- use new Hrkt=Katakana_Or_Hiragana
447
448	* ZWJ & ZWNJ
449	- are now part of combining character sequences
450	- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ