[apple/icu.git] / icuSources / data / unidata / changes.txt

* Copyright (C) 2004-2006, International Business Machines
* Corporation and others.  All Rights Reserved.
*
*   file name:  changes.txt
*   encoding:   US-ASCII
*   tab size:   8 (not used)
*   indentation:4
*
*   created on: 2004may06
*   created by: Markus W. Scherer
*
* change log for Unicode updates

---------------------------------------------------------------------------- ***

Unicode 5.0 update

*** related Jitterbugs

5084 RFE: Update to Unicode 5.0

*** data files & enums & parser code

* file preparation
- ucdstrip:
    DerivedCoreProperties.txt
    DerivedNormalizationProps.txt
    NormalizationTest.txt
    PropList.txt
    Scripts.txt
    GraphemeBreakProperty.txt
    SentenceBreakProperty.txt
    WordBreakProperty.txt
- ucdstrip and ucdmerge:
    EastAsianWidth.txt
    LineBreak.txt

* my ucd2unidata.txt (needs to be updated each time with UCD and file version numbers)
copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
copy 5.0.0\ucd\Blocks.txt ..\unidata\
copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
copy 5.0.0\ucd\UnicodeData.txt ..\unidata\

ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt

* update FractionalUCA.txt and UCARules.txt with new canonical closure

* genpname
- run preparse.pl
  + make sure that data.h is writable
  + perl preparse.pl \cvs\oss\icu > out.txt

* uchar.h & uscript.h & uprops.h & uprops.c & genprops
- new block & script values
  + script values already added in ICU 3.6 because all of ISO 15924 is now covered

* build Unicode data source code for hardcoding core data
C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data

ICU data make path is \cvs\oss\icu\source\data\
ICU root path is \cvs\oss\icu
Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
[etc.]
Creating data file for Unicode Character Properties
Creating data file for Unicode Case Mapping Properties
Creating data file for Unicode BiDi/Shaping Properties
Creating data file for Unicode Normalization
Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"

- copy the .c source files to C:\cvs\oss\icu\source\common
  and rebuild the common library

*** Unicode version numbers
- makedata.mak
- uchar.h
- configure.in

*** LayoutEngine script information
* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
ScriptRunData.cpp, which is no longer needed.)

The generated files have a current copyright date and "@draft" statement.

* copy the above files into <icu>/source/layout, replacing the old files.

Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)

* rebuild the layout and layoutex libraries.

---------------------------------------------------------------------------- ***

Unicode 4.1 update

*** related Jitterbugs

4332 RFE: Update to Unicode 4.1
4157 RBBI, TR29 4.1 updates

*** data files & enums & parser code

* file preparation
- ucdstrip:
    DerivedCoreProperties.txt
    DerivedNormalizationProps.txt
    NormalizationTest.txt
    GraphemeBreakProperty.txt
    SentenceBreakProperty.txt
    WordBreakProperty.txt
- ucdstrip and ucdmerge:
    EastAsianWidth.txt
    LineBreak.txt

* add new files to the repository
    GraphemeBreakProperty.txt
    SentenceBreakProperty.txt
    WordBreakProperty.txt

* update FractionalUCA.txt and UCARules.txt with new canonical closure

* genpname
- handle new enumerated properties in sub read_uchar
- run preparse.pl

* uchar.h & uscript.h & uprops.h & uprops.c & genprops
- new binary properties
  + Pattern_Syntax
  + Pattern_White_Space
- new enumerated properties
  + Grapheme_Cluster_Break
  + Sentence_Break
  + Word_Break
- new block & script & line break values

* gencase
- case-ignorable changes
  see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
  now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk

*** Unicode version numbers
- makedata.mak
- uchar.h
- configure.in

*** tests
- verify that u_charMirror() round-trips
- test all new properties and some new values of old properties

*** other code

* hardcoded Unihan range end/limit
- Unihan range end moves from 9FA5 to 9FBB
  search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
  + do not modify BOCU/BOCSU code because that would change the encoding
    and break binary compatibility!
  + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
    NamePrepProfile.txt
  + ignore trietest.c: test data is arbitrary
  + ignore tstnorm.cpp: test optimization, not important
  + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
  + do change line_th.txt and word_th.txt
    by replacing hardcoded ranges with the new property values
  + do change gennames.c

source\data\brkitr\line_th.txt(229):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
source\data\brkitr\word_th.txt(23):        \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
source\tools\gennames\gennames.c(971):        0x4e00, 0x9fa5,

* case mappings
- compare new special casing context conditions with previous ones
  see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods

* genpname
- consider storing only the short name if it is the same as the long name

*** other reviews
- UAX #29 changes (grapheme/word/sentence breaks)
- UAX #14 changes (line breaks)
- Pattern_Syntax & Pattern_White_Space

---------------------------------------------------------------------------- ***

Unicode 4.0.1 update

*** related Jitterbugs

3170 RFE: Update to Unicode 4.0.1
3171 Add new Unicode 4.0.1 properties
3520 use Unicode 4.0.1 updates for break iteration

*** data files & enums & parser code

* file preparation
- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt

* file fixes
- fix UnicodeData.txt general categories of Ethiopic digits Nd->No
  according to PRI #26
  http://www.unicode.org/review/resolved-pri.html#pri26
- undone again because no corrigendum in sight;
  instead modified tests to not check consistency on this for Unicode 4.0.1

* ucdterms.txt
- update from http://www.unicode.org/copyright.html
  formatted for plain text

* uchar.h & uprops.h & uprops.c & genprops
- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
- add U_LB_INSEPARABLE due to a spelling fix
  + put short name comment only on line with new constant
    for genpname perl script parser
- new binary properties
  + STerm
  + Variation_Selector

* genpname
- fix genpname perl script so that it doesn't choke on more than 2 names per property value
- perl script: correctly calculate the maximum number of fields per row

* uscript.h
- new script code Hrkt=Katakana_Or_Hiragana

* gennorm.c track changes in DerivedNormalizationProps.txt
- "FNC" -> "FC_NFKC"
- single field "NFD_NO" -> two fields "NFD_QC; N" etc.

* genprops/props2.c track changes in DerivedNumericValues.txt
- changed from 3 columns to 2, dropping the numeric type
  + assume that the type is always numeric for Han characters,
    and that only those are added in addition to what UnicodeData.txt lists

*** Unicode version numbers
- makedata.mak
- uchar.h
- configure.in

*** tests
- update test of default bidi classes according to PRI #28
  /tsutil/cucdtst/TestUnicodeData
  http://www.unicode.org/review/resolved-pri.html#pri28
- bidi tests: change exemplar character for ES depending on Unicode version
- change hardcoded expected property values where they change

*** other code

* name matching
- read UCD.html

* scripts
- use new Hrkt=Katakana_Or_Hiragana

* ZWJ & ZWNJ
- are now part of combining character sequences
- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ
Commit	Line	Data
73c04bcf A	1	* Copyright (C) 2004-2006, International Business Machines
	2	* Corporation and others. All Rights Reserved.
	3	*
	4	* file name: changes.txt
	5	* encoding: US-ASCII
	6	* tab size: 8 (not used)
	7	* indentation:4
	8	*
	9	* created on: 2004may06
	10	* created by: Markus W. Scherer
	11	*
	12	* change log for Unicode updates
	13
	14	---------------------------------------------------------------------------- ***
	15
	16	Unicode 5.0 update
	17
	18	*** related Jitterbugs
	19
	20	5084 RFE: Update to Unicode 5.0
	21
	22	*** data files & enums & parser code
	23
	24	* file preparation
	25	- ucdstrip:
	26	DerivedCoreProperties.txt
	27	DerivedNormalizationProps.txt
	28	NormalizationTest.txt
	29	PropList.txt
	30	Scripts.txt
	31	GraphemeBreakProperty.txt
	32	SentenceBreakProperty.txt
	33	WordBreakProperty.txt
	34	- ucdstrip and ucdmerge:
	35	EastAsianWidth.txt
	36	LineBreak.txt
	37
	38	* my ucd2unidata.txt (needs to be updated each time with UCD and file version numbers)
	39	copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
	40	copy 5.0.0\ucd\Blocks.txt ..\unidata\
	41	copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
	42	copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
	43	copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
	44	copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
	45	copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
	46	copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
	47	copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
	48	copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
	49	copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
	50	copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
	51	copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
	52
	53	ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
	54	ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
	55	ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
	56	ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
	57	ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
	58	ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
	59	ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
	60	ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
	61	ucdstrip < 5.0.0\ucd\EastAsianWidth.txt \| ucdmerge > ..\unidata\EastAsianWidth.txt
	62	ucdstrip < 5.0.0\ucd\LineBreak.txt \| ucdmerge > ..\unidata\LineBreak.txt
	63
	64	* update FractionalUCA.txt and UCARules.txt with new canonical closure
65
66	* genpname
67	- run preparse.pl
68	+ make sure that data.h is writable
69	+ perl preparse.pl \cvs\oss\icu > out.txt
70
71	* uchar.h & uscript.h & uprops.h & uprops.c & genprops
72	- new block & script values
73	+ script values already added in ICU 3.6 because all of ISO 15924 is now covered
74
75	* build Unicode data source code for hardcoding core data
76	C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
77
78	ICU data make path is \cvs\oss\icu\source\data\
79	ICU root path is \cvs\oss\icu
80	Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
81	[etc.]
82	Creating data file for Unicode Character Properties
83	Creating data file for Unicode Case Mapping Properties
84	Creating data file for Unicode BiDi/Shaping Properties
85	Creating data file for Unicode Normalization
86	Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
87	Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
88
89	- copy the .c source files to C:\cvs\oss\icu\source\common
90	and rebuild the common library
91
92	*** Unicode version numbers
93	- makedata.mak
94	- uchar.h
95	- configure.in
96
97	*** LayoutEngine script information
98	* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
99	ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
100	ScriptRunData.cpp, which is no longer needed.)
101
102	The generated files have a current copyright date and "@draft" statement.
103
104	* copy the above files into <icu>/source/layout, replacing the old files.
105
106	Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
107	and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
108
109	* rebuild the layout and layoutex libraries.
110
111	---------------------------------------------------------------------------- ***
112
113	Unicode 4.1 update
114
115	*** related Jitterbugs
116
117	4332 RFE: Update to Unicode 4.1
118	4157 RBBI, TR29 4.1 updates
119
120	*** data files & enums & parser code
121
122	* file preparation
123	- ucdstrip:
124	DerivedCoreProperties.txt
125	DerivedNormalizationProps.txt
126	NormalizationTest.txt
127	GraphemeBreakProperty.txt
128	SentenceBreakProperty.txt
129	WordBreakProperty.txt
130	- ucdstrip and ucdmerge:
131	EastAsianWidth.txt
132	LineBreak.txt
133
134	* add new files to the repository
135	GraphemeBreakProperty.txt
136	SentenceBreakProperty.txt
137	WordBreakProperty.txt
138
139	* update FractionalUCA.txt and UCARules.txt with new canonical closure
140
141	* genpname
142	- handle new enumerated properties in sub read_uchar
143	- run preparse.pl
144
145	* uchar.h & uscript.h & uprops.h & uprops.c & genprops
146	- new binary properties
147	+ Pattern_Syntax
148	+ Pattern_White_Space
149	- new enumerated properties
150	+ Grapheme_Cluster_Break
151	+ Sentence_Break
152	+ Word_Break
153	- new block & script & line break values
154
155	* gencase
156	- case-ignorable changes
157	see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
158	now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
159
160	*** Unicode version numbers
161	- makedata.mak
162	- uchar.h
163	- configure.in
164
165	*** tests
166	- verify that u_charMirror() round-trips
167	- test all new properties and some new values of old properties
168
169	*** other code
170
171	* hardcoded Unihan range end/limit
172	- Unihan range end moves from 9FA5 to 9FBB
173	search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
174	+ do not modify BOCU/BOCSU code because that would change the encoding
175	and break binary compatibility!
176	+ similarly, do not change the GB 18030 range data (ucnvmbcs.c),
177	NamePrepProfile.txt
178	+ ignore trietest.c: test data is arbitrary
179	+ ignore tstnorm.cpp: test optimization, not important
180	+ ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
181	+ do change line_th.txt and word_th.txt
182	by replacing hardcoded ranges with the new property values
183	+ do change gennames.c
184
185	source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
186	source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
187	source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5,
188
189	* case mappings
190	- compare new special casing context conditions with previous ones
191	see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
192
193	* genpname
194	- consider storing only the short name if it is the same as the long name
195
196	*** other reviews
197	- UAX #29 changes (grapheme/word/sentence breaks)
198	- UAX #14 changes (line breaks)
199	- Pattern_Syntax & Pattern_White_Space
200
201	---------------------------------------------------------------------------- ***
202
374ca955 A	203	Unicode 4.0.1 update
	204
	205	*** related Jitterbugs
	206
	207	3170 RFE: Update to Unicode 4.0.1
	208	3171 Add new Unicode 4.0.1 properties
	209	3520 use Unicode 4.0.1 updates for break iteration
	210
	211	*** data files & enums & parser code
	212
	213	* file preparation
	214	- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
	215	- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
	216
	217	* file fixes
	218	- fix UnicodeData.txt general categories of Ethiopic digits Nd->No
	219	according to PRI #26
	220	http://www.unicode.org/review/resolved-pri.html#pri26
	221	- undone again because no corrigendum in sight;
	222	instead modified tests to not check consistency on this for Unicode 4.0.1
	223
	224	* ucdterms.txt
	225	- update from http://www.unicode.org/copyright.html
	226	formatted for plain text
	227
	228	* uchar.h & uprops.h & uprops.c & genprops
	229	- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
	230	- add U_LB_INSEPARABLE due to a spelling fix
	231	+ put short name comment only on line with new constant
	232	for genpname perl script parser
	233	- new binary properties
	234	+ STerm
	235	+ Variation_Selector
	236
	237	* genpname
	238	- fix genpname perl script so that it doesn't choke on more than 2 names per property value
	239	- perl script: correctly calculate the maximum number of fields per row
	240
	241	* uscript.h
	242	- new script code Hrkt=Katakana_Or_Hiragana
	243
	244	* gennorm.c track changes in DerivedNormalizationProps.txt
	245	- "FNC" -> "FC_NFKC"
	246	- single field "NFD_NO" -> two fields "NFD_QC; N" etc.
	247
	248	* genprops/props2.c track changes in DerivedNumericValues.txt
	249	- changed from 3 columns to 2, dropping the numeric type
	250	+ assume that the type is always numeric for Han characters,
	251	and that only those are added in addition to what UnicodeData.txt lists
	252
	253	*** Unicode version numbers
	254	- makedata.mak
	255	- uchar.h
	256	- configure.in
	257
	258	*** tests
	259	- update test of default bidi classes according to PRI #28
	260	/tsutil/cucdtst/TestUnicodeData
	261	http://www.unicode.org/review/resolved-pri.html#pri28
	262	- bidi tests: change exemplar character for ES depending on Unicode version
	263	- change hardcoded expected property values where they change
	264
	265	*** other code
	266
267	* name matching
268	- read UCD.html
269
270	* scripts
271	- use new Hrkt=Katakana_Or_Hiragana
272
273	* ZWJ & ZWNJ
274	- are now part of combining character sequences
275	- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ