1 * Copyright (C) 2004-2010, International Business Machines
2 * Corporation and others. All Rights Reserved.
4 * file name: changes.txt
6 * tab size: 8 (not used)
9 * created on: 2004may06
10 * created by: Markus W. Scherer
12 * change log for Unicode updates
14 ---------------------------------------------------------------------------- ***
18 *** related ICU Trac tickets
20 7264 Unicode 6.0 Update
22 *** Unicode version numbers
25 (configure.in & configure: have been modified to extract the version from uchar.h)
26 - com.ibm.icu.util.VersionInfo
28 *** data files & enums & parser code
32 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed
33 - This now prepares both unidata and testdata files in respective output subfolders.
35 * PropertyAliases.txt changes
36 - new Script_Extensions property defined in the new ScriptExtensions.txt file
37 but not listed in PropertyAliases.txt; reported to unicode.org;
38 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt
39 scx; Script_Extensions
40 -> uchar.h with new UProperty section
41 -> com.ibm.icu.lang.UProperty, parallel with uchar.h
43 * PropertyValueAliases.txt changes
49 CJK_Unified_Ideographs_Extension_D
54 Miscellaneous_Symbols_And_Pictographs
56 Transport_And_Map_Symbols
58 -> add to UCharacter.UnicodeBlock
59 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
60 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
61 - Joining_Group (jg) values:
62 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias
63 -> uchar.h & UCharacter.JoiningGroup
68 -> remove these from SyntheticPropertyValueAliases.txt
69 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN
70 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
71 and in com.ibm.icu.dev.test.lang.TestUScript.java
72 - 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
73 (added 2009-11-11..2010-07-18)
75 Dupl 755 Duployan shortand
81 Merc 101 Meroitic Cursive
82 Narb 106 Old North Arabian
88 -> com.ibm.icu.lang.UScript
89 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
90 replace public static final int \1 = \2;\3
91 -> SyntheticPropertyValueAliases.txt
92 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
93 and in com.ibm.icu.dev.test.lang.TestUScript.java
94 - ISO 15924 name change
95 Mero 100 Meroitic Hieroglyphs (was Meroitic)
96 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC
97 - property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt
99 * UnicodeData.txt changes
101 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
102 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
103 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion
105 * build Unicode tools using CMake+make
107 * run genpname/preparse.pl (on Linux)
108 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
109 + make sure that data.h is writable
110 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
111 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
113 * rebuild Unicode tools (at least genpname) using make
114 - You might first need to "make install" ICU so that the tools build can pick
115 up the new definitions from the installed header files.
118 - ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
119 - rebuild ICU & tools
121 * update source/data/unidata/norm2/nfkc_cf.txt
122 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
124 * update source/data/unidata/norm2/uts46.txt
125 - download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt
126 to ~/svn.icu/tools/trunk/src/unicode/py
127 - adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values
128 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
129 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
131 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
132 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
133 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
134 - Unicode 6.0: U+2260, U+226E, U+226F
136 * generate core properties data files
137 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
138 - rebuild ICU & tools
139 - run makeuca.sh so that genuca picks up the new nfc.nrm:
140 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
141 - rebuild ICU & tools
143 * implement new Script_Extensions property (provisional)
144 - parser & generator: genprops & uprops.icu
145 - uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp
146 - UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java
148 * switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2
150 - genbidi/gencase/genprops tools changes
151 - re-run makeprops.sh (see above)
152 - UCharacterProperty.java, UCharacterTypeIterator.java,
153 UBiDiProps.java, UCaseProps.java, and several others with minor changes;
154 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java
156 * update Java data files
157 - refresh just the UCD-related files, just to be safe
158 - see (ICU4C)/source/data/icu4j-readme.txt
160 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
163 Unicode .icu files built to ./out/build/icudt45l
164 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b
165 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
166 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b
167 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b
168 mkdir -p /tmp/icu4j/main/shared/data
169 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
170 - copy the big-endian Unicode data files to another location,
171 separate from the other data files
172 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
173 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
174 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
175 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu
176 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
177 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
178 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
180 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
182 * refresh Java test .txt files
183 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
185 * un-hardcode normalization skippable (NF*_Inert) test data
186 - removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools
188 * copy updated break iterator test files
189 - now handled by early ucdcopy.py and
190 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata
192 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt
193 to ~/svn.icu/trunk/src/source/test/testdata)
194 - they are not used in ICU4J
198 - get output from Mark's tools; look in
199 http://www.unicode.org/~book/incoming/mark/uca6.0.0/
200 http://www.macchiato.com/unicode/utc/additional-uca-files
201 http://www.unicode.org/Public/UCA/6.0.0/
202 http://www.unicode.org/~mdavis/uca/
203 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
204 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
205 - update Han-implicit ranges for new CJK extensions:
206 swapCJK() in ucol.cpp & ImplicitCEGenerator.java
207 - genuca: allow bytes 02 for U+FFFE, new merge-sort character;
208 do not add it into invuca so that tailoring primary-after an ignorable works
209 - genuca: permit space between [variable top] bytes
210 - ucol.cpp: treat noncharacters like unassigned rather than ignorable
212 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
214 - refresh ICU4J collation data:
215 (subset of instructions above for properties data refresh, except copies all coll/*)
216 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
217 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
218 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
219 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
220 - update (ICU)/source/test/testdata/CollationTest_*.txt
221 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
222 with output from Mark's Unicode tools
223 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)
224 - note on intltest: if collate/UCAConformanceTest fails, then
225 utility/MultithreadTest/TestCollators will fail as well;
226 fix the conformance test before looking into the multi-thread test
228 * When refreshing all of ICU4J data from ICU4C
229 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
230 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
232 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
234 *** LayoutEngine script information
236 (For details see the Unicode 5.2 change log below.)
238 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
239 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
240 ScriptRunData.cpp, which is no longer needed.)
242 The generated files have a current copyright date and "@draft" statement.
244 * copy the above files into <icu>/source/layout, replacing the old files.
245 * fix mixed line endings
246 * review the diffs and fix incorrect @draft and missing aliases;
247 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
248 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
250 ---------------------------------------------------------------------------- ***
254 *** related ICU Trac tickets
258 7167 verify collation bytes
259 7235 Java test NAME_ALIAS
260 7236 Java DerivedCoreProperties.txt test
261 7237 Java BidiTest.txt
262 7238 UTrie2 in core unidata
263 7239 test for tailoring gaps
264 7240 Java fix CollationMiscTest
265 7243 update layout engine for Unicode 5.2
267 *** Unicode version numbers
270 - configure.in & configure
271 - update ucdVersion in gennames.c if an algorithmic range changes
273 *** data files & enums & parser code
277 python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata
278 - includes finding files regardless of version numbers,
279 copying them, and performing the equivalent processing of the
280 ucdstrip and ucdmerge tools on the desired set of files
283 - PropertyAliases.txt
284 moved from numeric to enumerated:
285 ccc ; Canonical_Combining_Class
286 new string properties:
287 NFKC_CF ; NFKC_Casefold
288 Name_Alias; Name_Alias
289 new binary properties:
292 CWCF ; Changes_When_Casefolded
293 CWCM ; Changes_When_Casemapped
294 CWKCF ; Changes_When_NFKC_Casefolded
295 CWL ; Changes_When_Lowercased
296 CWT ; Changes_When_Titlecased
297 CWU ; Changes_When_Uppercased
298 new CJK Unihan properties (not supported by ICU)
299 - PropertyValueAliases.txt
302 one script code change:
303 sc ; Qaai ; Inherited
305 sc ; Zinh ; Inherited ; Qaai
306 new Line_Break (lb) value:
307 lb ; CP ; Close_Parenthesis
308 new Joining_Group (jg) values: Farsi_Yeh, Nya
310 ccc; 214; ATA ; Attached_Above
311 - DerivedBidiClass.txt
312 new default-R range: U+1E800 - U+1EFFF
314 all of the ISO comments are gone
316 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last>
318 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
319 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
323 + cd \svn\icuproj\icu\trunk\source\tools\genpname
324 + make sure that data.h is writable
325 + perl preparse.pl \svn\icuproj\icu\trunk > out.txt
326 + preparse.pl complains with errors like the following:
327 Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34.
328 This is because ICU 4.0 had scripts from ISO 15924 which are now
329 added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt
330 and PropertyValueAliases.txt.
331 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
332 Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt
333 + preparse.pl complains with errors about block names missing from uchar.h; add them
335 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
336 - new block & script values
338 copy new blocks from Blocks.txt
339 MS VC++ 2008 regular expression:
340 find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$"
341 replace with " UBLOCK_\3 = 172, /*[\1]*/"
342 + several new script values already added in ICU 4.0 for ISO 15924 coverage
343 (removed from SyntheticPropertyValueAliases.txt, see genpname notes above)
344 + 3 new script values added for ISO 15924 and Unicode 5.2 coverage
345 + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2)
346 (added to SyntheticPropertyValueAliases.txt)
347 - new Joining Group (JG) values: Farsi_Yeh, Nya
348 - new Line_Break (lb) value:
349 lb ; CP ; Close_Parenthesis
351 * hardcoded Unihan range end/limit
352 - Unihan range end moves from 9FC3 to 9FCB
353 search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive)
354 + do change gennames.c
356 * Compare definitions of new binary properties with what we used to use
357 in algorithms, to see if the definitions changed.
358 - Verified that definitions for Cased and Case_Ignorable are unchanged.
359 The gencase tool now parses the newly public Case_Ignorable values
360 in case the definition changes in the future.
362 * uchar.c & uprops.h & uprops.c & genprops
363 - new numeric values that didn't exist in Unicode data before:
364 1/7, 1/9, 1/10, 3/10, 1/16, 3/16
365 the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5,
366 therefore redesign the encoding of numeric types and values for formatVersion 6;
367 design for simple numbers up to at least 144 ("one gross"),
368 large values up to at least 10^20,
369 and fractions with numerators -1..17 and denominators 1..16
370 to cover current and expected future values
371 (e.g., more Han numeric values, Meroitic twelfths)
373 * reimplement Hangul_Syllable_Type for new Jamo characters
374 - the old code assumed that all Jamo characters are in the 11xx block
375 - Unicode 5.2 fills holes there and adds new Jamo characters in
376 A960..A97F; Hangul Jamo Extended-A
378 D7B0..D7FF; Hangul Jamo Extended-B
379 - Hangul_Syllable_Type can be trivially derived from a subset of
380 Grapheme_Cluster_Break values
382 * build Unicode data source code for hardcoding core data
383 C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data
385 ICU data make path is \svn\icuproj\icu\trunk\source\data\
386 ICU root path is \svn\icuproj\icu\trunk
387 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
388 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
389 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
390 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
391 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
392 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
393 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
394 Information: cannot find "spreplocal.mk". Not building user-additional stringprep files.
395 Creating data file for Unicode Property Names
396 Creating data file for Unicode Character Properties
397 Creating data file for Unicode Case Mapping Properties
398 Creating data file for Unicode BiDi/Shaping Properties
399 Creating data file for Unicode Normalization
400 Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l"
401 Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp"
403 - copy the .c source files to C:\svn\icuproj\icu\trunk\source\common
404 and rebuild the common library
408 - update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools)
409 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools
410 - update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools
411 [ Begin obsolete instructions:
412 Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files.
413 - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py
415 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt
416 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt
417 End obsolete instructions]
418 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)
419 not just the *_STUB.txt files
420 - note on intltest: if collate/UCAConformanceTest fails, then
421 utility/MultithreadTest/TestCollators will fail as well;
422 fix the conformance test before looking into the multi-thread test
424 *** Implement Cased & Case_Ignorable properties
425 - via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable()
426 - Problem: These properties should be disjoint, but aren't
427 - UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not
428 - change ucase.icu to be able to store any combination of Cased and Case_Ignorable
430 *** Implement Changes_When_Xyz properties
431 - without stored data
433 *** Implement Name_Alias property
434 - add it as another name field in unames.icu
435 - make it available via u_charName() and UCharNameChoice and
436 - consider it in u_charFromName()
440 * Update break iterator rules to new UAX versions and new property values
441 * Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary
443 *** new BidiTest file
444 - review format and data
445 - copy BidiTest.txt to source/test/testdata
446 - write test code using this data
447 - fix ICU code where it fails the conformance test
450 - generally, find and update code corresponding to C/C++
451 - UCharacter.UnicodeBlock constants:
452 a) add an _ID integer per new block, update COUNT
453 b) add a class instance per new block
455 find UBLOCK_{[^ ]+} = [0-9]+, {/.+}
456 replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
457 - CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias()
459 - port test changes to Java
461 *** LayoutEngine script information
463 (For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833)
465 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
466 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
467 ScriptRunData.cpp, which is no longer needed.)
469 The generated files have a current copyright date and "@draft" statement.
471 -> Eric Mader wrote in email on 20090930:
472 "I think the tool has been modified to update @draft to @stable for
473 older scripts and to add @draft for new scripts.
474 (I worked with an intern on this last year.)
475 You should check the output after you run it."
477 * copy the above files into <icu>/source/layout, replacing the old files.
478 * fix mixed line endings
479 * review the diffs and fix incorrect @draft and missing aliases
480 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
482 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
483 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
485 -> Eric Mader wrote in email on 20090930:
486 "This is just a matter of making sure that all the per-script tables have
487 entries for any new scripts that were added.
488 If any new Indic characters were added, then the class tables in
489 IndicClassTables.cpp should be updated to reflect this.
490 John Emmons should know how to do this if it's required."
492 * rebuild the layout and layoutex libraries.
496 + Jamo_Short_Name, sfc->scf, binary property value aliases
498 ---------------------------------------------------------------------------- ***
502 *** related ICU Trac tickets
504 5696 Update to Unicode 5.1
506 *** Unicode version numbers
509 - configure.in & configure
510 - update ucdVersion in gennames.c if an algorithmic range changes
512 *** data files & enums & parser code
516 DerivedCoreProperties.txt
517 DerivedNormalizationProps.txt
518 NormalizationTest.txt
521 GraphemeBreakProperty.txt
522 SentenceBreakProperty.txt
523 WordBreakProperty.txt
524 - ucdstrip and ucdmerge:
528 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
529 copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
530 copy 5.1.0\ucd\Blocks.txt ..\unidata\
531 copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
532 copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
533 copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
534 copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
535 copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
536 copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
537 copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
538 copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
539 copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
540 copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
541 copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
543 ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
544 ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
545 ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
546 ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
547 ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
548 ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
549 ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
550 ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
551 ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
552 ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
556 + cd \svn\icuproj\icu\uni51\source\tools\genpname
557 + make sure that data.h is writable
558 + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
559 + preparse.pl complains with errors like the following:
560 Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
561 This is because ICU 3.8 had scripts from ISO 15924 which are now
562 added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
563 and PropertyValueAliases.txt.
564 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
565 Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
566 + PropertyValueAliases.txt now explicitly contains values for boolean properties:
567 N/Y, No/Yes, F/T, False/True
568 -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
569 It will use further values from the file if present.
571 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
572 - new block & script values
574 + 11 new script values already added in ICU 3.8 for ISO 15924 coverage
575 (removed from SyntheticPropertyValueAliases.txt)
576 + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
577 (added to SyntheticPropertyValueAliases.txt)
578 - uprops.icu (uprops.h) only provides 7 bits for script codes.
579 In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
580 There is none above 127 yet which is the script code for an
581 assigned Unicode character, so ICU 4.0 uprops.icu does not store any
582 script code values greater than 127.
583 However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
584 in a parallel bit field, and that overflows now.
585 Also, future values >=128 would be incompatible anyway.
586 uprops.h is modified to move around several of the bit fields
587 in the properties vector words, and now uses 8 bits for the script code.
588 Two other bit fields also grow to accommodate future growth:
589 Block (current count: 172) grows from 8 to 9 bits,
590 and Word_Break grows from 4 to 5 bits.
591 - renamed property Simple_Case_Folding (sfc->scf)
592 + nothing to be done: handled as normal alias
593 - new property JSN Jamo_Short_Name
594 + no new API: only contributes to the Name property
595 - new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
596 - new Joining Group (JG) value: Burushashki_Yeh_Barree
597 - new Sentence_Break (SB) values:
602 - new Word_Break (WB) values:
608 * Further changes in the 2008-02-29 update:
609 - Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
610 because they should not normally be invisible.
611 - new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
612 - new Grapheme_Cluster_Break (GCB) value: PP=Prepend
613 - new Word_Break (WB) value: NL=Newline
615 * hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
616 - Unihan range end moves from 9FBB to 9FC3
617 search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
618 + do change gennames.c
620 * build Unicode data source code for hardcoding core data
621 C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
623 ICU data make path is \svn\icuproj\icu\uni51\source\data\
624 ICU root path is \svn\icuproj\icu\uni51
625 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
626 Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
627 Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
628 Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
629 Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
630 Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
631 Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
632 Creating data file for Unicode Character Properties
633 Creating data file for Unicode Case Mapping Properties
634 Creating data file for Unicode BiDi/Shaping Properties
635 Creating data file for Unicode Normalization
636 Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
637 Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
639 - copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
640 and rebuild the common library
644 * Update break iterator rules to new UAX versions and new property values
648 * update FractionalUCA.txt and UCARules.txt with new canonical closure
651 - Test that APIs using Unicode property value aliases (like UnicodeSet)
652 support all of the boolean values N/Y, No/Yes, F/T, False/True
653 -> TestBinaryValues() tests in both cintltst and intltest
655 *** LayoutEngine script information
656 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
657 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
658 ScriptRunData.cpp, which is no longer needed.)
660 The generated files have a current copyright date and "@draft" statement.
662 * copy the above files into <icu>/source/layout, replacing the old files.
664 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
665 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
667 * rebuild the layout and layoutex libraries.
671 + Jamo_Short_Name, sfc->scf, binary property value aliases
673 ---------------------------------------------------------------------------- ***
677 *** related Jitterbugs
679 5084 RFE: Update to Unicode 5.0
681 *** data files & enums & parser code
685 DerivedCoreProperties.txt
686 DerivedNormalizationProps.txt
687 NormalizationTest.txt
690 GraphemeBreakProperty.txt
691 SentenceBreakProperty.txt
692 WordBreakProperty.txt
693 - ucdstrip and ucdmerge:
697 * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
698 copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
699 copy 5.0.0\ucd\Blocks.txt ..\unidata\
700 copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
701 copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
702 copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
703 copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
704 copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
705 copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
706 copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
707 copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
708 copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
709 copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
710 copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
712 ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
713 ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
714 ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
715 ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
716 ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
717 ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
718 ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
719 ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
720 ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
721 ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
723 * update FractionalUCA.txt and UCARules.txt with new canonical closure
727 + make sure that data.h is writable
728 + perl preparse.pl \cvs\oss\icu > out.txt
730 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
731 - new block & script values
732 + script values already added in ICU 3.6 because all of ISO 15924 is now covered
734 * build Unicode data source code for hardcoding core data
735 C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
737 ICU data make path is \cvs\oss\icu\source\data\
738 ICU root path is \cvs\oss\icu
739 Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
741 Creating data file for Unicode Character Properties
742 Creating data file for Unicode Case Mapping Properties
743 Creating data file for Unicode BiDi/Shaping Properties
744 Creating data file for Unicode Normalization
745 Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
746 Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
748 - copy the .c source files to C:\cvs\oss\icu\source\common
749 and rebuild the common library
751 *** Unicode version numbers
756 *** LayoutEngine script information
757 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
758 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
759 ScriptRunData.cpp, which is no longer needed.)
761 The generated files have a current copyright date and "@draft" statement.
763 * copy the above files into <icu>/source/layout, replacing the old files.
765 Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
766 and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
768 * rebuild the layout and layoutex libraries.
770 ---------------------------------------------------------------------------- ***
774 *** related Jitterbugs
776 4332 RFE: Update to Unicode 4.1
777 4157 RBBI, TR29 4.1 updates
779 *** data files & enums & parser code
783 DerivedCoreProperties.txt
784 DerivedNormalizationProps.txt
785 NormalizationTest.txt
786 GraphemeBreakProperty.txt
787 SentenceBreakProperty.txt
788 WordBreakProperty.txt
789 - ucdstrip and ucdmerge:
793 * add new files to the repository
794 GraphemeBreakProperty.txt
795 SentenceBreakProperty.txt
796 WordBreakProperty.txt
798 * update FractionalUCA.txt and UCARules.txt with new canonical closure
801 - handle new enumerated properties in sub read_uchar
804 * uchar.h & uscript.h & uprops.h & uprops.c & genprops
805 - new binary properties
807 + Pattern_White_Space
808 - new enumerated properties
809 + Grapheme_Cluster_Break
812 - new block & script & line break values
815 - case-ignorable changes
816 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
817 now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
819 *** Unicode version numbers
825 - verify that u_charMirror() round-trips
826 - test all new properties and some new values of old properties
830 * hardcoded Unihan range end/limit
831 - Unihan range end moves from 9FA5 to 9FBB
832 search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
833 + do not modify BOCU/BOCSU code because that would change the encoding
834 and break binary compatibility!
835 + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
837 + ignore trietest.c: test data is arbitrary
838 + ignore tstnorm.cpp: test optimization, not important
839 + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
840 + do change line_th.txt and word_th.txt
841 by replacing hardcoded ranges with the new property values
842 + do change gennames.c
844 source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
845 source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
846 source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5,
849 - compare new special casing context conditions with previous ones
850 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
853 - consider storing only the short name if it is the same as the long name
856 - UAX #29 changes (grapheme/word/sentence breaks)
857 - UAX #14 changes (line breaks)
858 - Pattern_Syntax & Pattern_White_Space
860 ---------------------------------------------------------------------------- ***
864 *** related Jitterbugs
866 3170 RFE: Update to Unicode 4.0.1
867 3171 Add new Unicode 4.0.1 properties
868 3520 use Unicode 4.0.1 updates for break iteration
870 *** data files & enums & parser code
873 - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
874 - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
877 - fix UnicodeData.txt general categories of Ethiopic digits Nd->No
879 http://www.unicode.org/review/resolved-pri.html#pri26
880 - undone again because no corrigendum in sight;
881 instead modified tests to not check consistency on this for Unicode 4.0.1
884 - update from http://www.unicode.org/copyright.html
885 formatted for plain text
887 * uchar.h & uprops.h & uprops.c & genprops
888 - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
889 - add U_LB_INSEPARABLE due to a spelling fix
890 + put short name comment only on line with new constant
891 for genpname perl script parser
892 - new binary properties
897 - fix genpname perl script so that it doesn't choke on more than 2 names per property value
898 - perl script: correctly calculate the maximum number of fields per row
901 - new script code Hrkt=Katakana_Or_Hiragana
903 * gennorm.c track changes in DerivedNormalizationProps.txt
905 - single field "NFD_NO" -> two fields "NFD_QC; N" etc.
907 * genprops/props2.c track changes in DerivedNumericValues.txt
908 - changed from 3 columns to 2, dropping the numeric type
909 + assume that the type is always numeric for Han characters,
910 and that only those are added in addition to what UnicodeData.txt lists
912 *** Unicode version numbers
918 - update test of default bidi classes according to PRI #28
919 /tsutil/cucdtst/TestUnicodeData
920 http://www.unicode.org/review/resolved-pri.html#pri28
921 - bidi tests: change exemplar character for ES depending on Unicode version
922 - change hardcoded expected property values where they change
930 - use new Hrkt=Katakana_Or_Hiragana
933 - are now part of combining character sequences
934 - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ