]> git.saurik.com Git - apple/icu.git/blame - icuSources/data/unidata/changes.txt
ICU-491.11.1.tar.gz
[apple/icu.git] / icuSources / data / unidata / changes.txt
CommitLineData
4388f060 1* Copyright (C) 2004-2012, International Business Machines
73c04bcf
A
2* Corporation and others. All Rights Reserved.
3*
4* file name: changes.txt
5* encoding: US-ASCII
6* tab size: 8 (not used)
7* indentation:4
8*
9* created on: 2004may06
10* created by: Markus W. Scherer
11*
12* change log for Unicode updates
13
14---------------------------------------------------------------------------- ***
15
4388f060
A
16Future Unicode update
17
18Tools simplified since the Unicode 6.1 update. See
19- http://site.icu-project.org/design/props/ppucd
20- http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972
21
22* Unicode version numbers
23- icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates
24
25* file preparation
26- ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py:
27- ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src
28- This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.
29- Check test file diffs for previously commented-out, known-failing data lines;
30 probably need to keep those commented out.
31
32* PropertyValueAliases.txt changes
33- Script codes that are in ISO 15924 but not in Unicode are now listed in
34 preparseucd.py, in the _scripts_only_in_iso15924 variable.
35 If there are new ISO codes, then add them.
36 If Unicode adds some of them, then remove them from the .py variable.
37
38* UnicodeData.txt changes
39- No more manual changes for CJK ranges for algorithmic names;
40 those are now written to ppucd.txt and genprops reads them from there.
41
42* generate core properties data files (makeprops.sh was deleted)
43- ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src
44
45* no more manual updates of source/data/unidata/norm2/nfkc_cf.txt
46- it is now generated by preparseucd.py
47
48* no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt
49- it is now generated by preparseucd.py
50- make sure that the Unicode data folder passed into preparseucd.py
51 includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
52 (can be in some subfolder)
53
54* generate normalization data files
55- ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib
56- ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in
57- ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata
58- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt
59- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt
60- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt
61- ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt
62
63* build ICU (make install)
64* build Unicode tools using CMake+make
65
66* new way to call genuca (makeuca.sh was deleted)
67- ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src
68
69---------------------------------------------------------------------------- ***
70
71Unicode 6.1 update
72
73*** ICU Trac
74
75- ticket 8995 final update to Unicode 6.1
76- ticket 8994 regenerate source/layout/CanonData.cpp
77
78- ticket 8961 support Unicode "Age" value *names*
79- ticket 8963 support multiple character name aliases & types
80
81- ticket 8827 "update ICU to Unicode 6.1"
82- C++ branches/markus/uni61 at r30864 from trunk at r30843
83- Java branches/markus/uni61 at r30865 from trunk at r30863
84
85*** Unicode version numbers
86- makedata.mak
87- uchar.h
88 (configure.in & configure: have been modified to extract the version from uchar.h)
89- com.ibm.icu.util.VersionInfo
90- icutools/unicode/makedefs.sh
91 + also review & update other definitions in that file,
92 e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l
93
94*** data files & enums & parser code
95
96* file preparation
97
98~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed
99- This prepares both unidata and testdata files in respective output subfolders.
100- Check test file diffs for previously commented-out, known-failing data lines;
101 probably need to keep those commented out.
102
103* PropertyValueAliases.txt changes
104- 11 new block names:
105 Arabic_Extended_A
106 Arabic_Mathematical_Alphabetic_Symbols
107 Chakma
108 Meetei_Mayek_Extensions
109 Meroitic_Cursive
110 Meroitic_Hieroglyphs
111 Miao
112 Sharada
113 Sora_Sompeng
114 Sundanese_Supplement
115 Takri
116 -> add to uchar.h
117 -> add to UCharacter.UnicodeBlock IDs
118 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)
119 replace public static final int \1_ID = \2; \3
120 -> add to UCharacter.UnicodeBlock objects
121 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
122 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
123- 1 new Joining_Group (jg) value:
124 Rohingya_Yeh
125 -> uchar.h & UCharacter.JoiningGroup
126- 2 new Line_Break (lb) values:
127 CJ=Conditional_Japanese_Starter
128 HL=Hebrew_Letter
129 -> uchar.h & UCharacter.LineBreak
130- 7 new scripts:
131 sc ; Cakm ; Chakma
132 sc ; Merc ; Meroitic_Cursive
133 sc ; Mero ; Meroitic_Hieroglyphs
134 sc ; Plrd ; Miao
135 sc ; Shrd ; Sharada
136 sc ; Sora ; Sora_Sompeng
137 sc ; Takr ; Takri
138 -> remove these from SyntheticPropertyValueAliases.txt
139 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
140 and in com.ibm.icu.dev.test.lang.TestUScript.java
141- 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
142 (added 2011-06-21)
143 Khoj 322 Khojki
144 Tirh 326 Tirhuta
145 and another one added 2011-12-09
146 Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)
147 -> uscript.h
148 -> com.ibm.icu.lang.UScript
149 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
150 replace public static final int \1 = \2;\3
151 -> SyntheticPropertyValueAliases.txt
152 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
153 and in com.ibm.icu.dev.test.lang.TestUScript.java
154
155* UnicodeData.txt changes
156- the last Unihan code point changes from U+9FCB to U+9FCC
157 search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive)
158 + do change gennames.c
159 + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java
160
161* DerivedBidiClass.txt changes
162- 2 new default-AL blocks:
163# Arabic Extended-A: U+08A0 - U+08FF (was default-R)
164# Arabic Mathematical Alphabetic Symbols:
165# U+1EE00 - U+1EEFF (was default-R)
166- 2 new default-R blocks:
167# Meroitic Hieroglyphs:
168# U+10980 - U+1099F
169# Meroitic Cursive: U+109A0 - U+109FF
170 -> should be picked up by the explicit data in the file
171
172* NameAliases.txt changes
173- from
174 # Each line has two fields
175 # First field: Code point
176 # Second field: Alias
177- to
178 # Each line has three fields, as described here:
179 #
180 # First field: Code point
181 # Second field: Alias
182 # Third field: Type
183- Also, the file previously allowed multiple aliases but only now does it
184 actually provide multiple, even multiple of the same type. For example,
185 FEFF;BYTE ORDER MARK;alternate
186 FEFF;BOM;abbreviation
187 FEFF;ZWNBSP;abbreviation
188- This breaks our gennames parser, unames.icu data structure, and API.
189 Fix gennames to only pick up "correction" aliases.
190 New ticket #8963 for further changes.
191
192* run genpname/preparse.pl (on Linux)
193 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
194 + make sure that data.h is writable
195 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
196 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
197
198* build ICU (make install)
199 so that the tools build can pick up the new definitions from the installed header files.
200* build Unicode tools (at least genpname) using CMake+make
201
202* run genpname
203 (builds both pnames.icu and propname_data.h)
204- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
205- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
206
207* build ICU (make install)
208* build Unicode tools using CMake+make
209
210* update source/data/unidata/norm2/nfkc_cf.txt
211- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
212
213* update source/data/unidata/norm2/uts46.txt
214- download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt
215 to ~/svn.icu/tools/trunk/src/unicode/py
216- adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008".
217- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
218- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
219
220* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
221 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
222- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
223- Unicode 6.0..6.1: U+2260, U+226E, U+226F
224- nothing new in 6.1, no test file to update
225
226* generate core properties data files
227- in initial bootstrapping, change the UCA version
228 in source/data/unidata/FractionalUCA.txt to match the new Unicode version
229- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
230- rebuild ICU & tools
231 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,
232 check if the UCA version in FractionalUCA.txt matches the new Unicode version
233 (see step above)
234- run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm:
235 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
236- rebuild ICU & tools
237
238* update Java data files
239- refresh just the UCD-related files, just to be safe
240- see (ICU4C)/source/data/icu4j-readme.txt
241- mkdir /tmp/icu4j
242- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
243 output:
244 ...
245 Unicode .icu files built to ./out/build/icudt49l
246 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b
247 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b
248 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
249 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b
250 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b"
251 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/
252 mkdir -p /tmp/icu4j/main/shared/data
253 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
254 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/
255 mkdir -p /tmp/icu4j/main/shared/data
256 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data
257 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data'
258- copy the big-endian Unicode data files to another location,
259 separate from the other data files
260 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
261 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
262 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
263 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu
264 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b
265 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
266 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr
267- refresh ICU4J
268 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
269
270* refresh Java test .txt files
271- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
272
273* test ICU so far, fix test code where necessary
274- temporarily ignore collation issues that look like UCA/UCD mismatches,
275 until UCA data is updated
276
277* UCA
278
279- get output from Mark's tools; look in
280 http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt
281- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
282- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
283 (note removing the underscore before "Rules")
284- update (ICU)/source/test/testdata/CollationTest_*.txt
285 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
286 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)
287- check test file diffs for previously commented-out, known-failing data lines;
288 probably need to keep those commented out
289- check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani
290- run makeuca.sh:
291 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
292- rebuild ICU4C
293- refresh ICU4J collation data:
294 (subset of instructions above for properties data refresh, except copies all coll/*)
295 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
296 ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
297 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll
298 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b
299- run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)
300- note on intltest: if collate/UCAConformanceTest fails, then
301 utility/MultithreadTest/TestCollators will fail as well;
302 fix the conformance test before looking into the multi-thread test
303
304* When refreshing all of ICU4J data from ICU4C
305- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
306- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
307or
308- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
309
310*** LayoutEngine script information
311
312(For details see the Unicode 5.2 change log below.)
313
314* Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.
315 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp
316 in the working directory.
317 (It also generates ScriptRunData.cpp, which is no longer needed.)
318
319 The generated files have a current copyright date and "@draft" statement.
320
321- diff current <icu>/source/layout files vs. generated ones
322 ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout
323 review and manually merge desired changes;
324 fix gratuitous changes, incorrect @draft and missing aliases;
325 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
326- if you just copy the above files, then
327 fix mixed line endings, review the diffs as above and restore changes to API tags etc.;
328 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
329
330*** merge the Unicode update branches back onto the trunk
331- do not merge the icudata.jar and testdata.jar,
332 instead rebuild them from merged & tested ICU4C
333
334---------------------------------------------------------------------------- ***
335
336ICU 4.8 (no Unicode update, just new script codes)
337
338* 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
339 (added 2010-12-21)
340 Afak 439 Afaka
341 Jurc 510 Jurchen
342 Mroo 199 Mro, Mru
343 Nshu 499 Nüshu
344 Shrd 319 Sharada, Śāradā
345 Sora 398 Sora Sompeng
346 Takr 321 Takri, Ṭākrī, Ṭāṅkrī
347 Tang 520 Tangut
348 Wole 480 Woleai
349 -> uscript.h
350 -> com.ibm.icu.lang.UScript
351 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
352 replace public static final int \1 = \2;\3
353 -> genpname/SyntheticPropertyValueAliases.txt
354 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
355 and in com.ibm.icu.dev.test.lang.TestUScript.java
356
357* run genpname/preparse.pl (on Linux)
358 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
359 + make sure that data.h is writable
360 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
361 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
362
363* rebuild Unicode tools (at least genpname) using make
364- You might first need to "make install" ICU so that the tools build can pick
365 up the new definitions from the installed header files.
366
367* run genpname
368 (builds both pnames.icu and propname_data.h)
369- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
370- ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource
371- rebuild ICU & tools
372
373* run genprops
374- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
375- ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0
376- rebuild ICU & tools
377
378* update Java data files
379- refresh just the UCD-related files, just to be safe
380- see (ICU4C)/source/data/icu4j-readme.txt
381- mkdir /tmp/icu4j
382- ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
383- copy the big-endian Unicode data files to another location,
384 separate from the other data files
385 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
386 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
387 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b
388- refresh ICU4J
389 ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b
390
391* should have updated the layout engine script codes but forgot
392
393---------------------------------------------------------------------------- ***
394
729e4ab9
A
395Unicode 6.0 update
396
397*** related ICU Trac tickets
398
3997264 Unicode 6.0 Update
400
401*** Unicode version numbers
402- makedata.mak
403- uchar.h
404 (configure.in & configure: have been modified to extract the version from uchar.h)
405- com.ibm.icu.util.VersionInfo
406
407*** data files & enums & parser code
408
409* file preparation
410
411~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed
412- This now prepares both unidata and testdata files in respective output subfolders.
413
414* PropertyAliases.txt changes
415- new Script_Extensions property defined in the new ScriptExtensions.txt file
416 but not listed in PropertyAliases.txt; reported to unicode.org;
417 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt
418 scx; Script_Extensions
419 -> uchar.h with new UProperty section
420 -> com.ibm.icu.lang.UProperty, parallel with uchar.h
421
422* PropertyValueAliases.txt changes
423- 12 new block names:
424 Alchemical_Symbols
425 Bamum_Supplement
426 Batak
427 Brahmi
428 CJK_Unified_Ideographs_Extension_D
429 Emoticons
430 Ethiopic_Extended_A
431 Kana_Supplement
432 Mandaic
433 Miscellaneous_Symbols_And_Pictographs
434 Playing_Cards
435 Transport_And_Map_Symbols
436 -> add to uchar.h
437 -> add to UCharacter.UnicodeBlock
438 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)
439 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
440- Joining_Group (jg) values:
441 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias
442 -> uchar.h & UCharacter.JoiningGroup
443- 3 new scripts:
444 sc ; Batk ; Batak
445 sc ; Brah ; Brahmi
446 sc ; Mand ; Mandaic
447 -> remove these from SyntheticPropertyValueAliases.txt
448 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN
449 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()
450 and in com.ibm.icu.dev.test.lang.TestUScript.java
451- 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html
452 (added 2009-11-11..2010-07-18)
453 Bass 259 Bassa Vah
454 Dupl 755 Duployan shortand
455 Elba 226 Elbasan
456 Gran 343 Grantha
457 Kpel 436 Kpelle
458 Loma 437 Loma
459 Mend 438 Mende
460 Merc 101 Meroitic Cursive
461 Narb 106 Old North Arabian
462 Nbat 159 Nabataean
463 Palm 126 Palmyrene
464 Sind 318 Sindhi
465 Wara 262 Warang Citi
466 -> uscript.h
467 -> com.ibm.icu.lang.UScript
468 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)
469 replace public static final int \1 = \2;\3
470 -> SyntheticPropertyValueAliases.txt
471 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()
472 and in com.ibm.icu.dev.test.lang.TestUScript.java
473- ISO 15924 name change
474 Mero 100 Meroitic Hieroglyphs (was Meroitic)
475 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC
476- property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt
477
478* UnicodeData.txt changes
479- new CJK block:
480 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
481 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
482 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion
483
484* build Unicode tools using CMake+make
485
486* run genpname/preparse.pl (on Linux)
487 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname
488 + make sure that data.h is writable
489 + perl preparse.pl ~/svn.icu/trunk/src > out.txt
490 + preparse.pl shows no errors, out.txt Info and Warning lines look ok
491
492* rebuild Unicode tools (at least genpname) using make
493- You might first need to "make install" ICU so that the tools build can pick
494 up the new definitions from the installed header files.
495
496* run genpname
497- ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in
498- rebuild ICU & tools
499
500* update source/data/unidata/norm2/nfkc_cf.txt
501- follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt
502
503* update source/data/unidata/norm2/uts46.txt
504- download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt
505 to ~/svn.icu/tools/trunk/src/unicode/py
506- adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values
507- ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py
508- ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2
509
510* update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to
511 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)
512- grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters
513- Unicode 6.0: U+2260, U+226E, U+226F
514
515* generate core properties data files
516- ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
517- rebuild ICU & tools
518- run makeuca.sh so that genuca picks up the new nfc.nrm:
519 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
520- rebuild ICU & tools
521
522* implement new Script_Extensions property (provisional)
523- parser & generator: genprops & uprops.icu
524- uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp
525- UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java
526
527* switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2
528- (one-time change)
529- genbidi/gencase/genprops tools changes
530- re-run makeprops.sh (see above)
531- UCharacterProperty.java, UCharacterTypeIterator.java,
532 UBiDiProps.java, UCaseProps.java, and several others with minor changes;
533 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java
534
535* update Java data files
536- refresh just the UCD-related files, just to be safe
537- see (ICU4C)/source/data/icu4j-readme.txt
538- mkdir /tmp/icu4j
539- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
540 output:
541 ...
542 Unicode .icu files built to ./out/build/icudt45l
543 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b
544 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt
545 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b
546 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b
547 mkdir -p /tmp/icu4j/main/shared/data
548 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data
549- copy the big-endian Unicode data files to another location,
550 separate from the other data files
551 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
552 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
553 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
554 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu
555 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b
556 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
557 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr
558- refresh ICU4J
559 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
560
561* refresh Java test .txt files
562- copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode
563
564* un-hardcode normalization skippable (NF*_Inert) test data
565- removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools
566
567* copy updated break iterator test files
568- now handled by early ucdcopy.py and
569 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata
570 (old instructions:
571 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt
572 to ~/svn.icu/trunk/src/source/test/testdata)
573- they are not used in ICU4J
574
575* UCA
576
577- get output from Mark's tools; look in
578 http://www.unicode.org/~book/incoming/mark/uca6.0.0/
579 http://www.macchiato.com/unicode/utc/additional-uca-files
580 http://www.unicode.org/Public/UCA/6.0.0/
581 http://www.unicode.org/~mdavis/uca/
582- update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt
583- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt
584- update Han-implicit ranges for new CJK extensions:
585 swapCJK() in ucol.cpp & ImplicitCEGenerator.java
586- genuca: allow bytes 02 for U+FFFE, new merge-sort character;
587 do not add it into invuca so that tailoring primary-after an ignorable works
588- genuca: permit space between [variable top] bytes
589- ucol.cpp: treat noncharacters like unassigned rather than ignorable
590- run makeuca.sh:
591 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld
592- rebuild ICU4C
593- refresh ICU4J collation data:
594 (subset of instructions above for properties data refresh, except copies all coll/*)
595 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
596 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
597 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll
598 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b
599- update (ICU)/source/test/testdata/CollationTest_*.txt
600 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt
601 with output from Mark's Unicode tools
602- run all tests with the *_SHORT.txt or the full files (the full ones have comments)
603- note on intltest: if collate/UCAConformanceTest fails, then
604 utility/MultithreadTest/TestCollators will fail as well;
605 fix the conformance test before looking into the multi-thread test
606
607* When refreshing all of ICU4J data from ICU4C
608- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install
609- cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data
610or
611- ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install
612
613*** LayoutEngine script information
614
615(For details see the Unicode 5.2 change log below.)
616
617* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
618ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
619ScriptRunData.cpp, which is no longer needed.)
620
621The generated files have a current copyright date and "@draft" statement.
622
623* copy the above files into <icu>/source/layout, replacing the old files.
624* fix mixed line endings
625* review the diffs and fix incorrect @draft and missing aliases;
626 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.
627* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
628
629---------------------------------------------------------------------------- ***
630
631Unicode 5.2 update
632
633*** related ICU Trac tickets
634
6357084 Unicode 5.2
636
6377167 verify collation bytes
6387235 Java test NAME_ALIAS
6397236 Java DerivedCoreProperties.txt test
6407237 Java BidiTest.txt
6417238 UTrie2 in core unidata
6427239 test for tailoring gaps
6437240 Java fix CollationMiscTest
6447243 update layout engine for Unicode 5.2
645
646*** Unicode version numbers
647- makedata.mak
648- uchar.h
649- configure.in & configure
650- update ucdVersion in gennames.c if an algorithmic range changes
651
652*** data files & enums & parser code
653
654* file preparation
655
656python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata
657- includes finding files regardless of version numbers,
658 copying them, and performing the equivalent processing of the
659 ucdstrip and ucdmerge tools on the desired set of files
660
661* notes on changes
662- PropertyAliases.txt
663 moved from numeric to enumerated:
664 ccc ; Canonical_Combining_Class
665 new string properties:
666 NFKC_CF ; NFKC_Casefold
667 Name_Alias; Name_Alias
668 new binary properties:
669 Cased ; Cased
670 CI ; Case_Ignorable
671 CWCF ; Changes_When_Casefolded
672 CWCM ; Changes_When_Casemapped
673 CWKCF ; Changes_When_NFKC_Casefolded
674 CWL ; Changes_When_Lowercased
675 CWT ; Changes_When_Titlecased
676 CWU ; Changes_When_Uppercased
677 new CJK Unihan properties (not supported by ICU)
678- PropertyValueAliases.txt
679 new block names
680 new scripts
681 one script code change:
682 sc ; Qaai ; Inherited
683 ->
684 sc ; Zinh ; Inherited ; Qaai
685 new Line_Break (lb) value:
686 lb ; CP ; Close_Parenthesis
687 new Joining_Group (jg) values: Farsi_Yeh, Nya
688 other new values:
689 ccc; 214; ATA ; Attached_Above
690- DerivedBidiClass.txt
691 new default-R range: U+1E800 - U+1EFFF
692- UnicodeData.txt
693 all of the ISO comments are gone
694 new CJK block end:
695 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last>
696 new CJK block:
697 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
698 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
699
700* genpname
701- run preparse.pl
702 + cd \svn\icuproj\icu\trunk\source\tools\genpname
703 + make sure that data.h is writable
704 + perl preparse.pl \svn\icuproj\icu\trunk > out.txt
705 + preparse.pl complains with errors like the following:
706 Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34.
707 This is because ICU 4.0 had scripts from ISO 15924 which are now
708 added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt
709 and PropertyValueAliases.txt.
710 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
711 Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt
712 + preparse.pl complains with errors about block names missing from uchar.h; add them
713
714* uchar.h & uscript.h & uprops.h & uprops.c & genprops
715- new block & script values
716 + 26 new blocks
717 copy new blocks from Blocks.txt
718 MS VC++ 2008 regular expression:
719 find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$"
720 replace with " UBLOCK_\3 = 172, /*[\1]*/"
721 + several new script values already added in ICU 4.0 for ISO 15924 coverage
722 (removed from SyntheticPropertyValueAliases.txt, see genpname notes above)
723 + 3 new script values added for ISO 15924 and Unicode 5.2 coverage
724 + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2)
725 (added to SyntheticPropertyValueAliases.txt)
726- new Joining Group (JG) values: Farsi_Yeh, Nya
727- new Line_Break (lb) value:
728 lb ; CP ; Close_Parenthesis
729
730* hardcoded Unihan range end/limit
731- Unihan range end moves from 9FC3 to 9FCB
732 search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive)
733 + do change gennames.c
734
735* Compare definitions of new binary properties with what we used to use
736 in algorithms, to see if the definitions changed.
737- Verified that definitions for Cased and Case_Ignorable are unchanged.
738 The gencase tool now parses the newly public Case_Ignorable values
739 in case the definition changes in the future.
740
741* uchar.c & uprops.h & uprops.c & genprops
742- new numeric values that didn't exist in Unicode data before:
743 1/7, 1/9, 1/10, 3/10, 1/16, 3/16
744 the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5,
745 therefore redesign the encoding of numeric types and values for formatVersion 6;
746 design for simple numbers up to at least 144 ("one gross"),
747 large values up to at least 10^20,
748 and fractions with numerators -1..17 and denominators 1..16
749 to cover current and expected future values
750 (e.g., more Han numeric values, Meroitic twelfths)
751
752* reimplement Hangul_Syllable_Type for new Jamo characters
753- the old code assumed that all Jamo characters are in the 11xx block
754- Unicode 5.2 fills holes there and adds new Jamo characters in
755 A960..A97F; Hangul Jamo Extended-A
756 and in
757 D7B0..D7FF; Hangul Jamo Extended-B
758- Hangul_Syllable_Type can be trivially derived from a subset of
759 Grapheme_Cluster_Break values
760
761* build Unicode data source code for hardcoding core data
762C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data
763
764ICU data make path is \svn\icuproj\icu\trunk\source\data\
765ICU root path is \svn\icuproj\icu\trunk
766Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
767Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
768Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
769Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
770Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
771Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
772Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
773Information: cannot find "spreplocal.mk". Not building user-additional stringprep files.
774Creating data file for Unicode Property Names
775Creating data file for Unicode Character Properties
776Creating data file for Unicode Case Mapping Properties
777Creating data file for Unicode BiDi/Shaping Properties
778Creating data file for Unicode Normalization
779Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l"
780Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp"
781
782- copy the .c source files to C:\svn\icuproj\icu\trunk\source\common
783 and rebuild the common library
784
785*** UCA
786
787- update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools)
788- update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools
789- update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools
790[ Begin obsolete instructions:
791 Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files.
792 - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py
793 on Windows:
794 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt
795 python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt
796 End obsolete instructions]
797- run all tests with the *_SHORT.txt or the full files (the full ones have comments)
798 not just the *_STUB.txt files
799- note on intltest: if collate/UCAConformanceTest fails, then
800 utility/MultithreadTest/TestCollators will fail as well;
801 fix the conformance test before looking into the multi-thread test
802
803*** Implement Cased & Case_Ignorable properties
804- via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable()
805- Problem: These properties should be disjoint, but aren't
806- UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not
807- change ucase.icu to be able to store any combination of Cased and Case_Ignorable
808
809*** Implement Changes_When_Xyz properties
810- without stored data
811
812*** Implement Name_Alias property
813- add it as another name field in unames.icu
814- make it available via u_charName() and UCharNameChoice and
815- consider it in u_charFromName()
816
817*** Break iterators
818
819* Update break iterator rules to new UAX versions and new property values
820* Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary
821
822*** new BidiTest file
823- review format and data
824- copy BidiTest.txt to source/test/testdata
825- write test code using this data
826- fix ICU code where it fails the conformance test
827
828*** Java
829- generally, find and update code corresponding to C/C++
830- UCharacter.UnicodeBlock constants:
831 a) add an _ID integer per new block, update COUNT
832 b) add a class instance per new block
833 Visual Studio regex:
834 find UBLOCK_{[^ ]+} = [0-9]+, {/.+}
835 replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2
836- CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias()
837
838- port test changes to Java
839
840*** LayoutEngine script information
841
842(For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833)
843
844* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,
845ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates
846ScriptRunData.cpp, which is no longer needed.)
847
848The generated files have a current copyright date and "@draft" statement.
849
850-> Eric Mader wrote in email on 20090930:
851 "I think the tool has been modified to update @draft to @stable for
852 older scripts and to add @draft for new scripts.
853 (I worked with an intern on this last year.)
854 You should check the output after you run it."
855
856* copy the above files into <icu>/source/layout, replacing the old files.
857* fix mixed line endings
858* review the diffs and fix incorrect @draft and missing aliases
859* manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h
860
861Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
862and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
863
864-> Eric Mader wrote in email on 20090930:
865 "This is just a matter of making sure that all the per-script tables have
866 entries for any new scripts that were added.
867 If any new Indic characters were added, then the class tables in
868 IndicClassTables.cpp should be updated to reflect this.
869 John Emmons should know how to do this if it's required."
870
871* rebuild the layout and layoutex libraries.
872
873*** Documentation
874- Update User Guide
875 + Jamo_Short_Name, sfc->scf, binary property value aliases
876
877---------------------------------------------------------------------------- ***
878
46f4442e
A
879Unicode 5.1 update
880
881*** related ICU Trac tickets
882
8835696 Update to Unicode 5.1
884
885*** Unicode version numbers
886- makedata.mak
887- uchar.h
888- configure.in & configure
889- update ucdVersion in gennames.c if an algorithmic range changes
890
891*** data files & enums & parser code
892
893* file preparation
894- ucdstrip:
895 DerivedCoreProperties.txt
896 DerivedNormalizationProps.txt
897 NormalizationTest.txt
898 PropList.txt
899 Scripts.txt
900 GraphemeBreakProperty.txt
901 SentenceBreakProperty.txt
902 WordBreakProperty.txt
903- ucdstrip and ucdmerge:
904 EastAsianWidth.txt
905 LineBreak.txt
906
907* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
908copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\
909copy 5.1.0\ucd\Blocks.txt ..\unidata\
910copy 5.1.0\ucd\CaseFolding.txt ..\unidata\
911copy 5.1.0\ucd\DerivedAge.txt ..\unidata\
912copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
913copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
914copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
915copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
916copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\
917copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\
918copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\
919copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\
920copy 5.1.0\ucd\UnicodeData.txt ..\unidata\
921
922ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
923ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
924ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
925ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt
926ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
927ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
928ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
929ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
930ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
931ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
932
933* genpname
934- run preparse.pl
935 + cd \svn\icuproj\icu\uni51\source\tools\genpname
936 + make sure that data.h is writable
937 + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt
938 + preparse.pl complains with errors like the following:
939 Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30.
940 This is because ICU 3.8 had scripts from ISO 15924 which are now
941 added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt
942 and PropertyValueAliases.txt.
943 -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt:
944 Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii
945 + PropertyValueAliases.txt now explicitly contains values for boolean properties:
946 N/Y, No/Yes, F/T, False/True
947 -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases.
948 It will use further values from the file if present.
949
950* uchar.h & uscript.h & uprops.h & uprops.c & genprops
951- new block & script values
952 + 17 new blocks
953 + 11 new script values already added in ICU 3.8 for ISO 15924 coverage
954 (removed from SyntheticPropertyValueAliases.txt)
955 + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1)
956 (added to SyntheticPropertyValueAliases.txt)
957- uprops.icu (uprops.h) only provides 7 bits for script codes.
958 In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now.
959 There is none above 127 yet which is the script code for an
960 assigned Unicode character, so ICU 4.0 uprops.icu does not store any
961 script code values greater than 127.
962 However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129
963 in a parallel bit field, and that overflows now.
964 Also, future values >=128 would be incompatible anyway.
965 uprops.h is modified to move around several of the bit fields
966 in the properties vector words, and now uses 8 bits for the script code.
967 Two other bit fields also grow to accommodate future growth:
968 Block (current count: 172) grows from 8 to 9 bits,
969 and Word_Break grows from 4 to 5 bits.
970- renamed property Simple_Case_Folding (sfc->scf)
971 + nothing to be done: handled as normal alias
972- new property JSN Jamo_Short_Name
973 + no new API: only contributes to the Name property
974- new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark
975- new Joining Group (JG) value: Burushashki_Yeh_Barree
976- new Sentence_Break (SB) values:
977 SB ; CR ; CR
978 SB ; EX ; Extend
979 SB ; LF ; LF
980 SB ; SC ; SContinue
981- new Word_Break (WB) values:
982 WB ; CR ; CR
983 WB ; Extend ; Extend
984 WB ; LF ; LF
985 WB ; MB ; MidNumLet
986
987* Further changes in the 2008-02-29 update:
988- Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP
989 because they should not normally be invisible.
990- new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed)
991- new Grapheme_Cluster_Break (GCB) value: PP=Prepend
992- new Word_Break (WB) value: NL=Newline
993
994* hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison)
995- Unihan range end moves from 9FBB to 9FC3
996 search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive)
997 + do change gennames.c
998
999* build Unicode data source code for hardcoding core data
1000C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data
1001
1002ICU data make path is \svn\icuproj\icu\uni51\source\data\
1003ICU root path is \svn\icuproj\icu\uni51
1004Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
1005Information: cannot find "brklocal.mk". Not building user-additional break iterator files.
1006Information: cannot find "reslocal.mk". Not building user-additional resource bundle files.
1007Information: cannot find "collocal.mk". Not building user-additional resource bundle files.
1008Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files.
1009Information: cannot find "trnslocal.mk". Not building user-additional transliterator files.
1010Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files.
1011Creating data file for Unicode Character Properties
1012Creating data file for Unicode Case Mapping Properties
1013Creating data file for Unicode BiDi/Shaping Properties
1014Creating data file for Unicode Normalization
1015Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l"
1016Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp"
1017
1018- copy the .c source files to C:\svn\icuproj\icu\uni51\source\common
1019 and rebuild the common library
1020
1021*** Break iterators
1022
1023* Update break iterator rules to new UAX versions and new property values
1024
1025*** UCA
1026
1027* update FractionalUCA.txt and UCARules.txt with new canonical closure
1028
1029*** Test suites
1030- Test that APIs using Unicode property value aliases (like UnicodeSet)
1031 support all of the boolean values N/Y, No/Yes, F/T, False/True
1032 -> TestBinaryValues() tests in both cintltst and intltest
1033
1034*** LayoutEngine script information
1035* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
1036ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
1037ScriptRunData.cpp, which is no longer needed.)
1038
1039The generated files have a current copyright date and "@draft" statement.
1040
1041* copy the above files into <icu>/source/layout, replacing the old files.
1042
1043Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
1044and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
1045
1046* rebuild the layout and layoutex libraries.
1047
1048*** Documentation
1049- Update User Guide
1050 + Jamo_Short_Name, sfc->scf, binary property value aliases
1051
1052---------------------------------------------------------------------------- ***
1053
73c04bcf
A
1054Unicode 5.0 update
1055
1056*** related Jitterbugs
1057
10585084 RFE: Update to Unicode 5.0
1059
1060*** data files & enums & parser code
1061
1062* file preparation
1063- ucdstrip:
1064 DerivedCoreProperties.txt
1065 DerivedNormalizationProps.txt
1066 NormalizationTest.txt
1067 PropList.txt
1068 Scripts.txt
1069 GraphemeBreakProperty.txt
1070 SentenceBreakProperty.txt
1071 WordBreakProperty.txt
1072- ucdstrip and ucdmerge:
1073 EastAsianWidth.txt
1074 LineBreak.txt
1075
46f4442e 1076* my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers)
73c04bcf
A
1077copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\
1078copy 5.0.0\ucd\Blocks.txt ..\unidata\
1079copy 5.0.0\ucd\CaseFolding.txt ..\unidata\
1080copy 5.0.0\ucd\DerivedAge.txt ..\unidata\
1081copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\
1082copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\
1083copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\
1084copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\
1085copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\
1086copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\
1087copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\
1088copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\
1089copy 5.0.0\ucd\UnicodeData.txt ..\unidata\
1090
1091ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt
1092ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt
1093ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt
1094ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt
1095ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt
1096ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt
1097ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt
1098ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt
1099ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt
1100ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt
1101
1102* update FractionalUCA.txt and UCARules.txt with new canonical closure
1103
1104* genpname
1105- run preparse.pl
1106 + make sure that data.h is writable
1107 + perl preparse.pl \cvs\oss\icu > out.txt
1108
1109* uchar.h & uscript.h & uprops.h & uprops.c & genprops
1110- new block & script values
1111 + script values already added in ICU 3.6 because all of ISO 15924 is now covered
1112
1113* build Unicode data source code for hardcoding core data
1114C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data
1115
1116ICU data make path is \cvs\oss\icu\source\data\
1117ICU root path is \cvs\oss\icu
1118Information: cannot find "ucmlocal.mk". Not building user-additional converter files.
1119[etc.]
1120Creating data file for Unicode Character Properties
1121Creating data file for Unicode Case Mapping Properties
1122Creating data file for Unicode BiDi/Shaping Properties
1123Creating data file for Unicode Normalization
1124Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l"
1125Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp"
1126
1127- copy the .c source files to C:\cvs\oss\icu\source\common
1128 and rebuild the common library
1129
1130*** Unicode version numbers
1131- makedata.mak
1132- uchar.h
1133- configure.in
1134
1135*** LayoutEngine script information
1136* Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h,
1137ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates
1138ScriptRunData.cpp, which is no longer needed.)
1139
1140The generated files have a current copyright date and "@draft" statement.
1141
1142* copy the above files into <icu>/source/layout, replacing the old files.
1143
1144Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp
1145and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...)
1146
1147* rebuild the layout and layoutex libraries.
1148
1149---------------------------------------------------------------------------- ***
1150
1151Unicode 4.1 update
1152
1153*** related Jitterbugs
1154
11554332 RFE: Update to Unicode 4.1
11564157 RBBI, TR29 4.1 updates
1157
1158*** data files & enums & parser code
1159
1160* file preparation
1161- ucdstrip:
1162 DerivedCoreProperties.txt
1163 DerivedNormalizationProps.txt
1164 NormalizationTest.txt
1165 GraphemeBreakProperty.txt
1166 SentenceBreakProperty.txt
1167 WordBreakProperty.txt
1168- ucdstrip and ucdmerge:
1169 EastAsianWidth.txt
1170 LineBreak.txt
1171
1172* add new files to the repository
1173 GraphemeBreakProperty.txt
1174 SentenceBreakProperty.txt
1175 WordBreakProperty.txt
1176
1177* update FractionalUCA.txt and UCARules.txt with new canonical closure
1178
1179* genpname
1180- handle new enumerated properties in sub read_uchar
1181- run preparse.pl
1182
1183* uchar.h & uscript.h & uprops.h & uprops.c & genprops
1184- new binary properties
1185 + Pattern_Syntax
1186 + Pattern_White_Space
1187- new enumerated properties
1188 + Grapheme_Cluster_Break
1189 + Sentence_Break
1190 + Word_Break
1191- new block & script & line break values
1192
1193* gencase
1194- case-ignorable changes
1195 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
1196 now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk
1197
1198*** Unicode version numbers
1199- makedata.mak
1200- uchar.h
1201- configure.in
1202
1203*** tests
1204- verify that u_charMirror() round-trips
1205- test all new properties and some new values of old properties
1206
1207*** other code
1208
1209* hardcoded Unihan range end/limit
1210- Unihan range end moves from 9FA5 to 9FBB
1211 search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive)
1212 + do not modify BOCU/BOCSU code because that would change the encoding
1213 and break binary compatibility!
1214 + similarly, do not change the GB 18030 range data (ucnvmbcs.c),
1215 NamePrepProfile.txt
1216 + ignore trietest.c: test data is arbitrary
1217 + ignore tstnorm.cpp: test optimization, not important
1218 + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF
1219 + do change line_th.txt and word_th.txt
1220 by replacing hardcoded ranges with the new property values
1221 + do change gennames.c
1222
1223source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
1224source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6
1225source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5,
1226
1227* case mappings
1228- compare new special casing context conditions with previous ones
1229 see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods
1230
1231* genpname
1232- consider storing only the short name if it is the same as the long name
1233
1234*** other reviews
1235- UAX #29 changes (grapheme/word/sentence breaks)
1236- UAX #14 changes (line breaks)
1237- Pattern_Syntax & Pattern_White_Space
1238
1239---------------------------------------------------------------------------- ***
1240
374ca955
A
1241Unicode 4.0.1 update
1242
1243*** related Jitterbugs
1244
12453170 RFE: Update to Unicode 4.0.1
12463171 Add new Unicode 4.0.1 properties
12473520 use Unicode 4.0.1 updates for break iteration
1248
1249*** data files & enums & parser code
1250
1251* file preparation
1252- ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt
1253- ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt
1254
1255* file fixes
1256- fix UnicodeData.txt general categories of Ethiopic digits Nd->No
1257 according to PRI #26
1258 http://www.unicode.org/review/resolved-pri.html#pri26
1259- undone again because no corrigendum in sight;
1260 instead modified tests to not check consistency on this for Unicode 4.0.1
1261
1262* ucdterms.txt
1263- update from http://www.unicode.org/copyright.html
1264 formatted for plain text
1265
1266* uchar.h & uprops.h & uprops.c & genprops
1267- add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed
1268- add U_LB_INSEPARABLE due to a spelling fix
1269 + put short name comment only on line with new constant
1270 for genpname perl script parser
1271- new binary properties
1272 + STerm
1273 + Variation_Selector
1274
1275* genpname
1276- fix genpname perl script so that it doesn't choke on more than 2 names per property value
1277- perl script: correctly calculate the maximum number of fields per row
1278
1279* uscript.h
1280- new script code Hrkt=Katakana_Or_Hiragana
1281
1282* gennorm.c track changes in DerivedNormalizationProps.txt
1283- "FNC" -> "FC_NFKC"
1284- single field "NFD_NO" -> two fields "NFD_QC; N" etc.
1285
1286* genprops/props2.c track changes in DerivedNumericValues.txt
1287- changed from 3 columns to 2, dropping the numeric type
1288 + assume that the type is always numeric for Han characters,
1289 and that only those are added in addition to what UnicodeData.txt lists
1290
1291*** Unicode version numbers
1292- makedata.mak
1293- uchar.h
1294- configure.in
1295
1296*** tests
1297- update test of default bidi classes according to PRI #28
1298 /tsutil/cucdtst/TestUnicodeData
1299 http://www.unicode.org/review/resolved-pri.html#pri28
1300- bidi tests: change exemplar character for ES depending on Unicode version
1301- change hardcoded expected property values where they change
1302
1303*** other code
1304
1305* name matching
1306- read UCD.html
1307
1308* scripts
1309- use new Hrkt=Katakana_Or_Hiragana
1310
1311* ZWJ & ZWNJ
1312- are now part of combining character sequences
1313- break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ