git.saurik.com Git - apple/icu.git/blob - icuSources/data/unidata/changes.txt

2 * License & terms of use: http://www.unicode.org/copyright.html

5 *

6 * file name: changes.txt

7 * encoding: US-ASCII

8 * tab size: 8 (not used)

9 * indentation:4

10 *

11 * created on: 2004may06

12 * created by: Markus W. Scherer

13 *

14 * change log for Unicode updates

15 *

16 * For each new Unicode version, during the beta period,

17 * I copy the change log for the previous version to the top of this file.

18 * I adjust the versions, tickets, URLs, and paths.

19 * I work my way through the steps listed in the log, top to bottom,

20 * adjusting the log as necessary.

21 * I report problems to the UTC and/or CLDR and/or ICU.

22 * Before the data is final, I "turn the crank" several more times,

23 * using appropriate subsets of the steps.

25 ---------------------------------------------------------------------------- ***

27 * New ISO 15924 script codes

29 Starting with ICU 55, we do not add UScriptCode constants for new scripts any more

30 until they are encoded in Unicode,

31 or can be assumed to be encoded in the next Unicode version.

32 Script enum constant names want to follow the Unicode script property value aliases,

33 which are assigned only when the scripts are encoded.

34 When we encode scripts early and guess wrong, then we have confusing enum constants

35 and have sometimes added aliases.

37 Variant script codes like Latf and Aran that are not subject to separate encoding

38 can be added at any time.

39 (For example, Aran could be added as USCRIPT_ARABIC_NASTALIQ.)

41 We add script codes used in CLDR or in the spoof checker.

42 This includes combination/alias codes like Hanb and Jamo.

43 See http://unicode.org/reports/tr35/#unicode_script_subtag_validity

44 and look for "alias" on http://unicode.org/iso15924/iso15924-codes.html

46 We add special Z* script codes like Zsye.

48 For new script codes see http://www.unicode.org/iso15924/codechanges.html

50 ---------------------------------------------------------------------------- ***

52 Unicode 13.0 update for ICU 66

54 https://www.unicode.org/versions/Unicode13.0.0/

55 https://www.unicode.org/versions/beta-13.0.0.html

56 https://www.unicode.org/Public/13.0.0/ucd/

57 https://www.unicode.org/reports/uax-proposed-updates.html

58 https://www.unicode.org/reports/tr44/tr44-25.html

60 https://unicode-org.atlassian.net/browse/CLDR-13387

61 https://unicode-org.atlassian.net/browse/ICU-20893

63 * Command-line environment setup

65 UNICODE_DATA=~/unidata/uni13/20200212

66 CLDR_SRC=~/cldr/uni/src

67 ICU_ROOT=~/icu/uni

68 ICU_SRC=$ICU_ROOT/src

69 ICUDT=icudt66b

70 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in

71 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata

72 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib

74 *** Unicode version numbers

75 - makedata.mak

76 - uchar.h

77 - com.ibm.icu.util.VersionInfo

78 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_

80 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h

81 so that the makefiles see the new version number.

82 cd $ICU_ROOT/dbg/icu4c

83 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh

85 *** data files & enums & parser code

87 * download files

88 - mkdir -p $UNICODE_DATA

89 - download Unicode files into $UNICODE_DATA

90 + subfolders: emoji, idna, security, ucd, uca

91 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip

92 + split Unihan into single-property files

93 ~/unitools/trunk/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan

94 + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt

95 or from the ucd/cldr/ output folder of the Unicode Tools:

96 Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules.

97 cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata

99 * for manual diffs and for Unicode Tools input data updates:

100 remove version suffixes from the file names

101 ~$ unidata/desuffixucd.py $UNICODE_DATA

102 (see https://sites.google.com/site/unicodetools/inputdata)

103

104 * process and/or copy files

105 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC

106 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.

107 + For debugging, and tweaking how ppucd.txt is written,

108 the tool has an --only_ppucd option:

109 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile

110

111 - cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA

112

113 * new constants for new property values

114 - preparseucd.py error:

115 ValueError: missing uchar.h enum constants for some property values:

116 [(u'blk', set([u'Symbols_For_Legacy_Computing', u'Dives_Akuru', u'Yezidi',

117 u'Tangut_Sup', u'CJK_Ext_G', u'Khitan_Small_Script', u'Chorasmian', u'Lisu_Sup'])),

118 (u'sc', set([u'Chrs', u'Diak', u'Kits', u'Yezi'])),

119 (u'InPC', set([u'Top_And_Bottom_And_Left']))]

120 = PropertyValueAliases.txt new property values (diff old & new .txt files)

121 blk; Chorasmian ; Chorasmian

122 blk; CJK_Ext_G ; CJK_Unified_Ideographs_Extension_G

123 blk; Dives_Akuru ; Dives_Akuru

124 blk; Khitan_Small_Script ; Khitan_Small_Script

125 blk; Lisu_Sup ; Lisu_Supplement

126 blk; Symbols_For_Legacy_Computing ; Symbols_For_Legacy_Computing

127 blk; Tangut_Sup ; Tangut_Supplement

128 blk; Yezidi ; Yezidi

129 -> add to uchar.h before UBLOCK_COUNT

130 use long property names for enum constants,

131 for the trailing comment get the block start code point: diff old & new Blocks.txt

132 -> add to UCharacter.UnicodeBlock IDs

133 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)

134 replace public static final int \1_ID = \2; \3

135 -> add to UCharacter.UnicodeBlock objects

136 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)

137 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2

138

139 sc ; Chrs ; Chorasmian

140 sc ; Diak ; Dives_Akuru

141 sc ; Kits ; Khitan_Small_Script

142 sc ; Yezi ; Yezidi

143 -> uscript.h & com.ibm.icu.lang.UScript

144 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()

145 and in com.ibm.icu.dev.test.lang.TestUScript.java

146

147 InPC; Top_And_Bottom_And_Left ; Top_And_Bottom_And_Left

148 -> uchar.h enum UIndicPositionalCategory & UCharacter.java IndicPositionalCategory

149

150 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata

151 (not strictly necessary for NOT_ENCODED scripts)

152 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt

153

154 * build ICU (make install)

155 to make sure that there are no syntax errors, and

156 so that the tools build can pick up the new definitions from the installed header files.

157

158 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date

159

160 * update spoof checker UnicodeSet initializers:

161 inclusionPat & recommendedPat in i18n/uspoof.cpp

162 INCLUSION & RECOMMENDED in SpoofChecker.java

163 - make sure that the Unicode Tools tree contains the latest security data files

164 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator

165 - update the hardcoded version number there in the DIRECTORY path

166 - run the tool (no special environment variables needed)

167 - copy & paste from the Console output into the .cpp & .java files

168

169 * generate normalization data files

170 cd $ICU_ROOT/dbg/icu4c

171 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource

172 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt

173 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt

174 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt

175 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt

176

177 * build ICU (make install)

178 so that the tools build can pick up the new definitions from the installed header files.

179

180 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date

181

182 * build Unicode tools using CMake+make

183

184 $ICU_SRC/tools/unicode/c/icudefs.txt:

185

186 # Location (--prefix) of where ICU was installed.

187 set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)

188 # Location of the ICU4C source tree.

189 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)

190

191 $ICU_ROOT/dbg$

192 mkdir -p tools/unicode/c

193 cd tools/unicode/c

194

195 $ICU_ROOT/dbg/tools/unicode/c$

196 cmake ../../../../src/tools/unicode/c

197 make

198

199 * generate core properties data files

200 $ICU_ROOT/dbg/tools/unicode/c$

201 genprops/genprops $ICU_SRC/icu4c

202 - tool failure:

203 genprops: Script_Extensions indexes overflow bit field

204 genprops: error parsing or setting values from ppucd.txt line 32696 - U_BUFFER_OVERFLOW_ERROR

205 -> uprops.icu data file format :

206 add two more bits to store a script code or Script_Extensions index

207 -> generator code, C++ & Java runtime, uprops.icu format version 7.7

208 - rebuild ICU (make install) & tools

209

210 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to

211 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)

212 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters

213 - Unicode 6.0..13.0: U+2260, U+226E, U+226F

214 - nothing new in this Unicode version, no test file to update

215

216 * run & fix ICU4C tests

217 - fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files

218 - Andy helps with RBBI & spoof check test failures

219

220 * collation: CLDR collation root, UCA DUCET

221

222 - UCA DUCET goes into Mark's Unicode tools, see

223 https://sites.google.com/site/unicodetools/home#TOC-UCA

224 diff the main mapping file, look for bad changes

225 (for example, more bytes per weight for common characters)

226 ~/svn.unitools/trunk$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ../Generated/UCA/13.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-13.0.txt

227 ~/svn.unitools/trunk$ meld ../frac-12.1.txt ../frac-13.0.txt

228

229 - CLDR root data files are checked into $CLDR_SRC/common/uca/

230 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/

231

232 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt

233 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt

234 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt

235 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt

236 (note removing the underscore before "Rules")

237 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt

238 - restore TODO diffs in UCARules.txt

239 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt

240 - update (ICU4C)/source/test/testdata/CollationTest_*.txt

241 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt

242 from the CLDR root files (..._CLDR_..._SHORT.txt)

243 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt

244 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt

245 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data

246 - if CLDR common/uca/unihan-index.txt changes, then update

247 CLDR common/collation/root.xml <collation type="private-unihan">

248 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt

249

250 - run genuca

251 $ICU_ROOT/dbg/tools/unicode/c$

252 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \

253 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c

254 - rebuild ICU4C

255

256 * Unihan collators

257 https://sites.google.com/site/unicodetools/unihan

258 - run Unicode Tools

259 org.unicode.draft.GenerateUnihanCollators

260 with VM arguments

261 -ea

262 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk

263 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools

264 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data

265 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src

266 -DUVERSION=13.0.0

267 - run Unicode Tools

268 org.unicode.draft.GenerateUnihanCollatorFiles

269 with the same arguments

270 - check CLDR diffs

271 cd $CLDR_SRC

272 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml

273 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml

274 - copy to CLDR

275 cd $CLDR_SRC

276 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml

277 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml

278 - run CLDR unit tests, commit to CLDR

279 - generate ICU zh collation data: run CLDR

280 org.unicode.cldr.icu.NewLdml2IcuConverter

281 with program arguments

282 -t collation

283 -s /usr/local/google/home/mscherer/cldr/uni/src/common/collation

284 -m /usr/local/google/home/mscherer/cldr/uni/src/common/supplemental

285 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll

286 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation

287 zh

288 and VM arguments

289 -ea

290 -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src

291 - rebuild ICU4C

292

293 * run & fix ICU4C tests, now with new CLDR collation root data

294 - run all tests with the collation test data *_SHORT.txt or the full files

295 (the full ones have comments, useful for debugging)

296 - note on intltest: if collate/UCAConformanceTest fails, then

297 utility/MultithreadTest/TestCollators will fail as well;

298 fix the conformance test before looking into the multi-thread test

299

300 * update Java data files

301 - refresh just the UCD/UCA-related/derived files, just to be safe

302 - see (ICU4C)/source/data/icu4j-readme.txt

303 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

304 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

305 output:

306 ...

307 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'

308 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt66b

309 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b

310 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt66l.dat ./out/icu4j/icudt66b.dat -s ./out/build/icudt66l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt66b

311 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b"

312 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt66b/

313 mkdir -p /tmp/icu4j/main/shared/data

314 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data

315 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt66b/

316 mkdir -p /tmp/icu4j/main/shared/data

317 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data

318 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'

319 - copy the big-endian Unicode data files to another location,

320 separate from the other data files,

321 and then refresh ICU4J

322 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j

323 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

324 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

325 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

326 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

327 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu

328 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

329 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

330 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

331 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT

332

333 * When refreshing all of ICU4J data from ICU4C

334 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

335 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data

336 or

337 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install

338

339 * update CollationFCD.java

340 + copy & paste the initializers of lcccIndex[] etc. from

341 ICU4C/source/i18n/collationfcd.cpp to

342 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java

343

344 * refresh Java test .txt files

345 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode

346 cd $ICU_SRC/icu4c/source/data/unidata

347 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

348 cd ../../test/testdata

349 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

350 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

351

352 * run & fix ICU4J tests

353

354 *** API additions

355 - send notice to icu-design about new born-@stable API (enum constants etc.)

356

357 *** CLDR numbering systems

358 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR

359 for example, look for

360 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt

361 in new blocks (Blocks.txt)

362 Unicode 13:

363 diak 11950..11959 Dives_Akuru

364

365 *** merge the Unicode update branches back onto the trunk

366 - do not merge the icudata.jar and testdata.jar,

367 instead rebuild them from merged & tested ICU4C

368 - make sure that changes to Unicode tools are checked in:

369 http://www.unicode.org/utility/trac/log/trunk/unicodetools

370

371 ---------------------------------------------------------------------------- ***

372

373 Unicode 12.1 update for ICU 64.2

374

375 ** This is an abbreviated update with one new character for the new

376 ** Japanese era expected to start on 2019-May-01: U+32FF SQUARE ERA NAME REIWA

377 https://en.wikipedia.org/wiki/Reiwa_period

378

379 http://www.unicode.org/versions/Unicode12.1.0/

380

381 ICU-20497 Unicode 12.1

382

383 cldrbug 11978: Unicode 12.1

384

385 * Command-line environment setup

386

387 UNICODE_DATA=~/unidata/uni121/20190403

388 CLDR_SRC=~/svn.cldr/uni

389 ICU_ROOT=~/icu/uni

390 ICU_SRC=$ICU_ROOT/src

391 ICUDT=icudt64b

392 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in

393 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata

394 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib

395

396 *** Unicode version numbers

397 - makedata.mak

398 - uchar.h

399 - com.ibm.icu.util.VersionInfo

400 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_

401

402 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h

403 so that the makefiles see the new version number.

404 cd $ICU_ROOT/dbg/icu4c

405 ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh

406

407 *** data files & enums & parser code

408

409 * download files

410 - mkdir -p $UNICODE_DATA

411 - download Unicode files into $UNICODE_DATA

412 + subfolders: emoji, idna, security, ucd, uca

413 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip

414

415 * for manual diffs and for Unicode Tools input data updates:

416 remove version suffixes from the file names

417 ~$ unidata/desuffixucd.py $UNICODE_DATA

418 (see https://sites.google.com/site/unicodetools/inputdata)

419

420 * process and/or copy files

421 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC

422 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.

423 + For debugging, and tweaking how ppucd.txt is written,

424 the tool has an --only_ppucd option:

425 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile

426

427 - cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA

428

429 * build ICU (make install)

430 so that the tools build can pick up the new definitions from the installed header files.

431

432 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date

433

434 * update spoof checker UnicodeSet initializers:

435 inclusionPat & recommendedPat in uspoof.cpp

436 INCLUSION & RECOMMENDED in SpoofChecker.java

437 - make sure that the Unicode Tools tree contains the latest security data files

438 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator

439 - update the hardcoded version number there in the DIRECTORY path

440 - run the tool (no special environment variables needed)

441 - copy & paste from the Console output into the .cpp & .java files

442

443 * generate normalization data files

444 cd $ICU_ROOT/dbg/icu4c

445 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource

446 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt

447 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt

448 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt

449 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt

450

451 * build ICU (make install)

452 so that the tools build can pick up the new definitions from the installed header files.

453

454 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date

455

456 * build Unicode tools using CMake+make

457

458 $ICU_SRC/tools/unicode/c/icudefs.txt:

459

460 # Location (--prefix) of where ICU was installed.

461 set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)

462 # Location of the ICU4C source tree.

463 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)

464

465 $ICU_ROOT/dbg$

466 mkdir -p tools/unicode/c

467 cd tools/unicode/c

468

469 $ICU_ROOT/dbg/tools/unicode/c$

470 cmake ../../../../src/tools/unicode/c

471 make

472

473 * generate core properties data files

474 $ICU_ROOT/dbg/tools/unicode/c$

475 genprops/genprops $ICU_SRC/icu4c

476 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \

477 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c

478 - rebuild ICU (make install) & tools

479

480 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to

481 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)

482 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters

483 - Unicode 6.0..12.1: U+2260, U+226E, U+226F

484 - nothing new in this Unicode version, no test file to update

485

486 * run & fix ICU4C tests

487 - Andy handles RBBI & spoof check test failures

488

489 * collation: CLDR collation root, UCA DUCET

490

491 - UCA DUCET goes into Mark's Unicode tools, see

492 https://sites.google.com/site/unicodetools/home#TOC-UCA

493 diff the main mapping file, look for bad changes

494 (for example, more bytes per weight for common characters)

495 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.1.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.1.txt

496 ~/svn.unitools/trunk$ meld ../frac-12.txt ../frac-12.1.txt

497

498 - CLDR root data files are checked into $CLDR_SRC/common/uca/

499 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/

500

501 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt

502 cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt

503 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt

504 cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt

505 (note removing the underscore before "Rules")

506 cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt

507 - restore TODO diffs in UCARules.txt

508 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt

509 - update (ICU4C)/source/test/testdata/CollationTest_*.txt

510 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt

511 from the CLDR root files (..._CLDR_..._SHORT.txt)

512 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt

513 cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt

514 cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data

515 - if CLDR common/uca/unihan-index.txt changes, then update

516 CLDR common/collation/root.xml <collation type="private-unihan">

517 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt

518

519 - run genuca, see command line above

520 - rebuild ICU4C

521

522 * Unihan collators

523 https://sites.google.com/site/unicodetools/unihan

524 - run Unicode Tools

525 org.unicode.draft.GenerateUnihanCollators

526 with VM arguments

527 -ea

528 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk

529 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools

530 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data

531 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni

532 -DUVERSION=12.1.0

533 - run Unicode Tools

534 org.unicode.draft.GenerateUnihanCollatorFiles

535 with the same arguments

536 - check CLDR diffs

537 cd $CLDR_SRC

538 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml

539 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml

540 - copy to CLDR

541 cd $CLDR_SRC

542 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml

543 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml

544 - run CLDR unit tests, commit to CLDR

545 - generate ICU zh collation data: run CLDR

546 org.unicode.cldr.icu.NewLdml2IcuConverter

547 with program arguments

548 -t collation

549 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation

550 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental

551 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll

552 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation

553 zh

554 and VM arguments

555 -ea

556 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni

557 - rebuild ICU4C

558

559 * run & fix ICU4C tests, now with new CLDR collation root data

560 - run all tests with the collation test data *_SHORT.txt or the full files

561 (the full ones have comments, useful for debugging)

562 - note on intltest: if collate/UCAConformanceTest fails, then

563 utility/MultithreadTest/TestCollators will fail as well;

564 fix the conformance test before looking into the multi-thread test

565

566 * update Java data files

567 - refresh just the UCD/UCA-related/derived files, just to be safe

568 - see (ICU4C)/source/data/icu4j-readme.txt

569 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

570 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

571 output:

572 ...

573 make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'

574 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt64b

575 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b

576 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt64l.dat ./out/icu4j/icudt64b.dat -s ./out/build/icudt64l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt64b

577 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b"

578 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt64b/

579 mkdir -p /tmp/icu4j/main/shared/data

580 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data

581 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt64b/

582 mkdir -p /tmp/icu4j/main/shared/data

583 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data

584 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'

585 - copy the big-endian Unicode data files to another location,

586 separate from the other data files,

587 and then refresh ICU4J

588 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j

589 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

590 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

591 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

592 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

593 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu

594 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

595 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

596 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

597 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT

598

599 * When refreshing all of ICU4J data from ICU4C

600 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

601 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data

602 or

603 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install

604

605 * update CollationFCD.java

606 + copy & paste the initializers of lcccIndex[] etc. from

607 ICU4C/source/i18n/collationfcd.cpp to

608 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java

609

610 * refresh Java test .txt files

611 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode

612 cd $ICU_SRC/icu4c/source/data/unidata

613 cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

614 cd ../../test/testdata

615 cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

616 cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

617

618 * run & fix ICU4J tests

619

620 *** API additions

621 - send notice to icu-design about new born-@stable API (enum constants etc.)

622

623 *** CLDR numbering systems

624 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR

625 for example, look for

626 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt

627 in new blocks (Blocks.txt)

628 Unicode 12: using Unicode 12 CLDR ticket #11478

629 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong

630 wcho 1E2F0..1E2F9 Wancho

631 Unicode 11: using Unicode 11 CLDR ticket #10978

632 rohg 10D30..10D39 Hanifi_Rohingya

633 gong 11DA0..11DA9 Gunjala_Gondi

634 Earlier: CLDR tickets specific to adding new numbering systems.

635 Unicode 10: http://unicode.org/cldr/trac/ticket/10219

636 Unicode 9: http://unicode.org/cldr/trac/ticket/9692

637

638 *** merge the Unicode update branches back onto the trunk

639 - do not merge the icudata.jar and testdata.jar,

640 instead rebuild them from merged & tested ICU4C

641 - make sure that changes to Unicode tools are checked in:

642 http://www.unicode.org/utility/trac/log/trunk/unicodetools

643

644 ---------------------------------------------------------------------------- ***

645

646 Unicode 12.0 update for ICU 64

647

648 http://www.unicode.org/versions/Unicode12.0.0/

649 http://unicode.org/versions/beta-12.0.0.html

650 https://www.unicode.org/review/pri389/

651 http://www.unicode.org/reports/uax-proposed-updates.html

652 http://www.unicode.org/reports/tr44/tr44-23.html

653

654 ICU-20203 Unicode 12

655

656 ICU-20111 move text layout properties data into a data file

657

658 cldrbug 11478: Unicode 12

659 Accidentally used ^/trunk instead of ^/branches/markus/uni12

660

661 * Command-line environment setup

662

663 UNICODE_DATA=~/unidata/uni12/20190309

664 CLDR_SRC=~/svn.cldr/uni

665 ICU_ROOT=~/icu/uni

666 ICU_SRC=$ICU_ROOT/src

667 ICUDT=icudt63b

668 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in

669 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata

670 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib

671

672 *** Unicode version numbers

673 - makedata.mak

674 - uchar.h

675 - com.ibm.icu.util.VersionInfo

676 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_

677

678 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h

679 so that the makefiles see the new version number.

680

681 *** data files & enums & parser code

682

683 * download files

684 - mkdir -p $UNICODE_DATA

685 - download Unicode files into $UNICODE_DATA

686 + subfolders: emoji, idna, security, ucd, uca

687 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip

688

689 * for manual diffs and for Unicode Tools input data updates:

690 remove version suffixes from the file names

691 ~$ unidata/desuffixucd.py $UNICODE_DATA

692 (see https://sites.google.com/site/unicodetools/inputdata)

693

694 * process and/or copy files

695 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC

696 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.

697 + For debugging, and tweaking how ppucd.txt is written,

698 the tool has an --only_ppucd option:

699 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile

700

701 - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA

702

703 * build ICU (make install)

704 so that the tools build can pick up the new definitions from the installed header files.

705

706 $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date

707

708 * new constants for new property values

709 - preparseucd.py error:

710 ValueError: missing uchar.h enum constants for some property values:

711 [(u'blk', set([u'Symbols_And_Pictographs_Ext_A', u'Elymaic',

712 u'Ottoman_Siyaq_Numbers', u'Nandinagari', u'Nyiakeng_Puachue_Hmong',

713 u'Small_Kana_Ext', u'Egyptian_Hieroglyph_Format_Controls', u'Wancho', u'Tamil_Sup'])),

714 (u'sc', set([u'Nand', u'Wcho', u'Elym', u'Hmnp']))]

715 = PropertyValueAliases.txt new property values (diff old & new .txt files)

716 blk; Egyptian_Hieroglyph_Format_Controls; Egyptian_Hieroglyph_Format_Controls

717 blk; Elymaic ; Elymaic

718 blk; Nandinagari ; Nandinagari

719 blk; Nyiakeng_Puachue_Hmong ; Nyiakeng_Puachue_Hmong

720 blk; Ottoman_Siyaq_Numbers ; Ottoman_Siyaq_Numbers

721 blk; Small_Kana_Ext ; Small_Kana_Extension

722 blk; Symbols_And_Pictographs_Ext_A ; Symbols_And_Pictographs_Extended_A

723 blk; Tamil_Sup ; Tamil_Supplement

724 blk; Wancho ; Wancho

725 -> add to uchar.h

726 use long property names for enum constants,

727 for the trailing comment get the block start code point: diff old & new Blocks.txt

728 -> add to UCharacter.UnicodeBlock IDs

729 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)

730 replace public static final int \1_ID = \2; \3

731 -> add to UCharacter.UnicodeBlock objects

732 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)

733 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \3

734

735 sc ; Elym ; Elymaic

736 sc ; Hmnp ; Nyiakeng_Puachue_Hmong

737 sc ; Nand ; Nandinagari

738 sc ; Wcho ; Wancho

739 -> uscript.h & com.ibm.icu.lang.UScript

740 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()

741 and in com.ibm.icu.dev.test.lang.TestUScript.java

742

743 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata

744 (not strictly necessary for NOT_ENCODED scripts)

745 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt

746

747 * update spoof checker UnicodeSet initializers:

748 inclusionPat & recommendedPat in uspoof.cpp

749 INCLUSION & RECOMMENDED in SpoofChecker.java

750 - make sure that the Unicode Tools tree contains the latest security data files

751 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator

752 - update the hardcoded version number there in the DIRECTORY path

753 - run the tool (no special environment variables needed)

754 - copy & paste from the Console output into the .cpp & .java files

755

756 * generate normalization data files

757 cd $ICU_ROOT/dbg/icu4c

758 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource

759 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt

760 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt

761 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt

762 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt

763

764 * build ICU (make install)

765 so that the tools build can pick up the new definitions from the installed header files.

766

767 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date

768

769 * build Unicode tools using CMake+make

770

771 $ICU_SRC/tools/unicode/c/icudefs.txt:

772

773 # Location (--prefix) of where ICU was installed.

774 set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)

775 # Location of the ICU4C source tree.

776 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c)

777

778 $ICU_ROOT/dbg$

779 mkdir -p tools/unicode/c

780 cd tools/unicode/c

781

782 $ICU_ROOT/dbg/tools/unicode/c$

783 cmake ../../../../src/tools/unicode/c

784 make

785

786 * generate core properties data files

787 $ICU_ROOT/dbg/tools/unicode/c$

788 genprops/genprops $ICU_SRC/icu4c

789 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \

790 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c

791 - rebuild ICU (make install) & tools

792

793 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to

794 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)

795 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters

796 - Unicode 6.0..12.0: U+2260, U+226E, U+226F

797 - nothing new in this Unicode version, no test file to update

798

799 * run & fix ICU4C tests

800 - update test of default bidi classes:

801 Bidi range \U0001ED00-\U0001ED4F changes default from R to AL,

802 see diffs in DerivedBidiClass.txt

803 + /tsutil/cucdtst/TestUnicodeData enumDefaultsRange() defaultBidi[]

804 + UCharacterTest.java TestIteration() defaultBidi[]

805 - Andy handles RBBI & spoof check test failures

806

807 * collation: CLDR collation root, UCA DUCET

808

809 - UCA DUCET goes into Mark's Unicode tools, see

810 https://sites.google.com/site/unicodetools/home#TOC-UCA

811 diff the main mapping file, look for bad changes

812 (for example, more bytes per weight for common characters)

813 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.txt

814 ~/svn.unitools/trunk$ meld ../frac-11.txt ../frac-12.txt

815

816 - CLDR root data files are checked into $CLDR_SRC/common/uca/

817 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/

818

819 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt

820 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt

821 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt

822 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt

823 (note removing the underscore before "Rules")

824 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt

825 - restore TODO diffs in UCARules.txt

826 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt

827 - update (ICU4C)/source/test/testdata/CollationTest_*.txt

828 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt

829 from the CLDR root files (..._CLDR_..._SHORT.txt)

830 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt

831 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt

832 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data

833 - if CLDR common/uca/unihan-index.txt changes, then update

834 CLDR common/collation/root.xml <collation type="private-unihan">

835 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt

836

837 - run genuca, see command line above;

838 deal with

839 Error: Unknown script for first-primary sample character U+119CE on line 29233 of /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:

840 FDD1 119CE; [71 CD 02, 05, 05] # Nandinagari first primary (compressible)

841 (add the character to genuca.cpp sampleCharsToScripts[])

842 + This time, I added code to genuca.cpp to use uscript_getSampleUnicodeString(script)

843 and cache its values.

844 Works as long as the script metadata is updated before the collation data.

845 - rebuild ICU4C

846

847 * Unihan collators

848 https://sites.google.com/site/unicodetools/unihan

849 - run Unicode Tools

850 org.unicode.draft.GenerateUnihanCollators

851 with VM arguments

852 -ea

853 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk

854 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools

855 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data

856 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni

857 -DUVERSION=12.0.0

858 - run Unicode Tools

859 org.unicode.draft.GenerateUnihanCollatorFiles

860 with the same arguments

861 - check CLDR diffs

862 cd $CLDR_SRC

863 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml

864 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml

865 - copy to CLDR

866 cd $CLDR_SRC

867 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml

868 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml

869 - run CLDR unit tests, commit to CLDR

870 - generate ICU zh collation data: run CLDR

871 org.unicode.cldr.icu.NewLdml2IcuConverter

872 with program arguments

873 -t collation

874 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation

875 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental

876 -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll

877 -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation

878 zh

879 and VM arguments

880 -ea

881 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni

882 - rebuild ICU4C

883

884 * run & fix ICU4C tests, now with new CLDR collation root data

885 - run all tests with the collation test data *_SHORT.txt or the full files

886 (the full ones have comments, useful for debugging)

887 - note on intltest: if collate/UCAConformanceTest fails, then

888 utility/MultithreadTest/TestCollators will fail as well;

889 fix the conformance test before looking into the multi-thread test

890

891 * update Java data files

892 - refresh just the UCD/UCA-related/derived files, just to be safe

893 - see (ICU4C)/source/data/icu4j-readme.txt

894 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

895 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

896 output:

897 ...

898 Unicode .icu files built to ./out/build/icudt63l

899 echo timestamp > uni-core-data

900 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt63b

901 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b

902 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt

903 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt63l.dat ./out/icu4j/icudt63b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt63l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt63b

904 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b"

905 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt63b/

906 mkdir -p /tmp/icu4j/main/shared/data

907 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data

908 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt63b/

909 mkdir -p /tmp/icu4j/main/shared/data

910 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data

911 make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data'

912 - copy the big-endian Unicode data files to another location,

913 separate from the other data files,

914 and then refresh ICU4J

915 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j

916 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

917 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

918 cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

919 cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

920 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu

921 cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

922 cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

923 cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

924 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT

925

926 * When refreshing all of ICU4J data from ICU4C

927 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

928 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data

929 or

930 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install

931

932 * update CollationFCD.java

933 + copy & paste the initializers of lcccIndex[] etc. from

934 ICU4C/source/i18n/collationfcd.cpp to

935 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java

936

937 * refresh Java test .txt files

938 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode

939 cd $ICU_SRC/icu4c/source/data/unidata

940 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

941 cd ../../test/testdata

942 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

943 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

944

945 * run & fix ICU4J tests

946

947 *** API additions

948 - send notice to icu-design about new born-@stable API (enum constants etc.)

949

950 *** CLDR numbering systems

951 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR

952 for example, look for

953 ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt

954 in new blocks (Blocks.txt)

955 Unicode 12: using Unicode 12 CLDR ticket #11478

956 hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong

957 wcho 1E2F0..1E2F9 Wancho

958 Unicode 11: using Unicode 11 CLDR ticket #10978

959 rohg 10D30..10D39 Hanifi_Rohingya

960 gong 11DA0..11DA9 Gunjala_Gondi

961 Earlier: CLDR tickets specific to adding new numbering systems.

962 Unicode 10: http://unicode.org/cldr/trac/ticket/10219

963 Unicode 9: http://unicode.org/cldr/trac/ticket/9692

964

965 *** merge the Unicode update branches back onto the trunk

966 - do not merge the icudata.jar and testdata.jar,

967 instead rebuild them from merged & tested ICU4C

968 - make sure that changes to Unicode tools are checked in:

969 http://www.unicode.org/utility/trac/log/trunk/unicodetools

970

971 ---------------------------------------------------------------------------- ***

972

973 ICU 63 addition of ICU support of text layout properties InPC, InSC, vo

974

975 * Command-line environment setup

976

977 UNICODE_DATA=~/unidata/uni11/20180609

978 CLDR_SRC=~/svn.cldr/uni

979 ICU_ROOT=~/icu/mine

980 ICU_SRC=$ICU_ROOT/src

981 ICUDT=icudt62b

982 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in

983 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata

984 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib

985

986 *** Links

987

988 https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC

989 https://unicode-org.atlassian.net/browse/ICU-12850 vo

990

991 *** data files & enums & parser code

992

993 * API additions

994 - for each of the three new enumerated properties

995 + uchar.h: add the enum UProperty constant UCHAR_<long prop name>

996 + uchar.h: update UCHAR_INT_LIMIT

997 + uchar.h: add the enum U<long prop name>

998 with constants U_<short prop name>_<long value name>

999 + UProperty.java: add the constant <long prop name>

1000 + UProperty.java: update INT_LIMIT

1001 + UCharacter.java: add the interface <long prop name>

1002 with constants <long value name>

1003

1004 * process and/or copy files

1005 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC

1006 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.

1007 + It also writes tools/unicode/c/genprops/pnames_data.h with property and value

1008 names and aliases.

1009 + For debugging, and tweaking how ppucd.txt is written,

1010 the tool has an --only_ppucd option:

1011 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile

1012

1013 * preparseucd.py changes

1014 - add new property short names (uppercase) to _prop_and_value_re

1015 so that ParseUCharHeader() parses the new enum constants

1016

1017 * build ICU (make install)

1018 so that the tools build can pick up the new definitions from the installed header files.

1019

1020 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date

1021

1022 * build Unicode tools using CMake+make

1023

1024 $ICU_SRC/tools/unicode/c/icudefs.txt:

1025

1026 # Location (--prefix) of where ICU was installed.

1027 set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c)

1028 # Location of the ICU4C source tree.

1029 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c)

1030

1031 $ICU_ROOT/dbg$

1032 mkdir -p tools/unicode/c

1033 cd tools/unicode/c

1034

1035 $ICU_ROOT/dbg/tools/unicode/c$

1036 cmake ../../../../../src/tools/unicode/c

1037 make

1038

1039 * generate core properties data files

1040 $ICU_ROOT/dbg/tools/unicode/c$

1041 genprops/genprops $ICU_SRC/icu4c

1042 - rebuild ICU (make install) & tools

1043

1044 * write data for runtime, hardcoded for now

1045 - add genprops/layoutpropsbuilder.cpp with pieces from sibling files

1046 - generate new icu4c/source/common/ulayout_props_data.h

1047 - for each of the three new enumerated properties

1048 + int property max value

1049 + small, 8-bit UCPTrie

1050 (A small 16-bit trie with bit fields for these three properties

1051 is very nearly the same size as the sum of the three.)

1052

1053 * wire into C++

1054 - uprops.cpp: #include ulayout_props_data.h

1055 - uprops.cpp: add getInPC() etc. functions

1056 - uprops.cpp: add lines to intProps[], include max values

1057 - uprops.h: add UPropertySource constants

1058 - uprops.cpp: add uprops_addPropertyStarts(src)

1059 - uniset_props.cpp: add to UnicodeSet_initInclusion()

1060 - intltest/ucdtest.cpp: write unit tests

1061

1062 * update Java data files

1063 - refresh just the pnames.icu file with the new property [value] names, just to be safe

1064 - see $ICU_SRC/icu4c/source/data/icu4j-readme.txt

1065 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

1066 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

1067 - copy the big-endian Unicode data files to another location,

1068 separate from the other data files,

1069 and then refresh ICU4J

1070 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j

1071 cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

1072 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT

1073

1074 * wire into Java

1075 - UCharacterProperty.java: add new SRC_INPC etc. constants as in C++

1076 - UCharacterProperty.java: for each new property

1077 + create a nested class to hold its CodePointTrie

1078 + initialize it from a string literal

1079 + paste in the initializer printed by genprops

1080 + add a new IntProperty object to the intProps[] array

1081 + use the correct max int value for each property, also printed by genprops

1082 - UCharacterProperty.java: add ulayout_addPropertyStarts(src, set)

1083 - UnicodeSet.java: add to getInclusions()

1084 - UCharacterTest.java: write unit tests

1085

1086 ---------------------------------------------------------------------------- ***

1087

1088 Unicode 11.0 update for ICU 62

1089

1090 http://www.unicode.org/versions/Unicode11.0.0/

1091 http://unicode.org/versions/beta-11.0.0.html

1092 https://www.unicode.org/review/pri372/

1093 http://www.unicode.org/reports/uax-proposed-updates.html

1094 http://www.unicode.org/reports/tr44/tr44-21.html

1095

1096 * Command-line environment setup

1097

1098 UNICODE_DATA=~/unidata/uni11/20180521

1099 CLDR_SRC=~/svn.cldr/uni

1100 ICU_ROOT=~/svn.icu/uni

1101 ICU_SRC=$ICU_ROOT/src

1102 ICUDT=icudt61b

1103 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in

1104 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata

1105 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib

1106

1107 *** ICU Trac

1108

1109 - ticket:13630: Unicode 11

1110 - ^/branches/markus/uni11

1111

1112 *** CLDR Trac

1113

1114 - cldrbug 10978: Unicode 11

1115 - ^/branches/markus/uni11

1116

1117 *** Unicode version numbers

1118 - makedata.mak

1119 - uchar.h

1120 - com.ibm.icu.util.VersionInfo

1121 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_

1122

1123 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h

1124 so that the makefiles see the new version number.

1125

1126 *** data files & enums & parser code

1127

1128 * download files

1129 - mkdir -p $UNICODE_DATA

1130 - download Unicode files into $UNICODE_DATA

1131 + subfolders: emoji, idna, security, ucd, uca

1132 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip

1133

1134 * for manual diffs and for Unicode Tools input data updates:

1135 remove version suffixes from the file names

1136 ~$ unidata/desuffixucd.py $UNICODE_DATA

1137 (see https://sites.google.com/site/unicodetools/inputdata)

1138

1139 * process and/or copy files

1140 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC

1141 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.

1142 + For debugging, and tweaking how ppucd.txt is written,

1143 the tool has an --only_ppucd option:

1144 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile

1145

1146 - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA

1147

1148 * build ICU (make install)

1149 so that the tools build can pick up the new definitions from the installed header files.

1150

1151 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date

1152

1153 * preparseucd.py changes

1154 - fix other errors

1155 NameError: unknown property Extended_Pictographic

1156 -> add Extended_Pictographic binary property

1157 -> add new short names for all Emoji properties

1158

1159 * new constants for new property values

1160 - preparseucd.py error:

1161 ValueError: missing uchar.h enum constants for some property values:

1162 [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar',

1163 u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals',

1164 u'Indic_Siyaq_Numbers'])),

1165 (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])),

1166 (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])),

1167 (u'GCB', set([u'LinkC', u'Virama'])),

1168 (u'WB', set([u'WSegSpace']))]

1169 = PropertyValueAliases.txt new property values (diff old & new .txt files)

1170 blk; Chess_Symbols ; Chess_Symbols

1171 blk; Dogra ; Dogra

1172 blk; Georgian_Ext ; Georgian_Extended

1173 blk; Gunjala_Gondi ; Gunjala_Gondi

1174 blk; Hanifi_Rohingya ; Hanifi_Rohingya

1175 blk; Indic_Siyaq_Numbers ; Indic_Siyaq_Numbers

1176 blk; Makasar ; Makasar

1177 blk; Mayan_Numerals ; Mayan_Numerals

1178 blk; Medefaidrin ; Medefaidrin

1179 blk; Old_Sogdian ; Old_Sogdian

1180 blk; Sogdian ; Sogdian

1181 -> add to uchar.h

1182 use long property names for enum constants,

1183 for the trailing comment get the block start code point: diff old & new Blocks.txt

1184 -> add to UCharacter.UnicodeBlock IDs

1185 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)

1186 replace public static final int \1_ID = \2; \3

1187 -> add to UCharacter.UnicodeBlock objects

1188 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)

1189 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2

1190

1191 GCB; LinkC ; LinkingConsonant

1192 GCB; Virama ; Virama

1193 -> uchar.h & UCharacter.GraphemeClusterBreak

1194 -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76

1195

1196 InSC; Consonant_Initial_Postfixed ; Consonant_Initial_Postfixed

1197 -> ignore: ICU does not yet support this property

1198

1199 jg ; Hanifi_Rohingya_Kinna_Ya ; Hanifi_Rohingya_Kinna_Ya

1200 jg ; Hanifi_Rohingya_Pa ; Hanifi_Rohingya_Pa

1201 -> uchar.h & UCharacter.JoiningGroup

1202

1203 sc ; Dogr ; Dogra

1204 sc ; Gong ; Gunjala_Gondi

1205 sc ; Maka ; Makasar

1206 sc ; Medf ; Medefaidrin

1207 sc ; Rohg ; Hanifi_Rohingya

1208 sc ; Sogd ; Sogdian

1209 sc ; Sogo ; Old_Sogdian

1210 -> uscript.h & com.ibm.icu.lang.UScript

1211 -> Nushu had been added already

1212 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()

1213 and in com.ibm.icu.dev.test.lang.TestUScript.java

1214

1215 WB ; WSegSpace ; WSegSpace

1216 -> uchar.h & UCharacter.WordBreak

1217

1218 * New short names for emoji properties

1219 - see UTS #51

1220 - short names set in preparseucd.py

1221

1222 * New properties

1223 - boolean emoji property Extended_Pictographic

1224 -> added in preparseucd.py

1225 -> uchar.h & UProperty.java

1226 - misc. property Equivalent_Unified_Ideograph (EqUIdeo)

1227 as shown in PropertyValueAliases.txt

1228 -> ignore for now

1229

1230 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata

1231 (not strictly necessary for NOT_ENCODED scripts)

1232 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt

1233

1234 * update spoof checker UnicodeSet initializers:

1235 inclusionPat & recommendedPat in uspoof.cpp

1236 INCLUSION & RECOMMENDED in SpoofChecker.java

1237 - make sure that the Unicode Tools tree contains the latest security data files

1238 - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator

1239 - update the hardcoded version number there in the DIRECTORY path

1240 - run the tool (no special environment variables needed)

1241 - copy & paste from the Console output into the .cpp & .java files

1242

1243 * generate normalization data files

1244 cd $ICU_ROOT/dbg/icu4c

1245 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource

1246 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt

1247 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt

1248 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt

1249 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt

1250

1251 * build ICU (make install)

1252 so that the tools build can pick up the new definitions from the installed header files.

1253

1254 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date

1255

1256 * build Unicode tools using CMake+make

1257

1258 $ICU_SRC/tools/unicode/c/icudefs.txt:

1259

1260 # Location (--prefix) of where ICU was installed.

1261 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)

1262 # Location of the ICU4C source tree.

1263 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c)

1264

1265 $ICU_ROOT/dbg$

1266 mkdir -p tools/unicode/c

1267 cd tools/unicode/c

1268

1269 $ICU_ROOT/dbg/tools/unicode/c$

1270 cmake ../../../../src/tools/unicode/c

1271 make

1272

1273 * generate core properties data files

1274 $ICU_ROOT/dbg/tools/unicode/c$

1275 genprops/genprops $ICU_SRC/icu4c

1276 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c

1277 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c

1278 - rebuild ICU (make install) & tools

1279

1280 * Fix case props

1281 genprops error: casepropsbuilder: too many exceptions words

1282 genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR

1283 - With the addition of Georgian Mtavruli capital letters,

1284 there are now too many simple case mappings with big mapping deltas

1285 that yield uncompressible exceptions.

1286 - Changing the data structure (now formatVersion 4),

1287 adding one bit for no-simple-case-folding (for Cherokee), and

1288 one optional slot for a big delta (for most faraway mappings),

1289 together with another bit for whether that is negative.

1290 This makes most Cherokee & Georgian etc. case mappings compressible,

1291 reducing the number of exceptions words.

1292 - Further changes to gain one more bit for the exceptions index,

1293 for future growth. Details see casepropsbuilder.cpp.

1294

1295 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to

1296 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)

1297 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters

1298 - Unicode 6.0..11.0: U+2260, U+226E, U+226F

1299 - nothing new in this Unicode version, no test file to update

1300

1301 * run & fix ICU4C tests

1302 - Andy handles RBBI & spoof check test failures

1303

1304 - Errors in char.txt, word.txt, word_POSIX.txt like

1305 createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 46, column 16

1306 because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty.

1307 -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them

1308 not empty, just to get ICU building.

1309 -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables

1310 and properties together with the rules that used them (GB 10, WB 14).

1311 -> Andy adjusts the rule sets further to sync with

1312 Unicode 11 grapheme, word, and line break spec changes.

1313

1314 * collation: CLDR collation root, UCA DUCET

1315

1316 - UCA DUCET goes into Mark's Unicode tools, see

1317 https://sites.google.com/site/unicodetools/home#TOC-UCA

1318 diff the main mapping file, look for bad changes

1319 (for example, more bytes per weight for common characters)

1320 ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt

1321 ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt

1322

1323 - CLDR root data files are checked into $CLDR_SRC/common/uca/

1324 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/

1325

1326 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt

1327 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt

1328 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt

1329 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt

1330 (note removing the underscore before "Rules")

1331 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt

1332 - restore TODO diffs in UCARules.txt

1333 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt

1334 - update (ICU4C)/source/test/testdata/CollationTest_*.txt

1335 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt

1336 from the CLDR root files (..._CLDR_..._SHORT.txt)

1337 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt

1338 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt

1339 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data

1340 - if CLDR common/uca/unihan-index.txt changes, then update

1341 CLDR common/collation/root.xml <collation type="private-unihan">

1342 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt

1343

1344 - run genuca, see command line above;

1345 deal with

1346 Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt:

1347 FDD1 1180B; [71 CC 02, 05, 05] # Dogra first primary (compressible)

1348 (add the character to genuca.cpp sampleCharsToScripts[])

1349 + look up the USCRIPT_ code for the new sample characters

1350 (should be obvious from the comment in the error output)

1351 + *add* mappings to sampleCharsToScripts[], do not replace them

1352 (in case the script sample characters flip-flop)

1353 + insert new scripts in DUCET script order, see the top_byte table

1354 at the beginning of FractionalUCA.txt

1355 - rebuild ICU4C

1356

1357 * Unihan collators

1358 https://sites.google.com/site/unicodetools/unihan

1359 - run Unicode Tools

1360 org.unicode.draft.GenerateUnihanCollators

1361 with VM arguments

1362 -ea

1363 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk

1364 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools

1365 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data

1366 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni

1367 -DUVERSION=11.0.0

1368 - run Unicode Tools

1369 org.unicode.draft.GenerateUnihanCollatorFiles

1370 with the same arguments

1371 - check CLDR diffs

1372 cd $CLDR_SRC

1373 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml

1374 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml

1375 - copy to CLDR

1376 cd $CLDR_SRC

1377 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml

1378 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml

1379 - run CLDR unit tests, commit to CLDR

1380 - generate ICU zh collation data: run CLDR

1381 org.unicode.cldr.icu.NewLdml2IcuConverter

1382 with program arguments

1383 -t collation

1384 -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation

1385 -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental

1386 -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll

1387 -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation

1388 zh

1389 and VM arguments

1390 -ea

1391 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni

1392 - rebuild ICU4C

1393

1394 * run & fix ICU4C tests, now with new CLDR collation root data

1395 - run all tests with the collation test data *_SHORT.txt or the full files

1396 (the full ones have comments, useful for debugging)

1397 - note on intltest: if collate/UCAConformanceTest fails, then

1398 utility/MultithreadTest/TestCollators will fail as well;

1399 fix the conformance test before looking into the multi-thread test

1400

1401 * update Java data files

1402 - refresh just the UCD/UCA-related/derived files, just to be safe

1403 - see (ICU4C)/source/data/icu4j-readme.txt

1404 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

1405 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

1406 output:

1407 ...

1408 Unicode .icu files built to ./out/build/icudt61l

1409 echo timestamp > uni-core-data

1410 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b

1411 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b

1412 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt

1413 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b

1414 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b"

1415 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/

1416 mkdir -p /tmp/icu4j/main/shared/data

1417 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data

1418 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/

1419 mkdir -p /tmp/icu4j/main/shared/data

1420 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data

1421 make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data'

1422 - copy the big-endian Unicode data files to another location,

1423 separate from the other data files,

1424 and then refresh ICU4J

1425 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j

1426 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

1427 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

1428 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

1429 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

1430 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu

1431 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

1432 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

1433 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

1434 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT

1435

1436 * When refreshing all of ICU4J data from ICU4C

1437 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

1438 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data

1439 or

1440 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install

1441

1442 * update CollationFCD.java

1443 + copy & paste the initializers of lcccIndex[] etc. from

1444 ICU4C/source/i18n/collationfcd.cpp to

1445 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java

1446

1447 * refresh Java test .txt files

1448 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode

1449 cd $ICU_SRC/icu4c/source/data/unidata

1450 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

1451 cd ../../test/testdata

1452 cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

1453 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

1454

1455 * run & fix ICU4J tests

1456

1457 *** API additions

1458 - send notice to icu-design about new born-@stable API (enum constants etc.)

1459

1460 *** CLDR numbering systems

1461 - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR

1462 Unicode 11: using Unicode 11 CLDR ticket #10978

1463 rohg 10D30..10D39 Hanifi_Rohingya

1464 gong 11DA0..11DA9 Gunjala_Gondi

1465 Earlier: CLDR tickets specific to adding new numbering systems.

1466 Unicode 10: http://unicode.org/cldr/trac/ticket/10219

1467 Unicode 9: http://unicode.org/cldr/trac/ticket/9692

1468

1469 *** merge the Unicode update branches back onto the trunk

1470 - do not merge the icudata.jar and testdata.jar,

1471 instead rebuild them from merged & tested ICU4C

1472 - make sure that changes to Unicode tools are checked in:

1473 http://www.unicode.org/utility/trac/log/trunk/unicodetools

1474

1475 ---------------------------------------------------------------------------- ***

1476

1477 Unicode 10.0 update for ICU 60

1478

1479 http://www.unicode.org/versions/Unicode10.0.0/

1480 http://www.unicode.org/versions/beta-10.0.0.html

1481 http://blog.unicode.org/2017/03/unicode-100-beta-review.html

1482 http://www.unicode.org/review/pri350/

1483 http://www.unicode.org/reports/uax-proposed-updates.html

1484 http://www.unicode.org/reports/tr44/tr44-19.html

1485

1486 * Command-line environment setup

1487

1488 UNICODE_DATA=~/unidata/uni10/20170605

1489 CLDR_SRC=~/svn.cldr/uni10

1490 ICU_ROOT=~/svn.icu/uni10

1491 ICU_SRC=$ICU_ROOT/src

1492 ICUDT=icudt60b

1493 ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in

1494 ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata

1495 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib

1496

1497 *** ICU Trac

1498

1499 - ticket:12985: Unicode 10

1500 - ticket:13061: undo hacks from emoji 5.0 update

1501 - ticket:13062: add Emoji_Component property

1502 - ^/branches/markus/uni10

1503

1504 *** CLDR Trac

1505

1506 - cldrbug 10055: Unicode 10

1507 - cldrbug 9882: Unicode 10 script metadata

1508 - cldrbug 10219: numbering systems for Unicode 10

1509

1510 *** Unicode version numbers

1511 - makedata.mak

1512 - uchar.h

1513 - com.ibm.icu.util.VersionInfo

1514 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_

1515

1516 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h

1517 so that the makefiles see the new version number.

1518

1519 *** data files & enums & parser code

1520

1521 * download files

1522 - mkdir -p $UNICODE_DATA

1523 - download Unicode 10.0 files into $UNICODE_DATA

1524 + subfolders: ucd, uca, idna, security

1525 + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip

1526 - download emoji 5.0 files into $UNICODE_DATA/emoji

1527

1528 * for manual diffs: remove version suffixes from the file names

1529 ~$ unidata/desuffixucd.py $UNICODE_DATA

1530 (see https://sites.google.com/site/unicodetools/inputdata)

1531

1532 * process and/or copy files

1533 - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC

1534 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.

1535 + For debugging, and tweaking how ppucd.txt is written,

1536 the tool has an --only_ppucd option:

1537 py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile

1538

1539 - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA

1540

1541 * build ICU (make install)

1542 so that the tools build can pick up the new definitions from the installed header files.

1543

1544 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date

1545

1546 * preparseucd.py changes

1547 - remove or add new Unicode scripts from/to the

1548 only-in-ISO-15924 list according to the error messages:

1549 ValueError: remove ['Nshu'] from _scripts_only_in_iso15924

1550 -> adjust _scripts_only_in_iso15924 as indicated

1551 - fix other errors

1552 Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo']

1553 -> add vo=Vertical_Orientation to _ignored_properties

1554 -> later removed again, parsing the file, even though we do not yet store data for runtime use

1555

1556 * new constants for new property values

1557 - preparseucd.py error:

1558 ValueError: missing uchar.h enum constants for some property values:

1559 [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F',

1560 u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])),

1561 (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla',

1562 u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra',

1563 u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])),

1564 (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))]

1565 = PropertyValueAliases.txt new property values (diff old & new .txt files)

1566 blk; CJK_Ext_F ; CJK_Unified_Ideographs_Extension_F

1567 blk; Kana_Ext_A ; Kana_Extended_A

1568 blk; Masaram_Gondi ; Masaram_Gondi

1569 blk; Nushu ; Nushu

1570 blk; Soyombo ; Soyombo

1571 blk; Syriac_Sup ; Syriac_Supplement

1572 blk; Zanabazar_Square ; Zanabazar_Square

1573 -> add to uchar.h

1574 use long property names for enum constants,

1575 for the trailing comment get the block start code point: diff old & new Blocks.txt

1576 -> add to UCharacter.UnicodeBlock IDs

1577 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)

1578 replace public static final int \1_ID = \2; \3

1579 -> add to UCharacter.UnicodeBlock objects

1580 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)

1581 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2

1582

1583 jg ; Malayalam_Bha ; Malayalam_Bha

1584 jg ; Malayalam_Ja ; Malayalam_Ja

1585 jg ; Malayalam_Lla ; Malayalam_Lla

1586 jg ; Malayalam_Llla ; Malayalam_Llla

1587 jg ; Malayalam_Nga ; Malayalam_Nga

1588 jg ; Malayalam_Nna ; Malayalam_Nna

1589 jg ; Malayalam_Nnna ; Malayalam_Nnna

1590 jg ; Malayalam_Nya ; Malayalam_Nya

1591 jg ; Malayalam_Ra ; Malayalam_Ra

1592 jg ; Malayalam_Ssa ; Malayalam_Ssa

1593 jg ; Malayalam_Tta ; Malayalam_Tta

1594 -> uchar.h & UCharacter.JoiningGroup

1595

1596 sc ; Gonm ; Masaram_Gondi

1597 sc ; Nshu ; Nushu

1598 sc ; Soyo ; Soyombo

1599 sc ; Zanb ; Zanabazar_Square

1600 -> uscript.h & com.ibm.icu.lang.UScript

1601 -> Nushu had been added already

1602 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()

1603 and in com.ibm.icu.dev.test.lang.TestUScript.java

1604

1605 * New properties as shown in PropertyValueAliases.txt changes

1606 - boolean Emoji_Component from emoji 5

1607 -> uchar.h & UProperty.java

1608 - boolean

1609 # Regional_Indicator (RI)

1610

1611 RI ; N ; No ; F ; False

1612 RI ; Y ; Yes ; T ; True

1613 -> uchar.h & UProperty.java

1614 -> single immutable range, to be hardcoded

1615 - boolean

1616 # Prepended_Concatenation_Mark (PCM)

1617

1618 PCM; N ; No ; F ; False

1619 PCM; Y ; Yes ; T ; True

1620 -> was new in Unicode 9

1621 -> uchar.h & UProperty.java

1622 - enumerated

1623 # Vertical_Orientation (vo)

1624

1625 vo ; R ; Rotated

1626 vo ; Tr ; Transformed_Rotated

1627 vo ; Tu ; Transformed_Upright

1628 vo ; U ; Upright

1629 -> only pre-parsed for now, but not yet stored for runtime use

1630

1631 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata

1632 (not strictly necessary for NOT_ENCODED scripts)

1633 $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt

1634

1635 * generate normalization data files

1636 cd $ICU_ROOT/dbg/icu4c

1637 bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource

1638 bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt

1639 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt

1640 bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt

1641 bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt

1642

1643 * build ICU (make install)

1644 so that the tools build can pick up the new definitions from the installed header files.

1645

1646 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date

1647

1648 * build Unicode tools using CMake+make

1649

1650 $ICU_SRC/tools/unicode/c/icudefs.txt:

1651

1652 # Location (--prefix) of where ICU was installed.

1653 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)

1654 # Location of the ICU4C source tree.

1655 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c)

1656

1657 $ICU_ROOT/dbg/tools/unicode/c$

1658 cmake ../../../../src/tools/unicode/c

1659 make

1660

1661 * generate core properties data files

1662 $ICU_ROOT/dbg/tools/unicode/c$

1663 genprops/genprops $ICU_SRC/icu4c

1664 genuca/genuca --hanOrder implicit $ICU_SRC/icu4c

1665 genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c

1666 - rebuild ICU (make install) & tools

1667

1668 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to

1669 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)

1670 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters

1671 - Unicode 6.0..10.0: U+2260, U+226E, U+226F

1672 - nothing new in this Unicode version, no test file to update

1673

1674 * run & fix ICU4C tests

1675 - Andy handles RBBI & spoof check test failures

1676

1677 * collation: CLDR collation root, UCA DUCET

1678

1679 - UCA DUCET goes into Mark's Unicode tools, see

1680 https://sites.google.com/site/unicodetools/home#TOC-UCA

1681 - CLDR root data files are checked into $CLDR_SRC/common/uca/

1682 cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/

1683

1684 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt

1685 cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt

1686 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt

1687 cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt

1688 (note removing the underscore before "Rules")

1689 cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt

1690 - restore TODO diffs in UCARules.txt

1691 meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt

1692 - update (ICU4C)/source/test/testdata/CollationTest_*.txt

1693 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt

1694 from the CLDR root files (..._CLDR_..._SHORT.txt)

1695 cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt

1696 cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt

1697 cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data

1698 - if CLDR common/uca/unihan-index.txt changes, then update

1699 CLDR common/collation/root.xml <collation type="private-unihan">

1700 and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt

1701

1702 - run genuca, see command line above;

1703 deal with

1704 Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt:

1705 FDD1 11D10; [70 D5 02, 05, 05] # Masaram_Gondi first primary (compressible)

1706 (add the character to genuca.cpp sampleCharsToScripts[])

1707 + look up the USCRIPT_ code for the new sample characters

1708 (should be obvious from the comment in the error output)

1709 + *add* mappings to sampleCharsToScripts[], do not replace them

1710 (in case the script sample characters flip-flop)

1711 + insert new scripts in DUCET script order, see the top_byte table

1712 at the beginning of FractionalUCA.txt

1713 - rebuild ICU4C

1714

1715 * Unihan collators

1716 https://sites.google.com/site/unicodetools/unihan

1717 - run Unicode Tools

1718 org.unicode.draft.GenerateUnihanCollators

1719 with VM arguments

1720 -ea

1721 -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk

1722 -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools

1723 -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data

1724 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10

1725 -DUVERSION=10.0.0

1726 - run Unicode Tools

1727 org.unicode.draft.GenerateUnihanCollatorFiles

1728 with the same arguments

1729 - check CLDR diffs

1730 cd $CLDR_SRC

1731 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml

1732 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml

1733 - copy to CLDR

1734 cd $CLDR_SRC

1735 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml

1736 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml

1737 - run CLDR unit tests, commit to CLDR

1738 - generate ICU zh collation data: run CLDR

1739 org.unicode.cldr.icu.NewLdml2IcuConverter

1740 with program arguments

1741 -t collation

1742 -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation

1743 -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental

1744 -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll

1745 -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation

1746 zh

1747 and VM arguments

1748 -ea

1749 -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10

1750 - rebuild ICU4C

1751

1752 * run & fix ICU4C tests, now with new CLDR collation root data

1753 - run all tests with the collation test data *_SHORT.txt or the full files

1754 (the full ones have comments, useful for debugging)

1755 - note on intltest: if collate/UCAConformanceTest fails, then

1756 utility/MultithreadTest/TestCollators will fail as well;

1757 fix the conformance test before looking into the multi-thread test

1758

1759 * update Java data files

1760 - refresh just the UCD/UCA-related/derived files, just to be safe

1761 - see (ICU4C)/source/data/icu4j-readme.txt

1762 - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

1763 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

1764 output:

1765 ...

1766 Unicode .icu files built to ./out/build/icudt60l

1767 echo timestamp > uni-core-data

1768 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b

1769 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b

1770 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt

1771 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b

1772 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b"

1773 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/

1774 mkdir -p /tmp/icu4j/main/shared/data

1775 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data

1776 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/

1777 mkdir -p /tmp/icu4j/main/shared/data

1778 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data

1779 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data'

1780 - copy the big-endian Unicode data files to another location,

1781 separate from the other data files,

1782 and then refresh ICU4J

1783 cd $ICU_ROOT/dbg/icu4c/data/out/icu4j

1784 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

1785 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

1786 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

1787 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

1788 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu

1789 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

1790 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

1791 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

1792 jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT

1793

1794 * When refreshing all of ICU4J data from ICU4C

1795 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

1796 - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data

1797 or

1798 - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install

1799

1800 * update CollationFCD.java

1801 + copy & paste the initializers of lcccIndex[] etc. from

1802 ICU4C/source/i18n/collationfcd.cpp to

1803 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java

1804

1805 * refresh Java test .txt files

1806 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode

1807 cd $ICU_SRC/icu4c/source/data/unidata

1808 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

1809 cd ../../test/testdata

1810 cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

1811 cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

1812

1813 * run & fix ICU4J tests

1814

1815 *** API additions

1816 - send notice to icu-design about new born-@stable API (enum constants etc.)

1817

1818 *** CLDR numbering systems

1819 - look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket

1820 Unicode 10: http://unicode.org/cldr/trac/ticket/10219

1821 Unicode 9: http://unicode.org/cldr/trac/ticket/9692

1822

1823 *** merge the Unicode update branches back onto the trunk

1824 - do not merge the icudata.jar and testdata.jar,

1825 instead rebuild them from merged & tested ICU4C

1826 - make sure that changes to Unicode tools are checked in:

1827 http://www.unicode.org/utility/trac/log/trunk/unicodetools

1828

1829 ---------------------------------------------------------------------------- ***

1830

1831 Emoji 5.0 update for ICU 59

1832 - ICU 59 mostly remains on Unicode 9.0

1833 - except updates bidi and segmentation data to Unicode 10 beta

1834

1835 First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg.

1836

1837 * Command-line environment setup

1838

1839 ICU_ROOT=~/svn.icu/trunk

1840 ICU_SRC_DIR=$ICU_ROOT/src

1841 ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c

1842 ICUDT=icudt59b

1843 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib

1844 SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in

1845 UNIDATA=$ICU4C_SRC_DIR/source/data/unidata

1846

1847 *** ICU Trac

1848

1849 - ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released

1850 - changes directly on trunk

1851

1852 *** data files & enums & parser code

1853

1854 * download files

1855

1856 - download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca)

1857 - download emoji 5.0 beta files into the same uni90e50 folder

1858 - download Unicode 10.0 beta files: ucd

1859 + copy Unicode 10 bidi files to the uni90e50/ucd folder:

1860 BidiBrackets.txt

1861 BidiCharacterTest.txt

1862 BidiMirroring.txt

1863 BidiTest.txt

1864 extracted/DerivedBidiClass.txt

1865 + copy Unicode 10 segmentation files to the uni90e50/ucd folder:

1866 LineBreak.txt

1867 auxiliary/*

1868

1869 * preparseucd.py changes

1870 - adjust for combined trunks

1871 - write new copyright lines

1872 - ignore new Emoji_Component property for now

1873

1874 * process and/or copy files

1875 - ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR

1876 + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.

1877

1878 - cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA

1879

1880 * build ICU (make install)

1881 so that the tools build can pick up the new definitions from the installed header files.

1882

1883 $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date

1884

1885 * build Unicode tools using CMake+make

1886

1887 ~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt:

1888

1889 # Location (--prefix) of where ICU was installed.

1890 set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c)

1891 # Location of the ICU4C source tree.

1892 set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c)

1893

1894 ~/svn.icu/trunk/dbg/tools/unicode/c$

1895 cmake ../../../../src/tools/unicode/c

1896 make

1897

1898 * generate core properties data files

1899 ~/svn.icu/trunk/dbg/tools/unicode/c$

1900 genprops/genprops $ICU4C_SRC_DIR

1901 - rebuild ICU (make install) & tools

1902

1903 * run & fix ICU4C tests

1904 - Andy handles RBBI & spoof check test failures

1905

1906 * update Java data files

1907 - refresh just the UCD/UCA-related/derived files, just to be safe

1908 - see (ICU4C)/source/data/icu4j-readme.txt

1909 - mkdir /tmp/icu4j

1910 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

1911 output:

1912 ...

1913 Unicode .icu files built to ./out/build/icudt59l

1914 echo timestamp > uni-core-data

1915 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b

1916 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b

1917 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt

1918 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b

1919 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b"

1920 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/

1921 mkdir -p /tmp/icu4j/main/shared/data

1922 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data

1923 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/

1924 mkdir -p /tmp/icu4j/main/shared/data

1925 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data

1926 make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data'

1927 - copy the big-endian Unicode data files to another location,

1928 separate from the other data files,

1929 and then refresh ICU4J

1930 cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j

1931 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

1932 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

1933 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

1934 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu

1935 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

1936 jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT

1937

1938 * When refreshing all of ICU4J data from ICU4C

1939 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

1940 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data

1941 or

1942 - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install

1943

1944 * refresh Java test .txt files

1945 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode

1946 cd $ICU4C_SRC_DIR/source/data/unidata

1947 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

1948 cd ../../test/testdata

1949 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

1950 cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode

1951

1952 * run & fix ICU4J tests

1953

1954 ---------------------------------------------------------------------------- ***

1955

1956 Unicode 9.0 update for ICU 58

1957

1958 * Command-line environment setup

1959

1960 ICU_ROOT=~/svn.icu/trunk

1961 ICU_SRC_DIR=$ICU_ROOT/src

1962 ICUDT=icudt58b

1963 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib

1964 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in

1965 UNIDATA=$ICU_SRC_DIR/source/data/unidata

1966

1967 http://www.unicode.org/review/pri323/ -- beta review

1968 http://www.unicode.org/reports/uax-proposed-updates.html

1969 http://www.unicode.org/versions/beta-9.0.0.html

1970 http://www.unicode.org/versions/Unicode9.0.0/

1971 http://www.unicode.org/reports/tr44/tr44-17.html

1972

1973 *** ICU Trac

1974

1975 - ticket:12526: integrate Unicode 9

1976 - C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b

1977 - Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b

1978

1979 *** CLDR Trac

1980

1981 - cldrbug 9414: UCA 9

1982 - ^/branches/markus/uni90 at r11518 from trunk at r11517

1983

1984 - cldrbug 8745: Unicode 9.0 script metadata

1985

1986 *** Unicode version numbers

1987 - makedata.mak

1988 - uchar.h

1989 - com.ibm.icu.util.VersionInfo

1990 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_

1991

1992 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h

1993 so that the makefiles see the new version number.

1994

1995 *** data files & enums & parser code

1996

1997 * file preparation

1998

1999 - download UCD & IDNA files

2000 - make sure that the Unicode data folder passed into preparseucd.py

2001 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)

2002 - only for manual diffs: remove version suffixes from the file names

2003 ~/unidata/uni70/20140403$ ../../desuffixucd.py .

2004 (see https://sites.google.com/site/unicodetools/inputdata)

2005 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip

2006 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src

2007 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.

2008

2009 - also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt

2010 and copy to $UNIDATA

2011 cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA

2012

2013 * preparseucd.py changes

2014 - remove or add new Unicode scripts from/to the

2015 only-in-ISO-15924 list according to the error messages:

2016 ValueError: remove ['Tang'] from _scripts_only_in_iso15924

2017 ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD

2018 ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD

2019 ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD

2020 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()

2021 and in com.ibm.icu.dev.test.lang.TestUScript.java

2022 - DerivedNumericValues.txt new numeric values

2023 0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH

2024 0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH

2025 0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS

2026 0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH

2027 0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS

2028 -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(),

2029 uchar.c, UCharacterProperty.java

2030 to support a new series of values

2031 - adjust preparseucd.py for Tangut algorithmic names

2032 in ppucd.txt:

2033 algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH-

2034 ->

2035 algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH-

2036 - avoid block-compressing most String/Miscellaneous property values,

2037 triggered by genprops not coping with a multi-code point Case_Folding on

2038 block;1C80..1C8F;...;Cased;cf=0442;CWCF;...

2039 keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors

2040

2041 * PropertyAliases.txt changes

2042 - 1 new property PCM=Prepended_Concatenation_Mark

2043 Ignore: Only useful for layout engines.

2044 Ok to list in ppucd.txt.

2045

2046 * PropertyValueAliases.txt new property values

2047 blk; Adlam ; Adlam

2048 blk; Bhaiksuki ; Bhaiksuki

2049 blk; Cyrillic_Ext_C ; Cyrillic_Extended_C

2050 blk; Glagolitic_Sup ; Glagolitic_Supplement

2051 blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation

2052 blk; Marchen ; Marchen

2053 blk; Mongolian_Sup ; Mongolian_Supplement

2054 blk; Newa ; Newa

2055 blk; Osage ; Osage

2056 blk; Tangut ; Tangut

2057 blk; Tangut_Components ; Tangut_Components

2058 -> add to uchar.h

2059 use long property names for enum constants

2060 -> add to UCharacter.UnicodeBlock IDs

2061 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)

2062 replace public static final int \1_ID = \2; \3

2063 -> add to UCharacter.UnicodeBlock objects

2064 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)

2065 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2

2066

2067 GCB; EB ; E_Base

2068 GCB; EBG ; E_Base_GAZ

2069 GCB; EM ; E_Modifier

2070 GCB; GAZ ; Glue_After_Zwj

2071 GCB; ZWJ ; ZWJ

2072 -> uchar.h & UCharacter.GraphemeClusterBreak

2073

2074 jg ; African_Feh ; African_Feh

2075 jg ; African_Noon ; African_Noon

2076 jg ; African_Qaf ; African_Qaf

2077 -> uchar.h & UCharacter.JoiningGroup

2078

2079 lb ; EB ; E_Base

2080 lb ; EM ; E_Modifier

2081 lb ; ZWJ ; ZWJ

2082 -> uchar.h & UCharacter.LineBreak

2083

2084 sc ; Adlm ; Adlam

2085 sc ; Bhks ; Bhaiksuki

2086 sc ; Marc ; Marchen

2087 sc ; Newa ; Newa

2088 sc ; Osge ; Osage

2089 sc ; Tang ; Tangut

2090 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript

2091

2092 WB ; EB ; E_Base

2093 WB ; EBG ; E_Base_GAZ

2094 WB ; EM ; E_Modifier

2095 WB ; GAZ ; Glue_After_Zwj

2096 WB ; ZWJ ; ZWJ

2097 -> uchar.h & UCharacter.WordBreak

2098

2099 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata

2100 (not strictly necessary for NOT_ENCODED scripts)

2101 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt

2102

2103 * generate normalization data files

2104 cd $ICU_ROOT/dbg

2105 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource

2106 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt

2107 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt

2108 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt

2109 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt

2110

2111 * build ICU (make install)

2112 so that the tools build can pick up the new definitions from the installed header files.

2113

2114 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt

2115

2116 * build Unicode tools using CMake+make

2117

2118 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:

2119

2120 # Location (--prefix) of where ICU was installed.

2121 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)

2122 # Location of the ICU source tree.

2123 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)

2124

2125 ~/svn.icutools/trunk/dbg/unicode/c$

2126 cmake ../../../src/unicode/c

2127 make

2128

2129 * generate core properties data files

2130 ~/svn.icutools/trunk/dbg/unicode/c$

2131 genprops/genprops $ICU_SRC_DIR

2132 genuca/genuca --hanOrder implicit $ICU_SRC_DIR

2133 genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR

2134 - rebuild ICU (make install) & tools

2135

2136 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to

2137 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)

2138 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters

2139 - Unicode 6.0..9.0: U+2260, U+226E, U+226F

2140 - nothing new in 9.0, no test file to update

2141

2142 * run & fix ICU4C tests

2143 - Andy handles RBBI & spoof check test failures

2144

2145 * collation: CLDR collation root, UCA DUCET

2146

2147 - UCA DUCET goes into Mark's Unicode tools, see

2148 https://sites.google.com/site/unicodetools/home#TOC-UCA

2149 - CLDR root data files are checked into (CLDR UCA branch)/common/uca/

2150 cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/

2151

2152 - cd (CLDR UCA branch)/common/uca/

2153 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt

2154 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt

2155 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt

2156 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt

2157 (note removing the underscore before "Rules")

2158 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt

2159 - restore TODO diffs in UCARules.txt

2160 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt

2161 - update (ICU4C)/source/test/testdata/CollationTest_*.txt

2162 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt

2163 from the CLDR root files (..._CLDR_..._SHORT.txt)

2164 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt

2165 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt

2166 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data

2167 - if CLDR common/uca/unihan-index.txt changes, then update

2168 CLDR common/collation/root.xml <collation type="private-unihan">

2169 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt

2170

2171 - run genuca, see command line above;

2172 deal with

2173 Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt:

2174 FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible)

2175 (add the character to genuca.cpp sampleCharsToScripts[])

2176 + look up the USCRIPT_ code for the new sample characters

2177 (should be obvious from the comment in the error output)

2178 + *add* mappings to sampleCharsToScripts[], do not replace them

2179 (in case the script sample characters flip-flop)

2180 + insert new scripts in DUCET script order, see the top_byte table

2181 at the beginning of FractionalUCA.txt

2182 - rebuild ICU4C

2183

2184 * Unihan collators

2185 - run Unicode Tools

2186 org.unicode.draft.GenerateUnihanCollators

2187 with VM arguments

2188 -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk

2189 -DOTHER_WORKSPACE=/home/mscherer/svn.unitools

2190 -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data

2191 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk

2192 -DUVERSION=9.0.0

2193 -ea

2194 - run Unicode Tools

2195 org.unicode.draft.GenerateUnihanCollatorFiles

2196 with the same arguments

2197 - check CLDR diffs

2198 cd ~/svn.cldr/trunk

2199 meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml

2200 meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml

2201 - copy to CLDR

2202 cd ~/svn.cldr/trunk

2203 cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml

2204 cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml

2205 - commit to CLDR

2206 - generate ICU zh collation data: run CLDR

2207 org.unicode.cldr.icu.NewLdml2IcuConverter

2208 with program arguments

2209 -t collation

2210 -s /home/mscherer/svn.cldr/trunk/common/collation

2211 -m /home/mscherer/svn.cldr/trunk/common/supplemental

2212 -d /home/mscherer/svn.icu/trunk/src/source/data/coll

2213 -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation

2214 zh

2215 and VM arguments

2216 -DCLDR_DIR=/home/mscherer/svn.cldr/trunk

2217 - rebuild ICU4C

2218

2219 * run & fix ICU4C tests, now with new CLDR collation root data

2220 - run all tests with the collation test data *_SHORT.txt or the full files

2221 (the full ones have comments, useful for debugging)

2222 - note on intltest: if collate/UCAConformanceTest fails, then

2223 utility/MultithreadTest/TestCollators will fail as well;

2224 fix the conformance test before looking into the multi-thread test

2225

2226 * update Java data files

2227 - refresh just the UCD/UCA-related/derived files, just to be safe

2228 - see (ICU4C)/source/data/icu4j-readme.txt

2229 - mkdir /tmp/icu4j

2230 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

2231 output:

2232 ...

2233 Unicode .icu files built to ./out/build/icudt58l

2234 echo timestamp > uni-core-data

2235 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b

2236 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b

2237 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt

2238 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b

2239 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b"

2240 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/

2241 mkdir -p /tmp/icu4j/main/shared/data

2242 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data

2243 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/

2244 mkdir -p /tmp/icu4j/main/shared/data

2245 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data

2246 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'

2247 - copy the big-endian Unicode data files to another location,

2248 separate from the other data files,

2249 and then refresh ICU4J

2250 cd ~/svn.icu/trunk/dbg/data/out/icu4j

2251 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

2252 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

2253 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

2254 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

2255 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu

2256 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

2257 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

2258 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

2259 jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT

2260

2261 * When refreshing all of ICU4J data from ICU4C

2262 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

2263 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data

2264 or

2265 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install

2266

2267 * update CollationFCD.java

2268 + copy & paste the initializers of lcccIndex[] etc. from

2269 ICU4C/source/i18n/collationfcd.cpp to

2270 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java

2271

2272 * refresh Java test .txt files

2273 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode

2274 cd $ICU_SRC_DIR/source/data/unidata

2275 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode

2276 cd ../../test/testdata

2277 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode

2278 cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode

2279

2280 * run & fix ICU4J tests

2281

2282 *** LayoutEngine script information

2283

2284 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.

2285 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp

2286 in the working directory.

2287

2288 (It also generates ScriptRunData.cpp, which is no longer needed.)

2289

2290 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages

2291 (a plain text file)

2292 which maps ICU versions to the numbers of script/language constants

2293 that were added then.

2294 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)

2295

2296 The generated files have a current copyright date and "@deprecated" statement.

2297

2298 * Review changes, fix Java tool if necessary, and copy to ICU4C

2299 cd ~/svn.icu4j/trunk/src

2300 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout

2301 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout

2302 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout

2303

2304 *** API additions

2305 - send notice to icu-design about new born-@stable API (enum constants etc.)

2306

2307 *** merge the Unicode update branches back onto the trunk

2308 - do not merge the icudata.jar and testdata.jar,

2309 instead rebuild them from merged & tested ICU4C

2310 - make sure that changes to Unicode tools & ICU tools are checked in

2311 http://www.unicode.org/utility/trac/log/trunk/unicodetools

2312 http://bugs.icu-project.org/trac/log/tools/trunk

2313

2314 ---------------------------------------------------------------------------- ***

2315

2316 New script codes early in ICU 58: http://bugs.icu-project.org/trac/ticket/11764

2317

2318 Adding

2319 - new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge

2320 - new combination/alias codes: Hanb, Jamo

2321 - used in CLDR 29 and in spoof checker

2322 - new Z* code: Zsye

2323

2324 Add new codes to uscript.h & UScript.java, see Unicode update logs.

2325 -> com.ibm.icu.lang.UScript

2326 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)

2327 replace public static final int \1 = \2; \3

2328

2329 Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h,

2330 add new script codes.

2331 "Long" script names only where established in Unicode 9 PropertyValueAliases.txt.

2332

2333 Note: If we have to run preparseucd.py again before the Unicode 9 update,

2334 then we need to manually keep/restore the new script codes.

2335

2336 ICU_ROOT=~/svn.icu/trunk

2337 ICU_SRC_DIR=$ICU_ROOT/src

2338 ICUDT=icudt57b

2339 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib

2340 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in

2341 UNIDATA=$ICU_SRC_DIR/source/data/unidata

2342

2343 Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files,

2344 see http://bugs.icu-project.org/trac/ticket/12141

2345

2346 make install, then icutools cmake & make, then

2347 ~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR

2348

2349 Generate Java data as usual, only update pnames.icu & uprops.icu.

2350

2351 *** LayoutEngine script information

2352

2353 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.

2354 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp

2355 in the working directory.

2356

2357 (It also generates ScriptRunData.cpp, which is no longer needed.)

2358

2359 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages

2360 (a plain text file)

2361 which maps ICU versions to the numbers of script/language constants

2362 that were added then.

2363 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)

2364

2365 The generated files have a current copyright date and "@deprecated" statement.

2366

2367 * Review changes, fix Java tool if necessary, and copy to ICU4C

2368 cd ~/svn.icu4j/trunk/src

2369 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout

2370 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout

2371 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout

2372

2373 ---------------------------------------------------------------------------- ***

2374

2375 Emoji properties added in ICU 57: http://bugs.icu-project.org/trac/ticket/11802

2376

2377 Edit preparseucd.py to add & parse new properties.

2378 They share the UCD property namespace but are not listed in PropertyAliases.txt.

2379

2380 Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/

2381 Initial data from emoji/2.0/

2382

2383 ICU_ROOT=~/svn.icu/trunk

2384 ICU_SRC_DIR=$ICU_ROOT/src

2385 ICUDT=icudt56b

2386 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib

2387 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in

2388 UNIDATA=$ICU_SRC_DIR/source/data/unidata

2389

2390 Add binary-property constants to uchar.h enum UProperty & UProperty.java.

2391

2392 ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src

2393 (Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.)

2394

2395 Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java

2396

2397 make install, then icutools cmake & make, then

2398 ~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR

2399

2400 Generate Java data as usual, only update pnames.icu & uprops.icu.

2401

2402 ---------------------------------------------------------------------------- ***

2403

2404 Unicode 8.0 update for ICU 56

2405

2406 * Command-line environment setup

2407

2408 ICU_ROOT=~/svn.icu/trunk

2409 ICU_SRC_DIR=$ICU_ROOT/src

2410 ICUDT=icudt56b

2411 export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib

2412 SRC_DATA_IN=$ICU_SRC_DIR/source/data/in

2413 UNIDATA=$ICU_SRC_DIR/source/data/unidata

2414

2415 http://www.unicode.org/review/pri297/ -- beta review

2416 http://www.unicode.org/reports/uax-proposed-updates.html

2417 http://unicode.org/versions/beta-8.0.0.html

2418 http://www.unicode.org/versions/Unicode8.0.0/

2419 http://www.unicode.org/reports/tr44/tr44-15.html

2420

2421 *** ICU Trac

2422

2423 - ticket:11574: Unicode 8

2424 - C++ branches/markus/uni80 at r37351 from trunk at r37343

2425 - Java branches/markus/uni80 at r37352 from trunk at r37338

2426

2427 *** CLDR Trac

2428

2429 - cldrbug 8311: UCA 8

2430 - branches/markus/uni80 at r11518 from trunk at r11517

2431

2432 - cldrbug 8109: Unicode 8.0 script metadata

2433 - cldrbug 8418: Updated segmentation for Unicode 8.0

2434

2435 *** Unicode version numbers

2436 - makedata.mak

2437 - uchar.h

2438 - com.ibm.icu.util.VersionInfo

2439 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_

2440

2441 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h

2442 so that the makefiles see the new version number.

2443

2444 *** data files & enums & parser code

2445

2446 * file preparation

2447

2448 - download UCD & IDNA files

2449 - make sure that the Unicode data folder passed into preparseucd.py

2450 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)

2451 - only for manual diffs: remove version suffixes from the file names

2452 ~/unidata/uni70/20140403$ ../../desuffixucd.py .

2453 (see https://sites.google.com/site/unicodetools/inputdata)

2454 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip

2455 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src

2456 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.

2457

2458 - also: from http://unicode.org/Public/security/8.0.0/ download new

2459 confusables.txt & confusablesWholeScript.txt

2460 and copy to $UNIDATA

2461 ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA

2462 ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA

2463

2464 * initial preparseucd.py changes

2465 - remove new Unicode scripts from the

2466 only-in-ISO-15924 list according to the error message:

2467 ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw']

2468 from _scripts_only_in_iso15924

2469 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()

2470 and in com.ibm.icu.dev.test.lang.TestUScript.java

2471 - property and file name change:

2472 IndicMatraCategory -> IndicPositionalCategory

2473 - UnicodeData.txt unusual numeric values (improper fractions)

2474 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;;

2475 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;;

2476 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;;

2477 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;;

2478 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;;

2479 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;;

2480 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;;

2481 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;;

2482 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;;

2483 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;;

2484 -> change preparseucd.py to map them to proper fractions (e.g., 1/6)

2485 which are listed in DerivedNumericValues.txt;

2486 keeps storage in data file simple

2487

2488 * PropertyValueAliases.txt changes

2489 - 10 new Block (blk) values:

2490 blk; Ahom ; Ahom

2491 blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs

2492 blk; Cherokee_Sup ; Cherokee_Supplement

2493 blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E

2494 blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform

2495 blk; Hatran ; Hatran

2496 blk; Multani ; Multani

2497 blk; Old_Hungarian ; Old_Hungarian

2498 blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs

2499 blk; Sutton_SignWriting ; Sutton_SignWriting

2500 -> add to uchar.h

2501 use long property names for enum constants

2502 -> add to UCharacter.UnicodeBlock IDs

2503 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)

2504 replace public static final int \1_ID = \2; \3

2505 -> add to UCharacter.UnicodeBlock objects

2506 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)

2507 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2

2508 - 6 new Script (sc) values:

2509 sc ; Ahom ; Ahom

2510 sc ; Hatr ; Hatran

2511 sc ; Hluw ; Anatolian_Hieroglyphs

2512 sc ; Hung ; Old_Hungarian

2513 sc ; Mult ; Multani

2514 sc ; Sgnw ; SignWriting

2515 -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript

2516

2517 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata

2518 (not strictly necessary for NOT_ENCODED scripts)

2519 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt

2520

2521 * generate normalization data files

2522 cd $ICU_ROOT/dbg

2523 bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource

2524 bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt

2525 bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt

2526 bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt

2527 bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt

2528

2529 * build ICU (make install)

2530 so that the tools build can pick up the new definitions from the installed header files.

2531

2532 $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt

2533

2534 * build Unicode tools using CMake+make

2535

2536 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:

2537

2538 # Location (--prefix) of where ICU was installed.

2539 set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst)

2540 # Location of the ICU source tree.

2541 set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src)

2542

2543 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c

2544 ~/svn.icutools/trunk/dbg/unicode/c$ make

2545

2546 * generate core properties data files

2547 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR

2548 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR

2549 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR

2550 - rebuild ICU (make install) & tools

2551 - run genuca again (see step above) so that it picks up the new nfc.nrm

2552 - rebuild ICU (make install) & tools

2553

2554 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to

2555 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)

2556 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters

2557 - Unicode 6.0..8.0: U+2260, U+226E, U+226F

2558 - nothing new in 8.0, no test file to update

2559

2560 * run & fix ICU4C tests

2561 - bad Cherokee case folding due to difference in fallbacks:

2562 UCD case folding falls back to no mapping,

2563 ICU runtime case folding falls back to lowercasing;

2564 fixed casepropsbuilder.cpp to generate scf mappings to self

2565 when there is an slc mapping but no scf

2566 - Andy handles RBBI & spoof check test failures

2567

2568 * collation: CLDR collation root, UCA DUCET

2569

2570 - UCA DUCET goes into Mark's Unicode tools, see

2571 https://sites.google.com/site/unicodetools/home#TOC-UCA

2572 - CLDR root data files are checked into (CLDR UCA branch)/common/uca/

2573 - cd (CLDR UCA branch)/common/uca/

2574 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt

2575 cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt

2576 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt

2577 cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt

2578 (note removing the underscore before "Rules")

2579 cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt

2580 - restore TODO diffs in UCARules.txt

2581 meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt

2582 - update (ICU4C)/source/test/testdata/CollationTest_*.txt

2583 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt

2584 from the CLDR root files (..._CLDR_..._SHORT.txt)

2585 cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt

2586 cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt

2587 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data

2588 - if CLDR common/uca/unihan-index.txt changes, then update

2589 CLDR common/collation/root.xml <collation type="private-unihan">

2590 and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt

2591 - run genuca, see command line above;

2592 deal with

2593 Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt

2594 (add the character to genuca.cpp sampleCharsToScripts[])

2595 + look up the script for the new sample characters

2596 (e.g., in FractionalUCA.txt)

2597 + *add* mappings to sampleCharsToScripts[], do not replace them

2598 (in case the script sample characters flip-flop)

2599 + insert new scripts in DUCET script order, see the top_byte table

2600 at the beginning of FractionalUCA.txt

2601 - rebuild ICU4C

2602

2603 * run & fix ICU4C tests, now with new CLDR collation root data

2604 - run all tests with the collation test data *_SHORT.txt or the full files

2605 (the full ones have comments, useful for debugging)

2606 - note on intltest: if collate/UCAConformanceTest fails, then

2607 utility/MultithreadTest/TestCollators will fail as well;

2608 fix the conformance test before looking into the multi-thread test

2609 - fixed bug in CollationWeights::getWeightRanges()

2610 exposed by new data and CollationTest::TestRootElements

2611

2612 * update Java data files

2613 - refresh just the UCD/UCA-related/derived files, just to be safe

2614 - see (ICU4C)/source/data/icu4j-readme.txt

2615 - mkdir /tmp/icu4j

2616 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

2617 output:

2618 ...

2619 Unicode .icu files built to ./out/build/icudt56l

2620 echo timestamp > uni-core-data

2621 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b

2622 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b

2623 echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt

2624 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b

2625 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b"

2626 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/

2627 mkdir -p /tmp/icu4j/main/shared/data

2628 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data

2629 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/

2630 mkdir -p /tmp/icu4j/main/shared/data

2631 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data

2632 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data'

2633 - copy the big-endian Unicode data files to another location,

2634 separate from the other data files,

2635 and then refresh ICU4J

2636 cd ~/svn.icu/trunk/dbg/data/out/icu4j

2637 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

2638 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

2639 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

2640 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

2641 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu

2642 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

2643 cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

2644 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

2645 jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT

2646

2647 * When refreshing all of ICU4J data from ICU4C

2648 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

2649 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data

2650 or

2651 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install

2652

2653 * update CollationFCD.java

2654 + copy & paste the initializers of lcccIndex[] etc. from

2655 ICU4C/source/i18n/collationfcd.cpp to

2656 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java

2657

2658 * refresh Java test .txt files

2659 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode

2660 cd $ICU_SRC_DIR/source/data/unidata

2661 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode

2662 cd ../../test/testdata

2663 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode

2664 cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode

2665

2666 * run & fix ICU4J tests

2667

2668 *** LayoutEngine script information

2669

2670 * ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more,

2671 because the layout engine was deprecated in ICU 54.

2672 Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java

2673 to write lines that we used to add manually.

2674

2675 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.

2676 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp

2677 in the working directory.

2678

2679 (It also generates ScriptRunData.cpp, which is no longer needed.)

2680

2681 It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages

2682 (a plain text file)

2683 which maps ICU versions to the numbers of script/language constants

2684 that were added then.

2685 (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.)

2686

2687 The generated files have a current copyright date and "@deprecated" statement.

2688

2689 * Review changes, fix Java tool if necessary, and copy to ICU4C

2690 cd ~/svn.icu4j/trunk/src

2691 meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout

2692 cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout

2693 cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout

2694

2695 *** API additions

2696 - send notice to icu-design about new born-@stable API (enum constants etc.)

2697

2698 *** merge the Unicode update branches back onto the trunk

2699 - do not merge the icudata.jar and testdata.jar,

2700 instead rebuild them from merged & tested ICU4C

2701 - make sure that changes to Unicode tools & ICU tools are checked in

2702 http://www.unicode.org/utility/trac/log/trunk/unicodetools

2703 http://bugs.icu-project.org/trac/log/tools/trunk

2704

2705 ---------------------------------------------------------------------------- ***

2706

2707 Unicode 7.0 update for ICU 54

2708

2709 http://www.unicode.org/review/pri271/ -- beta review

2710 http://www.unicode.org/reports/uax-proposed-updates.html

2711 http://www.unicode.org/versions/beta-7.0.0.html#notable_issues

2712 http://www.unicode.org/reports/tr44/tr44-13.html

2713

2714 *** ICU Trac

2715

2716 - ticket 10821: Unicode 7.0, UCA 7.0

2717 - C++ branches/markus/uni70 at r35584 from trunk at r35580

2718 - Java branches/markus/uni70 at r35587 from trunk at r35545

2719

2720 *** CLDR Trac

2721

2722 - ticket 7195: UCA 7.0 CLDR root collation

2723 - branches/markus/uni70 at r10062 from trunk at r10061

2724

2725 - ticket 6762: script metadata for Unicode 7.0 new scripts

2726

2727 *** Unicode version numbers

2728 - makedata.mak

2729 - uchar.h

2730 - com.ibm.icu.util.VersionInfo

2731 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_

2732

2733 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h

2734 so that the makefiles see the new version number.

2735

2736 *** data files & enums & parser code

2737

2738 * file preparation

2739

2740 - download UCD & IDNA files

2741 - make sure that the Unicode data folder passed into preparseucd.py

2742 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)

2743 - only for manual diffs: remove version suffixes from the file names

2744 ~/unidata/uni70/20140403$ ../../desuffixucd.py .

2745 (see https://sites.google.com/site/unicodetools/inputdata)

2746 - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip

2747 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src

2748 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.

2749 - Restore TODO diffs in source/data/unidata/UCARules.txt

2750 cd $ICU_SRC_DIR

2751 meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt

2752 - Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt

2753

2754 - also: from http://unicode.org/Public/security/7.0.0/ download new

2755 confusables.txt & confusablesWholeScript.txt

2756 and copy to $ICU_ROOT/src/source/data/unidata/

2757

2758 * initial preparseucd.py changes

2759 - remove new Unicode scripts from the

2760 only-in-ISO-15924 list according to the error message:

2761 ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass',

2762 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm',

2763 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj']

2764 from _scripts_only_in_iso15924

2765 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()

2766 and in com.ibm.icu.dev.test.lang.TestUScript.java

2767 - NamesList.txt now has a heading with a non-ASCII character

2768 + keep ppucd.txt in platform charset, rather than changing tool/test parsers

2769 + escape non-ASCII characters in heading comments

2770 - gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013

2771 + get the copyright from the first file whose copyright line contains the current year

2772

2773 * PropertyValueAliases.txt changes

2774 - 32 new Block (blk) values:

2775 blk; Bassa_Vah ; Bassa_Vah

2776 blk; Caucasian_Albanian ; Caucasian_Albanian

2777 blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers

2778 blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended

2779 blk; Duployan ; Duployan

2780 blk; Elbasan ; Elbasan

2781 blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended

2782 blk; Grantha ; Grantha

2783 blk; Khojki ; Khojki

2784 blk; Khudawadi ; Khudawadi

2785 blk; Latin_Ext_E ; Latin_Extended_E

2786 blk; Linear_A ; Linear_A

2787 blk; Mahajani ; Mahajani

2788 blk; Manichaean ; Manichaean

2789 blk; Mende_Kikakui ; Mende_Kikakui

2790 blk; Modi ; Modi

2791 blk; Mro ; Mro

2792 blk; Myanmar_Ext_B ; Myanmar_Extended_B

2793 blk; Nabataean ; Nabataean

2794 blk; Old_North_Arabian ; Old_North_Arabian

2795 blk; Old_Permic ; Old_Permic

2796 blk; Ornamental_Dingbats ; Ornamental_Dingbats

2797 blk; Pahawh_Hmong ; Pahawh_Hmong

2798 blk; Palmyrene ; Palmyrene

2799 blk; Pau_Cin_Hau ; Pau_Cin_Hau

2800 blk; Psalter_Pahlavi ; Psalter_Pahlavi

2801 blk; Shorthand_Format_Controls ; Shorthand_Format_Controls

2802 blk; Siddham ; Siddham

2803 blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers

2804 blk; Sup_Arrows_C ; Supplemental_Arrows_C

2805 blk; Tirhuta ; Tirhuta

2806 blk; Warang_Citi ; Warang_Citi

2807 -> add to uchar.h

2808 use long property names for enum constants

2809 -> add to UCharacter.UnicodeBlock IDs

2810 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)

2811 replace public static final int \1_ID = \2; \3

2812 -> add to UCharacter.UnicodeBlock objects

2813 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)

2814 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2

2815 - 28 new Joining_Group (jg) values:

2816 jg ; Manichaean_Aleph ; Manichaean_Aleph

2817 jg ; Manichaean_Ayin ; Manichaean_Ayin

2818 jg ; Manichaean_Beth ; Manichaean_Beth

2819 jg ; Manichaean_Daleth ; Manichaean_Daleth

2820 jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh

2821 jg ; Manichaean_Five ; Manichaean_Five

2822 jg ; Manichaean_Gimel ; Manichaean_Gimel

2823 jg ; Manichaean_Heth ; Manichaean_Heth

2824 jg ; Manichaean_Hundred ; Manichaean_Hundred

2825 jg ; Manichaean_Kaph ; Manichaean_Kaph

2826 jg ; Manichaean_Lamedh ; Manichaean_Lamedh

2827 jg ; Manichaean_Mem ; Manichaean_Mem

2828 jg ; Manichaean_Nun ; Manichaean_Nun

2829 jg ; Manichaean_One ; Manichaean_One

2830 jg ; Manichaean_Pe ; Manichaean_Pe

2831 jg ; Manichaean_Qoph ; Manichaean_Qoph

2832 jg ; Manichaean_Resh ; Manichaean_Resh

2833 jg ; Manichaean_Sadhe ; Manichaean_Sadhe

2834 jg ; Manichaean_Samekh ; Manichaean_Samekh

2835 jg ; Manichaean_Taw ; Manichaean_Taw

2836 jg ; Manichaean_Ten ; Manichaean_Ten

2837 jg ; Manichaean_Teth ; Manichaean_Teth

2838 jg ; Manichaean_Thamedh ; Manichaean_Thamedh

2839 jg ; Manichaean_Twenty ; Manichaean_Twenty

2840 jg ; Manichaean_Waw ; Manichaean_Waw

2841 jg ; Manichaean_Yodh ; Manichaean_Yodh

2842 jg ; Manichaean_Zayin ; Manichaean_Zayin

2843 jg ; Straight_Waw ; Straight_Waw

2844 -> uchar.h & UCharacter.JoiningGroup

2845 - 23 new Script (sc) values:

2846 sc ; Aghb ; Caucasian_Albanian

2847 sc ; Bass ; Bassa_Vah

2848 sc ; Dupl ; Duployan

2849 sc ; Elba ; Elbasan

2850 sc ; Gran ; Grantha

2851 sc ; Hmng ; Pahawh_Hmong

2852 sc ; Khoj ; Khojki

2853 sc ; Lina ; Linear_A

2854 sc ; Mahj ; Mahajani

2855 sc ; Mani ; Manichaean

2856 sc ; Mend ; Mende_Kikakui

2857 sc ; Modi ; Modi

2858 sc ; Mroo ; Mro

2859 sc ; Narb ; Old_North_Arabian

2860 sc ; Nbat ; Nabataean

2861 sc ; Palm ; Palmyrene

2862 sc ; Pauc ; Pau_Cin_Hau

2863 sc ; Perm ; Old_Permic

2864 sc ; Phlp ; Psalter_Pahlavi

2865 sc ; Sidd ; Siddham

2866 sc ; Sind ; Khudawadi

2867 sc ; Tirh ; Tirhuta

2868 sc ; Wara ; Warang_Citi

2869 -> uscript.h (many were added before)

2870 comment "Mende Kikakui" for USCRIPT_MENDE

2871 add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias

2872 -> com.ibm.icu.lang.UScript

2873 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)

2874 replace public static final int \1 = \2; \3

2875 - 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html

2876 (added 2012-11-01)

2877 Ahom 338 Ahom

2878 Hatr 127 Hatran

2879 Mult 323 Multani

2880 (added 2013-10-12)

2881 Modi 324 Modi

2882 Pauc 263 Pau Cin Hau

2883 Sidd 302 Siddham

2884 -> uscript.h (some overlap with additions from Unicode)

2885 -> com.ibm.icu.lang.UScript

2886 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)

2887 replace public static final int \1 = \2; \3

2888 -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924

2889 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()

2890 and in com.ibm.icu.dev.test.lang.TestUScript.java

2891

2892 * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata

2893 (not strictly necessary for NOT_ENCODED scripts)

2894 ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt

2895

2896 * generate normalization data files

2897 - cd $ICU_ROOT/dbg

2898 - export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib

2899 - SRC_DATA_IN=$ICU_SRC_DIR/source/data/in

2900 - UNIDATA=$ICU_SRC_DIR/source/data/unidata

2901 - bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource

2902 - bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt

2903 - bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt

2904 - bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt

2905 - bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt

2906

2907 * build ICU (make install)

2908 so that the tools build can pick up the new definitions from the installed header files.

2909

2910 ~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt

2911

2912 * build Unicode tools using CMake+make

2913

2914 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:

2915

2916 # Location (--prefix) of where ICU was installed.

2917 set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst)

2918 # Location of the ICU source tree.

2919 set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src)

2920

2921 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c

2922 ~/svn.icutools/trunk/dbg/unicode/c$ make

2923

2924 * genprops work

2925 - new code point range for Joining_Group values: 10AC0..10AFF Manichaean

2926 + add second array of Joining_Group values for at most 10800..10FFF

2927 icutools: unicode/c/genprops/bidipropsbuilder.cpp

2928 icu: source/common/ubidi_props.h/.c/_data.h

2929 icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java

2930

2931 * generate core properties data files

2932 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR

2933 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR

2934 - rebuild ICU (make install) & tools

2935 - run genuca again (see step above) so that it picks up the new nfc.nrm

2936 - rebuild ICU (make install) & tools

2937

2938 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to

2939 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)

2940 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters

2941 - Unicode 6.0..7.0: U+2260, U+226E, U+226F

2942 - nothing new in 7.0, no test file to update

2943

2944 * run & fix ICU4C tests

2945

2946 * update Java data files

2947 - refresh just the UCD-related files, just to be safe

2948 - see (ICU4C)/source/data/icu4j-readme.txt

2949 - mkdir /tmp/icu4j

2950 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

2951 output:

2952 ...

2953 Unicode .icu files built to ./out/build/icudt53l

2954 echo timestamp > uni-core-data

2955 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b

2956 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b

2957 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt

2958 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b

2959 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b"

2960 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/

2961 mkdir -p /tmp/icu4j/main/shared/data

2962 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data

2963 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/

2964 mkdir -p /tmp/icu4j/main/shared/data

2965 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data

2966 make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data'

2967 - copy the big-endian Unicode data files to another location,

2968 separate from the other data files

2969 ICUDT=icudt54b

2970 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

2971 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

2972 cd ~/svn.icu/uni70/dbg/data/out/icu4j

2973 cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

2974 cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

2975 rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu

2976 cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT

2977 cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

2978 cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr

2979 - refresh ICU4J

2980 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT

2981

2982 * update CollationFCD.java

2983 + copy & paste the initializers of lcccIndex[] etc. from

2984 ICU4C/source/i18n/collationfcd.cpp to

2985 ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java

2986

2987 * refresh Java test .txt files

2988 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode

2989 cd $ICU_SRC_DIR/source/data/unidata

2990 cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode

2991 cd ../../test/testdata

2992 cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode

2993 cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode

2994

2995 * UCA

2996

2997 - download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/

2998 - run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata)

2999 - update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/

3000 - run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA

3001 - output files are in ~/svn.unitools/Generated/uca/7.0.0/

3002 - review data; compare files, use blankweights.sed or similar

3003 ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt

3004 - cd ~/svn.unitools/Generated/uca/7.0.0/

3005 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt

3006 cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt

3007 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt

3008 (note removing the underscore before "Rules")

3009 cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt

3010 - update (ICU4C)/source/test/testdata/CollationTest_*.txt

3011 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt

3012 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)

3013 cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt

3014 cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt

3015 cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data

3016 - run genuca, see command line above

3017 - rebuild ICU4C

3018 - refresh ICU4J collation data:

3019 (subset of instructions above for properties data refresh, except copies all coll/*)

3020 ICUDT=icudt54b

3021 ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

3022 ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

3023 ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll

3024 ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT

3025 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)

3026 - note on intltest: if collate/UCAConformanceTest fails, then

3027 utility/MultithreadTest/TestCollators will fail as well;

3028 fix the conformance test before looking into the multi-thread test

3029 - copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors

3030 - copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch

3031 ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/

3032

3033 * When refreshing all of ICU4J data from ICU4C

3034 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

3035 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data

3036 or

3037 - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install

3038

3039 * run & fix ICU4J tests

3040

3041 *** LayoutEngine script information

3042

3043 (For details see the Unicode 5.2 change log below.)

3044

3045 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.

3046 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp

3047 in the working directory.

3048 (It also generates ScriptRunData.cpp, which is no longer needed.)

3049

3050 The generated files have a current copyright date and "@stable" statement.

3051 ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java

3052 for "born stable" Unicode API constants, and to stop parsing ICU version numbers

3053 which may not contain dots any more.

3054

3055 - diff current <icu>/source/layout files vs. generated ones

3056 ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout

3057 review and manually merge desired changes;

3058 fix gratuitous changes, incorrect @draft/@stable and missing aliases;

3059 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.

3060 - if you just copy the above files, then

3061 fix mixed line endings, review the diffs as above and restore changes to API tags etc.;

3062 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h

3063

3064 *** API additions

3065 - send notice to icu-design about new born-@stable API (enum constants etc.)

3066

3067 *** merge the Unicode update branches back onto the trunk

3068 - do not merge the icudata.jar and testdata.jar,

3069 instead rebuild them from merged & tested ICU4C

3070

3071 ---------------------------------------------------------------------------- ***

3072

3073 Unicode 6.3 update

3074

3075 http://www.unicode.org/review/pri249/ -- beta review

3076 http://www.unicode.org/reports/uax-proposed-updates.html

3077 http://www.unicode.org/versions/beta-6.3.0.html#notable_issues

3078 http://www.unicode.org/reports/tr44/tr44-11.html

3079

3080 *** ICU Trac

3081

3082 - ticket 10128: update ICU to Unicode 6.3 beta

3083 - ticket 10168: update ICU to Unicode 6.3 final

3084 - C++ branches/markus/uni63 at r33552 from trunk at r33551

3085 - Java branches/markus/uni63 at r33550 from trunk at r33553

3086

3087 - ticket 10142: implement Unicode 6.3 bidi algorithm additions

3088

3089 *** Unicode version numbers

3090 - makedata.mak

3091 - uchar.h

3092 (configure.in & configure: have been modified to extract the version from uchar.h)

3093 - com.ibm.icu.util.VersionInfo

3094 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_

3095

3096 - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h

3097 so that the makefiles see the new version number.

3098

3099 *** data files & enums & parser code

3100

3101 * file preparation

3102

3103 - download UCD, UCA & IDNA files

3104 - make sure that the Unicode data folder passed into preparseucd.py

3105 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)

3106 - modify preparseucd.py:

3107 parse new file BidiBrackets.txt

3108 with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type

3109 - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src

3110 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.

3111 - Check test file diffs for previously commented-out, known-failing data lines;

3112 probably need to keep those commented out.

3113

3114 * PropertyAliases.txt changes

3115 - 1 new Enumerated Property

3116 bpt ; Bidi_Paired_Bracket_Type

3117 -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType

3118 -> ubidi_props.h & .c & UBiDiProps.java

3119 -> remember to write the max value at UBIDI_MAX_VALUES_INDEX

3120 -> uprops.cpp

3121 -> change ubidi.icu format version from 2.0 to 2.1

3122 - 1 new Miscellaneous Property

3123 bpb ; Bidi_Paired_Bracket

3124 -> uchar.h & UProperty.java

3125 -> ppucd.h & .cpp

3126

3127 * PropertyValueAliases.txt changes

3128 - 3 Bidi_Paired_Bracket_Type (bpt) values:

3129 bpt; c ; Close

3130 bpt; n ; None

3131 bpt; o ; Open

3132 -> uchar.h & UCharacter.BidiPairedBracketType

3133 -> ubidi_props.h & .c & UBiDiProps.java

3134 -> change ubidi.icu format version from 2.0 to 2.1

3135 - 4 new Bidi_Class (bc) values:

3136 bc ; FSI ; First_Strong_Isolate

3137 bc ; LRI ; Left_To_Right_Isolate

3138 bc ; RLI ; Right_To_Left_Isolate

3139 bc ; PDI ; Pop_Directional_Isolate

3140 -> uchar.h & UCharacterEnums.ECharacterDirection

3141 -> until the bidi code gets updated,

3142 Roozbeh suggests mapping the new bc values to ON (Other_Neutral)

3143 - 3 new Word_Break (WB) values:

3144 WB ; HL ; Hebrew_Letter

3145 WB ; SQ ; Single_Quote

3146 WB ; DQ ; Double_Quote

3147 -> uchar.h & UCharacter.WordBreak

3148 -> first time Word_Break numeric constants exceed 4 bits (now 17 values)

3149 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html

3150 (added 2012-10-16)

3151 Aghb 239 Caucasian Albanian

3152 Mahj 314 Mahajani

3153 -> uscript.h

3154 -> com.ibm.icu.lang.UScript

3155 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)

3156 replace public static final int \1 = \2;\3

3157 -> preparseucd.py _scripts_only_in_iso15924

3158 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()

3159 and in com.ibm.icu.dev.test.lang.TestUScript.java

3160 -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata

3161 (not strictly necessary for NOT_ENCODED scripts)

3162

3163 * generate normalization data files

3164 - ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib

3165 - ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in

3166 - ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata

3167 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt

3168 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt

3169 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt

3170 - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt

3171

3172 * build ICU (make install)

3173 so that the tools build can pick up the new definitions from the installed header files.

3174

3175 ~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt

3176

3177 * build Unicode tools using CMake+make

3178

3179 ~/svn.icutools/trunk/src/unicode/c/icudefs.txt:

3180

3181 # Location (--prefix) of where ICU was installed.

3182 set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst)

3183 # Location of the ICU source tree.

3184 set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src)

3185

3186 ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c

3187 ~/svn.icutools/trunk/dbg/unicode/c$ make

3188

3189 * generate core properties data files

3190 - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src

3191 - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src

3192 - rebuild ICU (make install) & tools

3193 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm

3194 - rebuild ICU (make install) & tools

3195

3196 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to

3197 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)

3198 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters

3199 - Unicode 6.0..6.3: U+2260, U+226E, U+226F

3200 - nothing new in 6.3, no test file to update

3201

3202 * update Java data files

3203 - refresh just the UCD-related files, just to be safe

3204 - see (ICU4C)/source/data/icu4j-readme.txt

3205 - mkdir /tmp/icu4j

3206 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

3207 output:

3208 ...

3209 Unicode .icu files built to ./out/build/icudt52l

3210 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b

3211 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b

3212 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt

3213 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b

3214 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b"

3215 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/

3216 mkdir -p /tmp/icu4j/main/shared/data

3217 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data

3218 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/

3219 mkdir -p /tmp/icu4j/main/shared/data

3220 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data

3221 make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data'

3222 - copy the big-endian Unicode data files to another location,

3223 separate from the other data files

3224 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll

3225 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr

3226 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b

3227 ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu

3228 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b

3229 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll

3230 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr

3231 - refresh ICU4J

3232 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b

3233

3234 * refresh Java test .txt files

3235 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode

3236

3237 * UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files

3238

3239 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/

3240 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that

3241 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt

3242 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt

3243 (note removing the underscore before "Rules")

3244 - update (ICU4C)/source/test/testdata/CollationTest_*.txt

3245 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt

3246 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)

3247 - check test file diffs for previously commented-out, known-failing data lines;

3248 probably need to keep those commented out

3249 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani

3250 - run genuca, see command line above

3251 - rebuild ICU4C

3252 - refresh ICU4J collation data:

3253 (subset of instructions above for properties data refresh, except copies all coll/*)

3254 ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

3255 ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll

3256 ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll

3257 ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b

3258 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)

3259 - note on intltest: if collate/UCAConformanceTest fails, then

3260 utility/MultithreadTest/TestCollators will fail as well;

3261 fix the conformance test before looking into the multi-thread test

3262

3263 * test ICU, fix test code where necessary

3264

3265 * When refreshing all of ICU4J data from ICU4C

3266 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

3267 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data

3268 or

3269 - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install

3270

3271 *** LayoutEngine script information

3272 - skipped for Unicode 6.3: no new scripts

3273

3274 *** merge the Unicode update branches back onto the trunk

3275 - do not merge the icudata.jar and testdata.jar,

3276 instead rebuild them from merged & tested ICU4C

3277

3278 ---------------------------------------------------------------------------- ***

3279

3280 Unicode 6.2 update

3281

3282 http://www.unicode.org/review/pri230/

3283 http://www.unicode.org/versions/beta-6.2.0.html

3284 http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0

3285 http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values

3286 http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol

3287 http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols

3288 http://www.unicode.org/reports/tr46/tr46-8.html IDNA

3289 http://unicode.org/Public/idna/6.2.0/

3290

3291 *** ICU Trac

3292

3293 - ticket 9515: Unicode 6.2: final ICU update

3294

3295 - ticket 9514: UCA 6.2: fix UCARules.txt

3296

3297 - ticket 9437: update ICU to Unicode 6.2

3298 - C++ branches/markus/uni62 at r32050 from trunk at r32041

3299 - Java branches/markus/uni62 at r32068 from trunk at r32066

3300

3301 *** Unicode version numbers

3302 - makedata.mak

3303 - uchar.h

3304 (configure.in & configure: have been modified to extract the version from uchar.h)

3305 - com.ibm.icu.util.VersionInfo

3306 - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_

3307

3308 *** data files & enums & parser code

3309

3310 * file preparation

3311

3312 - download UCD, UCA & IDNA files

3313 - make sure that the Unicode data folder passed into preparseucd.py

3314 includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder)

3315 - modify preparseucd.py: NamesList.txt is now in UTF-8

3316 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src

3317 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.

3318 - Check test file diffs for previously commented-out, known-failing data lines;

3319 probably need to keep those commented out.

3320

3321 * PropertyValueAliases.txt changes

3322 - 1 new Line_Break (lb) value:

3323 lb ; RI ; Regional_Indicator

3324 -> uchar.h & UCharacter.LineBreak

3325 - 1 new Word_Break (WB) value:

3326 WB ; RI ; Regional_Indicator

3327 -> uchar.h & UCharacter.WordBreak

3328 - 1 new Grapheme_Cluster_Break (GCB) value:

3329 GCB; RI ; Regional_Indicator

3330 -> uchar.h & UCharacter.GraphemeClusterBreak

3331

3332 * 3 new numeric values

3333 The new value -1, which was really supposed to be NaN but that would have required

3334 new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1,

3335 but encodeNumericValue() in corepropsbuilder.cpp had to be fixed.

3336 cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1

3337 cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1

3338 The two new values 216000 and 432000 require an addition to the encoding of numeric values.

3339 cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000

3340 cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000

3341 -> uprops.h, uchar.c & UCharacterProperty.java

3342 -> cucdtst.c & UCharacterTest.java

3343

3344 * generate normalization data files

3345 - ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib

3346 - ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in

3347 - ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata

3348 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt

3349 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt

3350 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt

3351 - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt

3352

3353 * build ICU (make install)

3354 so that the tools build can pick up the new definitions from the installed header files.

3355 * build Unicode tools using CMake+make

3356

3357 * generate core properties data files

3358 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src

3359 - in initial bootstrapping, change the UCA version

3360 in source/data/unidata/FractionalUCA.txt to match the new Unicode version

3361 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src

3362 - rebuild ICU (make install) & tools

3363 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,

3364 check if the UCA version in FractionalUCA.txt matches the new Unicode version

3365 (see step above)

3366 - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm

3367 - rebuild ICU (make install) & tools

3368

3369 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to

3370 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)

3371 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters

3372 - Unicode 6.0..6.2: U+2260, U+226E, U+226F

3373 - nothing new in 6.2, no test file to update

3374

3375 * update Java data files

3376 - refresh just the UCD-related files, just to be safe

3377 - see (ICU4C)/source/data/icu4j-readme.txt

3378 - mkdir /tmp/icu4j

3379 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

3380 output:

3381 ...

3382 Unicode .icu files built to ./out/build/icudt50l

3383 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b

3384 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b

3385 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt

3386 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b

3387 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b"

3388 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/

3389 mkdir -p /tmp/icu4j/main/shared/data

3390 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data

3391 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/

3392 mkdir -p /tmp/icu4j/main/shared/data

3393 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data

3394 make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data'

3395 - copy the big-endian Unicode data files to another location,

3396 separate from the other data files

3397 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll

3398 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr

3399 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b

3400 ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu

3401 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b

3402 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll

3403 ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr

3404 - refresh ICU4J

3405 ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b

3406

3407 * refresh Java test .txt files

3408 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode

3409

3410 * UCA

3411

3412 - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/

3413 - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that

3414 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt

3415 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt

3416 (note removing the underscore before "Rules")

3417 - update (ICU4C)/source/test/testdata/CollationTest_*.txt

3418 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt

3419 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)

3420 - check test file diffs for previously commented-out, known-failing data lines;

3421 probably need to keep those commented out

3422 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani

3423 - run genuca, see command line above

3424 - rebuild ICU4C

3425 - refresh ICU4J collation data:

3426 (subset of instructions above for properties data refresh, except copies all coll/*)

3427 ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

3428 ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll

3429 ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll

3430 ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b

3431 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)

3432 - note on intltest: if collate/UCAConformanceTest fails, then

3433 utility/MultithreadTest/TestCollators will fail as well;

3434 fix the conformance test before looking into the multi-thread test

3435

3436 * test ICU, fix test code where necessary

3437

3438 * When refreshing all of ICU4J data from ICU4C

3439 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

3440 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data

3441 or

3442 - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install

3443

3444 *** LayoutEngine script information

3445 - skipped for Unicode 6.2: no new scripts

3446

3447 *** merge the Unicode update branches back onto the trunk

3448 - do not merge the icudata.jar and testdata.jar,

3449 instead rebuild them from merged & tested ICU4C

3450

3451 ---------------------------------------------------------------------------- ***

3452

3453 Future Unicode update

3454

3455 Tools simplified since the Unicode 6.1 update. See

3456 - http://site.icu-project.org/design/props/ppucd

3457 - http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972

3458

3459 * Unicode version numbers

3460 - icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates

3461

3462 * file preparation

3463 - ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py:

3464 - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src

3465 - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders.

3466 - Check test file diffs for previously commented-out, known-failing data lines;

3467 probably need to keep those commented out.

3468

3469 * PropertyValueAliases.txt changes

3470 - Script codes that are in ISO 15924 but not in Unicode are now listed in

3471 preparseucd.py, in the _scripts_only_in_iso15924 variable.

3472 If there are new ISO codes, then add them.

3473 If Unicode adds some of them, then remove them from the .py variable.

3474

3475 * UnicodeData.txt changes

3476 - No more manual changes for CJK ranges for algorithmic names;

3477 those are now written to ppucd.txt and genprops reads them from there.

3478

3479 * generate core properties data files (makeprops.sh was deleted)

3480 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src

3481

3482 * no more manual updates of source/data/unidata/norm2/nfkc_cf.txt

3483 - it is now generated by preparseucd.py

3484

3485 * no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt

3486 - it is now generated by preparseucd.py

3487 - make sure that the Unicode data folder passed into preparseucd.py

3488 includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt

3489 (can be in some subfolder)

3490

3491 * generate normalization data files

3492 - ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib

3493 - ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in

3494 - ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata

3495 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt

3496 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt

3497 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt

3498 - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt

3499

3500 * build ICU (make install)

3501 * build Unicode tools using CMake+make

3502

3503 * new way to call genuca (makeuca.sh was deleted)

3504 - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src

3505

3506 ---------------------------------------------------------------------------- ***

3507

3508 Unicode 6.1 update

3509

3510 *** ICU Trac

3511

3512 - ticket 8995 final update to Unicode 6.1

3513 - ticket 8994 regenerate source/layout/CanonData.cpp

3514

3515 - ticket 8961 support Unicode "Age" value *names*

3516 - ticket 8963 support multiple character name aliases & types

3517

3518 - ticket 8827 "update ICU to Unicode 6.1"

3519 - C++ branches/markus/uni61 at r30864 from trunk at r30843

3520 - Java branches/markus/uni61 at r30865 from trunk at r30863

3521

3522 *** Unicode version numbers

3523 - makedata.mak

3524 - uchar.h

3525 (configure.in & configure: have been modified to extract the version from uchar.h)

3526 - com.ibm.icu.util.VersionInfo

3527 - icutools/unicode/makedefs.sh

3528 + also review & update other definitions in that file,

3529 e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l

3530

3531 *** data files & enums & parser code

3532

3533 * file preparation

3534

3535 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed

3536 - This prepares both unidata and testdata files in respective output subfolders.

3537 - Check test file diffs for previously commented-out, known-failing data lines;

3538 probably need to keep those commented out.

3539

3540 * PropertyValueAliases.txt changes

3541 - 11 new block names:

3542 Arabic_Extended_A

3543 Arabic_Mathematical_Alphabetic_Symbols

3544 Chakma

3545 Meetei_Mayek_Extensions

3546 Meroitic_Cursive

3547 Meroitic_Hieroglyphs

3548 Miao

3549 Sharada

3550 Sora_Sompeng

3551 Sundanese_Supplement

3552 Takri

3553 -> add to uchar.h

3554 -> add to UCharacter.UnicodeBlock IDs

3555 Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+)

3556 replace public static final int \1_ID = \2; \3

3557 -> add to UCharacter.UnicodeBlock objects

3558 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)

3559 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2

3560 - 1 new Joining_Group (jg) value:

3561 Rohingya_Yeh

3562 -> uchar.h & UCharacter.JoiningGroup

3563 - 2 new Line_Break (lb) values:

3564 CJ=Conditional_Japanese_Starter

3565 HL=Hebrew_Letter

3566 -> uchar.h & UCharacter.LineBreak

3567 - 7 new scripts:

3568 sc ; Cakm ; Chakma

3569 sc ; Merc ; Meroitic_Cursive

3570 sc ; Mero ; Meroitic_Hieroglyphs

3571 sc ; Plrd ; Miao

3572 sc ; Shrd ; Sharada

3573 sc ; Sora ; Sora_Sompeng

3574 sc ; Takr ; Takri

3575 -> remove these from SyntheticPropertyValueAliases.txt

3576 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()

3577 and in com.ibm.icu.dev.test.lang.TestUScript.java

3578 - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html

3579 (added 2011-06-21)

3580 Khoj 322 Khojki

3581 Tirh 326 Tirhuta

3582 and another one added 2011-12-09

3583 Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs)

3584 -> uscript.h

3585 -> com.ibm.icu.lang.UScript

3586 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)

3587 replace public static final int \1 = \2;\3

3588 -> SyntheticPropertyValueAliases.txt

3589 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()

3590 and in com.ibm.icu.dev.test.lang.TestUScript.java

3591

3592 * UnicodeData.txt changes

3593 - the last Unihan code point changes from U+9FCB to U+9FCC

3594 search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive)

3595 + do change gennames.c

3596 + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java

3597

3598 * DerivedBidiClass.txt changes

3599 - 2 new default-AL blocks:

3600 # Arabic Extended-A: U+08A0 - U+08FF (was default-R)

3601 # Arabic Mathematical Alphabetic Symbols:

3602 # U+1EE00 - U+1EEFF (was default-R)

3603 - 2 new default-R blocks:

3604 # Meroitic Hieroglyphs:

3605 # U+10980 - U+1099F

3606 # Meroitic Cursive: U+109A0 - U+109FF

3607 -> should be picked up by the explicit data in the file

3608

3609 * NameAliases.txt changes

3610 - from

3611 # Each line has two fields

3612 # First field: Code point

3613 # Second field: Alias

3614 - to

3615 # Each line has three fields, as described here:

3616 #

3617 # First field: Code point

3618 # Second field: Alias

3619 # Third field: Type

3620 - Also, the file previously allowed multiple aliases but only now does it

3621 actually provide multiple, even multiple of the same type. For example,

3622 FEFF;BYTE ORDER MARK;alternate

3623 FEFF;BOM;abbreviation

3624 FEFF;ZWNBSP;abbreviation

3625 - This breaks our gennames parser, unames.icu data structure, and API.

3626 Fix gennames to only pick up "correction" aliases.

3627 New ticket #8963 for further changes.

3628

3629 * run genpname/preparse.pl (on Linux)

3630 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname

3631 + make sure that data.h is writable

3632 + perl preparse.pl ~/svn.icu/trunk/src > out.txt

3633 + preparse.pl shows no errors, out.txt Info and Warning lines look ok

3634

3635 * build ICU (make install)

3636 so that the tools build can pick up the new definitions from the installed header files.

3637 * build Unicode tools (at least genpname) using CMake+make

3638

3639 * run genpname

3640 (builds both pnames.icu and propname_data.h)

3641 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in

3642 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource

3643

3644 * build ICU (make install)

3645 * build Unicode tools using CMake+make

3646

3647 * update source/data/unidata/norm2/nfkc_cf.txt

3648 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt

3649

3650 * update source/data/unidata/norm2/uts46.txt

3651 - download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt

3652 to ~/svn.icu/tools/trunk/src/unicode/py

3653 - adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008".

3654 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py

3655 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2

3656

3657 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to

3658 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)

3659 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters

3660 - Unicode 6.0..6.1: U+2260, U+226E, U+226F

3661 - nothing new in 6.1, no test file to update

3662

3663 * generate core properties data files

3664 - in initial bootstrapping, change the UCA version

3665 in source/data/unidata/FractionalUCA.txt to match the new Unicode version

3666 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld

3667 - rebuild ICU & tools

3668 + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR,

3669 check if the UCA version in FractionalUCA.txt matches the new Unicode version

3670 (see step above)

3671 - run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm:

3672 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld

3673 - rebuild ICU & tools

3674

3675 * update Java data files

3676 - refresh just the UCD-related files, just to be safe

3677 - see (ICU4C)/source/data/icu4j-readme.txt

3678 - mkdir /tmp/icu4j

3679 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

3680 output:

3681 ...

3682 Unicode .icu files built to ./out/build/icudt49l

3683 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b

3684 mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b

3685 echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt

3686 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b

3687 mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b"

3688 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/

3689 mkdir -p /tmp/icu4j/main/shared/data

3690 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data

3691 jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/

3692 mkdir -p /tmp/icu4j/main/shared/data

3693 cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data

3694 make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data'

3695 - copy the big-endian Unicode data files to another location,

3696 separate from the other data files

3697 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll

3698 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr

3699 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b

3700 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu

3701 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b

3702 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll

3703 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr

3704 - refresh ICU4J

3705 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b

3706

3707 * refresh Java test .txt files

3708 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode

3709

3710 * test ICU so far, fix test code where necessary

3711 - temporarily ignore collation issues that look like UCA/UCD mismatches,

3712 until UCA data is updated

3713

3714 * UCA

3715

3716 - get output from Mark's tools; look in

3717 http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt

3718 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt

3719 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt

3720 (note removing the underscore before "Rules")

3721 - update (ICU)/source/test/testdata/CollationTest_*.txt

3722 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt

3723 with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt)

3724 - check test file diffs for previously commented-out, known-failing data lines;

3725 probably need to keep those commented out

3726 - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani

3727 - run makeuca.sh:

3728 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld

3729 - rebuild ICU4C

3730 - refresh ICU4J collation data:

3731 (subset of instructions above for properties data refresh, except copies all coll/*)

3732 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

3733 ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll

3734 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll

3735 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b

3736 - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging)

3737 - note on intltest: if collate/UCAConformanceTest fails, then

3738 utility/MultithreadTest/TestCollators will fail as well;

3739 fix the conformance test before looking into the multi-thread test

3740

3741 * When refreshing all of ICU4J data from ICU4C

3742 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

3743 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data

3744 or

3745 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install

3746

3747 *** LayoutEngine script information

3748

3749 (For details see the Unicode 5.2 change log below.)

3750

3751 * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder.

3752 This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp

3753 in the working directory.

3754 (It also generates ScriptRunData.cpp, which is no longer needed.)

3755

3756 The generated files have a current copyright date and "@draft" statement.

3757

3758 - diff current <icu>/source/layout files vs. generated ones

3759 ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout

3760 review and manually merge desired changes;

3761 fix gratuitous changes, incorrect @draft and missing aliases;

3762 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.

3763 - if you just copy the above files, then

3764 fix mixed line endings, review the diffs as above and restore changes to API tags etc.;

3765 manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h

3766

3767 *** merge the Unicode update branches back onto the trunk

3768 - do not merge the icudata.jar and testdata.jar,

3769 instead rebuild them from merged & tested ICU4C

3770

3771 ---------------------------------------------------------------------------- ***

3772

3773 ICU 4.8 (no Unicode update, just new script codes)

3774

3775 * 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html

3776 (added 2010-12-21)

3777 Afak 439 Afaka

3778 Jurc 510 Jurchen

3779 Mroo 199 Mro, Mru

3780 Nshu 499 Nüshu

3781 Shrd 319 Sharada, Śāradā

3782 Sora 398 Sora Sompeng

3783 Takr 321 Takri, Ṭākrī, Ṭāṅkrī

3784 Tang 520 Tangut

3785 Wole 480 Woleai

3786 -> uscript.h

3787 -> com.ibm.icu.lang.UScript

3788 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)

3789 replace public static final int \1 = \2;\3

3790 -> genpname/SyntheticPropertyValueAliases.txt

3791 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()

3792 and in com.ibm.icu.dev.test.lang.TestUScript.java

3793

3794 * run genpname/preparse.pl (on Linux)

3795 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname

3796 + make sure that data.h is writable

3797 + perl preparse.pl ~/svn.icu/trunk/src > out.txt

3798 + preparse.pl shows no errors, out.txt Info and Warning lines look ok

3799

3800 * rebuild Unicode tools (at least genpname) using make

3801 - You might first need to "make install" ICU so that the tools build can pick

3802 up the new definitions from the installed header files.

3803

3804 * run genpname

3805 (builds both pnames.icu and propname_data.h)

3806 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in

3807 - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource

3808 - rebuild ICU & tools

3809

3810 * run genprops

3811 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0

3812 - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0

3813 - rebuild ICU & tools

3814

3815 * update Java data files

3816 - refresh just the UCD-related files, just to be safe

3817 - see (ICU4C)/source/data/icu4j-readme.txt

3818 - mkdir /tmp/icu4j

3819 - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

3820 - copy the big-endian Unicode data files to another location,

3821 separate from the other data files

3822 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b

3823 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b

3824 ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b

3825 - refresh ICU4J

3826 ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b

3827

3828 * should have updated the layout engine script codes but forgot

3829

3830 ---------------------------------------------------------------------------- ***

3831

3832 Unicode 6.0 update

3833

3834 *** related ICU Trac tickets

3835

3836 7264 Unicode 6.0 Update

3837

3838 *** Unicode version numbers

3839 - makedata.mak

3840 - uchar.h

3841 (configure.in & configure: have been modified to extract the version from uchar.h)

3842 - com.ibm.icu.util.VersionInfo

3843

3844 *** data files & enums & parser code

3845

3846 * file preparation

3847

3848 ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed

3849 - This now prepares both unidata and testdata files in respective output subfolders.

3850

3851 * PropertyAliases.txt changes

3852 - new Script_Extensions property defined in the new ScriptExtensions.txt file

3853 but not listed in PropertyAliases.txt; reported to unicode.org;

3854 -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt

3855 scx; Script_Extensions

3856 -> uchar.h with new UProperty section

3857 -> com.ibm.icu.lang.UProperty, parallel with uchar.h

3858

3859 * PropertyValueAliases.txt changes

3860 - 12 new block names:

3861 Alchemical_Symbols

3862 Bamum_Supplement

3863 Batak

3864 Brahmi

3865 CJK_Unified_Ideographs_Extension_D

3866 Emoticons

3867 Ethiopic_Extended_A

3868 Kana_Supplement

3869 Mandaic

3870 Miscellaneous_Symbols_And_Pictographs

3871 Playing_Cards

3872 Transport_And_Map_Symbols

3873 -> add to uchar.h

3874 -> add to UCharacter.UnicodeBlock

3875 Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+)

3876 replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2

3877 - Joining_Group (jg) values:

3878 Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias

3879 -> uchar.h & UCharacter.JoiningGroup

3880 - 3 new scripts:

3881 sc ; Batk ; Batak

3882 sc ; Brah ; Brahmi

3883 sc ; Mand ; Mandaic

3884 -> remove these from SyntheticPropertyValueAliases.txt

3885 -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN

3886 -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI()

3887 and in com.ibm.icu.dev.test.lang.TestUScript.java

3888 - 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html

3889 (added 2009-11-11..2010-07-18)

3890 Bass 259 Bassa Vah

3891 Dupl 755 Duployan shortand

3892 Elba 226 Elbasan

3893 Gran 343 Grantha

3894 Kpel 436 Kpelle

3895 Loma 437 Loma

3896 Mend 438 Mende

3897 Merc 101 Meroitic Cursive

3898 Narb 106 Old North Arabian

3899 Nbat 159 Nabataean

3900 Palm 126 Palmyrene

3901 Sind 318 Sindhi

3902 Wara 262 Warang Citi

3903 -> uscript.h

3904 -> com.ibm.icu.lang.UScript

3905 find USCRIPT_([^ ]+) *= ([0-9]+),(.+)

3906 replace public static final int \1 = \2;\3

3907 -> SyntheticPropertyValueAliases.txt

3908 -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI()

3909 and in com.ibm.icu.dev.test.lang.TestUScript.java

3910 - ISO 15924 name change

3911 Mero 100 Meroitic Hieroglyphs (was Meroitic)

3912 -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC

3913 - property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt

3914

3915 * UnicodeData.txt changes

3916 - new CJK block:

3917 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;

3918 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;

3919 -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion

3920

3921 * build Unicode tools using CMake+make

3922

3923 * run genpname/preparse.pl (on Linux)

3924 + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname

3925 + make sure that data.h is writable

3926 + perl preparse.pl ~/svn.icu/trunk/src > out.txt

3927 + preparse.pl shows no errors, out.txt Info and Warning lines look ok

3928

3929 * rebuild Unicode tools (at least genpname) using make

3930 - You might first need to "make install" ICU so that the tools build can pick

3931 up the new definitions from the installed header files.

3932

3933 * run genpname

3934 - ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in

3935 - rebuild ICU & tools

3936

3937 * update source/data/unidata/norm2/nfkc_cf.txt

3938 - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt

3939

3940 * update source/data/unidata/norm2/uts46.txt

3941 - download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt

3942 to ~/svn.icu/tools/trunk/src/unicode/py

3943 - adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values

3944 - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py

3945 - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2

3946

3947 * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to

3948 sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar)

3949 - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters

3950 - Unicode 6.0: U+2260, U+226E, U+226F

3951

3952 * generate core properties data files

3953 - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld

3954 - rebuild ICU & tools

3955 - run makeuca.sh so that genuca picks up the new nfc.nrm:

3956 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld

3957 - rebuild ICU & tools

3958

3959 * implement new Script_Extensions property (provisional)

3960 - parser & generator: genprops & uprops.icu

3961 - uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp

3962 - UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java

3963

3964 * switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2

3965 - (one-time change)

3966 - genbidi/gencase/genprops tools changes

3967 - re-run makeprops.sh (see above)

3968 - UCharacterProperty.java, UCharacterTypeIterator.java,

3969 UBiDiProps.java, UCaseProps.java, and several others with minor changes;

3970 UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java

3971

3972 * update Java data files

3973 - refresh just the UCD-related files, just to be safe

3974 - see (ICU4C)/source/data/icu4j-readme.txt

3975 - mkdir /tmp/icu4j

3976 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

3977 output:

3978 ...

3979 Unicode .icu files built to ./out/build/icudt45l

3980 mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b

3981 echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt

3982 LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b

3983 jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b

3984 mkdir -p /tmp/icu4j/main/shared/data

3985 cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data

3986 - copy the big-endian Unicode data files to another location,

3987 separate from the other data files

3988 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll

3989 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr

3990 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b

3991 ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu

3992 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b

3993 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll

3994 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr

3995 - refresh ICU4J

3996 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b

3997

3998 * refresh Java test .txt files

3999 - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode

4000

4001 * un-hardcode normalization skippable (NF*_Inert) test data

4002 - removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools

4003

4004 * copy updated break iterator test files

4005 - now handled by early ucdcopy.py and

4006 copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata

4007 (old instructions:

4008 copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt

4009 to ~/svn.icu/trunk/src/source/test/testdata)

4010 - they are not used in ICU4J

4011

4012 * UCA

4013

4014 - get output from Mark's tools; look in

4015 http://www.unicode.org/~book/incoming/mark/uca6.0.0/

4016 http://www.macchiato.com/unicode/utc/additional-uca-files

4017 http://www.unicode.org/Public/UCA/6.0.0/

4018 http://www.unicode.org/~mdavis/uca/

4019 - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt

4020 - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt

4021 - update Han-implicit ranges for new CJK extensions:

4022 swapCJK() in ucol.cpp & ImplicitCEGenerator.java

4023 - genuca: allow bytes 02 for U+FFFE, new merge-sort character;

4024 do not add it into invuca so that tailoring primary-after an ignorable works

4025 - genuca: permit space between [variable top] bytes

4026 - ucol.cpp: treat noncharacters like unassigned rather than ignorable

4027 - run makeuca.sh:

4028 ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld

4029 - rebuild ICU4C

4030 - refresh ICU4J collation data:

4031 (subset of instructions above for properties data refresh, except copies all coll/*)

4032 ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

4033 mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll

4034 ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll

4035 ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b

4036 - update (ICU)/source/test/testdata/CollationTest_*.txt

4037 and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt

4038 with output from Mark's Unicode tools

4039 - run all tests with the *_SHORT.txt or the full files (the full ones have comments)

4040 - note on intltest: if collate/UCAConformanceTest fails, then

4041 utility/MultithreadTest/TestCollators will fail as well;

4042 fix the conformance test before looking into the multi-thread test

4043

4044 * When refreshing all of ICU4J data from ICU4C

4045 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install

4046 - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data

4047 or

4048 - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install

4049

4050 *** LayoutEngine script information

4051

4052 (For details see the Unicode 5.2 change log below.)

4053

4054 * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h,

4055 ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates

4056 ScriptRunData.cpp, which is no longer needed.)

4057

4058 The generated files have a current copyright date and "@draft" statement.

4059

4060 * copy the above files into <icu>/source/layout, replacing the old files.

4061 * fix mixed line endings

4062 * review the diffs and fix incorrect @draft and missing aliases;

4063 Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc.

4064 * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h

4065

4066 ---------------------------------------------------------------------------- ***

4067

4068 Unicode 5.2 update

4069

4070 *** related ICU Trac tickets

4071

4072 7084 Unicode 5.2

4073

4074 7167 verify collation bytes

4075 7235 Java test NAME_ALIAS

4076 7236 Java DerivedCoreProperties.txt test

4077 7237 Java BidiTest.txt

4078 7238 UTrie2 in core unidata

4079 7239 test for tailoring gaps

4080 7240 Java fix CollationMiscTest

4081 7243 update layout engine for Unicode 5.2

4082

4083 *** Unicode version numbers

4084 - makedata.mak

4085 - uchar.h

4086 - configure.in & configure

4087 - update ucdVersion in gennames.c if an algorithmic range changes

4088

4089 *** data files & enums & parser code

4090

4091 * file preparation

4092

4093 python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata

4094 - includes finding files regardless of version numbers,

4095 copying them, and performing the equivalent processing of the

4096 ucdstrip and ucdmerge tools on the desired set of files

4097

4098 * notes on changes

4099 - PropertyAliases.txt

4100 moved from numeric to enumerated:

4101 ccc ; Canonical_Combining_Class

4102 new string properties:

4103 NFKC_CF ; NFKC_Casefold

4104 Name_Alias; Name_Alias

4105 new binary properties: