-# Copyright (c) 2001-2006 International Business Machines
+# Copyright (c) 2001-2015 International Business Machines
# Corporation and others. All Rights Reserved.
# RBBI Test Data
# \ Escape. Normal ICU unescape applied.
# \ at end of line -> Line Continuation. Remove both the backslash and the new line
+# In ICU4C, this test data is run by intltest, rbbi/RBBITest/TestExtended.
+# In ICU4J, this test data is run by com.ibm.icu.dev.test.rbbi.RBBITestExtended
+# There are two copies of this file in the source repository,
+# [ICU4C] source/test/testdata/rbbitst.txt
+# [ICU4J] main/tests/core/src/com/ibm/icu/dev/test/rbbi/rbbitst.txt
+# ICU4C's copy is the master. If any changes are made to ICU4J's copy, make sure they
+# are merged back into ICU4C's copy of the file, lest they get overwritten later.
+# TODO: figure out how to have a single copy of the file for use by both C and Java.
# Temp debugging tests
-# to test for bug #4097920
-<data>•dog,cat,mouse •(one)•(two)\n<100></data>
# Hindi combining chars. (An old test)
-<data>•भ••ा•\u0930•\u0924• •\u0938\u0941\u0902•\u0926•\u0930•
+# TODO: Update these tests for Unicode 5.1 Extended Grapheme clusters
+#<data>•भ••ा•\u0930•\u0924• •\u0938\u0941\u0902•\u0926•\u0930•
-# Bug 1587. Tamil. \u0baa\u0bc1 should be two separate characters, even though
-# Hyangmi would perfer that it be one.
+# Bug 1587. Tamil. \u0baa\u0bc1 is an Extended Grpaheme Cluster
# Regression test for bug 1889
# Treat Japanese Half Width voicing marks as combining
+# E x t e n d e d G r a p h e m e C l u s t e r T e s t s
+# Plain Vanilla grapheme clusters
+#<data>•a\u0301\u0302• •b\u0303\u0304•</data>
+# Assorted Hindi combining marks
+#<data>•\u0904\u0903• •\u0937\u093E• •\u0904\u093F• •\u0937\u0940• •\u0937\u0949• •\u0937\u094A• •\u0937\u094B• •\u0937\u094C•</data>
+# Thai Clusters
+# $Prepend $Extend* $PrependBase $Extend*;
+#<data>•\u0e40\u0e01•\u0e44\u0301\u0e23\u0302\u0303•\u0e40•\u0e40\u0e02•\u0e02• •</data>
<data>•\uc5f0\ud569<200> •\uc7a5\ub85c\uad50\ud68c<200> •\u1109\u1161\u11bc\u1112\u1161\u11bc<200> •\u1112\u1161\u11ab\u110b\u1175\u11ab<200> •Hello<200>,• •how<200> •are<200> •you<200> •</data>
+<data>•Hello<200>,• •how<200> •are<200> •you<200> •\uc5f0\ud569<200> •\uc7a5\ub85c\uad50\ud68c<200> •\u1109\u1161\u11bc\u1112\u1161\u11bc<200> •\u1112\u1161\u11ab\u110b\u1175\u11ab<200> •</data>
# Words containing non-BMP letters
<data>•abc<200>\U0001D800•def<200>\U0001D3FF• •</data>
# Hiragana & Katakana stay together, but separates from each other and Latin.
+# *** what to do about theoretical combos of chars? i.e. hiragana + accent
+# test normalization/dictionary handling of halfwidth katakana: same dictionary phrase in fullwidth and halfwidth
+# more Japanese tests
+# TODO: some script=common characters in the Hiragana and the Katakana block may not be treated correctly
+# (was formerly true for U+30FC); need to check and fix if so.
+#<data>•どー<400>せ<400>日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400>は<400>我<400>でも<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data>
+<data>•日本語<400>を<400>勉強<400>する<400>理由<400>について<400> •て<400>こと<400>は<400>我<400>でも<400>知<400>ら<400>も<400>い<400>こと<400>なん<400>だ<400>。•</data>
+# Testing of word boundary for dictionary word containing both kanji and kana
+# Testing of Chinese segmentation (taken from a Chinese news article)
# Words with interior formatting characters
# to test for bug #4097779
<data>•aa\N{COMBINING GRAVE ACCENT}a<200> •</data>
+# fullwidth numeric, midletter characters etc should be treated like their halfwidth counterparts
+# <data>•ISN'T<200> •19<100>日<400></data>
+# why was this added with the dbbi stuff?
# to test for bug #4098467
# What follows is a string of Korean characters (I found it in the Yellow Pages
# precomposed syllables...
<data>•\uc0c1\ud56d<200> •\ud55c\uc778<200> •\uc5f0\ud569<200> •\uc7a5\ub85c\uad50\ud68c<200> •\u1109\u1161\u11bc\u1112\u1161\u11bc<200> •\u1112\u1161\u11ab\u110b\u1175\u11ab<200> •\u110b\u1167\u11ab\u1112\u1161\u11b8<200> •\u110c\u1161\u11bc\u1105\u1169\u1100\u116d\u1112\u116c<200> •</data>
-<data>•abc<200>\u4e01<400>\u4e02<400>\u3005<200>\u4e03<400>\u4e03<400>abc<200> •</data>
+# more Korean tests (Jamo not tested here, not counted as dictionary characters)
+# Disable them now because we don't include a Korean dictionary.
+#<data>•\ud604\uc7ac<200>\ub294<200> •\uac80\ucc30<200>\uc774<200> •\ubd84\uc2dd<200>\ud68c\uacc4<200>\ubb38\uc81c<200>\ub97c<200> •\uc870\uc0ac<200>\ud560<200> •\uac00\ub2a5\uc131<200>\uc740<200> •\uc5c6\ub2e4<200>\u002e•</data>
+<data>•abc<200>\u4e01<400>\u4e02<400>\u3005<400>\u4e03\u4e03<400>abc<200> •</data>
# Try some words from other scripts.
<data>•A\uff9e\uff9fBC<200> •1\uff9e\uff9f23<100></data>
+# User guide example:
+<data>•Parlez<200>-•vous<200> •français<200> •?•</data>
<data>• •\uF8FF\u2028<100>\uF8FF•</data>
<data>• \u200B\u2028<100>\u200B•</data>
+# User Guide example
+<data>•Parlez-•vous •français ?•</data>
# Old Line Break Test data. Orginally located in RBBITest::TestDefaultRuleBasedLineIteration()
<data>•\uc0c1•\ud56d •\ud55c•\uc778 •\uc5f0•\ud569 •\uc7a5•\ub85c•\uad50•\ud68c•</data>
# conjoining jamo...
-# TODO: rules update needed
-#<data>•\u1109\u1161\u11bc•\u1112\u1161\u11bc •\u1112\u1161\u11ab•\u110b\u1175\u11ab #•\u110b\u1167\u11ab•\u1112\u1161\u11b8 •\u110c\u1161\u11bc•\u1105\u1169•\u1100\u116d•\u1112\u116c•</data>
+<data>•\u1109\u1161\u11bc•\u1112\u1161\u11bc •\u1112\u1161\u11ab•\u110b\u1175\u11ab •\u110b\u1167\u11ab•\u1112\u1161\u11b8 •\u110c\u1161\u11bc•\u1105\u1169•\u1100\u116d•\u1112\u116c•</data>
# to test for bug #4117554: Fullwidth .!? should be treated as postJwrd
# Surrogate line break tests.
-<data>•\u4e01•\ud840\udc01•\u4e02•abc •\ue000 •\udb80\udc01•</data>
+<data>•\u4e01•\ud840\udc01•\u4e02•abc •\ue000 •\udb80\udc01•</data> #This line and the following are equivalent.
+<data>•\u4e01•\U00020001•\u4e02•abc •\ue000 •\U000f0001•</data>
# Regression for bug 836
+# Note: Unicode 5.1 changed this behavior
+# Unicode 5.2 changed it again, there is no break following the '('
<data>•AAA(AAA •</data>
# Try some words from other scripts.
<data>•ΑΒΓ •БВГ •אבג֓ •ابت •١٢٣ •\u10A0\u10A1\u10A2 •ABC •</data>
+# ticket #4853: unpaired surrogates should behave like AL
+# Regression tests for failures that originally came from the monkey test.
+# Monkey test failure lines can, with slight reformatting, be copied into this section
+# as test cases. The error display from here is more informative.
+# Test for #10176 (in root)
+<data>•abc/•s •def•</data>
+<data>•abc/\u05D9 •def•</data>
+<data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data>
+<data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05DD/\u05D9\u05D5\u05EA•</data>
<data>•123 •Start •with •a •number.•</data>
<data>•'•start •with •a •case-•ignorable •cha'r'a'cter•</data>
+<data>•' '' •start •with •case-•ignorable & •case-•insensitive •cha'r'a'cter•</data>
+<data>• ''•aaa' •bbb '•ccc' '•ddd''' '''•eee '''•fff''' •ggg ''•</data>
+# Note: apostrophe is case-ignorable. space is not cased.
+# Test data originally from http://bugs.icu-project.org/trac/search?q=r30327
+<data>•กู<200> •กิน<200>กุ้ง<200> •ปิ้่<200>งอ<200>ยู่<200>ใน<200>ถ้ำ<200></data>
+# Data originally from intltest RBBITest::TestThaiLineBreak()
+# \u0e2f-- the Thai paiyannoi character-- isn't a letter. It's a symbol that
+# represents elided letters at the end of a long word. It should be bound to
+# the end of the word and not treated as an independent punctuation mark.
+# the one time where the paiyannoi occurs somewhere other than at the end
+# of a word is in the Thai abbrevation for "etc.", which both begins and
+# ends with a paiyannoi
+# Data originally from RBBITest::TestMixedThaiLineBreak()
+# @suwit -- Test Arabic numerals, Thai numerals, Punctuation and English characters start
+\u0E1E\u0E38\u0E17\u0E18\u0E28\u0E31\u0E01\u0E23\u0E32\u0E0A •\
+2545 •\
+\u0E23\u0E2D\u0E1A •\
+\"\u0E52\u0E52\u0E50 •\
+\u0E1b\u0E35\" •\
+\u0E23\u0E31\u0E15\u0E19\u0E42\u0E01\u0E2A\u0E34\u0E19\u0E17\u0E23\u0E4C •\
+\u0E2B\u0E23\u0E37\u0E2D •\
+# Data originally from RBBITest::TestMaiyamok()
+# The Thai maiyamok character is a shorthand symbol that means "repeat the previous
+# word". Instead of appearing as a word unto itself, however, it's kept together
+# with the word before it.
+# Test for #10296
+# Test for #10593
+# Test for city names #10691
+# Test for #10630, #10631
+# Test for #11019
+# Lao Tests
+<locale en>
+# Basic check for #7647
+# Burmese/Myanmar Tests
+<locale en>
+# Basic sanity check for #10326 (some text from http://www.unicode.org/udhr/d/udhr_mya.txt)
+<data>•လူ•တိုင်း•သည် •တူညီ •လွတ်လပ်•သော •ဂုဏ်•သိ•က္•ခါ•ဖြ•င့် •လည်းကောင်း၊ •</data>
+<data>•တူညီ•လွတ်လပ်•သော •အ•ခွ•င့်•အရေး•များ•ဖြ•င့် •လည်းကောင်း၊ •မွေး•ဖွား•လာ•သူများ •ဖြစ်သည်။•</data>
+<data>•ထို•သူ•တို့၌ •ပိုင်းခြား •ဝေဖန်•တတ်•သော •ဉာဏ်•နှ•င့် •ကျ•င့်•ဝတ် •သိတတ်•သော •စိတ်•တို့•ရှိ•ကြ၍ •</data>
+<data>•ထို•သူ•တို့သည် •အချင်းချင်း •မေတ္တာ•ထား၍ •ဆက်ဆံ•ကျ•င့်•သုံး•</data>
+# Khmer Tests
+# Test data originally from http://bugs.icu-project.org/trac/search?q=r30327
+# from the file testdata/wordsegments.txt
+<locale en>
# Jitterbug 3671 Test Case
+# Tailored (locale specific) breaking.
+# Japanese line break tailoring test
+<locale ja>
+<locale en>
+# The following data was originally in RBBITest::TestJapaneseWordBreak()
+<locale ja>
+# UBreakIteratorType UBRK_WORD, Locale "ja"
+# Don't break in runs of hiragana or runs of ideograph, where the latter includes \u3005 \u3007 \u303B (cldrbug #2009).
+# \u79C1\u9054\u306B\u4E00\u3007\u3007\u3007\u306E\u30B3\u30F3\u30D4\u30E5\u30FC\u30BF\u304C\u3042\u308B\u3002\u5948\u3005\u306F\u30EF\u30FC\u30C9\u3067\u3042\u308B\u3002
+# modified to work with dbbi code - should verify
+<locale ja>
+# Test for #10176 (in ja)
+<data>•abc/•s •def•</data>
+<data>•abc/\u05D9 •def•</data>
+<data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data>
+<data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05DD/\u05D9\u05D5\u05EA•</data>
+<locale root>
+# The following test is for #10300
+# The following test is for #10571
+# UBreakIteratorType UBRK_SENTENCE, Locale "el"
+# Add break after Greek question mark (cldrbug #2069).
+# "\u0391\u03B2, \u03B3\u03B4; \u0395 \u03B6\u03B7\u037E \u0398 \u03B9\u03BA. "
+# "\u039B\u03BC \u03BD\u03BE! \u039F\u03C0, \u03A1\u03C2? \u03A3"
+# which is "Αβ, γδ; Ε ζη; Θ ικ. Λμ νξ! Οπ, Ρς? Σ"
+<locale root>
+<data>•Αβ, γδ; Ε ζη; Θ ικ. •Λμ νξ! •Οπ, Ρς? •Σ<100></data>
+<locale el>
+<data>•Αβ, γδ; •Ε ζη; •Θ ικ. •Λμ νξ! •Οπ, Ρς? •Σ<100></data>
+# UBreakIteratorType UBRK_WORD, Locale "en_US_POSIX"
+# Words don't include colon or period (cldrbug #1969).
+<locale en_US>
+<data>•Can't<200> •have<200> •breaks<200> •in<200> •xx:yy<200> •or<200> •struct.field<200> \
+•for<200> •CS<200>-•types<200>.•</data>
+<data>•\uFF92\uFF76\uFF9E<400> •</data>
+<locale en_US_POSIX>
+<data>•Can't<200> •have<200> •breaks<200> •in<200> •xx<200>:•yy<200> •or<200> •struct<200>.•field<200> \
+•for<200> •CS<200>-•types<200>.•</data>
+<data>•\uFF92\uFF76\uFF9E<400> •</data>
+# UBreakIteratorType UBRK_CHARACTER, Locale "th"
+# Clusters should not include spacing Thai/Lao vowels (prefix or postfix), except for [SARA] AM (cldrbug #2161).
+# Update: As of Unicode 6.1 root has same behavior as th for this.
+# "\u0E01\u0E23\u0E30\u0E17\u0E48\u0E2D\u0E21\u0E23\u0E08\u0E19\u0E32 "
+# "(\u0E2A\u0E38\u0E0A\u0E32\u0E15\u0E34-\u0E08\u0E38\u0E11\u0E32\u0E21\u0E32\u0E28) "
+# "\u0E40\u0E14\u0E47\u0E01\u0E21\u0E35\u0E1B\u0E31\u0E0D\u0E2B\u0E32 "
+# which is "กระท่อมรจนา (สุชาติ-จุฑามาศ) เด็กมีปัญหา "
+<locale th>
+<data>•\u0E01•\u0E23•\u0E30•\u0E17\u0E48•\u0E2D•\u0E21•\u0E23•\u0E08•\u0E19•\u0E32• •\
+(•\u0E2A\u0E38•\u0E0A•\u0E32•\u0E15\u0E34•-•\u0E08\u0E38•\u0E11•\u0E32•\u0E21•\u0E32•\u0E28•)• •\
+\u0E40•\u0E14\u0E47•\u0E01•\u0E21\u0E35•\u0E1B\u0E31•\u0E0D•\u0E2B•\u0E32• •</data>
+# Finnish line breaking
+# These rules deal with hyphens when there is a space on the leading side.
+# There should be a break opportunity between the space and the hyphen, and not after the hyphen.
+# See CLDR ticket 3029.
+# See ICU ticket 8151
+<locale root>
+<data>•abc •- •def •abc •-•def •abc- •def •abc-•def•</data> # With ASCII hyphen
+<data>•abc •‐ •def •abc •‐•def •abc‐ •def •abc‐•def•</data> # With Unicode u2010 hyphen
+<locale fi>
+# TODO: problems with Finnish line break rules cause these two lines to fail.
+#<data>•abc •- •def •abc •-def •abc- •def •abc-•def•</data> # With ASCII hyphen
+#<data>•abc •‐ •def •abc •‐def •abc‐ •def •abc‐•def•</data> # With Unicode u2010 hyphen
+<data>•abc •- •def •abc •-def •abc- •def •</data> # With ASCII hyphen
+<data>•abc •‐ •def •abc •‐def •abc‐ •def •</data> # With Unicode u2010 hyphen
+# Test for #10176 (in fi)
+<data>•abc/•s •def•</data>
+<data>•abc/\u05D9 •def•</data>
+<data>•\u05E7\u05D7/\u05D9 •\u05DE\u05E2\u05D9\u05DC•</data>
+<data>•\u05D3\u05E8\u05D5\u05E9\u05D9\u05DD •\u05E9\u05D7\u05E7\u05E0\u05D9\u05DD/\u05D9\u05D5\u05EA•</data>
+# Test CSS line break variants: strict, normal, loose
+<locale ja@lb=strict>
+# •no brk before 3063 •no brk before 301C•no brk btw 2026 •no brk before FF01•
+<locale ja@lb=normal>
+# •brk OK before 3063 •brk OK before 301C •no brk btw 2026 •no brk before FF01•
+<locale ja@lb=loose>
+# •brk OK before 3063 •brk OK before 301C •brk OK btw 2026 •brk OK before FF01•
+<locale en@lb=strict>
+# •no brk before 3063 •no brk before 301C•no brk btw 2026 •no brk before FF01•
+<locale en@lb=normal>
+# •brk OK before 3063 •no brk before 301C •no brk btw 2026 •no brk before FF01•
+<locale en@lb=loose>
+# •brk OK before 3063 •no brk before 301C •brk OK btw 2026 •no brk before FF01•
+# Test Apple early change of lb class for 22EF
+<locale en>
+# Test Apple early change of cjdict
+<locale en>
+<data>•ジョージア<400> •</data>
+<data>•主场<400>客场<400>干练<400>条码<400>杯具<400>温婉<400>猕猴桃<400>肌肤<400>黑头<400>话唠<400>话痨<400> •</data>
+# Test Apple early change of thaidict
+<locale th>
+# Test Apple breaks for emoji clusters (same for all locales and break types)
+<locale root>
+# woman zwj woman zwj girl zwj girl, woman/fitz-1-2 zwj woman/fitz-4 zwj boy/fitz-6
+# woman zwj, baby/fitz-3, older_woman/fitz-5, runner/fitz-4, raised_fist/fitz-3, fuel_pump, fitz-3
+# man zwj hvy_blk_heart zwj man, woman, man zwj hvy_blk_heart esel zwj man, woman
+# woman zwj hvy_blk_heart/esel zwj kiss_mark zwj woman, man
+# victory_hand esel, victory_hand/esel/fitz-1-2, victory_hand/fitz-1-2, rowboat/fitz-4, vulcan_salute/fitz-5, space,
+# writing_hand fitz-1-2, splayed_hand/fitz-3, middle_finger/fitz-4, sign_of_horns/fitz-5, eye zwj left_speech_bubble, space
+# flags1 AE AF AL AM AO AR AT
+# flags2 AU AZ BA BD BE BF BG
+# flags3 BH BJ BN BO BR BS BT
+# flags4 BW BY BZ CA CD CF CG
+# flags5 CH CI CL CM CN CO CR
+# flags6 CU CV CY CZ DE DJ DK
+# flags7 DM DO DZ EC EE EG ER
+# flags8 ES ET FI FJ FR GA GB
+# flags9 GE GH GM GN GR GT GW
+# flags10 GY HK HN HR HT HU ID
+# flags11 IE IL IN IQ IR IS IT
+# flags12 JM JO JP KE KG KH KR
+# flags13 MX MY NL NO PL PT
+# flags14 RO RU SA SE SK TH TR
+# flags15 UA US VN XK ZW
+# flagsX1 ES ES ES SE SE SE
+# flagsX2 GB GB GB BG BG BG
+# flagsXtnd AE AF AL AM AO AR
+# woman zwj woman zwj girl zwj girl, woman/fitz-1-2 zwj woman/fitz-4 zwj boy/fitz-6
+# woman zwj, baby/fitz-3, older_woman/fitz-5, runner/fitz-4, raised_fist/fitz-3, fuel_pump, fitz-3
+# man zwj hvy_blk_heart zwj man, woman, man zwj hvy_blk_heart esel zwj man, woman
+# woman zwj hvy_blk_heart esel zwj kiss mark zwj woman, man
+# victory_hand esel, victory_hand/esel/fitz-1-2, victory_hand/fitz-1-2, rowboat/fitz-4, vulcan_salute/fitz-5, space,
+# writing_hand fitz-1-2, splayed_hand/fitz-3, middle_finger/fitz-4, sign_of_horns/fitz-5, eye zwj left_speech_bubble, space
+# flags1 AE AF AL AM AO AR AT
+# flags2 AU AZ BA BD BE BF BG
+# flags3 BH BJ BN BO BR BS BT
+# flags4 BW BY BZ CA CD CF CG
+# flags5 CH CI CL CM CN CO CR
+# flags6 CU CV CY CZ DE DJ DK
+# flags7 DM DO DZ EC EE EG ER
+# flags8 ES ET FI FJ FR GA GB
+# flags9 GE GH GM GN GR GT GW
+# flags10 GY HK HN HR HT HU ID
+# flags11 IE IL IN IQ IR IS IT
+# flags12 JM JO JP KE KG KH KR
+# flags13 MX MY NL NO PL PT
+# flags14 RO RU SA SE SK TH TR
+# flags15 UA US VN XK ZW
+# flagsX1 ES ES ES SE SE SE
+# flagsX2 GB GB GB BG BG BG
+# flagsXtnd AE AF AL AM AO AR
+# woman zwj woman zwj girl zwj girl # (line, skip this for now, need safe rules and we don't generate it:) woman/fitz-1-2 zwj woman/fitz-4 zwj boy/fitz-6
+# woman zwj, baby/fitz-3, older_woman/fitz-5, runner/fitz-4, raised_fist/fitz-3, fuel_pump, fitz-3
+# man zwj hvy_blk_heart zwj man, woman, man zwj hvy_blk_heart esel zwj man, woman
+# woman zwj hvy_blk_heart esel zwj kiss mark zwj woman, man
+# victory_hand esel, victory_hand/esel/fitz-1-2, victory_hand/fitz-1-2, rowboat/fitz-4, vulcan_salute/fitz-5 space,
+# writing_hand fitz-1-2, splayed_hand/fitz-3, middle_finger/fitz-4, sign_of_horns/fitz-5, eye zwj left_speech_bubble, space
+# no special flags handling for line
+<locale ja@lb=loose>
+# woman zwj woman zwj girl zwj girl # (line, skip this for now, need safe rules and we don't generate it:) woman/fitz-1-2 zwj woman/fitz-4 zwj boy/fitz-6
+# woman zwj, baby/fitz-3, older_woman/fitz-5, runner/fitz-4, raised_fist/fitz-3, fuel_pump, fitz-3
+# man zwj hvy_blk_heart zwj man, woman, man zwj hvy_blk_heart esel zwj man, woman
+# woman zwj hvy_blk_heart esel zwj kiss mark zwj woman, man
+# victory_hand esel, victory_hand/esel/fitz-1-2, victory_hand/fitz-1-2, rowboat/fitz-4, vulcan_salute/fitz-5 space,
+# writing_hand fitz-1-2, splayed_hand/fitz-3, middle_finger/fitz-4, sign_of_horns/fitz-5, eye zwj left_speech_bubble, space
+# no special flags handling for line