]>
Commit | Line | Data |
---|---|---|
2ca993e8 | 1 | file: testdata/break_rules/readme.txt |
f3c0d7a5 A |
2 | Copyright (C) 2016 and later: Unicode, Inc. and others. |
3 | License & terms of use: http://www.unicode.org/copyright.html#License | |
4 | ||
2ca993e8 A |
5 | Copyright (c) 2015-2016, International Business Machines Corporation and others. All Rights Reserved. |
6 | ||
7 | This directory contains the break iterator reference rule files used by intltest rbbi/RBBIMonkeyTest/testMonkey. | |
0f5d89e8 | 8 | The rules in this directory track the boundary rules from Unicode UAX 14 and 29. They are interpreted |
2ca993e8 A |
9 | to provide an expected set of boundary positions to compare with the results from ICU break iteration. |
10 | ||
0f5d89e8 A |
11 | ICU4J also includes copies of the test reference rules, located in the directory |
12 | main/tests/core/src/com/ibm/icu/dev/test/rbbi/break_rules/ | |
13 | The copies should be kept synchronized; there should be no differences. | |
14 | ||
2ca993e8 | 15 | Each set of reference break rules lives in a separate file. |
0f5d89e8 | 16 | The list of rule files to run by default is hard coded into the test code, in rbbimonkeytest.cpp. |
2ca993e8 A |
17 | |
18 | Each test file includes | |
0f5d89e8 | 19 | - The type of ICU break iterator to create (word, line, sentence, etc.) |
2ca993e8 A |
20 | - The locale to use |
21 | - Character Class definitions | |
22 | - Rule definitions | |
23 | ||
24 | To Do | |
0f5d89e8 | 25 | - Extend the syntax to support rule tailoring. |
2ca993e8 A |
26 | |
27 | ||
0f5d89e8 | 28 | Character Class Definition: |
2ca993e8 A |
29 | name = set_regular_expression; |
30 | ||
31 | Rule Definition: | |
32 | rule_regular_expression; | |
33 | ||
34 | name: | |
35 | [A-Za-z_][A-Za-z0-9_]* | |
36 | ||
37 | set_regular_expression: | |
38 | The intersection of an ICU regular expression [set] expression and a UnicodeSet pattern. | |
39 | (They are mostly the same) | |
40 | May include previously defined set names, which are logically expanded in-place. | |
41 | ||
0f5d89e8 | 42 | rule_regular_expression: |
2ca993e8 A |
43 | An ICU Regular Expression. |
44 | May include set names, which are logically expanded in-place. | |
45 | May include a '÷', which defines a boundary position. | |
46 | ||
47 | Application of the rules: | |
48 | Matching begins at the start of text, or after a previously identified boundary. | |
49 | The pseudo-code below finds the next boundary. | |
50 | ||
51 | while position < end of text | |
52 | for each rule | |
53 | if the text at position matches this rule | |
54 | if the rule has a '÷' | |
55 | Boundary is found. | |
56 | return the position of the '÷' within the match. | |
57 | else | |
58 | position = last character of the rule match. | |
0f5d89e8 | 59 | break from the inner rule loop, continue the outer loop. |
2ca993e8 A |
60 | |
61 | This differs from the Unicode UAX algorithm in that each position in the text is | |
62 | not tested separately. Instead, when a rule match is found, rule application restarts with the last | |
63 | character of the preceding rule match. ICU's break rules also operate this way. | |
64 | ||
65 | Expressing rules this way simplifies UAX rules that have leading or trailing context; it | |
66 | is no longer necessary to write expressions that match the context starting from | |
67 | any position within it. | |
68 | ||
69 | This rule form differs from ICU rules in that the rules are applied sequentially, as they | |
70 | are with the Unicode UAX rules. With the main ICU break rules, all are applied in parallel. | |
71 | ||
72 | Word Dictionaries | |
0f5d89e8 | 73 | The monkey test does not test dictionary based breaking. The set named 'dictionary' is special, |
2ca993e8 A |
74 | as it is in the main ICU rules. For the monkey test, no characters from the dictionary set are |
75 | included in the randomly-generated test data. | |
76 |