]>
Commit | Line | Data |
---|---|---|
2ca993e8 A |
1 | file: testdata/break_rules/readme.txt |
2 | Copyright (c) 2015-2016, International Business Machines Corporation and others. All Rights Reserved. | |
3 | ||
4 | This directory contains the break iterator reference rule files used by intltest rbbi/RBBIMonkeyTest/testMonkey. | |
5 | The rules in this directory track the boundary rules from Unicode UAX 14 and 29. They are interpretted | |
6 | to provide an expected set of boundary positions to compare with the results from ICU break iteration. | |
7 | ||
8 | Each set of reference break rules lives in a separate file. | |
9 | The list of rule files to run by default is hardcoded into the test code, in rbbimonkeytest.cpp. | |
10 | ||
11 | Each test file includes | |
12 | - The type of ICU break interator to create (word, line, sentence, etc.) | |
13 | - The locale to use | |
14 | - Character Class definitions | |
15 | - Rule definitions | |
16 | ||
17 | To Do | |
18 | - Syntax for tailoring. | |
19 | ||
20 | ||
21 | Character Class Definition: | |
22 | name = set_regular_expression; | |
23 | ||
24 | Rule Definition: | |
25 | rule_regular_expression; | |
26 | ||
27 | name: | |
28 | [A-Za-z_][A-Za-z0-9_]* | |
29 | ||
30 | set_regular_expression: | |
31 | The intersection of an ICU regular expression [set] expression and a UnicodeSet pattern. | |
32 | (They are mostly the same) | |
33 | May include previously defined set names, which are logically expanded in-place. | |
34 | ||
35 | rule_regular_expresson: | |
36 | An ICU Regular Expression. | |
37 | May include set names, which are logically expanded in-place. | |
38 | May include a '÷', which defines a boundary position. | |
39 | ||
40 | Application of the rules: | |
41 | Matching begins at the start of text, or after a previously identified boundary. | |
42 | The pseudo-code below finds the next boundary. | |
43 | ||
44 | while position < end of text | |
45 | for each rule | |
46 | if the text at position matches this rule | |
47 | if the rule has a '÷' | |
48 | Boundary is found. | |
49 | return the position of the '÷' within the match. | |
50 | else | |
51 | position = last character of the rule match. | |
52 | break from the rule loop, continue the outer loop. | |
53 | ||
54 | This differs from the Unicode UAX algorithm in that each position in the text is | |
55 | not tested separately. Instead, when a rule match is found, rule application restarts with the last | |
56 | character of the preceding rule match. ICU's break rules also operate this way. | |
57 | ||
58 | Expressing rules this way simplifies UAX rules that have leading or trailing context; it | |
59 | is no longer necessary to write expressions that match the context starting from | |
60 | any position within it. | |
61 | ||
62 | This rule form differs from ICU rules in that the rules are applied sequentially, as they | |
63 | are with the Unicode UAX rules. With the main ICU break rules, all are applied in parallel. | |
64 | ||
65 | Word Dictionaries | |
66 | The monkey test does not test dictionary based breaking. The set named 'dicitionary' is special, | |
67 | as it is in the main ICU rules. For the monkey test, no characters from the dictionary set are | |
68 | included in the randomly-generated test data. | |
69 |