]>
Commit | Line | Data |
---|---|---|
2ca993e8 | 1 | file: testdata/break_rules/readme.txt |
f3c0d7a5 A |
2 | Copyright (C) 2016 and later: Unicode, Inc. and others. |
3 | License & terms of use: http://www.unicode.org/copyright.html#License | |
4 | ||
2ca993e8 A |
5 | Copyright (c) 2015-2016, International Business Machines Corporation and others. All Rights Reserved. |
6 | ||
7 | This directory contains the break iterator reference rule files used by intltest rbbi/RBBIMonkeyTest/testMonkey. | |
8 | The rules in this directory track the boundary rules from Unicode UAX 14 and 29. They are interpretted | |
9 | to provide an expected set of boundary positions to compare with the results from ICU break iteration. | |
10 | ||
11 | Each set of reference break rules lives in a separate file. | |
12 | The list of rule files to run by default is hardcoded into the test code, in rbbimonkeytest.cpp. | |
13 | ||
14 | Each test file includes | |
15 | - The type of ICU break interator to create (word, line, sentence, etc.) | |
16 | - The locale to use | |
17 | - Character Class definitions | |
18 | - Rule definitions | |
19 | ||
20 | To Do | |
21 | - Syntax for tailoring. | |
22 | ||
23 | ||
24 | Character Class Definition: | |
25 | name = set_regular_expression; | |
26 | ||
27 | Rule Definition: | |
28 | rule_regular_expression; | |
29 | ||
30 | name: | |
31 | [A-Za-z_][A-Za-z0-9_]* | |
32 | ||
33 | set_regular_expression: | |
34 | The intersection of an ICU regular expression [set] expression and a UnicodeSet pattern. | |
35 | (They are mostly the same) | |
36 | May include previously defined set names, which are logically expanded in-place. | |
37 | ||
38 | rule_regular_expresson: | |
39 | An ICU Regular Expression. | |
40 | May include set names, which are logically expanded in-place. | |
41 | May include a '÷', which defines a boundary position. | |
42 | ||
43 | Application of the rules: | |
44 | Matching begins at the start of text, or after a previously identified boundary. | |
45 | The pseudo-code below finds the next boundary. | |
46 | ||
47 | while position < end of text | |
48 | for each rule | |
49 | if the text at position matches this rule | |
50 | if the rule has a '÷' | |
51 | Boundary is found. | |
52 | return the position of the '÷' within the match. | |
53 | else | |
54 | position = last character of the rule match. | |
55 | break from the rule loop, continue the outer loop. | |
56 | ||
57 | This differs from the Unicode UAX algorithm in that each position in the text is | |
58 | not tested separately. Instead, when a rule match is found, rule application restarts with the last | |
59 | character of the preceding rule match. ICU's break rules also operate this way. | |
60 | ||
61 | Expressing rules this way simplifies UAX rules that have leading or trailing context; it | |
62 | is no longer necessary to write expressions that match the context starting from | |
63 | any position within it. | |
64 | ||
65 | This rule form differs from ICU rules in that the rules are applied sequentially, as they | |
66 | are with the Unicode UAX rules. With the main ICU break rules, all are applied in parallel. | |
67 | ||
68 | Word Dictionaries | |
69 | The monkey test does not test dictionary based breaking. The set named 'dicitionary' is special, | |
70 | as it is in the main ICU rules. For the monkey test, no characters from the dictionary set are | |
71 | included in the randomly-generated test data. | |
72 |