icuSources/test/testdata/break_rules/readme.txt

   1 file: testdata/break_rules/readme.txt
   2 Copyright (c) 2015-2016, International Business Machines Corporation and others. All Rights Reserved.
   3
   4 This directory contains the break iterator reference rule files used by intltest rbbi/RBBIMonkeyTest/testMonkey.
   5 The rules in this directory track the boundary rules from Unicode UAX 14 and 29. They are interpretted
   6 to provide an expected set of boundary positions to compare with the results from ICU break iteration.
   7
   8 Each set of reference break rules lives in a separate file.
   9 The list of rule files to run by default is hardcoded into the test code, in rbbimonkeytest.cpp.
  10
  11 Each test file includes
  12   - The type of ICU break interator to create (word, line, sentence, etc.)
  13   - The locale to use
  14   - Character Class definitions
  15   - Rule definitions
  16
  17 To Do
  18   - Syntax for tailoring.
  19
  20
  21 Character Class Definition:
  22     name = set_regular_expression;
  23
  24 Rule Definition:
  25     rule_regular_expression;
  26
  27 name:
  28     [A-Za-z_][A-Za-z0-9_]*
  29
  30 set_regular_expression:
  31     The intersection of an ICU regular expression [set] expression and a UnicodeSet pattern.
  32     (They are mostly the same)
  33     May include previously defined set names, which are logically expanded in-place.
  34
  35 rule_regular_expresson:
  36     An ICU Regular Expression.
  37     May include set names, which are logically expanded in-place.
  38     May include a '÷', which defines a boundary position.
  39
  40 Application of the rules:
  41     Matching begins at the start of text, or after a previously identified boundary.
  42     The pseudo-code below finds the next boundary.
  43
  44     while position < end of text
  45         for each rule
  46             if the text at position matches this rule
  47                 if the rule has a '÷'
  48                     Boundary is found.
  49                     return the position of the '÷' within the match.
  50                 else
  51                     position = last character of the rule match.
  52                     break from the rule loop, continue the outer loop.
  53
  54     This differs from the Unicode UAX algorithm in that each position in the text is
  55     not tested separately. Instead, when a rule match is found, rule application restarts with the last
  56     character of the preceding rule match. ICU's break rules also operate this way.
  57
  58     Expressing rules this way simplifies UAX rules that have leading or trailing context; it
  59     is no longer necessary to write expressions that match the context starting from
  60     any position within it.
  61
  62     This rule form differs from ICU rules in that the rules are applied sequentially, as they
  63     are with the Unicode UAX rules. With the main ICU break rules, all are applied in parallel.
  64
  65 Word Dictionaries
  66     The monkey test does not test dictionary based breaking. The set named 'dicitionary' is special,
  67     as it is in the main ICU rules. For the monkey test, no characters from the dictionary set are
  68     included in the randomly-generated test data.
  69