[apple/icu.git] / icuSources / test / testdata / break_rules / readme.txt

file: testdata/break_rules/readme.txt
Copyright (C) 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html#License

Copyright (c) 2015-2016, International Business Machines Corporation and others. All Rights Reserved.

This directory contains the break iterator reference rule files used by intltest rbbi/RBBIMonkeyTest/testMonkey.
The rules in this directory track the boundary rules from Unicode UAX 14 and 29. They are interpreted
to provide an expected set of boundary positions to compare with the results from ICU break iteration.

ICU4J also includes copies of the test reference rules, located in the directory
main/tests/core/src/com/ibm/icu/dev/test/rbbi/break_rules/
The copies should be kept synchronized; there should be no differences.

Each set of reference break rules lives in a separate file.
The list of rule files to run by default is hard coded into the test code, in rbbimonkeytest.cpp.

Each test file includes
  - The type of ICU break iterator to create (word, line, sentence, etc.)
  - The locale to use
  - Character Class definitions
  - Rule definitions

To Do
  - Extend the syntax to support rule tailoring.


Character Class Definition:
    name = set_regular_expression;

Rule Definition:
    rule_regular_expression;

name:
    [A-Za-z_][A-Za-z0-9_]*

set_regular_expression:
    The intersection of an ICU regular expression [set] expression and a UnicodeSet pattern.
    (They are mostly the same)
    May include previously defined set names, which are logically expanded in-place.

rule_regular_expression:
    An ICU Regular Expression.
    May include set names, which are logically expanded in-place.
    May include a '÷', which defines a boundary position.

Application of the rules:
    Matching begins at the start of text, or after a previously identified boundary.
    The pseudo-code below finds the next boundary.

    while position < end of text
        for each rule
            if the text at position matches this rule
                if the rule has a '÷'
                    Boundary is found.
                    return the position of the '÷' within the match.
                else
                    position = last character of the rule match.
                    break from the inner rule loop, continue the outer loop.

    This differs from the Unicode UAX algorithm in that each position in the text is
    not tested separately. Instead, when a rule match is found, rule application restarts with the last
    character of the preceding rule match. ICU's break rules also operate this way.

    Expressing rules this way simplifies UAX rules that have leading or trailing context; it
    is no longer necessary to write expressions that match the context starting from
    any position within it.

    This rule form differs from ICU rules in that the rules are applied sequentially, as they
    are with the Unicode UAX rules. With the main ICU break rules, all are applied in parallel.

Word Dictionaries
    The monkey test does not test dictionary based breaking. The set named 'dictionary' is special,
    as it is in the main ICU rules. For the monkey test, no characters from the dictionary set are
    included in the randomly-generated test data.
Commit	Line	Data
2ca993e8	1	file: testdata/break_rules/readme.txt
f3c0d7a5 A	2	Copyright (C) 2016 and later: Unicode, Inc. and others.
	3	License & terms of use: http://www.unicode.org/copyright.html#License
	4
2ca993e8 A	5	Copyright (c) 2015-2016, International Business Machines Corporation and others. All Rights Reserved.
	6
	7	This directory contains the break iterator reference rule files used by intltest rbbi/RBBIMonkeyTest/testMonkey.
0f5d89e8	8	The rules in this directory track the boundary rules from Unicode UAX 14 and 29. They are interpreted
2ca993e8 A	9	to provide an expected set of boundary positions to compare with the results from ICU break iteration.
2ca993e8 A	10
0f5d89e8 A	11	ICU4J also includes copies of the test reference rules, located in the directory
	12	main/tests/core/src/com/ibm/icu/dev/test/rbbi/break_rules/
	13	The copies should be kept synchronized; there should be no differences.
	14
2ca993e8	15	Each set of reference break rules lives in a separate file.
0f5d89e8	16	The list of rule files to run by default is hard coded into the test code, in rbbimonkeytest.cpp.
2ca993e8 A	17
2ca993e8 A	18	Each test file includes
0f5d89e8	19	- The type of ICU break iterator to create (word, line, sentence, etc.)
2ca993e8 A	20	- The locale to use
	21	- Character Class definitions
	22	- Rule definitions
	23
	24	To Do
0f5d89e8	25	- Extend the syntax to support rule tailoring.
2ca993e8 A	26
2ca993e8 A	27
0f5d89e8	28	Character Class Definition:
2ca993e8 A	29	name = set_regular_expression;
	30
	31	Rule Definition:
	32	rule_regular_expression;
	33
	34	name:
	35	[A-Za-z_][A-Za-z0-9_]*
	36
	37	set_regular_expression:
	38	The intersection of an ICU regular expression [set] expression and a UnicodeSet pattern.
	39	(They are mostly the same)
	40	May include previously defined set names, which are logically expanded in-place.
	41
0f5d89e8	42	rule_regular_expression:
2ca993e8 A	43	An ICU Regular Expression.
	44	May include set names, which are logically expanded in-place.
	45	May include a '÷', which defines a boundary position.
	46
	47	Application of the rules:
	48	Matching begins at the start of text, or after a previously identified boundary.
	49	The pseudo-code below finds the next boundary.
	50
	51	while position < end of text
	52	for each rule
	53	if the text at position matches this rule
	54	if the rule has a '÷'
	55	Boundary is found.
	56	return the position of the '÷' within the match.
	57	else
	58	position = last character of the rule match.
0f5d89e8	59	break from the inner rule loop, continue the outer loop.
2ca993e8 A	60
	61	This differs from the Unicode UAX algorithm in that each position in the text is
	62	not tested separately. Instead, when a rule match is found, rule application restarts with the last
	63	character of the preceding rule match. ICU's break rules also operate this way.
	64
	65	Expressing rules this way simplifies UAX rules that have leading or trailing context; it
	66	is no longer necessary to write expressions that match the context starting from
	67	any position within it.
	68
	69	This rule form differs from ICU rules in that the rules are applied sequentially, as they
	70	are with the Unicode UAX rules. With the main ICU break rules, all are applied in parallel.
	71
	72	Word Dictionaries
0f5d89e8	73	The monkey test does not test dictionary based breaking. The set named 'dictionary' is special,
2ca993e8 A	74	as it is in the main ICU rules. For the monkey test, no characters from the dictionary set are
	75	included in the randomly-generated test data.
	76