]>
git.saurik.com Git - wxWidgets.git/blob - docs/doxygen/overviews/resyntax.h
1 /////////////////////////////////////////////////////////////////////////////
3 // Purpose: topic overview
4 // Author: wxWidgets team
6 // Licence: wxWindows license
7 /////////////////////////////////////////////////////////////////////////////
11 @page overview_resyntax Syntax of the Built-in Regular Expression Library
13 A <em>regular expression</em> describes strings of characters. It's a pattern
14 that matches certain strings and doesn't match others.
18 @li @ref overview_resyntax_differentflavors
19 @li @ref overview_resyntax_syntax
20 @li @ref overview_resyntax_bracket
21 @li @ref overview_resyntax_escapes
22 @li @ref overview_resyntax_metasyntax
23 @li @ref overview_resyntax_matching
24 @li @ref overview_resyntax_limits
25 @li @ref overview_resyntax_bre
26 @li @ref overview_resyntax_characters
32 @section overview_resyntax_differentflavors Different Flavors of REs
34 Regular expressions ("RE''s), as defined by POSIX, come in two
35 flavors: @e extended REs ("EREs'') and @e basic REs ("BREs''). EREs are roughly those
36 of the traditional @e egrep, while BREs are roughly those of the traditional
37 @e ed. This implementation adds a third flavor, @e advanced REs ("AREs''), basically
38 EREs with some significant extensions.
39 This manual page primarily describes
40 AREs. BREs mostly exist for backward compatibility in some old programs;
41 they will be discussed at the #end. POSIX EREs are almost an exact subset
42 of AREs. Features of AREs that are not present in EREs will be indicated.
45 @section overview_resyntax_syntax Regular Expression Syntax
47 These regular expressions are implemented using
48 the package written by Henry Spencer, based on the 1003.2 spec and some
49 (not quite all) of the Perl5 extensions (thanks, Henry!). Much of the description
50 of regular expressions below is copied verbatim from his manual entry.
51 An ARE is one or more @e branches, separated by '@b |', matching anything that matches
53 A branch is zero or more @e constraints or @e quantified
54 atoms, concatenated. It matches a match for the first, followed by a match
55 for the second, etc; an empty branch matches the empty string.
56 A quantified atom is an @e atom possibly followed by a single @e quantifier. Without a quantifier,
57 it matches a match for the atom. The quantifiers, and what a so-quantified
64 a sequence of 0 or more matches of the atom
68 a sequence of 1 or more matches of the atom
72 a sequence of 0 or 1 matches of the atom
76 a sequence of exactly @e m matches of the atom
80 a sequence of @e m or more matches of the atom
84 a sequence of @e m through @e n (inclusive)
85 matches of the atom; @e m may not exceed @e n
87 @b *? +? ?? {m}? {m,}? {m,n}?
89 @e non-greedy quantifiers,
90 which match the same possibilities, but prefer the
91 smallest number rather than the largest number of matches (see #Matching)
93 The forms using @b { and @b } are known as @e bounds. The numbers @e m and @e n are unsigned
94 decimal integers with permissible values from 0 to 255 inclusive.
99 (where @e re is any regular expression) matches a match for
100 @e re, with the match noted for possible reporting
105 does no reporting (a "non-capturing'' set of parentheses)
110 string, noted for possible reporting
114 matches an empty string, without reporting
118 a @e bracket expression, matching any one of the @e chars
119 (see @ref resynbracket_overview for more detail)
123 matches any single character
127 (where @e k is a non-alphanumeric character)
128 matches that character taken as an ordinary character, e.g. \\ matches a backslash
133 where @e c is alphanumeric (possibly followed by other characters),
134 an @e escape (AREs only), see #Escapes below
138 when followed by a character
139 other than a digit, matches the left-brace character '@b {'; when followed by
140 a digit, it is the beginning of a @e bound (see above)
144 where @e x is a single
145 character with no other significance, matches that character.
147 A @e constraint matches an empty string when specific conditions are met. A constraint may
148 not be followed by a quantifier. The simple constraints are as follows;
149 some more constraints are described later, under #Escapes.
153 matches at the beginning of a line
157 matches at the end of a line
161 @e positive lookahead
162 (AREs only), matches at any point where a substring matching @e re begins
166 @e negative lookahead (AREs only),
167 matches at any point where no substring matching @e re begins
171 The lookahead constraints may not contain back references
172 (see later), and all parentheses within them are considered non-capturing.
173 An RE may not end with '@b \'.
176 @section overview_resyntax_bracket Bracket Expressions
178 A @e bracket expression is a list
179 of characters enclosed in '@b []'. It normally matches any single character from
180 the list (but see below). If the list begins with '@b ^', it matches any single
181 character (but see below) @e not from the rest of the list.
183 in the list are separated by '@b -', this is shorthand for the full @e range of
184 characters between those two (inclusive) in the collating sequence, e.g.
185 @b [0-9] in ASCII matches any decimal digit. Two ranges may not share an endpoint,
186 so e.g. @b a-c-e is illegal. Ranges are very collating-sequence-dependent, and portable
187 programs should avoid relying on them.
188 To include a literal @b ] or @b - in the
189 list, the simplest method is to enclose it in @b [. and @b .] to make it a collating
190 element (see below). Alternatively, make it the first character (following
191 a possible '@b ^'), or (AREs only) precede it with '@b \'.
192 Alternatively, for '@b -', make
193 it the last character, or the second endpoint of a range. To use a literal
194 @b - as the first endpoint of a range, make it a collating element or (AREs
195 only) precede it with '@b \'. With the exception of these, some combinations using
196 @b [ (see next paragraphs), and escapes, all other special characters lose
197 their special significance within a bracket expression.
199 expression, a collating element (a character, a multi-character sequence
200 that collates as if it were a single character, or a collating-sequence
201 name for either) enclosed in @b [. and @b .] stands for the
202 sequence of characters of that collating element.
203 @e wxWidgets: Currently no multi-character collating elements are defined.
204 So in @b [.X.], @e X can either be a single character literal or
205 the name of a character. For example, the following are both identical
206 @b [[.0.]-[.9.]] and @b [[.zero.]-[.nine.]] and mean the same as
208 See @ref resynchars_overview.
209 Within a bracket expression, a collating element enclosed in @b [= and @b =]
210 is an equivalence class, standing for the sequences of characters of all
211 collating elements equivalent to that one, including itself.
212 An equivalence class may not be an endpoint of a range.
213 @e wxWidgets: Currently no equivalence classes are defined, so
214 @b [=X=] stands for just the single character @e X.
215 @e X can either be a single character literal or the name of a character,
216 see @ref resynchars_overview.
217 Within a bracket expression,
218 the name of a @e character class enclosed in @b [: and @b :] stands for the list
219 of all characters (not all collating elements!) belonging to that class.
220 Standard character classes are:
230 An upper-case letter.
246 An alphanumeric (letter or digit).
250 An alphanumeric (same as alnum).
254 A space or tab character.
258 A character producing white space in displayed text.
262 A punctuation character.
266 A character with a visible representation.
274 A character class may not be used as an endpoint of a range.
275 @e wxWidgets: In a non-Unicode build, these character classifications depend on the
276 current locale, and correspond to the values return by the ANSI C 'is'
277 functions: isalpha, isupper, etc. In Unicode mode they are based on
278 Unicode classifications, and are not affected by the current locale.
279 There are two special cases of bracket expressions:
280 the bracket expressions @b [[::]] and @b [[::]] are constraints, matching empty
281 strings at the beginning and end of a word respectively. A word is defined
282 as a sequence of word characters that is neither preceded nor followed
283 by word characters. A word character is an @e alnum character or an underscore
284 (@b _). These special bracket expressions are deprecated; users of AREs should
285 use constraint escapes instead (see #Escapes below).
288 @section overview_resyntax_escapes Escapes
291 which begin with a @b \ followed by an alphanumeric character, come in several
292 varieties: character entry, class shorthands, constraint escapes, and back
293 references. A @b \ followed by an alphanumeric character but not constituting
294 a valid escape is illegal in AREs. In EREs, there are no escapes: outside
295 a bracket expression, a @b \ followed by an alphanumeric character merely stands
296 for that character as an ordinary character, and inside a bracket expression,
297 @b \ is an ordinary character. (The latter is the one actual incompatibility
298 between EREs and AREs.)
299 Character-entry escapes (AREs only) exist to make
300 it easier to specify non-printing and otherwise inconvenient characters
307 alert (bell) character, as in C
316 for @b \ to help reduce backslash doubling in some applications where there
317 are multiple levels of backslash processing
321 (where X is any character)
322 the character whose low-order 5 bits are the same as those of @e X, and whose
323 other bits are all zero
327 the character whose collating-sequence name is
328 '@b ESC', or failing that, the character with octal value 033
340 carriage return, as in C
344 horizontal tab, as in C
348 (where @e wxyz is exactly four hexadecimal digits)
350 character @b U+@e wxyz in the local byte ordering
354 (where @e stuvwxyz is
355 exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode
360 vertical tab, as in C are all available.
365 @e hhh is any sequence of hexadecimal digits) the character whose hexadecimal
366 value is @b 0x@e hhh (a single character no matter how many hexadecimal digits
371 the character whose value is @b 0
375 (where @e xy is exactly two
376 octal digits, and is not a @e back reference (see below)) the character whose
377 octal value is @b 0@e xy
381 (where @e xyz is exactly three octal digits, and is
382 not a back reference (see below))
383 the character whose octal value is @b 0@e xyz
387 Hexadecimal digits are '@b 0'-'@b 9', '@b a'-'@b f', and '@b A'-'@b F'. Octal
388 digits are '@b 0'-'@b 7'.
390 escapes are always taken as ordinary characters. For example, @b \135 is @b ] in
391 ASCII, but @b \135 does not terminate a bracket expression. Beware, however,
392 that some applications (e.g., C compilers) interpret such sequences themselves
393 before the regular-expression package gets to see them, which may require
394 doubling (quadrupling, etc.) the '@b \'.
395 Class-shorthand escapes (AREs only) provide
396 shorthands for certain commonly-used character classes:
410 @b [[:alnum:]_] (note underscore)
422 @b [^[:alnum:]_] (note underscore)
426 Within bracket expressions, '@b \d', '@b \s', and
427 '@b \w' lose their outer brackets, and '@b \D',
428 '@b \S', and '@b \W' are illegal. (So, for example,
429 @b [a-c\d] is equivalent to @b [a-c[:digit:]].
430 Also, @b [a-c\D], which is equivalent to
431 @b [a-c^[:digit:]], is illegal.)
432 A constraint escape (AREs only) is a constraint,
433 matching the empty string if specific conditions are met, written as an
440 matches only at the beginning of the string
441 (see #Matching, below,
442 for how this differs from '@b ^')
446 matches only at the beginning of a word
450 matches only at the end of a word
454 matches only at the beginning or end of a word
458 matches only at a point that is not the beginning or end of
463 matches only at the end of the string
464 (see #Matching, below, for
465 how this differs from '@b $')
469 (where @e m is a nonzero digit) a @e back reference,
474 (where @e m is a nonzero digit, and @e nn is some more digits,
475 and the decimal value @e mnn is not greater than the number of closing capturing
476 parentheses seen so far) a @e back reference, see below
481 as in the specification of @b [[::]] and @b [[::]] above. Constraint escapes are
482 illegal within bracket expressions.
483 A back reference (AREs only) matches
484 the same string matched by the parenthesized subexpression specified by
485 the number, so that (e.g.) @b ([bc])\1 matches @b bb or @b cc but not '@b bc'.
487 must entirely precede the back reference in the RE. Subexpressions are numbered
488 in the order of their leading parentheses. Non-capturing parentheses do not
489 define subexpressions.
490 There is an inherent historical ambiguity between
491 octal character-entry escapes and back references, which is resolved by
492 heuristics, as hinted at above. A leading zero always indicates an octal
493 escape. A single non-zero digit, not followed by another digit, is always
494 taken as a back reference. A multi-digit sequence not starting with a zero
495 is taken as a back reference if it comes after a suitable subexpression
496 (i.e. the number is in the legal range for a back reference), and otherwise
500 @section overview_resyntax_metasyntax Metasyntax
502 In addition to the main syntax described above,
503 there are some special forms and miscellaneous syntactic facilities available.
504 Normally the flavor of RE being used is specified by application-dependent
505 means. However, this can be overridden by a @e director. If an RE of any flavor
506 begins with '@b ***:', the rest of the RE is an ARE. If an RE of any flavor begins
507 with '@b ***=', the rest of the RE is taken to be a literal string, with all
508 characters considered ordinary characters.
509 An ARE may begin with @e embedded options: a sequence @b (?xyz)
510 (where @e xyz is one or more alphabetic characters)
511 specifies options affecting the rest of the RE. These supplement, and can
512 override, any options specified by the application. The available option
523 case-sensitive matching (usual default)
531 case-insensitive matching (see #Matching, below)
535 historical synonym for @b n
539 newline-sensitive matching (see #Matching, below)
543 partial newline-sensitive matching (see #Matching, below)
548 is a literal ("quoted'') string, all ordinary characters
552 non-newline-sensitive matching (usual default)
556 tight syntax (usual default; see below)
561 partial newline-sensitive ("weird'') matching (see #Matching, below)
565 expanded syntax (see below)
569 Embedded options take effect at the @b ) terminating the
570 sequence. They are available only at the start of an ARE, and may not be
571 used later within it.
572 In addition to the usual (@e tight) RE syntax, in which
573 all characters are significant, there is an @e expanded syntax, available
574 in AREs with the embedded
575 x option. In the expanded syntax, white-space characters are ignored and
576 all characters between a @b # and the following newline (or the end of the
577 RE) are ignored, permitting paragraphing and commenting a complex RE. There
578 are three exceptions to that basic rule:
581 a white-space character or '@b #' preceded
582 by '@b \' is retained
583 white space or '@b #' within a bracket expression is retained
584 white space and comments are illegal within multi-character symbols like
585 the ARE '@b (?:' or the BRE '@b \('
588 Expanded-syntax white-space characters are blank,
589 tab, newline, and any character that belongs to the @e space character class.
590 Finally, in an ARE, outside bracket expressions, the sequence '@b (?#ttt)' (where
591 @e ttt is any text not containing a '@b )') is a comment, completely ignored. Again,
592 this is not allowed between the characters of multi-character symbols like
593 '@b (?:'. Such comments are more a historical artifact than a useful facility,
594 and their use is deprecated; use the expanded syntax instead.
596 metasyntax extensions is available if the application (or an initial @b ***=
597 director) has specified that the user's input be treated as a literal string
598 rather than as an RE.
601 @section overview_resyntax_matching Matching
603 In the event that an RE could match more than
604 one substring of a given string, the RE matches the one starting earliest
605 in the string. If the RE could match more than one substring starting at
606 that point, its choice is determined by its @e preference: either the longest
607 substring, or the shortest.
608 Most atoms, and all constraints, have no preference.
609 A parenthesized RE has the same preference (possibly none) as the RE. A
610 quantified atom with quantifier @b {m} or @b {m}? has the same preference (possibly
611 none) as the atom itself. A quantified atom with other normal quantifiers
612 (including @b {m,n} with @e m equal to @e n) prefers longest match. A quantified
613 atom with other non-greedy quantifiers (including @b {m,n}? with @e m equal to
614 @e n) prefers shortest match. A branch has the same preference as the first
615 quantified atom in it which has a preference. An RE consisting of two or
616 more branches connected by the @b | operator prefers longest match.
617 Subject to the constraints imposed by the rules for matching the whole RE, subexpressions
618 also match the longest or shortest possible substrings, based on their
619 preferences, with subexpressions starting earlier in the RE taking priority
620 over ones starting later. Note that outer subexpressions thus take priority
621 over their component subexpressions.
622 Note that the quantifiers @b {1,1} and
623 @b {1,1}? can be used to force longest and shortest preference, respectively,
624 on a subexpression or a whole RE.
625 Match lengths are measured in characters,
626 not collating elements. An empty string is considered longer than no match
627 at all. For example, @b bb* matches the three middle characters
628 of '@b abbbc', @b (week|wee)(night|knights)
629 matches all ten characters of '@b weeknights', when @b (.*).* is matched against
630 @b abc the parenthesized subexpression matches all three characters, and when
631 @b (a*)* is matched against @b bc both the whole RE and the parenthesized subexpression
632 match an empty string.
633 If case-independent matching is specified, the effect
634 is much as if all case distinctions had vanished from the alphabet. When
635 an alphabetic that exists in multiple cases appears as an ordinary character
636 outside a bracket expression, it is effectively transformed into a bracket
637 expression containing both cases, so that @b x becomes '@b [xX]'. When it appears
638 inside a bracket expression, all case counterparts of it are added to the
639 bracket expression, so that @b [x] becomes @b [xX] and @b [^x] becomes '@b [^xX]'.
641 matching is specified, @b . and bracket expressions using @b ^ will never match
642 the newline character (so that matches will never cross newlines unless
643 the RE explicitly arranges it) and @b ^ and @b $ will match the empty string after
644 and before a newline respectively, in addition to matching at beginning
645 and end of string respectively. ARE @b \A and @b \Z continue to match beginning
646 or end of string @e only.
647 If partial newline-sensitive matching is specified,
648 this affects @b . and bracket expressions as with newline-sensitive matching,
649 but not @b ^ and '@b $'.
650 If inverse partial newline-sensitive matching is specified,
651 this affects @b ^ and @b $ as with newline-sensitive matching, but not @b . and bracket
652 expressions. This isn't very useful but is provided for symmetry.
655 @section overview_resyntax_limits Limits and Compatibility
657 No particular limit is imposed on the length of REs. Programs
658 intended to be highly portable should not employ REs longer than 256 bytes,
659 as a POSIX-compliant implementation can refuse to accept such REs.
661 feature of AREs that is actually incompatible with POSIX EREs is that @b \
662 does not lose its special significance inside bracket expressions. All other
663 ARE features use syntax which is illegal or has undefined or unspecified
664 effects in POSIX EREs; the @b *** syntax of directors likewise is outside
665 the POSIX syntax for both BREs and EREs.
666 Many of the ARE extensions are
667 borrowed from Perl, but some have been changed to clean them up, and a
668 few Perl extensions are not present. Incompatibilities of note include '@b \b',
669 '@b \B', the lack of special treatment for a trailing newline, the addition of
670 complemented bracket expressions to the things affected by newline-sensitive
671 matching, the restrictions on parentheses and back references in lookahead
672 constraints, and the longest/shortest-match (rather than first-match) matching
674 The matching rules for REs containing both normal and non-greedy
675 quantifiers have changed since early beta-test versions of this package.
676 (The new rules are much simpler and cleaner, but don't work as hard at guessing
677 the user's real intentions.)
678 Henry Spencer's original 1986 @e regexp package, still in widespread use,
679 implemented an early version of today's EREs. There are four incompatibilities between @e regexp's
680 near-EREs ('RREs' for short) and AREs. In roughly increasing order of significance:
682 In AREs, @b \ followed by an alphanumeric character is either an escape or
683 an error, while in RREs, it was just another way of writing the alphanumeric.
684 This should not be a problem because there was no reason to write such
686 @b { followed by a digit in an ARE is the beginning of
687 a bound, while in RREs, @b { was always an ordinary character. Such sequences
688 should be rare, and will often result in an error because following characters
689 will not look like a valid bound.
690 In AREs, @b \ remains a special character
691 within '@b []', so a literal @b \ within @b [] must be
692 written '@b \\'. @b \\ also gives a literal
693 @b \ within @b [] in RREs, but only truly paranoid programmers routinely doubled
695 AREs report the longest/shortest match for the RE, rather
696 than the first found in a specified search order. This may affect some RREs
697 which were written in the expectation that the first match would be reported.
698 (The careful crafting of RREs to optimize the search order for fast matching
699 is obsolete (AREs examine all possible matches in parallel, and their performance
700 is largely insensitive to their complexity) but cases where the search
701 order was exploited to deliberately find a match which was @e not the longest/shortest
702 will need rewriting.)
705 @section overview_resyntax_bre Basic Regular Expressions
707 BREs differ from EREs in
708 several respects. '@b |', '@b +', and @b ? are ordinary characters and there is no equivalent
709 for their functionality. The delimiters for bounds
710 are @b \{ and '@b \}', with @b { and
711 @b } by themselves ordinary characters. The parentheses for nested subexpressions
712 are @b \( and '@b \)', with @b ( and @b ) by themselves
713 ordinary characters. @b ^ is an ordinary
714 character except at the beginning of the RE or the beginning of a parenthesized
715 subexpression, @b $ is an ordinary character except at the end of the RE or
716 the end of a parenthesized subexpression, and @b * is an ordinary character
717 if it appears at the beginning of the RE or the beginning of a parenthesized
718 subexpression (after a possible leading '@b ^'). Finally, single-digit back references
719 are available, and @b \ and @b \ are synonyms
720 for @b [[::]] and @b [[::]] respectively;
721 no other escapes are available.
724 @section overview_resyntax_characters Regular Expression Character Names
726 Note that the character names are case sensitive.
1635 right-square-bracket