]>
git.saurik.com Git - wxWidgets.git/blob - docs/doxygen/overviews/resyntax.h
501294fe9c2578055bf99acd1164bc5cd1b873ec
1 /////////////////////////////////////////////////////////////////////////////
3 // Purpose: topic overview
4 // Author: wxWidgets team
6 // Licence: wxWindows license
7 /////////////////////////////////////////////////////////////////////////////
11 @page overview_resyntax Syntax of the Built-in Regular Expression Library
13 A <em>regular expression</em> describes strings of characters. It's a pattern
14 that matches certain strings and doesn't match others.
16 @li @ref overview_resyntax_differentflavors
17 @li @ref overview_resyntax_syntax
18 @li @ref overview_resyntax_bracket
19 @li @ref overview_resyntax_escapes
20 @li @ref overview_resyntax_metasyntax
21 @li @ref overview_resyntax_matching
22 @li @ref overview_resyntax_limits
23 @li @ref overview_resyntax_bre
24 @li @ref overview_resyntax_characters
34 @section overview_resyntax_differentflavors Different Flavors of Regular Expressions
36 Regular expressions (RE), as defined by POSIX, come in two flavors:
37 <em>extended regular expressions</em> (ERE) and <em>basic regular
38 expressions</em> (BRE). EREs are roughly those of the traditional @e egrep,
39 while BREs are roughly those of the traditional @e ed. This implementation
40 adds a third flavor: <em>advanced regular expressions</em> (ARE), basically
41 EREs with some significant extensions.
43 This manual page primarily describes AREs. BREs mostly exist for backward
44 compatibility in some old programs. POSIX EREs are almost an exact subset of
45 AREs. Features of AREs that are not present in EREs will be indicated.
48 @section overview_resyntax_syntax Regular Expression Syntax
50 These regular expressions are implemented using the package written by Henry
51 Spencer, based on the 1003.2 spec and some (not quite all) of the Perl5
52 extensions (thanks, Henry!). Much of the description of regular expressions
53 below is copied verbatim from his manual entry.
55 An ARE is one or more @e branches, separated by "|", matching anything that
56 matches any of the branches.
58 A branch is zero or more @e constraints or @e quantified atoms, concatenated.
59 It matches a match for the first, followed by a match for the second, etc; an
60 empty branch matches the empty string.
62 A quantified atom is an @e atom possibly followed by a single @e quantifier.
63 Without a quantifier, it matches a match for the atom. The quantifiers, and
64 what a so-quantified atom matches, are:
67 @row2col{ <tt>*</tt> ,
68 A sequence of 0 or more matches of the atom. }
69 @row2col{ <tt>+</tt> ,
70 A sequence of 1 or more matches of the atom. }
71 @row2col{ <tt>?</tt> ,
72 A sequence of 0 or 1 matches of the atom. }
73 @row2col{ <tt>{m}</tt> ,
74 A sequence of exactly @e m matches of the atom. }
75 @row2col{ <tt>{m\,}</tt> ,
76 A sequence of @e m or more matches of the atom. }
77 @row2col{ <tt>{m\,n}</tt> ,
78 A sequence of @e m through @e n (inclusive) matches of the atom; @e m may
80 @row2col{ <tt>*? +? ?? {m}? {m\,}? {m\,n}?</tt> ,
81 @e Non-greedy quantifiers, which match the same possibilities, but prefer
82 the smallest number rather than the largest number of matches (see
83 @ref overview_resyntax_matching). }
86 The forms using @b { and @b } are known as @e bounds. The numbers @e m and
87 @e n are unsigned decimal integers with permissible values from 0 to 255
88 inclusive. An atom is one of:
91 @row2col{ <tt>(re)</tt> ,
92 Where @e re is any regular expression, matches for @e re, with the match
93 captured for possible reporting. }
94 @row2col{ <tt>(?:re)</tt> ,
95 As previous, but does no reporting (a "non-capturing" set of
97 @row2col{ <tt>()</tt> ,
98 Matches an empty string, captured for possible reporting. }
99 @row2col{ <tt>(?:)</tt> ,
100 Matches an empty string, without reporting. }
101 @row2col{ <tt>[chars]</tt> ,
102 A <em>bracket expression</em>, matching any one of the @e chars (see
103 @ref overview_resyntax_bracket for more details). }
104 @row2col{ <tt>.</tt> ,
105 Matches any single character. }
106 @row2col{ <tt>@\k</tt> ,
107 Where @e k is a non-alphanumeric character, matches that character taken
108 as an ordinary character, e.g. @\@\ matches a backslash character. }
109 @row2col{ <tt>@\c</tt> ,
110 Where @e c is alphanumeric (possibly followed by other characters), an
111 @e escape (AREs only), see @ref overview_resyntax_escapes below. }
112 @row2col{ <tt>@leftCurly</tt> ,
113 When followed by a character other than a digit, matches the left-brace
114 character "@leftCurly"; when followed by a digit, it is the beginning of a
115 @e bound (see above). }
116 @row2col{ <tt>x</tt> ,
117 Where @e x is a single character with no other significance, matches that
121 A @e constraint matches an empty string when specific conditions are met. A
122 constraint may not be followed by a quantifier. The simple constraints are as
123 follows; some more constraints are described later, under
124 @ref overview_resyntax_escapes.
127 @row2col{ <tt>^</tt> ,
128 Matches at the beginning of a line. }
129 @row2col{ <tt>@$</tt> ,
130 Matches at the end of a line. }
131 @row2col{ <tt>(?=re)</tt> ,
132 @e Positive lookahead (AREs only), matches at any point where a substring
133 matching @e re begins. }
134 @row2col{ <tt>(?!re)</tt> ,
135 @e Negative lookahead (AREs only), matches at any point where no substring
136 matching @e re begins. }
139 The lookahead constraints may not contain back references (see later), and all
140 parentheses within them are considered non-capturing. A RE may not end with
144 @section overview_resyntax_bracket Bracket Expressions
146 A <em>bracket expression</em> is a list of characters enclosed in <tt>[]</tt>.
147 It normally matches any single character from the list (but see below). If the
148 list begins with @c ^, it matches any single character (but see below) @e not
149 from the rest of the list.
151 If two characters in the list are separated by <tt>-</tt>, this is shorthand
152 for the full @e range of characters between those two (inclusive) in the
153 collating sequence, e.g. <tt>[0-9]</tt> in ASCII matches any decimal digit.
154 Two ranges may not share an endpoint, so e.g. <tt>a-c-e</tt> is illegal.
155 Ranges are very collating-sequence-dependent, and portable programs should
156 avoid relying on them.
158 To include a literal <tt>]</tt> or <tt>-</tt> in the list, the simplest method
159 is to enclose it in <tt>[.</tt> and <tt>.]</tt> to make it a collating element
160 (see below). Alternatively, make it the first character (following a possible
161 <tt>^</tt>), or (AREs only) precede it with <tt>@\</tt>. Alternatively, for
162 <tt>-</tt>, make it the last character, or the second endpoint of a range. To
163 use a literal <tt>-</tt> as the first endpoint of a range, make it a collating
164 element or (AREs only) precede it with <tt>@\</tt>. With the exception of
165 these, some combinations using <tt>[</tt> (see next paragraphs), and escapes,
166 all other special characters lose their special significance within a bracket
169 Within a bracket expression, a collating element (a character, a
170 multi-character sequence that collates as if it were a single character, or a
171 collating-sequence name for either) enclosed in <tt>[.</tt> and <tt>.]</tt>
172 stands for the sequence of characters of that collating element.
174 @e wxWidgets: Currently no multi-character collating elements are defined. So
175 in <tt>[.X.]</tt>, @c X can either be a single character literal or the name
176 of a character. For example, the following are both identical:
177 <tt>[[.0.]-[.9.]]</tt> and <tt>[[.zero.]-[.nine.]]</tt> and mean the same as
178 <tt>[0-9]</tt>. See @ref overview_resyntax_characters.
180 Within a bracket expression, a collating element enclosed in @b [= and @b =]
181 is an equivalence class, standing for the sequences of characters of all
182 collating elements equivalent to that one, including itself.
183 An equivalence class may not be an endpoint of a range.
184 @e wxWidgets: Currently no equivalence classes are defined, so
185 @b [=X=] stands for just the single character @e X.
186 @e X can either be a single character literal or the name of a character,
187 see @ref resynchars_overview.
188 Within a bracket expression,
189 the name of a @e character class enclosed in @b [: and @b :] stands for the list
190 of all characters (not all collating elements!) belonging to that class.
191 Standard character classes are:
194 @row2col{ <tt>alpha</tt> , A letter. }
195 @row2col{ <tt>upper</tt> , An upper-case letter. }
196 @row2col{ <tt>lower</tt> , A lower-case letter. }
197 @row2col{ <tt>digit</tt> , A decimal digit. }
198 @row2col{ <tt>xdigit</tt> , A hexadecimal digit. }
199 @row2col{ <tt>alnum</tt> , An alphanumeric (letter or digit). }
200 @row2col{ <tt>print</tt> , An alphanumeric (same as alnum). }
201 @row2col{ <tt>blank</tt> , A space or tab character. }
202 @row2col{ <tt>space</tt> , A character producing white space in displayed text. }
203 @row2col{ <tt>punct</tt> , A punctuation character. }
204 @row2col{ <tt>graph</tt> , A character with a visible representation. }
205 @row2col{ <tt>cntrl</tt> , A control character. }
208 A character class may not be used as an endpoint of a range.
209 @e wxWidgets: In a non-Unicode build, these character classifications depend on the
210 current locale, and correspond to the values return by the ANSI C 'is'
211 functions: isalpha, isupper, etc. In Unicode mode they are based on
212 Unicode classifications, and are not affected by the current locale.
213 There are two special cases of bracket expressions:
214 the bracket expressions @b [[::]] and @b [[::]] are constraints, matching empty
215 strings at the beginning and end of a word respectively. A word is defined
216 as a sequence of word characters that is neither preceded nor followed
217 by word characters. A word character is an @e alnum character or an underscore
218 (@b _). These special bracket expressions are deprecated; users of AREs should
219 use constraint escapes instead (see #Escapes below).
222 @section overview_resyntax_escapes Escapes
225 which begin with a <tt>@\</tt> followed by an alphanumeric character, come in several
226 varieties: character entry, class shorthands, constraint escapes, and back
227 references. A <tt>@\</tt> followed by an alphanumeric character but not constituting
228 a valid escape is illegal in AREs. In EREs, there are no escapes: outside
229 a bracket expression, a <tt>@\</tt> followed by an alphanumeric character merely stands
230 for that character as an ordinary character, and inside a bracket expression,
231 <tt>@\</tt> is an ordinary character. (The latter is the one actual incompatibility
232 between EREs and AREs.)
233 Character-entry escapes (AREs only) exist to make
234 it easier to specify non-printing and otherwise inconvenient characters
241 alert (bell) character, as in C
250 for @b \ to help reduce backslash doubling in some applications where there
251 are multiple levels of backslash processing
255 (where X is any character)
256 the character whose low-order 5 bits are the same as those of @e X, and whose
257 other bits are all zero
261 the character whose collating-sequence name is
262 '@b ESC', or failing that, the character with octal value 033
274 carriage return, as in C
278 horizontal tab, as in C
282 (where @e wxyz is exactly four hexadecimal digits)
284 character @b U+@e wxyz in the local byte ordering
288 (where @e stuvwxyz is
289 exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode
294 vertical tab, as in C are all available.
299 @e hhh is any sequence of hexadecimal digits) the character whose hexadecimal
300 value is @b 0x@e hhh (a single character no matter how many hexadecimal digits
305 the character whose value is @b 0
309 (where @e xy is exactly two
310 octal digits, and is not a @e back reference (see below)) the character whose
311 octal value is @b 0@e xy
315 (where @e xyz is exactly three octal digits, and is
316 not a back reference (see below))
317 the character whose octal value is @b 0@e xyz
321 Hexadecimal digits are '@b 0'-'@b 9', '@b a'-'@b f', and '@b A'-'@b F'. Octal
322 digits are '@b 0'-'@b 7'.
324 escapes are always taken as ordinary characters. For example, @b \135 is @b ] in
325 ASCII, but @b \135 does not terminate a bracket expression. Beware, however,
326 that some applications (e.g., C compilers) interpret such sequences themselves
327 before the regular-expression package gets to see them, which may require
328 doubling (quadrupling, etc.) the '@b \'.
329 Class-shorthand escapes (AREs only) provide
330 shorthands for certain commonly-used character classes:
344 @b [[:alnum:]_] (note underscore)
356 @b [^[:alnum:]_] (note underscore)
360 Within bracket expressions, '@b \d', '@b \s', and
361 '@b \w' lose their outer brackets, and '@b \D',
362 '@b \S', and '@b \W' are illegal. (So, for example,
363 @b [a-c\d] is equivalent to @b [a-c[:digit:]].
364 Also, @b [a-c\D], which is equivalent to
365 @b [a-c^[:digit:]], is illegal.)
366 A constraint escape (AREs only) is a constraint,
367 matching the empty string if specific conditions are met, written as an
374 matches only at the beginning of the string
375 (see #Matching, below,
376 for how this differs from '@b ^')
380 matches only at the beginning of a word
384 matches only at the end of a word
388 matches only at the beginning or end of a word
392 matches only at a point that is not the beginning or end of
397 matches only at the end of the string
398 (see #Matching, below, for
399 how this differs from '@b $')
403 (where @e m is a nonzero digit) a @e back reference,
408 (where @e m is a nonzero digit, and @e nn is some more digits,
409 and the decimal value @e mnn is not greater than the number of closing capturing
410 parentheses seen so far) a @e back reference, see below
415 as in the specification of @b [[::]] and @b [[::]] above. Constraint escapes are
416 illegal within bracket expressions.
417 A back reference (AREs only) matches
418 the same string matched by the parenthesized subexpression specified by
419 the number, so that (e.g.) @b ([bc])\1 matches @b bb or @b cc but not '@b bc'.
421 must entirely precede the back reference in the RE. Subexpressions are numbered
422 in the order of their leading parentheses. Non-capturing parentheses do not
423 define subexpressions.
424 There is an inherent historical ambiguity between
425 octal character-entry escapes and back references, which is resolved by
426 heuristics, as hinted at above. A leading zero always indicates an octal
427 escape. A single non-zero digit, not followed by another digit, is always
428 taken as a back reference. A multi-digit sequence not starting with a zero
429 is taken as a back reference if it comes after a suitable subexpression
430 (i.e. the number is in the legal range for a back reference), and otherwise
434 @section overview_resyntax_metasyntax Metasyntax
436 In addition to the main syntax described above,
437 there are some special forms and miscellaneous syntactic facilities available.
438 Normally the flavor of RE being used is specified by application-dependent
439 means. However, this can be overridden by a @e director. If an RE of any flavor
440 begins with '@b ***:', the rest of the RE is an ARE. If an RE of any flavor begins
441 with '@b ***=', the rest of the RE is taken to be a literal string, with all
442 characters considered ordinary characters.
443 An ARE may begin with @e embedded options: a sequence @b (?xyz)
444 (where @e xyz is one or more alphabetic characters)
445 specifies options affecting the rest of the RE. These supplement, and can
446 override, any options specified by the application. The available option
457 case-sensitive matching (usual default)
465 case-insensitive matching (see #Matching, below)
469 historical synonym for @b n
473 newline-sensitive matching (see #Matching, below)
477 partial newline-sensitive matching (see #Matching, below)
482 is a literal ("quoted'') string, all ordinary characters
486 non-newline-sensitive matching (usual default)
490 tight syntax (usual default; see below)
495 partial newline-sensitive ("weird'') matching (see #Matching, below)
499 expanded syntax (see below)
503 Embedded options take effect at the @b ) terminating the
504 sequence. They are available only at the start of an ARE, and may not be
505 used later within it.
506 In addition to the usual (@e tight) RE syntax, in which
507 all characters are significant, there is an @e expanded syntax, available
508 in AREs with the embedded
509 x option. In the expanded syntax, white-space characters are ignored and
510 all characters between a @b # and the following newline (or the end of the
511 RE) are ignored, permitting paragraphing and commenting a complex RE. There
512 are three exceptions to that basic rule:
515 a white-space character or '@b #' preceded
516 by '@b \' is retained
517 white space or '@b #' within a bracket expression is retained
518 white space and comments are illegal within multi-character symbols like
519 the ARE '@b (?:' or the BRE '@b \('
522 Expanded-syntax white-space characters are blank,
523 tab, newline, and any character that belongs to the @e space character class.
524 Finally, in an ARE, outside bracket expressions, the sequence '@b (?#ttt)' (where
525 @e ttt is any text not containing a '@b )') is a comment, completely ignored. Again,
526 this is not allowed between the characters of multi-character symbols like
527 '@b (?:'. Such comments are more a historical artifact than a useful facility,
528 and their use is deprecated; use the expanded syntax instead.
530 metasyntax extensions is available if the application (or an initial @b ***=
531 director) has specified that the user's input be treated as a literal string
532 rather than as an RE.
535 @section overview_resyntax_matching Matching
537 In the event that an RE could match more than
538 one substring of a given string, the RE matches the one starting earliest
539 in the string. If the RE could match more than one substring starting at
540 that point, its choice is determined by its @e preference: either the longest
541 substring, or the shortest.
542 Most atoms, and all constraints, have no preference.
543 A parenthesized RE has the same preference (possibly none) as the RE. A
544 quantified atom with quantifier @b {m} or @b {m}? has the same preference (possibly
545 none) as the atom itself. A quantified atom with other normal quantifiers
546 (including @b {m,n} with @e m equal to @e n) prefers longest match. A quantified
547 atom with other non-greedy quantifiers (including @b {m,n}? with @e m equal to
548 @e n) prefers shortest match. A branch has the same preference as the first
549 quantified atom in it which has a preference. An RE consisting of two or
550 more branches connected by the @b | operator prefers longest match.
551 Subject to the constraints imposed by the rules for matching the whole RE, subexpressions
552 also match the longest or shortest possible substrings, based on their
553 preferences, with subexpressions starting earlier in the RE taking priority
554 over ones starting later. Note that outer subexpressions thus take priority
555 over their component subexpressions.
556 Note that the quantifiers @b {1,1} and
557 @b {1,1}? can be used to force longest and shortest preference, respectively,
558 on a subexpression or a whole RE.
559 Match lengths are measured in characters,
560 not collating elements. An empty string is considered longer than no match
561 at all. For example, @b bb* matches the three middle characters
562 of '@b abbbc', @b (week|wee)(night|knights)
563 matches all ten characters of '@b weeknights', when @b (.*).* is matched against
564 @b abc the parenthesized subexpression matches all three characters, and when
565 @b (a*)* is matched against @b bc both the whole RE and the parenthesized subexpression
566 match an empty string.
567 If case-independent matching is specified, the effect
568 is much as if all case distinctions had vanished from the alphabet. When
569 an alphabetic that exists in multiple cases appears as an ordinary character
570 outside a bracket expression, it is effectively transformed into a bracket
571 expression containing both cases, so that @b x becomes '@b [xX]'. When it appears
572 inside a bracket expression, all case counterparts of it are added to the
573 bracket expression, so that @b [x] becomes @b [xX] and @b [^x] becomes '@b [^xX]'.
575 matching is specified, @b . and bracket expressions using @b ^ will never match
576 the newline character (so that matches will never cross newlines unless
577 the RE explicitly arranges it) and @b ^ and @b $ will match the empty string after
578 and before a newline respectively, in addition to matching at beginning
579 and end of string respectively. ARE @b \A and @b \Z continue to match beginning
580 or end of string @e only.
581 If partial newline-sensitive matching is specified,
582 this affects @b . and bracket expressions as with newline-sensitive matching,
583 but not @b ^ and '@b $'.
584 If inverse partial newline-sensitive matching is specified,
585 this affects @b ^ and @b $ as with newline-sensitive matching, but not @b . and bracket
586 expressions. This isn't very useful but is provided for symmetry.
589 @section overview_resyntax_limits Limits and Compatibility
591 No particular limit is imposed on the length of REs. Programs
592 intended to be highly portable should not employ REs longer than 256 bytes,
593 as a POSIX-compliant implementation can refuse to accept such REs.
595 feature of AREs that is actually incompatible with POSIX EREs is that @b \
596 does not lose its special significance inside bracket expressions. All other
597 ARE features use syntax which is illegal or has undefined or unspecified
598 effects in POSIX EREs; the @b *** syntax of directors likewise is outside
599 the POSIX syntax for both BREs and EREs.
600 Many of the ARE extensions are
601 borrowed from Perl, but some have been changed to clean them up, and a
602 few Perl extensions are not present. Incompatibilities of note include '@b \b',
603 '@b \B', the lack of special treatment for a trailing newline, the addition of
604 complemented bracket expressions to the things affected by newline-sensitive
605 matching, the restrictions on parentheses and back references in lookahead
606 constraints, and the longest/shortest-match (rather than first-match) matching
608 The matching rules for REs containing both normal and non-greedy
609 quantifiers have changed since early beta-test versions of this package.
610 (The new rules are much simpler and cleaner, but don't work as hard at guessing
611 the user's real intentions.)
612 Henry Spencer's original 1986 @e regexp package, still in widespread use,
613 implemented an early version of today's EREs. There are four incompatibilities between @e regexp's
614 near-EREs ('RREs' for short) and AREs. In roughly increasing order of significance:
616 In AREs, @b \ followed by an alphanumeric character is either an escape or
617 an error, while in RREs, it was just another way of writing the alphanumeric.
618 This should not be a problem because there was no reason to write such
620 @b { followed by a digit in an ARE is the beginning of
621 a bound, while in RREs, @b { was always an ordinary character. Such sequences
622 should be rare, and will often result in an error because following characters
623 will not look like a valid bound.
624 In AREs, @b \ remains a special character
625 within '@b []', so a literal @b \ within @b [] must be
626 written '@b \\'. @b \\ also gives a literal
627 @b \ within @b [] in RREs, but only truly paranoid programmers routinely doubled
629 AREs report the longest/shortest match for the RE, rather
630 than the first found in a specified search order. This may affect some RREs
631 which were written in the expectation that the first match would be reported.
632 (The careful crafting of RREs to optimize the search order for fast matching
633 is obsolete (AREs examine all possible matches in parallel, and their performance
634 is largely insensitive to their complexity) but cases where the search
635 order was exploited to deliberately find a match which was @e not the longest/shortest
636 will need rewriting.)
639 @section overview_resyntax_bre Basic Regular Expressions
641 BREs differ from EREs in
642 several respects. '@b |', '@b +', and @b ? are ordinary characters and there is no equivalent
643 for their functionality. The delimiters for bounds
644 are @b \{ and '@b \}', with @b { and
645 @b } by themselves ordinary characters. The parentheses for nested subexpressions
646 are @b \( and '@b \)', with @b ( and @b ) by themselves
647 ordinary characters. @b ^ is an ordinary
648 character except at the beginning of the RE or the beginning of a parenthesized
649 subexpression, @b $ is an ordinary character except at the end of the RE or
650 the end of a parenthesized subexpression, and @b * is an ordinary character
651 if it appears at the beginning of the RE or the beginning of a parenthesized
652 subexpression (after a possible leading '@b ^'). Finally, single-digit back references
653 are available, and @b \ and @b \ are synonyms
654 for @b [[::]] and @b [[::]] respectively;
655 no other escapes are available.
658 @section overview_resyntax_characters Regular Expression Character Names
660 Note that the character names are case sensitive.
1569 right-square-bracket