]>
git.saurik.com Git - wxWidgets.git/blob - docs/doxygen/overviews/resyntax.h
1 /////////////////////////////////////////////////////////////////////////////
3 // Purpose: topic overview
4 // Author: wxWidgets team
6 // Licence: wxWindows license
7 /////////////////////////////////////////////////////////////////////////////
11 @page resyn_overview Syntax of the builtin regular expression library
13 A @e regular expression describes strings of characters. It's a
14 pattern that matches certain strings and doesn't match others.
17 @ref differentflavors_overview
18 @ref resyntax_overview
19 @ref resynbracket_overview
23 @ref relimits_overview
24 @ref resynbre_overview
25 @ref resynchars_overview
28 @section differentflavors Different Flavors of REs
31 Regular expressions ("RE''s), as defined by POSIX, come in two
32 flavors: @e extended REs ("EREs'') and @e basic REs ("BREs''). EREs are roughly those
33 of the traditional @e egrep, while BREs are roughly those of the traditional
34 @e ed. This implementation adds a third flavor, @e advanced REs ("AREs''), basically
35 EREs with some significant extensions.
36 This manual page primarily describes
37 AREs. BREs mostly exist for backward compatibility in some old programs;
38 they will be discussed at the #end. POSIX EREs are almost an exact subset
39 of AREs. Features of AREs that are not present in EREs will be indicated.
41 @section resyntax Regular Expression Syntax
44 These regular expressions are implemented using
45 the package written by Henry Spencer, based on the 1003.2 spec and some
46 (not quite all) of the Perl5 extensions (thanks, Henry!). Much of the description
47 of regular expressions below is copied verbatim from his manual entry.
48 An ARE is one or more @e branches, separated by '@b |', matching anything that matches
50 A branch is zero or more @e constraints or @e quantified
51 atoms, concatenated. It matches a match for the first, followed by a match
52 for the second, etc; an empty branch matches the empty string.
53 A quantified atom is an @e atom possibly followed by a single @e quantifier. Without a quantifier,
54 it matches a match for the atom. The quantifiers, and what a so-quantified
67 a sequence of 0 or more matches of the atom
78 a sequence of 1 or more matches of the atom
89 a sequence of 0 or 1 matches of the atom
100 a sequence of exactly @e m matches of the atom
111 a sequence of @e m or more matches of the atom
122 a sequence of @e m through @e n (inclusive)
123 matches of the atom; @e m may not exceed @e n
129 @b *? +? ?? {m}? {m,}? {m,n}?
134 @e non-greedy quantifiers,
135 which match the same possibilities, but prefer the
136 smallest number rather than the largest number of matches (see #Matching)
142 The forms using @b { and @b } are known as @e bounds. The numbers @e m and @e n are unsigned
143 decimal integers with permissible values from 0 to 255 inclusive.
156 (where @e re is any regular expression) matches a match for
157 @e re, with the match noted for possible reporting
169 does no reporting (a "non-capturing'' set of parentheses)
181 string, noted for possible reporting
192 matches an empty string, without reporting
203 a @e bracket expression, matching any one of the @e chars
204 (see @ref resynbracket_overview for more detail)
215 matches any single character
226 (where @e k is a non-alphanumeric character)
227 matches that character taken as an ordinary character, e.g. \\ matches a backslash
239 where @e c is alphanumeric (possibly followed by other characters),
240 an @e escape (AREs only), see #Escapes below
251 when followed by a character
252 other than a digit, matches the left-brace character '@b {'; when followed by
253 a digit, it is the beginning of a @e bound (see above)
264 where @e x is a single
265 character with no other significance, matches that character.
271 A @e constraint matches an empty string when specific conditions are met. A constraint may
272 not be followed by a quantifier. The simple constraints are as follows;
273 some more constraints are described later, under #Escapes.
285 matches at the beginning of a line
296 matches at the end of a line
307 @e positive lookahead
308 (AREs only), matches at any point where a substring matching @e re begins
319 @e negative lookahead (AREs only),
320 matches at any point where no substring matching @e re begins
326 The lookahead constraints may not contain back references
327 (see later), and all parentheses within them are considered non-capturing.
328 An RE may not end with '@b \'.
330 @section wxresynbracket Bracket Expressions
333 A @e bracket expression is a list
334 of characters enclosed in '@b []'. It normally matches any single character from
335 the list (but see below). If the list begins with '@b ^', it matches any single
336 character (but see below) @e not from the rest of the list.
338 in the list are separated by '@b -', this is shorthand for the full @e range of
339 characters between those two (inclusive) in the collating sequence, e.g.
340 @b [0-9] in ASCII matches any decimal digit. Two ranges may not share an endpoint,
341 so e.g. @b a-c-e is illegal. Ranges are very collating-sequence-dependent, and portable
342 programs should avoid relying on them.
343 To include a literal @b ] or @b - in the
344 list, the simplest method is to enclose it in @b [. and @b .] to make it a collating
345 element (see below). Alternatively, make it the first character (following
346 a possible '@b ^'), or (AREs only) precede it with '@b \'.
347 Alternatively, for '@b -', make
348 it the last character, or the second endpoint of a range. To use a literal
349 @b - as the first endpoint of a range, make it a collating element or (AREs
350 only) precede it with '@b \'. With the exception of these, some combinations using
351 @b [ (see next paragraphs), and escapes, all other special characters lose
352 their special significance within a bracket expression.
354 expression, a collating element (a character, a multi-character sequence
355 that collates as if it were a single character, or a collating-sequence
356 name for either) enclosed in @b [. and @b .] stands for the
357 sequence of characters of that collating element.
358 @e wxWidgets: Currently no multi-character collating elements are defined.
359 So in @b [.X.], @e X can either be a single character literal or
360 the name of a character. For example, the following are both identical
361 @b [[.0.]-[.9.]] and @b [[.zero.]-[.nine.]] and mean the same as
363 See @ref resynchars_overview.
364 Within a bracket expression, a collating element enclosed in @b [= and @b =]
365 is an equivalence class, standing for the sequences of characters of all
366 collating elements equivalent to that one, including itself.
367 An equivalence class may not be an endpoint of a range.
368 @e wxWidgets: Currently no equivalence classes are defined, so
369 @b [=X=] stands for just the single character @e X.
370 @e X can either be a single character literal or the name of a character,
371 see @ref resynchars_overview.
372 Within a bracket expression,
373 the name of a @e character class enclosed in @b [: and @b :] stands for the list
374 of all characters (not all collating elements!) belonging to that class.
375 Standard character classes are:
398 An upper-case letter.
442 An alphanumeric (letter or digit).
453 An alphanumeric (same as alnum).
464 A space or tab character.
475 A character producing white space in displayed text.
486 A punctuation character.
497 A character with a visible representation.
514 A character class may not be used as an endpoint of a range.
515 @e wxWidgets: In a non-Unicode build, these character classifications depend on the
516 current locale, and correspond to the values return by the ANSI C 'is'
517 functions: isalpha, isupper, etc. In Unicode mode they are based on
518 Unicode classifications, and are not affected by the current locale.
519 There are two special cases of bracket expressions:
520 the bracket expressions @b [[::]] and @b [[::]] are constraints, matching empty
521 strings at the beginning and end of a word respectively. A word is defined
522 as a sequence of word characters that is neither preceded nor followed
523 by word characters. A word character is an @e alnum character or an underscore
524 (@b _). These special bracket expressions are deprecated; users of AREs should
525 use constraint escapes instead (see #Escapes below).
527 @section wxresynescapes Escapes
531 which begin with a @b \ followed by an alphanumeric character, come in several
532 varieties: character entry, class shorthands, constraint escapes, and back
533 references. A @b \ followed by an alphanumeric character but not constituting
534 a valid escape is illegal in AREs. In EREs, there are no escapes: outside
535 a bracket expression, a @b \ followed by an alphanumeric character merely stands
536 for that character as an ordinary character, and inside a bracket expression,
537 @b \ is an ordinary character. (The latter is the one actual incompatibility
538 between EREs and AREs.)
539 Character-entry escapes (AREs only) exist to make
540 it easier to specify non-printing and otherwise inconvenient characters
553 alert (bell) character, as in C
576 for @b \ to help reduce backslash doubling in some applications where there
577 are multiple levels of backslash processing
588 (where X is any character)
589 the character whose low-order 5 bits are the same as those of @e X, and whose
590 other bits are all zero
601 the character whose collating-sequence name is
602 '@b ESC', or failing that, the character with octal value 033
635 carriage return, as in C
646 horizontal tab, as in C
657 (where @e wxyz is exactly four hexadecimal digits)
659 character @b U+@e wxyz in the local byte ordering
670 (where @e stuvwxyz is
671 exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode
683 vertical tab, as in C are all available.
695 @e hhh is any sequence of hexadecimal digits) the character whose hexadecimal
696 value is @b 0x@e hhh (a single character no matter how many hexadecimal digits
708 the character whose value is @b 0
719 (where @e xy is exactly two
720 octal digits, and is not a @e back reference (see below)) the character whose
721 octal value is @b 0@e xy
732 (where @e xyz is exactly three octal digits, and is
733 not a back reference (see below))
734 the character whose octal value is @b 0@e xyz
740 Hexadecimal digits are '@b 0'-'@b 9', '@b a'-'@b f', and '@b A'-'@b F'. Octal
741 digits are '@b 0'-'@b 7'.
743 escapes are always taken as ordinary characters. For example, @b \135 is @b ] in
744 ASCII, but @b \135 does not terminate a bracket expression. Beware, however,
745 that some applications (e.g., C compilers) interpret such sequences themselves
746 before the regular-expression package gets to see them, which may require
747 doubling (quadrupling, etc.) the '@b \'.
748 Class-shorthand escapes (AREs only) provide
749 shorthands for certain commonly-used character classes:
783 @b [[:alnum:]_] (note underscore)
816 @b [^[:alnum:]_] (note underscore)
822 Within bracket expressions, '@b \d', '@b \s', and
823 '@b \w' lose their outer brackets, and '@b \D',
824 '@b \S', and '@b \W' are illegal. (So, for example,
825 @b [a-c\d] is equivalent to @b [a-c[:digit:]].
826 Also, @b [a-c\D], which is equivalent to
827 @b [a-c^[:digit:]], is illegal.)
828 A constraint escape (AREs only) is a constraint,
829 matching the empty string if specific conditions are met, written as an
842 matches only at the beginning of the string
843 (see #Matching, below,
844 for how this differs from '@b ^')
855 matches only at the beginning of a word
866 matches only at the end of a word
877 matches only at the beginning or end of a word
888 matches only at a point that is not the beginning or end of
900 matches only at the end of the string
901 (see #Matching, below, for
902 how this differs from '@b $')
913 (where @e m is a nonzero digit) a @e back reference,
925 (where @e m is a nonzero digit, and @e nn is some more digits,
926 and the decimal value @e mnn is not greater than the number of closing capturing
927 parentheses seen so far) a @e back reference, see below
934 as in the specification of @b [[::]] and @b [[::]] above. Constraint escapes are
935 illegal within bracket expressions.
936 A back reference (AREs only) matches
937 the same string matched by the parenthesized subexpression specified by
938 the number, so that (e.g.) @b ([bc])\1 matches @b bb or @b cc but not '@b bc'.
940 must entirely precede the back reference in the RE. Subexpressions are numbered
941 in the order of their leading parentheses. Non-capturing parentheses do not
942 define subexpressions.
943 There is an inherent historical ambiguity between
944 octal character-entry escapes and back references, which is resolved by
945 heuristics, as hinted at above. A leading zero always indicates an octal
946 escape. A single non-zero digit, not followed by another digit, is always
947 taken as a back reference. A multi-digit sequence not starting with a zero
948 is taken as a back reference if it comes after a suitable subexpression
949 (i.e. the number is in the legal range for a back reference), and otherwise
952 @section remetasyntax Metasyntax
955 In addition to the main syntax described above,
956 there are some special forms and miscellaneous syntactic facilities available.
957 Normally the flavor of RE being used is specified by application-dependent
958 means. However, this can be overridden by a @e director. If an RE of any flavor
959 begins with '@b ***:', the rest of the RE is an ARE. If an RE of any flavor begins
960 with '@b ***=', the rest of the RE is taken to be a literal string, with all
961 characters considered ordinary characters.
962 An ARE may begin with @e embedded options: a sequence @b (?xyz)
963 (where @e xyz is one or more alphabetic characters)
964 specifies options affecting the rest of the RE. These supplement, and can
965 override, any options specified by the application. The available option
989 case-sensitive matching (usual default)
1000 rest of RE is an ERE
1011 case-insensitive matching (see #Matching, below)
1022 historical synonym for @b n
1033 newline-sensitive matching (see #Matching, below)
1044 partial newline-sensitive matching (see #Matching, below)
1056 is a literal ("quoted'') string, all ordinary characters
1067 non-newline-sensitive matching (usual default)
1078 tight syntax (usual default; see below)
1090 partial newline-sensitive ("weird'') matching (see #Matching, below)
1101 expanded syntax (see below)
1107 Embedded options take effect at the @b ) terminating the
1108 sequence. They are available only at the start of an ARE, and may not be
1109 used later within it.
1110 In addition to the usual (@e tight) RE syntax, in which
1111 all characters are significant, there is an @e expanded syntax, available
1112 in AREs with the embedded
1113 x option. In the expanded syntax, white-space characters are ignored and
1114 all characters between a @b # and the following newline (or the end of the
1115 RE) are ignored, permitting paragraphing and commenting a complex RE. There
1116 are three exceptions to that basic rule:
1119 a white-space character or '@b #' preceded
1120 by '@b \' is retained
1121 white space or '@b #' within a bracket expression is retained
1122 white space and comments are illegal within multi-character symbols like
1123 the ARE '@b (?:' or the BRE '@b \('
1126 Expanded-syntax white-space characters are blank,
1127 tab, newline, and any character that belongs to the @e space character class.
1128 Finally, in an ARE, outside bracket expressions, the sequence '@b (?#ttt)' (where
1129 @e ttt is any text not containing a '@b )') is a comment, completely ignored. Again,
1130 this is not allowed between the characters of multi-character symbols like
1131 '@b (?:'. Such comments are more a historical artifact than a useful facility,
1132 and their use is deprecated; use the expanded syntax instead.
1134 metasyntax extensions is available if the application (or an initial @b ***=
1135 director) has specified that the user's input be treated as a literal string
1136 rather than as an RE.
1138 @section wxresynmatching Matching
1141 In the event that an RE could match more than
1142 one substring of a given string, the RE matches the one starting earliest
1143 in the string. If the RE could match more than one substring starting at
1144 that point, its choice is determined by its @e preference: either the longest
1145 substring, or the shortest.
1146 Most atoms, and all constraints, have no preference.
1147 A parenthesized RE has the same preference (possibly none) as the RE. A
1148 quantified atom with quantifier @b {m} or @b {m}? has the same preference (possibly
1149 none) as the atom itself. A quantified atom with other normal quantifiers
1150 (including @b {m,n} with @e m equal to @e n) prefers longest match. A quantified
1151 atom with other non-greedy quantifiers (including @b {m,n}? with @e m equal to
1152 @e n) prefers shortest match. A branch has the same preference as the first
1153 quantified atom in it which has a preference. An RE consisting of two or
1154 more branches connected by the @b | operator prefers longest match.
1155 Subject to the constraints imposed by the rules for matching the whole RE, subexpressions
1156 also match the longest or shortest possible substrings, based on their
1157 preferences, with subexpressions starting earlier in the RE taking priority
1158 over ones starting later. Note that outer subexpressions thus take priority
1159 over their component subexpressions.
1160 Note that the quantifiers @b {1,1} and
1161 @b {1,1}? can be used to force longest and shortest preference, respectively,
1162 on a subexpression or a whole RE.
1163 Match lengths are measured in characters,
1164 not collating elements. An empty string is considered longer than no match
1165 at all. For example, @b bb* matches the three middle characters
1166 of '@b abbbc', @b (week|wee)(night|knights)
1167 matches all ten characters of '@b weeknights', when @b (.*).* is matched against
1168 @b abc the parenthesized subexpression matches all three characters, and when
1169 @b (a*)* is matched against @b bc both the whole RE and the parenthesized subexpression
1170 match an empty string.
1171 If case-independent matching is specified, the effect
1172 is much as if all case distinctions had vanished from the alphabet. When
1173 an alphabetic that exists in multiple cases appears as an ordinary character
1174 outside a bracket expression, it is effectively transformed into a bracket
1175 expression containing both cases, so that @b x becomes '@b [xX]'. When it appears
1176 inside a bracket expression, all case counterparts of it are added to the
1177 bracket expression, so that @b [x] becomes @b [xX] and @b [^x] becomes '@b [^xX]'.
1178 If newline-sensitive
1179 matching is specified, @b . and bracket expressions using @b ^ will never match
1180 the newline character (so that matches will never cross newlines unless
1181 the RE explicitly arranges it) and @b ^ and @b $ will match the empty string after
1182 and before a newline respectively, in addition to matching at beginning
1183 and end of string respectively. ARE @b \A and @b \Z continue to match beginning
1184 or end of string @e only.
1185 If partial newline-sensitive matching is specified,
1186 this affects @b . and bracket expressions as with newline-sensitive matching,
1187 but not @b ^ and '@b $'.
1188 If inverse partial newline-sensitive matching is specified,
1189 this affects @b ^ and @b $ as with newline-sensitive matching, but not @b . and bracket
1190 expressions. This isn't very useful but is provided for symmetry.
1192 @section relimits Limits And Compatibility
1195 No particular limit is imposed on the length of REs. Programs
1196 intended to be highly portable should not employ REs longer than 256 bytes,
1197 as a POSIX-compliant implementation can refuse to accept such REs.
1199 feature of AREs that is actually incompatible with POSIX EREs is that @b \
1200 does not lose its special significance inside bracket expressions. All other
1201 ARE features use syntax which is illegal or has undefined or unspecified
1202 effects in POSIX EREs; the @b *** syntax of directors likewise is outside
1203 the POSIX syntax for both BREs and EREs.
1204 Many of the ARE extensions are
1205 borrowed from Perl, but some have been changed to clean them up, and a
1206 few Perl extensions are not present. Incompatibilities of note include '@b \b',
1207 '@b \B', the lack of special treatment for a trailing newline, the addition of
1208 complemented bracket expressions to the things affected by newline-sensitive
1209 matching, the restrictions on parentheses and back references in lookahead
1210 constraints, and the longest/shortest-match (rather than first-match) matching
1212 The matching rules for REs containing both normal and non-greedy
1213 quantifiers have changed since early beta-test versions of this package.
1214 (The new rules are much simpler and cleaner, but don't work as hard at guessing
1215 the user's real intentions.)
1216 Henry Spencer's original 1986 @e regexp package, still in widespread use,
1217 implemented an early version of today's EREs. There are four incompatibilities between @e regexp's
1218 near-EREs ('RREs' for short) and AREs. In roughly increasing order of significance:
1221 In AREs, @b \ followed by an alphanumeric character is either an escape or
1222 an error, while in RREs, it was just another way of writing the alphanumeric.
1223 This should not be a problem because there was no reason to write such
1225 @b { followed by a digit in an ARE is the beginning of
1226 a bound, while in RREs, @b { was always an ordinary character. Such sequences
1227 should be rare, and will often result in an error because following characters
1228 will not look like a valid bound.
1229 In AREs, @b \ remains a special character
1230 within '@b []', so a literal @b \ within @b [] must be
1231 written '@b \\'. @b \\ also gives a literal
1232 @b \ within @b [] in RREs, but only truly paranoid programmers routinely doubled
1234 AREs report the longest/shortest match for the RE, rather
1235 than the first found in a specified search order. This may affect some RREs
1236 which were written in the expectation that the first match would be reported.
1237 (The careful crafting of RREs to optimize the search order for fast matching
1238 is obsolete (AREs examine all possible matches in parallel, and their performance
1239 is largely insensitive to their complexity) but cases where the search
1240 order was exploited to deliberately find a match which was @e not the longest/shortest
1241 will need rewriting.)
1245 @section wxresynbre Basic Regular Expressions
1248 BREs differ from EREs in
1249 several respects. '@b |', '@b +', and @b ? are ordinary characters and there is no equivalent
1250 for their functionality. The delimiters for bounds
1251 are @b \{ and '@b \}', with @b { and
1252 @b } by themselves ordinary characters. The parentheses for nested subexpressions
1253 are @b \( and '@b \)', with @b ( and @b ) by themselves
1254 ordinary characters. @b ^ is an ordinary
1255 character except at the beginning of the RE or the beginning of a parenthesized
1256 subexpression, @b $ is an ordinary character except at the end of the RE or
1257 the end of a parenthesized subexpression, and @b * is an ordinary character
1258 if it appears at the beginning of the RE or the beginning of a parenthesized
1259 subexpression (after a possible leading '@b ^'). Finally, single-digit back references
1260 are available, and @b \ and @b \ are synonyms
1261 for @b [[::]] and @b [[::]] respectively;
1262 no other escapes are available.
1264 @section wxresynchars Regular Expression Character Names
1267 Note that the character names are case sensitive.
2176 right-square-bracket