]> git.saurik.com Git - wxWidgets.git/blame - docs/doxygen/overviews/resyntax.h
td163 down, move to td189.
[wxWidgets.git] / docs / doxygen / overviews / resyntax.h
CommitLineData
15b6757b 1/////////////////////////////////////////////////////////////////////////////
72844950 2// Name: resyntax.h
15b6757b
FM
3// Purpose: topic overview
4// Author: wxWidgets team
5// RCS-ID: $Id$
6// Licence: wxWindows license
7/////////////////////////////////////////////////////////////////////////////
8
9/*!
36c9828f 10
72844950 11@page overview_resyntax Syntax of the Built-in Regular Expression Library
36c9828f 12
72844950
BP
13A <em>regular expression</em> describes strings of characters. It's a pattern
14that matches certain strings and doesn't match others.
36c9828f 15
72844950
BP
16@li @ref overview_resyntax_differentflavors
17@li @ref overview_resyntax_syntax
18@li @ref overview_resyntax_bracket
19@li @ref overview_resyntax_escapes
20@li @ref overview_resyntax_metasyntax
21@li @ref overview_resyntax_matching
22@li @ref overview_resyntax_limits
23@li @ref overview_resyntax_bre
24@li @ref overview_resyntax_characters
36c9828f 25
877b5c30 26@seealso
36c9828f 27
877b5c30 28@li #wxRegEx
36c9828f 29
36c9828f 30
877b5c30 31<hr>
36c9828f
FM
32
33
877b5c30 34@section overview_resyntax_differentflavors Different Flavors of Regular Expressions
36c9828f 35
877b5c30
BP
36Regular expressions (RE), as defined by POSIX, come in two flavors:
37<em>extended regular expressions</em> (ERE) and <em>basic regular
38expressions</em> (BRE). EREs are roughly those of the traditional @e egrep,
39while BREs are roughly those of the traditional @e ed. This implementation
40adds a third flavor: <em>advanced regular expressions</em> (ARE), basically
41EREs with some significant extensions.
36c9828f 42
877b5c30
BP
43This manual page primarily describes AREs. BREs mostly exist for backward
44compatibility in some old programs. POSIX EREs are almost an exact subset of
45AREs. Features of AREs that are not present in EREs will be indicated.
36c9828f
FM
46
47
877b5c30 48@section overview_resyntax_syntax Regular Expression Syntax
36c9828f 49
877b5c30
BP
50These regular expressions are implemented using the package written by Henry
51Spencer, based on the 1003.2 spec and some (not quite all) of the Perl5
52extensions (thanks, Henry!). Much of the description of regular expressions
53below is copied verbatim from his manual entry.
54
55An ARE is one or more @e branches, separated by "|", matching anything that
56matches any of the branches.
57
58A branch is zero or more @e constraints or @e quantified atoms, concatenated.
59It matches a match for the first, followed by a match for the second, etc; an
60empty branch matches the empty string.
61
62A quantified atom is an @e atom possibly followed by a single @e quantifier.
63Without a quantifier, it matches a match for the atom. The quantifiers, and
64what a so-quantified atom matches, are:
65
66@beginTable
67@row2col{ <tt>*</tt> ,
68 A sequence of 0 or more matches of the atom. }
69@row2col{ <tt>+</tt> ,
70 A sequence of 1 or more matches of the atom. }
71@row2col{ <tt>?</tt> ,
72 A sequence of 0 or 1 matches of the atom. }
73@row2col{ <tt>{m}</tt> ,
74 A sequence of exactly @e m matches of the atom. }
75@row2col{ <tt>{m\,}</tt> ,
76 A sequence of @e m or more matches of the atom. }
77@row2col{ <tt>{m\,n}</tt> ,
78 A sequence of @e m through @e n (inclusive) matches of the atom; @e m may
79 not exceed @e n. }
80@row2col{ <tt>*? +? ?? {m}? {m\,}? {m\,n}?</tt> ,
81 @e Non-greedy quantifiers, which match the same possibilities, but prefer
82 the smallest number rather than the largest number of matches (see
83 @ref overview_resyntax_matching). }
84@endTable
85
86The forms using @b { and @b } are known as @e bounds. The numbers @e m and
87@e n are unsigned decimal integers with permissible values from 0 to 255
88inclusive. An atom is one of:
89
90@beginTable
91@row2col{ <tt>(re)</tt> ,
92 Where @e re is any regular expression, matches for @e re, with the match
93 captured for possible reporting. }
94@row2col{ <tt>(?:re)</tt> ,
95 As previous, but does no reporting (a "non-capturing" set of
96 parentheses). }
97@row2col{ <tt>()</tt> ,
98 Matches an empty string, captured for possible reporting. }
99@row2col{ <tt>(?:)</tt> ,
100 Matches an empty string, without reporting. }
101@row2col{ <tt>[chars]</tt> ,
102 A <em>bracket expression</em>, matching any one of the @e chars (see
103 @ref overview_resyntax_bracket for more details). }
104@row2col{ <tt>.</tt> ,
105 Matches any single character. }
106@row2col{ <tt>@\k</tt> ,
107 Where @e k is a non-alphanumeric character, matches that character taken
108 as an ordinary character, e.g. @\@\ matches a backslash character. }
109@row2col{ <tt>@\c</tt> ,
110 Where @e c is alphanumeric (possibly followed by other characters), an
111 @e escape (AREs only), see @ref overview_resyntax_escapes below. }
112@row2col{ <tt>@leftCurly</tt> ,
113 When followed by a character other than a digit, matches the left-brace
114 character "@leftCurly"; when followed by a digit, it is the beginning of a
115 @e bound (see above). }
116@row2col{ <tt>x</tt> ,
117 Where @e x is a single character with no other significance, matches that
118 character. }
119@endTable
120
121A @e constraint matches an empty string when specific conditions are met. A
122constraint may not be followed by a quantifier. The simple constraints are as
123follows; some more constraints are described later, under
124@ref overview_resyntax_escapes.
125
126@beginTable
127@row2col{ <tt>^</tt> ,
128 Matches at the beginning of a line. }
129@row2col{ <tt>@$</tt> ,
130 Matches at the end of a line. }
131@row2col{ <tt>(?=re)</tt> ,
132 @e Positive lookahead (AREs only), matches at any point where a substring
133 matching @e re begins. }
134@row2col{ <tt>(?!re)</tt> ,
135 @e Negative lookahead (AREs only), matches at any point where no substring
136 matching @e re begins. }
137@endTable
138
139The lookahead constraints may not contain back references (see later), and all
140parentheses within them are considered non-capturing. A RE may not end with
141"\".
36c9828f 142
36c9828f 143
72844950 144@section overview_resyntax_bracket Bracket Expressions
36c9828f 145
877b5c30
BP
146A <em>bracket expression</em> is a list of characters enclosed in <tt>[]</tt>.
147It normally matches any single character from the list (but see below). If the
148list begins with @c ^, it matches any single character (but see below) @e not
149from the rest of the list.
150
151If two characters in the list are separated by <tt>-</tt>, this is shorthand
152for the full @e range of characters between those two (inclusive) in the
153collating sequence, e.g. <tt>[0-9]</tt> in ASCII matches any decimal digit.
154Two ranges may not share an endpoint, so e.g. <tt>a-c-e</tt> is illegal.
155Ranges are very collating-sequence-dependent, and portable programs should
156avoid relying on them.
157
158To include a literal <tt>]</tt> or <tt>-</tt> in the list, the simplest method
159is to enclose it in <tt>[.</tt> and <tt>.]</tt> to make it a collating element
160(see below). Alternatively, make it the first character (following a possible
161<tt>^</tt>), or (AREs only) precede it with <tt>@\</tt>. Alternatively, for
162<tt>-</tt>, make it the last character, or the second endpoint of a range. To
163use a literal <tt>-</tt> as the first endpoint of a range, make it a collating
164element or (AREs only) precede it with <tt>@\</tt>. With the exception of
165these, some combinations using <tt>[</tt> (see next paragraphs), and escapes,
166all other special characters lose their special significance within a bracket
167expression.
168
169Within a bracket expression, a collating element (a character, a
170multi-character sequence that collates as if it were a single character, or a
171collating-sequence name for either) enclosed in <tt>[.</tt> and <tt>.]</tt>
172stands for the sequence of characters of that collating element.
173
174@e wxWidgets: Currently no multi-character collating elements are defined. So
175in <tt>[.X.]</tt>, @c X can either be a single character literal or the name
176of a character. For example, the following are both identical:
177<tt>[[.0.]-[.9.]]</tt> and <tt>[[.zero.]-[.nine.]]</tt> and mean the same as
178<tt>[0-9]</tt>. See @ref overview_resyntax_characters.
179
72844950
BP
180Within a bracket expression, a collating element enclosed in @b [= and @b =]
181is an equivalence class, standing for the sequences of characters of all
182collating elements equivalent to that one, including itself.
183An equivalence class may not be an endpoint of a range.
184@e wxWidgets: Currently no equivalence classes are defined, so
185@b [=X=] stands for just the single character @e X.
186@e X can either be a single character literal or the name of a character,
187see @ref resynchars_overview.
188Within a bracket expression,
189the name of a @e character class enclosed in @b [: and @b :] stands for the list
190of all characters (not all collating elements!) belonging to that class.
191Standard character classes are:
36c9828f 192
877b5c30
BP
193@beginTable
194@row2col{ <tt>alpha</tt> , A letter. }
195@row2col{ <tt>upper</tt> , An upper-case letter. }
196@row2col{ <tt>lower</tt> , A lower-case letter. }
197@row2col{ <tt>digit</tt> , A decimal digit. }
198@row2col{ <tt>xdigit</tt> , A hexadecimal digit. }
199@row2col{ <tt>alnum</tt> , An alphanumeric (letter or digit). }
200@row2col{ <tt>print</tt> , An alphanumeric (same as alnum). }
201@row2col{ <tt>blank</tt> , A space or tab character. }
202@row2col{ <tt>space</tt> , A character producing white space in displayed text. }
203@row2col{ <tt>punct</tt> , A punctuation character. }
204@row2col{ <tt>graph</tt> , A character with a visible representation. }
205@row2col{ <tt>cntrl</tt> , A control character. }
206@endTable
36c9828f 207
72844950
BP
208A character class may not be used as an endpoint of a range.
209@e wxWidgets: In a non-Unicode build, these character classifications depend on the
210current locale, and correspond to the values return by the ANSI C 'is'
211functions: isalpha, isupper, etc. In Unicode mode they are based on
212Unicode classifications, and are not affected by the current locale.
213There are two special cases of bracket expressions:
214the bracket expressions @b [[::]] and @b [[::]] are constraints, matching empty
215strings at the beginning and end of a word respectively. A word is defined
216as a sequence of word characters that is neither preceded nor followed
217by word characters. A word character is an @e alnum character or an underscore
218(@b _). These special bracket expressions are deprecated; users of AREs should
219use constraint escapes instead (see #Escapes below).
36c9828f
FM
220
221
72844950 222@section overview_resyntax_escapes Escapes
36c9828f 223
72844950 224Escapes (AREs only),
877b5c30 225which begin with a <tt>@\</tt> followed by an alphanumeric character, come in several
72844950 226varieties: character entry, class shorthands, constraint escapes, and back
877b5c30 227references. A <tt>@\</tt> followed by an alphanumeric character but not constituting
72844950 228a valid escape is illegal in AREs. In EREs, there are no escapes: outside
877b5c30 229a bracket expression, a <tt>@\</tt> followed by an alphanumeric character merely stands
72844950 230for that character as an ordinary character, and inside a bracket expression,
877b5c30 231<tt>@\</tt> is an ordinary character. (The latter is the one actual incompatibility
72844950
BP
232between EREs and AREs.)
233Character-entry escapes (AREs only) exist to make
234it easier to specify non-printing and otherwise inconvenient characters
235in REs:
36c9828f
FM
236
237
238
72844950 239@b \a
36c9828f 240
72844950 241alert (bell) character, as in C
36c9828f 242
72844950 243@b \b
36c9828f 244
72844950 245backspace, as in C
36c9828f 246
72844950 247@b \B
36c9828f 248
72844950
BP
249synonym
250for @b \ to help reduce backslash doubling in some applications where there
251are multiple levels of backslash processing
36c9828f 252
72844950 253@b \c@e X
36c9828f 254
72844950
BP
255(where X is any character)
256the character whose low-order 5 bits are the same as those of @e X, and whose
257other bits are all zero
36c9828f 258
72844950 259@b \e
36c9828f 260
72844950
BP
261the character whose collating-sequence name is
262'@b ESC', or failing that, the character with octal value 033
36c9828f 263
72844950 264@b \f
36c9828f 265
72844950 266formfeed, as in C
36c9828f 267
72844950 268@b \n
36c9828f 269
72844950 270newline, as in C
36c9828f 271
72844950 272@b \r
36c9828f 273
72844950 274carriage return, as in C
36c9828f 275
72844950 276@b \t
36c9828f 277
72844950 278horizontal tab, as in C
36c9828f 279
72844950 280@b \u@e wxyz
36c9828f 281
72844950
BP
282(where @e wxyz is exactly four hexadecimal digits)
283the Unicode
284character @b U+@e wxyz in the local byte ordering
36c9828f 285
72844950 286@b \U@e stuvwxyz
36c9828f 287
72844950
BP
288(where @e stuvwxyz is
289exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode
290extension to 32 bits
36c9828f 291
72844950 292@b \v
36c9828f 293
72844950 294vertical tab, as in C are all available.
36c9828f 295
72844950 296@b \x@e hhh
36c9828f 297
72844950
BP
298(where
299@e hhh is any sequence of hexadecimal digits) the character whose hexadecimal
300value is @b 0x@e hhh (a single character no matter how many hexadecimal digits
301are used).
36c9828f 302
72844950 303@b \0
36c9828f 304
72844950 305the character whose value is @b 0
36c9828f 306
72844950 307@b \@e xy
36c9828f 308
72844950
BP
309(where @e xy is exactly two
310octal digits, and is not a @e back reference (see below)) the character whose
311octal value is @b 0@e xy
36c9828f 312
72844950 313@b \@e xyz
36c9828f 314
72844950
BP
315(where @e xyz is exactly three octal digits, and is
316not a back reference (see below))
317the character whose octal value is @b 0@e xyz
36c9828f 318
36c9828f
FM
319
320
72844950
BP
321Hexadecimal digits are '@b 0'-'@b 9', '@b a'-'@b f', and '@b A'-'@b F'. Octal
322digits are '@b 0'-'@b 7'.
323The character-entry
324escapes are always taken as ordinary characters. For example, @b \135 is @b ] in
325ASCII, but @b \135 does not terminate a bracket expression. Beware, however,
326that some applications (e.g., C compilers) interpret such sequences themselves
327before the regular-expression package gets to see them, which may require
328doubling (quadrupling, etc.) the '@b \'.
329Class-shorthand escapes (AREs only) provide
330shorthands for certain commonly-used character classes:
36c9828f
FM
331
332
333
72844950 334@b \d
36c9828f 335
72844950 336@b [[:digit:]]
36c9828f 337
72844950 338@b \s
36c9828f 339
72844950 340@b [[:space:]]
36c9828f 341
72844950 342@b \w
36c9828f 343
72844950 344@b [[:alnum:]_] (note underscore)
36c9828f 345
72844950 346@b \D
36c9828f 347
72844950 348@b [^[:digit:]]
36c9828f 349
72844950 350@b \S
36c9828f 351
72844950 352@b [^[:space:]]
36c9828f 353
72844950 354@b \W
36c9828f 355
72844950 356@b [^[:alnum:]_] (note underscore)
36c9828f
FM
357
358
36c9828f 359
72844950
BP
360Within bracket expressions, '@b \d', '@b \s', and
361'@b \w' lose their outer brackets, and '@b \D',
362'@b \S', and '@b \W' are illegal. (So, for example,
363@b [a-c\d] is equivalent to @b [a-c[:digit:]].
364Also, @b [a-c\D], which is equivalent to
365@b [a-c^[:digit:]], is illegal.)
366A constraint escape (AREs only) is a constraint,
367matching the empty string if specific conditions are met, written as an
368escape:
36c9828f
FM
369
370
371
72844950 372@b \A
36c9828f 373
72844950
BP
374matches only at the beginning of the string
375(see #Matching, below,
376for how this differs from '@b ^')
36c9828f 377
72844950 378@b \m
36c9828f 379
72844950 380matches only at the beginning of a word
36c9828f 381
72844950 382@b \M
36c9828f 383
72844950 384matches only at the end of a word
36c9828f 385
72844950 386@b \y
36c9828f 387
72844950 388matches only at the beginning or end of a word
36c9828f 389
72844950 390@b \Y
36c9828f 391
72844950
BP
392matches only at a point that is not the beginning or end of
393a word
36c9828f 394
72844950 395@b \Z
36c9828f 396
72844950
BP
397matches only at the end of the string
398(see #Matching, below, for
399how this differs from '@b $')
36c9828f 400
72844950 401@b \@e m
36c9828f 402
72844950
BP
403(where @e m is a nonzero digit) a @e back reference,
404see below
36c9828f 405
72844950 406@b \@e mnn
36c9828f 407
72844950
BP
408(where @e m is a nonzero digit, and @e nn is some more digits,
409and the decimal value @e mnn is not greater than the number of closing capturing
410parentheses seen so far) a @e back reference, see below
36c9828f
FM
411
412
413
72844950
BP
414A word is defined
415as in the specification of @b [[::]] and @b [[::]] above. Constraint escapes are
416illegal within bracket expressions.
417A back reference (AREs only) matches
418the same string matched by the parenthesized subexpression specified by
419the number, so that (e.g.) @b ([bc])\1 matches @b bb or @b cc but not '@b bc'.
420The subexpression
421must entirely precede the back reference in the RE. Subexpressions are numbered
422in the order of their leading parentheses. Non-capturing parentheses do not
423define subexpressions.
424There is an inherent historical ambiguity between
425octal character-entry escapes and back references, which is resolved by
426heuristics, as hinted at above. A leading zero always indicates an octal
427escape. A single non-zero digit, not followed by another digit, is always
428taken as a back reference. A multi-digit sequence not starting with a zero
429is taken as a back reference if it comes after a suitable subexpression
430(i.e. the number is in the legal range for a back reference), and otherwise
431is taken as octal.
36c9828f 432
36c9828f 433
72844950 434@section overview_resyntax_metasyntax Metasyntax
36c9828f 435
72844950
BP
436In addition to the main syntax described above,
437there are some special forms and miscellaneous syntactic facilities available.
438Normally the flavor of RE being used is specified by application-dependent
439means. However, this can be overridden by a @e director. If an RE of any flavor
440begins with '@b ***:', the rest of the RE is an ARE. If an RE of any flavor begins
441with '@b ***=', the rest of the RE is taken to be a literal string, with all
442characters considered ordinary characters.
443An ARE may begin with @e embedded options: a sequence @b (?xyz)
444(where @e xyz is one or more alphabetic characters)
445specifies options affecting the rest of the RE. These supplement, and can
446override, any options specified by the application. The available option
447letters are:
36c9828f
FM
448
449
450
72844950 451@b b
36c9828f 452
72844950 453rest of RE is a BRE
36c9828f 454
72844950 455@b c
36c9828f 456
72844950 457case-sensitive matching (usual default)
36c9828f 458
72844950 459@b e
36c9828f 460
72844950 461rest of RE is an ERE
36c9828f 462
72844950 463@b i
36c9828f 464
72844950 465case-insensitive matching (see #Matching, below)
36c9828f 466
72844950 467@b m
36c9828f 468
72844950 469historical synonym for @b n
36c9828f 470
72844950 471@b n
36c9828f 472
72844950 473newline-sensitive matching (see #Matching, below)
36c9828f 474
72844950 475@b p
36c9828f 476
72844950 477partial newline-sensitive matching (see #Matching, below)
36c9828f 478
72844950 479@b q
36c9828f 480
72844950
BP
481rest of RE
482is a literal ("quoted'') string, all ordinary characters
36c9828f 483
72844950 484@b s
36c9828f 485
72844950 486non-newline-sensitive matching (usual default)
36c9828f 487
72844950 488@b t
36c9828f 489
72844950 490tight syntax (usual default; see below)
36c9828f 491
72844950 492@b w
36c9828f 493
72844950
BP
494inverse
495partial newline-sensitive ("weird'') matching (see #Matching, below)
36c9828f 496
72844950
BP
497@b x
498
499expanded syntax (see below)
500
501
502
503Embedded options take effect at the @b ) terminating the
504sequence. They are available only at the start of an ARE, and may not be
505used later within it.
506In addition to the usual (@e tight) RE syntax, in which
507all characters are significant, there is an @e expanded syntax, available
508in AREs with the embedded
509x option. In the expanded syntax, white-space characters are ignored and
510all characters between a @b # and the following newline (or the end of the
511RE) are ignored, permitting paragraphing and commenting a complex RE. There
512are three exceptions to that basic rule:
513
514
515a white-space character or '@b #' preceded
516by '@b \' is retained
517white space or '@b #' within a bracket expression is retained
518white space and comments are illegal within multi-character symbols like
519the ARE '@b (?:' or the BRE '@b \('
520
521
522Expanded-syntax white-space characters are blank,
523tab, newline, and any character that belongs to the @e space character class.
524Finally, in an ARE, outside bracket expressions, the sequence '@b (?#ttt)' (where
525@e ttt is any text not containing a '@b )') is a comment, completely ignored. Again,
526this is not allowed between the characters of multi-character symbols like
527'@b (?:'. Such comments are more a historical artifact than a useful facility,
528and their use is deprecated; use the expanded syntax instead.
529@e None of these
530metasyntax extensions is available if the application (or an initial @b ***=
531director) has specified that the user's input be treated as a literal string
532rather than as an RE.
533
534
535@section overview_resyntax_matching Matching
536
537In the event that an RE could match more than
538one substring of a given string, the RE matches the one starting earliest
539in the string. If the RE could match more than one substring starting at
540that point, its choice is determined by its @e preference: either the longest
541substring, or the shortest.
542Most atoms, and all constraints, have no preference.
543A parenthesized RE has the same preference (possibly none) as the RE. A
544quantified atom with quantifier @b {m} or @b {m}? has the same preference (possibly
545none) as the atom itself. A quantified atom with other normal quantifiers
546(including @b {m,n} with @e m equal to @e n) prefers longest match. A quantified
547atom with other non-greedy quantifiers (including @b {m,n}? with @e m equal to
548@e n) prefers shortest match. A branch has the same preference as the first
549quantified atom in it which has a preference. An RE consisting of two or
550more branches connected by the @b | operator prefers longest match.
551Subject to the constraints imposed by the rules for matching the whole RE, subexpressions
552also match the longest or shortest possible substrings, based on their
553preferences, with subexpressions starting earlier in the RE taking priority
554over ones starting later. Note that outer subexpressions thus take priority
555over their component subexpressions.
556Note that the quantifiers @b {1,1} and
557@b {1,1}? can be used to force longest and shortest preference, respectively,
558on a subexpression or a whole RE.
559Match lengths are measured in characters,
560not collating elements. An empty string is considered longer than no match
561at all. For example, @b bb* matches the three middle characters
562of '@b abbbc', @b (week|wee)(night|knights)
563matches all ten characters of '@b weeknights', when @b (.*).* is matched against
564@b abc the parenthesized subexpression matches all three characters, and when
565@b (a*)* is matched against @b bc both the whole RE and the parenthesized subexpression
566match an empty string.
567If case-independent matching is specified, the effect
568is much as if all case distinctions had vanished from the alphabet. When
569an alphabetic that exists in multiple cases appears as an ordinary character
570outside a bracket expression, it is effectively transformed into a bracket
571expression containing both cases, so that @b x becomes '@b [xX]'. When it appears
572inside a bracket expression, all case counterparts of it are added to the
573bracket expression, so that @b [x] becomes @b [xX] and @b [^x] becomes '@b [^xX]'.
574If newline-sensitive
575matching is specified, @b . and bracket expressions using @b ^ will never match
576the newline character (so that matches will never cross newlines unless
577the RE explicitly arranges it) and @b ^ and @b $ will match the empty string after
578and before a newline respectively, in addition to matching at beginning
579and end of string respectively. ARE @b \A and @b \Z continue to match beginning
580or end of string @e only.
581If partial newline-sensitive matching is specified,
582this affects @b . and bracket expressions as with newline-sensitive matching,
583but not @b ^ and '@b $'.
584If inverse partial newline-sensitive matching is specified,
585this affects @b ^ and @b $ as with newline-sensitive matching, but not @b . and bracket
586expressions. This isn't very useful but is provided for symmetry.
587
588
589@section overview_resyntax_limits Limits and Compatibility
590
591No particular limit is imposed on the length of REs. Programs
592intended to be highly portable should not employ REs longer than 256 bytes,
593as a POSIX-compliant implementation can refuse to accept such REs.
594The only
595feature of AREs that is actually incompatible with POSIX EREs is that @b \
596does not lose its special significance inside bracket expressions. All other
597ARE features use syntax which is illegal or has undefined or unspecified
598effects in POSIX EREs; the @b *** syntax of directors likewise is outside
599the POSIX syntax for both BREs and EREs.
600Many of the ARE extensions are
601borrowed from Perl, but some have been changed to clean them up, and a
602few Perl extensions are not present. Incompatibilities of note include '@b \b',
603'@b \B', the lack of special treatment for a trailing newline, the addition of
604complemented bracket expressions to the things affected by newline-sensitive
605matching, the restrictions on parentheses and back references in lookahead
606constraints, and the longest/shortest-match (rather than first-match) matching
607semantics.
608The matching rules for REs containing both normal and non-greedy
609quantifiers have changed since early beta-test versions of this package.
610(The new rules are much simpler and cleaner, but don't work as hard at guessing
611the user's real intentions.)
612Henry Spencer's original 1986 @e regexp package, still in widespread use,
613implemented an early version of today's EREs. There are four incompatibilities between @e regexp's
614near-EREs ('RREs' for short) and AREs. In roughly increasing order of significance:
36c9828f 615
72844950
BP
616In AREs, @b \ followed by an alphanumeric character is either an escape or
617an error, while in RREs, it was just another way of writing the alphanumeric.
618This should not be a problem because there was no reason to write such
619a sequence in RREs.
620@b { followed by a digit in an ARE is the beginning of
621a bound, while in RREs, @b { was always an ordinary character. Such sequences
622should be rare, and will often result in an error because following characters
623will not look like a valid bound.
624In AREs, @b \ remains a special character
625within '@b []', so a literal @b \ within @b [] must be
626written '@b \\'. @b \\ also gives a literal
627@b \ within @b [] in RREs, but only truly paranoid programmers routinely doubled
628the backslash.
629AREs report the longest/shortest match for the RE, rather
630than the first found in a specified search order. This may affect some RREs
631which were written in the expectation that the first match would be reported.
632(The careful crafting of RREs to optimize the search order for fast matching
633is obsolete (AREs examine all possible matches in parallel, and their performance
634is largely insensitive to their complexity) but cases where the search
635order was exploited to deliberately find a match which was @e not the longest/shortest
636will need rewriting.)
36c9828f
FM
637
638
72844950 639@section overview_resyntax_bre Basic Regular Expressions
36c9828f 640
72844950
BP
641BREs differ from EREs in
642several respects. '@b |', '@b +', and @b ? are ordinary characters and there is no equivalent
643for their functionality. The delimiters for bounds
644are @b \{ and '@b \}', with @b { and
645@b } by themselves ordinary characters. The parentheses for nested subexpressions
646are @b \( and '@b \)', with @b ( and @b ) by themselves
647ordinary characters. @b ^ is an ordinary
648character except at the beginning of the RE or the beginning of a parenthesized
649subexpression, @b $ is an ordinary character except at the end of the RE or
650the end of a parenthesized subexpression, and @b * is an ordinary character
651if it appears at the beginning of the RE or the beginning of a parenthesized
652subexpression (after a possible leading '@b ^'). Finally, single-digit back references
653are available, and @b \ and @b \ are synonyms
654for @b [[::]] and @b [[::]] respectively;
655no other escapes are available.
36c9828f
FM
656
657
72844950 658@section overview_resyntax_characters Regular Expression Character Names
36c9828f 659
72844950 660Note that the character names are case sensitive.
36c9828f
FM
661
662
663
36c9828f
FM
664
665
666
72844950 667NUL
36c9828f
FM
668
669
36c9828f 670
36c9828f 671
72844950 672'\0'
36c9828f
FM
673
674
675
676
677
72844950 678SOH
36c9828f 679
36c9828f
FM
680
681
682
72844950 683'\001'
36c9828f 684
36c9828f
FM
685
686
687
688
72844950 689STX
36c9828f 690
36c9828f
FM
691
692
693
72844950 694'\002'
36c9828f 695
36c9828f
FM
696
697
698
699
72844950 700ETX
36c9828f 701
36c9828f
FM
702
703
704
72844950 705'\003'
36c9828f 706
36c9828f
FM
707
708
709
710
72844950 711EOT
36c9828f 712
36c9828f
FM
713
714
715
72844950 716'\004'
36c9828f 717
36c9828f
FM
718
719
720
721
72844950 722ENQ
36c9828f 723
36c9828f
FM
724
725
726
72844950 727'\005'
36c9828f 728
36c9828f
FM
729
730
731
732
72844950 733ACK
36c9828f 734
36c9828f
FM
735
736
737
72844950 738'\006'
36c9828f 739
36c9828f
FM
740
741
742
743
72844950 744BEL
36c9828f 745
36c9828f
FM
746
747
748
72844950 749'\007'
36c9828f 750
36c9828f
FM
751
752
753
754
72844950 755alert
36c9828f 756
36c9828f
FM
757
758
759
72844950 760'\007'
36c9828f 761
36c9828f
FM
762
763
764
765
72844950 766BS
36c9828f 767
36c9828f
FM
768
769
770
72844950 771'\010'
36c9828f 772
36c9828f
FM
773
774
775
776
72844950 777backspace
36c9828f 778
36c9828f
FM
779
780
781
72844950 782'\b'
36c9828f 783
36c9828f
FM
784
785
786
787
72844950 788HT
36c9828f 789
36c9828f
FM
790
791
792
72844950 793'\011'
36c9828f 794
36c9828f
FM
795
796
797
798
72844950 799tab
36c9828f 800
36c9828f
FM
801
802
803
72844950 804'\t'
36c9828f 805
36c9828f
FM
806
807
808
809
72844950 810LF
36c9828f 811
36c9828f 812
36c9828f 813
36c9828f 814
72844950 815'\012'
36c9828f
FM
816
817
818
819
820
72844950 821newline
36c9828f
FM
822
823
824
825
72844950 826'\n'
36c9828f
FM
827
828
829
830
831
72844950 832VT
36c9828f
FM
833
834
835
836
72844950 837'\013'
36c9828f
FM
838
839
840
841
842
72844950 843vertical-tab
36c9828f
FM
844
845
846
847
72844950 848'\v'
36c9828f
FM
849
850
851
852
853
72844950 854FF
36c9828f
FM
855
856
857
858
72844950 859'\014'
36c9828f
FM
860
861
862
863
864
72844950 865form-feed
36c9828f
FM
866
867
868
869
72844950 870'\f'
36c9828f
FM
871
872
873
874
875
72844950 876CR
36c9828f
FM
877
878
879
880
72844950 881'\015'
36c9828f
FM
882
883
884
885
886
72844950 887carriage-return
36c9828f
FM
888
889
890
891
72844950 892'\r'
36c9828f
FM
893
894
895
896
897
72844950 898SO
36c9828f
FM
899
900
901
902
72844950 903'\016'
36c9828f
FM
904
905
906
907
908
72844950 909SI
36c9828f
FM
910
911
912
913
72844950 914'\017'
36c9828f
FM
915
916
917
918
919
72844950 920DLE
36c9828f
FM
921
922
923
924
72844950 925'\020'
36c9828f
FM
926
927
928
929
930
72844950 931DC1
36c9828f
FM
932
933
934
935
72844950 936'\021'
36c9828f
FM
937
938
939
940
941
72844950 942DC2
36c9828f
FM
943
944
945
946
72844950 947'\022'
36c9828f
FM
948
949
950
951
952
72844950 953DC3
36c9828f
FM
954
955
956
957
72844950 958'\023'
36c9828f
FM
959
960
961
962
963
72844950 964DC4
36c9828f
FM
965
966
967
968
72844950 969'\024'
36c9828f
FM
970
971
972
973
974
72844950 975NAK
36c9828f
FM
976
977
978
979
72844950 980'\025'
36c9828f
FM
981
982
983
984
985
72844950 986SYN
36c9828f
FM
987
988
989
990
72844950 991'\026'
36c9828f
FM
992
993
994
995
996
72844950 997ETB
36c9828f
FM
998
999
1000
1001
72844950 1002'\027'
36c9828f
FM
1003
1004
36c9828f
FM
1005
1006
1007
72844950 1008CAN
36c9828f 1009
36c9828f
FM
1010
1011
1012
72844950 1013'\030'
36c9828f
FM
1014
1015
36c9828f
FM
1016
1017
1018
72844950 1019EM
36c9828f 1020
36c9828f
FM
1021
1022
1023
72844950 1024'\031'
36c9828f
FM
1025
1026
36c9828f
FM
1027
1028
1029
72844950 1030SUB
36c9828f 1031
36c9828f
FM
1032
1033
1034
72844950 1035'\032'
36c9828f
FM
1036
1037
36c9828f
FM
1038
1039
1040
72844950 1041ESC
36c9828f 1042
36c9828f
FM
1043
1044
1045
72844950 1046'\033'
36c9828f
FM
1047
1048
36c9828f
FM
1049
1050
1051
72844950 1052IS4
36c9828f 1053
36c9828f
FM
1054
1055
1056
72844950 1057'\034'
36c9828f
FM
1058
1059
36c9828f
FM
1060
1061
1062
72844950 1063FS
36c9828f 1064
36c9828f
FM
1065
1066
1067
72844950 1068'\034'
36c9828f
FM
1069
1070
36c9828f
FM
1071
1072
1073
72844950 1074IS3
36c9828f
FM
1075
1076
1077
36c9828f 1078
72844950 1079'\035'
36c9828f
FM
1080
1081
1082
36c9828f
FM
1083
1084
72844950 1085GS
36c9828f
FM
1086
1087
1088
36c9828f 1089
72844950 1090'\035'
36c9828f
FM
1091
1092
1093
36c9828f
FM
1094
1095
72844950 1096IS2
36c9828f
FM
1097
1098
1099
36c9828f 1100
72844950 1101'\036'
36c9828f
FM
1102
1103
1104
36c9828f
FM
1105
1106
72844950 1107RS
36c9828f
FM
1108
1109
1110
36c9828f 1111
72844950 1112'\036'
36c9828f
FM
1113
1114
1115
36c9828f
FM
1116
1117
72844950 1118IS1
36c9828f
FM
1119
1120
1121
36c9828f 1122
72844950 1123'\037'
36c9828f
FM
1124
1125
1126
36c9828f
FM
1127
1128
72844950 1129US
36c9828f
FM
1130
1131
1132
36c9828f 1133
72844950 1134'\037'
36c9828f
FM
1135
1136
1137
36c9828f
FM
1138
1139
72844950 1140space
36c9828f
FM
1141
1142
1143
36c9828f 1144
72844950 1145' '
36c9828f
FM
1146
1147
1148
36c9828f
FM
1149
1150
72844950 1151exclamation-mark
36c9828f
FM
1152
1153
1154
36c9828f 1155
72844950 1156'!'
36c9828f
FM
1157
1158
1159
36c9828f
FM
1160
1161
72844950 1162quotation-mark
36c9828f
FM
1163
1164
1165
36c9828f 1166
72844950 1167'"'
36c9828f 1168
36c9828f
FM
1169
1170
1171
1172
72844950 1173number-sign
36c9828f
FM
1174
1175
36c9828f
FM
1176
1177
72844950 1178'#'
36c9828f
FM
1179
1180
36c9828f
FM
1181
1182
1183
72844950 1184dollar-sign
36c9828f
FM
1185
1186
36c9828f
FM
1187
1188
72844950 1189'$'
36c9828f
FM
1190
1191
36c9828f
FM
1192
1193
1194
72844950 1195percent-sign
36c9828f
FM
1196
1197
36c9828f
FM
1198
1199
72844950 1200'%'
36c9828f
FM
1201
1202
36c9828f
FM
1203
1204
1205
72844950 1206ampersand
36c9828f
FM
1207
1208
36c9828f
FM
1209
1210
72844950 1211''
36c9828f
FM
1212
1213
36c9828f
FM
1214
1215
1216
72844950 1217apostrophe
36c9828f
FM
1218
1219
36c9828f
FM
1220
1221
72844950 1222'\''
36c9828f
FM
1223
1224
36c9828f
FM
1225
1226
1227
72844950 1228left-parenthesis
36c9828f
FM
1229
1230
36c9828f
FM
1231
1232
72844950 1233'('
36c9828f
FM
1234
1235
36c9828f
FM
1236
1237
1238
72844950 1239right-parenthesis
36c9828f
FM
1240
1241
36c9828f
FM
1242
1243
72844950 1244')'
36c9828f
FM
1245
1246
36c9828f
FM
1247
1248
1249
72844950 1250asterisk
36c9828f
FM
1251
1252
36c9828f
FM
1253
1254
72844950 1255'*'
36c9828f
FM
1256
1257
36c9828f
FM
1258
1259
1260
72844950 1261plus-sign
36c9828f
FM
1262
1263
36c9828f
FM
1264
1265
72844950 1266'+'
36c9828f
FM
1267
1268
36c9828f
FM
1269
1270
1271
72844950 1272comma
36c9828f
FM
1273
1274
36c9828f
FM
1275
1276
72844950 1277','
36c9828f
FM
1278
1279
36c9828f
FM
1280
1281
1282
72844950 1283hyphen
36c9828f
FM
1284
1285
36c9828f
FM
1286
1287
72844950 1288'-'
36c9828f
FM
1289
1290
36c9828f
FM
1291
1292
1293
72844950 1294hyphen-minus
36c9828f
FM
1295
1296
36c9828f
FM
1297
1298
72844950 1299'-'
36c9828f
FM
1300
1301
36c9828f
FM
1302
1303
1304
72844950 1305period
36c9828f
FM
1306
1307
36c9828f
FM
1308
1309
72844950 1310'.'
36c9828f
FM
1311
1312
36c9828f 1313
36c9828f 1314
36c9828f 1315
72844950 1316full-stop
36c9828f 1317
36c9828f
FM
1318
1319
36c9828f 1320
72844950 1321'.'
36c9828f
FM
1322
1323
36c9828f 1324
36c9828f 1325
36c9828f 1326
72844950 1327slash
36c9828f
FM
1328
1329
1330
1331
72844950 1332'/'
36c9828f
FM
1333
1334
36c9828f
FM
1335
1336
1337
72844950 1338solidus
36c9828f 1339
36c9828f
FM
1340
1341
1342
72844950 1343'/'
36c9828f
FM
1344
1345
36c9828f
FM
1346
1347
1348
72844950 1349zero
36c9828f 1350
36c9828f
FM
1351
1352
1353
72844950 1354'0'
36c9828f
FM
1355
1356
36c9828f
FM
1357
1358
1359
72844950 1360one
36c9828f 1361
36c9828f
FM
1362
1363
1364
72844950 1365'1'
36c9828f
FM
1366
1367
36c9828f
FM
1368
1369
1370
72844950 1371two
36c9828f 1372
36c9828f
FM
1373
1374
1375
72844950 1376'2'
36c9828f
FM
1377
1378
36c9828f
FM
1379
1380
1381
72844950 1382three
36c9828f 1383
36c9828f
FM
1384
1385
1386
72844950 1387'3'
36c9828f
FM
1388
1389
36c9828f
FM
1390
1391
1392
72844950 1393four
36c9828f 1394
36c9828f
FM
1395
1396
1397
72844950 1398'4'
36c9828f
FM
1399
1400
36c9828f
FM
1401
1402
1403
72844950 1404five
36c9828f 1405
36c9828f
FM
1406
1407
1408
72844950 1409'5'
36c9828f
FM
1410
1411
36c9828f
FM
1412
1413
1414
72844950 1415six
36c9828f 1416
36c9828f
FM
1417
1418
1419
72844950 1420'6'
36c9828f
FM
1421
1422
36c9828f
FM
1423
1424
1425
72844950 1426seven
36c9828f 1427
36c9828f
FM
1428
1429
1430
72844950 1431'7'
36c9828f
FM
1432
1433
36c9828f
FM
1434
1435
1436
72844950 1437eight
36c9828f 1438
36c9828f
FM
1439
1440
1441
72844950 1442'8'
36c9828f
FM
1443
1444
36c9828f
FM
1445
1446
1447
72844950 1448nine
36c9828f 1449
36c9828f
FM
1450
1451
1452
72844950 1453'9'
36c9828f
FM
1454
1455
36c9828f
FM
1456
1457
1458
72844950 1459colon
36c9828f 1460
36c9828f
FM
1461
1462
1463
72844950 1464':'
36c9828f
FM
1465
1466
36c9828f
FM
1467
1468
1469
72844950 1470semicolon
36c9828f 1471
36c9828f
FM
1472
1473
1474
72844950 1475';'
36c9828f
FM
1476
1477
36c9828f
FM
1478
1479
1480
72844950 1481less-than-sign
36c9828f 1482
36c9828f
FM
1483
1484
1485
72844950 1486''
36c9828f
FM
1487
1488
36c9828f
FM
1489
1490
1491
72844950 1492equals-sign
36c9828f 1493
36c9828f
FM
1494
1495
1496
72844950 1497'='
36c9828f
FM
1498
1499
36c9828f
FM
1500
1501
1502
72844950 1503greater-than-sign
36c9828f 1504
36c9828f
FM
1505
1506
1507
72844950 1508''
36c9828f
FM
1509
1510
36c9828f
FM
1511
1512
1513
72844950 1514question-mark
36c9828f 1515
36c9828f
FM
1516
1517
1518
72844950 1519'?'
36c9828f
FM
1520
1521
36c9828f
FM
1522
1523
1524
72844950 1525commercial-at
36c9828f 1526
36c9828f
FM
1527
1528
1529
72844950 1530'@'
36c9828f
FM
1531
1532
36c9828f
FM
1533
1534
1535
72844950 1536left-square-bracket
36c9828f 1537
36c9828f
FM
1538
1539
1540
72844950 1541'['
36c9828f
FM
1542
1543
36c9828f
FM
1544
1545
1546
72844950 1547backslash
36c9828f 1548
36c9828f
FM
1549
1550
1551
72844950 1552'\'
36c9828f
FM
1553
1554
36c9828f
FM
1555
1556
1557
72844950 1558reverse-solidus
36c9828f 1559
36c9828f
FM
1560
1561
1562
72844950 1563'\'
36c9828f
FM
1564
1565
36c9828f
FM
1566
1567
1568
72844950 1569right-square-bracket
36c9828f 1570
36c9828f
FM
1571
1572
1573
72844950 1574']'
36c9828f
FM
1575
1576
36c9828f
FM
1577
1578
1579
72844950 1580circumflex
36c9828f 1581
36c9828f
FM
1582
1583
1584
72844950 1585'^'
36c9828f
FM
1586
1587
36c9828f
FM
1588
1589
1590
72844950 1591circumflex-accent
36c9828f 1592
36c9828f
FM
1593
1594
1595
72844950 1596'^'
36c9828f
FM
1597
1598
36c9828f
FM
1599
1600
1601
72844950 1602underscore
36c9828f 1603
36c9828f
FM
1604
1605
1606
72844950 1607'_'
36c9828f
FM
1608
1609
36c9828f
FM
1610
1611
1612
72844950 1613low-line
36c9828f 1614
36c9828f
FM
1615
1616
1617
72844950 1618'_'
36c9828f
FM
1619
1620
36c9828f
FM
1621
1622
1623
72844950 1624grave-accent
36c9828f 1625
36c9828f
FM
1626
1627
1628
72844950 1629'''
36c9828f
FM
1630
1631
36c9828f
FM
1632
1633
1634
72844950 1635left-brace
36c9828f 1636
36c9828f
FM
1637
1638
1639
72844950 1640'{'
36c9828f
FM
1641
1642
36c9828f
FM
1643
1644
1645
72844950 1646left-curly-bracket
36c9828f 1647
36c9828f
FM
1648
1649
1650
72844950 1651'{'
36c9828f
FM
1652
1653
36c9828f
FM
1654
1655
1656
72844950 1657vertical-line
36c9828f 1658
36c9828f
FM
1659
1660
1661
72844950 1662'|'
36c9828f
FM
1663
1664
36c9828f
FM
1665
1666
1667
72844950 1668right-brace
36c9828f 1669
36c9828f
FM
1670
1671
1672
72844950 1673'}'
36c9828f
FM
1674
1675
36c9828f
FM
1676
1677
1678
72844950 1679right-curly-bracket
36c9828f 1680
36c9828f
FM
1681
1682
1683
72844950 1684'}'
36c9828f
FM
1685
1686
36c9828f
FM
1687
1688
1689
72844950 1690tilde
36c9828f 1691
36c9828f
FM
1692
1693
1694
72844950 1695'~'
36c9828f
FM
1696
1697
36c9828f
FM
1698
1699
1700
72844950 1701DEL
36c9828f 1702
36c9828f
FM
1703
1704
1705
72844950 1706'\177'
36c9828f 1707
72844950 1708*/
36c9828f 1709