]> git.saurik.com Git - wxWidgets.git/blame - docs/doxygen/overviews/resyntax.h
compilation fix for pch-less build
[wxWidgets.git] / docs / doxygen / overviews / resyntax.h
CommitLineData
15b6757b 1/////////////////////////////////////////////////////////////////////////////
72844950 2// Name: resyntax.h
15b6757b
FM
3// Purpose: topic overview
4// Author: wxWidgets team
5// RCS-ID: $Id$
6// Licence: wxWindows license
7/////////////////////////////////////////////////////////////////////////////
8
9/*!
36c9828f 10
72844950 11@page overview_resyntax Syntax of the Built-in Regular Expression Library
36c9828f 12
72844950
BP
13A <em>regular expression</em> describes strings of characters. It's a pattern
14that matches certain strings and doesn't match others.
36c9828f 15
72844950 16@seealso #wxRegEx
36c9828f 17
72844950
BP
18@li @ref overview_resyntax_differentflavors
19@li @ref overview_resyntax_syntax
20@li @ref overview_resyntax_bracket
21@li @ref overview_resyntax_escapes
22@li @ref overview_resyntax_metasyntax
23@li @ref overview_resyntax_matching
24@li @ref overview_resyntax_limits
25@li @ref overview_resyntax_bre
26@li @ref overview_resyntax_characters
36c9828f 27
36c9828f 28
72844950 29<hr>
36c9828f 30
36c9828f 31
72844950 32@section overview_resyntax_differentflavors Different Flavors of REs
36c9828f 33
72844950
BP
34Regular expressions ("RE''s), as defined by POSIX, come in two
35flavors: @e extended REs ("EREs'') and @e basic REs ("BREs''). EREs are roughly those
36of the traditional @e egrep, while BREs are roughly those of the traditional
37@e ed. This implementation adds a third flavor, @e advanced REs ("AREs''), basically
38EREs with some significant extensions.
39This manual page primarily describes
40AREs. BREs mostly exist for backward compatibility in some old programs;
41they will be discussed at the #end. POSIX EREs are almost an exact subset
42of AREs. Features of AREs that are not present in EREs will be indicated.
36c9828f
FM
43
44
72844950 45@section overview_resyntax_syntax Regular Expression Syntax
36c9828f 46
72844950
BP
47These regular expressions are implemented using
48the package written by Henry Spencer, based on the 1003.2 spec and some
49(not quite all) of the Perl5 extensions (thanks, Henry!). Much of the description
50of regular expressions below is copied verbatim from his manual entry.
51An ARE is one or more @e branches, separated by '@b |', matching anything that matches
52any of the branches.
53A branch is zero or more @e constraints or @e quantified
54atoms, concatenated. It matches a match for the first, followed by a match
55for the second, etc; an empty branch matches the empty string.
56A quantified atom is an @e atom possibly followed by a single @e quantifier. Without a quantifier,
57it matches a match for the atom. The quantifiers, and what a so-quantified
58atom matches, are:
36c9828f 59
36c9828f
FM
60
61
72844950 62@b *
36c9828f 63
72844950 64a sequence of 0 or more matches of the atom
36c9828f 65
72844950 66@b +
36c9828f 67
72844950 68a sequence of 1 or more matches of the atom
36c9828f 69
72844950 70@b ?
36c9828f 71
72844950 72a sequence of 0 or 1 matches of the atom
36c9828f 73
72844950 74@b {m}
36c9828f 75
72844950 76a sequence of exactly @e m matches of the atom
36c9828f 77
72844950 78@b {m,}
36c9828f 79
72844950 80a sequence of @e m or more matches of the atom
36c9828f 81
72844950 82@b {m,n}
36c9828f 83
72844950
BP
84a sequence of @e m through @e n (inclusive)
85matches of the atom; @e m may not exceed @e n
36c9828f 86
72844950 87@b *? +? ?? {m}? {m,}? {m,n}?
36c9828f 88
72844950
BP
89@e non-greedy quantifiers,
90which match the same possibilities, but prefer the
91smallest number rather than the largest number of matches (see #Matching)
36c9828f 92
72844950
BP
93The forms using @b { and @b } are known as @e bounds. The numbers @e m and @e n are unsigned
94decimal integers with permissible values from 0 to 255 inclusive.
95An atom is one of:
36c9828f 96
72844950 97@b (re)
36c9828f 98
72844950
BP
99(where @e re is any regular expression) matches a match for
100@e re, with the match noted for possible reporting
36c9828f 101
72844950 102@b (?:re)
36c9828f 103
72844950
BP
104as previous, but
105does no reporting (a "non-capturing'' set of parentheses)
36c9828f 106
72844950 107@b ()
36c9828f 108
72844950
BP
109matches an empty
110string, noted for possible reporting
36c9828f 111
72844950 112@b (?:)
36c9828f 113
72844950 114matches an empty string, without reporting
36c9828f 115
72844950 116@b [chars]
36c9828f 117
72844950
BP
118a @e bracket expression, matching any one of the @e chars
119(see @ref resynbracket_overview for more detail)
36c9828f 120
72844950 121@b .
36c9828f 122
72844950 123matches any single character
36c9828f 124
72844950 125@b \k
36c9828f 126
72844950
BP
127(where @e k is a non-alphanumeric character)
128matches that character taken as an ordinary character, e.g. \\ matches a backslash
129character
36c9828f 130
72844950 131@b \c
36c9828f 132
72844950
BP
133where @e c is alphanumeric (possibly followed by other characters),
134an @e escape (AREs only), see #Escapes below
36c9828f 135
72844950 136@b {
36c9828f 137
72844950
BP
138when followed by a character
139other than a digit, matches the left-brace character '@b {'; when followed by
140a digit, it is the beginning of a @e bound (see above)
36c9828f 141
72844950 142@b x
36c9828f 143
72844950
BP
144where @e x is a single
145character with no other significance, matches that character.
36c9828f 146
72844950
BP
147A @e constraint matches an empty string when specific conditions are met. A constraint may
148not be followed by a quantifier. The simple constraints are as follows;
149some more constraints are described later, under #Escapes.
36c9828f 150
72844950 151@b ^
36c9828f 152
72844950 153matches at the beginning of a line
36c9828f 154
72844950 155@b $
36c9828f 156
72844950 157matches at the end of a line
36c9828f 158
72844950 159@b (?=re)
36c9828f 160
72844950
BP
161@e positive lookahead
162(AREs only), matches at any point where a substring matching @e re begins
36c9828f 163
72844950 164@b (?!re)
36c9828f 165
72844950
BP
166@e negative lookahead (AREs only),
167matches at any point where no substring matching @e re begins
36c9828f
FM
168
169
170
72844950
BP
171The lookahead constraints may not contain back references
172(see later), and all parentheses within them are considered non-capturing.
173An RE may not end with '@b \'.
36c9828f 174
36c9828f 175
72844950 176@section overview_resyntax_bracket Bracket Expressions
36c9828f 177
72844950
BP
178A @e bracket expression is a list
179of characters enclosed in '@b []'. It normally matches any single character from
180the list (but see below). If the list begins with '@b ^', it matches any single
181character (but see below) @e not from the rest of the list.
182If two characters
183in the list are separated by '@b -', this is shorthand for the full @e range of
184characters between those two (inclusive) in the collating sequence, e.g.
185@b [0-9] in ASCII matches any decimal digit. Two ranges may not share an endpoint,
186so e.g. @b a-c-e is illegal. Ranges are very collating-sequence-dependent, and portable
187programs should avoid relying on them.
188To include a literal @b ] or @b - in the
189list, the simplest method is to enclose it in @b [. and @b .] to make it a collating
190element (see below). Alternatively, make it the first character (following
191a possible '@b ^'), or (AREs only) precede it with '@b \'.
192Alternatively, for '@b -', make
193it the last character, or the second endpoint of a range. To use a literal
194@b - as the first endpoint of a range, make it a collating element or (AREs
195only) precede it with '@b \'. With the exception of these, some combinations using
196@b [ (see next paragraphs), and escapes, all other special characters lose
197their special significance within a bracket expression.
198Within a bracket
199expression, a collating element (a character, a multi-character sequence
200that collates as if it were a single character, or a collating-sequence
201name for either) enclosed in @b [. and @b .] stands for the
202sequence of characters of that collating element.
203@e wxWidgets: Currently no multi-character collating elements are defined.
204So in @b [.X.], @e X can either be a single character literal or
205the name of a character. For example, the following are both identical
206@b [[.0.]-[.9.]] and @b [[.zero.]-[.nine.]] and mean the same as
207@b [0-9].
208See @ref resynchars_overview.
209Within a bracket expression, a collating element enclosed in @b [= and @b =]
210is an equivalence class, standing for the sequences of characters of all
211collating elements equivalent to that one, including itself.
212An equivalence class may not be an endpoint of a range.
213@e wxWidgets: Currently no equivalence classes are defined, so
214@b [=X=] stands for just the single character @e X.
215@e X can either be a single character literal or the name of a character,
216see @ref resynchars_overview.
217Within a bracket expression,
218the name of a @e character class enclosed in @b [: and @b :] stands for the list
219of all characters (not all collating elements!) belonging to that class.
220Standard character classes are:
36c9828f
FM
221
222
223
72844950 224@b alpha
36c9828f 225
72844950 226A letter.
36c9828f 227
72844950 228@b upper
36c9828f 229
72844950 230An upper-case letter.
36c9828f 231
72844950 232@b lower
36c9828f 233
72844950 234A lower-case letter.
36c9828f 235
72844950 236@b digit
36c9828f 237
72844950 238A decimal digit.
36c9828f 239
72844950 240@b xdigit
36c9828f 241
72844950 242A hexadecimal digit.
36c9828f 243
72844950 244@b alnum
36c9828f 245
72844950 246An alphanumeric (letter or digit).
36c9828f 247
72844950 248@b print
36c9828f 249
72844950 250An alphanumeric (same as alnum).
36c9828f 251
72844950 252@b blank
36c9828f 253
72844950 254A space or tab character.
36c9828f 255
72844950 256@b space
36c9828f 257
72844950 258A character producing white space in displayed text.
36c9828f 259
72844950 260@b punct
36c9828f 261
72844950 262A punctuation character.
36c9828f 263
72844950 264@b graph
36c9828f 265
72844950 266A character with a visible representation.
36c9828f 267
72844950 268@b cntrl
36c9828f 269
72844950 270A control character.
36c9828f 271
36c9828f
FM
272
273
72844950
BP
274A character class may not be used as an endpoint of a range.
275@e wxWidgets: In a non-Unicode build, these character classifications depend on the
276current locale, and correspond to the values return by the ANSI C 'is'
277functions: isalpha, isupper, etc. In Unicode mode they are based on
278Unicode classifications, and are not affected by the current locale.
279There are two special cases of bracket expressions:
280the bracket expressions @b [[::]] and @b [[::]] are constraints, matching empty
281strings at the beginning and end of a word respectively. A word is defined
282as a sequence of word characters that is neither preceded nor followed
283by word characters. A word character is an @e alnum character or an underscore
284(@b _). These special bracket expressions are deprecated; users of AREs should
285use constraint escapes instead (see #Escapes below).
36c9828f
FM
286
287
72844950 288@section overview_resyntax_escapes Escapes
36c9828f 289
72844950
BP
290Escapes (AREs only),
291which begin with a @b \ followed by an alphanumeric character, come in several
292varieties: character entry, class shorthands, constraint escapes, and back
293references. A @b \ followed by an alphanumeric character but not constituting
294a valid escape is illegal in AREs. In EREs, there are no escapes: outside
295a bracket expression, a @b \ followed by an alphanumeric character merely stands
296for that character as an ordinary character, and inside a bracket expression,
297@b \ is an ordinary character. (The latter is the one actual incompatibility
298between EREs and AREs.)
299Character-entry escapes (AREs only) exist to make
300it easier to specify non-printing and otherwise inconvenient characters
301in REs:
36c9828f
FM
302
303
304
72844950 305@b \a
36c9828f 306
72844950 307alert (bell) character, as in C
36c9828f 308
72844950 309@b \b
36c9828f 310
72844950 311backspace, as in C
36c9828f 312
72844950 313@b \B
36c9828f 314
72844950
BP
315synonym
316for @b \ to help reduce backslash doubling in some applications where there
317are multiple levels of backslash processing
36c9828f 318
72844950 319@b \c@e X
36c9828f 320
72844950
BP
321(where X is any character)
322the character whose low-order 5 bits are the same as those of @e X, and whose
323other bits are all zero
36c9828f 324
72844950 325@b \e
36c9828f 326
72844950
BP
327the character whose collating-sequence name is
328'@b ESC', or failing that, the character with octal value 033
36c9828f 329
72844950 330@b \f
36c9828f 331
72844950 332formfeed, as in C
36c9828f 333
72844950 334@b \n
36c9828f 335
72844950 336newline, as in C
36c9828f 337
72844950 338@b \r
36c9828f 339
72844950 340carriage return, as in C
36c9828f 341
72844950 342@b \t
36c9828f 343
72844950 344horizontal tab, as in C
36c9828f 345
72844950 346@b \u@e wxyz
36c9828f 347
72844950
BP
348(where @e wxyz is exactly four hexadecimal digits)
349the Unicode
350character @b U+@e wxyz in the local byte ordering
36c9828f 351
72844950 352@b \U@e stuvwxyz
36c9828f 353
72844950
BP
354(where @e stuvwxyz is
355exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode
356extension to 32 bits
36c9828f 357
72844950 358@b \v
36c9828f 359
72844950 360vertical tab, as in C are all available.
36c9828f 361
72844950 362@b \x@e hhh
36c9828f 363
72844950
BP
364(where
365@e hhh is any sequence of hexadecimal digits) the character whose hexadecimal
366value is @b 0x@e hhh (a single character no matter how many hexadecimal digits
367are used).
36c9828f 368
72844950 369@b \0
36c9828f 370
72844950 371the character whose value is @b 0
36c9828f 372
72844950 373@b \@e xy
36c9828f 374
72844950
BP
375(where @e xy is exactly two
376octal digits, and is not a @e back reference (see below)) the character whose
377octal value is @b 0@e xy
36c9828f 378
72844950 379@b \@e xyz
36c9828f 380
72844950
BP
381(where @e xyz is exactly three octal digits, and is
382not a back reference (see below))
383the character whose octal value is @b 0@e xyz
36c9828f 384
36c9828f
FM
385
386
72844950
BP
387Hexadecimal digits are '@b 0'-'@b 9', '@b a'-'@b f', and '@b A'-'@b F'. Octal
388digits are '@b 0'-'@b 7'.
389The character-entry
390escapes are always taken as ordinary characters. For example, @b \135 is @b ] in
391ASCII, but @b \135 does not terminate a bracket expression. Beware, however,
392that some applications (e.g., C compilers) interpret such sequences themselves
393before the regular-expression package gets to see them, which may require
394doubling (quadrupling, etc.) the '@b \'.
395Class-shorthand escapes (AREs only) provide
396shorthands for certain commonly-used character classes:
36c9828f
FM
397
398
399
72844950 400@b \d
36c9828f 401
72844950 402@b [[:digit:]]
36c9828f 403
72844950 404@b \s
36c9828f 405
72844950 406@b [[:space:]]
36c9828f 407
72844950 408@b \w
36c9828f 409
72844950 410@b [[:alnum:]_] (note underscore)
36c9828f 411
72844950 412@b \D
36c9828f 413
72844950 414@b [^[:digit:]]
36c9828f 415
72844950 416@b \S
36c9828f 417
72844950 418@b [^[:space:]]
36c9828f 419
72844950 420@b \W
36c9828f 421
72844950 422@b [^[:alnum:]_] (note underscore)
36c9828f
FM
423
424
36c9828f 425
72844950
BP
426Within bracket expressions, '@b \d', '@b \s', and
427'@b \w' lose their outer brackets, and '@b \D',
428'@b \S', and '@b \W' are illegal. (So, for example,
429@b [a-c\d] is equivalent to @b [a-c[:digit:]].
430Also, @b [a-c\D], which is equivalent to
431@b [a-c^[:digit:]], is illegal.)
432A constraint escape (AREs only) is a constraint,
433matching the empty string if specific conditions are met, written as an
434escape:
36c9828f
FM
435
436
437
72844950 438@b \A
36c9828f 439
72844950
BP
440matches only at the beginning of the string
441(see #Matching, below,
442for how this differs from '@b ^')
36c9828f 443
72844950 444@b \m
36c9828f 445
72844950 446matches only at the beginning of a word
36c9828f 447
72844950 448@b \M
36c9828f 449
72844950 450matches only at the end of a word
36c9828f 451
72844950 452@b \y
36c9828f 453
72844950 454matches only at the beginning or end of a word
36c9828f 455
72844950 456@b \Y
36c9828f 457
72844950
BP
458matches only at a point that is not the beginning or end of
459a word
36c9828f 460
72844950 461@b \Z
36c9828f 462
72844950
BP
463matches only at the end of the string
464(see #Matching, below, for
465how this differs from '@b $')
36c9828f 466
72844950 467@b \@e m
36c9828f 468
72844950
BP
469(where @e m is a nonzero digit) a @e back reference,
470see below
36c9828f 471
72844950 472@b \@e mnn
36c9828f 473
72844950
BP
474(where @e m is a nonzero digit, and @e nn is some more digits,
475and the decimal value @e mnn is not greater than the number of closing capturing
476parentheses seen so far) a @e back reference, see below
36c9828f
FM
477
478
479
72844950
BP
480A word is defined
481as in the specification of @b [[::]] and @b [[::]] above. Constraint escapes are
482illegal within bracket expressions.
483A back reference (AREs only) matches
484the same string matched by the parenthesized subexpression specified by
485the number, so that (e.g.) @b ([bc])\1 matches @b bb or @b cc but not '@b bc'.
486The subexpression
487must entirely precede the back reference in the RE. Subexpressions are numbered
488in the order of their leading parentheses. Non-capturing parentheses do not
489define subexpressions.
490There is an inherent historical ambiguity between
491octal character-entry escapes and back references, which is resolved by
492heuristics, as hinted at above. A leading zero always indicates an octal
493escape. A single non-zero digit, not followed by another digit, is always
494taken as a back reference. A multi-digit sequence not starting with a zero
495is taken as a back reference if it comes after a suitable subexpression
496(i.e. the number is in the legal range for a back reference), and otherwise
497is taken as octal.
36c9828f 498
36c9828f 499
72844950 500@section overview_resyntax_metasyntax Metasyntax
36c9828f 501
72844950
BP
502In addition to the main syntax described above,
503there are some special forms and miscellaneous syntactic facilities available.
504Normally the flavor of RE being used is specified by application-dependent
505means. However, this can be overridden by a @e director. If an RE of any flavor
506begins with '@b ***:', the rest of the RE is an ARE. If an RE of any flavor begins
507with '@b ***=', the rest of the RE is taken to be a literal string, with all
508characters considered ordinary characters.
509An ARE may begin with @e embedded options: a sequence @b (?xyz)
510(where @e xyz is one or more alphabetic characters)
511specifies options affecting the rest of the RE. These supplement, and can
512override, any options specified by the application. The available option
513letters are:
36c9828f
FM
514
515
516
72844950 517@b b
36c9828f 518
72844950 519rest of RE is a BRE
36c9828f 520
72844950 521@b c
36c9828f 522
72844950 523case-sensitive matching (usual default)
36c9828f 524
72844950 525@b e
36c9828f 526
72844950 527rest of RE is an ERE
36c9828f 528
72844950 529@b i
36c9828f 530
72844950 531case-insensitive matching (see #Matching, below)
36c9828f 532
72844950 533@b m
36c9828f 534
72844950 535historical synonym for @b n
36c9828f 536
72844950 537@b n
36c9828f 538
72844950 539newline-sensitive matching (see #Matching, below)
36c9828f 540
72844950 541@b p
36c9828f 542
72844950 543partial newline-sensitive matching (see #Matching, below)
36c9828f 544
72844950 545@b q
36c9828f 546
72844950
BP
547rest of RE
548is a literal ("quoted'') string, all ordinary characters
36c9828f 549
72844950 550@b s
36c9828f 551
72844950 552non-newline-sensitive matching (usual default)
36c9828f 553
72844950 554@b t
36c9828f 555
72844950 556tight syntax (usual default; see below)
36c9828f 557
72844950 558@b w
36c9828f 559
72844950
BP
560inverse
561partial newline-sensitive ("weird'') matching (see #Matching, below)
36c9828f 562
72844950
BP
563@b x
564
565expanded syntax (see below)
566
567
568
569Embedded options take effect at the @b ) terminating the
570sequence. They are available only at the start of an ARE, and may not be
571used later within it.
572In addition to the usual (@e tight) RE syntax, in which
573all characters are significant, there is an @e expanded syntax, available
574in AREs with the embedded
575x option. In the expanded syntax, white-space characters are ignored and
576all characters between a @b # and the following newline (or the end of the
577RE) are ignored, permitting paragraphing and commenting a complex RE. There
578are three exceptions to that basic rule:
579
580
581a white-space character or '@b #' preceded
582by '@b \' is retained
583white space or '@b #' within a bracket expression is retained
584white space and comments are illegal within multi-character symbols like
585the ARE '@b (?:' or the BRE '@b \('
586
587
588Expanded-syntax white-space characters are blank,
589tab, newline, and any character that belongs to the @e space character class.
590Finally, in an ARE, outside bracket expressions, the sequence '@b (?#ttt)' (where
591@e ttt is any text not containing a '@b )') is a comment, completely ignored. Again,
592this is not allowed between the characters of multi-character symbols like
593'@b (?:'. Such comments are more a historical artifact than a useful facility,
594and their use is deprecated; use the expanded syntax instead.
595@e None of these
596metasyntax extensions is available if the application (or an initial @b ***=
597director) has specified that the user's input be treated as a literal string
598rather than as an RE.
599
600
601@section overview_resyntax_matching Matching
602
603In the event that an RE could match more than
604one substring of a given string, the RE matches the one starting earliest
605in the string. If the RE could match more than one substring starting at
606that point, its choice is determined by its @e preference: either the longest
607substring, or the shortest.
608Most atoms, and all constraints, have no preference.
609A parenthesized RE has the same preference (possibly none) as the RE. A
610quantified atom with quantifier @b {m} or @b {m}? has the same preference (possibly
611none) as the atom itself. A quantified atom with other normal quantifiers
612(including @b {m,n} with @e m equal to @e n) prefers longest match. A quantified
613atom with other non-greedy quantifiers (including @b {m,n}? with @e m equal to
614@e n) prefers shortest match. A branch has the same preference as the first
615quantified atom in it which has a preference. An RE consisting of two or
616more branches connected by the @b | operator prefers longest match.
617Subject to the constraints imposed by the rules for matching the whole RE, subexpressions
618also match the longest or shortest possible substrings, based on their
619preferences, with subexpressions starting earlier in the RE taking priority
620over ones starting later. Note that outer subexpressions thus take priority
621over their component subexpressions.
622Note that the quantifiers @b {1,1} and
623@b {1,1}? can be used to force longest and shortest preference, respectively,
624on a subexpression or a whole RE.
625Match lengths are measured in characters,
626not collating elements. An empty string is considered longer than no match
627at all. For example, @b bb* matches the three middle characters
628of '@b abbbc', @b (week|wee)(night|knights)
629matches all ten characters of '@b weeknights', when @b (.*).* is matched against
630@b abc the parenthesized subexpression matches all three characters, and when
631@b (a*)* is matched against @b bc both the whole RE and the parenthesized subexpression
632match an empty string.
633If case-independent matching is specified, the effect
634is much as if all case distinctions had vanished from the alphabet. When
635an alphabetic that exists in multiple cases appears as an ordinary character
636outside a bracket expression, it is effectively transformed into a bracket
637expression containing both cases, so that @b x becomes '@b [xX]'. When it appears
638inside a bracket expression, all case counterparts of it are added to the
639bracket expression, so that @b [x] becomes @b [xX] and @b [^x] becomes '@b [^xX]'.
640If newline-sensitive
641matching is specified, @b . and bracket expressions using @b ^ will never match
642the newline character (so that matches will never cross newlines unless
643the RE explicitly arranges it) and @b ^ and @b $ will match the empty string after
644and before a newline respectively, in addition to matching at beginning
645and end of string respectively. ARE @b \A and @b \Z continue to match beginning
646or end of string @e only.
647If partial newline-sensitive matching is specified,
648this affects @b . and bracket expressions as with newline-sensitive matching,
649but not @b ^ and '@b $'.
650If inverse partial newline-sensitive matching is specified,
651this affects @b ^ and @b $ as with newline-sensitive matching, but not @b . and bracket
652expressions. This isn't very useful but is provided for symmetry.
653
654
655@section overview_resyntax_limits Limits and Compatibility
656
657No particular limit is imposed on the length of REs. Programs
658intended to be highly portable should not employ REs longer than 256 bytes,
659as a POSIX-compliant implementation can refuse to accept such REs.
660The only
661feature of AREs that is actually incompatible with POSIX EREs is that @b \
662does not lose its special significance inside bracket expressions. All other
663ARE features use syntax which is illegal or has undefined or unspecified
664effects in POSIX EREs; the @b *** syntax of directors likewise is outside
665the POSIX syntax for both BREs and EREs.
666Many of the ARE extensions are
667borrowed from Perl, but some have been changed to clean them up, and a
668few Perl extensions are not present. Incompatibilities of note include '@b \b',
669'@b \B', the lack of special treatment for a trailing newline, the addition of
670complemented bracket expressions to the things affected by newline-sensitive
671matching, the restrictions on parentheses and back references in lookahead
672constraints, and the longest/shortest-match (rather than first-match) matching
673semantics.
674The matching rules for REs containing both normal and non-greedy
675quantifiers have changed since early beta-test versions of this package.
676(The new rules are much simpler and cleaner, but don't work as hard at guessing
677the user's real intentions.)
678Henry Spencer's original 1986 @e regexp package, still in widespread use,
679implemented an early version of today's EREs. There are four incompatibilities between @e regexp's
680near-EREs ('RREs' for short) and AREs. In roughly increasing order of significance:
36c9828f 681
72844950
BP
682In AREs, @b \ followed by an alphanumeric character is either an escape or
683an error, while in RREs, it was just another way of writing the alphanumeric.
684This should not be a problem because there was no reason to write such
685a sequence in RREs.
686@b { followed by a digit in an ARE is the beginning of
687a bound, while in RREs, @b { was always an ordinary character. Such sequences
688should be rare, and will often result in an error because following characters
689will not look like a valid bound.
690In AREs, @b \ remains a special character
691within '@b []', so a literal @b \ within @b [] must be
692written '@b \\'. @b \\ also gives a literal
693@b \ within @b [] in RREs, but only truly paranoid programmers routinely doubled
694the backslash.
695AREs report the longest/shortest match for the RE, rather
696than the first found in a specified search order. This may affect some RREs
697which were written in the expectation that the first match would be reported.
698(The careful crafting of RREs to optimize the search order for fast matching
699is obsolete (AREs examine all possible matches in parallel, and their performance
700is largely insensitive to their complexity) but cases where the search
701order was exploited to deliberately find a match which was @e not the longest/shortest
702will need rewriting.)
36c9828f
FM
703
704
72844950 705@section overview_resyntax_bre Basic Regular Expressions
36c9828f 706
72844950
BP
707BREs differ from EREs in
708several respects. '@b |', '@b +', and @b ? are ordinary characters and there is no equivalent
709for their functionality. The delimiters for bounds
710are @b \{ and '@b \}', with @b { and
711@b } by themselves ordinary characters. The parentheses for nested subexpressions
712are @b \( and '@b \)', with @b ( and @b ) by themselves
713ordinary characters. @b ^ is an ordinary
714character except at the beginning of the RE or the beginning of a parenthesized
715subexpression, @b $ is an ordinary character except at the end of the RE or
716the end of a parenthesized subexpression, and @b * is an ordinary character
717if it appears at the beginning of the RE or the beginning of a parenthesized
718subexpression (after a possible leading '@b ^'). Finally, single-digit back references
719are available, and @b \ and @b \ are synonyms
720for @b [[::]] and @b [[::]] respectively;
721no other escapes are available.
36c9828f
FM
722
723
72844950 724@section overview_resyntax_characters Regular Expression Character Names
36c9828f 725
72844950 726Note that the character names are case sensitive.
36c9828f
FM
727
728
729
36c9828f
FM
730
731
732
72844950 733NUL
36c9828f
FM
734
735
36c9828f 736
36c9828f 737
72844950 738'\0'
36c9828f
FM
739
740
741
742
743
72844950 744SOH
36c9828f 745
36c9828f
FM
746
747
748
72844950 749'\001'
36c9828f 750
36c9828f
FM
751
752
753
754
72844950 755STX
36c9828f 756
36c9828f
FM
757
758
759
72844950 760'\002'
36c9828f 761
36c9828f
FM
762
763
764
765
72844950 766ETX
36c9828f 767
36c9828f
FM
768
769
770
72844950 771'\003'
36c9828f 772
36c9828f
FM
773
774
775
776
72844950 777EOT
36c9828f 778
36c9828f
FM
779
780
781
72844950 782'\004'
36c9828f 783
36c9828f
FM
784
785
786
787
72844950 788ENQ
36c9828f 789
36c9828f
FM
790
791
792
72844950 793'\005'
36c9828f 794
36c9828f
FM
795
796
797
798
72844950 799ACK
36c9828f 800
36c9828f
FM
801
802
803
72844950 804'\006'
36c9828f 805
36c9828f
FM
806
807
808
809
72844950 810BEL
36c9828f 811
36c9828f
FM
812
813
814
72844950 815'\007'
36c9828f 816
36c9828f
FM
817
818
819
820
72844950 821alert
36c9828f 822
36c9828f
FM
823
824
825
72844950 826'\007'
36c9828f 827
36c9828f
FM
828
829
830
831
72844950 832BS
36c9828f 833
36c9828f
FM
834
835
836
72844950 837'\010'
36c9828f 838
36c9828f
FM
839
840
841
842
72844950 843backspace
36c9828f 844
36c9828f
FM
845
846
847
72844950 848'\b'
36c9828f 849
36c9828f
FM
850
851
852
853
72844950 854HT
36c9828f 855
36c9828f
FM
856
857
858
72844950 859'\011'
36c9828f 860
36c9828f
FM
861
862
863
864
72844950 865tab
36c9828f 866
36c9828f
FM
867
868
869
72844950 870'\t'
36c9828f 871
36c9828f
FM
872
873
874
875
72844950 876LF
36c9828f 877
36c9828f 878
36c9828f 879
36c9828f 880
72844950 881'\012'
36c9828f
FM
882
883
884
885
886
72844950 887newline
36c9828f
FM
888
889
890
891
72844950 892'\n'
36c9828f
FM
893
894
895
896
897
72844950 898VT
36c9828f
FM
899
900
901
902
72844950 903'\013'
36c9828f
FM
904
905
906
907
908
72844950 909vertical-tab
36c9828f
FM
910
911
912
913
72844950 914'\v'
36c9828f
FM
915
916
917
918
919
72844950 920FF
36c9828f
FM
921
922
923
924
72844950 925'\014'
36c9828f
FM
926
927
928
929
930
72844950 931form-feed
36c9828f
FM
932
933
934
935
72844950 936'\f'
36c9828f
FM
937
938
939
940
941
72844950 942CR
36c9828f
FM
943
944
945
946
72844950 947'\015'
36c9828f
FM
948
949
950
951
952
72844950 953carriage-return
36c9828f
FM
954
955
956
957
72844950 958'\r'
36c9828f
FM
959
960
961
962
963
72844950 964SO
36c9828f
FM
965
966
967
968
72844950 969'\016'
36c9828f
FM
970
971
972
973
974
72844950 975SI
36c9828f
FM
976
977
978
979
72844950 980'\017'
36c9828f
FM
981
982
983
984
985
72844950 986DLE
36c9828f
FM
987
988
989
990
72844950 991'\020'
36c9828f
FM
992
993
994
995
996
72844950 997DC1
36c9828f
FM
998
999
1000
1001
72844950 1002'\021'
36c9828f
FM
1003
1004
1005
1006
1007
72844950 1008DC2
36c9828f
FM
1009
1010
1011
1012
72844950 1013'\022'
36c9828f
FM
1014
1015
1016
1017
1018
72844950 1019DC3
36c9828f
FM
1020
1021
1022
1023
72844950 1024'\023'
36c9828f
FM
1025
1026
1027
1028
1029
72844950 1030DC4
36c9828f
FM
1031
1032
1033
1034
72844950 1035'\024'
36c9828f
FM
1036
1037
1038
1039
1040
72844950 1041NAK
36c9828f
FM
1042
1043
1044
1045
72844950 1046'\025'
36c9828f
FM
1047
1048
1049
1050
1051
72844950 1052SYN
36c9828f
FM
1053
1054
1055
1056
72844950 1057'\026'
36c9828f
FM
1058
1059
1060
1061
1062
72844950 1063ETB
36c9828f
FM
1064
1065
1066
1067
72844950 1068'\027'
36c9828f
FM
1069
1070
36c9828f
FM
1071
1072
1073
72844950 1074CAN
36c9828f 1075
36c9828f
FM
1076
1077
1078
72844950 1079'\030'
36c9828f
FM
1080
1081
36c9828f
FM
1082
1083
1084
72844950 1085EM
36c9828f 1086
36c9828f
FM
1087
1088
1089
72844950 1090'\031'
36c9828f
FM
1091
1092
36c9828f
FM
1093
1094
1095
72844950 1096SUB
36c9828f 1097
36c9828f
FM
1098
1099
1100
72844950 1101'\032'
36c9828f
FM
1102
1103
36c9828f
FM
1104
1105
1106
72844950 1107ESC
36c9828f 1108
36c9828f
FM
1109
1110
1111
72844950 1112'\033'
36c9828f
FM
1113
1114
36c9828f
FM
1115
1116
1117
72844950 1118IS4
36c9828f 1119
36c9828f
FM
1120
1121
1122
72844950 1123'\034'
36c9828f
FM
1124
1125
36c9828f
FM
1126
1127
1128
72844950 1129FS
36c9828f 1130
36c9828f
FM
1131
1132
1133
72844950 1134'\034'
36c9828f
FM
1135
1136
36c9828f
FM
1137
1138
1139
72844950 1140IS3
36c9828f
FM
1141
1142
1143
36c9828f 1144
72844950 1145'\035'
36c9828f
FM
1146
1147
1148
36c9828f
FM
1149
1150
72844950 1151GS
36c9828f
FM
1152
1153
1154
36c9828f 1155
72844950 1156'\035'
36c9828f
FM
1157
1158
1159
36c9828f
FM
1160
1161
72844950 1162IS2
36c9828f
FM
1163
1164
1165
36c9828f 1166
72844950 1167'\036'
36c9828f
FM
1168
1169
1170
36c9828f
FM
1171
1172
72844950 1173RS
36c9828f
FM
1174
1175
1176
36c9828f 1177
72844950 1178'\036'
36c9828f
FM
1179
1180
1181
36c9828f
FM
1182
1183
72844950 1184IS1
36c9828f
FM
1185
1186
1187
36c9828f 1188
72844950 1189'\037'
36c9828f
FM
1190
1191
1192
36c9828f
FM
1193
1194
72844950 1195US
36c9828f
FM
1196
1197
1198
36c9828f 1199
72844950 1200'\037'
36c9828f
FM
1201
1202
1203
36c9828f
FM
1204
1205
72844950 1206space
36c9828f
FM
1207
1208
1209
36c9828f 1210
72844950 1211' '
36c9828f
FM
1212
1213
1214
36c9828f
FM
1215
1216
72844950 1217exclamation-mark
36c9828f
FM
1218
1219
1220
36c9828f 1221
72844950 1222'!'
36c9828f
FM
1223
1224
1225
36c9828f
FM
1226
1227
72844950 1228quotation-mark
36c9828f
FM
1229
1230
1231
36c9828f 1232
72844950 1233'"'
36c9828f 1234
36c9828f
FM
1235
1236
1237
1238
72844950 1239number-sign
36c9828f
FM
1240
1241
36c9828f
FM
1242
1243
72844950 1244'#'
36c9828f
FM
1245
1246
36c9828f
FM
1247
1248
1249
72844950 1250dollar-sign
36c9828f
FM
1251
1252
36c9828f
FM
1253
1254
72844950 1255'$'
36c9828f
FM
1256
1257
36c9828f
FM
1258
1259
1260
72844950 1261percent-sign
36c9828f
FM
1262
1263
36c9828f
FM
1264
1265
72844950 1266'%'
36c9828f
FM
1267
1268
36c9828f
FM
1269
1270
1271
72844950 1272ampersand
36c9828f
FM
1273
1274
36c9828f
FM
1275
1276
72844950 1277''
36c9828f
FM
1278
1279
36c9828f
FM
1280
1281
1282
72844950 1283apostrophe
36c9828f
FM
1284
1285
36c9828f
FM
1286
1287
72844950 1288'\''
36c9828f
FM
1289
1290
36c9828f
FM
1291
1292
1293
72844950 1294left-parenthesis
36c9828f
FM
1295
1296
36c9828f
FM
1297
1298
72844950 1299'('
36c9828f
FM
1300
1301
36c9828f
FM
1302
1303
1304
72844950 1305right-parenthesis
36c9828f
FM
1306
1307
36c9828f
FM
1308
1309
72844950 1310')'
36c9828f
FM
1311
1312
36c9828f
FM
1313
1314
1315
72844950 1316asterisk
36c9828f
FM
1317
1318
36c9828f
FM
1319
1320
72844950 1321'*'
36c9828f
FM
1322
1323
36c9828f
FM
1324
1325
1326
72844950 1327plus-sign
36c9828f
FM
1328
1329
36c9828f
FM
1330
1331
72844950 1332'+'
36c9828f
FM
1333
1334
36c9828f
FM
1335
1336
1337
72844950 1338comma
36c9828f
FM
1339
1340
36c9828f
FM
1341
1342
72844950 1343','
36c9828f
FM
1344
1345
36c9828f
FM
1346
1347
1348
72844950 1349hyphen
36c9828f
FM
1350
1351
36c9828f
FM
1352
1353
72844950 1354'-'
36c9828f
FM
1355
1356
36c9828f
FM
1357
1358
1359
72844950 1360hyphen-minus
36c9828f
FM
1361
1362
36c9828f
FM
1363
1364
72844950 1365'-'
36c9828f
FM
1366
1367
36c9828f
FM
1368
1369
1370
72844950 1371period
36c9828f
FM
1372
1373
36c9828f
FM
1374
1375
72844950 1376'.'
36c9828f
FM
1377
1378
36c9828f 1379
36c9828f 1380
36c9828f 1381
72844950 1382full-stop
36c9828f 1383
36c9828f
FM
1384
1385
36c9828f 1386
72844950 1387'.'
36c9828f
FM
1388
1389
36c9828f 1390
36c9828f 1391
36c9828f 1392
72844950 1393slash
36c9828f
FM
1394
1395
1396
1397
72844950 1398'/'
36c9828f
FM
1399
1400
36c9828f
FM
1401
1402
1403
72844950 1404solidus
36c9828f 1405
36c9828f
FM
1406
1407
1408
72844950 1409'/'
36c9828f
FM
1410
1411
36c9828f
FM
1412
1413
1414
72844950 1415zero
36c9828f 1416
36c9828f
FM
1417
1418
1419
72844950 1420'0'
36c9828f
FM
1421
1422
36c9828f
FM
1423
1424
1425
72844950 1426one
36c9828f 1427
36c9828f
FM
1428
1429
1430
72844950 1431'1'
36c9828f
FM
1432
1433
36c9828f
FM
1434
1435
1436
72844950 1437two
36c9828f 1438
36c9828f
FM
1439
1440
1441
72844950 1442'2'
36c9828f
FM
1443
1444
36c9828f
FM
1445
1446
1447
72844950 1448three
36c9828f 1449
36c9828f
FM
1450
1451
1452
72844950 1453'3'
36c9828f
FM
1454
1455
36c9828f
FM
1456
1457
1458
72844950 1459four
36c9828f 1460
36c9828f
FM
1461
1462
1463
72844950 1464'4'
36c9828f
FM
1465
1466
36c9828f
FM
1467
1468
1469
72844950 1470five
36c9828f 1471
36c9828f
FM
1472
1473
1474
72844950 1475'5'
36c9828f
FM
1476
1477
36c9828f
FM
1478
1479
1480
72844950 1481six
36c9828f 1482
36c9828f
FM
1483
1484
1485
72844950 1486'6'
36c9828f
FM
1487
1488
36c9828f
FM
1489
1490
1491
72844950 1492seven
36c9828f 1493
36c9828f
FM
1494
1495
1496
72844950 1497'7'
36c9828f
FM
1498
1499
36c9828f
FM
1500
1501
1502
72844950 1503eight
36c9828f 1504
36c9828f
FM
1505
1506
1507
72844950 1508'8'
36c9828f
FM
1509
1510
36c9828f
FM
1511
1512
1513
72844950 1514nine
36c9828f 1515
36c9828f
FM
1516
1517
1518
72844950 1519'9'
36c9828f
FM
1520
1521
36c9828f
FM
1522
1523
1524
72844950 1525colon
36c9828f 1526
36c9828f
FM
1527
1528
1529
72844950 1530':'
36c9828f
FM
1531
1532
36c9828f
FM
1533
1534
1535
72844950 1536semicolon
36c9828f 1537
36c9828f
FM
1538
1539
1540
72844950 1541';'
36c9828f
FM
1542
1543
36c9828f
FM
1544
1545
1546
72844950 1547less-than-sign
36c9828f 1548
36c9828f
FM
1549
1550
1551
72844950 1552''
36c9828f
FM
1553
1554
36c9828f
FM
1555
1556
1557
72844950 1558equals-sign
36c9828f 1559
36c9828f
FM
1560
1561
1562
72844950 1563'='
36c9828f
FM
1564
1565
36c9828f
FM
1566
1567
1568
72844950 1569greater-than-sign
36c9828f 1570
36c9828f
FM
1571
1572
1573
72844950 1574''
36c9828f
FM
1575
1576
36c9828f
FM
1577
1578
1579
72844950 1580question-mark
36c9828f 1581
36c9828f
FM
1582
1583
1584
72844950 1585'?'
36c9828f
FM
1586
1587
36c9828f
FM
1588
1589
1590
72844950 1591commercial-at
36c9828f 1592
36c9828f
FM
1593
1594
1595
72844950 1596'@'
36c9828f
FM
1597
1598
36c9828f
FM
1599
1600
1601
72844950 1602left-square-bracket
36c9828f 1603
36c9828f
FM
1604
1605
1606
72844950 1607'['
36c9828f
FM
1608
1609
36c9828f
FM
1610
1611
1612
72844950 1613backslash
36c9828f 1614
36c9828f
FM
1615
1616
1617
72844950 1618'\'
36c9828f
FM
1619
1620
36c9828f
FM
1621
1622
1623
72844950 1624reverse-solidus
36c9828f 1625
36c9828f
FM
1626
1627
1628
72844950 1629'\'
36c9828f
FM
1630
1631
36c9828f
FM
1632
1633
1634
72844950 1635right-square-bracket
36c9828f 1636
36c9828f
FM
1637
1638
1639
72844950 1640']'
36c9828f
FM
1641
1642
36c9828f
FM
1643
1644
1645
72844950 1646circumflex
36c9828f 1647
36c9828f
FM
1648
1649
1650
72844950 1651'^'
36c9828f
FM
1652
1653
36c9828f
FM
1654
1655
1656
72844950 1657circumflex-accent
36c9828f 1658
36c9828f
FM
1659
1660
1661
72844950 1662'^'
36c9828f
FM
1663
1664
36c9828f
FM
1665
1666
1667
72844950 1668underscore
36c9828f 1669
36c9828f
FM
1670
1671
1672
72844950 1673'_'
36c9828f
FM
1674
1675
36c9828f
FM
1676
1677
1678
72844950 1679low-line
36c9828f 1680
36c9828f
FM
1681
1682
1683
72844950 1684'_'
36c9828f
FM
1685
1686
36c9828f
FM
1687
1688
1689
72844950 1690grave-accent
36c9828f 1691
36c9828f
FM
1692
1693
1694
72844950 1695'''
36c9828f
FM
1696
1697
36c9828f
FM
1698
1699
1700
72844950 1701left-brace
36c9828f 1702
36c9828f
FM
1703
1704
1705
72844950 1706'{'
36c9828f
FM
1707
1708
36c9828f
FM
1709
1710
1711
72844950 1712left-curly-bracket
36c9828f 1713
36c9828f
FM
1714
1715
1716
72844950 1717'{'
36c9828f
FM
1718
1719
36c9828f
FM
1720
1721
1722
72844950 1723vertical-line
36c9828f 1724
36c9828f
FM
1725
1726
1727
72844950 1728'|'
36c9828f
FM
1729
1730
36c9828f
FM
1731
1732
1733
72844950 1734right-brace
36c9828f 1735
36c9828f
FM
1736
1737
1738
72844950 1739'}'
36c9828f
FM
1740
1741
36c9828f
FM
1742
1743
1744
72844950 1745right-curly-bracket
36c9828f 1746
36c9828f
FM
1747
1748
1749
72844950 1750'}'
36c9828f
FM
1751
1752
36c9828f
FM
1753
1754
1755
72844950 1756tilde
36c9828f 1757
36c9828f
FM
1758
1759
1760
72844950 1761'~'
36c9828f
FM
1762
1763
36c9828f
FM
1764
1765
1766
72844950 1767DEL
36c9828f 1768
36c9828f
FM
1769
1770
1771
72844950 1772'\177'
36c9828f 1773
72844950 1774*/
36c9828f 1775