docs/doxygen/overviews/resyntax.h

   1 /////////////////////////////////////////////////////////////////////////////
   2 // Name:        resyntax.h
   3 // Purpose:     topic overview
   4 // Author:      wxWidgets team
   5 // RCS-ID:      $Id$
   6 // Licence:     wxWindows license
   7 /////////////////////////////////////////////////////////////////////////////
   8
   9 /*!
  10
  11 @page overview_resyntax Syntax of the Built-in Regular Expression Library
  12
  13 A <em>regular expression</em> describes strings of characters. It's a  pattern
  14 that matches certain strings and doesn't match others.
  15
  16 @li @ref overview_resyntax_differentflavors
  17 @li @ref overview_resyntax_syntax
  18 @li @ref overview_resyntax_bracket
  19 @li @ref overview_resyntax_escapes
  20 @li @ref overview_resyntax_metasyntax
  21 @li @ref overview_resyntax_matching
  22 @li @ref overview_resyntax_limits
  23 @li @ref overview_resyntax_bre
  24 @li @ref overview_resyntax_characters
  25
  26 @seealso
  27
  28 @li #wxRegEx
  29
  30
  31 <hr>
  32
  33
  34 @section overview_resyntax_differentflavors Different Flavors of Regular Expressions
  35
  36 Regular expressions (RE), as defined by POSIX, come in two flavors:
  37 <em>extended regular expressions</em> (ERE) and <em>basic regular
  38 expressions</em> (BRE). EREs are roughly those of the traditional @e egrep,
  39 while BREs are roughly those of the traditional @e ed. This implementation
  40 adds a third flavor: <em>advanced regular expressions</em> (ARE), basically
  41 EREs with some significant extensions.
  42
  43 This manual page primarily describes AREs. BREs mostly exist for backward
  44 compatibility in some old programs. POSIX EREs are almost an exact subset of
  45 AREs. Features of AREs that are not present in EREs will be indicated.
  46
  47
  48 @section overview_resyntax_syntax Regular Expression Syntax
  49
  50 These regular expressions are implemented using the package written by Henry
  51 Spencer, based on the 1003.2 spec and some (not quite all) of the Perl5
  52 extensions (thanks, Henry!).  Much of the description of regular expressions
  53 below is copied verbatim from his manual entry.
  54
  55 An ARE is one or more @e branches, separated by "|", matching anything that
  56 matches any of the branches.
  57
  58 A branch is zero or more @e constraints or @e quantified atoms, concatenated.
  59 It matches a match for the first, followed by a match for the second, etc; an
  60 empty branch matches the empty string.
  61
  62 A quantified atom is an @e atom possibly followed by a single @e quantifier.
  63 Without a quantifier, it matches a match for the atom. The quantifiers, and
  64 what a so-quantified atom matches, are:
  65
  66 @beginTable
  67 @row2col{ <tt>*</tt> ,
  68     A sequence of 0 or more matches of the atom. }
  69 @row2col{ <tt>+</tt> ,
  70     A sequence of 1 or more matches of the atom. }
  71 @row2col{ <tt>?</tt> ,
  72     A sequence of 0 or 1 matches of the atom. }
  73 @row2col{ <tt>{m}</tt> ,
  74     A sequence of exactly @e m matches of the atom. }
  75 @row2col{ <tt>{m\,}</tt> ,
  76     A sequence of @e m or more matches of the atom. }
  77 @row2col{ <tt>{m\,n}</tt> ,
  78     A sequence of @e m through @e n (inclusive) matches of the atom; @e m may
  79     not exceed @e n. }
  80 @row2col{ <tt>*? +? ?? {m}? {m\,}? {m\,n}?</tt> ,
  81     @e Non-greedy quantifiers, which match the same possibilities, but prefer
  82     the smallest number rather than the largest number of matches (see
  83     @ref overview_resyntax_matching). }
  84 @endTable
  85
  86 The forms using @b { and @b } are known as @e bounds. The numbers @e m and
  87 @e n are unsigned decimal integers with permissible values from 0 to 255
  88 inclusive. An atom is one of:
  89
  90 @beginTable
  91 @row2col{ <tt>(re)</tt> ,
  92     Where @e re is any regular expression, matches for @e re, with the match
  93     captured for possible reporting. }
  94 @row2col{ <tt>(?:re)</tt> ,
  95     As previous, but does no reporting (a "non-capturing" set of
  96     parentheses). }
  97 @row2col{ <tt>()</tt> ,
  98     Matches an empty string, captured for possible reporting. }
  99 @row2col{ <tt>(?:)</tt> ,
 100     Matches an empty string, without reporting. }
 101 @row2col{ <tt>[chars]</tt> ,
 102     A <em>bracket expression</em>, matching any one of the @e chars (see
 103     @ref overview_resyntax_bracket for more details). }
 104 @row2col{ <tt>.</tt> ,
 105     Matches any single character. }
 106 @row2col{ <tt>@\k</tt> ,
 107     Where @e k is a non-alphanumeric character, matches that character taken
 108     as an ordinary character, e.g. @\@\ matches a backslash character. }
 109 @row2col{ <tt>@\c</tt> ,
 110     Where @e c is alphanumeric (possibly followed by other characters), an
 111     @e escape (AREs only), see @ref overview_resyntax_escapes below. }
 112 @row2col{ <tt>@leftCurly</tt> ,
 113     When followed by a character other than a digit, matches the left-brace
 114     character "@leftCurly"; when followed by a digit, it is the beginning of a
 115     @e bound (see above). }
 116 @row2col{ <tt>x</tt> ,
 117     Where @e x is a single character with no other significance, matches that
 118     character. }
 119 @endTable
 120
 121 A @e constraint matches an empty string when specific conditions are met. A
 122 constraint may not be followed by a quantifier. The simple constraints are as
 123 follows; some more constraints are described later, under
 124 @ref overview_resyntax_escapes.
 125
 126 @beginTable
 127 @row2col{ <tt>^</tt> ,
 128     Matches at the beginning of a line. }
 129 @row2col{ <tt>@$</tt> ,
 130     Matches at the end of a line. }
 131 @row2col{ <tt>(?=re)</tt> ,
 132     @e Positive lookahead (AREs only), matches at any point where a substring
 133     matching @e re begins. }
 134 @row2col{ <tt>(?!re)</tt> ,
 135     @e Negative lookahead (AREs only), matches at any point where no substring
 136     matching @e re begins. }
 137 @endTable
 138
 139 The lookahead constraints may not contain back references (see later), and all
 140 parentheses within them are considered non-capturing. A RE may not end with
 141 "\".
 142
 143
 144 @section overview_resyntax_bracket Bracket Expressions
 145
 146 A <em>bracket expression</em> is a list of characters enclosed in <tt>[]</tt>.
 147 It normally matches any single character from the list (but see below). If the
 148 list begins with @c ^, it matches any single character (but see below) @e not
 149 from the rest of the list.
 150
 151 If two characters in the list are separated by <tt>-</tt>, this is shorthand
 152 for the full @e range of characters between those two (inclusive) in the
 153 collating sequence, e.g. <tt>[0-9]</tt> in ASCII matches any decimal digit.
 154 Two ranges may not share an endpoint, so e.g. <tt>a-c-e</tt> is illegal.
 155 Ranges are very collating-sequence-dependent, and portable programs should
 156 avoid relying on them.
 157
 158 To include a literal <tt>]</tt> or <tt>-</tt> in the list, the simplest method
 159 is to enclose it in <tt>[.</tt> and <tt>.]</tt> to make it a collating element
 160 (see below). Alternatively, make it the first character (following a possible
 161 <tt>^</tt>), or (AREs only) precede it with <tt>@\</tt>. Alternatively, for
 162 <tt>-</tt>, make it the last character, or the second endpoint of a range. To
 163 use a literal <tt>-</tt> as the first endpoint of a range, make it a collating
 164 element or (AREs only) precede it with <tt>@\</tt>. With the exception of
 165 these, some combinations using <tt>[</tt> (see next paragraphs), and escapes,
 166 all other special characters lose their special significance within a bracket
 167 expression.
 168
 169 Within a bracket expression, a collating element (a character, a
 170 multi-character sequence that collates as if it were a single character, or a
 171 collating-sequence name for either) enclosed in <tt>[.</tt> and <tt>.]</tt>
 172 stands for the sequence of characters of that collating element.
 173
 174 @e wxWidgets: Currently no multi-character collating elements are defined. So
 175 in <tt>[.X.]</tt>, @c X can either be a single character literal or the name
 176 of a character. For example, the following are both identical:
 177 <tt>[[.0.]-[.9.]]</tt> and <tt>[[.zero.]-[.nine.]]</tt> and mean the same as
 178 <tt>[0-9]</tt>. See @ref overview_resyntax_characters.
 179
 180 Within a bracket expression, a collating element enclosed in @b [= and @b =]
 181 is an equivalence class, standing for the sequences of characters of all
 182 collating elements equivalent to that one, including itself.
 183 An equivalence class may not be an endpoint of a range.
 184 @e wxWidgets: Currently no equivalence classes are defined, so
 185 @b [=X=] stands for just the single character @e X.
 186 @e X can either be a single character literal or the name of a character,
 187 see @ref resynchars_overview.
 188 Within a bracket expression,
 189 the name of a @e character class enclosed in @b [: and @b :] stands for the list
 190 of all characters (not all collating elements!) belonging to that class.
 191 Standard character classes are:
 192
 193 @beginTable
 194 @row2col{ <tt>alpha</tt>  , A letter. }
 195 @row2col{ <tt>upper</tt>  , An upper-case letter. }
 196 @row2col{ <tt>lower</tt>  , A lower-case letter. }
 197 @row2col{ <tt>digit</tt>  , A decimal digit. }
 198 @row2col{ <tt>xdigit</tt> , A hexadecimal digit. }
 199 @row2col{ <tt>alnum</tt>  , An alphanumeric (letter or digit). }
 200 @row2col{ <tt>print</tt>  , An alphanumeric (same as alnum). }
 201 @row2col{ <tt>blank</tt>  , A space or tab character. }
 202 @row2col{ <tt>space</tt>  , A character producing white space in displayed text. }
 203 @row2col{ <tt>punct</tt>  , A punctuation character. }
 204 @row2col{ <tt>graph</tt>  , A character with a visible representation. }
 205 @row2col{ <tt>cntrl</tt>  , A control character. }
 206 @endTable
 207
 208 A character class may not be used as an endpoint of a range.
 209 @e wxWidgets: In a non-Unicode build, these character classifications depend on the
 210 current locale, and correspond to the values return by the ANSI C 'is'
 211 functions: isalpha, isupper, etc. In Unicode mode they are based on
 212 Unicode classifications, and are not affected by the current locale.
 213 There are two special cases of bracket expressions:
 214 the bracket expressions @b [[::]] and @b [[::]] are constraints, matching empty
 215 strings at the beginning and end of a word respectively.  A word is defined
 216 as a sequence of word characters that is neither preceded nor followed
 217 by word characters. A word character is an @e alnum character or an underscore
 218 (@b _). These special bracket expressions are deprecated; users of AREs should
 219 use constraint escapes instead (see #Escapes below).
 220
 221
 222 @section overview_resyntax_escapes Escapes
 223
 224 Escapes (AREs only),
 225 which begin with a <tt>@\</tt> followed by an alphanumeric character, come in several
 226 varieties: character entry, class shorthands, constraint escapes, and back
 227 references. A <tt>@\</tt> followed by an alphanumeric character but not constituting
 228 a valid escape is illegal in AREs. In EREs, there are no escapes: outside
 229 a bracket expression, a <tt>@\</tt> followed by an alphanumeric character merely stands
 230 for that character as an ordinary character, and inside a bracket expression,
 231 <tt>@\</tt> is an ordinary character. (The latter is the one actual incompatibility
 232 between EREs and AREs.)
 233 Character-entry escapes (AREs only) exist to make
 234 it easier to specify non-printing and otherwise inconvenient characters
 235 in REs:
 236
 237
 238
 239 @b \a
 240
 241 alert (bell) character, as in C
 242
 243 @b \b
 244
 245 backspace, as in C
 246
 247 @b \B
 248
 249 synonym
 250 for @b \ to help reduce backslash doubling in some applications where there
 251 are multiple levels of backslash processing
 252
 253 @b \c@e X
 254
 255 (where X is any character)
 256 the character whose low-order 5 bits are the same as those of @e X, and whose
 257 other bits are all zero
 258
 259 @b \e
 260
 261 the character whose collating-sequence name is
 262 '@b ESC', or failing that, the character with octal value 033
 263
 264 @b \f
 265
 266 formfeed, as in C
 267
 268 @b \n
 269
 270 newline, as in C
 271
 272 @b \r
 273
 274 carriage return, as in C
 275
 276 @b \t
 277
 278 horizontal tab, as in C
 279
 280 @b \u@e wxyz
 281
 282 (where @e wxyz is exactly four hexadecimal digits)
 283 the Unicode
 284 character @b U+@e wxyz in the local byte ordering
 285
 286 @b \U@e stuvwxyz
 287
 288 (where @e stuvwxyz is
 289 exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode
 290 extension to 32 bits
 291
 292 @b \v
 293
 294 vertical tab, as in C are all available.
 295
 296 @b \x@e hhh
 297
 298 (where
 299 @e hhh is any sequence of hexadecimal digits) the character whose hexadecimal
 300 value is @b 0x@e hhh (a single character no matter how many hexadecimal digits
 301 are used).
 302
 303 @b \0
 304
 305 the character whose value is @b 0
 306
 307 @b \@e xy
 308
 309 (where @e xy is exactly two
 310 octal digits, and is not a @e back reference (see below)) the character whose
 311 octal value is @b 0@e xy
 312
 313 @b \@e xyz
 314
 315 (where @e xyz is exactly three octal digits, and is
 316 not a back reference (see below))
 317 the character whose octal value is @b 0@e xyz
 318
 319
 320
 321 Hexadecimal digits are '@b 0'-'@b 9', '@b a'-'@b f', and '@b A'-'@b F'. Octal
 322 digits are '@b 0'-'@b 7'.
 323 The character-entry
 324 escapes are always taken as ordinary characters. For example, @b \135 is @b ] in
 325 ASCII, but @b \135 does not terminate a bracket expression. Beware, however,
 326 that some applications (e.g., C compilers) interpret  such sequences themselves
 327 before the regular-expression package gets to see them, which may require
 328 doubling (quadrupling, etc.) the '@b \'.
 329 Class-shorthand escapes (AREs only) provide
 330 shorthands for certain commonly-used character classes:
 331
 332
 333
 334 @b \d
 335
 336 @b [[:digit:]]
 337
 338 @b \s
 339
 340 @b [[:space:]]
 341
 342 @b \w
 343
 344 @b [[:alnum:]_] (note underscore)
 345
 346 @b \D
 347
 348 @b [^[:digit:]]
 349
 350 @b \S
 351
 352 @b [^[:space:]]
 353
 354 @b \W
 355
 356 @b [^[:alnum:]_] (note underscore)
 357
 358
 359
 360 Within bracket expressions, '@b \d', '@b \s', and
 361 '@b \w' lose their outer brackets, and '@b \D',
 362 '@b \S', and '@b \W' are illegal. (So, for example,
 363 @b [a-c\d] is equivalent to @b [a-c[:digit:]].
 364 Also, @b [a-c\D], which is equivalent to
 365 @b [a-c^[:digit:]], is illegal.)
 366 A constraint escape (AREs only) is a constraint,
 367 matching the empty string if specific conditions are met, written as an
 368 escape:
 369
 370
 371
 372 @b \A
 373
 374 matches only at the beginning of the string
 375 (see #Matching, below,
 376 for how this differs from '@b ^')
 377
 378 @b \m
 379
 380 matches only at the beginning of a word
 381
 382 @b \M
 383
 384 matches only at the end of a word
 385
 386 @b \y
 387
 388 matches only at the beginning or end of a word
 389
 390 @b \Y
 391
 392 matches only at a point that is not the beginning or end of
 393 a word
 394
 395 @b \Z
 396
 397 matches only at the end of the string
 398 (see #Matching, below, for
 399 how this differs from '@b $')
 400
 401 @b \@e m
 402
 403 (where @e m is a nonzero digit) a @e back reference,
 404 see below
 405
 406 @b \@e mnn
 407
 408 (where @e m is a nonzero digit, and @e nn is some more digits,
 409 and the decimal value @e mnn is not greater than the number of closing capturing
 410 parentheses seen so far) a @e back reference, see below
 411
 412
 413
 414 A word is defined
 415 as in the specification of @b [[::]] and @b [[::]] above. Constraint escapes are
 416 illegal within bracket expressions.
 417 A back reference (AREs only) matches
 418 the same string matched by the parenthesized subexpression specified by
 419 the number, so that (e.g.) @b ([bc])\1 matches @b bb or @b cc but not '@b bc'.
 420 The subexpression
 421 must entirely precede the back reference in the RE. Subexpressions are numbered
 422 in the order of their leading parentheses. Non-capturing parentheses do not
 423 define subexpressions.
 424 There is an inherent historical ambiguity between
 425 octal character-entry  escapes and back references, which is resolved by
 426 heuristics, as hinted at above. A leading zero always indicates an octal
 427 escape. A single non-zero digit, not followed by another digit, is always
 428 taken as a back reference. A multi-digit sequence not starting with a zero
 429 is taken as a back  reference if it comes after a suitable subexpression
 430 (i.e. the number is in the legal range for a back reference), and otherwise
 431 is taken as octal.
 432
 433
 434 @section overview_resyntax_metasyntax Metasyntax
 435
 436 In addition to the main syntax described above,
 437 there are some special forms and miscellaneous syntactic facilities available.
 438 Normally the flavor of RE being used is specified by application-dependent
 439 means. However, this can be overridden by a @e director. If an RE of any flavor
 440 begins with '@b ***:', the rest of the RE is an ARE. If an RE of any flavor begins
 441 with '@b ***=', the rest of the RE is taken to be a literal string, with all
 442 characters considered ordinary characters.
 443 An ARE may begin with @e embedded options: a sequence @b (?xyz)
 444 (where @e xyz is one or more alphabetic characters)
 445 specifies options affecting the rest of the RE. These supplement, and can
 446 override, any options specified by the application. The available option
 447 letters are:
 448
 449
 450
 451 @b b
 452
 453 rest of RE is a BRE
 454
 455 @b c
 456
 457 case-sensitive matching (usual default)
 458
 459 @b e
 460
 461 rest of RE is an ERE
 462
 463 @b i
 464
 465 case-insensitive matching (see #Matching, below)
 466
 467 @b m
 468
 469 historical synonym for @b n
 470
 471 @b n
 472
 473 newline-sensitive matching (see #Matching, below)
 474
 475 @b p
 476
 477 partial newline-sensitive matching (see #Matching, below)
 478
 479 @b q
 480
 481 rest of RE
 482 is a literal ("quoted'') string, all ordinary characters
 483
 484 @b s
 485
 486 non-newline-sensitive matching (usual default)
 487
 488 @b t
 489
 490 tight syntax (usual default; see below)
 491
 492 @b w
 493
 494 inverse
 495 partial newline-sensitive ("weird'') matching (see #Matching, below)
 496
 497 @b x
 498
 499 expanded syntax (see below)
 500
 501
 502
 503 Embedded options take effect at the @b ) terminating the
 504 sequence. They are available only at the start of an ARE, and may not be
 505 used later within it.
 506 In addition to the usual (@e tight) RE syntax, in which
 507 all characters are significant, there is an @e expanded syntax, available
 508 in AREs with the embedded
 509 x option. In the expanded syntax, white-space characters are ignored and
 510 all characters between a @b # and the following newline (or the end of the
 511 RE) are ignored, permitting paragraphing and commenting a complex RE. There
 512 are three exceptions to that basic rule:
 513
 514
 515 a white-space character or '@b #' preceded
 516 by '@b \' is retained
 517 white space or '@b #' within a bracket expression is retained
 518 white space and comments are illegal within multi-character symbols like
 519 the ARE '@b (?:' or the BRE '@b \('
 520
 521
 522 Expanded-syntax white-space characters are blank,
 523 tab, newline, and any character that belongs to the @e space character class.
 524 Finally, in an ARE, outside bracket expressions, the sequence '@b (?#ttt)' (where
 525 @e ttt is any text not containing a '@b )') is a comment, completely ignored. Again,
 526 this is not allowed between the characters of multi-character symbols like
 527 '@b (?:'. Such comments are more a historical artifact than a useful facility,
 528 and their use is deprecated; use the expanded syntax instead.
 529 @e None of these
 530 metasyntax extensions is available if the application (or an initial @b ***=
 531 director) has specified that the user's input be treated as a literal string
 532 rather than as an RE.
 533
 534
 535 @section overview_resyntax_matching Matching
 536
 537 In the event that an RE could match more than
 538 one substring of a given string, the RE matches the one starting earliest
 539 in the string. If the RE could match more than one substring starting at
 540 that point, its choice is determined by its @e preference: either the longest
 541 substring, or the shortest.
 542 Most atoms, and all constraints, have no preference.
 543 A parenthesized RE has the same preference (possibly none) as the RE. A
 544 quantified atom with quantifier @b {m} or @b {m}? has the same preference (possibly
 545 none) as the atom itself. A quantified atom with other normal quantifiers
 546 (including @b {m,n} with @e m equal to @e n) prefers longest match. A quantified
 547 atom with other non-greedy quantifiers (including @b {m,n}? with @e m equal to
 548 @e n) prefers shortest match. A branch has the same preference as the first
 549 quantified atom in it which has a preference. An RE consisting of two or
 550 more branches connected by the @b | operator prefers longest match.
 551 Subject to the constraints imposed by the rules for matching the whole RE, subexpressions
 552 also match the longest or shortest possible substrings, based on their
 553 preferences, with subexpressions starting earlier in the RE taking priority
 554 over ones starting later. Note that outer subexpressions thus take priority
 555 over their component subexpressions.
 556 Note that the quantifiers @b {1,1} and
 557 @b {1,1}? can be used to force longest and shortest preference, respectively,
 558 on a subexpression or a whole RE.
 559 Match lengths are measured in characters,
 560 not collating elements. An empty string is considered longer than no match
 561 at all. For example, @b bb* matches the three middle characters
 562 of '@b abbbc', @b (week|wee)(night|knights)
 563 matches all ten characters of '@b weeknights', when @b (.*).* is matched against
 564 @b abc the parenthesized subexpression matches all three characters, and when
 565 @b (a*)* is matched against @b bc both the whole RE and the parenthesized subexpression
 566 match an empty string.
 567 If case-independent matching is specified, the effect
 568 is much as if all case distinctions had vanished from the alphabet. When
 569 an alphabetic that exists in multiple cases appears as an ordinary character
 570 outside a bracket expression, it is effectively transformed into a bracket
 571 expression containing both cases, so that @b x becomes '@b [xX]'. When it appears
 572 inside a bracket expression, all case counterparts of it are added to the
 573 bracket expression, so that @b [x] becomes @b [xX] and @b [^x] becomes '@b [^xX]'.
 574 If newline-sensitive
 575 matching is specified, @b . and bracket expressions using @b ^ will never match
 576 the newline character (so that matches will never cross newlines unless
 577 the RE explicitly arranges it) and @b ^ and @b $ will match the empty string after
 578 and before a newline respectively, in addition to matching at beginning
 579 and end of string respectively. ARE @b \A and @b \Z continue to match beginning
 580 or end of string @e only.
 581 If partial newline-sensitive matching is specified,
 582 this affects @b . and bracket expressions as with newline-sensitive matching,
 583 but not @b ^ and '@b $'.
 584 If inverse partial newline-sensitive matching is specified,
 585 this affects @b ^ and @b $ as with newline-sensitive matching, but not @b . and bracket
 586 expressions. This isn't very useful but is provided for symmetry.
 587
 588
 589 @section overview_resyntax_limits Limits and Compatibility
 590
 591 No particular limit is imposed on the length of REs. Programs
 592 intended to be highly portable should not employ REs longer than 256 bytes,
 593 as a POSIX-compliant implementation can refuse to accept such REs.
 594 The only
 595 feature of AREs that is actually incompatible with POSIX EREs is that @b \
 596 does not lose its special significance inside bracket expressions. All other
 597 ARE features use syntax which is illegal or has undefined or unspecified
 598 effects in POSIX EREs; the @b *** syntax of directors likewise is outside
 599 the POSIX syntax for both BREs and EREs.
 600 Many of the ARE extensions are
 601 borrowed from Perl, but some have been changed to clean them up, and a
 602 few Perl extensions are not present. Incompatibilities of note include '@b \b',
 603 '@b \B', the lack of special treatment for a trailing newline, the addition of
 604 complemented bracket expressions to the things affected by newline-sensitive
 605 matching, the restrictions on parentheses and back references in lookahead
 606 constraints, and the longest/shortest-match (rather than first-match) matching
 607 semantics.
 608 The matching rules for REs containing both normal and non-greedy
 609 quantifiers have changed since early beta-test versions of this package.
 610 (The new rules are much simpler and cleaner, but don't work as hard at guessing
 611 the user's real intentions.)
 612 Henry Spencer's original 1986 @e regexp package, still in widespread use,
 613 implemented an early version of today's EREs. There are four incompatibilities between @e regexp's
 614 near-EREs ('RREs' for short) and AREs. In roughly increasing order of significance:
 615
 616 In AREs, @b \ followed by an alphanumeric character is either an escape or
 617 an error, while in RREs, it was just another way of writing the  alphanumeric.
 618 This should not be a problem because there was no reason to write such
 619 a sequence in RREs.
 620 @b { followed by a digit in an ARE is the beginning of
 621 a bound, while in RREs, @b { was always an ordinary character. Such sequences
 622 should be rare, and will often result in an error because following characters
 623 will not look like a valid bound.
 624 In AREs, @b \ remains a special character
 625 within '@b []', so a literal @b \ within @b [] must be
 626 written '@b \\'. @b \\ also gives a literal
 627 @b \ within @b [] in RREs, but only truly paranoid programmers routinely doubled
 628 the backslash.
 629 AREs report the longest/shortest match for the RE, rather
 630 than the first found in a specified search order. This may affect some RREs
 631 which were written in the expectation that the first match would be reported.
 632 (The careful crafting of RREs to optimize the search order for fast matching
 633 is obsolete (AREs examine all possible matches in parallel, and their performance
 634 is largely insensitive to their complexity) but cases where the search
 635 order was exploited to deliberately  find a match which was @e not the longest/shortest
 636 will need rewriting.)
 637
 638
 639 @section overview_resyntax_bre Basic Regular Expressions
 640
 641 BREs differ from EREs in
 642 several respects.  '@b |', '@b +', and @b ? are ordinary characters and there is no equivalent
 643 for their functionality. The delimiters for bounds
 644 are @b \{ and '@b \}', with @b { and
 645 @b } by themselves ordinary characters. The parentheses for nested subexpressions
 646 are @b \( and '@b \)', with @b ( and @b ) by themselves
 647 ordinary characters. @b ^ is an ordinary
 648 character except at the beginning of the RE or the beginning of a parenthesized
 649 subexpression, @b $ is an ordinary character except at the end of the RE or
 650 the end of a parenthesized subexpression, and @b * is an ordinary character
 651 if it appears at the beginning of the RE or the beginning of a parenthesized
 652 subexpression (after a possible leading '@b ^'). Finally, single-digit back references
 653 are available, and @b \ and @b \ are synonyms
 654 for @b [[::]] and @b [[::]] respectively;
 655 no other escapes are available.
 656
 657
 658 @section overview_resyntax_characters Regular Expression Character Names
 659
 660 Note that the character names are case sensitive.
 661
 662
 663
 664
 665
 666
 667 NUL
 668
 669
 670
 671
 672 '\0'
 673
 674
 675
 676
 677
 678 SOH
 679
 680
 681
 682
 683 '\001'
 684
 685
 686
 687
 688
 689 STX
 690
 691
 692
 693
 694 '\002'
 695
 696
 697
 698
 699
 700 ETX
 701
 702
 703
 704
 705 '\003'
 706
 707
 708
 709
 710
 711 EOT
 712
 713
 714
 715
 716 '\004'
 717
 718
 719
 720
 721
 722 ENQ
 723
 724
 725
 726
 727 '\005'
 728
 729
 730
 731
 732
 733 ACK
 734
 735
 736
 737
 738 '\006'
 739
 740
 741
 742
 743
 744 BEL
 745
 746
 747
 748
 749 '\007'
 750
 751
 752
 753
 754
 755 alert
 756
 757
 758
 759
 760 '\007'
 761
 762
 763
 764
 765
 766 BS
 767
 768
 769
 770
 771 '\010'
 772
 773
 774
 775
 776
 777 backspace
 778
 779
 780
 781
 782 '\b'
 783
 784
 785
 786
 787
 788 HT
 789
 790
 791
 792
 793 '\011'
 794
 795
 796
 797
 798
 799 tab
 800
 801
 802
 803
 804 '\t'
 805
 806
 807
 808
 809
 810 LF
 811
 812
 813
 814
 815 '\012'
 816
 817
 818
 819
 820
 821 newline
 822
 823
 824
 825
 826 '\n'
 827
 828
 829
 830
 831
 832 VT
 833
 834
 835
 836
 837 '\013'
 838
 839
 840
 841
 842
 843 vertical-tab
 844
 845
 846
 847
 848 '\v'
 849
 850
 851
 852
 853
 854 FF
 855
 856
 857
 858
 859 '\014'
 860
 861
 862
 863
 864
 865 form-feed
 866
 867
 868
 869
 870 '\f'
 871
 872
 873
 874
 875
 876 CR
 877
 878
 879
 880
 881 '\015'
 882
 883
 884
 885
 886
 887 carriage-return
 888
 889
 890
 891
 892 '\r'
 893
 894
 895
 896
 897
 898 SO
 899
 900
 901
 902
 903 '\016'
 904
 905
 906
 907
 908
 909 SI
 910
 911
 912
 913
 914 '\017'
 915
 916
 917
 918
 919
 920 DLE
 921
 922
 923
 924
 925 '\020'
 926
 927
 928
 929
 930
 931 DC1
 932
 933
 934
 935
 936 '\021'
 937
 938
 939
 940
 941
 942 DC2
 943
 944
 945
 946
 947 '\022'
 948
 949
 950
 951
 952
 953 DC3
 954
 955
 956
 957
 958 '\023'
 959
 960
 961
 962
 963
 964 DC4
 965
 966
 967
 968
 969 '\024'
 970
 971
 972
 973
 974
 975 NAK
 976
 977
 978
 979
 980 '\025'
 981
 982
 983
 984
 985
 986 SYN
 987
 988
 989
 990
 991 '\026'
 992
 993
 994
 995
 996
 997 ETB
 998
 999
1000
1001
1002 '\027'
1003
1004
1005
1006
1007
1008 CAN
1009
1010
1011
1012
1013 '\030'
1014
1015
1016
1017
1018
1019 EM
1020
1021
1022
1023
1024 '\031'
1025
1026
1027
1028
1029
1030 SUB
1031
1032
1033
1034
1035 '\032'
1036
1037
1038
1039
1040
1041 ESC
1042
1043
1044
1045
1046 '\033'
1047
1048
1049
1050
1051
1052 IS4
1053
1054
1055
1056
1057 '\034'
1058
1059
1060
1061
1062
1063 FS
1064
1065
1066
1067
1068 '\034'
1069
1070
1071
1072
1073
1074 IS3
1075
1076
1077
1078
1079 '\035'
1080
1081
1082
1083
1084
1085 GS
1086
1087
1088
1089
1090 '\035'
1091
1092
1093
1094
1095
1096 IS2
1097
1098
1099
1100
1101 '\036'
1102
1103
1104
1105
1106
1107 RS
1108
1109
1110
1111
1112 '\036'
1113
1114
1115
1116
1117
1118 IS1
1119
1120
1121
1122
1123 '\037'
1124
1125
1126
1127
1128
1129 US
1130
1131
1132
1133
1134 '\037'
1135
1136
1137
1138
1139
1140 space
1141
1142
1143
1144
1145 ' '
1146
1147
1148
1149
1150
1151 exclamation-mark
1152
1153
1154
1155
1156 '!'
1157
1158
1159
1160
1161
1162 quotation-mark
1163
1164
1165
1166
1167 '"'
1168
1169
1170
1171
1172
1173 number-sign
1174
1175
1176
1177
1178 '#'
1179
1180
1181
1182
1183
1184 dollar-sign
1185
1186
1187
1188
1189 '$'
1190
1191
1192
1193
1194
1195 percent-sign
1196
1197
1198
1199
1200 '%'
1201
1202
1203
1204
1205
1206 ampersand
1207
1208
1209
1210
1211 ''
1212
1213
1214
1215
1216
1217 apostrophe
1218
1219
1220
1221
1222 '\''
1223
1224
1225
1226
1227
1228 left-parenthesis
1229
1230
1231
1232
1233 '('
1234
1235
1236
1237
1238
1239 right-parenthesis
1240
1241
1242
1243
1244 ')'
1245
1246
1247
1248
1249
1250 asterisk
1251
1252
1253
1254
1255 '*'
1256
1257
1258
1259
1260
1261 plus-sign
1262
1263
1264
1265
1266 '+'
1267
1268
1269
1270
1271
1272 comma
1273
1274
1275
1276
1277 ','
1278
1279
1280
1281
1282
1283 hyphen
1284
1285
1286
1287
1288 '-'
1289
1290
1291
1292
1293
1294 hyphen-minus
1295
1296
1297
1298
1299 '-'
1300
1301
1302
1303
1304
1305 period
1306
1307
1308
1309
1310 '.'
1311
1312
1313
1314
1315
1316 full-stop
1317
1318
1319
1320
1321 '.'
1322
1323
1324
1325
1326
1327 slash
1328
1329
1330
1331
1332 '/'
1333
1334
1335
1336
1337
1338 solidus
1339
1340
1341
1342
1343 '/'
1344
1345
1346
1347
1348
1349 zero
1350
1351
1352
1353
1354 '0'
1355
1356
1357
1358
1359
1360 one
1361
1362
1363
1364
1365 '1'
1366
1367
1368
1369
1370
1371 two
1372
1373
1374
1375
1376 '2'
1377
1378
1379
1380
1381
1382 three
1383
1384
1385
1386
1387 '3'
1388
1389
1390
1391
1392
1393 four
1394
1395
1396
1397
1398 '4'
1399
1400
1401
1402
1403
1404 five
1405
1406
1407
1408
1409 '5'
1410
1411
1412
1413
1414
1415 six
1416
1417
1418
1419
1420 '6'
1421
1422
1423
1424
1425
1426 seven
1427
1428
1429
1430
1431 '7'
1432
1433
1434
1435
1436
1437 eight
1438
1439
1440
1441
1442 '8'
1443
1444
1445
1446
1447
1448 nine
1449
1450
1451
1452
1453 '9'
1454
1455
1456
1457
1458
1459 colon
1460
1461
1462
1463
1464 ':'
1465
1466
1467
1468
1469
1470 semicolon
1471
1472
1473
1474
1475 ';'
1476
1477
1478
1479
1480
1481 less-than-sign
1482
1483
1484
1485
1486 ''
1487
1488
1489
1490
1491
1492 equals-sign
1493
1494
1495
1496
1497 '='
1498
1499
1500
1501
1502
1503 greater-than-sign
1504
1505
1506
1507
1508 ''
1509
1510
1511
1512
1513
1514 question-mark
1515
1516
1517
1518
1519 '?'
1520
1521
1522
1523
1524
1525 commercial-at
1526
1527
1528
1529
1530 '@'
1531
1532
1533
1534
1535
1536 left-square-bracket
1537
1538
1539
1540
1541 '['
1542
1543
1544
1545
1546
1547 backslash
1548
1549
1550
1551
1552 '\'
1553
1554
1555
1556
1557
1558 reverse-solidus
1559
1560
1561
1562
1563 '\'
1564
1565
1566
1567
1568
1569 right-square-bracket
1570
1571
1572
1573
1574 ']'
1575
1576
1577
1578
1579
1580 circumflex
1581
1582
1583
1584
1585 '^'
1586
1587
1588
1589
1590
1591 circumflex-accent
1592
1593
1594
1595
1596 '^'
1597
1598
1599
1600
1601
1602 underscore
1603
1604
1605
1606
1607 '_'
1608
1609
1610
1611
1612
1613 low-line
1614
1615
1616
1617
1618 '_'
1619
1620
1621
1622
1623
1624 grave-accent
1625
1626
1627
1628
1629 '''
1630
1631
1632
1633
1634
1635 left-brace
1636
1637
1638
1639
1640 '{'
1641
1642
1643
1644
1645
1646 left-curly-bracket
1647
1648
1649
1650
1651 '{'
1652
1653
1654
1655
1656
1657 vertical-line
1658
1659
1660
1661
1662 '|'
1663
1664
1665
1666
1667
1668 right-brace
1669
1670
1671
1672
1673 '}'
1674
1675
1676
1677
1678
1679 right-curly-bracket
1680
1681
1682
1683
1684 '}'
1685
1686
1687
1688
1689
1690 tilde
1691
1692
1693
1694
1695 '~'
1696
1697
1698
1699
1700
1701 DEL
1702
1703
1704
1705
1706 '\177'
1707
1708 */
1709