docs/doxygen/overviews/resyntax.h

   1 /////////////////////////////////////////////////////////////////////////////
   2 // Name:        resyntax.h
   3 // Purpose:     topic overview
   4 // Author:      wxWidgets team
   5 // RCS-ID:      $Id$
   6 // Licence:     wxWindows license
   7 /////////////////////////////////////////////////////////////////////////////
   8
   9 /*!
  10
  11 @page overview_resyntax Syntax of the Built-in Regular Expression Library
  12
  13 A <em>regular expression</em> describes strings of characters. It's a  pattern
  14 that matches certain strings and doesn't match others.
  15
  16 @seealso #wxRegEx
  17
  18 @li @ref overview_resyntax_differentflavors
  19 @li @ref overview_resyntax_syntax
  20 @li @ref overview_resyntax_bracket
  21 @li @ref overview_resyntax_escapes
  22 @li @ref overview_resyntax_metasyntax
  23 @li @ref overview_resyntax_matching
  24 @li @ref overview_resyntax_limits
  25 @li @ref overview_resyntax_bre
  26 @li @ref overview_resyntax_characters
  27
  28
  29 <hr>
  30
  31
  32 @section overview_resyntax_differentflavors Different Flavors of REs
  33
  34 Regular expressions ("RE''s), as defined by POSIX, come in two
  35 flavors: @e extended REs ("EREs'') and @e basic REs ("BREs''). EREs are roughly those
  36 of the traditional @e egrep, while BREs are roughly those of the traditional
  37 @e ed.  This implementation adds a third flavor, @e advanced REs ("AREs''), basically
  38 EREs with some significant extensions.
  39 This manual page primarily describes
  40 AREs. BREs mostly exist for backward compatibility in some old programs;
  41 they will be discussed at the #end. POSIX EREs are almost an exact subset
  42 of AREs. Features of AREs that are not present in EREs will be indicated.
  43
  44
  45 @section overview_resyntax_syntax Regular Expression Syntax
  46
  47 These regular expressions are implemented using
  48 the package written by Henry Spencer, based on the 1003.2 spec and some
  49 (not quite all) of the Perl5 extensions (thanks, Henry!).  Much of the description
  50 of regular expressions below is copied verbatim from his manual entry.
  51 An ARE is one or more @e branches, separated by '@b |', matching anything that matches
  52 any of the branches.
  53 A branch is zero or more @e constraints or @e quantified
  54 atoms, concatenated. It matches a match for the first, followed by a match
  55 for the second, etc; an empty branch matches the empty string.
  56 A quantified atom is an @e atom possibly followed by a single @e quantifier. Without a quantifier,
  57 it matches a match for the atom. The quantifiers, and what a so-quantified
  58 atom matches, are:
  59
  60
  61
  62 @b *
  63
  64 a sequence of 0 or more matches of the atom
  65
  66 @b +
  67
  68 a sequence of 1 or more matches of the atom
  69
  70 @b ?
  71
  72 a sequence of 0 or 1 matches of the atom
  73
  74 @b {m}
  75
  76 a sequence of exactly @e m matches of the atom
  77
  78 @b {m,}
  79
  80 a sequence of @e m or more matches of the atom
  81
  82 @b {m,n}
  83
  84 a sequence of @e m through @e n (inclusive)
  85 matches of the atom; @e m may not exceed @e n
  86
  87 @b *?  +?  ??  {m}?  {m,}?  {m,n}?
  88
  89 @e non-greedy quantifiers,
  90 which match the same possibilities, but prefer the
  91 smallest number rather than the largest number of matches (see #Matching)
  92
  93 The forms using @b { and @b } are known as @e bounds. The numbers @e m and @e n are unsigned
  94 decimal integers with permissible values from 0 to 255 inclusive.
  95 An atom is one of:
  96
  97 @b (re)
  98
  99 (where @e re is any regular expression) matches a match for
 100 @e re, with the match noted for possible reporting
 101
 102 @b (?:re)
 103
 104 as previous, but
 105 does no reporting (a "non-capturing'' set of parentheses)
 106
 107 @b ()
 108
 109 matches an empty
 110 string, noted for possible reporting
 111
 112 @b (?:)
 113
 114 matches an empty string, without reporting
 115
 116 @b [chars]
 117
 118 a @e bracket expression, matching any one of the @e chars
 119 (see @ref resynbracket_overview for more detail)
 120
 121 @b .
 122
 123 matches any single character
 124
 125 @b \k
 126
 127 (where @e k is a non-alphanumeric character)
 128 matches that character taken as an ordinary character, e.g. \\ matches a backslash
 129 character
 130
 131 @b \c
 132
 133 where @e c is alphanumeric (possibly followed by other characters),
 134 an @e escape (AREs only), see #Escapes below
 135
 136 @b {
 137
 138 when followed by a character
 139 other than a digit, matches the left-brace character '@b {'; when followed by
 140 a digit, it is the beginning of a @e bound (see above)
 141
 142 @b x
 143
 144 where @e x is a single
 145 character with no other significance, matches that character.
 146
 147 A @e constraint matches an empty string when specific conditions are met. A constraint may
 148 not be followed by a quantifier. The simple constraints are as follows;
 149 some more constraints are described later, under #Escapes.
 150
 151 @b ^
 152
 153 matches at the beginning of a line
 154
 155 @b $
 156
 157 matches at the end of a line
 158
 159 @b (?=re)
 160
 161 @e positive lookahead
 162 (AREs only), matches at any point where a substring matching @e re begins
 163
 164 @b (?!re)
 165
 166 @e negative lookahead (AREs only),
 167 matches at any point where no substring matching @e re begins
 168
 169
 170
 171 The lookahead constraints may not contain back references
 172 (see later), and all parentheses within them are considered non-capturing.
 173 An RE may not end with '@b \'.
 174
 175
 176 @section overview_resyntax_bracket Bracket Expressions
 177
 178 A @e bracket expression is a list
 179 of characters enclosed in '@b []'. It normally matches any single character from
 180 the list (but see below). If the list begins with '@b ^', it matches any single
 181 character (but see below) @e not from the rest of the list.
 182 If two characters
 183 in the list are separated by '@b -', this is shorthand for the full @e range of
 184 characters between those two (inclusive) in the collating sequence, e.g.
 185 @b [0-9] in ASCII matches any decimal digit. Two ranges may not share an endpoint,
 186 so e.g. @b a-c-e is illegal. Ranges are very collating-sequence-dependent, and portable
 187 programs should avoid relying on them.
 188 To include a literal @b ] or @b - in the
 189 list, the simplest method is to enclose it in @b [. and @b .] to make it a collating
 190 element (see below). Alternatively, make it the first character (following
 191 a possible '@b ^'), or (AREs only) precede it with '@b \'.
 192 Alternatively, for '@b -', make
 193 it the last character, or the second endpoint of a range. To use a literal
 194 @b - as the first endpoint of a range, make it a collating element or (AREs
 195 only) precede it with '@b \'. With the exception of these, some combinations using
 196 @b [ (see next paragraphs), and escapes, all other special characters lose
 197 their special significance within a bracket expression.
 198 Within a bracket
 199 expression, a collating element (a character, a multi-character sequence
 200 that collates as if it were a single character, or a collating-sequence
 201 name for either) enclosed in @b [. and @b .] stands for the
 202 sequence of characters of that collating element.
 203 @e wxWidgets: Currently no multi-character collating elements are defined.
 204 So in @b [.X.], @e X can either be a single character literal or
 205 the name of a character. For example, the following are both identical
 206 @b [[.0.]-[.9.]] and @b [[.zero.]-[.nine.]] and mean the same as
 207 @b [0-9].
 208 See @ref resynchars_overview.
 209 Within a bracket expression, a collating element enclosed in @b [= and @b =]
 210 is an equivalence class, standing for the sequences of characters of all
 211 collating elements equivalent to that one, including itself.
 212 An equivalence class may not be an endpoint of a range.
 213 @e wxWidgets: Currently no equivalence classes are defined, so
 214 @b [=X=] stands for just the single character @e X.
 215 @e X can either be a single character literal or the name of a character,
 216 see @ref resynchars_overview.
 217 Within a bracket expression,
 218 the name of a @e character class enclosed in @b [: and @b :] stands for the list
 219 of all characters (not all collating elements!) belonging to that class.
 220 Standard character classes are:
 221
 222
 223
 224 @b alpha
 225
 226 A letter.
 227
 228 @b upper
 229
 230 An upper-case letter.
 231
 232 @b lower
 233
 234 A lower-case letter.
 235
 236 @b digit
 237
 238 A decimal digit.
 239
 240 @b xdigit
 241
 242 A hexadecimal digit.
 243
 244 @b alnum
 245
 246 An alphanumeric (letter or digit).
 247
 248 @b print
 249
 250 An alphanumeric (same as alnum).
 251
 252 @b blank
 253
 254 A space or tab character.
 255
 256 @b space
 257
 258 A character producing white space in displayed text.
 259
 260 @b punct
 261
 262 A punctuation character.
 263
 264 @b graph
 265
 266 A character with a visible representation.
 267
 268 @b cntrl
 269
 270 A control character.
 271
 272
 273
 274 A character class may not be used as an endpoint of a range.
 275 @e wxWidgets: In a non-Unicode build, these character classifications depend on the
 276 current locale, and correspond to the values return by the ANSI C 'is'
 277 functions: isalpha, isupper, etc. In Unicode mode they are based on
 278 Unicode classifications, and are not affected by the current locale.
 279 There are two special cases of bracket expressions:
 280 the bracket expressions @b [[::]] and @b [[::]] are constraints, matching empty
 281 strings at the beginning and end of a word respectively.  A word is defined
 282 as a sequence of word characters that is neither preceded nor followed
 283 by word characters. A word character is an @e alnum character or an underscore
 284 (@b _). These special bracket expressions are deprecated; users of AREs should
 285 use constraint escapes instead (see #Escapes below).
 286
 287
 288 @section overview_resyntax_escapes Escapes
 289
 290 Escapes (AREs only),
 291 which begin with a @b \ followed by an alphanumeric character, come in several
 292 varieties: character entry, class shorthands, constraint escapes, and back
 293 references. A @b \ followed by an alphanumeric character but not constituting
 294 a valid escape is illegal in AREs. In EREs, there are no escapes: outside
 295 a bracket expression, a @b \ followed by an alphanumeric character merely stands
 296 for that character as an ordinary character, and inside a bracket expression,
 297 @b \ is an ordinary character. (The latter is the one actual incompatibility
 298 between EREs and AREs.)
 299 Character-entry escapes (AREs only) exist to make
 300 it easier to specify non-printing and otherwise inconvenient characters
 301 in REs:
 302
 303
 304
 305 @b \a
 306
 307 alert (bell) character, as in C
 308
 309 @b \b
 310
 311 backspace, as in C
 312
 313 @b \B
 314
 315 synonym
 316 for @b \ to help reduce backslash doubling in some applications where there
 317 are multiple levels of backslash processing
 318
 319 @b \c@e X
 320
 321 (where X is any character)
 322 the character whose low-order 5 bits are the same as those of @e X, and whose
 323 other bits are all zero
 324
 325 @b \e
 326
 327 the character whose collating-sequence name is
 328 '@b ESC', or failing that, the character with octal value 033
 329
 330 @b \f
 331
 332 formfeed, as in C
 333
 334 @b \n
 335
 336 newline, as in C
 337
 338 @b \r
 339
 340 carriage return, as in C
 341
 342 @b \t
 343
 344 horizontal tab, as in C
 345
 346 @b \u@e wxyz
 347
 348 (where @e wxyz is exactly four hexadecimal digits)
 349 the Unicode
 350 character @b U+@e wxyz in the local byte ordering
 351
 352 @b \U@e stuvwxyz
 353
 354 (where @e stuvwxyz is
 355 exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode
 356 extension to 32 bits
 357
 358 @b \v
 359
 360 vertical tab, as in C are all available.
 361
 362 @b \x@e hhh
 363
 364 (where
 365 @e hhh is any sequence of hexadecimal digits) the character whose hexadecimal
 366 value is @b 0x@e hhh (a single character no matter how many hexadecimal digits
 367 are used).
 368
 369 @b \0
 370
 371 the character whose value is @b 0
 372
 373 @b \@e xy
 374
 375 (where @e xy is exactly two
 376 octal digits, and is not a @e back reference (see below)) the character whose
 377 octal value is @b 0@e xy
 378
 379 @b \@e xyz
 380
 381 (where @e xyz is exactly three octal digits, and is
 382 not a back reference (see below))
 383 the character whose octal value is @b 0@e xyz
 384
 385
 386
 387 Hexadecimal digits are '@b 0'-'@b 9', '@b a'-'@b f', and '@b A'-'@b F'. Octal
 388 digits are '@b 0'-'@b 7'.
 389 The character-entry
 390 escapes are always taken as ordinary characters. For example, @b \135 is @b ] in
 391 ASCII, but @b \135 does not terminate a bracket expression. Beware, however,
 392 that some applications (e.g., C compilers) interpret  such sequences themselves
 393 before the regular-expression package gets to see them, which may require
 394 doubling (quadrupling, etc.) the '@b \'.
 395 Class-shorthand escapes (AREs only) provide
 396 shorthands for certain commonly-used character classes:
 397
 398
 399
 400 @b \d
 401
 402 @b [[:digit:]]
 403
 404 @b \s
 405
 406 @b [[:space:]]
 407
 408 @b \w
 409
 410 @b [[:alnum:]_] (note underscore)
 411
 412 @b \D
 413
 414 @b [^[:digit:]]
 415
 416 @b \S
 417
 418 @b [^[:space:]]
 419
 420 @b \W
 421
 422 @b [^[:alnum:]_] (note underscore)
 423
 424
 425
 426 Within bracket expressions, '@b \d', '@b \s', and
 427 '@b \w' lose their outer brackets, and '@b \D',
 428 '@b \S', and '@b \W' are illegal. (So, for example,
 429 @b [a-c\d] is equivalent to @b [a-c[:digit:]].
 430 Also, @b [a-c\D], which is equivalent to
 431 @b [a-c^[:digit:]], is illegal.)
 432 A constraint escape (AREs only) is a constraint,
 433 matching the empty string if specific conditions are met, written as an
 434 escape:
 435
 436
 437
 438 @b \A
 439
 440 matches only at the beginning of the string
 441 (see #Matching, below,
 442 for how this differs from '@b ^')
 443
 444 @b \m
 445
 446 matches only at the beginning of a word
 447
 448 @b \M
 449
 450 matches only at the end of a word
 451
 452 @b \y
 453
 454 matches only at the beginning or end of a word
 455
 456 @b \Y
 457
 458 matches only at a point that is not the beginning or end of
 459 a word
 460
 461 @b \Z
 462
 463 matches only at the end of the string
 464 (see #Matching, below, for
 465 how this differs from '@b $')
 466
 467 @b \@e m
 468
 469 (where @e m is a nonzero digit) a @e back reference,
 470 see below
 471
 472 @b \@e mnn
 473
 474 (where @e m is a nonzero digit, and @e nn is some more digits,
 475 and the decimal value @e mnn is not greater than the number of closing capturing
 476 parentheses seen so far) a @e back reference, see below
 477
 478
 479
 480 A word is defined
 481 as in the specification of @b [[::]] and @b [[::]] above. Constraint escapes are
 482 illegal within bracket expressions.
 483 A back reference (AREs only) matches
 484 the same string matched by the parenthesized subexpression specified by
 485 the number, so that (e.g.) @b ([bc])\1 matches @b bb or @b cc but not '@b bc'.
 486 The subexpression
 487 must entirely precede the back reference in the RE. Subexpressions are numbered
 488 in the order of their leading parentheses. Non-capturing parentheses do not
 489 define subexpressions.
 490 There is an inherent historical ambiguity between
 491 octal character-entry  escapes and back references, which is resolved by
 492 heuristics, as hinted at above. A leading zero always indicates an octal
 493 escape. A single non-zero digit, not followed by another digit, is always
 494 taken as a back reference. A multi-digit sequence not starting with a zero
 495 is taken as a back  reference if it comes after a suitable subexpression
 496 (i.e. the number is in the legal range for a back reference), and otherwise
 497 is taken as octal.
 498
 499
 500 @section overview_resyntax_metasyntax Metasyntax
 501
 502 In addition to the main syntax described above,
 503 there are some special forms and miscellaneous syntactic facilities available.
 504 Normally the flavor of RE being used is specified by application-dependent
 505 means. However, this can be overridden by a @e director. If an RE of any flavor
 506 begins with '@b ***:', the rest of the RE is an ARE. If an RE of any flavor begins
 507 with '@b ***=', the rest of the RE is taken to be a literal string, with all
 508 characters considered ordinary characters.
 509 An ARE may begin with @e embedded options: a sequence @b (?xyz)
 510 (where @e xyz is one or more alphabetic characters)
 511 specifies options affecting the rest of the RE. These supplement, and can
 512 override, any options specified by the application. The available option
 513 letters are:
 514
 515
 516
 517 @b b
 518
 519 rest of RE is a BRE
 520
 521 @b c
 522
 523 case-sensitive matching (usual default)
 524
 525 @b e
 526
 527 rest of RE is an ERE
 528
 529 @b i
 530
 531 case-insensitive matching (see #Matching, below)
 532
 533 @b m
 534
 535 historical synonym for @b n
 536
 537 @b n
 538
 539 newline-sensitive matching (see #Matching, below)
 540
 541 @b p
 542
 543 partial newline-sensitive matching (see #Matching, below)
 544
 545 @b q
 546
 547 rest of RE
 548 is a literal ("quoted'') string, all ordinary characters
 549
 550 @b s
 551
 552 non-newline-sensitive matching (usual default)
 553
 554 @b t
 555
 556 tight syntax (usual default; see below)
 557
 558 @b w
 559
 560 inverse
 561 partial newline-sensitive ("weird'') matching (see #Matching, below)
 562
 563 @b x
 564
 565 expanded syntax (see below)
 566
 567
 568
 569 Embedded options take effect at the @b ) terminating the
 570 sequence. They are available only at the start of an ARE, and may not be
 571 used later within it.
 572 In addition to the usual (@e tight) RE syntax, in which
 573 all characters are significant, there is an @e expanded syntax, available
 574 in AREs with the embedded
 575 x option. In the expanded syntax, white-space characters are ignored and
 576 all characters between a @b # and the following newline (or the end of the
 577 RE) are ignored, permitting paragraphing and commenting a complex RE. There
 578 are three exceptions to that basic rule:
 579
 580
 581 a white-space character or '@b #' preceded
 582 by '@b \' is retained
 583 white space or '@b #' within a bracket expression is retained
 584 white space and comments are illegal within multi-character symbols like
 585 the ARE '@b (?:' or the BRE '@b \('
 586
 587
 588 Expanded-syntax white-space characters are blank,
 589 tab, newline, and any character that belongs to the @e space character class.
 590 Finally, in an ARE, outside bracket expressions, the sequence '@b (?#ttt)' (where
 591 @e ttt is any text not containing a '@b )') is a comment, completely ignored. Again,
 592 this is not allowed between the characters of multi-character symbols like
 593 '@b (?:'. Such comments are more a historical artifact than a useful facility,
 594 and their use is deprecated; use the expanded syntax instead.
 595 @e None of these
 596 metasyntax extensions is available if the application (or an initial @b ***=
 597 director) has specified that the user's input be treated as a literal string
 598 rather than as an RE.
 599
 600
 601 @section overview_resyntax_matching Matching
 602
 603 In the event that an RE could match more than
 604 one substring of a given string, the RE matches the one starting earliest
 605 in the string. If the RE could match more than one substring starting at
 606 that point, its choice is determined by its @e preference: either the longest
 607 substring, or the shortest.
 608 Most atoms, and all constraints, have no preference.
 609 A parenthesized RE has the same preference (possibly none) as the RE. A
 610 quantified atom with quantifier @b {m} or @b {m}? has the same preference (possibly
 611 none) as the atom itself. A quantified atom with other normal quantifiers
 612 (including @b {m,n} with @e m equal to @e n) prefers longest match. A quantified
 613 atom with other non-greedy quantifiers (including @b {m,n}? with @e m equal to
 614 @e n) prefers shortest match. A branch has the same preference as the first
 615 quantified atom in it which has a preference. An RE consisting of two or
 616 more branches connected by the @b | operator prefers longest match.
 617 Subject to the constraints imposed by the rules for matching the whole RE, subexpressions
 618 also match the longest or shortest possible substrings, based on their
 619 preferences, with subexpressions starting earlier in the RE taking priority
 620 over ones starting later. Note that outer subexpressions thus take priority
 621 over their component subexpressions.
 622 Note that the quantifiers @b {1,1} and
 623 @b {1,1}? can be used to force longest and shortest preference, respectively,
 624 on a subexpression or a whole RE.
 625 Match lengths are measured in characters,
 626 not collating elements. An empty string is considered longer than no match
 627 at all. For example, @b bb* matches the three middle characters
 628 of '@b abbbc', @b (week|wee)(night|knights)
 629 matches all ten characters of '@b weeknights', when @b (.*).* is matched against
 630 @b abc the parenthesized subexpression matches all three characters, and when
 631 @b (a*)* is matched against @b bc both the whole RE and the parenthesized subexpression
 632 match an empty string.
 633 If case-independent matching is specified, the effect
 634 is much as if all case distinctions had vanished from the alphabet. When
 635 an alphabetic that exists in multiple cases appears as an ordinary character
 636 outside a bracket expression, it is effectively transformed into a bracket
 637 expression containing both cases, so that @b x becomes '@b [xX]'. When it appears
 638 inside a bracket expression, all case counterparts of it are added to the
 639 bracket expression, so that @b [x] becomes @b [xX] and @b [^x] becomes '@b [^xX]'.
 640 If newline-sensitive
 641 matching is specified, @b . and bracket expressions using @b ^ will never match
 642 the newline character (so that matches will never cross newlines unless
 643 the RE explicitly arranges it) and @b ^ and @b $ will match the empty string after
 644 and before a newline respectively, in addition to matching at beginning
 645 and end of string respectively. ARE @b \A and @b \Z continue to match beginning
 646 or end of string @e only.
 647 If partial newline-sensitive matching is specified,
 648 this affects @b . and bracket expressions as with newline-sensitive matching,
 649 but not @b ^ and '@b $'.
 650 If inverse partial newline-sensitive matching is specified,
 651 this affects @b ^ and @b $ as with newline-sensitive matching, but not @b . and bracket
 652 expressions. This isn't very useful but is provided for symmetry.
 653
 654
 655 @section overview_resyntax_limits Limits and Compatibility
 656
 657 No particular limit is imposed on the length of REs. Programs
 658 intended to be highly portable should not employ REs longer than 256 bytes,
 659 as a POSIX-compliant implementation can refuse to accept such REs.
 660 The only
 661 feature of AREs that is actually incompatible with POSIX EREs is that @b \
 662 does not lose its special significance inside bracket expressions. All other
 663 ARE features use syntax which is illegal or has undefined or unspecified
 664 effects in POSIX EREs; the @b *** syntax of directors likewise is outside
 665 the POSIX syntax for both BREs and EREs.
 666 Many of the ARE extensions are
 667 borrowed from Perl, but some have been changed to clean them up, and a
 668 few Perl extensions are not present. Incompatibilities of note include '@b \b',
 669 '@b \B', the lack of special treatment for a trailing newline, the addition of
 670 complemented bracket expressions to the things affected by newline-sensitive
 671 matching, the restrictions on parentheses and back references in lookahead
 672 constraints, and the longest/shortest-match (rather than first-match) matching
 673 semantics.
 674 The matching rules for REs containing both normal and non-greedy
 675 quantifiers have changed since early beta-test versions of this package.
 676 (The new rules are much simpler and cleaner, but don't work as hard at guessing
 677 the user's real intentions.)
 678 Henry Spencer's original 1986 @e regexp package, still in widespread use,
 679 implemented an early version of today's EREs. There are four incompatibilities between @e regexp's
 680 near-EREs ('RREs' for short) and AREs. In roughly increasing order of significance:
 681
 682 In AREs, @b \ followed by an alphanumeric character is either an escape or
 683 an error, while in RREs, it was just another way of writing the  alphanumeric.
 684 This should not be a problem because there was no reason to write such
 685 a sequence in RREs.
 686 @b { followed by a digit in an ARE is the beginning of
 687 a bound, while in RREs, @b { was always an ordinary character. Such sequences
 688 should be rare, and will often result in an error because following characters
 689 will not look like a valid bound.
 690 In AREs, @b \ remains a special character
 691 within '@b []', so a literal @b \ within @b [] must be
 692 written '@b \\'. @b \\ also gives a literal
 693 @b \ within @b [] in RREs, but only truly paranoid programmers routinely doubled
 694 the backslash.
 695 AREs report the longest/shortest match for the RE, rather
 696 than the first found in a specified search order. This may affect some RREs
 697 which were written in the expectation that the first match would be reported.
 698 (The careful crafting of RREs to optimize the search order for fast matching
 699 is obsolete (AREs examine all possible matches in parallel, and their performance
 700 is largely insensitive to their complexity) but cases where the search
 701 order was exploited to deliberately  find a match which was @e not the longest/shortest
 702 will need rewriting.)
 703
 704
 705 @section overview_resyntax_bre Basic Regular Expressions
 706
 707 BREs differ from EREs in
 708 several respects.  '@b |', '@b +', and @b ? are ordinary characters and there is no equivalent
 709 for their functionality. The delimiters for bounds
 710 are @b \{ and '@b \}', with @b { and
 711 @b } by themselves ordinary characters. The parentheses for nested subexpressions
 712 are @b \( and '@b \)', with @b ( and @b ) by themselves
 713 ordinary characters. @b ^ is an ordinary
 714 character except at the beginning of the RE or the beginning of a parenthesized
 715 subexpression, @b $ is an ordinary character except at the end of the RE or
 716 the end of a parenthesized subexpression, and @b * is an ordinary character
 717 if it appears at the beginning of the RE or the beginning of a parenthesized
 718 subexpression (after a possible leading '@b ^'). Finally, single-digit back references
 719 are available, and @b \ and @b \ are synonyms
 720 for @b [[::]] and @b [[::]] respectively;
 721 no other escapes are available.
 722
 723
 724 @section overview_resyntax_characters Regular Expression Character Names
 725
 726 Note that the character names are case sensitive.
 727
 728
 729
 730
 731
 732
 733 NUL
 734
 735
 736
 737
 738 '\0'
 739
 740
 741
 742
 743
 744 SOH
 745
 746
 747
 748
 749 '\001'
 750
 751
 752
 753
 754
 755 STX
 756
 757
 758
 759
 760 '\002'
 761
 762
 763
 764
 765
 766 ETX
 767
 768
 769
 770
 771 '\003'
 772
 773
 774
 775
 776
 777 EOT
 778
 779
 780
 781
 782 '\004'
 783
 784
 785
 786
 787
 788 ENQ
 789
 790
 791
 792
 793 '\005'
 794
 795
 796
 797
 798
 799 ACK
 800
 801
 802
 803
 804 '\006'
 805
 806
 807
 808
 809
 810 BEL
 811
 812
 813
 814
 815 '\007'
 816
 817
 818
 819
 820
 821 alert
 822
 823
 824
 825
 826 '\007'
 827
 828
 829
 830
 831
 832 BS
 833
 834
 835
 836
 837 '\010'
 838
 839
 840
 841
 842
 843 backspace
 844
 845
 846
 847
 848 '\b'
 849
 850
 851
 852
 853
 854 HT
 855
 856
 857
 858
 859 '\011'
 860
 861
 862
 863
 864
 865 tab
 866
 867
 868
 869
 870 '\t'
 871
 872
 873
 874
 875
 876 LF
 877
 878
 879
 880
 881 '\012'
 882
 883
 884
 885
 886
 887 newline
 888
 889
 890
 891
 892 '\n'
 893
 894
 895
 896
 897
 898 VT
 899
 900
 901
 902
 903 '\013'
 904
 905
 906
 907
 908
 909 vertical-tab
 910
 911
 912
 913
 914 '\v'
 915
 916
 917
 918
 919
 920 FF
 921
 922
 923
 924
 925 '\014'
 926
 927
 928
 929
 930
 931 form-feed
 932
 933
 934
 935
 936 '\f'
 937
 938
 939
 940
 941
 942 CR
 943
 944
 945
 946
 947 '\015'
 948
 949
 950
 951
 952
 953 carriage-return
 954
 955
 956
 957
 958 '\r'
 959
 960
 961
 962
 963
 964 SO
 965
 966
 967
 968
 969 '\016'
 970
 971
 972
 973
 974
 975 SI
 976
 977
 978
 979
 980 '\017'
 981
 982
 983
 984
 985
 986 DLE
 987
 988
 989
 990
 991 '\020'
 992
 993
 994
 995
 996
 997 DC1
 998
 999
1000
1001
1002 '\021'
1003
1004
1005
1006
1007
1008 DC2
1009
1010
1011
1012
1013 '\022'
1014
1015
1016
1017
1018
1019 DC3
1020
1021
1022
1023
1024 '\023'
1025
1026
1027
1028
1029
1030 DC4
1031
1032
1033
1034
1035 '\024'
1036
1037
1038
1039
1040
1041 NAK
1042
1043
1044
1045
1046 '\025'
1047
1048
1049
1050
1051
1052 SYN
1053
1054
1055
1056
1057 '\026'
1058
1059
1060
1061
1062
1063 ETB
1064
1065
1066
1067
1068 '\027'
1069
1070
1071
1072
1073
1074 CAN
1075
1076
1077
1078
1079 '\030'
1080
1081
1082
1083
1084
1085 EM
1086
1087
1088
1089
1090 '\031'
1091
1092
1093
1094
1095
1096 SUB
1097
1098
1099
1100
1101 '\032'
1102
1103
1104
1105
1106
1107 ESC
1108
1109
1110
1111
1112 '\033'
1113
1114
1115
1116
1117
1118 IS4
1119
1120
1121
1122
1123 '\034'
1124
1125
1126
1127
1128
1129 FS
1130
1131
1132
1133
1134 '\034'
1135
1136
1137
1138
1139
1140 IS3
1141
1142
1143
1144
1145 '\035'
1146
1147
1148
1149
1150
1151 GS
1152
1153
1154
1155
1156 '\035'
1157
1158
1159
1160
1161
1162 IS2
1163
1164
1165
1166
1167 '\036'
1168
1169
1170
1171
1172
1173 RS
1174
1175
1176
1177
1178 '\036'
1179
1180
1181
1182
1183
1184 IS1
1185
1186
1187
1188
1189 '\037'
1190
1191
1192
1193
1194
1195 US
1196
1197
1198
1199
1200 '\037'
1201
1202
1203
1204
1205
1206 space
1207
1208
1209
1210
1211 ' '
1212
1213
1214
1215
1216
1217 exclamation-mark
1218
1219
1220
1221
1222 '!'
1223
1224
1225
1226
1227
1228 quotation-mark
1229
1230
1231
1232
1233 '"'
1234
1235
1236
1237
1238
1239 number-sign
1240
1241
1242
1243
1244 '#'
1245
1246
1247
1248
1249
1250 dollar-sign
1251
1252
1253
1254
1255 '$'
1256
1257
1258
1259
1260
1261 percent-sign
1262
1263
1264
1265
1266 '%'
1267
1268
1269
1270
1271
1272 ampersand
1273
1274
1275
1276
1277 ''
1278
1279
1280
1281
1282
1283 apostrophe
1284
1285
1286
1287
1288 '\''
1289
1290
1291
1292
1293
1294 left-parenthesis
1295
1296
1297
1298
1299 '('
1300
1301
1302
1303
1304
1305 right-parenthesis
1306
1307
1308
1309
1310 ')'
1311
1312
1313
1314
1315
1316 asterisk
1317
1318
1319
1320
1321 '*'
1322
1323
1324
1325
1326
1327 plus-sign
1328
1329
1330
1331
1332 '+'
1333
1334
1335
1336
1337
1338 comma
1339
1340
1341
1342
1343 ','
1344
1345
1346
1347
1348
1349 hyphen
1350
1351
1352
1353
1354 '-'
1355
1356
1357
1358
1359
1360 hyphen-minus
1361
1362
1363
1364
1365 '-'
1366
1367
1368
1369
1370
1371 period
1372
1373
1374
1375
1376 '.'
1377
1378
1379
1380
1381
1382 full-stop
1383
1384
1385
1386
1387 '.'
1388
1389
1390
1391
1392
1393 slash
1394
1395
1396
1397
1398 '/'
1399
1400
1401
1402
1403
1404 solidus
1405
1406
1407
1408
1409 '/'
1410
1411
1412
1413
1414
1415 zero
1416
1417
1418
1419
1420 '0'
1421
1422
1423
1424
1425
1426 one
1427
1428
1429
1430
1431 '1'
1432
1433
1434
1435
1436
1437 two
1438
1439
1440
1441
1442 '2'
1443
1444
1445
1446
1447
1448 three
1449
1450
1451
1452
1453 '3'
1454
1455
1456
1457
1458
1459 four
1460
1461
1462
1463
1464 '4'
1465
1466
1467
1468
1469
1470 five
1471
1472
1473
1474
1475 '5'
1476
1477
1478
1479
1480
1481 six
1482
1483
1484
1485
1486 '6'
1487
1488
1489
1490
1491
1492 seven
1493
1494
1495
1496
1497 '7'
1498
1499
1500
1501
1502
1503 eight
1504
1505
1506
1507
1508 '8'
1509
1510
1511
1512
1513
1514 nine
1515
1516
1517
1518
1519 '9'
1520
1521
1522
1523
1524
1525 colon
1526
1527
1528
1529
1530 ':'
1531
1532
1533
1534
1535
1536 semicolon
1537
1538
1539
1540
1541 ';'
1542
1543
1544
1545
1546
1547 less-than-sign
1548
1549
1550
1551
1552 ''
1553
1554
1555
1556
1557
1558 equals-sign
1559
1560
1561
1562
1563 '='
1564
1565
1566
1567
1568
1569 greater-than-sign
1570
1571
1572
1573
1574 ''
1575
1576
1577
1578
1579
1580 question-mark
1581
1582
1583
1584
1585 '?'
1586
1587
1588
1589
1590
1591 commercial-at
1592
1593
1594
1595
1596 '@'
1597
1598
1599
1600
1601
1602 left-square-bracket
1603
1604
1605
1606
1607 '['
1608
1609
1610
1611
1612
1613 backslash
1614
1615
1616
1617
1618 '\'
1619
1620
1621
1622
1623
1624 reverse-solidus
1625
1626
1627
1628
1629 '\'
1630
1631
1632
1633
1634
1635 right-square-bracket
1636
1637
1638
1639
1640 ']'
1641
1642
1643
1644
1645
1646 circumflex
1647
1648
1649
1650
1651 '^'
1652
1653
1654
1655
1656
1657 circumflex-accent
1658
1659
1660
1661
1662 '^'
1663
1664
1665
1666
1667
1668 underscore
1669
1670
1671
1672
1673 '_'
1674
1675
1676
1677
1678
1679 low-line
1680
1681
1682
1683
1684 '_'
1685
1686
1687
1688
1689
1690 grave-accent
1691
1692
1693
1694
1695 '''
1696
1697
1698
1699
1700
1701 left-brace
1702
1703
1704
1705
1706 '{'
1707
1708
1709
1710
1711
1712 left-curly-bracket
1713
1714
1715
1716
1717 '{'
1718
1719
1720
1721
1722
1723 vertical-line
1724
1725
1726
1727
1728 '|'
1729
1730
1731
1732
1733
1734 right-brace
1735
1736
1737
1738
1739 '}'
1740
1741
1742
1743
1744
1745 right-curly-bracket
1746
1747
1748
1749
1750 '}'
1751
1752
1753
1754
1755
1756 tilde
1757
1758
1759
1760
1761 '~'
1762
1763
1764
1765
1766
1767 DEL
1768
1769
1770
1771
1772 '\177'
1773
1774 */
1775