docs/doxygen/overviews/resyntax.h

   1 /////////////////////////////////////////////////////////////////////////////
   2 // Name:        resyn
   3 // Purpose:     topic overview
   4 // Author:      wxWidgets team
   5 // RCS-ID:      $Id$
   6 // Licence:     wxWindows license
   7 /////////////////////////////////////////////////////////////////////////////
   8
   9 /*!
  10
  11  @page resyn_overview Syntax of the builtin regular expression library
  12
  13  A @e regular expression describes strings of characters. It's a
  14  pattern that matches certain strings and doesn't match others.
  15  @b See also
  16  #wxRegEx
  17  @ref differentflavors_overview
  18  @ref resyntax_overview
  19  @ref resynbracket_overview
  20  #Escapes
  21  #Metasyntax
  22  #Matching
  23  @ref relimits_overview
  24  @ref resynbre_overview
  25  @ref resynchars_overview
  26
  27
  28  @section differentflavors Different Flavors of REs
  29
  30  @ref resyn_overview
  31  Regular expressions ("RE''s), as defined by POSIX, come in two
  32  flavors: @e extended REs ("EREs'') and @e basic REs ("BREs''). EREs are roughly those
  33  of the traditional @e egrep, while BREs are roughly those of the traditional
  34   @e ed.  This implementation adds a third flavor, @e advanced REs ("AREs''), basically
  35  EREs with some significant extensions.
  36  This manual page primarily describes
  37  AREs. BREs mostly exist for backward compatibility in some old programs;
  38  they will be discussed at the #end. POSIX EREs are almost an exact subset
  39  of AREs. Features of AREs that are not present in EREs will be indicated.
  40
  41  @section resyntax Regular Expression Syntax
  42
  43  @ref resyn_overview
  44  These regular expressions are implemented using
  45  the package written by Henry Spencer, based on the 1003.2 spec and some
  46  (not quite all) of the Perl5 extensions (thanks, Henry!).  Much of the description
  47  of regular expressions below is copied verbatim from his manual entry.
  48  An ARE is one or more @e branches, separated by '@b |', matching anything that matches
  49  any of the branches.
  50  A branch is zero or more @e constraints or @e quantified
  51  atoms, concatenated. It matches a match for the first, followed by a match
  52  for the second, etc; an empty branch matches the empty string.
  53  A quantified atom is an @e atom possibly followed by a single @e quantifier. Without a quantifier,
  54  it matches a match for the atom. The quantifiers, and what a so-quantified
  55  atom matches, are:
  56
  57
  58
  59
  60
  61
  62  @b *
  63
  64
  65
  66
  67  a sequence of 0 or more matches of the atom
  68
  69
  70
  71
  72
  73  @b +
  74
  75
  76
  77
  78  a sequence of 1 or more matches of the atom
  79
  80
  81
  82
  83
  84  @b ?
  85
  86
  87
  88
  89  a sequence of 0 or 1 matches of the atom
  90
  91
  92
  93
  94
  95  @b {m}
  96
  97
  98
  99
 100  a sequence of exactly @e m matches of the atom
 101
 102
 103
 104
 105
 106  @b {m,}
 107
 108
 109
 110
 111  a sequence of @e m or more matches of the atom
 112
 113
 114
 115
 116
 117  @b {m,n}
 118
 119
 120
 121
 122  a sequence of @e m through @e n (inclusive)
 123  matches of the atom; @e m may not exceed @e n
 124
 125
 126
 127
 128
 129  @b *?  +?  ??  {m}?  {m,}?  {m,n}?
 130
 131
 132
 133
 134  @e non-greedy quantifiers,
 135  which match the same possibilities, but prefer the
 136  smallest number rather than the largest number of matches (see #Matching)
 137
 138
 139
 140
 141
 142  The forms using @b { and @b } are known as @e bounds. The numbers @e m and @e n are unsigned
 143  decimal integers with permissible values from 0 to 255 inclusive.
 144  An atom is one of:
 145
 146
 147
 148
 149
 150
 151  @b (re)
 152
 153
 154
 155
 156  (where @e re is any regular expression) matches a match for
 157   @e re, with the match noted for possible reporting
 158
 159
 160
 161
 162
 163  @b (?:re)
 164
 165
 166
 167
 168  as previous, but
 169  does no reporting (a "non-capturing'' set of parentheses)
 170
 171
 172
 173
 174
 175  @b ()
 176
 177
 178
 179
 180  matches an empty
 181  string, noted for possible reporting
 182
 183
 184
 185
 186
 187  @b (?:)
 188
 189
 190
 191
 192  matches an empty string, without reporting
 193
 194
 195
 196
 197
 198  @b [chars]
 199
 200
 201
 202
 203  a @e bracket expression, matching any one of the @e chars
 204  (see @ref resynbracket_overview for more detail)
 205
 206
 207
 208
 209
 210  @b .
 211
 212
 213
 214
 215  matches any single character
 216
 217
 218
 219
 220
 221  @b \k
 222
 223
 224
 225
 226  (where @e k is a non-alphanumeric character)
 227  matches that character taken as an ordinary character, e.g. \\ matches a backslash
 228  character
 229
 230
 231
 232
 233
 234  @b \c
 235
 236
 237
 238
 239  where @e c is alphanumeric (possibly followed by other characters),
 240  an @e escape (AREs only), see #Escapes below
 241
 242
 243
 244
 245
 246  @b {
 247
 248
 249
 250
 251  when followed by a character
 252  other than a digit, matches the left-brace character '@b {'; when followed by
 253  a digit, it is the beginning of a @e bound (see above)
 254
 255
 256
 257
 258
 259  @b x
 260
 261
 262
 263
 264  where @e x is a single
 265  character with no other significance, matches that character.
 266
 267
 268
 269
 270
 271  A @e constraint matches an empty string when specific conditions are met. A constraint may
 272  not be followed by a quantifier. The simple constraints are as follows;
 273  some more constraints are described later, under #Escapes.
 274
 275
 276
 277
 278
 279
 280  @b ^
 281
 282
 283
 284
 285  matches at the beginning of a line
 286
 287
 288
 289
 290
 291  @b $
 292
 293
 294
 295
 296  matches at the end of a line
 297
 298
 299
 300
 301
 302  @b (?=re)
 303
 304
 305
 306
 307  @e positive lookahead
 308  (AREs only), matches at any point where a substring matching @e re begins
 309
 310
 311
 312
 313
 314  @b (?!re)
 315
 316
 317
 318
 319  @e negative lookahead (AREs only),
 320  matches at any point where no substring matching @e re begins
 321
 322
 323
 324
 325
 326  The lookahead constraints may not contain back references
 327  (see later), and all parentheses within them are considered non-capturing.
 328  An RE may not end with '@b \'.
 329
 330  @section wxresynbracket Bracket Expressions
 331
 332  @ref resyn_overview
 333  A @e bracket expression is a list
 334  of characters enclosed in '@b []'. It normally matches any single character from
 335  the list (but see below). If the list begins with '@b ^', it matches any single
 336  character (but see below) @e not from the rest of the list.
 337  If two characters
 338  in the list are separated by '@b -', this is shorthand for the full @e range of
 339  characters between those two (inclusive) in the collating sequence, e.g.
 340   @b [0-9] in ASCII matches any decimal digit. Two ranges may not share an endpoint,
 341  so e.g. @b a-c-e is illegal. Ranges are very collating-sequence-dependent, and portable
 342  programs should avoid relying on them.
 343  To include a literal @b ] or @b - in the
 344  list, the simplest method is to enclose it in @b [. and @b .] to make it a collating
 345  element (see below). Alternatively, make it the first character (following
 346  a possible '@b ^'), or (AREs only) precede it with '@b \'.
 347  Alternatively, for '@b -', make
 348  it the last character, or the second endpoint of a range. To use a literal
 349   @b - as the first endpoint of a range, make it a collating element or (AREs
 350  only) precede it with '@b \'. With the exception of these, some combinations using
 351   @b [ (see next paragraphs), and escapes, all other special characters lose
 352  their special significance within a bracket expression.
 353  Within a bracket
 354  expression, a collating element (a character, a multi-character sequence
 355  that collates as if it were a single character, or a collating-sequence
 356  name for either) enclosed in @b [. and @b .] stands for the
 357  sequence of characters of that collating element.
 358  @e wxWidgets: Currently no multi-character collating elements are defined.
 359  So in @b [.X.], @e X can either be a single character literal or
 360  the name of a character. For example, the following are both identical
 361   @b [[.0.]-[.9.]] and @b [[.zero.]-[.nine.]] and mean the same as
 362   @b [0-9].
 363   See @ref resynchars_overview.
 364  Within a bracket expression, a collating element enclosed in @b [= and @b =]
 365  is an equivalence class, standing for the sequences of characters of all
 366  collating elements equivalent to that one, including itself.
 367  An equivalence class may not be an endpoint of a range.
 368  @e wxWidgets: Currently no equivalence classes are defined, so
 369  @b [=X=] stands for just the single character @e X.
 370   @e X can either be a single character literal or the name of a character,
 371  see @ref resynchars_overview.
 372  Within a bracket expression,
 373  the name of a @e character class enclosed in @b [: and @b :] stands for the list
 374  of all characters (not all collating elements!) belonging to that class.
 375  Standard character classes are:
 376
 377
 378
 379
 380
 381
 382  @b alpha
 383
 384
 385
 386
 387  A letter.
 388
 389
 390
 391
 392
 393  @b upper
 394
 395
 396
 397
 398  An upper-case letter.
 399
 400
 401
 402
 403
 404  @b lower
 405
 406
 407
 408
 409  A lower-case letter.
 410
 411
 412
 413
 414
 415  @b digit
 416
 417
 418
 419
 420  A decimal digit.
 421
 422
 423
 424
 425
 426  @b xdigit
 427
 428
 429
 430
 431  A hexadecimal digit.
 432
 433
 434
 435
 436
 437  @b alnum
 438
 439
 440
 441
 442  An alphanumeric (letter or digit).
 443
 444
 445
 446
 447
 448  @b print
 449
 450
 451
 452
 453  An alphanumeric (same as alnum).
 454
 455
 456
 457
 458
 459  @b blank
 460
 461
 462
 463
 464  A space or tab character.
 465
 466
 467
 468
 469
 470  @b space
 471
 472
 473
 474
 475  A character producing white space in displayed text.
 476
 477
 478
 479
 480
 481  @b punct
 482
 483
 484
 485
 486  A punctuation character.
 487
 488
 489
 490
 491
 492  @b graph
 493
 494
 495
 496
 497  A character with a visible representation.
 498
 499
 500
 501
 502
 503  @b cntrl
 504
 505
 506
 507
 508  A control character.
 509
 510
 511
 512
 513
 514  A character class may not be used as an endpoint of a range.
 515  @e wxWidgets: In a non-Unicode build, these character classifications depend on the
 516  current locale, and correspond to the values return by the ANSI C 'is'
 517  functions: isalpha, isupper, etc. In Unicode mode they are based on
 518  Unicode classifications, and are not affected by the current locale.
 519  There are two special cases of bracket expressions:
 520  the bracket expressions @b [[::]] and @b [[::]] are constraints, matching empty
 521  strings at the beginning and end of a word respectively.  A word is defined
 522  as a sequence of word characters that is neither preceded nor followed
 523  by word characters. A word character is an @e alnum character or an underscore
 524  (@b _). These special bracket expressions are deprecated; users of AREs should
 525  use constraint escapes instead (see #Escapes below).
 526
 527  @section wxresynescapes Escapes
 528
 529  @ref resyn_overview
 530  Escapes (AREs only),
 531  which begin with a @b \ followed by an alphanumeric character, come in several
 532  varieties: character entry, class shorthands, constraint escapes, and back
 533  references. A @b \ followed by an alphanumeric character but not constituting
 534  a valid escape is illegal in AREs. In EREs, there are no escapes: outside
 535  a bracket expression, a @b \ followed by an alphanumeric character merely stands
 536  for that character as an ordinary character, and inside a bracket expression,
 537   @b \ is an ordinary character. (The latter is the one actual incompatibility
 538  between EREs and AREs.)
 539  Character-entry escapes (AREs only) exist to make
 540  it easier to specify non-printing and otherwise inconvenient characters
 541  in REs:
 542
 543
 544
 545
 546
 547
 548  @b \a
 549
 550
 551
 552
 553  alert (bell) character, as in C
 554
 555
 556
 557
 558
 559  @b \b
 560
 561
 562
 563
 564  backspace, as in C
 565
 566
 567
 568
 569
 570  @b \B
 571
 572
 573
 574
 575  synonym
 576  for @b \ to help reduce backslash doubling in some applications where there
 577  are multiple levels of backslash processing
 578
 579
 580
 581
 582
 583  @b \c@e X
 584
 585
 586
 587
 588  (where X is any character)
 589  the character whose low-order 5 bits are the same as those of @e X, and whose
 590  other bits are all zero
 591
 592
 593
 594
 595
 596  @b \e
 597
 598
 599
 600
 601  the character whose collating-sequence name is
 602  '@b ESC', or failing that, the character with octal value 033
 603
 604
 605
 606
 607
 608  @b \f
 609
 610
 611
 612
 613  formfeed, as in C
 614
 615
 616
 617
 618
 619  @b \n
 620
 621
 622
 623
 624  newline, as in C
 625
 626
 627
 628
 629
 630  @b \r
 631
 632
 633
 634
 635  carriage return, as in C
 636
 637
 638
 639
 640
 641  @b \t
 642
 643
 644
 645
 646  horizontal tab, as in C
 647
 648
 649
 650
 651
 652  @b \u@e wxyz
 653
 654
 655
 656
 657  (where @e wxyz is exactly four hexadecimal digits)
 658  the Unicode
 659  character @b U+@e wxyz in the local byte ordering
 660
 661
 662
 663
 664
 665  @b \U@e stuvwxyz
 666
 667
 668
 669
 670  (where @e stuvwxyz is
 671  exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode
 672  extension to 32 bits
 673
 674
 675
 676
 677
 678  @b \v
 679
 680
 681
 682
 683  vertical tab, as in C are all available.
 684
 685
 686
 687
 688
 689  @b \x@e hhh
 690
 691
 692
 693
 694  (where
 695   @e hhh is any sequence of hexadecimal digits) the character whose hexadecimal
 696  value is @b 0x@e hhh (a single character no matter how many hexadecimal digits
 697  are used).
 698
 699
 700
 701
 702
 703  @b \0
 704
 705
 706
 707
 708  the character whose value is @b 0
 709
 710
 711
 712
 713
 714  @b \@e xy
 715
 716
 717
 718
 719  (where @e xy is exactly two
 720  octal digits, and is not a @e back reference (see below)) the character whose
 721  octal value is @b 0@e xy
 722
 723
 724
 725
 726
 727  @b \@e xyz
 728
 729
 730
 731
 732  (where @e xyz is exactly three octal digits, and is
 733  not a back reference (see below))
 734  the character whose octal value is @b 0@e xyz
 735
 736
 737
 738
 739
 740  Hexadecimal digits are '@b 0'-'@b 9', '@b a'-'@b f', and '@b A'-'@b F'. Octal
 741  digits are '@b 0'-'@b 7'.
 742  The character-entry
 743  escapes are always taken as ordinary characters. For example, @b \135 is @b ] in
 744  ASCII, but @b \135 does not terminate a bracket expression. Beware, however,
 745  that some applications (e.g., C compilers) interpret  such sequences themselves
 746  before the regular-expression package gets to see them, which may require
 747  doubling (quadrupling, etc.) the '@b \'.
 748  Class-shorthand escapes (AREs only) provide
 749  shorthands for certain commonly-used character classes:
 750
 751
 752
 753
 754
 755
 756  @b \d
 757
 758
 759
 760
 761  @b [[:digit:]]
 762
 763
 764
 765
 766
 767  @b \s
 768
 769
 770
 771
 772  @b [[:space:]]
 773
 774
 775
 776
 777
 778  @b \w
 779
 780
 781
 782
 783  @b [[:alnum:]_] (note underscore)
 784
 785
 786
 787
 788
 789  @b \D
 790
 791
 792
 793
 794  @b [^[:digit:]]
 795
 796
 797
 798
 799
 800  @b \S
 801
 802
 803
 804
 805  @b [^[:space:]]
 806
 807
 808
 809
 810
 811  @b \W
 812
 813
 814
 815
 816  @b [^[:alnum:]_] (note underscore)
 817
 818
 819
 820
 821
 822  Within bracket expressions, '@b \d', '@b \s', and
 823  '@b \w' lose their outer brackets, and '@b \D',
 824  '@b \S', and '@b \W' are illegal. (So, for example,
 825   @b [a-c\d] is equivalent to @b [a-c[:digit:]].
 826  Also, @b [a-c\D], which is equivalent to
 827   @b [a-c^[:digit:]], is illegal.)
 828  A constraint escape (AREs only) is a constraint,
 829  matching the empty string if specific conditions are met, written as an
 830  escape:
 831
 832
 833
 834
 835
 836
 837  @b \A
 838
 839
 840
 841
 842  matches only at the beginning of the string
 843  (see #Matching, below,
 844  for how this differs from '@b ^')
 845
 846
 847
 848
 849
 850  @b \m
 851
 852
 853
 854
 855  matches only at the beginning of a word
 856
 857
 858
 859
 860
 861  @b \M
 862
 863
 864
 865
 866  matches only at the end of a word
 867
 868
 869
 870
 871
 872  @b \y
 873
 874
 875
 876
 877  matches only at the beginning or end of a word
 878
 879
 880
 881
 882
 883  @b \Y
 884
 885
 886
 887
 888  matches only at a point that is not the beginning or end of
 889  a word
 890
 891
 892
 893
 894
 895  @b \Z
 896
 897
 898
 899
 900  matches only at the end of the string
 901  (see #Matching, below, for
 902  how this differs from '@b $')
 903
 904
 905
 906
 907
 908  @b \@e m
 909
 910
 911
 912
 913  (where @e m is a nonzero digit) a @e back reference,
 914  see below
 915
 916
 917
 918
 919
 920  @b \@e mnn
 921
 922
 923
 924
 925  (where @e m is a nonzero digit, and @e nn is some more digits,
 926  and the decimal value @e mnn is not greater than the number of closing capturing
 927  parentheses seen so far) a @e back reference, see below
 928
 929
 930
 931
 932
 933  A word is defined
 934  as in the specification of @b [[::]] and @b [[::]] above. Constraint escapes are
 935  illegal within bracket expressions.
 936  A back reference (AREs only) matches
 937  the same string matched by the parenthesized subexpression specified by
 938  the number, so that (e.g.) @b ([bc])\1 matches @b bb or @b cc but not '@b bc'.
 939  The subexpression
 940  must entirely precede the back reference in the RE. Subexpressions are numbered
 941  in the order of their leading parentheses. Non-capturing parentheses do not
 942  define subexpressions.
 943  There is an inherent historical ambiguity between
 944  octal character-entry  escapes and back references, which is resolved by
 945  heuristics, as hinted at above. A leading zero always indicates an octal
 946  escape. A single non-zero digit, not followed by another digit, is always
 947  taken as a back reference. A multi-digit sequence not starting with a zero
 948  is taken as a back  reference if it comes after a suitable subexpression
 949  (i.e. the number is in the legal range for a back reference), and otherwise
 950  is taken as octal.
 951
 952  @section remetasyntax Metasyntax
 953
 954  @ref resyn_overview
 955  In addition to the main syntax described above,
 956  there are some special forms and miscellaneous syntactic facilities available.
 957  Normally the flavor of RE being used is specified by application-dependent
 958  means. However, this can be overridden by a @e director. If an RE of any flavor
 959  begins with '@b ***:', the rest of the RE is an ARE. If an RE of any flavor begins
 960  with '@b ***=', the rest of the RE is taken to be a literal string, with all
 961  characters considered ordinary characters.
 962  An ARE may begin with @e embedded options: a sequence @b (?xyz)
 963  (where @e xyz is one or more alphabetic characters)
 964  specifies options affecting the rest of the RE. These supplement, and can
 965  override, any options specified by the application. The available option
 966  letters are:
 967
 968
 969
 970
 971
 972
 973  @b b
 974
 975
 976
 977
 978  rest of RE is a BRE
 979
 980
 981
 982
 983
 984  @b c
 985
 986
 987
 988
 989  case-sensitive matching (usual default)
 990
 991
 992
 993
 994
 995  @b e
 996
 997
 998
 999
1000  rest of RE is an ERE
1001
1002
1003
1004
1005
1006  @b i
1007
1008
1009
1010
1011  case-insensitive matching (see #Matching, below)
1012
1013
1014
1015
1016
1017  @b m
1018
1019
1020
1021
1022  historical synonym for @b n
1023
1024
1025
1026
1027
1028  @b n
1029
1030
1031
1032
1033  newline-sensitive matching (see #Matching, below)
1034
1035
1036
1037
1038
1039  @b p
1040
1041
1042
1043
1044  partial newline-sensitive matching (see #Matching, below)
1045
1046
1047
1048
1049
1050  @b q
1051
1052
1053
1054
1055  rest of RE
1056  is a literal ("quoted'') string, all ordinary characters
1057
1058
1059
1060
1061
1062  @b s
1063
1064
1065
1066
1067  non-newline-sensitive matching (usual default)
1068
1069
1070
1071
1072
1073  @b t
1074
1075
1076
1077
1078  tight syntax (usual default; see below)
1079
1080
1081
1082
1083
1084  @b w
1085
1086
1087
1088
1089  inverse
1090  partial newline-sensitive ("weird'') matching (see #Matching, below)
1091
1092
1093
1094
1095
1096  @b x
1097
1098
1099
1100
1101  expanded syntax (see below)
1102
1103
1104
1105
1106
1107  Embedded options take effect at the @b ) terminating the
1108  sequence. They are available only at the start of an ARE, and may not be
1109  used later within it.
1110  In addition to the usual (@e tight) RE syntax, in which
1111  all characters are significant, there is an @e expanded syntax, available
1112  in AREs with the embedded
1113  x option. In the expanded syntax, white-space characters are ignored and
1114  all characters between a @b # and the following newline (or the end of the
1115  RE) are ignored, permitting paragraphing and commenting a complex RE. There
1116  are three exceptions to that basic rule:
1117
1118
1119  a white-space character or '@b #' preceded
1120  by '@b \' is retained
1121  white space or '@b #' within a bracket expression is retained
1122  white space and comments are illegal within multi-character symbols like
1123  the ARE '@b (?:' or the BRE '@b \('
1124
1125
1126  Expanded-syntax white-space characters are blank,
1127  tab, newline, and any character that belongs to the @e space character class.
1128  Finally, in an ARE, outside bracket expressions, the sequence '@b (?#ttt)' (where
1129   @e ttt is any text not containing a '@b )') is a comment, completely ignored. Again,
1130  this is not allowed between the characters of multi-character symbols like
1131  '@b (?:'. Such comments are more a historical artifact than a useful facility,
1132  and their use is deprecated; use the expanded syntax instead.
1133  @e None of these
1134  metasyntax extensions is available if the application (or an initial @b ***=
1135  director) has specified that the user's input be treated as a literal string
1136  rather than as an RE.
1137
1138  @section wxresynmatching Matching
1139
1140  @ref resyn_overview
1141  In the event that an RE could match more than
1142  one substring of a given string, the RE matches the one starting earliest
1143  in the string. If the RE could match more than one substring starting at
1144  that point, its choice is determined by its @e preference: either the longest
1145  substring, or the shortest.
1146  Most atoms, and all constraints, have no preference.
1147  A parenthesized RE has the same preference (possibly none) as the RE. A
1148  quantified atom with quantifier @b {m} or @b {m}? has the same preference (possibly
1149  none) as the atom itself. A quantified atom with other normal quantifiers
1150  (including @b {m,n} with @e m equal to @e n) prefers longest match. A quantified
1151  atom with other non-greedy quantifiers (including @b {m,n}? with @e m equal to
1152   @e n) prefers shortest match. A branch has the same preference as the first
1153  quantified atom in it which has a preference. An RE consisting of two or
1154  more branches connected by the @b | operator prefers longest match.
1155  Subject to the constraints imposed by the rules for matching the whole RE, subexpressions
1156  also match the longest or shortest possible substrings, based on their
1157  preferences, with subexpressions starting earlier in the RE taking priority
1158  over ones starting later. Note that outer subexpressions thus take priority
1159  over their component subexpressions.
1160  Note that the quantifiers @b {1,1} and
1161   @b {1,1}? can be used to force longest and shortest preference, respectively,
1162  on a subexpression or a whole RE.
1163  Match lengths are measured in characters,
1164  not collating elements. An empty string is considered longer than no match
1165  at all. For example, @b bb* matches the three middle characters
1166  of '@b abbbc', @b (week|wee)(night|knights)
1167  matches all ten characters of '@b weeknights', when @b (.*).* is matched against
1168   @b abc the parenthesized subexpression matches all three characters, and when
1169   @b (a*)* is matched against @b bc both the whole RE and the parenthesized subexpression
1170  match an empty string.
1171  If case-independent matching is specified, the effect
1172  is much as if all case distinctions had vanished from the alphabet. When
1173  an alphabetic that exists in multiple cases appears as an ordinary character
1174  outside a bracket expression, it is effectively transformed into a bracket
1175  expression containing both cases, so that @b x becomes '@b [xX]'. When it appears
1176  inside a bracket expression, all case counterparts of it are added to the
1177  bracket expression, so that @b [x] becomes @b [xX] and @b [^x] becomes '@b [^xX]'.
1178  If newline-sensitive
1179  matching is specified, @b . and bracket expressions using @b ^ will never match
1180  the newline character (so that matches will never cross newlines unless
1181  the RE explicitly arranges it) and @b ^ and @b $ will match the empty string after
1182  and before a newline respectively, in addition to matching at beginning
1183  and end of string respectively. ARE @b \A and @b \Z continue to match beginning
1184  or end of string @e only.
1185  If partial newline-sensitive matching is specified,
1186  this affects @b . and bracket expressions as with newline-sensitive matching,
1187  but not @b ^ and '@b $'.
1188  If inverse partial newline-sensitive matching is specified,
1189  this affects @b ^ and @b $ as with newline-sensitive matching, but not @b . and bracket
1190  expressions. This isn't very useful but is provided for symmetry.
1191
1192  @section relimits Limits And Compatibility
1193
1194  @ref resyn_overview
1195  No particular limit is imposed on the length of REs. Programs
1196  intended to be highly portable should not employ REs longer than 256 bytes,
1197  as a POSIX-compliant implementation can refuse to accept such REs.
1198  The only
1199  feature of AREs that is actually incompatible with POSIX EREs is that @b \
1200  does not lose its special significance inside bracket expressions. All other
1201  ARE features use syntax which is illegal or has undefined or unspecified
1202  effects in POSIX EREs; the @b *** syntax of directors likewise is outside
1203  the POSIX syntax for both BREs and EREs.
1204  Many of the ARE extensions are
1205  borrowed from Perl, but some have been changed to clean them up, and a
1206  few Perl extensions are not present. Incompatibilities of note include '@b \b',
1207  '@b \B', the lack of special treatment for a trailing newline, the addition of
1208  complemented bracket expressions to the things affected by newline-sensitive
1209  matching, the restrictions on parentheses and back references in lookahead
1210  constraints, and the longest/shortest-match (rather than first-match) matching
1211  semantics.
1212  The matching rules for REs containing both normal and non-greedy
1213  quantifiers have changed since early beta-test versions of this package.
1214  (The new rules are much simpler and cleaner, but don't work as hard at guessing
1215  the user's real intentions.)
1216  Henry Spencer's original 1986 @e regexp package, still in widespread use,
1217  implemented an early version of today's EREs. There are four incompatibilities between @e regexp's
1218  near-EREs ('RREs' for short) and AREs. In roughly increasing order of significance:
1219
1220
1221   In AREs, @b \ followed by an alphanumeric character is either an escape or
1222  an error, while in RREs, it was just another way of writing the  alphanumeric.
1223  This should not be a problem because there was no reason to write such
1224  a sequence in RREs.
1225   @b { followed by a digit in an ARE is the beginning of
1226  a bound, while in RREs, @b { was always an ordinary character. Such sequences
1227  should be rare, and will often result in an error because following characters
1228  will not look like a valid bound.
1229   In AREs, @b \ remains a special character
1230  within '@b []', so a literal @b \ within @b [] must be
1231  written '@b \\'. @b \\ also gives a literal
1232   @b \ within @b [] in RREs, but only truly paranoid programmers routinely doubled
1233  the backslash.
1234   AREs report the longest/shortest match for the RE, rather
1235  than the first found in a specified search order. This may affect some RREs
1236  which were written in the expectation that the first match would be reported.
1237  (The careful crafting of RREs to optimize the search order for fast matching
1238  is obsolete (AREs examine all possible matches in parallel, and their performance
1239  is largely insensitive to their complexity) but cases where the search
1240  order was exploited to deliberately  find a match which was @e not the longest/shortest
1241  will need rewriting.)
1242
1243
1244
1245  @section wxresynbre Basic Regular Expressions
1246
1247  @ref resyn_overview
1248  BREs differ from EREs in
1249  several respects.  '@b |', '@b +', and @b ? are ordinary characters and there is no equivalent
1250  for their functionality. The delimiters for bounds
1251  are @b \{ and '@b \}', with @b { and
1252   @b } by themselves ordinary characters. The parentheses for nested subexpressions
1253  are @b \( and '@b \)', with @b ( and @b ) by themselves
1254  ordinary characters. @b ^ is an ordinary
1255  character except at the beginning of the RE or the beginning of a parenthesized
1256  subexpression, @b $ is an ordinary character except at the end of the RE or
1257  the end of a parenthesized subexpression, and @b * is an ordinary character
1258  if it appears at the beginning of the RE or the beginning of a parenthesized
1259  subexpression (after a possible leading '@b ^'). Finally, single-digit back references
1260  are available, and @b \ and @b \ are synonyms
1261  for @b [[::]] and @b [[::]] respectively;
1262  no other escapes are available.
1263
1264  @section wxresynchars Regular Expression Character Names
1265
1266  @ref resyn_overview
1267  Note that the character names are case sensitive.
1268
1269
1270
1271
1272
1273
1274  NUL
1275
1276
1277
1278
1279  '\0'
1280
1281
1282
1283
1284
1285  SOH
1286
1287
1288
1289
1290  '\001'
1291
1292
1293
1294
1295
1296  STX
1297
1298
1299
1300
1301  '\002'
1302
1303
1304
1305
1306
1307  ETX
1308
1309
1310
1311
1312  '\003'
1313
1314
1315
1316
1317
1318  EOT
1319
1320
1321
1322
1323  '\004'
1324
1325
1326
1327
1328
1329  ENQ
1330
1331
1332
1333
1334  '\005'
1335
1336
1337
1338
1339
1340  ACK
1341
1342
1343
1344
1345  '\006'
1346
1347
1348
1349
1350
1351  BEL
1352
1353
1354
1355
1356  '\007'
1357
1358
1359
1360
1361
1362  alert
1363
1364
1365
1366
1367  '\007'
1368
1369
1370
1371
1372
1373  BS
1374
1375
1376
1377
1378  '\010'
1379
1380
1381
1382
1383
1384  backspace
1385
1386
1387
1388
1389  '\b'
1390
1391
1392
1393
1394
1395  HT
1396
1397
1398
1399
1400  '\011'
1401
1402
1403
1404
1405
1406  tab
1407
1408
1409
1410
1411  '\t'
1412
1413
1414
1415
1416
1417  LF
1418
1419
1420
1421
1422  '\012'
1423
1424
1425
1426
1427
1428  newline
1429
1430
1431
1432
1433  '\n'
1434
1435
1436
1437
1438
1439  VT
1440
1441
1442
1443
1444  '\013'
1445
1446
1447
1448
1449
1450  vertical-tab
1451
1452
1453
1454
1455  '\v'
1456
1457
1458
1459
1460
1461  FF
1462
1463
1464
1465
1466  '\014'
1467
1468
1469
1470
1471
1472  form-feed
1473
1474
1475
1476
1477  '\f'
1478
1479
1480
1481
1482
1483  CR
1484
1485
1486
1487
1488  '\015'
1489
1490
1491
1492
1493
1494  carriage-return
1495
1496
1497
1498
1499  '\r'
1500
1501
1502
1503
1504
1505  SO
1506
1507
1508
1509
1510  '\016'
1511
1512
1513
1514
1515
1516  SI
1517
1518
1519
1520
1521  '\017'
1522
1523
1524
1525
1526
1527  DLE
1528
1529
1530
1531
1532  '\020'
1533
1534
1535
1536
1537
1538  DC1
1539
1540
1541
1542
1543  '\021'
1544
1545
1546
1547
1548
1549  DC2
1550
1551
1552
1553
1554  '\022'
1555
1556
1557
1558
1559
1560  DC3
1561
1562
1563
1564
1565  '\023'
1566
1567
1568
1569
1570
1571  DC4
1572
1573
1574
1575
1576  '\024'
1577
1578
1579
1580
1581
1582  NAK
1583
1584
1585
1586
1587  '\025'
1588
1589
1590
1591
1592
1593  SYN
1594
1595
1596
1597
1598  '\026'
1599
1600
1601
1602
1603
1604  ETB
1605
1606
1607
1608
1609  '\027'
1610
1611
1612
1613
1614
1615  CAN
1616
1617
1618
1619
1620  '\030'
1621
1622
1623
1624
1625
1626  EM
1627
1628
1629
1630
1631  '\031'
1632
1633
1634
1635
1636
1637  SUB
1638
1639
1640
1641
1642  '\032'
1643
1644
1645
1646
1647
1648  ESC
1649
1650
1651
1652
1653  '\033'
1654
1655
1656
1657
1658
1659  IS4
1660
1661
1662
1663
1664  '\034'
1665
1666
1667
1668
1669
1670  FS
1671
1672
1673
1674
1675  '\034'
1676
1677
1678
1679
1680
1681  IS3
1682
1683
1684
1685
1686  '\035'
1687
1688
1689
1690
1691
1692  GS
1693
1694
1695
1696
1697  '\035'
1698
1699
1700
1701
1702
1703  IS2
1704
1705
1706
1707
1708  '\036'
1709
1710
1711
1712
1713
1714  RS
1715
1716
1717
1718
1719  '\036'
1720
1721
1722
1723
1724
1725  IS1
1726
1727
1728
1729
1730  '\037'
1731
1732
1733
1734
1735
1736  US
1737
1738
1739
1740
1741  '\037'
1742
1743
1744
1745
1746
1747  space
1748
1749
1750
1751
1752  ' '
1753
1754
1755
1756
1757
1758  exclamation-mark
1759
1760
1761
1762
1763  '!'
1764
1765
1766
1767
1768
1769  quotation-mark
1770
1771
1772
1773
1774  '"'
1775
1776
1777
1778
1779
1780  number-sign
1781
1782
1783
1784
1785  '#'
1786
1787
1788
1789
1790
1791  dollar-sign
1792
1793
1794
1795
1796  '$'
1797
1798
1799
1800
1801
1802  percent-sign
1803
1804
1805
1806
1807  '%'
1808
1809
1810
1811
1812
1813  ampersand
1814
1815
1816
1817
1818  ''
1819
1820
1821
1822
1823
1824  apostrophe
1825
1826
1827
1828
1829  '\''
1830
1831
1832
1833
1834
1835  left-parenthesis
1836
1837
1838
1839
1840  '('
1841
1842
1843
1844
1845
1846  right-parenthesis
1847
1848
1849
1850
1851  ')'
1852
1853
1854
1855
1856
1857  asterisk
1858
1859
1860
1861
1862  '*'
1863
1864
1865
1866
1867
1868  plus-sign
1869
1870
1871
1872
1873  '+'
1874
1875
1876
1877
1878
1879  comma
1880
1881
1882
1883
1884  ','
1885
1886
1887
1888
1889
1890  hyphen
1891
1892
1893
1894
1895  '-'
1896
1897
1898
1899
1900
1901  hyphen-minus
1902
1903
1904
1905
1906  '-'
1907
1908
1909
1910
1911
1912  period
1913
1914
1915
1916
1917  '.'
1918
1919
1920
1921
1922
1923  full-stop
1924
1925
1926
1927
1928  '.'
1929
1930
1931
1932
1933
1934  slash
1935
1936
1937
1938
1939  '/'
1940
1941
1942
1943
1944
1945  solidus
1946
1947
1948
1949
1950  '/'
1951
1952
1953
1954
1955
1956  zero
1957
1958
1959
1960
1961  '0'
1962
1963
1964
1965
1966
1967  one
1968
1969
1970
1971
1972  '1'
1973
1974
1975
1976
1977
1978  two
1979
1980
1981
1982
1983  '2'
1984
1985
1986
1987
1988
1989  three
1990
1991
1992
1993
1994  '3'
1995
1996
1997
1998
1999
2000  four
2001
2002
2003
2004
2005  '4'
2006
2007
2008
2009
2010
2011  five
2012
2013
2014
2015
2016  '5'
2017
2018
2019
2020
2021
2022  six
2023
2024
2025
2026
2027  '6'
2028
2029
2030
2031
2032
2033  seven
2034
2035
2036
2037
2038  '7'
2039
2040
2041
2042
2043
2044  eight
2045
2046
2047
2048
2049  '8'
2050
2051
2052
2053
2054
2055  nine
2056
2057
2058
2059
2060  '9'
2061
2062
2063
2064
2065
2066  colon
2067
2068
2069
2070
2071  ':'
2072
2073
2074
2075
2076
2077  semicolon
2078
2079
2080
2081
2082  ';'
2083
2084
2085
2086
2087
2088  less-than-sign
2089
2090
2091
2092
2093  ''
2094
2095
2096
2097
2098
2099  equals-sign
2100
2101
2102
2103
2104  '='
2105
2106
2107
2108
2109
2110  greater-than-sign
2111
2112
2113
2114
2115  ''
2116
2117
2118
2119
2120
2121  question-mark
2122
2123
2124
2125
2126  '?'
2127
2128
2129
2130
2131
2132  commercial-at
2133
2134
2135
2136
2137  '@'
2138
2139
2140
2141
2142
2143  left-square-bracket
2144
2145
2146
2147
2148  '['
2149
2150
2151
2152
2153
2154  backslash
2155
2156
2157
2158
2159  '\'
2160
2161
2162
2163
2164
2165  reverse-solidus
2166
2167
2168
2169
2170  '\'
2171
2172
2173
2174
2175
2176  right-square-bracket
2177
2178
2179
2180
2181  ']'
2182
2183
2184
2185
2186
2187  circumflex
2188
2189
2190
2191
2192  '^'
2193
2194
2195
2196
2197
2198  circumflex-accent
2199
2200
2201
2202
2203  '^'
2204
2205
2206
2207
2208
2209  underscore
2210
2211
2212
2213
2214  '_'
2215
2216
2217
2218
2219
2220  low-line
2221
2222
2223
2224
2225  '_'
2226
2227
2228
2229
2230
2231  grave-accent
2232
2233
2234
2235
2236  '''
2237
2238
2239
2240
2241
2242  left-brace
2243
2244
2245
2246
2247  '{'
2248
2249
2250
2251
2252
2253  left-curly-bracket
2254
2255
2256
2257
2258  '{'
2259
2260
2261
2262
2263
2264  vertical-line
2265
2266
2267
2268
2269  '|'
2270
2271
2272
2273
2274
2275  right-brace
2276
2277
2278
2279
2280  '}'
2281
2282
2283
2284
2285
2286  right-curly-bracket
2287
2288
2289
2290
2291  '}'
2292
2293
2294
2295
2296
2297  tilde
2298
2299
2300
2301
2302  '~'
2303
2304
2305
2306
2307
2308  DEL
2309
2310
2311
2312
2313  '\177'
2314
2315  */
2316
2317