]> git.saurik.com Git - wxWidgets.git/blob - docs/doxygen/overviews/resyntax.h
501294fe9c2578055bf99acd1164bc5cd1b873ec
[wxWidgets.git] / docs / doxygen / overviews / resyntax.h
1 /////////////////////////////////////////////////////////////////////////////
2 // Name: resyntax.h
3 // Purpose: topic overview
4 // Author: wxWidgets team
5 // RCS-ID: $Id$
6 // Licence: wxWindows license
7 /////////////////////////////////////////////////////////////////////////////
8
9 /*!
10
11 @page overview_resyntax Syntax of the Built-in Regular Expression Library
12
13 A <em>regular expression</em> describes strings of characters. It's a pattern
14 that matches certain strings and doesn't match others.
15
16 @li @ref overview_resyntax_differentflavors
17 @li @ref overview_resyntax_syntax
18 @li @ref overview_resyntax_bracket
19 @li @ref overview_resyntax_escapes
20 @li @ref overview_resyntax_metasyntax
21 @li @ref overview_resyntax_matching
22 @li @ref overview_resyntax_limits
23 @li @ref overview_resyntax_bre
24 @li @ref overview_resyntax_characters
25
26 @seealso
27
28 @li #wxRegEx
29
30
31 <hr>
32
33
34 @section overview_resyntax_differentflavors Different Flavors of Regular Expressions
35
36 Regular expressions (RE), as defined by POSIX, come in two flavors:
37 <em>extended regular expressions</em> (ERE) and <em>basic regular
38 expressions</em> (BRE). EREs are roughly those of the traditional @e egrep,
39 while BREs are roughly those of the traditional @e ed. This implementation
40 adds a third flavor: <em>advanced regular expressions</em> (ARE), basically
41 EREs with some significant extensions.
42
43 This manual page primarily describes AREs. BREs mostly exist for backward
44 compatibility in some old programs. POSIX EREs are almost an exact subset of
45 AREs. Features of AREs that are not present in EREs will be indicated.
46
47
48 @section overview_resyntax_syntax Regular Expression Syntax
49
50 These regular expressions are implemented using the package written by Henry
51 Spencer, based on the 1003.2 spec and some (not quite all) of the Perl5
52 extensions (thanks, Henry!). Much of the description of regular expressions
53 below is copied verbatim from his manual entry.
54
55 An ARE is one or more @e branches, separated by "|", matching anything that
56 matches any of the branches.
57
58 A branch is zero or more @e constraints or @e quantified atoms, concatenated.
59 It matches a match for the first, followed by a match for the second, etc; an
60 empty branch matches the empty string.
61
62 A quantified atom is an @e atom possibly followed by a single @e quantifier.
63 Without a quantifier, it matches a match for the atom. The quantifiers, and
64 what a so-quantified atom matches, are:
65
66 @beginTable
67 @row2col{ <tt>*</tt> ,
68 A sequence of 0 or more matches of the atom. }
69 @row2col{ <tt>+</tt> ,
70 A sequence of 1 or more matches of the atom. }
71 @row2col{ <tt>?</tt> ,
72 A sequence of 0 or 1 matches of the atom. }
73 @row2col{ <tt>{m}</tt> ,
74 A sequence of exactly @e m matches of the atom. }
75 @row2col{ <tt>{m\,}</tt> ,
76 A sequence of @e m or more matches of the atom. }
77 @row2col{ <tt>{m\,n}</tt> ,
78 A sequence of @e m through @e n (inclusive) matches of the atom; @e m may
79 not exceed @e n. }
80 @row2col{ <tt>*? +? ?? {m}? {m\,}? {m\,n}?</tt> ,
81 @e Non-greedy quantifiers, which match the same possibilities, but prefer
82 the smallest number rather than the largest number of matches (see
83 @ref overview_resyntax_matching). }
84 @endTable
85
86 The forms using @b { and @b } are known as @e bounds. The numbers @e m and
87 @e n are unsigned decimal integers with permissible values from 0 to 255
88 inclusive. An atom is one of:
89
90 @beginTable
91 @row2col{ <tt>(re)</tt> ,
92 Where @e re is any regular expression, matches for @e re, with the match
93 captured for possible reporting. }
94 @row2col{ <tt>(?:re)</tt> ,
95 As previous, but does no reporting (a "non-capturing" set of
96 parentheses). }
97 @row2col{ <tt>()</tt> ,
98 Matches an empty string, captured for possible reporting. }
99 @row2col{ <tt>(?:)</tt> ,
100 Matches an empty string, without reporting. }
101 @row2col{ <tt>[chars]</tt> ,
102 A <em>bracket expression</em>, matching any one of the @e chars (see
103 @ref overview_resyntax_bracket for more details). }
104 @row2col{ <tt>.</tt> ,
105 Matches any single character. }
106 @row2col{ <tt>@\k</tt> ,
107 Where @e k is a non-alphanumeric character, matches that character taken
108 as an ordinary character, e.g. @\@\ matches a backslash character. }
109 @row2col{ <tt>@\c</tt> ,
110 Where @e c is alphanumeric (possibly followed by other characters), an
111 @e escape (AREs only), see @ref overview_resyntax_escapes below. }
112 @row2col{ <tt>@leftCurly</tt> ,
113 When followed by a character other than a digit, matches the left-brace
114 character "@leftCurly"; when followed by a digit, it is the beginning of a
115 @e bound (see above). }
116 @row2col{ <tt>x</tt> ,
117 Where @e x is a single character with no other significance, matches that
118 character. }
119 @endTable
120
121 A @e constraint matches an empty string when specific conditions are met. A
122 constraint may not be followed by a quantifier. The simple constraints are as
123 follows; some more constraints are described later, under
124 @ref overview_resyntax_escapes.
125
126 @beginTable
127 @row2col{ <tt>^</tt> ,
128 Matches at the beginning of a line. }
129 @row2col{ <tt>@$</tt> ,
130 Matches at the end of a line. }
131 @row2col{ <tt>(?=re)</tt> ,
132 @e Positive lookahead (AREs only), matches at any point where a substring
133 matching @e re begins. }
134 @row2col{ <tt>(?!re)</tt> ,
135 @e Negative lookahead (AREs only), matches at any point where no substring
136 matching @e re begins. }
137 @endTable
138
139 The lookahead constraints may not contain back references (see later), and all
140 parentheses within them are considered non-capturing. A RE may not end with
141 "\".
142
143
144 @section overview_resyntax_bracket Bracket Expressions
145
146 A <em>bracket expression</em> is a list of characters enclosed in <tt>[]</tt>.
147 It normally matches any single character from the list (but see below). If the
148 list begins with @c ^, it matches any single character (but see below) @e not
149 from the rest of the list.
150
151 If two characters in the list are separated by <tt>-</tt>, this is shorthand
152 for the full @e range of characters between those two (inclusive) in the
153 collating sequence, e.g. <tt>[0-9]</tt> in ASCII matches any decimal digit.
154 Two ranges may not share an endpoint, so e.g. <tt>a-c-e</tt> is illegal.
155 Ranges are very collating-sequence-dependent, and portable programs should
156 avoid relying on them.
157
158 To include a literal <tt>]</tt> or <tt>-</tt> in the list, the simplest method
159 is to enclose it in <tt>[.</tt> and <tt>.]</tt> to make it a collating element
160 (see below). Alternatively, make it the first character (following a possible
161 <tt>^</tt>), or (AREs only) precede it with <tt>@\</tt>. Alternatively, for
162 <tt>-</tt>, make it the last character, or the second endpoint of a range. To
163 use a literal <tt>-</tt> as the first endpoint of a range, make it a collating
164 element or (AREs only) precede it with <tt>@\</tt>. With the exception of
165 these, some combinations using <tt>[</tt> (see next paragraphs), and escapes,
166 all other special characters lose their special significance within a bracket
167 expression.
168
169 Within a bracket expression, a collating element (a character, a
170 multi-character sequence that collates as if it were a single character, or a
171 collating-sequence name for either) enclosed in <tt>[.</tt> and <tt>.]</tt>
172 stands for the sequence of characters of that collating element.
173
174 @e wxWidgets: Currently no multi-character collating elements are defined. So
175 in <tt>[.X.]</tt>, @c X can either be a single character literal or the name
176 of a character. For example, the following are both identical:
177 <tt>[[.0.]-[.9.]]</tt> and <tt>[[.zero.]-[.nine.]]</tt> and mean the same as
178 <tt>[0-9]</tt>. See @ref overview_resyntax_characters.
179
180 Within a bracket expression, a collating element enclosed in @b [= and @b =]
181 is an equivalence class, standing for the sequences of characters of all
182 collating elements equivalent to that one, including itself.
183 An equivalence class may not be an endpoint of a range.
184 @e wxWidgets: Currently no equivalence classes are defined, so
185 @b [=X=] stands for just the single character @e X.
186 @e X can either be a single character literal or the name of a character,
187 see @ref resynchars_overview.
188 Within a bracket expression,
189 the name of a @e character class enclosed in @b [: and @b :] stands for the list
190 of all characters (not all collating elements!) belonging to that class.
191 Standard character classes are:
192
193 @beginTable
194 @row2col{ <tt>alpha</tt> , A letter. }
195 @row2col{ <tt>upper</tt> , An upper-case letter. }
196 @row2col{ <tt>lower</tt> , A lower-case letter. }
197 @row2col{ <tt>digit</tt> , A decimal digit. }
198 @row2col{ <tt>xdigit</tt> , A hexadecimal digit. }
199 @row2col{ <tt>alnum</tt> , An alphanumeric (letter or digit). }
200 @row2col{ <tt>print</tt> , An alphanumeric (same as alnum). }
201 @row2col{ <tt>blank</tt> , A space or tab character. }
202 @row2col{ <tt>space</tt> , A character producing white space in displayed text. }
203 @row2col{ <tt>punct</tt> , A punctuation character. }
204 @row2col{ <tt>graph</tt> , A character with a visible representation. }
205 @row2col{ <tt>cntrl</tt> , A control character. }
206 @endTable
207
208 A character class may not be used as an endpoint of a range.
209 @e wxWidgets: In a non-Unicode build, these character classifications depend on the
210 current locale, and correspond to the values return by the ANSI C 'is'
211 functions: isalpha, isupper, etc. In Unicode mode they are based on
212 Unicode classifications, and are not affected by the current locale.
213 There are two special cases of bracket expressions:
214 the bracket expressions @b [[::]] and @b [[::]] are constraints, matching empty
215 strings at the beginning and end of a word respectively. A word is defined
216 as a sequence of word characters that is neither preceded nor followed
217 by word characters. A word character is an @e alnum character or an underscore
218 (@b _). These special bracket expressions are deprecated; users of AREs should
219 use constraint escapes instead (see #Escapes below).
220
221
222 @section overview_resyntax_escapes Escapes
223
224 Escapes (AREs only),
225 which begin with a <tt>@\</tt> followed by an alphanumeric character, come in several
226 varieties: character entry, class shorthands, constraint escapes, and back
227 references. A <tt>@\</tt> followed by an alphanumeric character but not constituting
228 a valid escape is illegal in AREs. In EREs, there are no escapes: outside
229 a bracket expression, a <tt>@\</tt> followed by an alphanumeric character merely stands
230 for that character as an ordinary character, and inside a bracket expression,
231 <tt>@\</tt> is an ordinary character. (The latter is the one actual incompatibility
232 between EREs and AREs.)
233 Character-entry escapes (AREs only) exist to make
234 it easier to specify non-printing and otherwise inconvenient characters
235 in REs:
236
237
238
239 @b \a
240
241 alert (bell) character, as in C
242
243 @b \b
244
245 backspace, as in C
246
247 @b \B
248
249 synonym
250 for @b \ to help reduce backslash doubling in some applications where there
251 are multiple levels of backslash processing
252
253 @b \c@e X
254
255 (where X is any character)
256 the character whose low-order 5 bits are the same as those of @e X, and whose
257 other bits are all zero
258
259 @b \e
260
261 the character whose collating-sequence name is
262 '@b ESC', or failing that, the character with octal value 033
263
264 @b \f
265
266 formfeed, as in C
267
268 @b \n
269
270 newline, as in C
271
272 @b \r
273
274 carriage return, as in C
275
276 @b \t
277
278 horizontal tab, as in C
279
280 @b \u@e wxyz
281
282 (where @e wxyz is exactly four hexadecimal digits)
283 the Unicode
284 character @b U+@e wxyz in the local byte ordering
285
286 @b \U@e stuvwxyz
287
288 (where @e stuvwxyz is
289 exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode
290 extension to 32 bits
291
292 @b \v
293
294 vertical tab, as in C are all available.
295
296 @b \x@e hhh
297
298 (where
299 @e hhh is any sequence of hexadecimal digits) the character whose hexadecimal
300 value is @b 0x@e hhh (a single character no matter how many hexadecimal digits
301 are used).
302
303 @b \0
304
305 the character whose value is @b 0
306
307 @b \@e xy
308
309 (where @e xy is exactly two
310 octal digits, and is not a @e back reference (see below)) the character whose
311 octal value is @b 0@e xy
312
313 @b \@e xyz
314
315 (where @e xyz is exactly three octal digits, and is
316 not a back reference (see below))
317 the character whose octal value is @b 0@e xyz
318
319
320
321 Hexadecimal digits are '@b 0'-'@b 9', '@b a'-'@b f', and '@b A'-'@b F'. Octal
322 digits are '@b 0'-'@b 7'.
323 The character-entry
324 escapes are always taken as ordinary characters. For example, @b \135 is @b ] in
325 ASCII, but @b \135 does not terminate a bracket expression. Beware, however,
326 that some applications (e.g., C compilers) interpret such sequences themselves
327 before the regular-expression package gets to see them, which may require
328 doubling (quadrupling, etc.) the '@b \'.
329 Class-shorthand escapes (AREs only) provide
330 shorthands for certain commonly-used character classes:
331
332
333
334 @b \d
335
336 @b [[:digit:]]
337
338 @b \s
339
340 @b [[:space:]]
341
342 @b \w
343
344 @b [[:alnum:]_] (note underscore)
345
346 @b \D
347
348 @b [^[:digit:]]
349
350 @b \S
351
352 @b [^[:space:]]
353
354 @b \W
355
356 @b [^[:alnum:]_] (note underscore)
357
358
359
360 Within bracket expressions, '@b \d', '@b \s', and
361 '@b \w' lose their outer brackets, and '@b \D',
362 '@b \S', and '@b \W' are illegal. (So, for example,
363 @b [a-c\d] is equivalent to @b [a-c[:digit:]].
364 Also, @b [a-c\D], which is equivalent to
365 @b [a-c^[:digit:]], is illegal.)
366 A constraint escape (AREs only) is a constraint,
367 matching the empty string if specific conditions are met, written as an
368 escape:
369
370
371
372 @b \A
373
374 matches only at the beginning of the string
375 (see #Matching, below,
376 for how this differs from '@b ^')
377
378 @b \m
379
380 matches only at the beginning of a word
381
382 @b \M
383
384 matches only at the end of a word
385
386 @b \y
387
388 matches only at the beginning or end of a word
389
390 @b \Y
391
392 matches only at a point that is not the beginning or end of
393 a word
394
395 @b \Z
396
397 matches only at the end of the string
398 (see #Matching, below, for
399 how this differs from '@b $')
400
401 @b \@e m
402
403 (where @e m is a nonzero digit) a @e back reference,
404 see below
405
406 @b \@e mnn
407
408 (where @e m is a nonzero digit, and @e nn is some more digits,
409 and the decimal value @e mnn is not greater than the number of closing capturing
410 parentheses seen so far) a @e back reference, see below
411
412
413
414 A word is defined
415 as in the specification of @b [[::]] and @b [[::]] above. Constraint escapes are
416 illegal within bracket expressions.
417 A back reference (AREs only) matches
418 the same string matched by the parenthesized subexpression specified by
419 the number, so that (e.g.) @b ([bc])\1 matches @b bb or @b cc but not '@b bc'.
420 The subexpression
421 must entirely precede the back reference in the RE. Subexpressions are numbered
422 in the order of their leading parentheses. Non-capturing parentheses do not
423 define subexpressions.
424 There is an inherent historical ambiguity between
425 octal character-entry escapes and back references, which is resolved by
426 heuristics, as hinted at above. A leading zero always indicates an octal
427 escape. A single non-zero digit, not followed by another digit, is always
428 taken as a back reference. A multi-digit sequence not starting with a zero
429 is taken as a back reference if it comes after a suitable subexpression
430 (i.e. the number is in the legal range for a back reference), and otherwise
431 is taken as octal.
432
433
434 @section overview_resyntax_metasyntax Metasyntax
435
436 In addition to the main syntax described above,
437 there are some special forms and miscellaneous syntactic facilities available.
438 Normally the flavor of RE being used is specified by application-dependent
439 means. However, this can be overridden by a @e director. If an RE of any flavor
440 begins with '@b ***:', the rest of the RE is an ARE. If an RE of any flavor begins
441 with '@b ***=', the rest of the RE is taken to be a literal string, with all
442 characters considered ordinary characters.
443 An ARE may begin with @e embedded options: a sequence @b (?xyz)
444 (where @e xyz is one or more alphabetic characters)
445 specifies options affecting the rest of the RE. These supplement, and can
446 override, any options specified by the application. The available option
447 letters are:
448
449
450
451 @b b
452
453 rest of RE is a BRE
454
455 @b c
456
457 case-sensitive matching (usual default)
458
459 @b e
460
461 rest of RE is an ERE
462
463 @b i
464
465 case-insensitive matching (see #Matching, below)
466
467 @b m
468
469 historical synonym for @b n
470
471 @b n
472
473 newline-sensitive matching (see #Matching, below)
474
475 @b p
476
477 partial newline-sensitive matching (see #Matching, below)
478
479 @b q
480
481 rest of RE
482 is a literal ("quoted'') string, all ordinary characters
483
484 @b s
485
486 non-newline-sensitive matching (usual default)
487
488 @b t
489
490 tight syntax (usual default; see below)
491
492 @b w
493
494 inverse
495 partial newline-sensitive ("weird'') matching (see #Matching, below)
496
497 @b x
498
499 expanded syntax (see below)
500
501
502
503 Embedded options take effect at the @b ) terminating the
504 sequence. They are available only at the start of an ARE, and may not be
505 used later within it.
506 In addition to the usual (@e tight) RE syntax, in which
507 all characters are significant, there is an @e expanded syntax, available
508 in AREs with the embedded
509 x option. In the expanded syntax, white-space characters are ignored and
510 all characters between a @b # and the following newline (or the end of the
511 RE) are ignored, permitting paragraphing and commenting a complex RE. There
512 are three exceptions to that basic rule:
513
514
515 a white-space character or '@b #' preceded
516 by '@b \' is retained
517 white space or '@b #' within a bracket expression is retained
518 white space and comments are illegal within multi-character symbols like
519 the ARE '@b (?:' or the BRE '@b \('
520
521
522 Expanded-syntax white-space characters are blank,
523 tab, newline, and any character that belongs to the @e space character class.
524 Finally, in an ARE, outside bracket expressions, the sequence '@b (?#ttt)' (where
525 @e ttt is any text not containing a '@b )') is a comment, completely ignored. Again,
526 this is not allowed between the characters of multi-character symbols like
527 '@b (?:'. Such comments are more a historical artifact than a useful facility,
528 and their use is deprecated; use the expanded syntax instead.
529 @e None of these
530 metasyntax extensions is available if the application (or an initial @b ***=
531 director) has specified that the user's input be treated as a literal string
532 rather than as an RE.
533
534
535 @section overview_resyntax_matching Matching
536
537 In the event that an RE could match more than
538 one substring of a given string, the RE matches the one starting earliest
539 in the string. If the RE could match more than one substring starting at
540 that point, its choice is determined by its @e preference: either the longest
541 substring, or the shortest.
542 Most atoms, and all constraints, have no preference.
543 A parenthesized RE has the same preference (possibly none) as the RE. A
544 quantified atom with quantifier @b {m} or @b {m}? has the same preference (possibly
545 none) as the atom itself. A quantified atom with other normal quantifiers
546 (including @b {m,n} with @e m equal to @e n) prefers longest match. A quantified
547 atom with other non-greedy quantifiers (including @b {m,n}? with @e m equal to
548 @e n) prefers shortest match. A branch has the same preference as the first
549 quantified atom in it which has a preference. An RE consisting of two or
550 more branches connected by the @b | operator prefers longest match.
551 Subject to the constraints imposed by the rules for matching the whole RE, subexpressions
552 also match the longest or shortest possible substrings, based on their
553 preferences, with subexpressions starting earlier in the RE taking priority
554 over ones starting later. Note that outer subexpressions thus take priority
555 over their component subexpressions.
556 Note that the quantifiers @b {1,1} and
557 @b {1,1}? can be used to force longest and shortest preference, respectively,
558 on a subexpression or a whole RE.
559 Match lengths are measured in characters,
560 not collating elements. An empty string is considered longer than no match
561 at all. For example, @b bb* matches the three middle characters
562 of '@b abbbc', @b (week|wee)(night|knights)
563 matches all ten characters of '@b weeknights', when @b (.*).* is matched against
564 @b abc the parenthesized subexpression matches all three characters, and when
565 @b (a*)* is matched against @b bc both the whole RE and the parenthesized subexpression
566 match an empty string.
567 If case-independent matching is specified, the effect
568 is much as if all case distinctions had vanished from the alphabet. When
569 an alphabetic that exists in multiple cases appears as an ordinary character
570 outside a bracket expression, it is effectively transformed into a bracket
571 expression containing both cases, so that @b x becomes '@b [xX]'. When it appears
572 inside a bracket expression, all case counterparts of it are added to the
573 bracket expression, so that @b [x] becomes @b [xX] and @b [^x] becomes '@b [^xX]'.
574 If newline-sensitive
575 matching is specified, @b . and bracket expressions using @b ^ will never match
576 the newline character (so that matches will never cross newlines unless
577 the RE explicitly arranges it) and @b ^ and @b $ will match the empty string after
578 and before a newline respectively, in addition to matching at beginning
579 and end of string respectively. ARE @b \A and @b \Z continue to match beginning
580 or end of string @e only.
581 If partial newline-sensitive matching is specified,
582 this affects @b . and bracket expressions as with newline-sensitive matching,
583 but not @b ^ and '@b $'.
584 If inverse partial newline-sensitive matching is specified,
585 this affects @b ^ and @b $ as with newline-sensitive matching, but not @b . and bracket
586 expressions. This isn't very useful but is provided for symmetry.
587
588
589 @section overview_resyntax_limits Limits and Compatibility
590
591 No particular limit is imposed on the length of REs. Programs
592 intended to be highly portable should not employ REs longer than 256 bytes,
593 as a POSIX-compliant implementation can refuse to accept such REs.
594 The only
595 feature of AREs that is actually incompatible with POSIX EREs is that @b \
596 does not lose its special significance inside bracket expressions. All other
597 ARE features use syntax which is illegal or has undefined or unspecified
598 effects in POSIX EREs; the @b *** syntax of directors likewise is outside
599 the POSIX syntax for both BREs and EREs.
600 Many of the ARE extensions are
601 borrowed from Perl, but some have been changed to clean them up, and a
602 few Perl extensions are not present. Incompatibilities of note include '@b \b',
603 '@b \B', the lack of special treatment for a trailing newline, the addition of
604 complemented bracket expressions to the things affected by newline-sensitive
605 matching, the restrictions on parentheses and back references in lookahead
606 constraints, and the longest/shortest-match (rather than first-match) matching
607 semantics.
608 The matching rules for REs containing both normal and non-greedy
609 quantifiers have changed since early beta-test versions of this package.
610 (The new rules are much simpler and cleaner, but don't work as hard at guessing
611 the user's real intentions.)
612 Henry Spencer's original 1986 @e regexp package, still in widespread use,
613 implemented an early version of today's EREs. There are four incompatibilities between @e regexp's
614 near-EREs ('RREs' for short) and AREs. In roughly increasing order of significance:
615
616 In AREs, @b \ followed by an alphanumeric character is either an escape or
617 an error, while in RREs, it was just another way of writing the alphanumeric.
618 This should not be a problem because there was no reason to write such
619 a sequence in RREs.
620 @b { followed by a digit in an ARE is the beginning of
621 a bound, while in RREs, @b { was always an ordinary character. Such sequences
622 should be rare, and will often result in an error because following characters
623 will not look like a valid bound.
624 In AREs, @b \ remains a special character
625 within '@b []', so a literal @b \ within @b [] must be
626 written '@b \\'. @b \\ also gives a literal
627 @b \ within @b [] in RREs, but only truly paranoid programmers routinely doubled
628 the backslash.
629 AREs report the longest/shortest match for the RE, rather
630 than the first found in a specified search order. This may affect some RREs
631 which were written in the expectation that the first match would be reported.
632 (The careful crafting of RREs to optimize the search order for fast matching
633 is obsolete (AREs examine all possible matches in parallel, and their performance
634 is largely insensitive to their complexity) but cases where the search
635 order was exploited to deliberately find a match which was @e not the longest/shortest
636 will need rewriting.)
637
638
639 @section overview_resyntax_bre Basic Regular Expressions
640
641 BREs differ from EREs in
642 several respects. '@b |', '@b +', and @b ? are ordinary characters and there is no equivalent
643 for their functionality. The delimiters for bounds
644 are @b \{ and '@b \}', with @b { and
645 @b } by themselves ordinary characters. The parentheses for nested subexpressions
646 are @b \( and '@b \)', with @b ( and @b ) by themselves
647 ordinary characters. @b ^ is an ordinary
648 character except at the beginning of the RE or the beginning of a parenthesized
649 subexpression, @b $ is an ordinary character except at the end of the RE or
650 the end of a parenthesized subexpression, and @b * is an ordinary character
651 if it appears at the beginning of the RE or the beginning of a parenthesized
652 subexpression (after a possible leading '@b ^'). Finally, single-digit back references
653 are available, and @b \ and @b \ are synonyms
654 for @b [[::]] and @b [[::]] respectively;
655 no other escapes are available.
656
657
658 @section overview_resyntax_characters Regular Expression Character Names
659
660 Note that the character names are case sensitive.
661
662
663
664
665
666
667 NUL
668
669
670
671
672 '\0'
673
674
675
676
677
678 SOH
679
680
681
682
683 '\001'
684
685
686
687
688
689 STX
690
691
692
693
694 '\002'
695
696
697
698
699
700 ETX
701
702
703
704
705 '\003'
706
707
708
709
710
711 EOT
712
713
714
715
716 '\004'
717
718
719
720
721
722 ENQ
723
724
725
726
727 '\005'
728
729
730
731
732
733 ACK
734
735
736
737
738 '\006'
739
740
741
742
743
744 BEL
745
746
747
748
749 '\007'
750
751
752
753
754
755 alert
756
757
758
759
760 '\007'
761
762
763
764
765
766 BS
767
768
769
770
771 '\010'
772
773
774
775
776
777 backspace
778
779
780
781
782 '\b'
783
784
785
786
787
788 HT
789
790
791
792
793 '\011'
794
795
796
797
798
799 tab
800
801
802
803
804 '\t'
805
806
807
808
809
810 LF
811
812
813
814
815 '\012'
816
817
818
819
820
821 newline
822
823
824
825
826 '\n'
827
828
829
830
831
832 VT
833
834
835
836
837 '\013'
838
839
840
841
842
843 vertical-tab
844
845
846
847
848 '\v'
849
850
851
852
853
854 FF
855
856
857
858
859 '\014'
860
861
862
863
864
865 form-feed
866
867
868
869
870 '\f'
871
872
873
874
875
876 CR
877
878
879
880
881 '\015'
882
883
884
885
886
887 carriage-return
888
889
890
891
892 '\r'
893
894
895
896
897
898 SO
899
900
901
902
903 '\016'
904
905
906
907
908
909 SI
910
911
912
913
914 '\017'
915
916
917
918
919
920 DLE
921
922
923
924
925 '\020'
926
927
928
929
930
931 DC1
932
933
934
935
936 '\021'
937
938
939
940
941
942 DC2
943
944
945
946
947 '\022'
948
949
950
951
952
953 DC3
954
955
956
957
958 '\023'
959
960
961
962
963
964 DC4
965
966
967
968
969 '\024'
970
971
972
973
974
975 NAK
976
977
978
979
980 '\025'
981
982
983
984
985
986 SYN
987
988
989
990
991 '\026'
992
993
994
995
996
997 ETB
998
999
1000
1001
1002 '\027'
1003
1004
1005
1006
1007
1008 CAN
1009
1010
1011
1012
1013 '\030'
1014
1015
1016
1017
1018
1019 EM
1020
1021
1022
1023
1024 '\031'
1025
1026
1027
1028
1029
1030 SUB
1031
1032
1033
1034
1035 '\032'
1036
1037
1038
1039
1040
1041 ESC
1042
1043
1044
1045
1046 '\033'
1047
1048
1049
1050
1051
1052 IS4
1053
1054
1055
1056
1057 '\034'
1058
1059
1060
1061
1062
1063 FS
1064
1065
1066
1067
1068 '\034'
1069
1070
1071
1072
1073
1074 IS3
1075
1076
1077
1078
1079 '\035'
1080
1081
1082
1083
1084
1085 GS
1086
1087
1088
1089
1090 '\035'
1091
1092
1093
1094
1095
1096 IS2
1097
1098
1099
1100
1101 '\036'
1102
1103
1104
1105
1106
1107 RS
1108
1109
1110
1111
1112 '\036'
1113
1114
1115
1116
1117
1118 IS1
1119
1120
1121
1122
1123 '\037'
1124
1125
1126
1127
1128
1129 US
1130
1131
1132
1133
1134 '\037'
1135
1136
1137
1138
1139
1140 space
1141
1142
1143
1144
1145 ' '
1146
1147
1148
1149
1150
1151 exclamation-mark
1152
1153
1154
1155
1156 '!'
1157
1158
1159
1160
1161
1162 quotation-mark
1163
1164
1165
1166
1167 '"'
1168
1169
1170
1171
1172
1173 number-sign
1174
1175
1176
1177
1178 '#'
1179
1180
1181
1182
1183
1184 dollar-sign
1185
1186
1187
1188
1189 '$'
1190
1191
1192
1193
1194
1195 percent-sign
1196
1197
1198
1199
1200 '%'
1201
1202
1203
1204
1205
1206 ampersand
1207
1208
1209
1210
1211 ''
1212
1213
1214
1215
1216
1217 apostrophe
1218
1219
1220
1221
1222 '\''
1223
1224
1225
1226
1227
1228 left-parenthesis
1229
1230
1231
1232
1233 '('
1234
1235
1236
1237
1238
1239 right-parenthesis
1240
1241
1242
1243
1244 ')'
1245
1246
1247
1248
1249
1250 asterisk
1251
1252
1253
1254
1255 '*'
1256
1257
1258
1259
1260
1261 plus-sign
1262
1263
1264
1265
1266 '+'
1267
1268
1269
1270
1271
1272 comma
1273
1274
1275
1276
1277 ','
1278
1279
1280
1281
1282
1283 hyphen
1284
1285
1286
1287
1288 '-'
1289
1290
1291
1292
1293
1294 hyphen-minus
1295
1296
1297
1298
1299 '-'
1300
1301
1302
1303
1304
1305 period
1306
1307
1308
1309
1310 '.'
1311
1312
1313
1314
1315
1316 full-stop
1317
1318
1319
1320
1321 '.'
1322
1323
1324
1325
1326
1327 slash
1328
1329
1330
1331
1332 '/'
1333
1334
1335
1336
1337
1338 solidus
1339
1340
1341
1342
1343 '/'
1344
1345
1346
1347
1348
1349 zero
1350
1351
1352
1353
1354 '0'
1355
1356
1357
1358
1359
1360 one
1361
1362
1363
1364
1365 '1'
1366
1367
1368
1369
1370
1371 two
1372
1373
1374
1375
1376 '2'
1377
1378
1379
1380
1381
1382 three
1383
1384
1385
1386
1387 '3'
1388
1389
1390
1391
1392
1393 four
1394
1395
1396
1397
1398 '4'
1399
1400
1401
1402
1403
1404 five
1405
1406
1407
1408
1409 '5'
1410
1411
1412
1413
1414
1415 six
1416
1417
1418
1419
1420 '6'
1421
1422
1423
1424
1425
1426 seven
1427
1428
1429
1430
1431 '7'
1432
1433
1434
1435
1436
1437 eight
1438
1439
1440
1441
1442 '8'
1443
1444
1445
1446
1447
1448 nine
1449
1450
1451
1452
1453 '9'
1454
1455
1456
1457
1458
1459 colon
1460
1461
1462
1463
1464 ':'
1465
1466
1467
1468
1469
1470 semicolon
1471
1472
1473
1474
1475 ';'
1476
1477
1478
1479
1480
1481 less-than-sign
1482
1483
1484
1485
1486 ''
1487
1488
1489
1490
1491
1492 equals-sign
1493
1494
1495
1496
1497 '='
1498
1499
1500
1501
1502
1503 greater-than-sign
1504
1505
1506
1507
1508 ''
1509
1510
1511
1512
1513
1514 question-mark
1515
1516
1517
1518
1519 '?'
1520
1521
1522
1523
1524
1525 commercial-at
1526
1527
1528
1529
1530 '@'
1531
1532
1533
1534
1535
1536 left-square-bracket
1537
1538
1539
1540
1541 '['
1542
1543
1544
1545
1546
1547 backslash
1548
1549
1550
1551
1552 '\'
1553
1554
1555
1556
1557
1558 reverse-solidus
1559
1560
1561
1562
1563 '\'
1564
1565
1566
1567
1568
1569 right-square-bracket
1570
1571
1572
1573
1574 ']'
1575
1576
1577
1578
1579
1580 circumflex
1581
1582
1583
1584
1585 '^'
1586
1587
1588
1589
1590
1591 circumflex-accent
1592
1593
1594
1595
1596 '^'
1597
1598
1599
1600
1601
1602 underscore
1603
1604
1605
1606
1607 '_'
1608
1609
1610
1611
1612
1613 low-line
1614
1615
1616
1617
1618 '_'
1619
1620
1621
1622
1623
1624 grave-accent
1625
1626
1627
1628
1629 '''
1630
1631
1632
1633
1634
1635 left-brace
1636
1637
1638
1639
1640 '{'
1641
1642
1643
1644
1645
1646 left-curly-bracket
1647
1648
1649
1650
1651 '{'
1652
1653
1654
1655
1656
1657 vertical-line
1658
1659
1660
1661
1662 '|'
1663
1664
1665
1666
1667
1668 right-brace
1669
1670
1671
1672
1673 '}'
1674
1675
1676
1677
1678
1679 right-curly-bracket
1680
1681
1682
1683
1684 '}'
1685
1686
1687
1688
1689
1690 tilde
1691
1692
1693
1694
1695 '~'
1696
1697
1698
1699
1700
1701 DEL
1702
1703
1704
1705
1706 '\177'
1707
1708 */
1709