]> git.saurik.com Git - wxWidgets.git/blame - docs/doxygen/overviews/resyntax.h
Add a tiny script for producing HTML documentation archives.
[wxWidgets.git] / docs / doxygen / overviews / resyntax.h
CommitLineData
15b6757b 1/////////////////////////////////////////////////////////////////////////////
72844950 2// Name: resyntax.h
15b6757b
FM
3// Purpose: topic overview
4// Author: wxWidgets team
5// RCS-ID: $Id$
526954c5 6// Licence: wxWindows licence
15b6757b
FM
7/////////////////////////////////////////////////////////////////////////////
8
880efa2a 9/**
36c9828f 10
880efa2a 11@page overview_resyntax Regular Expressions
36c9828f 12
72844950
BP
13A <em>regular expression</em> describes strings of characters. It's a pattern
14that matches certain strings and doesn't match others.
36c9828f 15
72844950
BP
16@li @ref overview_resyntax_differentflavors
17@li @ref overview_resyntax_syntax
18@li @ref overview_resyntax_bracket
19@li @ref overview_resyntax_escapes
20@li @ref overview_resyntax_metasyntax
21@li @ref overview_resyntax_matching
22@li @ref overview_resyntax_limits
23@li @ref overview_resyntax_bre
24@li @ref overview_resyntax_characters
36c9828f 25
adcb6f88 26@see
36c9828f 27
7442b5ee 28@li wxRegEx
36c9828f 29
36c9828f 30
877b5c30 31<hr>
36c9828f
FM
32
33
877b5c30 34@section overview_resyntax_differentflavors Different Flavors of Regular Expressions
36c9828f 35
877b5c30
BP
36Regular expressions (RE), as defined by POSIX, come in two flavors:
37<em>extended regular expressions</em> (ERE) and <em>basic regular
38expressions</em> (BRE). EREs are roughly those of the traditional @e egrep,
39while BREs are roughly those of the traditional @e ed. This implementation
40adds a third flavor: <em>advanced regular expressions</em> (ARE), basically
41EREs with some significant extensions.
36c9828f 42
877b5c30
BP
43This manual page primarily describes AREs. BREs mostly exist for backward
44compatibility in some old programs. POSIX EREs are almost an exact subset of
45AREs. Features of AREs that are not present in EREs will be indicated.
36c9828f
FM
46
47
877b5c30 48@section overview_resyntax_syntax Regular Expression Syntax
36c9828f 49
877b5c30
BP
50These regular expressions are implemented using the package written by Henry
51Spencer, based on the 1003.2 spec and some (not quite all) of the Perl5
52extensions (thanks, Henry!). Much of the description of regular expressions
53below is copied verbatim from his manual entry.
54
55An ARE is one or more @e branches, separated by "|", matching anything that
56matches any of the branches.
57
58A branch is zero or more @e constraints or @e quantified atoms, concatenated.
59It matches a match for the first, followed by a match for the second, etc; an
60empty branch matches the empty string.
61
62A quantified atom is an @e atom possibly followed by a single @e quantifier.
63Without a quantifier, it matches a match for the atom. The quantifiers, and
64what a so-quantified atom matches, are:
65
66@beginTable
67@row2col{ <tt>*</tt> ,
68 A sequence of 0 or more matches of the atom. }
69@row2col{ <tt>+</tt> ,
70 A sequence of 1 or more matches of the atom. }
71@row2col{ <tt>?</tt> ,
72 A sequence of 0 or 1 matches of the atom. }
73@row2col{ <tt>{m}</tt> ,
74 A sequence of exactly @e m matches of the atom. }
75@row2col{ <tt>{m\,}</tt> ,
76 A sequence of @e m or more matches of the atom. }
77@row2col{ <tt>{m\,n}</tt> ,
78 A sequence of @e m through @e n (inclusive) matches of the atom; @e m may
79 not exceed @e n. }
80@row2col{ <tt>*? +? ?? {m}? {m\,}? {m\,n}?</tt> ,
81 @e Non-greedy quantifiers, which match the same possibilities, but prefer
82 the smallest number rather than the largest number of matches (see
83 @ref overview_resyntax_matching). }
84@endTable
85
86The forms using @b { and @b } are known as @e bounds. The numbers @e m and
87@e n are unsigned decimal integers with permissible values from 0 to 255
88inclusive. An atom is one of:
89
90@beginTable
91@row2col{ <tt>(re)</tt> ,
92 Where @e re is any regular expression, matches for @e re, with the match
93 captured for possible reporting. }
94@row2col{ <tt>(?:re)</tt> ,
95 As previous, but does no reporting (a "non-capturing" set of
96 parentheses). }
97@row2col{ <tt>()</tt> ,
98 Matches an empty string, captured for possible reporting. }
99@row2col{ <tt>(?:)</tt> ,
100 Matches an empty string, without reporting. }
101@row2col{ <tt>[chars]</tt> ,
102 A <em>bracket expression</em>, matching any one of the @e chars (see
103 @ref overview_resyntax_bracket for more details). }
104@row2col{ <tt>.</tt> ,
105 Matches any single character. }
106@row2col{ <tt>@\k</tt> ,
107 Where @e k is a non-alphanumeric character, matches that character taken
108 as an ordinary character, e.g. @\@\ matches a backslash character. }
109@row2col{ <tt>@\c</tt> ,
110 Where @e c is alphanumeric (possibly followed by other characters), an
111 @e escape (AREs only), see @ref overview_resyntax_escapes below. }
112@row2col{ <tt>@leftCurly</tt> ,
113 When followed by a character other than a digit, matches the left-brace
114 character "@leftCurly"; when followed by a digit, it is the beginning of a
115 @e bound (see above). }
116@row2col{ <tt>x</tt> ,
117 Where @e x is a single character with no other significance, matches that
118 character. }
119@endTable
120
121A @e constraint matches an empty string when specific conditions are met. A
122constraint may not be followed by a quantifier. The simple constraints are as
123follows; some more constraints are described later, under
124@ref overview_resyntax_escapes.
125
126@beginTable
127@row2col{ <tt>^</tt> ,
128 Matches at the beginning of a line. }
129@row2col{ <tt>@$</tt> ,
130 Matches at the end of a line. }
131@row2col{ <tt>(?=re)</tt> ,
132 @e Positive lookahead (AREs only), matches at any point where a substring
133 matching @e re begins. }
134@row2col{ <tt>(?!re)</tt> ,
135 @e Negative lookahead (AREs only), matches at any point where no substring
136 matching @e re begins. }
137@endTable
138
139The lookahead constraints may not contain back references (see later), and all
140parentheses within them are considered non-capturing. A RE may not end with
141"\".
36c9828f 142
36c9828f 143
72844950 144@section overview_resyntax_bracket Bracket Expressions
36c9828f 145
877b5c30
BP
146A <em>bracket expression</em> is a list of characters enclosed in <tt>[]</tt>.
147It normally matches any single character from the list (but see below). If the
148list begins with @c ^, it matches any single character (but see below) @e not
149from the rest of the list.
150
151If two characters in the list are separated by <tt>-</tt>, this is shorthand
152for the full @e range of characters between those two (inclusive) in the
153collating sequence, e.g. <tt>[0-9]</tt> in ASCII matches any decimal digit.
154Two ranges may not share an endpoint, so e.g. <tt>a-c-e</tt> is illegal.
155Ranges are very collating-sequence-dependent, and portable programs should
156avoid relying on them.
157
158To include a literal <tt>]</tt> or <tt>-</tt> in the list, the simplest method
159is to enclose it in <tt>[.</tt> and <tt>.]</tt> to make it a collating element
160(see below). Alternatively, make it the first character (following a possible
161<tt>^</tt>), or (AREs only) precede it with <tt>@\</tt>. Alternatively, for
162<tt>-</tt>, make it the last character, or the second endpoint of a range. To
163use a literal <tt>-</tt> as the first endpoint of a range, make it a collating
164element or (AREs only) precede it with <tt>@\</tt>. With the exception of
165these, some combinations using <tt>[</tt> (see next paragraphs), and escapes,
166all other special characters lose their special significance within a bracket
167expression.
168
169Within a bracket expression, a collating element (a character, a
170multi-character sequence that collates as if it were a single character, or a
171collating-sequence name for either) enclosed in <tt>[.</tt> and <tt>.]</tt>
172stands for the sequence of characters of that collating element.
173
174@e wxWidgets: Currently no multi-character collating elements are defined. So
175in <tt>[.X.]</tt>, @c X can either be a single character literal or the name
176of a character. For example, the following are both identical:
177<tt>[[.0.]-[.9.]]</tt> and <tt>[[.zero.]-[.nine.]]</tt> and mean the same as
178<tt>[0-9]</tt>. See @ref overview_resyntax_characters.
179
a2968d85
BP
180Within a bracket expression, a collating element enclosed in <tt>[=</tt> and
181<tt>=]</tt> is an equivalence class, standing for the sequences of characters
182of all collating elements equivalent to that one, including itself. An
183equivalence class may not be an endpoint of a range.
184
185@e wxWidgets: Currently no equivalence classes are defined, so <tt>[=X=]</tt>
186stands for just the single character @c X. @c X can either be a single
187character literal or the name of a character, see
188@ref overview_resyntax_characters.
189
190Within a bracket expression, the name of a @e character class enclosed in
191<tt>[:</tt> and <tt>:]</tt> stands for the list of all characters (not all
192collating elements!) belonging to that class. Standard character classes are:
36c9828f 193
877b5c30
BP
194@beginTable
195@row2col{ <tt>alpha</tt> , A letter. }
196@row2col{ <tt>upper</tt> , An upper-case letter. }
197@row2col{ <tt>lower</tt> , A lower-case letter. }
198@row2col{ <tt>digit</tt> , A decimal digit. }
199@row2col{ <tt>xdigit</tt> , A hexadecimal digit. }
200@row2col{ <tt>alnum</tt> , An alphanumeric (letter or digit). }
201@row2col{ <tt>print</tt> , An alphanumeric (same as alnum). }
202@row2col{ <tt>blank</tt> , A space or tab character. }
203@row2col{ <tt>space</tt> , A character producing white space in displayed text. }
204@row2col{ <tt>punct</tt> , A punctuation character. }
205@row2col{ <tt>graph</tt> , A character with a visible representation. }
206@row2col{ <tt>cntrl</tt> , A control character. }
207@endTable
36c9828f 208
72844950 209A character class may not be used as an endpoint of a range.
36c9828f 210
a2968d85
BP
211@e wxWidgets: In a non-Unicode build, these character classifications depend on
212the current locale, and correspond to the values return by the ANSI C "is"
213functions: <tt>isalpha</tt>, <tt>isupper</tt>, etc. In Unicode mode they are
214based on Unicode classifications, and are not affected by the current locale.
36c9828f 215
a2968d85
BP
216There are two special cases of bracket expressions: the bracket expressions
217<tt>[[:@<:]]</tt> and <tt>[[:@>:]]</tt> are constraints, matching empty strings at
218the beginning and end of a word respectively. A word is defined as a sequence
219of word characters that is neither preceded nor followed by word characters. A
220word character is an @e alnum character or an underscore (_). These special
221bracket expressions are deprecated; users of AREs should use constraint escapes
222instead (see escapes below).
36c9828f
FM
223
224
a2968d85 225@section overview_resyntax_escapes Escapes
36c9828f 226
a2968d85
BP
227Escapes (AREs only), which begin with a <tt>@\</tt> followed by an alphanumeric
228character, come in several varieties: character entry, class shorthands,
229constraint escapes, and back references. A <tt>@\</tt> followed by an
230alphanumeric character but not constituting a valid escape is illegal in AREs.
231In EREs, there are no escapes: outside a bracket expression, a <tt>@\</tt>
232followed by an alphanumeric character merely stands for that character as an
233ordinary character, and inside a bracket expression, <tt>@\</tt> is an ordinary
234character. (The latter is the one actual incompatibility between EREs and
235AREs.)
36c9828f 236
a2968d85
BP
237Character-entry escapes (AREs only) exist to make it easier to specify
238non-printing and otherwise inconvenient characters in REs:
36c9828f 239
a2968d85
BP
240@beginTable
241@row2col{ <tt>@\a</tt> , Alert (bell) character, as in C. }
242@row2col{ <tt>@\b</tt> , Backspace, as in C. }
243@row2col{ <tt>@\B</tt> ,
244 Synonym for <tt>@\</tt> to help reduce backslash doubling in some
245 applications where there are multiple levels of backslash processing. }
246@row2col{ <tt>@\cX</tt> ,
247 The character whose low-order 5 bits are the same as those of @e X, and
248 whose other bits are all zero, where @e X is any character. }
249@row2col{ <tt>@\e</tt> ,
250 The character whose collating-sequence name is @c ESC, or failing that,
251 the character with octal value 033. }
252@row2col{ <tt>@\f</tt> , Formfeed, as in C. }
253@row2col{ <tt>@\n</tt> , Newline, as in C. }
254@row2col{ <tt>@\r</tt> , Carriage return, as in C. }
255@row2col{ <tt>@\t</tt> , Horizontal tab, as in C. }
256@row2col{ <tt>@\uwxyz</tt> ,
257 The Unicode character <tt>U+wxyz</tt> in the local byte ordering, where
258 @e wxyz is exactly four hexadecimal digits. }
259@row2col{ <tt>@\Ustuvwxyz</tt> ,
260 Reserved for a somewhat-hypothetical Unicode extension to 32 bits, where
261 @e stuvwxyz is exactly eight hexadecimal digits. }
262@row2col{ <tt>@\v</tt> , Vertical tab, as in C are all available. }
263@row2col{ <tt>@\xhhh</tt> ,
264 The single character whose hexadecimal value is @e 0xhhh, where @e hhh is
265 any sequence of hexadecimal digits. }
266@row2col{ <tt>@\0</tt> , The character whose value is 0. }
267@row2col{ <tt>@\xy</tt> ,
268 The character whose octal value is @e 0xy, where @e xy is exactly two octal
269 digits, and is not a <em>back reference</em> (see below). }
270@row2col{ <tt>@\xyz</tt> ,
271 The character whose octal value is @e 0xyz, where @e xyz is exactly three
272 octal digits, and is not a <em>back reference</em> (see below). }
273@endTable
36c9828f 274
a2968d85 275Hexadecimal digits are 0-9, a-f, and A-F. Octal digits are 0-7.
36c9828f 276
a2968d85
BP
277The character-entry escapes are always taken as ordinary characters. For
278example, <tt>@\135</tt> is <tt>]</tt> in ASCII, but <tt>@\135</tt> does not
279terminate a bracket expression. Beware, however, that some applications (e.g.,
280C compilers) interpret such sequences themselves before the regular-expression
281package gets to see them, which may require doubling (quadrupling, etc.) the
282'<tt>@\</tt>'.
36c9828f 283
a2968d85
BP
284Class-shorthand escapes (AREs only) provide shorthands for certain
285commonly-used character classes:
36c9828f 286
a2968d85
BP
287@beginTable
288@row2col{ <tt>@\d</tt> , <tt>[[:digit:]]</tt> }
289@row2col{ <tt>@\s</tt> , <tt>[[:space:]]</tt> }
290@row2col{ <tt>@\w</tt> , <tt>[[:alnum:]_]</tt> (note underscore) }
291@row2col{ <tt>@\D</tt> , <tt>[^[:digit:]]</tt> }
292@row2col{ <tt>@\S</tt> , <tt>[^[:space:]]</tt> }
293@row2col{ <tt>@\W</tt> , <tt>[^[:alnum:]_]</tt> (note underscore) }
294@endTable
36c9828f 295
a2968d85
BP
296Within bracket expressions, <tt>@\d</tt>, <tt>@\s</tt>, and <tt>@\w</tt> lose
297their outer brackets, and <tt>@\D</tt>, <tt>@\S</tt>, <tt>@\W</tt> are illegal.
298So, for example, <tt>[a-c@\d]</tt> is equivalent to <tt>[a-c[:digit:]]</tt>.
299Also, <tt>[a-c@\D]</tt>, which is equivalent to <tt>[a-c^[:digit:]]</tt>, is
300illegal.
36c9828f 301
a2968d85
BP
302A constraint escape (AREs only) is a constraint, matching the empty string if
303specific conditions are met, written as an escape:
36c9828f 304
a2968d85
BP
305@beginTable
306@row2col{ <tt>@\A</tt> , Matches only at the beginning of the string, see
307 @ref overview_resyntax_matching for how this differs
308 from <tt>^</tt>. }
309@row2col{ <tt>@\m</tt> , Matches only at the beginning of a word. }
310@row2col{ <tt>@\M</tt> , Matches only at the end of a word. }
311@row2col{ <tt>@\y</tt> , Matches only at the beginning or end of a word. }
312@row2col{ <tt>@\Y</tt> , Matches only at a point that is not the beginning or
313 end of a word. }
314@row2col{ <tt>@\Z</tt> , Matches only at the end of the string, see
315 @ref overview_resyntax_matching for how this differs
316 from <tt>@$</tt>. }
317@row2col{ <tt>@\m</tt> , A <em>back reference</em>, where @e m is a non-zero
318 digit. See below. }
319@row2col{ <tt>@\mnn</tt> ,
320 A <em>back reference</em>, where @e m is a nonzero digit, and @e nn is some
321 more digits, and the decimal value @e mnn is not greater than the number of
322 closing capturing parentheses seen so far. See below. }
323@endTable
36c9828f 324
a2968d85 325A word is defined as in the specification of <tt>[[:@<:]]</tt> and
721a49c7 326<tt>[[:@>:]]</tt> above. Constraint escapes are illegal within bracket
a2968d85 327expressions.
36c9828f 328
a2968d85
BP
329A back reference (AREs only) matches the same string matched by the
330parenthesized subexpression specified by the number. For example, "([bc])\1"
331matches "bb" or "cc" but not "bc". The subexpression must entirely precede the
332back reference in the RE.Subexpressions are numbered in the order of their
333leading parentheses. Non-capturing parentheses do not define subexpressions.
36c9828f 334
a2968d85
BP
335There is an inherent historical ambiguity between octal character-entry escapes
336and back references, which is resolved by heuristics, as hinted at above. A
337leading zero always indicates an octal escape. A single non-zero digit, not
338followed by another digit, is always taken as a back reference. A multi-digit
339sequence not starting with a zero is taken as a back reference if it comes
340after a suitable subexpression (i.e. the number is in the legal range for a
341back reference), and otherwise is taken as octal.
36c9828f 342
36c9828f 343
72844950 344@section overview_resyntax_metasyntax Metasyntax
36c9828f 345
721a49c7
BP
346In addition to the main syntax described above, there are some special forms
347and miscellaneous syntactic facilities available.
348
72844950
BP
349Normally the flavor of RE being used is specified by application-dependent
350means. However, this can be overridden by a @e director. If an RE of any flavor
721a49c7
BP
351begins with <tt>***:</tt>, the rest of the RE is an ARE. If an RE of any
352flavor begins with <tt>***=</tt>, the rest of the RE is taken to be a literal
353string, with all characters considered ordinary characters.
36c9828f 354
721a49c7
BP
355An ARE may begin with <em>embedded options</em>: a sequence <tt>(?xyz)</tt>
356(where @e xyz is one or more alphabetic characters) specifies options affecting
357the rest of the RE. These supplement, and can override, any options specified
358by the application. The available option letters are:
72844950 359
721a49c7
BP
360@beginTable
361@row2col{ <tt>b</tt> , Rest of RE is a BRE. }
362@row2col{ <tt>c</tt> , Case-sensitive matching (usual default). }
363@row2col{ <tt>e</tt> , Rest of RE is an ERE. }
364@row2col{ <tt>i</tt> , Case-insensitive matching (see
365 @ref overview_resyntax_matching, below). }
366@row2col{ <tt>m</tt> , Historical synonym for @e n. }
367@row2col{ <tt>n</tt> , Newline-sensitive matching (see
368 @ref overview_resyntax_matching, below). }
369@row2col{ <tt>p</tt> , Partial newline-sensitive matching (see
370 @ref overview_resyntax_matching, below). }
371@row2col{ <tt>q</tt> , Rest of RE is a literal ("quoted") string, all ordinary
372 characters. }
373@row2col{ <tt>s</tt> , Non-newline-sensitive matching (usual default). }
374@row2col{ <tt>t</tt> , Tight syntax (usual default; see below). }
375@row2col{ <tt>w</tt> , Inverse partial newline-sensitive ("weird") matching
376 (see @ref overview_resyntax_matching, below). }
377@row2col{ <tt>x</tt> , Expanded syntax (see below). }
378@endTable
72844950 379
721a49c7
BP
380Embedded options take effect at the <tt>)</tt> terminating the sequence. They
381are available only at the start of an ARE, and may not be used later within it.
382
383In addition to the usual (@e tight) RE syntax, in which all characters are
384significant, there is an @e expanded syntax, available in AREs with the
385embedded x option. In the expanded syntax, white-space characters are ignored
386and all characters between a <tt>@#</tt> and the following newline (or the end
387of the RE) are ignored, permitting paragraphing and commenting a complex RE.
388There are three exceptions to that basic rule:
389
390@li A white-space character or <tt>@#</tt> preceded by <tt>@\</tt> is retained.
391@li White space or <tt>@#</tt> within a bracket expression is retained.
392@li White space and comments are illegal within multi-character symbols like
393 the ARE <tt>(?:</tt> or the BRE <tt>\(</tt>.
394
395Expanded-syntax white-space characters are blank, tab, newline, and any
396character that belongs to the @e space character class.
397
398Finally, in an ARE, outside bracket expressions, the sequence <tt>(?@#ttt)</tt>
399(where @e ttt is any text not containing a <tt>)</tt>) is a comment, completely
400ignored. Again, this is not allowed between the characters of multi-character
401symbols like <tt>(?:</tt>. Such comments are more a historical artifact than a
402useful facility, and their use is deprecated; use the expanded syntax instead.
403
404@e None of these metasyntax extensions is available if the application (or an
405initial <tt>***=</tt> director) has specified that the user's input be treated
406as a literal string rather than as an RE.
72844950
BP
407
408
409@section overview_resyntax_matching Matching
410
07fa8f78
BP
411In the event that an RE could match more than one substring of a given string,
412the RE matches the one starting earliest in the string. If the RE could match
413more than one substring starting at that point, the choice is determined by
414it's @e preference: either the longest substring, or the shortest.
415
416Most atoms, and all constraints, have no preference. A parenthesized RE has the
417same preference (possibly none) as the RE. A quantified atom with quantifier
418<tt>{m}</tt> or <tt>{m}?</tt> has the same preference (possibly none) as the
419atom itself. A quantified atom with other normal quantifiers (including
420<tt>{m,n}</tt> with @e m equal to @e n) prefers longest match. A quantified
421atom with other non-greedy quantifiers (including <tt>{m,n}?</tt> with @e m
422equal to @e n) prefers shortest match. A branch has the same preference as the
423first quantified atom in it which has a preference. An RE consisting of two or
424more branches connected by the @c | operator prefers longest match.
425
426Subject to the constraints imposed by the rules for matching the whole RE,
427subexpressions also match the longest or shortest possible substrings, based on
428their preferences, with subexpressions starting earlier in the RE taking
429priority over ones starting later. Note that outer subexpressions thus take
430priority over their component subexpressions.
431
432Note that the quantifiers <tt>{1,1}</tt> and <tt>{1,1}?</tt> can be used to
433force longest and shortest preference, respectively, on a subexpression or a
434whole RE.
435
436Match lengths are measured in characters, not collating elements. An empty
437string is considered longer than no match at all. For example, <tt>bb*</tt>
438matches the three middle characters of "abbbc",
439<tt>(week|wee)(night|knights)</tt> matches all ten characters of "weeknights",
440when <tt>(.*).*</tt> is matched against "abc" the parenthesized subexpression
441matches all three characters, and when <tt>(a*)*</tt> is matched against "bc"
442both the whole RE and the parenthesized subexpression match an empty string.
443
444If case-independent matching is specified, the effect is much as if all case
445distinctions had vanished from the alphabet. When an alphabetic that exists in
446multiple cases appears as an ordinary character outside a bracket expression,
447it is effectively transformed into a bracket expression containing both cases,
448so that @c x becomes @c [xX]. When it appears inside a bracket expression, all
449case counterparts of it are added to the bracket expression, so that @c [x]
450becomes @c [xX] and @c [^x] becomes @c [^xX].
451
452If newline-sensitive matching is specified, "." and bracket expressions using
453"^" will never match the newline character (so that matches will never cross
454newlines unless the RE explicitly arranges it) and "^" and "$" will match the
455empty string after and before a newline respectively, in addition to matching
456at beginning and end of string respectively. ARE <tt>@\A</tt> and <tt>@\Z</tt>
457continue to match beginning or end of string @e only.
458
459If partial newline-sensitive matching is specified, this affects "." and
460bracket expressions as with newline-sensitive matching, but not "^" and "$".
461
462If inverse partial newline-sensitive matching is specified, this affects "^"
463and "$" as with newline-sensitive matching, but not "." and bracket
72844950
BP
464expressions. This isn't very useful but is provided for symmetry.
465
466
467@section overview_resyntax_limits Limits and Compatibility
468
07fa8f78
BP
469No particular limit is imposed on the length of REs. Programs intended to be
470highly portable should not employ REs longer than 256 bytes, as a
471POSIX-compliant implementation can refuse to accept such REs.
472
473The only feature of AREs that is actually incompatible with POSIX EREs is that
474<tt>@\</tt> does not lose its special significance inside bracket expressions.
475All other ARE features use syntax which is illegal or has undefined or
476unspecified effects in POSIX EREs; the <tt>***</tt> syntax of directors
477likewise is outside the POSIX syntax for both BREs and EREs.
478
479Many of the ARE extensions are borrowed from Perl, but some have been changed
480to clean them up, and a few Perl extensions are not present. Incompatibilities
481of note include <tt>@\b</tt>, <tt>@\B</tt>, the lack of special treatment for a
482trailing newline, the addition of complemented bracket expressions to the
483things affected by newline-sensitive matching, the restrictions on parentheses
484and back references in lookahead constraints, and the longest/shortest-match
485(rather than first-match) matching semantics.
486
487The matching rules for REs containing both normal and non-greedy quantifiers
488have changed since early beta-test versions of this package. The new rules are
489much simpler and cleaner, but don't work as hard at guessing the user's real
490intentions.
491
72844950 492Henry Spencer's original 1986 @e regexp package, still in widespread use,
07fa8f78
BP
493implemented an early version of today's EREs. There are four incompatibilities
494between @e regexp's near-EREs (RREs for short) and AREs. In roughly increasing
495order of significance:
496
497@li In AREs, <tt>@\</tt> followed by an alphanumeric character is either an
498 escape or an error, while in RREs, it was just another way of writing the
499 alphanumeric. This should not be a problem because there was no reason to
500 write such a sequence in RREs.
501@li @c { followed by a digit in an ARE is the beginning of a bound, while in
502 RREs, @c { was always an ordinary character. Such sequences should be rare,
503 and will often result in an error because following characters will not
504 look like a valid bound.
505@li In AREs, @c @\ remains a special character within @c [], so a literal @c @\
506 within @c [] must be written as <tt>@\@\</tt>. <tt>@\@\</tt> also gives a
507 literal @c @\ within @c [] in RREs, but only truly paranoid programmers
508 routinely doubled the backslash.
509@li AREs report the longest/shortest match for the RE, rather than the first
510 found in a specified search order. This may affect some RREs which were
511 written in the expectation that the first match would be reported. The
512 careful crafting of RREs to optimize the search order for fast matching is
513 obsolete (AREs examine all possible matches in parallel, and their
514 performance is largely insensitive to their complexity) but cases where the
515 search order was exploited to deliberately find a match which was @e not
516 the longest/shortest will need rewriting.
36c9828f
FM
517
518
72844950 519@section overview_resyntax_bre Basic Regular Expressions
36c9828f 520
07fa8f78
BP
521BREs differ from EREs in several respects. @c |, @c +, and @c ? are ordinary
522characters and there is no equivalent for their functionality. The delimiters
523for bounds are @c @\{ and @c @\}, with @c { and @c } by themselves ordinary
524characters. The parentheses for nested subexpressions are @c @\( and @c @\),
525with @c ( and @c ) by themselves ordinary characters. @c ^ is an ordinary
72844950 526character except at the beginning of the RE or the beginning of a parenthesized
07fa8f78
BP
527subexpression, @c $ is an ordinary character except at the end of the RE or the
528end of a parenthesized subexpression, and @c * is an ordinary character if it
529appears at the beginning of the RE or the beginning of a parenthesized
530subexpression (after a possible leading <tt>^</tt>). Finally, single-digit back
531references are available, and @c @\@< and @c @\@> are synonyms for
532<tt>[[:@<:]]</tt> and <tt>[[:@>:]]</tt> respectively; no other escapes are
533available.
36c9828f
FM
534
535
72844950 536@section overview_resyntax_characters Regular Expression Character Names
36c9828f 537
72844950 538Note that the character names are case sensitive.
36c9828f 539
a2968d85 540<center><table class='doctable' border='0' cellspacing='5' cellpadding='4'><tr>
36c9828f 541
a2968d85
BP
542<td>
543@beginTable
544@row2col{ <tt>NUL</tt> , @\0 }
545@row2col{ <tt>SOH</tt> , @\001 }
546@row2col{ <tt>STX</tt> , @\002 }
547@row2col{ <tt>ETX</tt> , @\003 }
548@row2col{ <tt>EOT</tt> , @\004 }
549@row2col{ <tt>ENQ</tt> , @\005 }
550@row2col{ <tt>ACK</tt> , @\006 }
551@row2col{ <tt>BEL</tt> , @\007 }
552@row2col{ <tt>alert</tt> , @\007 }
553@row2col{ <tt>BS</tt> , @\010 }
554@row2col{ <tt>backspace</tt> , @\b }
555@row2col{ <tt>HT</tt> , @\011 }
556@row2col{ <tt>tab</tt> , @\t }
557@row2col{ <tt>LF</tt> , @\012 }
558@row2col{ <tt>newline</tt> , @\n }
559@row2col{ <tt>VT</tt> , @\013 }
560@row2col{ <tt>vertical-tab</tt> , @\v }
561@row2col{ <tt>FF</tt> , @\014 }
562@row2col{ <tt>form-feed</tt> , @\f }
563@endTable
564</td>
36c9828f 565
a2968d85
BP
566<td>
567@beginTable
568@row2col{ <tt>CR</tt> , @\015 }
569@row2col{ <tt>carriage-return</tt> , @\r }
570@row2col{ <tt>SO</tt> , @\016 }
571@row2col{ <tt>SI</tt> , @\017 }
572@row2col{ <tt>DLE</tt> , @\020 }
573@row2col{ <tt>DC1</tt> , @\021 }
574@row2col{ <tt>DC2</tt> , @\022 }
575@row2col{ <tt>DC3</tt> , @\023 }
576@row2col{ <tt>DC4</tt> , @\024 }
577@row2col{ <tt>NAK</tt> , @\025 }
578@row2col{ <tt>SYN</tt> , @\026 }
579@row2col{ <tt>ETB</tt> , @\027 }
580@row2col{ <tt>CAN</tt> , @\030 }
581@row2col{ <tt>EM</tt> , @\031 }
582@row2col{ <tt>SUB</tt> , @\032 }
583@row2col{ <tt>ESC</tt> , @\033 }
584@row2col{ <tt>IS4</tt> , @\034 }
585@row2col{ <tt>FS</tt> , @\034 }
586@row2col{ <tt>IS3</tt> , @\035 }
587@endTable
588</td>
36c9828f 589
a2968d85
BP
590<td>
591@beginTable
592@row2col{ <tt>GS</tt> , @\035 }
593@row2col{ <tt>IS2</tt> , @\036 }
594@row2col{ <tt>RS</tt> , @\036 }
595@row2col{ <tt>IS1</tt> , @\037 }
596@row2col{ <tt>US</tt> , @\037 }
597@row2col{ <tt>space</tt> , " " (space) }
598@row2col{ <tt>exclamation-mark</tt> , ! }
599@row2col{ <tt>quotation-mark</tt> , " }
600@row2col{ <tt>number-sign</tt> , @# }
601@row2col{ <tt>dollar-sign</tt> , @$ }
602@row2col{ <tt>percent-sign</tt> , @% }
603@row2col{ <tt>ampersand</tt> , @& }
604@row2col{ <tt>apostrophe</tt> , ' }
605@row2col{ <tt>left-parenthesis</tt> , ( }
606@row2col{ <tt>right-parenthesis</tt> , ) }
607@row2col{ <tt>asterisk</tt> , * }
608@row2col{ <tt>plus-sign</tt> , + }
609@row2col{ <tt>comma</tt> , \, }
610@row2col{ <tt>hyphen</tt> , - }
611@endTable
612</td>
36c9828f 613
a2968d85
BP
614<td>
615@beginTable
616@row2col{ <tt>hyphen-minus</tt> , - }
617@row2col{ <tt>period</tt> , . }
618@row2col{ <tt>full-stop</tt> , . }
619@row2col{ <tt>slash</tt> , / }
620@row2col{ <tt>solidus</tt> , / }
621@row2col{ <tt>zero</tt> , 0 }
622@row2col{ <tt>one</tt> , 1 }
623@row2col{ <tt>two</tt> , 2 }
624@row2col{ <tt>three</tt> , 3 }
625@row2col{ <tt>four</tt> , 4 }
626@row2col{ <tt>five</tt> , 5 }
627@row2col{ <tt>six</tt> , 6 }
628@row2col{ <tt>seven</tt> , 7 }
629@row2col{ <tt>eight</tt> , 8 }
630@row2col{ <tt>nine</tt> , 9 }
631@row2col{ <tt>colon</tt> , : }
632@row2col{ <tt>semicolon</tt> , ; }
633@row2col{ <tt>less-than-sign</tt> , @< }
634@row2col{ <tt>equals-sign</tt> , = }
635@endTable
636</td>
36c9828f 637
a2968d85
BP
638<td>
639@beginTable
640@row2col{ <tt>greater-than-sign</tt> , @> }
641@row2col{ <tt>question-mark</tt> , ? }
642@row2col{ <tt>commercial-at</tt> , @@ }
643@row2col{ <tt>left-square-bracket</tt> , [ }
644@row2col{ <tt>backslash</tt> , @\ }
645@row2col{ <tt>reverse-solidus</tt> , @\ }
646@row2col{ <tt>right-square-bracket</tt> , ] }
647@row2col{ <tt>circumflex</tt> , ^ }
648@row2col{ <tt>circumflex-accent</tt> , ^ }
649@row2col{ <tt>underscore</tt> , _ }
650@row2col{ <tt>low-line</tt> , _ }
651@row2col{ <tt>grave-accent</tt> , ' }
652@row2col{ <tt>left-brace</tt> , @leftCurly }
653@row2col{ <tt>left-curly-bracket</tt> , @leftCurly }
654@row2col{ <tt>vertical-line</tt> , | }
655@row2col{ <tt>right-brace</tt> , @rightCurly }
656@row2col{ <tt>right-curly-bracket</tt> , @rightCurly }
657@row2col{ <tt>tilde</tt> , ~ }
658@row2col{ <tt>DEL</tt> , @\177 }
659@endTable
660</td>
36c9828f 661
a2968d85 662</tr></table></center>
36c9828f 663
72844950 664*/
36c9828f 665