Fix a couple of spelling mistakes in the documentation.
[wxWidgets.git] / docs / doxygen / overviews / resyntax.h
1 /////////////////////////////////////////////////////////////////////////////
2 // Name: resyntax.h
3 // Purpose: topic overview
4 // Author: wxWidgets team
5 // Licence: wxWindows licence
6 /////////////////////////////////////////////////////////////////////////////
7
8 /**
9
10 @page overview_resyntax Regular Expressions
11
12 @tableofcontents
13
14 A <em>regular expression</em> describes strings of characters. It's a pattern
15 that matches certain strings and doesn't match others.
16
17 @see wxRegEx
18
19
20
21 @section overview_resyntax_differentflavors Different Flavors of Regular Expressions
22
23 Regular expressions (RE), as defined by POSIX, come in two flavors:
24 <em>extended regular expressions</em> (ERE) and <em>basic regular
25 expressions</em> (BRE). EREs are roughly those of the traditional @e egrep,
26 while BREs are roughly those of the traditional @e ed. This implementation
27 adds a third flavor: <em>advanced regular expressions</em> (ARE), basically
28 EREs with some significant extensions.
29
30 This manual page primarily describes AREs. BREs mostly exist for backward
31 compatibility in some old programs. POSIX EREs are almost an exact subset of
32 AREs. Features of AREs that are not present in EREs will be indicated.
33
34
35 @section overview_resyntax_syntax Regular Expression Syntax
36
37 These regular expressions are implemented using the package written by Henry
38 Spencer, based on the 1003.2 spec and some (not quite all) of the Perl5
39 extensions (thanks, Henry!). Much of the description of regular expressions
40 below is copied verbatim from his manual entry.
41
42 An ARE is one or more @e branches, separated by "|", matching anything that
43 matches any of the branches.
44
45 A branch is zero or more @e constraints or @e quantified atoms, concatenated.
46 It matches a match for the first, followed by a match for the second, etc; an
47 empty branch matches the empty string.
48
49 A quantified atom is an @e atom possibly followed by a single @e quantifier.
50 Without a quantifier, it matches a match for the atom. The quantifiers, and
51 what a so-quantified atom matches, are:
52
53 @beginTable
54 @row2col{ <tt>*</tt> ,
55 A sequence of 0 or more matches of the atom. }
56 @row2col{ <tt>+</tt> ,
57 A sequence of 1 or more matches of the atom. }
58 @row2col{ <tt>?</tt> ,
59 A sequence of 0 or 1 matches of the atom. }
60 @row2col{ <tt>{m}</tt> ,
61 A sequence of exactly @e m matches of the atom. }
62 @row2col{ <tt>{m\,}</tt> ,
63 A sequence of @e m or more matches of the atom. }
64 @row2col{ <tt>{m\,n}</tt> ,
65 A sequence of @e m through @e n (inclusive) matches of the atom; @e m may
66 not exceed @e n. }
67 @row2col{ <tt>*? +? ?? {m}? {m\,}? {m\,n}?</tt> ,
68 @e Non-greedy quantifiers, which match the same possibilities, but prefer
69 the smallest number rather than the largest number of matches (see
70 @ref overview_resyntax_matching). }
71 @endTable
72
73 The forms using @b { and @b } are known as @e bounds. The numbers @e m and
74 @e n are unsigned decimal integers with permissible values from 0 to 255
75 inclusive. An atom is one of:
76
77 @beginTable
78 @row2col{ <tt>(re)</tt> ,
79 Where @e re is any regular expression, matches for @e re, with the match
80 captured for possible reporting. }
81 @row2col{ <tt>(?:re)</tt> ,
82 As previous, but does no reporting (a "non-capturing" set of
83 parentheses). }
84 @row2col{ <tt>()</tt> ,
85 Matches an empty string, captured for possible reporting. }
86 @row2col{ <tt>(?:)</tt> ,
87 Matches an empty string, without reporting. }
88 @row2col{ <tt>[chars]</tt> ,
89 A <em>bracket expression</em>, matching any one of the @e chars (see
90 @ref overview_resyntax_bracket for more details). }
91 @row2col{ <tt>.</tt> ,
92 Matches any single character. }
93 @row2col{ <tt>@\k</tt> ,
94 Where @e k is a non-alphanumeric character, matches that character taken
95 as an ordinary character, e.g. @\@\ matches a backslash character. }
96 @row2col{ <tt>@\c</tt> ,
97 Where @e c is alphanumeric (possibly followed by other characters), an
98 @e escape (AREs only), see @ref overview_resyntax_escapes below. }
99 @row2col{ <tt>@leftCurly</tt> ,
100 When followed by a character other than a digit, matches the left-brace
101 character "@leftCurly"; when followed by a digit, it is the beginning of a
102 @e bound (see above). }
103 @row2col{ <tt>x</tt> ,
104 Where @e x is a single character with no other significance, matches that
105 character. }
106 @endTable
107
108 A @e constraint matches an empty string when specific conditions are met. A
109 constraint may not be followed by a quantifier. The simple constraints are as
110 follows; some more constraints are described later, under
111 @ref overview_resyntax_escapes.
112
113 @beginTable
114 @row2col{ <tt>^</tt> ,
115 Matches at the beginning of a line. }
116 @row2col{ <tt>@$</tt> ,
117 Matches at the end of a line. }
118 @row2col{ <tt>(?=re)</tt> ,
119 @e Positive lookahead (AREs only), matches at any point where a substring
120 matching @e re begins. }
121 @row2col{ <tt>(?!re)</tt> ,
122 @e Negative lookahead (AREs only), matches at any point where no substring
123 matching @e re begins. }
124 @endTable
125
126 The lookahead constraints may not contain back references (see later), and all
127 parentheses within them are considered non-capturing. A RE may not end with
128 "\".
129
130
131 @section overview_resyntax_bracket Bracket Expressions
132
133 A <em>bracket expression</em> is a list of characters enclosed in <tt>[]</tt>.
134 It normally matches any single character from the list (but see below). If the
135 list begins with @c ^, it matches any single character (but see below) @e not
136 from the rest of the list.
137
138 If two characters in the list are separated by <tt>-</tt>, this is shorthand
139 for the full @e range of characters between those two (inclusive) in the
140 collating sequence, e.g. <tt>[0-9]</tt> in ASCII matches any decimal digit.
141 Two ranges may not share an endpoint, so e.g. <tt>a-c-e</tt> is illegal.
142 Ranges are very collating-sequence-dependent, and portable programs should
143 avoid relying on them.
144
145 To include a literal <tt>]</tt> or <tt>-</tt> in the list, the simplest method
146 is to enclose it in <tt>[.</tt> and <tt>.]</tt> to make it a collating element
147 (see below). Alternatively, make it the first character (following a possible
148 <tt>^</tt>), or (AREs only) precede it with <tt>@\</tt>. Alternatively, for
149 <tt>-</tt>, make it the last character, or the second endpoint of a range. To
150 use a literal <tt>-</tt> as the first endpoint of a range, make it a collating
151 element or (AREs only) precede it with <tt>@\</tt>. With the exception of
152 these, some combinations using <tt>[</tt> (see next paragraphs), and escapes,
153 all other special characters lose their special significance within a bracket
154 expression.
155
156 Within a bracket expression, a collating element (a character, a
157 multi-character sequence that collates as if it were a single character, or a
158 collating-sequence name for either) enclosed in <tt>[.</tt> and <tt>.]</tt>
159 stands for the sequence of characters of that collating element.
160
161 @e wxWidgets: Currently no multi-character collating elements are defined. So
162 in <tt>[.X.]</tt>, @c X can either be a single character literal or the name
163 of a character. For example, the following are both identical:
164 <tt>[[.0.]-[.9.]]</tt> and <tt>[[.zero.]-[.nine.]]</tt> and mean the same as
165 <tt>[0-9]</tt>. See @ref overview_resyntax_characters.
166
167 Within a bracket expression, a collating element enclosed in <tt>[=</tt> and
168 <tt>=]</tt> is an equivalence class, standing for the sequences of characters
169 of all collating elements equivalent to that one, including itself. An
170 equivalence class may not be an endpoint of a range.
171
172 @e wxWidgets: Currently no equivalence classes are defined, so <tt>[=X=]</tt>
173 stands for just the single character @c X. @c X can either be a single
174 character literal or the name of a character, see
175 @ref overview_resyntax_characters.
176
177 Within a bracket expression, the name of a @e character class enclosed in
178 <tt>[:</tt> and <tt>:]</tt> stands for the list of all characters (not all
179 collating elements!) belonging to that class. Standard character classes are:
180
181 @beginTable
182 @row2col{ <tt>alpha</tt> , A letter. }
183 @row2col{ <tt>upper</tt> , An upper-case letter. }
184 @row2col{ <tt>lower</tt> , A lower-case letter. }
185 @row2col{ <tt>digit</tt> , A decimal digit. }
186 @row2col{ <tt>xdigit</tt> , A hexadecimal digit. }
187 @row2col{ <tt>alnum</tt> , An alphanumeric (letter or digit). }
188 @row2col{ <tt>print</tt> , An alphanumeric (same as alnum). }
189 @row2col{ <tt>blank</tt> , A space or tab character. }
190 @row2col{ <tt>space</tt> , A character producing white space in displayed text. }
191 @row2col{ <tt>punct</tt> , A punctuation character. }
192 @row2col{ <tt>graph</tt> , A character with a visible representation. }
193 @row2col{ <tt>cntrl</tt> , A control character. }
194 @endTable
195
196 A character class may not be used as an endpoint of a range.
197
198 @e wxWidgets: In a non-Unicode build, these character classifications depend on
199 the current locale, and correspond to the values return by the ANSI C "is"
200 functions: <tt>isalpha</tt>, <tt>isupper</tt>, etc. In Unicode mode they are
201 based on Unicode classifications, and are not affected by the current locale.
202
203 There are two special cases of bracket expressions: the bracket expressions
204 <tt>[[:@<:]]</tt> and <tt>[[:@>:]]</tt> are constraints, matching empty strings at
205 the beginning and end of a word respectively. A word is defined as a sequence
206 of word characters that is neither preceded nor followed by word characters. A
207 word character is an @e alnum character or an underscore (_). These special
208 bracket expressions are deprecated; users of AREs should use constraint escapes
209 instead (see escapes below).
210
211
212 @section overview_resyntax_escapes Escapes
213
214 Escapes (AREs only), which begin with a <tt>@\</tt> followed by an alphanumeric
215 character, come in several varieties: character entry, class shorthands,
216 constraint escapes, and back references. A <tt>@\</tt> followed by an
217 alphanumeric character but not constituting a valid escape is illegal in AREs.
218 In EREs, there are no escapes: outside a bracket expression, a <tt>@\</tt>
219 followed by an alphanumeric character merely stands for that character as an
220 ordinary character, and inside a bracket expression, <tt>@\</tt> is an ordinary
221 character. (The latter is the one actual incompatibility between EREs and
222 AREs.)
223
224 Character-entry escapes (AREs only) exist to make it easier to specify
225 non-printing and otherwise inconvenient characters in REs:
226
227 @beginTable
228 @row2col{ <tt>@\a</tt> , Alert (bell) character, as in C. }
229 @row2col{ <tt>@\b</tt> , Backspace, as in C. }
230 @row2col{ <tt>@\B</tt> ,
231 Synonym for <tt>@\</tt> to help reduce backslash doubling in some
232 applications where there are multiple levels of backslash processing. }
233 @row2col{ <tt>@\cX</tt> ,
234 The character whose low-order 5 bits are the same as those of @e X, and
235 whose other bits are all zero, where @e X is any character. }
236 @row2col{ <tt>@\e</tt> ,
237 The character whose collating-sequence name is @c ESC, or failing that,
238 the character with octal value 033. }
239 @row2col{ <tt>@\f</tt> , Formfeed, as in C. }
240 @row2col{ <tt>@\n</tt> , Newline, as in C. }
241 @row2col{ <tt>@\r</tt> , Carriage return, as in C. }
242 @row2col{ <tt>@\t</tt> , Horizontal tab, as in C. }
243 @row2col{ <tt>@\uwxyz</tt> ,
244 The Unicode character <tt>U+wxyz</tt> in the local byte ordering, where
245 @e wxyz is exactly four hexadecimal digits. }
246 @row2col{ <tt>@\Ustuvwxyz</tt> ,
247 Reserved for a somewhat-hypothetical Unicode extension to 32 bits, where
248 @e stuvwxyz is exactly eight hexadecimal digits. }
249 @row2col{ <tt>@\v</tt> , Vertical tab, as in C are all available. }
250 @row2col{ <tt>@\xhhh</tt> ,
251 The single character whose hexadecimal value is @e 0xhhh, where @e hhh is
252 any sequence of hexadecimal digits. }
253 @row2col{ <tt>@\0</tt> , The character whose value is 0. }
254 @row2col{ <tt>@\xy</tt> ,
255 The character whose octal value is @e 0xy, where @e xy is exactly two octal
256 digits, and is not a <em>back reference</em> (see below). }
257 @row2col{ <tt>@\xyz</tt> ,
258 The character whose octal value is @e 0xyz, where @e xyz is exactly three
259 octal digits, and is not a <em>back reference</em> (see below). }
260 @endTable
261
262 Hexadecimal digits are 0-9, a-f, and A-F. Octal digits are 0-7.
263
264 The character-entry escapes are always taken as ordinary characters. For
265 example, <tt>@\135</tt> is <tt>]</tt> in ASCII, but <tt>@\135</tt> does not
266 terminate a bracket expression. Beware, however, that some applications (e.g.,
267 C compilers) interpret such sequences themselves before the regular-expression
268 package gets to see them, which may require doubling (quadrupling, etc.) the
269 '<tt>@\</tt>'.
270
271 Class-shorthand escapes (AREs only) provide shorthands for certain
272 commonly-used character classes:
273
274 @beginTable
275 @row2col{ <tt>@\d</tt> , <tt>[[:digit:]]</tt> }
276 @row2col{ <tt>@\s</tt> , <tt>[[:space:]]</tt> }
277 @row2col{ <tt>@\w</tt> , <tt>[[:alnum:]_]</tt> (note underscore) }
278 @row2col{ <tt>@\D</tt> , <tt>[^[:digit:]]</tt> }
279 @row2col{ <tt>@\S</tt> , <tt>[^[:space:]]</tt> }
280 @row2col{ <tt>@\W</tt> , <tt>[^[:alnum:]_]</tt> (note underscore) }
281 @endTable
282
283 Within bracket expressions, <tt>@\d</tt>, <tt>@\s</tt>, and <tt>@\w</tt> lose
284 their outer brackets, and <tt>@\D</tt>, <tt>@\S</tt>, <tt>@\W</tt> are illegal.
285 So, for example, <tt>[a-c@\d]</tt> is equivalent to <tt>[a-c[:digit:]]</tt>.
286 Also, <tt>[a-c@\D]</tt>, which is equivalent to <tt>[a-c^[:digit:]]</tt>, is
287 illegal.
288
289 A constraint escape (AREs only) is a constraint, matching the empty string if
290 specific conditions are met, written as an escape:
291
292 @beginTable
293 @row2col{ <tt>@\A</tt> , Matches only at the beginning of the string, see
294 @ref overview_resyntax_matching for how this differs
295 from <tt>^</tt>. }
296 @row2col{ <tt>@\m</tt> , Matches only at the beginning of a word. }
297 @row2col{ <tt>@\M</tt> , Matches only at the end of a word. }
298 @row2col{ <tt>@\y</tt> , Matches only at the beginning or end of a word. }
299 @row2col{ <tt>@\Y</tt> , Matches only at a point that is not the beginning or
300 end of a word. }
301 @row2col{ <tt>@\Z</tt> , Matches only at the end of the string, see
302 @ref overview_resyntax_matching for how this differs
303 from <tt>@$</tt>. }
304 @row2col{ <tt>@\m</tt> , A <em>back reference</em>, where @e m is a non-zero
305 digit. See below. }
306 @row2col{ <tt>@\mnn</tt> ,
307 A <em>back reference</em>, where @e m is a nonzero digit, and @e nn is some
308 more digits, and the decimal value @e mnn is not greater than the number of
309 closing capturing parentheses seen so far. See below. }
310 @endTable
311
312 A word is defined as in the specification of <tt>[[:@<:]]</tt> and
313 <tt>[[:@>:]]</tt> above. Constraint escapes are illegal within bracket
314 expressions.
315
316 A back reference (AREs only) matches the same string matched by the
317 parenthesized subexpression specified by the number. For example, "([bc])\1"
318 matches "bb" or "cc" but not "bc". The subexpression must entirely precede the
319 back reference in the RE.Subexpressions are numbered in the order of their
320 leading parentheses. Non-capturing parentheses do not define subexpressions.
321
322 There is an inherent historical ambiguity between octal character-entry escapes
323 and back references, which is resolved by heuristics, as hinted at above. A
324 leading zero always indicates an octal escape. A single non-zero digit, not
325 followed by another digit, is always taken as a back reference. A multi-digit
326 sequence not starting with a zero is taken as a back reference if it comes
327 after a suitable subexpression (i.e. the number is in the legal range for a
328 back reference), and otherwise is taken as octal.
329
330
331 @section overview_resyntax_metasyntax Metasyntax
332
333 In addition to the main syntax described above, there are some special forms
334 and miscellaneous syntactic facilities available.
335
336 Normally the flavor of RE being used is specified by application-dependent
337 means. However, this can be overridden by a @e director. If an RE of any flavor
338 begins with <tt>***:</tt>, the rest of the RE is an ARE. If an RE of any
339 flavor begins with <tt>***=</tt>, the rest of the RE is taken to be a literal
340 string, with all characters considered ordinary characters.
341
342 An ARE may begin with <em>embedded options</em>: a sequence <tt>(?xyz)</tt>
343 (where @e xyz is one or more alphabetic characters) specifies options affecting
344 the rest of the RE. These supplement, and can override, any options specified
345 by the application. The available option letters are:
346
347 @beginTable
348 @row2col{ <tt>b</tt> , Rest of RE is a BRE. }
349 @row2col{ <tt>c</tt> , Case-sensitive matching (usual default). }
350 @row2col{ <tt>e</tt> , Rest of RE is an ERE. }
351 @row2col{ <tt>i</tt> , Case-insensitive matching (see
352 @ref overview_resyntax_matching, below). }
353 @row2col{ <tt>m</tt> , Historical synonym for @e n. }
354 @row2col{ <tt>n</tt> , Newline-sensitive matching (see
355 @ref overview_resyntax_matching, below). }
356 @row2col{ <tt>p</tt> , Partial newline-sensitive matching (see
357 @ref overview_resyntax_matching, below). }
358 @row2col{ <tt>q</tt> , Rest of RE is a literal ("quoted") string, all ordinary
359 characters. }
360 @row2col{ <tt>s</tt> , Non-newline-sensitive matching (usual default). }
361 @row2col{ <tt>t</tt> , Tight syntax (usual default; see below). }
362 @row2col{ <tt>w</tt> , Inverse partial newline-sensitive ("weird") matching
363 (see @ref overview_resyntax_matching, below). }
364 @row2col{ <tt>x</tt> , Expanded syntax (see below). }
365 @endTable
366
367 Embedded options take effect at the <tt>)</tt> terminating the sequence. They
368 are available only at the start of an ARE, and may not be used later within it.
369
370 In addition to the usual (@e tight) RE syntax, in which all characters are
371 significant, there is an @e expanded syntax, available in AREs with the
372 embedded x option. In the expanded syntax, white-space characters are ignored
373 and all characters between a <tt>@#</tt> and the following newline (or the end
374 of the RE) are ignored, permitting paragraphing and commenting a complex RE.
375 There are three exceptions to that basic rule:
376
377 @li A white-space character or <tt>@#</tt> preceded by <tt>@\</tt> is retained.
378 @li White space or <tt>@#</tt> within a bracket expression is retained.
379 @li White space and comments are illegal within multi-character symbols like
380 the ARE <tt>(?:</tt> or the BRE <tt>\(</tt>.
381
382 Expanded-syntax white-space characters are blank, tab, newline, and any
383 character that belongs to the @e space character class.
384
385 Finally, in an ARE, outside bracket expressions, the sequence <tt>(?@#ttt)</tt>
386 (where @e ttt is any text not containing a <tt>)</tt>) is a comment, completely
387 ignored. Again, this is not allowed between the characters of multi-character
388 symbols like <tt>(?:</tt>. Such comments are more a historical artifact than a
389 useful facility, and their use is deprecated; use the expanded syntax instead.
390
391 @e None of these metasyntax extensions is available if the application (or an
392 initial <tt>***=</tt> director) has specified that the user's input be treated
393 as a literal string rather than as an RE.
394
395
396 @section overview_resyntax_matching Matching
397
398 In the event that an RE could match more than one substring of a given string,
399 the RE matches the one starting earliest in the string. If the RE could match
400 more than one substring starting at that point, the choice is determined by
401 it's @e preference: either the longest substring, or the shortest.
402
403 Most atoms, and all constraints, have no preference. A parenthesized RE has the
404 same preference (possibly none) as the RE. A quantified atom with quantifier
405 <tt>{m}</tt> or <tt>{m}?</tt> has the same preference (possibly none) as the
406 atom itself. A quantified atom with other normal quantifiers (including
407 <tt>{m,n}</tt> with @e m equal to @e n) prefers longest match. A quantified
408 atom with other non-greedy quantifiers (including <tt>{m,n}?</tt> with @e m
409 equal to @e n) prefers shortest match. A branch has the same preference as the
410 first quantified atom in it which has a preference. An RE consisting of two or
411 more branches connected by the @c | operator prefers longest match.
412
413 Subject to the constraints imposed by the rules for matching the whole RE,
414 subexpressions also match the longest or shortest possible substrings, based on
415 their preferences, with subexpressions starting earlier in the RE taking
416 priority over ones starting later. Note that outer subexpressions thus take
417 priority over their component subexpressions.
418
419 Note that the quantifiers <tt>{1,1}</tt> and <tt>{1,1}?</tt> can be used to
420 force longest and shortest preference, respectively, on a subexpression or a
421 whole RE.
422
423 Match lengths are measured in characters, not collating elements. An empty
424 string is considered longer than no match at all. For example, <tt>bb*</tt>
425 matches the three middle characters of "abbbc",
426 <tt>(week|wee)(night|knights)</tt> matches all ten characters of "weeknights",
427 when <tt>(.*).*</tt> is matched against "abc" the parenthesized subexpression
428 matches all three characters, and when <tt>(a*)*</tt> is matched against "bc"
429 both the whole RE and the parenthesized subexpression match an empty string.
430
431 If case-independent matching is specified, the effect is much as if all case
432 distinctions had vanished from the alphabet. When an alphabetic that exists in
433 multiple cases appears as an ordinary character outside a bracket expression,
434 it is effectively transformed into a bracket expression containing both cases,
435 so that @c x becomes @c [xX]. When it appears inside a bracket expression, all
436 case counterparts of it are added to the bracket expression, so that @c [x]
437 becomes @c [xX] and @c [^x] becomes @c [^xX].
438
439 If newline-sensitive matching is specified, "." and bracket expressions using
440 "^" will never match the newline character (so that matches will never cross
441 newlines unless the RE explicitly arranges it) and "^" and "$" will match the
442 empty string after and before a newline respectively, in addition to matching
443 at beginning and end of string respectively. ARE <tt>@\A</tt> and <tt>@\Z</tt>
444 continue to match beginning or end of string @e only.
445
446 If partial newline-sensitive matching is specified, this affects "." and
447 bracket expressions as with newline-sensitive matching, but not "^" and "$".
448
449 If inverse partial newline-sensitive matching is specified, this affects "^"
450 and "$" as with newline-sensitive matching, but not "." and bracket
451 expressions. This isn't very useful but is provided for symmetry.
452
453
454 @section overview_resyntax_limits Limits and Compatibility
455
456 No particular limit is imposed on the length of REs. Programs intended to be
457 highly portable should not employ REs longer than 256 bytes, as a
458 POSIX-compliant implementation can refuse to accept such REs.
459
460 The only feature of AREs that is actually incompatible with POSIX EREs is that
461 <tt>@\</tt> does not lose its special significance inside bracket expressions.
462 All other ARE features use syntax which is illegal or has undefined or
463 unspecified effects in POSIX EREs; the <tt>***</tt> syntax of directors
464 likewise is outside the POSIX syntax for both BREs and EREs.
465
466 Many of the ARE extensions are borrowed from Perl, but some have been changed
467 to clean them up, and a few Perl extensions are not present. Incompatibilities
468 of note include <tt>@\b</tt>, <tt>@\B</tt>, the lack of special treatment for a
469 trailing newline, the addition of complemented bracket expressions to the
470 things affected by newline-sensitive matching, the restrictions on parentheses
471 and back references in lookahead constraints, and the longest/shortest-match
472 (rather than first-match) matching semantics.
473
474 The matching rules for REs containing both normal and non-greedy quantifiers
475 have changed since early beta-test versions of this package. The new rules are
476 much simpler and cleaner, but don't work as hard at guessing the user's real
477 intentions.
478
479 Henry Spencer's original 1986 @e regexp package, still in widespread use,
480 implemented an early version of today's EREs. There are four incompatibilities
481 between @e regexp's near-EREs (RREs for short) and AREs. In roughly increasing
482 order of significance:
483
484 @li In AREs, <tt>@\</tt> followed by an alphanumeric character is either an
485 escape or an error, while in RREs, it was just another way of writing the
486 alphanumeric. This should not be a problem because there was no reason to
487 write such a sequence in RREs.
488 @li @c { followed by a digit in an ARE is the beginning of a bound, while in
489 RREs, @c { was always an ordinary character. Such sequences should be rare,
490 and will often result in an error because following characters will not
491 look like a valid bound.
492 @li In AREs, @c @\ remains a special character within @c [], so a literal @c @\
493 within @c [] must be written as <tt>@\@\</tt>. <tt>@\@\</tt> also gives a
494 literal @c @\ within @c [] in RREs, but only truly paranoid programmers
495 routinely doubled the backslash.
496 @li AREs report the longest/shortest match for the RE, rather than the first
497 found in a specified search order. This may affect some RREs which were
498 written in the expectation that the first match would be reported. The
499 careful crafting of RREs to optimize the search order for fast matching is
500 obsolete (AREs examine all possible matches in parallel, and their
501 performance is largely insensitive to their complexity) but cases where the
502 search order was exploited to deliberately find a match which was @e not
503 the longest/shortest will need rewriting.
504
505
506 @section overview_resyntax_bre Basic Regular Expressions
507
508 BREs differ from EREs in several respects. @c |, @c +, and @c ? are ordinary
509 characters and there is no equivalent for their functionality. The delimiters
510 for bounds are @c @\{ and @c @\}, with @c { and @c } by themselves ordinary
511 characters. The parentheses for nested subexpressions are @c @\( and @c @\),
512 with @c ( and @c ) by themselves ordinary characters. @c ^ is an ordinary
513 character except at the beginning of the RE or the beginning of a parenthesized
514 subexpression, @c $ is an ordinary character except at the end of the RE or the
515 end of a parenthesized subexpression, and @c * is an ordinary character if it
516 appears at the beginning of the RE or the beginning of a parenthesized
517 subexpression (after a possible leading <tt>^</tt>). Finally, single-digit back
518 references are available, and @c @\@< and @c @\@> are synonyms for
519 <tt>[[:@<:]]</tt> and <tt>[[:@>:]]</tt> respectively; no other escapes are
520 available.
521
522
523 @section overview_resyntax_characters Regular Expression Character Names
524
525 Note that the character names are case sensitive.
526
527 <center><table class='doctable' border='0' cellspacing='5' cellpadding='4'><tr>
528
529 <td>
530 @beginTable
531 @row2col{ <tt>NUL</tt> , @\0 }
532 @row2col{ <tt>SOH</tt> , @\001 }
533 @row2col{ <tt>STX</tt> , @\002 }
534 @row2col{ <tt>ETX</tt> , @\003 }
535 @row2col{ <tt>EOT</tt> , @\004 }
536 @row2col{ <tt>ENQ</tt> , @\005 }
537 @row2col{ <tt>ACK</tt> , @\006 }
538 @row2col{ <tt>BEL</tt> , @\007 }
539 @row2col{ <tt>alert</tt> , @\007 }
540 @row2col{ <tt>BS</tt> , @\010 }
541 @row2col{ <tt>backspace</tt> , @\b }
542 @row2col{ <tt>HT</tt> , @\011 }
543 @row2col{ <tt>tab</tt> , @\t }
544 @row2col{ <tt>LF</tt> , @\012 }
545 @row2col{ <tt>newline</tt> , @\n }
546 @row2col{ <tt>VT</tt> , @\013 }
547 @row2col{ <tt>vertical-tab</tt> , @\v }
548 @row2col{ <tt>FF</tt> , @\014 }
549 @row2col{ <tt>form-feed</tt> , @\f }
550 @endTable
551 </td>
552
553 <td>
554 @beginTable
555 @row2col{ <tt>CR</tt> , @\015 }
556 @row2col{ <tt>carriage-return</tt> , @\r }
557 @row2col{ <tt>SO</tt> , @\016 }
558 @row2col{ <tt>SI</tt> , @\017 }
559 @row2col{ <tt>DLE</tt> , @\020 }
560 @row2col{ <tt>DC1</tt> , @\021 }
561 @row2col{ <tt>DC2</tt> , @\022 }
562 @row2col{ <tt>DC3</tt> , @\023 }
563 @row2col{ <tt>DC4</tt> , @\024 }
564 @row2col{ <tt>NAK</tt> , @\025 }
565 @row2col{ <tt>SYN</tt> , @\026 }
566 @row2col{ <tt>ETB</tt> , @\027 }
567 @row2col{ <tt>CAN</tt> , @\030 }
568 @row2col{ <tt>EM</tt> , @\031 }
569 @row2col{ <tt>SUB</tt> , @\032 }
570 @row2col{ <tt>ESC</tt> , @\033 }
571 @row2col{ <tt>IS4</tt> , @\034 }
572 @row2col{ <tt>FS</tt> , @\034 }
573 @row2col{ <tt>IS3</tt> , @\035 }
574 @endTable
575 </td>
576
577 <td>
578 @beginTable
579 @row2col{ <tt>GS</tt> , @\035 }
580 @row2col{ <tt>IS2</tt> , @\036 }
581 @row2col{ <tt>RS</tt> , @\036 }
582 @row2col{ <tt>IS1</tt> , @\037 }
583 @row2col{ <tt>US</tt> , @\037 }
584 @row2col{ <tt>space</tt> , " " (space) }
585 @row2col{ <tt>exclamation-mark</tt> , ! }
586 @row2col{ <tt>quotation-mark</tt> , " }
587 @row2col{ <tt>number-sign</tt> , @# }
588 @row2col{ <tt>dollar-sign</tt> , @$ }
589 @row2col{ <tt>percent-sign</tt> , @% }
590 @row2col{ <tt>ampersand</tt> , @& }
591 @row2col{ <tt>apostrophe</tt> , ' }
592 @row2col{ <tt>left-parenthesis</tt> , ( }
593 @row2col{ <tt>right-parenthesis</tt> , ) }
594 @row2col{ <tt>asterisk</tt> , * }
595 @row2col{ <tt>plus-sign</tt> , + }
596 @row2col{ <tt>comma</tt> , \, }
597 @row2col{ <tt>hyphen</tt> , - }
598 @endTable
599 </td>
600
601 <td>
602 @beginTable
603 @row2col{ <tt>hyphen-minus</tt> , - }
604 @row2col{ <tt>period</tt> , . }
605 @row2col{ <tt>full-stop</tt> , . }
606 @row2col{ <tt>slash</tt> , / }
607 @row2col{ <tt>solidus</tt> , / }
608 @row2col{ <tt>zero</tt> , 0 }
609 @row2col{ <tt>one</tt> , 1 }
610 @row2col{ <tt>two</tt> , 2 }
611 @row2col{ <tt>three</tt> , 3 }
612 @row2col{ <tt>four</tt> , 4 }
613 @row2col{ <tt>five</tt> , 5 }
614 @row2col{ <tt>six</tt> , 6 }
615 @row2col{ <tt>seven</tt> , 7 }
616 @row2col{ <tt>eight</tt> , 8 }
617 @row2col{ <tt>nine</tt> , 9 }
618 @row2col{ <tt>colon</tt> , : }
619 @row2col{ <tt>semicolon</tt> , ; }
620 @row2col{ <tt>less-than-sign</tt> , @< }
621 @row2col{ <tt>equals-sign</tt> , = }
622 @endTable
623 </td>
624
625 <td>
626 @beginTable
627 @row2col{ <tt>greater-than-sign</tt> , @> }
628 @row2col{ <tt>question-mark</tt> , ? }
629 @row2col{ <tt>commercial-at</tt> , @@ }
630 @row2col{ <tt>left-square-bracket</tt> , [ }
631 @row2col{ <tt>backslash</tt> , @\ }
632 @row2col{ <tt>reverse-solidus</tt> , @\ }
633 @row2col{ <tt>right-square-bracket</tt> , ] }
634 @row2col{ <tt>circumflex</tt> , ^ }
635 @row2col{ <tt>circumflex-accent</tt> , ^ }
636 @row2col{ <tt>underscore</tt> , _ }
637 @row2col{ <tt>low-line</tt> , _ }
638 @row2col{ <tt>grave-accent</tt> , ' }
639 @row2col{ <tt>left-brace</tt> , @leftCurly }
640 @row2col{ <tt>left-curly-bracket</tt> , @leftCurly }
641 @row2col{ <tt>vertical-line</tt> , | }
642 @row2col{ <tt>right-brace</tt> , @rightCurly }
643 @row2col{ <tt>right-curly-bracket</tt> , @rightCurly }
644 @row2col{ <tt>tilde</tt> , ~ }
645 @row2col{ <tt>DEL</tt> , @\177 }
646 @endTable
647 </td>
648
649 </tr></table></center>
650
651 */