]>
Commit | Line | Data |
---|---|---|
1 | % manual page source format generated by PolyglotMan v3.0.9, | |
2 | % available via anonymous ftp from ftp.cs.berkeley.edu:/ucb/people/phelps/tcltk/rman.tar.Z | |
3 | ||
4 | \section{Syntax of the builtin regular expression library}\label{wxresyn} | |
5 | ||
6 | A {\it regular expression} describes strings of characters. It's a | |
7 | pattern that matches certain strings and doesn't match others. | |
8 | ||
9 | \wxheading{See also} | |
10 | ||
11 | \helpref{wxRegEx}{wxregex} | |
12 | ||
13 | \subsection{Different Flavors of REs} | |
14 | ||
15 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
16 | ||
17 | Regular expressions (``RE''s), as defined by POSIX, come in two | |
18 | flavors: {\it extended} REs (``EREs'') and {\it basic} REs (``BREs''). EREs are roughly those | |
19 | of the traditional {\it egrep}, while BREs are roughly those of the traditional | |
20 | {\it ed}. This implementation adds a third flavor, {\it advanced} REs (``AREs''), basically | |
21 | EREs with some significant extensions. | |
22 | ||
23 | This manual page primarily describes | |
24 | AREs. BREs mostly exist for backward compatibility in some old programs; | |
25 | they will be discussed at the \helpref{end}{wxresynbre}. POSIX EREs are almost an exact subset | |
26 | of AREs. Features of AREs that are not present in EREs will be indicated. | |
27 | ||
28 | \subsection{Regular Expression Syntax} | |
29 | ||
30 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
31 | ||
32 | These regular expressions are implemented using | |
33 | the package written by Henry Spencer, based on the 1003.2 spec and some | |
34 | (not quite all) of the Perl5 extensions (thanks, Henry!). Much of the description | |
35 | of regular expressions below is copied verbatim from his manual entry. | |
36 | ||
37 | An ARE is one or more {\it branches}, separated by `{\bf $|$}', matching anything that matches | |
38 | any of the branches. | |
39 | ||
40 | A branch is zero or more {\it constraints} or {\it quantified | |
41 | atoms}, concatenated. It matches a match for the first, followed by a match | |
42 | for the second, etc; an empty branch matches the empty string. | |
43 | ||
44 | A quantified atom is an {\it atom} possibly followed by a single {\it quantifier}. Without a quantifier, | |
45 | it matches a match for the atom. The quantifiers, and what a so-quantified | |
46 | atom matches, are: | |
47 | ||
48 | \begin{twocollist}\twocolwidtha{4cm} | |
49 | \twocolitem{{\bf *}}{a sequence of 0 or more matches of the atom} | |
50 | \twocolitem{{\bf +}}{a sequence of 1 or more matches of the atom} | |
51 | \twocolitem{{\bf ?}}{a sequence of 0 or 1 matches of the atom} | |
52 | \twocolitem{{\bf \{m\}}}{a sequence of exactly {\it m} matches of the atom} | |
53 | \twocolitem{{\bf \{m,\}}}{a sequence of {\it m} or more matches of the atom} | |
54 | \twocolitem{{\bf \{m,n\}}}{a sequence of {\it m} through {\it n} (inclusive) | |
55 | matches of the atom; {\it m} may not exceed {\it n}} | |
56 | \twocolitem{{\bf *? +? ?? \{m\}? \{m,\}? \{m,n\}?}}{{\it non-greedy} quantifiers, | |
57 | which match the same possibilities, but prefer the | |
58 | smallest number rather than the largest number of matches (see \helpref{Matching}{wxresynmatching})} | |
59 | \end{twocollist} | |
60 | ||
61 | The forms using {\bf \{} and {\bf \}} are known as {\it bound}s. The numbers {\it m} and {\it n} are unsigned | |
62 | decimal integers with permissible values from 0 to 255 inclusive. | |
63 | An atom is one of: | |
64 | ||
65 | \begin{twocollist}\twocolwidtha{4cm} | |
66 | \twocolitem{{\bf (re)}}{(where {\it re} is any regular expression) matches a match for | |
67 | {\it re}, with the match noted for possible reporting} | |
68 | \twocolitem{{\bf (?:re)}}{as previous, but | |
69 | does no reporting (a ``non-capturing'' set of parentheses)} | |
70 | \twocolitem{{\bf ()}}{matches an empty | |
71 | string, noted for possible reporting} | |
72 | \twocolitem{{\bf (?:)}}{matches an empty string, without reporting} | |
73 | \twocolitem{{\bf $[chars]$}}{a {\it bracket expression}, matching any one of the {\it chars} | |
74 | (see \helpref{Bracket Expressions}{wxresynbracket} for more detail)} | |
75 | \twocolitem{{\bf .}}{matches any single character } | |
76 | \twocolitem{{\bf $\backslash$k}}{(where {\it k} is a non-alphanumeric character) | |
77 | matches that character taken as an ordinary character, e.g. $\backslash\backslash$ matches a backslash | |
78 | character} | |
79 | \twocolitem{{\bf $\backslash$c}}{where {\it c} is alphanumeric (possibly followed by other characters), | |
80 | an {\it escape} (AREs only), see \helpref{Escapes}{wxresynescapes} below} | |
81 | \twocolitem{{\bf \{}}{when followed by a character | |
82 | other than a digit, matches the left-brace character `{\bf \{}'; when followed by | |
83 | a digit, it is the beginning of a {\it bound} (see above)} | |
84 | \twocolitem{{\bf x}}{where {\it x} is a single | |
85 | character with no other significance, matches that character.} | |
86 | \end{twocollist} | |
87 | ||
88 | A {\it constraint} matches an empty string when specific conditions are met. A constraint may | |
89 | not be followed by a quantifier. The simple constraints are as follows; | |
90 | some more constraints are described later, under \helpref{Escapes}{wxresynescapes}. | |
91 | ||
92 | \begin{twocollist}\twocolwidtha{4cm} | |
93 | \twocolitem{{\bf $^$}}{matches at the beginning of a line} | |
94 | \twocolitem{{\bf \$}}{matches at the end of a line} | |
95 | \twocolitem{{\bf (?=re)}}{{\it positive lookahead} | |
96 | (AREs only), matches at any point where a substring matching {\it re} begins} | |
97 | \twocolitem{{\bf (?!re)}}{{\it negative lookahead} (AREs only), | |
98 | matches at any point where no substring matching {\it re} begins} | |
99 | \end{twocollist} | |
100 | ||
101 | The lookahead constraints may not contain back references | |
102 | (see later), and all parentheses within them are considered non-capturing. | |
103 | ||
104 | An RE may not end with `{\bf $\backslash$}'. | |
105 | ||
106 | \subsection{Bracket Expressions}\label{wxresynbracket} | |
107 | ||
108 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
109 | ||
110 | A {\it bracket expression} is a list | |
111 | of characters enclosed in `{\bf $[]$}'. It normally matches any single character from | |
112 | the list (but see below). If the list begins with `{\bf $^$}', it matches any single | |
113 | character (but see below) {\it not} from the rest of the list. | |
114 | ||
115 | If two characters | |
116 | in the list are separated by `{\bf -}', this is shorthand for the full {\it range} of | |
117 | characters between those two (inclusive) in the collating sequence, e.g. | |
118 | {\bf $[0-9]$} in ASCII matches any decimal digit. Two ranges may not share an endpoint, | |
119 | so e.g. {\bf a-c-e} is illegal. Ranges are very collating-sequence-dependent, and portable | |
120 | programs should avoid relying on them. | |
121 | ||
122 | To include a literal {\bf $]$} or {\bf -} in the | |
123 | list, the simplest method is to enclose it in {\bf $[.$} and {\bf $.]$} to make it a collating | |
124 | element (see below). Alternatively, make it the first character (following | |
125 | a possible `{\bf $^$}'), or (AREs only) precede it with `{\bf $\backslash$}'. | |
126 | Alternatively, for `{\bf -}', make | |
127 | it the last character, or the second endpoint of a range. To use a literal | |
128 | {\bf -} as the first endpoint of a range, make it a collating element or (AREs | |
129 | only) precede it with `{\bf $\backslash$}'. With the exception of these, some combinations using | |
130 | {\bf $[$} (see next paragraphs), and escapes, all other special characters lose | |
131 | their special significance within a bracket expression. | |
132 | ||
133 | Within a bracket | |
134 | expression, a collating element (a character, a multi-character sequence | |
135 | that collates as if it were a single character, or a collating-sequence | |
136 | name for either) enclosed in {\bf $[.$} and {\bf $.]$} stands for the | |
137 | sequence of characters of that collating element. | |
138 | ||
139 | {\it wxWidgets}: Currently no multi-character collating elements are defined. | |
140 | So in {\bf $[.X.]$}, {\it X} can either be a single character literal or | |
141 | the name of a character. For example, the following are both identical | |
142 | {\bf $[[.0.]-[.9.]]$} and {\bf $[[.zero.]-[.nine.]]$} and mean the same as | |
143 | {\bf $[0-9]$}. | |
144 | See \helpref{Character Names}{wxresynchars}. | |
145 | ||
146 | %The sequence is a single element of the bracket | |
147 | %expression's list. A bracket expression in a locale that has multi-character | |
148 | %collating elements can thus match more than one character. So (insidiously), | |
149 | %a bracket expression that starts with {\bf $^$} can match multi-character collating | |
150 | %elements even if none of them appear in the bracket expression! ({\it Note:} | |
151 | %Tcl currently has no multi-character collating elements. This information | |
152 | %is only for illustration.) | |
153 | % | |
154 | %For example, assume the collating sequence includes | |
155 | %a {\bf ch} multi-character collating element. Then the RE {\bf $[[.ch.]]*c$} (zero or more | |
156 | % {\bf ch}'s followed by {\bf c}) matches the first five characters of `{\bf chchcc}'. Also, the | |
157 | %RE {\bf $[^c]b$} matches all of `{\bf chb}' (because {\bf $[^c]$} matches the multi-character {\bf ch}). | |
158 | ||
159 | Within a bracket expression, a collating element enclosed in {\bf $[=$} and {\bf $=]$} | |
160 | is an equivalence class, standing for the sequences of characters of all | |
161 | collating elements equivalent to that one, including itself. | |
162 | %(If there are | |
163 | %no other equivalent collating elements, the treatment is as if the enclosing | |
164 | %delimiters were `{\bf $[.$}' and `{\bf $.]$}'.) For example, if {\bf o} | |
165 | %and {\bf $^$} are the members of an | |
166 | %equivalence class, then `{\bf $[[$=o=$]]$}', `{\bf $[[$=$^$=$]]$}', | |
167 | %and `{\bf $[o^]$}' are all synonymous. | |
168 | An equivalence class may not be an endpoint of a range. | |
169 | ||
170 | %({\it Note:} Tcl currently | |
171 | %implements only the Unicode locale. It doesn't define any equivalence classes. | |
172 | %The examples above are just illustrations.) | |
173 | ||
174 | {\it wxWidgets}: Currently no equivalence classes are defined, so | |
175 | {\bf $[=X=]$} stands for just the single character {\it X}. | |
176 | {\it X} can either be a single character literal or the name of a character, | |
177 | see \helpref{Character Names}{wxresynchars}. | |
178 | ||
179 | Within a bracket expression, | |
180 | the name of a {\it character class} enclosed in {\bf $[:$} and {\bf $:]$} stands for the list | |
181 | of all characters (not all collating elements!) belonging to that class. | |
182 | Standard character classes are: | |
183 | ||
184 | \begin{twocollist}\twocolwidtha{3cm} | |
185 | \twocolitem{{\bf alpha}}{A letter.} | |
186 | \twocolitem{{\bf upper}}{An upper-case letter.} | |
187 | \twocolitem{{\bf lower}}{A lower-case letter.} | |
188 | \twocolitem{{\bf digit}}{A decimal digit.} | |
189 | \twocolitem{{\bf xdigit}}{A hexadecimal digit.} | |
190 | \twocolitem{{\bf alnum}}{An alphanumeric (letter or digit).} | |
191 | \twocolitem{{\bf print}}{An alphanumeric (same as alnum).} | |
192 | \twocolitem{{\bf blank}}{A space or tab character.} | |
193 | \twocolitem{{\bf space}}{A character producing white space in displayed text.} | |
194 | \twocolitem{{\bf punct}}{A punctuation character.} | |
195 | \twocolitem{{\bf graph}}{A character with a visible representation.} | |
196 | \twocolitem{{\bf cntrl}}{A control character.} | |
197 | \end{twocollist} | |
198 | ||
199 | %A locale may provide others. (Note that the current Tcl | |
200 | %implementation has only one locale: the Unicode locale.) | |
201 | A character class may not be used as an endpoint of a range. | |
202 | ||
203 | {\it wxWidgets}: In a non-Unicode build, these character classifications depend on the | |
204 | current locale, and correspond to the values return by the ANSI C 'is' | |
205 | functions: isalpha, isupper, etc. In Unicode mode they are based on | |
206 | Unicode classifications, and are not affected by the current locale. | |
207 | ||
208 | There are two special cases of bracket expressions: | |
209 | the bracket expressions {\bf $[[:$<$:]]$} and {\bf $[[:$>$:]]$} are constraints, matching empty | |
210 | strings at the beginning and end of a word respectively. A word is defined | |
211 | as a sequence of word characters that is neither preceded nor followed | |
212 | by word characters. A word character is an {\it alnum} character or an underscore | |
213 | ({\bf \_}). These special bracket expressions are deprecated; users of AREs should | |
214 | use constraint escapes instead (see \helpref{Escapes}{wxresynescapes} below). | |
215 | ||
216 | \subsection{Escapes}\label{wxresynescapes} | |
217 | ||
218 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
219 | ||
220 | Escapes (AREs only), | |
221 | which begin with a {\bf $\backslash$} followed by an alphanumeric character, come in several | |
222 | varieties: character entry, class shorthands, constraint escapes, and back | |
223 | references. A {\bf $\backslash$} followed by an alphanumeric character but not constituting | |
224 | a valid escape is illegal in AREs. In EREs, there are no escapes: outside | |
225 | a bracket expression, a {\bf $\backslash$} followed by an alphanumeric character merely stands | |
226 | for that character as an ordinary character, and inside a bracket expression, | |
227 | {\bf $\backslash$} is an ordinary character. (The latter is the one actual incompatibility | |
228 | between EREs and AREs.) | |
229 | ||
230 | Character-entry escapes (AREs only) exist to make | |
231 | it easier to specify non-printing and otherwise inconvenient characters | |
232 | in REs: | |
233 | ||
234 | \begin{twocollist}\twocolwidtha{4cm} | |
235 | \twocolitem{{\bf $\backslash$a}}{alert (bell) character, as in C} | |
236 | \twocolitem{{\bf $\backslash$b}}{backspace, as in C} | |
237 | \twocolitem{{\bf $\backslash$B}}{synonym | |
238 | for {\bf $\backslash$} to help reduce backslash doubling in some applications where there | |
239 | are multiple levels of backslash processing} | |
240 | \twocolitem{{\bf $\backslash$c{\it X}}}{(where X is any character) | |
241 | the character whose low-order 5 bits are the same as those of {\it X}, and whose | |
242 | other bits are all zero} | |
243 | \twocolitem{{\bf $\backslash$e}}{the character whose collating-sequence name is | |
244 | `{\bf ESC}', or failing that, the character with octal value 033} | |
245 | \twocolitem{{\bf $\backslash$f}}{formfeed, as in C} | |
246 | \twocolitem{{\bf $\backslash$n}}{newline, as in C} | |
247 | \twocolitem{{\bf $\backslash$r}}{carriage return, as in C} | |
248 | \twocolitem{{\bf $\backslash$t}}{horizontal tab, as in C} | |
249 | \twocolitem{{\bf $\backslash$u{\it wxyz}}}{(where {\it wxyz} is exactly four hexadecimal digits) | |
250 | the Unicode | |
251 | character {\bf U+{\it wxyz}} in the local byte ordering} | |
252 | \twocolitem{{\bf $\backslash$U{\it stuvwxyz}}}{(where {\it stuvwxyz} is | |
253 | exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode | |
254 | extension to 32 bits} | |
255 | \twocolitem{{\bf $\backslash$v}}{vertical tab, as in C are all available.} | |
256 | \twocolitem{{\bf $\backslash$x{\it hhh}}}{(where | |
257 | {\it hhh} is any sequence of hexadecimal digits) the character whose hexadecimal | |
258 | value is {\bf 0x{\it hhh}} (a single character no matter how many hexadecimal digits | |
259 | are used).} | |
260 | \twocolitem{{\bf $\backslash$0}}{the character whose value is {\bf 0}} | |
261 | \twocolitem{{\bf $\backslash${\it xy}}}{(where {\it xy} is exactly two | |
262 | octal digits, and is not a {\it back reference} (see below)) the character whose | |
263 | octal value is {\bf 0{\it xy}}} | |
264 | \twocolitem{{\bf $\backslash${\it xyz}}}{(where {\it xyz} is exactly three octal digits, and is | |
265 | not a back reference (see below)) | |
266 | the character whose octal value is {\bf 0{\it xyz}}} | |
267 | \end{twocollist} | |
268 | ||
269 | Hexadecimal digits are `{\bf 0}'-`{\bf 9}', `{\bf a}'-`{\bf f}', and `{\bf A}'-`{\bf F}'. Octal | |
270 | digits are `{\bf 0}'-`{\bf 7}'. | |
271 | ||
272 | The character-entry | |
273 | escapes are always taken as ordinary characters. For example, {\bf $\backslash$135} is {\bf ]} in | |
274 | ASCII, but {\bf $\backslash$135} does not terminate a bracket expression. Beware, however, | |
275 | that some applications (e.g., C compilers) interpret such sequences themselves | |
276 | before the regular-expression package gets to see them, which may require | |
277 | doubling (quadrupling, etc.) the `{\bf $\backslash$}'. | |
278 | ||
279 | Class-shorthand escapes (AREs only) provide | |
280 | shorthands for certain commonly-used character classes: | |
281 | ||
282 | \begin{twocollist}\twocolwidtha{4cm} | |
283 | \twocolitem{{\bf $\backslash$d}}{{\bf $[[:digit:]]$}} | |
284 | \twocolitem{{\bf $\backslash$s}}{{\bf $[[:space:]]$}} | |
285 | \twocolitem{{\bf $\backslash$w}}{{\bf $[[:alnum:]\_]$} (note underscore)} | |
286 | \twocolitem{{\bf $\backslash$D}}{{\bf $[^[:digit:]]$}} | |
287 | \twocolitem{{\bf $\backslash$S}}{{\bf $[^[:space:]]$}} | |
288 | \twocolitem{{\bf $\backslash$W}}{{\bf $[^[:alnum:]\_]$} (note underscore)} | |
289 | \end{twocollist} | |
290 | ||
291 | Within bracket expressions, `{\bf $\backslash$d}', `{\bf $\backslash$s}', and | |
292 | `{\bf $\backslash$w}' lose their outer brackets, and `{\bf $\backslash$D}', | |
293 | `{\bf $\backslash$S}', and `{\bf $\backslash$W}' are illegal. (So, for example, | |
294 | {\bf $[$a-c$\backslash$d$]$} is equivalent to {\bf $[a-c[:digit:]]$}. | |
295 | Also, {\bf $[$a-c$\backslash$D$]$}, which is equivalent to | |
296 | {\bf $[a-c^[:digit:]]$}, is illegal.) | |
297 | ||
298 | A constraint escape (AREs only) is a constraint, | |
299 | matching the empty string if specific conditions are met, written as an | |
300 | escape: | |
301 | ||
302 | \begin{twocollist}\twocolwidtha{4cm} | |
303 | \twocolitem{{\bf $\backslash$A}}{matches only at the beginning of the string | |
304 | (see \helpref{Matching}{wxresynmatching}, below, | |
305 | for how this differs from `{\bf $^$}')} | |
306 | \twocolitem{{\bf $\backslash$m}}{matches only at the beginning of a word} | |
307 | \twocolitem{{\bf $\backslash$M}}{matches only at the end of a word} | |
308 | \twocolitem{{\bf $\backslash$y}}{matches only at the beginning or end of a word} | |
309 | \twocolitem{{\bf $\backslash$Y}}{matches only at a point that is not the beginning or end of | |
310 | a word} | |
311 | \twocolitem{{\bf $\backslash$Z}}{matches only at the end of the string | |
312 | (see \helpref{Matching}{wxresynmatching}, below, for | |
313 | how this differs from `{\bf \$}')} | |
314 | \twocolitem{{\bf $\backslash${\it m}}}{(where {\it m} is a nonzero digit) a {\it back reference}, | |
315 | see below} | |
316 | \twocolitem{{\bf $\backslash${\it mnn}}}{(where {\it m} is a nonzero digit, and {\it nn} is some more digits, | |
317 | and the decimal value {\it mnn} is not greater than the number of closing capturing | |
318 | parentheses seen so far) a {\it back reference}, see below} | |
319 | \end{twocollist} | |
320 | ||
321 | A word is defined | |
322 | as in the specification of {\bf $[[:$<$:]]$} and {\bf $[[:$>$:]]$} above. Constraint escapes are | |
323 | illegal within bracket expressions. | |
324 | ||
325 | A back reference (AREs only) matches | |
326 | the same string matched by the parenthesized subexpression specified by | |
327 | the number, so that (e.g.) {\bf ($[bc]$)$\backslash$1} matches {\bf bb} or {\bf cc} but not `{\bf bc}'. | |
328 | The subexpression | |
329 | must entirely precede the back reference in the RE. Subexpressions are numbered | |
330 | in the order of their leading parentheses. Non-capturing parentheses do not | |
331 | define subexpressions. | |
332 | ||
333 | There is an inherent historical ambiguity between | |
334 | octal character-entry escapes and back references, which is resolved by | |
335 | heuristics, as hinted at above. A leading zero always indicates an octal | |
336 | escape. A single non-zero digit, not followed by another digit, is always | |
337 | taken as a back reference. A multi-digit sequence not starting with a zero | |
338 | is taken as a back reference if it comes after a suitable subexpression | |
339 | (i.e. the number is in the legal range for a back reference), and otherwise | |
340 | is taken as octal. | |
341 | ||
342 | \subsection{Metasyntax} | |
343 | ||
344 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
345 | ||
346 | In addition to the main syntax described above, | |
347 | there are some special forms and miscellaneous syntactic facilities available. | |
348 | ||
349 | Normally the flavor of RE being used is specified by application-dependent | |
350 | means. However, this can be overridden by a {\it director}. If an RE of any flavor | |
351 | begins with `{\bf ***:}', the rest of the RE is an ARE. If an RE of any flavor begins | |
352 | with `{\bf ***=}', the rest of the RE is taken to be a literal string, with all | |
353 | characters considered ordinary characters. | |
354 | ||
355 | An ARE may begin with {\it embedded options}: a sequence {\bf (?xyz)} | |
356 | (where {\it xyz} is one or more alphabetic characters) | |
357 | specifies options affecting the rest of the RE. These supplement, and can | |
358 | override, any options specified by the application. The available option | |
359 | letters are: | |
360 | ||
361 | \begin{twocollist}\twocolwidtha{4cm} | |
362 | \twocolitem{{\bf b}}{rest of RE is a BRE} | |
363 | \twocolitem{{\bf c}}{case-sensitive matching (usual default)} | |
364 | \twocolitem{{\bf e}}{rest of RE is an ERE} | |
365 | \twocolitem{{\bf i}}{case-insensitive matching (see \helpref{Matching}{wxresynmatching}, below)} | |
366 | \twocolitem{{\bf m}}{historical synonym for {\bf n}} | |
367 | \twocolitem{{\bf n}}{newline-sensitive matching (see \helpref{Matching}{wxresynmatching}, below)} | |
368 | \twocolitem{{\bf p}}{partial newline-sensitive matching (see \helpref{Matching}{wxresynmatching}, below)} | |
369 | \twocolitem{{\bf q}}{rest of RE | |
370 | is a literal (``quoted'') string, all ordinary characters} | |
371 | \twocolitem{{\bf s}}{non-newline-sensitive matching (usual default)} | |
372 | \twocolitem{{\bf t}}{tight syntax (usual default; see below)} | |
373 | \twocolitem{{\bf w}}{inverse | |
374 | partial newline-sensitive (``weird'') matching (see \helpref{Matching}{wxresynmatching}, below)} | |
375 | \twocolitem{{\bf x}}{expanded syntax (see below)} | |
376 | \end{twocollist} | |
377 | ||
378 | Embedded options take effect at the {\bf )} terminating the | |
379 | sequence. They are available only at the start of an ARE, and may not be | |
380 | used later within it. | |
381 | ||
382 | In addition to the usual ({\it tight}) RE syntax, in which | |
383 | all characters are significant, there is an {\it expanded} syntax, available | |
384 | %in all flavors of RE with the {\bf -expanded} switch, or | |
385 | in AREs with the embedded | |
386 | x option. In the expanded syntax, white-space characters are ignored and | |
387 | all characters between a {\bf \#} and the following newline (or the end of the | |
388 | RE) are ignored, permitting paragraphing and commenting a complex RE. There | |
389 | are three exceptions to that basic rule: | |
390 | {\itemize | |
391 | \item% | |
392 | a white-space character or `{\bf \#}' preceded | |
393 | by `{\bf $\backslash$}' is retained | |
394 | \item% | |
395 | white space or `{\bf \#}' within a bracket expression is retained | |
396 | \item% | |
397 | white space and comments are illegal within multi-character symbols like | |
398 | the ARE `{\bf (?:}' or the BRE `{\bf $\backslash$(}' | |
399 | } | |
400 | Expanded-syntax white-space characters are blank, | |
401 | tab, newline, and any character that belongs to the {\it space} character class. | |
402 | ||
403 | Finally, in an ARE, outside bracket expressions, the sequence `{\bf (?\#ttt)}' (where | |
404 | {\it ttt} is any text not containing a `{\bf )}') is a comment, completely ignored. Again, | |
405 | this is not allowed between the characters of multi-character symbols like | |
406 | `{\bf (?:}'. Such comments are more a historical artifact than a useful facility, | |
407 | and their use is deprecated; use the expanded syntax instead. | |
408 | ||
409 | {\it None} of these | |
410 | metasyntax extensions is available if the application (or an initial {\bf ***=} | |
411 | director) has specified that the user's input be treated as a literal string | |
412 | rather than as an RE. | |
413 | ||
414 | \subsection{Matching}\label{wxresynmatching} | |
415 | ||
416 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
417 | ||
418 | In the event that an RE could match more than | |
419 | one substring of a given string, the RE matches the one starting earliest | |
420 | in the string. If the RE could match more than one substring starting at | |
421 | that point, its choice is determined by its {\it preference}: either the longest | |
422 | substring, or the shortest. | |
423 | ||
424 | Most atoms, and all constraints, have no preference. | |
425 | A parenthesized RE has the same preference (possibly none) as the RE. A | |
426 | quantified atom with quantifier {\bf \{m\}} or {\bf \{m\}?} has the same preference (possibly | |
427 | none) as the atom itself. A quantified atom with other normal quantifiers | |
428 | (including {\bf \{m,n\}} with {\it m} equal to {\it n}) prefers longest match. A quantified | |
429 | atom with other non-greedy quantifiers (including {\bf \{m,n\}?} with {\it m} equal to | |
430 | {\it n}) prefers shortest match. A branch has the same preference as the first | |
431 | quantified atom in it which has a preference. An RE consisting of two or | |
432 | more branches connected by the {\bf $|$} operator prefers longest match. | |
433 | ||
434 | Subject to the constraints imposed by the rules for matching the whole RE, subexpressions | |
435 | also match the longest or shortest possible substrings, based on their | |
436 | preferences, with subexpressions starting earlier in the RE taking priority | |
437 | over ones starting later. Note that outer subexpressions thus take priority | |
438 | over their component subexpressions. | |
439 | ||
440 | Note that the quantifiers {\bf \{1,1\}} and | |
441 | {\bf \{1,1\}?} can be used to force longest and shortest preference, respectively, | |
442 | on a subexpression or a whole RE. | |
443 | ||
444 | Match lengths are measured in characters, | |
445 | not collating elements. An empty string is considered longer than no match | |
446 | at all. For example, {\bf bb*} matches the three middle characters | |
447 | of `{\bf abbbc}', {\bf (week$|$wee)(night$|$knights)} | |
448 | matches all ten characters of `{\bf weeknights}', when {\bf (.*).*} is matched against | |
449 | {\bf abc} the parenthesized subexpression matches all three characters, and when | |
450 | {\bf (a*)*} is matched against {\bf bc} both the whole RE and the parenthesized subexpression | |
451 | match an empty string. | |
452 | ||
453 | If case-independent matching is specified, the effect | |
454 | is much as if all case distinctions had vanished from the alphabet. When | |
455 | an alphabetic that exists in multiple cases appears as an ordinary character | |
456 | outside a bracket expression, it is effectively transformed into a bracket | |
457 | expression containing both cases, so that {\bf x} becomes `{\bf $[xX]$}'. When it appears | |
458 | inside a bracket expression, all case counterparts of it are added to the | |
459 | bracket expression, so that {\bf $[x]$} becomes {\bf $[xX]$} and {\bf $[^x]$} becomes `{\bf $[^xX]$}'. | |
460 | ||
461 | If newline-sensitive | |
462 | matching is specified, {\bf .} and bracket expressions using {\bf $^$} will never match | |
463 | the newline character (so that matches will never cross newlines unless | |
464 | the RE explicitly arranges it) and {\bf $^$} and {\bf \$} will match the empty string after | |
465 | and before a newline respectively, in addition to matching at beginning | |
466 | and end of string respectively. ARE {\bf $\backslash$A} and {\bf $\backslash$Z} continue to match beginning | |
467 | or end of string {\it only}. | |
468 | ||
469 | If partial newline-sensitive matching is specified, | |
470 | this affects {\bf .} and bracket expressions as with newline-sensitive matching, | |
471 | but not {\bf $^$} and `{\bf \$}'. | |
472 | ||
473 | If inverse partial newline-sensitive matching is specified, | |
474 | this affects {\bf $^$} and {\bf \$} as with newline-sensitive matching, but not {\bf .} and bracket | |
475 | expressions. This isn't very useful but is provided for symmetry. | |
476 | ||
477 | \subsection{Limits And Compatibility} | |
478 | ||
479 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
480 | ||
481 | No particular limit is imposed on the length of REs. Programs | |
482 | intended to be highly portable should not employ REs longer than 256 bytes, | |
483 | as a POSIX-compliant implementation can refuse to accept such REs. | |
484 | ||
485 | The only | |
486 | feature of AREs that is actually incompatible with POSIX EREs is that {\bf $\backslash$} | |
487 | does not lose its special significance inside bracket expressions. All other | |
488 | ARE features use syntax which is illegal or has undefined or unspecified | |
489 | effects in POSIX EREs; the {\bf ***} syntax of directors likewise is outside | |
490 | the POSIX syntax for both BREs and EREs. | |
491 | ||
492 | Many of the ARE extensions are | |
493 | borrowed from Perl, but some have been changed to clean them up, and a | |
494 | few Perl extensions are not present. Incompatibilities of note include `{\bf $\backslash$b}', | |
495 | `{\bf $\backslash$B}', the lack of special treatment for a trailing newline, the addition of | |
496 | complemented bracket expressions to the things affected by newline-sensitive | |
497 | matching, the restrictions on parentheses and back references in lookahead | |
498 | constraints, and the longest/shortest-match (rather than first-match) matching | |
499 | semantics. | |
500 | ||
501 | The matching rules for REs containing both normal and non-greedy | |
502 | quantifiers have changed since early beta-test versions of this package. | |
503 | (The new rules are much simpler and cleaner, but don't work as hard at guessing | |
504 | the user's real intentions.) | |
505 | ||
506 | Henry Spencer's original 1986 {\it regexp} package, still in widespread use, | |
507 | %(e.g., in pre-8.1 releases of Tcl), | |
508 | implemented an early version of today's EREs. There are four incompatibilities between {\it regexp}'s | |
509 | near-EREs (`RREs' for short) and AREs. In roughly increasing order of significance: | |
510 | {\itemize | |
511 | \item In AREs, {\bf $\backslash$} followed by an alphanumeric character is either an escape or | |
512 | an error, while in RREs, it was just another way of writing the alphanumeric. | |
513 | This should not be a problem because there was no reason to write such | |
514 | a sequence in RREs. | |
515 | ||
516 | \item {\bf \{} followed by a digit in an ARE is the beginning of | |
517 | a bound, while in RREs, {\bf \{} was always an ordinary character. Such sequences | |
518 | should be rare, and will often result in an error because following characters | |
519 | will not look like a valid bound. | |
520 | ||
521 | \item In AREs, {\bf $\backslash$} remains a special character | |
522 | within `{\bf $[]$}', so a literal {\bf $\backslash$} within {\bf $[]$} must be | |
523 | written `{\bf $\backslash\backslash$}'. {\bf $\backslash\backslash$} also gives a literal | |
524 | {\bf $\backslash$} within {\bf $[]$} in RREs, but only truly paranoid programmers routinely doubled | |
525 | the backslash. | |
526 | ||
527 | \item AREs report the longest/shortest match for the RE, rather | |
528 | than the first found in a specified search order. This may affect some RREs | |
529 | which were written in the expectation that the first match would be reported. | |
530 | (The careful crafting of RREs to optimize the search order for fast matching | |
531 | is obsolete (AREs examine all possible matches in parallel, and their performance | |
532 | is largely insensitive to their complexity) but cases where the search | |
533 | order was exploited to deliberately find a match which was {\it not} the longest/shortest | |
534 | will need rewriting.) | |
535 | } | |
536 | ||
537 | \subsection{Basic Regular Expressions}\label{wxresynbre} | |
538 | ||
539 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
540 | ||
541 | BREs differ from EREs in | |
542 | several respects. `{\bf $|$}', `{\bf +}', and {\bf ?} are ordinary characters and there is no equivalent | |
543 | for their functionality. The delimiters for bounds | |
544 | are {\bf $\backslash$\{} and `{\bf $\backslash$\}}', with {\bf \{} and | |
545 | {\bf \}} by themselves ordinary characters. The parentheses for nested subexpressions | |
546 | are {\bf $\backslash$(} and `{\bf $\backslash$)}', with {\bf (} and {\bf )} by themselves | |
547 | ordinary characters. {\bf $^$} is an ordinary | |
548 | character except at the beginning of the RE or the beginning of a parenthesized | |
549 | subexpression, {\bf \$} is an ordinary character except at the end of the RE or | |
550 | the end of a parenthesized subexpression, and {\bf *} is an ordinary character | |
551 | if it appears at the beginning of the RE or the beginning of a parenthesized | |
552 | subexpression (after a possible leading `{\bf $^$}'). Finally, single-digit back references | |
553 | are available, and {\bf $\backslash<$} and {\bf $\backslash>$} are synonyms | |
554 | for {\bf $[[:<:]]$} and {\bf $[[:>:]]$} respectively; | |
555 | no other escapes are available. | |
556 | ||
557 | \subsection{Regular Expression Character Names}\label{wxresynchars} | |
558 | ||
559 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
560 | ||
561 | Note that the character names are case sensitive. | |
562 | ||
563 | \begin{twocollist} | |
564 | \twocolitem{NUL}{'$\backslash$0'} | |
565 | \twocolitem{SOH}{'$\backslash$001'} | |
566 | \twocolitem{STX}{'$\backslash$002'} | |
567 | \twocolitem{ETX}{'$\backslash$003'} | |
568 | \twocolitem{EOT}{'$\backslash$004'} | |
569 | \twocolitem{ENQ}{'$\backslash$005'} | |
570 | \twocolitem{ACK}{'$\backslash$006'} | |
571 | \twocolitem{BEL}{'$\backslash$007'} | |
572 | \twocolitem{alert}{'$\backslash$007'} | |
573 | \twocolitem{BS}{'$\backslash$010'} | |
574 | \twocolitem{backspace}{'$\backslash$b'} | |
575 | \twocolitem{HT}{'$\backslash$011'} | |
576 | \twocolitem{tab}{'$\backslash$t'} | |
577 | \twocolitem{LF}{'$\backslash$012'} | |
578 | \twocolitem{newline}{'$\backslash$n'} | |
579 | \twocolitem{VT}{'$\backslash$013'} | |
580 | \twocolitem{vertical-tab}{'$\backslash$v'} | |
581 | \twocolitem{FF}{'$\backslash$014'} | |
582 | \twocolitem{form-feed}{'$\backslash$f'} | |
583 | \twocolitem{CR}{'$\backslash$015'} | |
584 | \twocolitem{carriage-return}{'$\backslash$r'} | |
585 | \twocolitem{SO}{'$\backslash$016'} | |
586 | \twocolitem{SI}{'$\backslash$017'} | |
587 | \twocolitem{DLE}{'$\backslash$020'} | |
588 | \twocolitem{DC1}{'$\backslash$021'} | |
589 | \twocolitem{DC2}{'$\backslash$022'} | |
590 | \twocolitem{DC3}{'$\backslash$023'} | |
591 | \twocolitem{DC4}{'$\backslash$024'} | |
592 | \twocolitem{NAK}{'$\backslash$025'} | |
593 | \twocolitem{SYN}{'$\backslash$026'} | |
594 | \twocolitem{ETB}{'$\backslash$027'} | |
595 | \twocolitem{CAN}{'$\backslash$030'} | |
596 | \twocolitem{EM}{'$\backslash$031'} | |
597 | \twocolitem{SUB}{'$\backslash$032'} | |
598 | \twocolitem{ESC}{'$\backslash$033'} | |
599 | \twocolitem{IS4}{'$\backslash$034'} | |
600 | \twocolitem{FS}{'$\backslash$034'} | |
601 | \twocolitem{IS3}{'$\backslash$035'} | |
602 | \twocolitem{GS}{'$\backslash$035'} | |
603 | \twocolitem{IS2}{'$\backslash$036'} | |
604 | \twocolitem{RS}{'$\backslash$036'} | |
605 | \twocolitem{IS1}{'$\backslash$037'} | |
606 | \twocolitem{US}{'$\backslash$037'} | |
607 | \twocolitem{space}{' '} | |
608 | \twocolitem{exclamation-mark}{'!'} | |
609 | \twocolitem{quotation-mark}{'"'} | |
610 | \twocolitem{number-sign}{'\#'} | |
611 | \twocolitem{dollar-sign}{'\$'} | |
612 | \twocolitem{percent-sign}{'\%'} | |
613 | \twocolitem{ampersand}{'\&'} | |
614 | \twocolitem{apostrophe}{'$\backslash$''} | |
615 | \twocolitem{left-parenthesis}{'('} | |
616 | \twocolitem{right-parenthesis}{')'} | |
617 | \twocolitem{asterisk}{'*'} | |
618 | \twocolitem{plus-sign}{'+'} | |
619 | \twocolitem{comma}{','} | |
620 | \twocolitem{hyphen}{'-'} | |
621 | \twocolitem{hyphen-minus}{'-'} | |
622 | \twocolitem{period}{'.'} | |
623 | \twocolitem{full-stop}{'.'} | |
624 | \twocolitem{slash}{'/'} | |
625 | \twocolitem{solidus}{'/'} | |
626 | \twocolitem{zero}{'0'} | |
627 | \twocolitem{one}{'1'} | |
628 | \twocolitem{two}{'2'} | |
629 | \twocolitem{three}{'3'} | |
630 | \twocolitem{four}{'4'} | |
631 | \twocolitem{five}{'5'} | |
632 | \twocolitem{six}{'6'} | |
633 | \twocolitem{seven}{'7'} | |
634 | \twocolitem{eight}{'8'} | |
635 | \twocolitem{nine}{'9'} | |
636 | \twocolitem{colon}{':'} | |
637 | \twocolitem{semicolon}{';'} | |
638 | \twocolitem{less-than-sign}{'<'} | |
639 | \twocolitem{equals-sign}{'='} | |
640 | \twocolitem{greater-than-sign}{'>'} | |
641 | \twocolitem{question-mark}{'?'} | |
642 | \twocolitem{commercial-at}{'@'} | |
643 | \twocolitem{left-square-bracket}{'$[$'} | |
644 | \twocolitem{backslash}{'$\backslash$'} | |
645 | \twocolitem{reverse-solidus}{'$\backslash$'} | |
646 | \twocolitem{right-square-bracket}{'$]$'} | |
647 | \twocolitem{circumflex}{'$^$'} | |
648 | \twocolitem{circumflex-accent}{'$^$'} | |
649 | \twocolitem{underscore}{'\_'} | |
650 | \twocolitem{low-line}{'\_'} | |
651 | \twocolitem{grave-accent}{'`'} | |
652 | \twocolitem{left-brace}{'\{'} | |
653 | \twocolitem{left-curly-bracket}{'\{'} | |
654 | \twocolitem{vertical-line}{'$|$'} | |
655 | \twocolitem{right-brace}{'\}'} | |
656 | \twocolitem{right-curly-bracket}{'\}'} | |
657 | \twocolitem{tilde}{'\destruct{}'} | |
658 | \twocolitem{DEL}{'$\backslash$177'} | |
659 | \end{twocollist} | |
660 |