]>
Commit | Line | Data |
---|---|---|
0aa7fa9a VS |
1 | % manual page source format generated by PolyglotMan v3.0.9, |
2 | % available via anonymous ftp from ftp.cs.berkeley.edu:/ucb/people/phelps/tcltk/rman.tar.Z | |
3 | ||
4 | \section{Syntax of the builtin regular expression library}\label{wxresyn} | |
5 | ||
6 | A {\it regular expression} describes strings of characters. It's a | |
7 | pattern that matches certain strings and doesn't match others. | |
8 | ||
9 | \wxheading{See also} | |
10 | ||
11 | \helpref{wxRegEx}{wxregex} | |
12 | ||
13 | ||
14 | \subsection{Different Flavors of REs} | |
15 | ||
16 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
17 | ||
18 | Regular expressions (``RE''s), as defined by POSIX, come in two | |
19 | flavors: {\it extended} REs (``EREs'') and {\it basic} REs (``BREs''). EREs are roughly those | |
20 | of the traditional {\it egrep}, while BREs are roughly those of the traditional | |
21 | {\it ed}. This implementation adds a third flavor, {\it advanced} REs (``AREs''), basically | |
22 | EREs with some significant extensions. | |
23 | ||
24 | This manual page primarily describes | |
25 | AREs. BREs mostly exist for backward compatibility in some old programs; | |
26 | they will be discussed at the \helpref{end}{wxresynbre}. POSIX EREs are almost an exact subset | |
27 | of AREs. Features of AREs that are not present in EREs will be indicated. | |
28 | ||
29 | ||
30 | \subsection{Regular Expression Syntax} | |
31 | ||
32 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
33 | ||
34 | These regular expressions are implemented using | |
35 | the package written by Henry Spencer, based on the 1003.2 spec and some | |
36 | (not quite all) of the Perl5 extensions (thanks, Henry!). Much of the description | |
37 | of regular expressions below is copied verbatim from his manual entry. | |
38 | ||
39 | An | |
40 | ARE is one or more {\it branches}, separated by `{\bf $|$}', matching anything that matches | |
41 | any of the branches. | |
42 | ||
43 | A branch is zero or more {\it constraints} or {\it quantified | |
44 | atoms}, concatenated. It matches a match for the first, followed by a match | |
45 | for the second, etc; an empty branch matches the empty string. | |
46 | ||
47 | A quantified | |
48 | atom is an {\it atom} possibly followed by a single {\it quantifier}. Without a quantifier, | |
49 | it matches a match for the atom. The quantifiers, and what a so-quantified | |
50 | atom matches, are: | |
51 | ||
52 | \begin{twocollist}\twocolwidtha{4cm} | |
53 | \twocolitem{{\bf *}}{a sequence of 0 or more matches of the atom} | |
54 | \twocolitem{{\bf +}}{a sequence of 1 or more matches of the atom} | |
55 | \twocolitem{{\bf ?}}{a sequence of 0 or 1 matches of the atom} | |
56 | \twocolitem{{\bf \{m\}}}{a sequence of exactly {\it m} matches of the atom} | |
57 | \twocolitem{{\bf \{m,\}}}{a sequence of {\it m} or more matches of the atom} | |
58 | \twocolitem{{\bf \{m,n\}}}{a sequence of {\it m} through {\it n} (inclusive) | |
59 | matches of the atom; {\it m} may not exceed {\it n}} | |
60 | \twocolitem{{\bf *? +? ?? \{m\}? \{m,\}? \{m,n\}?}}{{\it non-greedy} quantifiers, | |
61 | which match the same possibilities, but prefer the | |
62 | smallest number rather than the largest number of matches (see \helpref{Matching}{wxresynmatching})} | |
63 | \end{twocollist} | |
64 | ||
65 | The forms using {\bf \{} and {\bf \}} are known as {\it bound}s. The numbers {\it m} and {\it n} are unsigned | |
66 | decimal integers with permissible values from 0 to 255 inclusive. | |
67 | An atom is one of: | |
68 | ||
69 | \begin{twocollist}\twocolwidtha{4cm} | |
70 | \twocolitem{{\bf (re)}}{(where {\it re} is any regular expression) matches a match for | |
71 | {\it re}, with the match noted for possible reporting} | |
72 | \twocolitem{{\bf (?:re)}}{as previous, but | |
73 | does no reporting (a ``non-capturing'' set of parentheses)} | |
74 | \twocolitem{{\bf ()}}{matches an empty | |
75 | string, noted for possible reporting} | |
76 | \twocolitem{{\bf (?:)}}{matches an empty string, without reporting} | |
77 | \twocolitem{{\bf $[chars]$}}{a {\it bracket expression}, matching any one of the {\it chars} | |
78 | (see \helpref{Bracket Expressions}{wxresynbracket} for more detail)} | |
79 | \twocolitem{{\bf .}}{matches any single character } | |
80 | \twocolitem{{\bf $\backslash$k}}{(where {\it k} is a non-alphanumeric character) | |
81 | matches that character taken as an ordinary character, e.g. $\backslash\backslash$ matches a backslash | |
82 | character} | |
83 | \twocolitem{{\bf $\backslash$c}}{where {\it c} is alphanumeric (possibly followed by other characters), | |
84 | an {\it escape} (AREs only), see \helpref{Escapes}{wxresynescapes} below} | |
85 | \twocolitem{{\bf \{}}{when followed by a character | |
86 | other than a digit, matches the left-brace character `{\bf \{}'; when followed by | |
87 | a digit, it is the beginning of a {\it bound} (see above)} | |
88 | \twocolitem{{\bf x}}{where {\it x} is a single | |
89 | character with no other significance, matches that character.} | |
90 | \end{twocollist} | |
91 | ||
92 | A {\it constraint} | |
93 | matches an empty string when specific conditions are met. A constraint may | |
94 | not be followed by a quantifier. The simple constraints are as follows; | |
95 | some more constraints are described later, under \helpref{Escapes}{wxresynescapes}. | |
96 | ||
97 | \begin{twocollist}\twocolwidtha{4cm} | |
98 | \twocolitem{{\bf $^$}}{matches at the beginning of a line} | |
99 | \twocolitem{{\bf \$}}{matches at the end of a line} | |
100 | \twocolitem{{\bf (?=re)}}{{\it positive lookahead} | |
101 | (AREs only), matches at any point where a substring matching {\it re} begins} | |
102 | \twocolitem{{\bf (?!re)}}{{\it negative lookahead} (AREs only), | |
103 | matches at any point where no substring matching {\it re} begins} | |
104 | \end{twocollist} | |
105 | ||
106 | The lookahead constraints may not contain back references | |
107 | (see later), and all parentheses within them are considered non-capturing. | |
108 | ||
109 | An RE may not end with `{\bf $\backslash$}'. | |
110 | ||
111 | ||
112 | \subsection{Bracket Expressions}\label{wxresynbracket} | |
113 | ||
114 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
115 | ||
116 | A {\it bracket expression} is a list | |
117 | of characters enclosed in `{\bf $[]$}'. It normally matches any single character from | |
118 | the list (but see below). If the list begins with `{\bf $^$}', it matches any single | |
119 | character (but see below) {\it not} from the rest of the list. | |
120 | ||
121 | If two characters | |
122 | in the list are separated by `{\bf -}', this is shorthand for the full {\it range} of | |
123 | characters between those two (inclusive) in the collating sequence, e.g. | |
124 | {\bf $[0-9]$} in ASCII matches any decimal digit. Two ranges may not share an endpoint, | |
125 | so e.g. {\bf a-c-e} is illegal. Ranges are very collating-sequence-dependent, and portable | |
126 | programs should avoid relying on them. | |
127 | ||
128 | To include a literal {\bf $]$} or {\bf -} in the | |
129 | list, the simplest method is to enclose it in {\bf $[.$} and {\bf $.]$} to make it a collating | |
130 | element (see below). Alternatively, make it the first character (following | |
131 | a possible `{\bf $^$}'), or (AREs only) precede it with `{\bf $\backslash$}'. | |
132 | Alternatively, for `{\bf -}', make | |
133 | it the last character, or the second endpoint of a range. To use a literal | |
134 | {\bf -} as the first endpoint of a range, make it a collating element or (AREs | |
135 | only) precede it with `{\bf $\backslash$}'. With the exception of these, some combinations using | |
136 | {\bf $[$} (see next paragraphs), and escapes, all other special characters lose | |
137 | their special significance within a bracket expression. | |
138 | ||
139 | Within a bracket | |
140 | expression, a collating element (a character, a multi-character sequence | |
141 | that collates as if it were a single character, or a collating-sequence | |
142 | name for either) enclosed in {\bf $[.$} and {\bf $.]$} stands for the | |
143 | sequence of characters of that collating element. | |
144 | ||
145 | {\it wxWindows}: Currently no multi-character collating elements are defined. | |
146 | So in {\bf $[.X.]$}, {\it X} can either be a single character literal or | |
147 | the name of a character. For example, the following are both identical | |
148 | {\bf $[[.0.]-[.9.]]$} and {\bf $[[.zero.]-[.nine.]]$} and mean the same as | |
149 | {\bf $[0-9]$}. | |
150 | See \helpref{Character Names}{wxresynchars}. | |
151 | ||
152 | %The sequence is a single element of the bracket | |
153 | %expression's list. A bracket expression in a locale that has multi-character | |
154 | %collating elements can thus match more than one character. So (insidiously), | |
155 | %a bracket expression that starts with {\bf $^$} can match multi-character collating | |
156 | %elements even if none of them appear in the bracket expression! ({\it Note:} | |
157 | %Tcl currently has no multi-character collating elements. This information | |
158 | %is only for illustration.) | |
159 | % | |
160 | %For example, assume the collating sequence includes | |
161 | %a {\bf ch} multi-character collating element. Then the RE {\bf $[[.ch.]]*c$} (zero or more | |
162 | % {\bf ch}'s followed by {\bf c}) matches the first five characters of `{\bf chchcc}'. Also, the | |
163 | %RE {\bf $[^c]b$} matches all of `{\bf chb}' (because {\bf $[^c]$} matches the multi-character {\bf ch}). | |
164 | ||
165 | Within a bracket expression, a collating element enclosed in {\bf $[=$} and {\bf $=]$} | |
166 | is an equivalence class, standing for the sequences of characters of all | |
167 | collating elements equivalent to that one, including itself. | |
168 | %(If there are | |
169 | %no other equivalent collating elements, the treatment is as if the enclosing | |
170 | %delimiters were `{\bf $[.$}' and `{\bf $.]$}'.) For example, if {\bf o} | |
171 | %and {\bf $^$} are the members of an | |
172 | %equivalence class, then `{\bf $[[$=o=$]]$}', `{\bf $[[$=$^$=$]]$}', | |
173 | %and `{\bf $[o^]$}' are all synonymous. | |
174 | An equivalence class may not be an endpoint of a range. | |
175 | ||
176 | %({\it Note:} Tcl currently | |
177 | %implements only the Unicode locale. It doesn't define any equivalence classes. | |
178 | %The examples above are just illustrations.) | |
179 | ||
180 | {\it wxWindows}: Currently no equivalence classes are defined, so | |
181 | {\bf $[=X=]$} stands for just the single character {\it X}. | |
182 | {\it X} can either be a single character literal or the name of a character, | |
183 | see \helpref{Character Names}{wxresynchars}. | |
184 | ||
185 | Within a bracket expression, | |
186 | the name of a {\it character class} enclosed in {\bf $[:$} and {\bf $:]$} stands for the list | |
187 | of all characters (not all collating elements!) belonging to that class. | |
188 | Standard character classes are: | |
189 | ||
190 | \begin{twocollist}\twocolwidtha{3cm} | |
191 | \twocolitem{{\bf alpha}}{A letter.} | |
192 | \twocolitem{{\bf upper}}{An upper-case letter.} | |
193 | \twocolitem{{\bf lower}}{A lower-case letter.} | |
194 | \twocolitem{{\bf digit}}{A decimal digit.} | |
195 | \twocolitem{{\bf xdigit}}{A hexadecimal digit.} | |
196 | \twocolitem{{\bf alnum}}{An alphanumeric (letter or digit).} | |
197 | \twocolitem{{\bf print}}{An alphanumeric (same as alnum).} | |
198 | \twocolitem{{\bf blank}}{A space or tab character.} | |
199 | \twocolitem{{\bf space}}{A character producing white space in displayed text.} | |
200 | \twocolitem{{\bf punct}}{A punctuation character.} | |
201 | \twocolitem{{\bf graph}}{A character with a visible representation.} | |
202 | \twocolitem{{\bf cntrl}}{A control character.} | |
203 | \end{twocollist} | |
204 | ||
205 | %A locale may provide others. (Note that the current Tcl | |
206 | %implementation has only one locale: the Unicode locale.) | |
207 | A character class may not be used as an endpoint of a range. | |
208 | ||
209 | {\it wxWindows:} In a non-Unicode build, these character classifications depend on the | |
210 | current locale, and correspond to the values return by the ANSI C 'is' | |
211 | functions: isalpha, isupper, etc. In Unicode mode they are based on | |
212 | Unicode classifications, and are not affected by the current locale. | |
213 | ||
214 | There are two special cases of bracket expressions: | |
215 | the bracket expressions {\bf $[[:$<$:]]$} and {\bf $[[:$>$:]]$} are constraints, matching empty | |
216 | strings at the beginning and end of a word respectively. A word is defined | |
217 | as a sequence of word characters that is neither preceded nor followed | |
218 | by word characters. A word character is an {\it alnum} character or an underscore | |
219 | ({\bf \_}). These special bracket expressions are deprecated; users of AREs should | |
220 | use constraint escapes instead (see \helpref{Escapes}{wxresynescapes} below). | |
221 | ||
222 | ||
223 | \subsection{Escapes}\label{wxresynescapes} | |
224 | ||
225 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
226 | ||
227 | Escapes (AREs only), | |
228 | which begin with a {\bf $\backslash$} followed by an alphanumeric character, come in several | |
229 | varieties: character entry, class shorthands, constraint escapes, and back | |
230 | references. A {\bf $\backslash$} followed by an alphanumeric character but not constituting | |
231 | a valid escape is illegal in AREs. In EREs, there are no escapes: outside | |
232 | a bracket expression, a {\bf $\backslash$} followed by an alphanumeric character merely stands | |
233 | for that character as an ordinary character, and inside a bracket expression, | |
234 | {\bf $\backslash$} is an ordinary character. (The latter is the one actual incompatibility | |
235 | between EREs and AREs.) | |
236 | ||
237 | Character-entry escapes (AREs only) exist to make | |
238 | it easier to specify non-printing and otherwise inconvenient characters | |
239 | in REs: | |
240 | ||
241 | \begin{twocollist}\twocolwidtha{4cm} | |
242 | \twocolitem{{\bf $\backslash$a}}{alert (bell) character, as in C} | |
243 | \twocolitem{{\bf $\backslash$b}}{backspace, as in C} | |
244 | \twocolitem{{\bf $\backslash$B}}{synonym | |
245 | for {\bf $\backslash$} to help reduce backslash doubling in some applications where there | |
246 | are multiple levels of backslash processing} | |
247 | \twocolitem{{\bf $\backslash$c{\it X}}}{(where X is any character) | |
248 | the character whose low-order 5 bits are the same as those of {\it X}, and whose | |
249 | other bits are all zero} | |
250 | \twocolitem{{\bf $\backslash$e}}{the character whose collating-sequence name is | |
251 | `{\bf ESC}', or failing that, the character with octal value 033} | |
252 | \twocolitem{{\bf $\backslash$f}}{formfeed, as in C} | |
253 | \twocolitem{{\bf $\backslash$n}}{newline, as in C} | |
254 | \twocolitem{{\bf $\backslash$r}}{carriage return, as in C} | |
255 | \twocolitem{{\bf $\backslash$t}}{horizontal tab, as in C} | |
256 | \twocolitem{{\bf $\backslash$u{\it wxyz}}}{(where {\it wxyz} is exactly four hexadecimal digits) | |
257 | the Unicode | |
258 | character {\bf U+{\it wxyz}} in the local byte ordering} | |
259 | \twocolitem{{\bf $\backslash$U{\it stuvwxyz}}}{(where {\it stuvwxyz} is | |
260 | exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode | |
261 | extension to 32 bits} | |
262 | \twocolitem{{\bf $\backslash$v}}{vertical tab, as in C are all available.} | |
263 | \twocolitem{{\bf $\backslash$x{\it hhh}}}{(where | |
264 | {\it hhh} is any sequence of hexadecimal digits) the character whose hexadecimal | |
265 | value is {\bf 0x{\it hhh}} (a single character no matter how many hexadecimal digits | |
266 | are used).} | |
267 | \twocolitem{{\bf $\backslash$0}}{the character whose value is {\bf 0}} | |
268 | \twocolitem{{\bf $\backslash${\it xy}}}{(where {\it xy} is exactly two | |
269 | octal digits, and is not a {\it back reference} (see below)) the character whose | |
270 | octal value is {\bf 0{\it xy}}} | |
271 | \twocolitem{{\bf $\backslash${\it xyz}}}{(where {\it xyz} is exactly three octal digits, and is | |
272 | not a back reference (see below)) | |
273 | the character whose octal value is {\bf 0{\it xyz}}} | |
274 | \end{twocollist} | |
275 | ||
276 | Hexadecimal digits are `{\bf 0}'-`{\bf 9}', `{\bf a}'-`{\bf f}', and `{\bf A}'-`{\bf F}'. Octal | |
277 | digits are `{\bf 0}'-`{\bf 7}'. | |
278 | ||
279 | The character-entry | |
280 | escapes are always taken as ordinary characters. For example, {\bf $\backslash$135} is {\bf ]} in | |
281 | ASCII, but {\bf $\backslash$135} does not terminate a bracket expression. Beware, however, | |
282 | that some applications (e.g., C compilers) interpret such sequences themselves | |
283 | before the regular-expression package gets to see them, which may require | |
284 | doubling (quadrupling, etc.) the `{\bf $\backslash$}'. | |
285 | ||
286 | Class-shorthand escapes (AREs only) provide | |
287 | shorthands for certain commonly-used character classes: | |
288 | ||
289 | \begin{twocollist}\twocolwidtha{4cm} | |
290 | \twocolitem{{\bf $\backslash$d}}{{\bf $[[:digit:]]$}} | |
291 | \twocolitem{{\bf $\backslash$s}}{{\bf $[[:space:]]$}} | |
292 | \twocolitem{{\bf $\backslash$w}}{{\bf $[[:alnum:]\_]$} (note underscore)} | |
293 | \twocolitem{{\bf $\backslash$D}}{{\bf $[^[:digit:]]$}} | |
294 | \twocolitem{{\bf $\backslash$S}}{{\bf $[^[:space:]]$}} | |
295 | \twocolitem{{\bf $\backslash$W}}{{\bf $[^[:alnum:]\_]$} (note underscore)} | |
296 | \end{twocollist} | |
297 | ||
298 | Within bracket expressions, `{\bf $\backslash$d}', `{\bf $\backslash$s}', and | |
299 | `{\bf $\backslash$w}' lose their outer brackets, and `{\bf $\backslash$D}', | |
300 | `{\bf $\backslash$S}', and `{\bf $\backslash$W}' are illegal. (So, for example, | |
301 | {\bf $[$a-c$\backslash$d$]$} is equivalent to {\bf $[a-c[:digit:]]$}. | |
302 | Also, {\bf $[$a-c$\backslash$D$]$}, which is equivalent to | |
303 | {\bf $[a-c^[:digit:]]$}, is illegal.) | |
304 | ||
305 | A constraint escape (AREs only) is a constraint, | |
306 | matching the empty string if specific conditions are met, written as an | |
307 | escape: | |
308 | ||
309 | \begin{twocollist}\twocolwidtha{4cm} | |
310 | \twocolitem{{\bf $\backslash$A}}{matches only at the beginning of the string | |
311 | (see \helpref{Matching}{wxresynmatching}, below, | |
312 | for how this differs from `{\bf $^$}')} | |
313 | \twocolitem{{\bf $\backslash$m}}{matches only at the beginning of a word} | |
314 | \twocolitem{{\bf $\backslash$M}}{matches only at the end of a word} | |
315 | \twocolitem{{\bf $\backslash$y}}{matches only at the beginning or end of a word} | |
316 | \twocolitem{{\bf $\backslash$Y}}{matches only at a point that is not the beginning or end of | |
317 | a word} | |
318 | \twocolitem{{\bf $\backslash$Z}}{matches only at the end of the string | |
319 | (see \helpref{Matching}{wxresynmatching}, below, for | |
320 | how this differs from `{\bf \$}')} | |
321 | \twocolitem{{\bf $\backslash${\it m}}}{(where {\it m} is a nonzero digit) a {\it back reference}, | |
322 | see below} | |
323 | \twocolitem{{\bf $\backslash${\it mnn}}}{(where {\it m} is a nonzero digit, and {\it nn} is some more digits, | |
324 | and the decimal value {\it mnn} is not greater than the number of closing capturing | |
325 | parentheses seen so far) a {\it back reference}, see below} | |
326 | \end{twocollist} | |
327 | ||
328 | A word is defined | |
329 | as in the specification of {\bf $[[:$<$:]]$} and {\bf $[[:$>$:]]$} above. Constraint escapes are | |
330 | illegal within bracket expressions. | |
331 | ||
332 | A back reference (AREs only) matches | |
333 | the same string matched by the parenthesized subexpression specified by | |
334 | the number, so that (e.g.) {\bf ($[bc]$)$\backslash$1} matches {\bf bb} or {\bf cc} but not `{\bf bc}'. | |
335 | The subexpression | |
336 | must entirely precede the back reference in the RE. Subexpressions are numbered | |
337 | in the order of their leading parentheses. Non-capturing parentheses do not | |
338 | define subexpressions. | |
339 | ||
340 | There is an inherent historical ambiguity between | |
341 | octal character-entry escapes and back references, which is resolved by | |
342 | heuristics, as hinted at above. A leading zero always indicates an octal | |
343 | escape. A single non-zero digit, not followed by another digit, is always | |
344 | taken as a back reference. A multi-digit sequence not starting with a zero | |
345 | is taken as a back reference if it comes after a suitable subexpression | |
346 | (i.e. the number is in the legal range for a back reference), and otherwise | |
347 | is taken as octal. | |
348 | ||
349 | ||
350 | \subsection{Metasyntax} | |
351 | ||
352 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
353 | ||
354 | In addition to the main syntax described above, | |
355 | there are some special forms and miscellaneous syntactic facilities available. | |
356 | ||
357 | Normally the flavor of RE being used is specified by application-dependent | |
358 | means. However, this can be overridden by a {\it director}. If an RE of any flavor | |
359 | begins with `{\bf ***:}', the rest of the RE is an ARE. If an RE of any flavor begins | |
360 | with `{\bf ***=}', the rest of the RE is taken to be a literal string, with all | |
361 | characters considered ordinary characters. | |
362 | ||
363 | An ARE may begin with {\it embedded options}: a sequence {\bf (?xyz)} | |
364 | (where {\it xyz} is one or more alphabetic characters) | |
365 | specifies options affecting the rest of the RE. These supplement, and can | |
366 | override, any options specified by the application. The available option | |
367 | letters are: | |
368 | ||
369 | \begin{twocollist}\twocolwidtha{4cm} | |
370 | \twocolitem{{\bf b}}{rest of RE is a BRE} | |
371 | \twocolitem{{\bf c}}{case-sensitive matching (usual default)} | |
372 | \twocolitem{{\bf e}}{rest of RE is an ERE} | |
373 | \twocolitem{{\bf i}}{case-insensitive matching (see \helpref{Matching}{wxresynmatching}, below)} | |
374 | \twocolitem{{\bf m}}{historical synonym for {\bf n}} | |
375 | \twocolitem{{\bf n}}{newline-sensitive matching (see \helpref{Matching}{wxresynmatching}, below)} | |
376 | \twocolitem{{\bf p}}{partial newline-sensitive matching (see \helpref{Matching}{wxresynmatching}, below)} | |
377 | \twocolitem{{\bf q}}{rest of RE | |
378 | is a literal (``quoted'') string, all ordinary characters} | |
379 | \twocolitem{{\bf s}}{non-newline-sensitive matching (usual default)} | |
380 | \twocolitem{{\bf t}}{tight syntax (usual default; see below)} | |
381 | \twocolitem{{\bf w}}{inverse | |
382 | partial newline-sensitive (``weird'') matching (see \helpref{Matching}{wxresynmatching}, below)} | |
383 | \twocolitem{{\bf x}}{expanded syntax (see below)} | |
384 | \end{twocollist} | |
385 | ||
386 | Embedded options take effect at the {\bf )} terminating the | |
387 | sequence. They are available only at the start of an ARE, and may not be | |
388 | used later within it. | |
389 | ||
390 | In addition to the usual ({\it tight}) RE syntax, in which | |
391 | all characters are significant, there is an {\it expanded} syntax, available | |
392 | %in all flavors of RE with the {\bf -expanded} switch, or | |
393 | in AREs with the embedded | |
394 | x option. In the expanded syntax, white-space characters are ignored and | |
395 | all characters between a {\bf \#} and the following newline (or the end of the | |
396 | RE) are ignored, permitting paragraphing and commenting a complex RE. There | |
397 | are three exceptions to that basic rule: | |
398 | {\itemize | |
399 | \item% | |
400 | a white-space character or `{\bf \#}' preceded | |
401 | by `{\bf $\backslash$}' is retained | |
402 | \item% | |
403 | white space or `{\bf \#}' within a bracket expression is retained | |
404 | \item% | |
405 | white space and comments are illegal within multi-character symbols like | |
406 | the ARE `{\bf (?:}' or the BRE `{\bf $\backslash$(}' | |
407 | } | |
408 | Expanded-syntax white-space characters are blank, | |
409 | tab, newline, and any character that belongs to the {\it space} character class. | |
410 | ||
411 | Finally, in an ARE, outside bracket expressions, the sequence `{\bf (?\#ttt)}' (where | |
412 | {\it ttt} is any text not containing a `{\bf )}') is a comment, completely ignored. Again, | |
413 | this is not allowed between the characters of multi-character symbols like | |
414 | `{\bf (?:}'. Such comments are more a historical artifact than a useful facility, | |
415 | and their use is deprecated; use the expanded syntax instead. | |
416 | ||
417 | {\it None} of these | |
418 | metasyntax extensions is available if the application (or an initial {\bf ***=} | |
419 | director) has specified that the user's input be treated as a literal string | |
420 | rather than as an RE. | |
421 | ||
422 | ||
423 | \subsection{Matching}\label{wxresynmatching} | |
424 | ||
425 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
426 | ||
427 | In the event that an RE could match more than | |
428 | one substring of a given string, the RE matches the one starting earliest | |
429 | in the string. If the RE could match more than one substring starting at | |
430 | that point, its choice is determined by its {\it preference}: either the longest | |
431 | substring, or the shortest. | |
432 | ||
433 | Most atoms, and all constraints, have no preference. | |
434 | A parenthesized RE has the same preference (possibly none) as the RE. A | |
435 | quantified atom with quantifier {\bf \{m\}} or {\bf \{m\}?} has the same preference (possibly | |
436 | none) as the atom itself. A quantified atom with other normal quantifiers | |
437 | (including {\bf \{m,n\}} with {\it m} equal to {\it n}) prefers longest match. A quantified | |
438 | atom with other non-greedy quantifiers (including {\bf \{m,n\}?} with {\it m} equal to | |
439 | {\it n}) prefers shortest match. A branch has the same preference as the first | |
440 | quantified atom in it which has a preference. An RE consisting of two or | |
441 | more branches connected by the {\bf $|$} operator prefers longest match. | |
442 | ||
443 | Subject | |
444 | to the constraints imposed by the rules for matching the whole RE, subexpressions | |
445 | also match the longest or shortest possible substrings, based on their | |
446 | preferences, with subexpressions starting earlier in the RE taking priority | |
447 | over ones starting later. Note that outer subexpressions thus take priority | |
448 | over their component subexpressions. | |
449 | ||
450 | Note that the quantifiers {\bf \{1,1\}} and | |
451 | {\bf \{1,1\}?} can be used to force longest and shortest preference, respectively, | |
452 | on a subexpression or a whole RE. | |
453 | ||
454 | Match lengths are measured in characters, | |
455 | not collating elements. An empty string is considered longer than no match | |
456 | at all. For example, {\bf bb*} matches the three middle characters | |
457 | of `{\bf abbbc}', {\bf (week$|$wee)(night$|$knights)} | |
458 | matches all ten characters of `{\bf weeknights}', when {\bf (.*).*} is matched against | |
459 | {\bf abc} the parenthesized subexpression matches all three characters, and when | |
460 | {\bf (a*)*} is matched against {\bf bc} both the whole RE and the parenthesized subexpression | |
461 | match an empty string. | |
462 | ||
463 | If case-independent matching is specified, the effect | |
464 | is much as if all case distinctions had vanished from the alphabet. When | |
465 | an alphabetic that exists in multiple cases appears as an ordinary character | |
466 | outside a bracket expression, it is effectively transformed into a bracket | |
467 | expression containing both cases, so that {\bf x} becomes `{\bf $[xX]$}'. When it appears | |
468 | inside a bracket expression, all case counterparts of it are added to the | |
469 | bracket expression, so that {\bf $[x]$} becomes {\bf $[xX]$} and {\bf $[^x]$} becomes `{\bf $[^xX]$}'. | |
470 | ||
471 | If newline-sensitive | |
472 | matching is specified, {\bf .} and bracket expressions using {\bf $^$} will never match | |
473 | the newline character (so that matches will never cross newlines unless | |
474 | the RE explicitly arranges it) and {\bf $^$} and {\bf \$} will match the empty string after | |
475 | and before a newline respectively, in addition to matching at beginning | |
476 | and end of string respectively. ARE {\bf $\backslash$A} and {\bf $\backslash$Z} continue to match beginning | |
477 | or end of string {\it only}. | |
478 | ||
479 | If partial newline-sensitive matching is specified, | |
480 | this affects {\bf .} and bracket expressions as with newline-sensitive matching, | |
481 | but not {\bf $^$} and `{\bf \$}'. | |
482 | ||
483 | If inverse partial newline-sensitive matching is specified, | |
484 | this affects {\bf $^$} and {\bf \$} as with newline-sensitive matching, but not {\bf .} and bracket | |
485 | expressions. This isn't very useful but is provided for symmetry. | |
486 | ||
487 | ||
488 | \subsection{Limits And Compatibility} | |
489 | ||
490 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
491 | ||
492 | No particular limit is imposed on the length of REs. Programs | |
493 | intended to be highly portable should not employ REs longer than 256 bytes, | |
494 | as a POSIX-compliant implementation can refuse to accept such REs. | |
495 | ||
496 | The only | |
497 | feature of AREs that is actually incompatible with POSIX EREs is that {\bf $\backslash$} | |
498 | does not lose its special significance inside bracket expressions. All other | |
499 | ARE features use syntax which is illegal or has undefined or unspecified | |
500 | effects in POSIX EREs; the {\bf ***} syntax of directors likewise is outside | |
501 | the POSIX syntax for both BREs and EREs. | |
502 | ||
503 | Many of the ARE extensions are | |
504 | borrowed from Perl, but some have been changed to clean them up, and a | |
505 | few Perl extensions are not present. Incompatibilities of note include `{\bf $\backslash$b}', | |
506 | `{\bf $\backslash$B}', the lack of special treatment for a trailing newline, the addition of | |
507 | complemented bracket expressions to the things affected by newline-sensitive | |
508 | matching, the restrictions on parentheses and back references in lookahead | |
509 | constraints, and the longest/shortest-match (rather than first-match) matching | |
510 | semantics. | |
511 | ||
512 | The matching rules for REs containing both normal and non-greedy | |
513 | quantifiers have changed since early beta-test versions of this package. | |
514 | (The new rules are much simpler and cleaner, but don't work as hard at guessing | |
515 | the user's real intentions.) | |
516 | ||
517 | Henry Spencer's original 1986 {\it regexp} package, still in widespread use, | |
518 | %(e.g., in pre-8.1 releases of Tcl), | |
519 | implemented an early version of today's EREs. There are four incompatibilities between {\it regexp}'s | |
520 | near-EREs (`RREs' for short) and AREs. In roughly increasing order of significance: | |
521 | {\itemize | |
522 | \item | |
523 | In AREs, {\bf $\backslash$} followed by an alphanumeric character is either an escape or | |
524 | an error, while in RREs, it was just another way of writing the alphanumeric. | |
525 | This should not be a problem because there was no reason to write such | |
526 | a sequence in RREs. | |
527 | ||
528 | \item% | |
529 | {\bf \{} followed by a digit in an ARE is the beginning of | |
530 | a bound, while in RREs, {\bf \{} was always an ordinary character. Such sequences | |
531 | should be rare, and will often result in an error because following characters | |
532 | will not look like a valid bound. | |
533 | ||
534 | \item% | |
535 | In AREs, {\bf $\backslash$} remains a special character | |
536 | within `{\bf $[]$}', so a literal {\bf $\backslash$} within {\bf $[]$} must be | |
537 | written `{\bf $\backslash\backslash$}'. {\bf $\backslash\backslash$} also gives a literal | |
538 | {\bf $\backslash$} within {\bf $[]$} in RREs, but only truly paranoid programmers routinely doubled | |
539 | the backslash. | |
540 | ||
541 | \item% | |
542 | AREs report the longest/shortest match for the RE, rather | |
543 | than the first found in a specified search order. This may affect some RREs | |
544 | which were written in the expectation that the first match would be reported. | |
545 | (The careful crafting of RREs to optimize the search order for fast matching | |
546 | is obsolete (AREs examine all possible matches in parallel, and their performance | |
547 | is largely insensitive to their complexity) but cases where the search | |
548 | order was exploited to deliberately find a match which was {\it not} the longest/shortest | |
549 | will need rewriting.) | |
550 | } | |
551 | ||
552 | ||
553 | \subsection{Basic Regular Expressions}\label{wxresynbre} | |
554 | ||
555 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
556 | ||
557 | BREs differ from EREs in | |
558 | several respects. `{\bf $|$}', `{\bf +}', and {\bf ?} are ordinary characters and there is no equivalent | |
559 | for their functionality. The delimiters for bounds | |
560 | are {\bf $\backslash$\{} and `{\bf $\backslash$\}}', with {\bf \{} and | |
561 | {\bf \}} by themselves ordinary characters. The parentheses for nested subexpressions | |
562 | are {\bf $\backslash$(} and `{\bf $\backslash$)}', with {\bf (} and {\bf )} by themselves | |
563 | ordinary characters. {\bf $^$} is an ordinary | |
564 | character except at the beginning of the RE or the beginning of a parenthesized | |
565 | subexpression, {\bf \$} is an ordinary character except at the end of the RE or | |
566 | the end of a parenthesized subexpression, and {\bf *} is an ordinary character | |
567 | if it appears at the beginning of the RE or the beginning of a parenthesized | |
568 | subexpression (after a possible leading `{\bf $^$}'). Finally, single-digit back references | |
569 | are available, and {\bf $\backslash<$} and {\bf $\backslash>$} are synonyms | |
570 | for {\bf $[[:<:]]$} and {\bf $[[:>:]]$} respectively; | |
571 | no other escapes are available. | |
572 | ||
573 | ||
574 | \subsection{Regular Expression Character Names}\label{wxresynchars} | |
575 | ||
576 | \helpref{Syntax of the builtin regular expression library}{wxresyn} | |
577 | ||
578 | Note that the character names are case sensitive. | |
579 | ||
580 | \begin{twocollist} | |
581 | \twocolitem{NUL}{'$\backslash$0'} | |
582 | \twocolitem{SOH}{'$\backslash$001'} | |
583 | \twocolitem{STX}{'$\backslash$002'} | |
584 | \twocolitem{ETX}{'$\backslash$003'} | |
585 | \twocolitem{EOT}{'$\backslash$004'} | |
586 | \twocolitem{ENQ}{'$\backslash$005'} | |
587 | \twocolitem{ACK}{'$\backslash$006'} | |
588 | \twocolitem{BEL}{'$\backslash$007'} | |
589 | \twocolitem{alert}{'$\backslash$007'} | |
590 | \twocolitem{BS}{'$\backslash$010'} | |
591 | \twocolitem{backspace}{'$\backslash$b'} | |
592 | \twocolitem{HT}{'$\backslash$011'} | |
593 | \twocolitem{tab}{'$\backslash$t'} | |
594 | \twocolitem{LF}{'$\backslash$012'} | |
595 | \twocolitem{newline}{'$\backslash$n'} | |
596 | \twocolitem{VT}{'$\backslash$013'} | |
597 | \twocolitem{vertical-tab}{'$\backslash$v'} | |
598 | \twocolitem{FF}{'$\backslash$014'} | |
599 | \twocolitem{form-feed}{'$\backslash$f'} | |
600 | \twocolitem{CR}{'$\backslash$015'} | |
601 | \twocolitem{carriage-return}{'$\backslash$r'} | |
602 | \twocolitem{SO}{'$\backslash$016'} | |
603 | \twocolitem{SI}{'$\backslash$017'} | |
604 | \twocolitem{DLE}{'$\backslash$020'} | |
605 | \twocolitem{DC1}{'$\backslash$021'} | |
606 | \twocolitem{DC2}{'$\backslash$022'} | |
607 | \twocolitem{DC3}{'$\backslash$023'} | |
608 | \twocolitem{DC4}{'$\backslash$024'} | |
609 | \twocolitem{NAK}{'$\backslash$025'} | |
610 | \twocolitem{SYN}{'$\backslash$026'} | |
611 | \twocolitem{ETB}{'$\backslash$027'} | |
612 | \twocolitem{CAN}{'$\backslash$030'} | |
613 | \twocolitem{EM}{'$\backslash$031'} | |
614 | \twocolitem{SUB}{'$\backslash$032'} | |
615 | \twocolitem{ESC}{'$\backslash$033'} | |
616 | \twocolitem{IS4}{'$\backslash$034'} | |
617 | \twocolitem{FS}{'$\backslash$034'} | |
618 | \twocolitem{IS3}{'$\backslash$035'} | |
619 | \twocolitem{GS}{'$\backslash$035'} | |
620 | \twocolitem{IS2}{'$\backslash$036'} | |
621 | \twocolitem{RS}{'$\backslash$036'} | |
622 | \twocolitem{IS1}{'$\backslash$037'} | |
623 | \twocolitem{US}{'$\backslash$037'} | |
624 | \twocolitem{space}{' '} | |
625 | \twocolitem{exclamation-mark}{'!'} | |
626 | \twocolitem{quotation-mark}{'"'} | |
627 | \twocolitem{number-sign}{'\#'} | |
628 | \twocolitem{dollar-sign}{'\$'} | |
629 | \twocolitem{percent-sign}{'\%'} | |
1dc049cc | 630 | \twocolitem{ampersand}{'\&'} |
0aa7fa9a VS |
631 | \twocolitem{apostrophe}{'$\backslash$''} |
632 | \twocolitem{left-parenthesis}{'('} | |
633 | \twocolitem{right-parenthesis}{')'} | |
634 | \twocolitem{asterisk}{'*'} | |
635 | \twocolitem{plus-sign}{'+'} | |
636 | \twocolitem{comma}{','} | |
637 | \twocolitem{hyphen}{'-'} | |
638 | \twocolitem{hyphen-minus}{'-'} | |
639 | \twocolitem{period}{'.'} | |
640 | \twocolitem{full-stop}{'.'} | |
641 | \twocolitem{slash}{'/'} | |
642 | \twocolitem{solidus}{'/'} | |
643 | \twocolitem{zero}{'0'} | |
644 | \twocolitem{one}{'1'} | |
645 | \twocolitem{two}{'2'} | |
646 | \twocolitem{three}{'3'} | |
647 | \twocolitem{four}{'4'} | |
648 | \twocolitem{five}{'5'} | |
649 | \twocolitem{six}{'6'} | |
650 | \twocolitem{seven}{'7'} | |
651 | \twocolitem{eight}{'8'} | |
652 | \twocolitem{nine}{'9'} | |
653 | \twocolitem{colon}{':'} | |
654 | \twocolitem{semicolon}{';'} | |
655 | \twocolitem{less-than-sign}{'<'} | |
656 | \twocolitem{equals-sign}{'='} | |
657 | \twocolitem{greater-than-sign}{'>'} | |
658 | \twocolitem{question-mark}{'?'} | |
659 | \twocolitem{commercial-at}{'@'} | |
660 | \twocolitem{left-square-bracket}{'$[$'} | |
661 | \twocolitem{backslash}{'$\backslash$'} | |
662 | \twocolitem{reverse-solidus}{'$\backslash$'} | |
663 | \twocolitem{right-square-bracket}{'$]$'} | |
664 | \twocolitem{circumflex}{'$^$'} | |
665 | \twocolitem{circumflex-accent}{'$^$'} | |
666 | \twocolitem{underscore}{'\_'} | |
667 | \twocolitem{low-line}{'\_'} | |
668 | \twocolitem{grave-accent}{'`'} | |
669 | \twocolitem{left-brace}{'\{'} | |
670 | \twocolitem{left-curly-bracket}{'\{'} | |
1dc049cc | 671 | \twocolitem{vertical-line}{'$|$'} |
0aa7fa9a VS |
672 | \twocolitem{right-brace}{'\}'} |
673 | \twocolitem{right-curly-bracket}{'\}'} | |
674 | \twocolitem{tilde}{'$~$'} | |
675 | \twocolitem{DEL}{'$\backslash$177'} | |
676 | \end{twocollist} |