]>
Commit | Line | Data |
---|---|---|
1 | '\" | |
2 | '\" Copyright (c) 1998 Sun Microsystems, Inc. | |
3 | '\" Copyright (c) 1999 Scriptics Corporation | |
4 | '\" | |
5 | '\" This software is copyrighted by the Regents of the University of | |
6 | '\" California, Sun Microsystems, Inc., Scriptics Corporation, ActiveState | |
7 | '\" Corporation and other parties. The following terms apply to all files | |
8 | '\" associated with the software unless explicitly disclaimed in | |
9 | '\" individual files. | |
10 | '\" | |
11 | '\" The authors hereby grant permission to use, copy, modify, distribute, | |
12 | '\" and license this software and its documentation for any purpose, provided | |
13 | '\" that existing copyright notices are retained in all copies and that this | |
14 | '\" notice is included verbatim in any distributions. No written agreement, | |
15 | '\" license, or royalty fee is required for any of the authorized uses. | |
16 | '\" Modifications to this software may be copyrighted by their authors | |
17 | '\" and need not follow the licensing terms described here, provided that | |
18 | '\" the new terms are clearly indicated on the first page of each file where | |
19 | '\" they apply. | |
20 | '\" | |
21 | '\" IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY | |
22 | '\" FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES | |
23 | '\" ARISING OUT OF THE USE OF THIS SOFTWARE, ITS DOCUMENTATION, OR ANY | |
24 | '\" DERIVATIVES THEREOF, EVEN IF THE AUTHORS HAVE BEEN ADVISED OF THE | |
25 | '\" POSSIBILITY OF SUCH DAMAGE. | |
26 | '\" | |
27 | '\" THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES, | |
28 | '\" INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, | |
29 | '\" FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. THIS SOFTWARE | |
30 | '\" IS PROVIDED ON AN "AS IS" BASIS, AND THE AUTHORS AND DISTRIBUTORS HAVE | |
31 | '\" NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR | |
32 | '\" MODIFICATIONS. | |
33 | '\" | |
34 | '\" GOVERNMENT USE: If you are acquiring this software on behalf of the | |
35 | '\" U.S. government, the Government shall have only "Restricted Rights" | |
36 | '\" in the software and related documentation as defined in the Federal | |
37 | '\" Acquisition Regulations (FARs) in Clause 52.227.19 (c) (2). If you | |
38 | '\" are acquiring the software on behalf of the Department of Defense, the | |
39 | '\" software shall be classified as "Commercial Computer Software" and the | |
40 | '\" Government shall have only "Restricted Rights" as defined in Clause | |
41 | '\" 252.227-7013 (c) (1) of DFARs. Notwithstanding the foregoing, the | |
42 | '\" authors grant the U.S. Government and others acting in its behalf | |
43 | '\" permission to use and distribute the software in accordance with the | |
44 | '\" terms specified in this license. | |
45 | '\" | |
46 | '\" RCS: @(#) Id: re_syntax.n,v 1.3 1999/07/14 19:09:36 jpeek Exp | |
47 | '\" | |
48 | .so man.macros | |
49 | .TH re_syntax n "8.1" Tcl "Tcl Built-In Commands" | |
50 | .BS | |
51 | .SH NAME | |
52 | re_syntax \- Syntax of Tcl regular expressions. | |
53 | .BE | |
54 | ||
55 | .SH DESCRIPTION | |
56 | .PP | |
57 | A \fIregular expression\fR describes strings of characters. | |
58 | It's a pattern that matches certain strings and doesn't match others. | |
59 | ||
60 | .SH "DIFFERENT FLAVORS OF REs" | |
61 | Regular expressions (``RE''s), as defined by POSIX, come in two | |
62 | flavors: \fIextended\fR REs (``EREs'') and \fIbasic\fR REs (``BREs''). | |
63 | EREs are roughly those of the traditional \fIegrep\fR, while BREs are | |
64 | roughly those of the traditional \fIed\fR. This implementation adds | |
65 | a third flavor, \fIadvanced\fR REs (``AREs''), basically EREs with | |
66 | some significant extensions. | |
67 | .PP | |
68 | This manual page primarily describes AREs. BREs mostly exist for | |
69 | backward compatibility in some old programs; they will be discussed at | |
70 | the end. POSIX EREs are almost an exact subset of AREs. Features of | |
71 | AREs that are not present in EREs will be indicated. | |
72 | ||
73 | .SH "REGULAR EXPRESSION SYNTAX" | |
74 | .PP | |
75 | Tcl regular expressions are implemented using the package written by | |
76 | Henry Spencer, based on the 1003.2 spec and some (not quite all) of | |
77 | the Perl5 extensions (thanks, Henry!). Much of the description of | |
78 | regular expressions below is copied verbatim from his manual entry. | |
79 | .PP | |
80 | An ARE is one or more \fIbranches\fR, | |
81 | separated by `\fB|\fR', | |
82 | matching anything that matches any of the branches. | |
83 | .PP | |
84 | A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR, | |
85 | concatenated. | |
86 | It matches a match for the first, followed by a match for the second, etc; | |
87 | an empty branch matches the empty string. | |
88 | .PP | |
89 | A quantified atom is an \fIatom\fR possibly followed | |
90 | by a single \fIquantifier\fR. | |
91 | Without a quantifier, it matches a match for the atom. | |
92 | The quantifiers, | |
93 | and what a so-quantified atom matches, are: | |
94 | .RS 2 | |
95 | .TP 6 | |
96 | \fB*\fR | |
97 | a sequence of 0 or more matches of the atom | |
98 | .TP | |
99 | \fB+\fR | |
100 | a sequence of 1 or more matches of the atom | |
101 | .TP | |
102 | \fB?\fR | |
103 | a sequence of 0 or 1 matches of the atom | |
104 | .TP | |
105 | \fB{\fIm\fB}\fR | |
106 | a sequence of exactly \fIm\fR matches of the atom | |
107 | .TP | |
108 | \fB{\fIm\fB,}\fR | |
109 | a sequence of \fIm\fR or more matches of the atom | |
110 | .TP | |
111 | \fB{\fIm\fB,\fIn\fB}\fR | |
112 | a sequence of \fIm\fR through \fIn\fR (inclusive) matches of the atom; | |
113 | \fIm\fR may not exceed \fIn\fR | |
114 | .TP | |
115 | \fB*? +? ?? {\fIm\fB}? {\fIm\fB,}? {\fIm\fB,\fIn\fB}?\fR | |
116 | \fInon-greedy\fR quantifiers, | |
117 | which match the same possibilities, | |
118 | but prefer the smallest number rather than the largest number | |
119 | of matches (see MATCHING) | |
120 | .RE | |
121 | .PP | |
122 | The forms using | |
123 | \fB{\fR and \fB}\fR | |
124 | are known as \fIbound\fRs. | |
125 | The numbers | |
126 | \fIm\fR and \fIn\fR are unsigned decimal integers | |
127 | with permissible values from 0 to 255 inclusive. | |
128 | .PP | |
129 | An atom is one of: | |
130 | .RS 2 | |
131 | .TP 6 | |
132 | \fB(\fIre\fB)\fR | |
133 | (where \fIre\fR is any regular expression) | |
134 | matches a match for | |
135 | \fIre\fR, with the match noted for possible reporting | |
136 | .TP | |
137 | \fB(?:\fIre\fB)\fR | |
138 | as previous, | |
139 | but does no reporting | |
140 | (a ``non-capturing'' set of parentheses) | |
141 | .TP | |
142 | \fB()\fR | |
143 | matches an empty string, | |
144 | noted for possible reporting | |
145 | .TP | |
146 | \fB(?:)\fR | |
147 | matches an empty string, | |
148 | without reporting | |
149 | .TP | |
150 | \fB[\fIchars\fB]\fR | |
151 | a \fIbracket expression\fR, | |
152 | matching any one of the \fIchars\fR (see BRACKET EXPRESSIONS for more detail) | |
153 | .TP | |
154 | \fB.\fR | |
155 | matches any single character | |
156 | .TP | |
157 | \fB\e\fIk\fR | |
158 | (where \fIk\fR is a non-alphanumeric character) | |
159 | matches that character taken as an ordinary character, | |
160 | e.g. \e\e matches a backslash character | |
161 | .TP | |
162 | \fB\e\fIc\fR | |
163 | where \fIc\fR is alphanumeric | |
164 | (possibly followed by other characters), | |
165 | an \fIescape\fR (AREs only), | |
166 | see ESCAPES below | |
167 | .TP | |
168 | \fB{\fR | |
169 | when followed by a character other than a digit, | |
170 | matches the left-brace character `\fB{\fR'; | |
171 | when followed by a digit, it is the beginning of a | |
172 | \fIbound\fR (see above) | |
173 | .TP | |
174 | \fIx\fR | |
175 | where \fIx\fR is | |
176 | a single character with no other significance, matches that character. | |
177 | .RE | |
178 | .PP | |
179 | A \fIconstraint\fR matches an empty string when specific conditions | |
180 | are met. | |
181 | A constraint may not be followed by a quantifier. | |
182 | The simple constraints are as follows; some more constraints are | |
183 | described later, under ESCAPES. | |
184 | .RS 2 | |
185 | .TP 8 | |
186 | \fB^\fR | |
187 | matches at the beginning of a line | |
188 | .TP | |
189 | \fB$\fR | |
190 | matches at the end of a line | |
191 | .TP | |
192 | \fB(?=\fIre\fB)\fR | |
193 | \fIpositive lookahead\fR (AREs only), matches at any point | |
194 | where a substring matching \fIre\fR begins | |
195 | .TP | |
196 | \fB(?!\fIre\fB)\fR | |
197 | \fInegative lookahead\fR (AREs only), matches at any point | |
198 | where no substring matching \fIre\fR begins | |
199 | .RE | |
200 | .PP | |
201 | The lookahead constraints may not contain back references (see later), | |
202 | and all parentheses within them are considered non-capturing. | |
203 | .PP | |
204 | An RE may not end with `\fB\e\fR'. | |
205 | ||
206 | .SH "BRACKET EXPRESSIONS" | |
207 | A \fIbracket expression\fR is a list of characters enclosed in `\fB[\|]\fR'. | |
208 | It normally matches any single character from the list (but see below). | |
209 | If the list begins with `\fB^\fR', | |
210 | it matches any single character | |
211 | (but see below) \fInot\fR from the rest of the list. | |
212 | .PP | |
213 | If two characters in the list are separated by `\fB\-\fR', | |
214 | this is shorthand | |
215 | for the full \fIrange\fR of characters between those two (inclusive) in the | |
216 | collating sequence, | |
217 | e.g. | |
218 | \fB[0\-9]\fR | |
219 | in ASCII matches any decimal digit. | |
220 | Two ranges may not share an | |
221 | endpoint, so e.g. | |
222 | \fBa\-c\-e\fR | |
223 | is illegal. | |
224 | Ranges are very collating-sequence-dependent, | |
225 | and portable programs should avoid relying on them. | |
226 | .PP | |
227 | To include a literal | |
228 | \fB]\fR | |
229 | or | |
230 | \fB\-\fR | |
231 | in the list, | |
232 | the simplest method is to | |
233 | enclose it in | |
234 | \fB[.\fR and \fB.]\fR | |
235 | to make it a collating element (see below). | |
236 | Alternatively, | |
237 | make it the first character | |
238 | (following a possible `\fB^\fR'), | |
239 | or (AREs only) precede it with `\fB\e\fR'. | |
240 | Alternatively, for `\fB\-\fR', | |
241 | make it the last character, | |
242 | or the second endpoint of a range. | |
243 | To use a literal | |
244 | \fB\-\fR | |
245 | as the first endpoint of a range, | |
246 | make it a collating element | |
247 | or (AREs only) precede it with `\fB\e\fR'. | |
248 | With the exception of these, some combinations using | |
249 | \fB[\fR | |
250 | (see next | |
251 | paragraphs), and escapes, | |
252 | all other special characters lose their | |
253 | special significance within a bracket expression. | |
254 | .PP | |
255 | Within a bracket expression, a collating element (a character, | |
256 | a multi-character sequence that collates as if it were a single character, | |
257 | or a collating-sequence name for either) | |
258 | enclosed in | |
259 | \fB[.\fR and \fB.]\fR | |
260 | stands for the | |
261 | sequence of characters of that collating element. | |
262 | The sequence is a single element of the bracket expression's list. | |
263 | A bracket expression in a locale that has | |
264 | multi-character collating elements | |
265 | can thus match more than one character. | |
266 | .VS 8.2 | |
267 | So (insidiously), a bracket expression that starts with \fB^\fR | |
268 | can match multi-character collating elements even if none of them | |
269 | appear in the bracket expression! | |
270 | (\fINote:\fR Tcl currently has no multi-character collating elements. | |
271 | This information is only for illustration.) | |
272 | .PP | |
273 | For example, assume the collating sequence includes a \fBch\fR | |
274 | multi-character collating element. | |
275 | Then the RE \fB[[.ch.]]*c\fR (zero or more \fBch\fP's followed by \fBc\fP) | |
276 | matches the first five characters of `\fBchchcc\fR'. | |
277 | Also, the RE \fB[^c]b\fR matches all of `\fBchb\fR' | |
278 | (because \fB[^c]\fR matches the multi-character \fBch\fR). | |
279 | .VE 8.2 | |
280 | .PP | |
281 | Within a bracket expression, a collating element enclosed in | |
282 | \fB[=\fR | |
283 | and | |
284 | \fB=]\fR | |
285 | is an equivalence class, standing for the sequences of characters | |
286 | of all collating elements equivalent to that one, including itself. | |
287 | (If there are no other equivalent collating elements, | |
288 | the treatment is as if the enclosing delimiters were `\fB[.\fR'\& | |
289 | and `\fB.]\fR'.) | |
290 | For example, if | |
291 | \fBo\fR | |
292 | and | |
293 | \fB\o'o^'\fR | |
294 | are the members of an equivalence class, | |
295 | then `\fB[[=o=]]\fR', `\fB[[=\o'o^'=]]\fR', | |
296 | and `\fB[o\o'o^']\fR'\& | |
297 | are all synonymous. | |
298 | An equivalence class may not be an endpoint | |
299 | of a range. | |
300 | .VS 8.2 | |
301 | (\fINote:\fR | |
302 | Tcl currently implements only the Unicode locale. | |
303 | It doesn't define any equivalence classes. | |
304 | The examples above are just illustrations.) | |
305 | .VE 8.2 | |
306 | .PP | |
307 | Within a bracket expression, the name of a \fIcharacter class\fR enclosed | |
308 | in | |
309 | \fB[:\fR | |
310 | and | |
311 | \fB:]\fR | |
312 | stands for the list of all characters | |
313 | (not all collating elements!) | |
314 | belonging to that | |
315 | class. | |
316 | Standard character classes are: | |
317 | .PP | |
318 | .RS | |
319 | .ne 5 | |
320 | .nf | |
321 | .ta 3c | |
322 | \fBalpha\fR A letter. | |
323 | \fBupper\fR An upper-case letter. | |
324 | \fBlower\fR A lower-case letter. | |
325 | \fBdigit\fR A decimal digit. | |
326 | \fBxdigit\fR A hexadecimal digit. | |
327 | \fBalnum\fR An alphanumeric (letter or digit). | |
328 | \fBprint\fR An alphanumeric (same as alnum). | |
329 | \fBblank\fR A space or tab character. | |
330 | \fBspace\fR A character producing white space in displayed text. | |
331 | \fBpunct\fR A punctuation character. | |
332 | \fBgraph\fR A character with a visible representation. | |
333 | \fBcntrl\fR A control character. | |
334 | .fi | |
335 | .RE | |
336 | .PP | |
337 | A locale may provide others. | |
338 | .VS 8.2 | |
339 | (Note that the current Tcl implementation has only one locale: | |
340 | the Unicode locale.) | |
341 | .VE 8.2 | |
342 | A character class may not be used as an endpoint of a range. | |
343 | .PP | |
344 | There are two special cases of bracket expressions: | |
345 | the bracket expressions | |
346 | \fB[[:<:]]\fR | |
347 | and | |
348 | \fB[[:>:]]\fR | |
349 | are constraints, matching empty strings at | |
350 | the beginning and end of a word respectively. | |
351 | '\" note, discussion of escapes below references this definition of word | |
352 | A word is defined as a sequence of | |
353 | word characters | |
354 | that is neither preceded nor followed by | |
355 | word characters. | |
356 | A word character is an | |
357 | \fIalnum\fR | |
358 | character | |
359 | or an underscore | |
360 | (\fB_\fR). | |
361 | These special bracket expressions are deprecated; | |
362 | users of AREs should use constraint escapes instead (see below). | |
363 | .SH ESCAPES | |
364 | Escapes (AREs only), which begin with a | |
365 | \fB\e\fR | |
366 | followed by an alphanumeric character, | |
367 | come in several varieties: | |
368 | character entry, class shorthands, constraint escapes, and back references. | |
369 | A | |
370 | \fB\e\fR | |
371 | followed by an alphanumeric character but not constituting | |
372 | a valid escape is illegal in AREs. | |
373 | In EREs, there are no escapes: | |
374 | outside a bracket expression, | |
375 | a | |
376 | \fB\e\fR | |
377 | followed by an alphanumeric character merely stands for that | |
378 | character as an ordinary character, | |
379 | and inside a bracket expression, | |
380 | \fB\e\fR | |
381 | is an ordinary character. | |
382 | (The latter is the one actual incompatibility between EREs and AREs.) | |
383 | .PP | |
384 | Character-entry escapes (AREs only) exist to make it easier to specify | |
385 | non-printing and otherwise inconvenient characters in REs: | |
386 | .RS 2 | |
387 | .TP 5 | |
388 | \fB\ea\fR | |
389 | alert (bell) character, as in C | |
390 | .TP | |
391 | \fB\eb\fR | |
392 | backspace, as in C | |
393 | .TP | |
394 | \fB\eB\fR | |
395 | synonym for | |
396 | \fB\e\fR | |
397 | to help reduce backslash doubling in some | |
398 | applications where there are multiple levels of backslash processing | |
399 | .TP | |
400 | \fB\ec\fIX\fR | |
401 | (where X is any character) the character whose | |
402 | low-order 5 bits are the same as those of | |
403 | \fIX\fR, | |
404 | and whose other bits are all zero | |
405 | .TP | |
406 | \fB\ee\fR | |
407 | the character whose collating-sequence name | |
408 | is `\fBESC\fR', | |
409 | or failing that, the character with octal value 033 | |
410 | .TP | |
411 | \fB\ef\fR | |
412 | formfeed, as in C | |
413 | .TP | |
414 | \fB\en\fR | |
415 | newline, as in C | |
416 | .TP | |
417 | \fB\er\fR | |
418 | carriage return, as in C | |
419 | .TP | |
420 | \fB\et\fR | |
421 | horizontal tab, as in C | |
422 | .TP | |
423 | \fB\eu\fIwxyz\fR | |
424 | (where | |
425 | \fIwxyz\fR | |
426 | is exactly four hexadecimal digits) | |
427 | the Unicode character | |
428 | \fBU+\fIwxyz\fR | |
429 | in the local byte ordering | |
430 | .TP | |
431 | \fB\eU\fIstuvwxyz\fR | |
432 | (where | |
433 | \fIstuvwxyz\fR | |
434 | is exactly eight hexadecimal digits) | |
435 | reserved for a somewhat-hypothetical Unicode extension to 32 bits | |
436 | .TP | |
437 | \fB\ev\fR | |
438 | vertical tab, as in C | |
439 | are all available. | |
440 | .TP | |
441 | \fB\ex\fIhhh\fR | |
442 | (where | |
443 | \fIhhh\fR | |
444 | is any sequence of hexadecimal digits) | |
445 | the character whose hexadecimal value is | |
446 | \fB0x\fIhhh\fR | |
447 | (a single character no matter how many hexadecimal digits are used). | |
448 | .TP | |
449 | \fB\e0\fR | |
450 | the character whose value is | |
451 | \fB0\fR | |
452 | .TP | |
453 | \fB\e\fIxy\fR | |
454 | (where | |
455 | \fIxy\fR | |
456 | is exactly two octal digits, | |
457 | and is not a | |
458 | \fIback reference\fR (see below)) | |
459 | the character whose octal value is | |
460 | \fB0\fIxy\fR | |
461 | .TP | |
462 | \fB\e\fIxyz\fR | |
463 | (where | |
464 | \fIxyz\fR | |
465 | is exactly three octal digits, | |
466 | and is not a | |
467 | back reference (see below)) | |
468 | the character whose octal value is | |
469 | \fB0\fIxyz\fR | |
470 | .RE | |
471 | .PP | |
472 | Hexadecimal digits are `\fB0\fR'-`\fB9\fR', `\fBa\fR'-`\fBf\fR', | |
473 | and `\fBA\fR'-`\fBF\fR'. | |
474 | Octal digits are `\fB0\fR'-`\fB7\fR'. | |
475 | .PP | |
476 | The character-entry escapes are always taken as ordinary characters. | |
477 | For example, | |
478 | \fB\e135\fR | |
479 | is | |
480 | \fB]\fR | |
481 | in ASCII, | |
482 | but | |
483 | \fB\e135\fR | |
484 | does not terminate a bracket expression. | |
485 | Beware, however, that some applications (e.g., C compilers) interpret | |
486 | such sequences themselves before the regular-expression package | |
487 | gets to see them, which may require doubling (quadrupling, etc.) the `\fB\e\fR'. | |
488 | .PP | |
489 | Class-shorthand escapes (AREs only) provide shorthands for certain commonly-used | |
490 | character classes: | |
491 | .RS 2 | |
492 | .TP 10 | |
493 | \fB\ed\fR | |
494 | \fB[[:digit:]]\fR | |
495 | .TP | |
496 | \fB\es\fR | |
497 | \fB[[:space:]]\fR | |
498 | .TP | |
499 | \fB\ew\fR | |
500 | \fB[[:alnum:]_]\fR | |
501 | (note underscore) | |
502 | .TP | |
503 | \fB\eD\fR | |
504 | \fB[^[:digit:]]\fR | |
505 | .TP | |
506 | \fB\eS\fR | |
507 | \fB[^[:space:]]\fR | |
508 | .TP | |
509 | \fB\eW\fR | |
510 | \fB[^[:alnum:]_]\fR | |
511 | (note underscore) | |
512 | .RE | |
513 | .PP | |
514 | Within bracket expressions, `\fB\ed\fR', `\fB\es\fR', | |
515 | and `\fB\ew\fR'\& | |
516 | lose their outer brackets, | |
517 | and `\fB\eD\fR', `\fB\eS\fR', | |
518 | and `\fB\eW\fR'\& | |
519 | are illegal. | |
520 | .VS 8.2 | |
521 | (So, for example, \fB[a-c\ed]\fR is equivalent to \fB[a-c[:digit:]]\fR. | |
522 | Also, \fB[a-c\eD]\fR, which is equivalent to \fB[a-c^[:digit:]]\fR, is illegal.) | |
523 | .VE 8.2 | |
524 | .PP | |
525 | A constraint escape (AREs only) is a constraint, | |
526 | matching the empty string if specific conditions are met, | |
527 | written as an escape: | |
528 | .RS 2 | |
529 | .TP 6 | |
530 | \fB\eA\fR | |
531 | matches only at the beginning of the string | |
532 | (see MATCHING, below, for how this differs from `\fB^\fR') | |
533 | .TP | |
534 | \fB\em\fR | |
535 | matches only at the beginning of a word | |
536 | .TP | |
537 | \fB\eM\fR | |
538 | matches only at the end of a word | |
539 | .TP | |
540 | \fB\ey\fR | |
541 | matches only at the beginning or end of a word | |
542 | .TP | |
543 | \fB\eY\fR | |
544 | matches only at a point that is not the beginning or end of a word | |
545 | .TP | |
546 | \fB\eZ\fR | |
547 | matches only at the end of the string | |
548 | (see MATCHING, below, for how this differs from `\fB$\fR') | |
549 | .TP | |
550 | \fB\e\fIm\fR | |
551 | (where | |
552 | \fIm\fR | |
553 | is a nonzero digit) a \fIback reference\fR, see below | |
554 | .TP | |
555 | \fB\e\fImnn\fR | |
556 | (where | |
557 | \fIm\fR | |
558 | is a nonzero digit, and | |
559 | \fInn\fR | |
560 | is some more digits, | |
561 | and the decimal value | |
562 | \fImnn\fR | |
563 | is not greater than the number of closing capturing parentheses seen so far) | |
564 | a \fIback reference\fR, see below | |
565 | .RE | |
566 | .PP | |
567 | A word is defined as in the specification of | |
568 | \fB[[:<:]]\fR | |
569 | and | |
570 | \fB[[:>:]]\fR | |
571 | above. | |
572 | Constraint escapes are illegal within bracket expressions. | |
573 | .PP | |
574 | A back reference (AREs only) matches the same string matched by the parenthesized | |
575 | subexpression specified by the number, | |
576 | so that (e.g.) | |
577 | \fB([bc])\e1\fR | |
578 | matches | |
579 | \fBbb\fR | |
580 | or | |
581 | \fBcc\fR | |
582 | but not `\fBbc\fR'. | |
583 | The subexpression must entirely precede the back reference in the RE. | |
584 | Subexpressions are numbered in the order of their leading parentheses. | |
585 | Non-capturing parentheses do not define subexpressions. | |
586 | .PP | |
587 | There is an inherent historical ambiguity between octal character-entry | |
588 | escapes and back references, which is resolved by heuristics, | |
589 | as hinted at above. | |
590 | A leading zero always indicates an octal escape. | |
591 | A single non-zero digit, not followed by another digit, | |
592 | is always taken as a back reference. | |
593 | A multi-digit sequence not starting with a zero is taken as a back | |
594 | reference if it comes after a suitable subexpression | |
595 | (i.e. the number is in the legal range for a back reference), | |
596 | and otherwise is taken as octal. | |
597 | .SH "METASYNTAX" | |
598 | In addition to the main syntax described above, there are some special | |
599 | forms and miscellaneous syntactic facilities available. | |
600 | .PP | |
601 | Normally the flavor of RE being used is specified by | |
602 | application-dependent means. | |
603 | However, this can be overridden by a \fIdirector\fR. | |
604 | If an RE of any flavor begins with `\fB***:\fR', | |
605 | the rest of the RE is an ARE. | |
606 | If an RE of any flavor begins with `\fB***=\fR', | |
607 | the rest of the RE is taken to be a literal string, | |
608 | with all characters considered ordinary characters. | |
609 | .PP | |
610 | An ARE may begin with \fIembedded options\fR: | |
611 | a sequence | |
612 | \fB(?\fIxyz\fB)\fR | |
613 | (where | |
614 | \fIxyz\fR | |
615 | is one or more alphabetic characters) | |
616 | specifies options affecting the rest of the RE. | |
617 | These supplement, and can override, | |
618 | any options specified by the application. | |
619 | The available option letters are: | |
620 | .RS 2 | |
621 | .TP 3 | |
622 | \fBb\fR | |
623 | rest of RE is a BRE | |
624 | .TP 3 | |
625 | \fBc\fR | |
626 | case-sensitive matching (usual default) | |
627 | .TP 3 | |
628 | \fBe\fR | |
629 | rest of RE is an ERE | |
630 | .TP 3 | |
631 | \fBi\fR | |
632 | case-insensitive matching (see MATCHING, below) | |
633 | .TP 3 | |
634 | \fBm\fR | |
635 | historical synonym for | |
636 | \fBn\fR | |
637 | .TP 3 | |
638 | \fBn\fR | |
639 | newline-sensitive matching (see MATCHING, below) | |
640 | .TP 3 | |
641 | \fBp\fR | |
642 | partial newline-sensitive matching (see MATCHING, below) | |
643 | .TP 3 | |
644 | \fBq\fR | |
645 | rest of RE is a literal (``quoted'') string, all ordinary characters | |
646 | .TP 3 | |
647 | \fBs\fR | |
648 | non-newline-sensitive matching (usual default) | |
649 | .TP 3 | |
650 | \fBt\fR | |
651 | tight syntax (usual default; see below) | |
652 | .TP 3 | |
653 | \fBw\fR | |
654 | inverse partial newline-sensitive (``weird'') matching (see MATCHING, below) | |
655 | .TP 3 | |
656 | \fBx\fR | |
657 | expanded syntax (see below) | |
658 | .RE | |
659 | .PP | |
660 | Embedded options take effect at the | |
661 | \fB)\fR | |
662 | terminating the sequence. | |
663 | They are available only at the start of an ARE, | |
664 | and may not be used later within it. | |
665 | .PP | |
666 | In addition to the usual (\fItight\fR) RE syntax, in which all characters are | |
667 | significant, there is an \fIexpanded\fR syntax, | |
668 | available in all flavors of RE | |
669 | with the \fB-expanded\fR switch, or in AREs with the embedded x option. | |
670 | In the expanded syntax, | |
671 | white-space characters are ignored | |
672 | and all characters between a | |
673 | \fB#\fR | |
674 | and the following newline (or the end of the RE) are ignored, | |
675 | permitting paragraphing and commenting a complex RE. | |
676 | There are three exceptions to that basic rule: | |
677 | .RS 2 | |
678 | .PP | |
679 | a white-space character or `\fB#\fR' preceded by `\fB\e\fR' is retained | |
680 | .PP | |
681 | white space or `\fB#\fR' within a bracket expression is retained | |
682 | .PP | |
683 | white space and comments are illegal within multi-character symbols | |
684 | like the ARE `\fB(?:\fR' or the BRE `\fB\e(\fR' | |
685 | .RE | |
686 | .PP | |
687 | Expanded-syntax white-space characters are blank, tab, newline, and | |
688 | .VS 8.2 | |
689 | any character that belongs to the \fIspace\fR character class. | |
690 | .VE 8.2 | |
691 | .PP | |
692 | Finally, in an ARE, | |
693 | outside bracket expressions, the sequence `\fB(?#\fIttt\fB)\fR' | |
694 | (where | |
695 | \fIttt\fR | |
696 | is any text not containing a `\fB)\fR') | |
697 | is a comment, | |
698 | completely ignored. | |
699 | Again, this is not allowed between the characters of | |
700 | multi-character symbols like `\fB(?:\fR'. | |
701 | Such comments are more a historical artifact than a useful facility, | |
702 | and their use is deprecated; | |
703 | use the expanded syntax instead. | |
704 | .PP | |
705 | \fINone\fR of these metasyntax extensions is available if the application | |
706 | (or an initial | |
707 | \fB***=\fR | |
708 | director) | |
709 | has specified that the user's input be treated as a literal string | |
710 | rather than as an RE. | |
711 | .SH MATCHING | |
712 | In the event that an RE could match more than one substring of a given | |
713 | string, | |
714 | the RE matches the one starting earliest in the string. | |
715 | If the RE could match more than one substring starting at that point, | |
716 | its choice is determined by its \fIpreference\fR: | |
717 | either the longest substring, or the shortest. | |
718 | .PP | |
719 | Most atoms, and all constraints, have no preference. | |
720 | A parenthesized RE has the same preference (possibly none) as the RE. | |
721 | A quantified atom with quantifier | |
722 | \fB{\fIm\fB}\fR | |
723 | or | |
724 | \fB{\fIm\fB}?\fR | |
725 | has the same preference (possibly none) as the atom itself. | |
726 | A quantified atom with other normal quantifiers (including | |
727 | \fB{\fIm\fB,\fIn\fB}\fR | |
728 | with | |
729 | \fIm\fR | |
730 | equal to | |
731 | \fIn\fR) | |
732 | prefers longest match. | |
733 | A quantified atom with other non-greedy quantifiers (including | |
734 | \fB{\fIm\fB,\fIn\fB}?\fR | |
735 | with | |
736 | \fIm\fR | |
737 | equal to | |
738 | \fIn\fR) | |
739 | prefers shortest match. | |
740 | A branch has the same preference as the first quantified atom in it | |
741 | which has a preference. | |
742 | An RE consisting of two or more branches connected by the | |
743 | \fB|\fR | |
744 | operator prefers longest match. | |
745 | .PP | |
746 | Subject to the constraints imposed by the rules for matching the whole RE, | |
747 | subexpressions also match the longest or shortest possible substrings, | |
748 | based on their preferences, | |
749 | with subexpressions starting earlier in the RE taking priority over | |
750 | ones starting later. | |
751 | Note that outer subexpressions thus take priority over | |
752 | their component subexpressions. | |
753 | .PP | |
754 | Note that the quantifiers | |
755 | \fB{1,1}\fR | |
756 | and | |
757 | \fB{1,1}?\fR | |
758 | can be used to force longest and shortest preference, respectively, | |
759 | on a subexpression or a whole RE. | |
760 | .PP | |
761 | Match lengths are measured in characters, not collating elements. | |
762 | An empty string is considered longer than no match at all. | |
763 | For example, | |
764 | \fBbb*\fR | |
765 | matches the three middle characters of `\fBabbbc\fR', | |
766 | \fB(week|wee)(night|knights)\fR | |
767 | matches all ten characters of `\fBweeknights\fR', | |
768 | when | |
769 | \fB(.*).*\fR | |
770 | is matched against | |
771 | \fBabc\fR | |
772 | the parenthesized subexpression | |
773 | matches all three characters, and | |
774 | when | |
775 | \fB(a*)*\fR | |
776 | is matched against | |
777 | \fBbc\fR | |
778 | both the whole RE and the parenthesized | |
779 | subexpression match an empty string. | |
780 | .PP | |
781 | If case-independent matching is specified, | |
782 | the effect is much as if all case distinctions had vanished from the | |
783 | alphabet. | |
784 | When an alphabetic that exists in multiple cases appears as an | |
785 | ordinary character outside a bracket expression, it is effectively | |
786 | transformed into a bracket expression containing both cases, | |
787 | so that | |
788 | \fBx\fR | |
789 | becomes `\fB[xX]\fR'. | |
790 | When it appears inside a bracket expression, all case counterparts | |
791 | of it are added to the bracket expression, so that | |
792 | \fB[x]\fR | |
793 | becomes | |
794 | \fB[xX]\fR | |
795 | and | |
796 | \fB[^x]\fR | |
797 | becomes `\fB[^xX]\fR'. | |
798 | .PP | |
799 | If newline-sensitive matching is specified, \fB.\fR | |
800 | and bracket expressions using | |
801 | \fB^\fR | |
802 | will never match the newline character | |
803 | (so that matches will never cross newlines unless the RE | |
804 | explicitly arranges it) | |
805 | and | |
806 | \fB^\fR | |
807 | and | |
808 | \fB$\fR | |
809 | will match the empty string after and before a newline | |
810 | respectively, in addition to matching at beginning and end of string | |
811 | respectively. | |
812 | ARE | |
813 | \fB\eA\fR | |
814 | and | |
815 | \fB\eZ\fR | |
816 | continue to match beginning or end of string \fIonly\fR. | |
817 | .PP | |
818 | If partial newline-sensitive matching is specified, | |
819 | this affects \fB.\fR | |
820 | and bracket expressions | |
821 | as with newline-sensitive matching, but not | |
822 | \fB^\fR | |
823 | and `\fB$\fR'. | |
824 | .PP | |
825 | If inverse partial newline-sensitive matching is specified, | |
826 | this affects | |
827 | \fB^\fR | |
828 | and | |
829 | \fB$\fR | |
830 | as with | |
831 | newline-sensitive matching, | |
832 | but not \fB.\fR | |
833 | and bracket expressions. | |
834 | This isn't very useful but is provided for symmetry. | |
835 | .SH "LIMITS AND COMPATIBILITY" | |
836 | No particular limit is imposed on the length of REs. | |
837 | Programs intended to be highly portable should not employ REs longer | |
838 | than 256 bytes, | |
839 | as a POSIX-compliant implementation can refuse to accept such REs. | |
840 | .PP | |
841 | The only feature of AREs that is actually incompatible with | |
842 | POSIX EREs is that | |
843 | \fB\e\fR | |
844 | does not lose its special | |
845 | significance inside bracket expressions. | |
846 | All other ARE features use syntax which is illegal or has | |
847 | undefined or unspecified effects in POSIX EREs; | |
848 | the | |
849 | \fB***\fR | |
850 | syntax of directors likewise is outside the POSIX | |
851 | syntax for both BREs and EREs. | |
852 | .PP | |
853 | Many of the ARE extensions are borrowed from Perl, but some have | |
854 | been changed to clean them up, and a few Perl extensions are not present. | |
855 | Incompatibilities of note include `\fB\eb\fR', `\fB\eB\fR', | |
856 | the lack of special treatment for a trailing newline, | |
857 | the addition of complemented bracket expressions to the things | |
858 | affected by newline-sensitive matching, | |
859 | the restrictions on parentheses and back references in lookahead constraints, | |
860 | and the longest/shortest-match (rather than first-match) matching semantics. | |
861 | .PP | |
862 | The matching rules for REs containing both normal and non-greedy quantifiers | |
863 | have changed since early beta-test versions of this package. | |
864 | (The new rules are much simpler and cleaner, | |
865 | but don't work as hard at guessing the user's real intentions.) | |
866 | .PP | |
867 | Henry Spencer's original 1986 \fIregexp\fR package, | |
868 | still in widespread use (e.g., in pre-8.1 releases of Tcl), | |
869 | implemented an early version of today's EREs. | |
870 | There are four incompatibilities between \fIregexp\fR's near-EREs | |
871 | (`RREs' for short) and AREs. | |
872 | In roughly increasing order of significance: | |
873 | .PP | |
874 | .RS | |
875 | In AREs, | |
876 | \fB\e\fR | |
877 | followed by an alphanumeric character is either an | |
878 | escape or an error, | |
879 | while in RREs, it was just another way of writing the | |
880 | alphanumeric. | |
881 | This should not be a problem because there was no reason to write | |
882 | such a sequence in RREs. | |
883 | .PP | |
884 | \fB{\fR | |
885 | followed by a digit in an ARE is the beginning of a bound, | |
886 | while in RREs, | |
887 | \fB{\fR | |
888 | was always an ordinary character. | |
889 | Such sequences should be rare, | |
890 | and will often result in an error because following characters | |
891 | will not look like a valid bound. | |
892 | .PP | |
893 | In AREs, | |
894 | \fB\e\fR | |
895 | remains a special character within `\fB[\|]\fR', | |
896 | so a literal | |
897 | \fB\e\fR | |
898 | within | |
899 | \fB[\|]\fR | |
900 | must be written `\fB\e\e\fR'. | |
901 | \fB\e\e\fR | |
902 | also gives a literal | |
903 | \fB\e\fR | |
904 | within | |
905 | \fB[\|]\fR | |
906 | in RREs, | |
907 | but only truly paranoid programmers routinely doubled the backslash. | |
908 | .PP | |
909 | AREs report the longest/shortest match for the RE, | |
910 | rather than the first found in a specified search order. | |
911 | This may affect some RREs which were written in the expectation that | |
912 | the first match would be reported. | |
913 | (The careful crafting of RREs to optimize the search order for fast | |
914 | matching is obsolete (AREs examine all possible matches | |
915 | in parallel, and their performance is largely insensitive to their | |
916 | complexity) but cases where the search order was exploited to deliberately | |
917 | find a match which was \fInot\fR the longest/shortest will need rewriting.) | |
918 | .RE | |
919 | ||
920 | .SH "BASIC REGULAR EXPRESSIONS" | |
921 | BREs differ from EREs in several respects. `\fB|\fR', `\fB+\fR', | |
922 | and | |
923 | \fB?\fR | |
924 | are ordinary characters and there is no equivalent | |
925 | for their functionality. | |
926 | The delimiters for bounds are | |
927 | \fB\e{\fR | |
928 | and `\fB\e}\fR', | |
929 | with | |
930 | \fB{\fR | |
931 | and | |
932 | \fB}\fR | |
933 | by themselves ordinary characters. | |
934 | The parentheses for nested subexpressions are | |
935 | \fB\e(\fR | |
936 | and `\fB\e)\fR', | |
937 | with | |
938 | \fB(\fR | |
939 | and | |
940 | \fB)\fR | |
941 | by themselves ordinary characters. | |
942 | \fB^\fR | |
943 | is an ordinary character except at the beginning of the | |
944 | RE or the beginning of a parenthesized subexpression, | |
945 | \fB$\fR | |
946 | is an ordinary character except at the end of the | |
947 | RE or the end of a parenthesized subexpression, | |
948 | and | |
949 | \fB*\fR | |
950 | is an ordinary character if it appears at the beginning of the | |
951 | RE or the beginning of a parenthesized subexpression | |
952 | (after a possible leading `\fB^\fR'). | |
953 | Finally, | |
954 | single-digit back references are available, | |
955 | and | |
956 | \fB\e<\fR | |
957 | and | |
958 | \fB\e>\fR | |
959 | are synonyms for | |
960 | \fB[[:<:]]\fR | |
961 | and | |
962 | \fB[[:>:]]\fR | |
963 | respectively; | |
964 | no other escapes are available. | |
965 | ||
966 | .SH "SEE ALSO" | |
967 | RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n) | |
968 | ||
969 | .SH KEYWORDS | |
970 | match, regular expression, string |