2 '\" Copyright (c) 1998 Sun Microsystems, Inc.
3 '\" Copyright (c) 1999 Scriptics Corporation
5 '\" This software is copyrighted by the Regents of the University of
6 '\" California, Sun Microsystems, Inc., Scriptics Corporation, ActiveState
7 '\" Corporation and other parties. The following terms apply to all files
8 '\" associated with the software unless explicitly disclaimed in
11 '\" The authors hereby grant permission to use, copy, modify, distribute,
12 '\" and license this software and its documentation for any purpose, provided
13 '\" that existing copyright notices are retained in all copies and that this
14 '\" notice is included verbatim in any distributions. No written agreement,
15 '\" license, or royalty fee is required for any of the authorized uses.
16 '\" Modifications to this software may be copyrighted by their authors
17 '\" and need not follow the licensing terms described here, provided that
18 '\" the new terms are clearly indicated on the first page of each file where
21 '\" IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY
22 '\" FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES
23 '\" ARISING OUT OF THE USE OF THIS SOFTWARE, ITS DOCUMENTATION, OR ANY
24 '\" DERIVATIVES THEREOF, EVEN IF THE AUTHORS HAVE BEEN ADVISED OF THE
25 '\" POSSIBILITY OF SUCH DAMAGE.
27 '\" THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
28 '\" INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY,
29 '\" FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. THIS SOFTWARE
30 '\" IS PROVIDED ON AN "AS IS" BASIS, AND THE AUTHORS AND DISTRIBUTORS HAVE
31 '\" NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR
34 '\" GOVERNMENT USE: If you are acquiring this software on behalf of the
35 '\" U.S. government, the Government shall have only "Restricted Rights"
36 '\" in the software and related documentation as defined in the Federal
37 '\" Acquisition Regulations (FARs) in Clause 52.227.19 (c) (2). If you
38 '\" are acquiring the software on behalf of the Department of Defense, the
39 '\" software shall be classified as "Commercial Computer Software" and the
40 '\" Government shall have only "Restricted Rights" as defined in Clause
41 '\" 252.227-7013 (c) (1) of DFARs. Notwithstanding the foregoing, the
42 '\" authors grant the U.S. Government and others acting in its behalf
43 '\" permission to use and distribute the software in accordance with the
44 '\" terms specified in this license.
46 '\" RCS: @(#) Id: re_syntax.n,v 1.3 1999/07/14 19:09:36 jpeek Exp
49 .TH re_syntax n "8.1" Tcl "Tcl Built-In Commands"
52 re_syntax \- Syntax of Tcl regular expressions.
57 A \fIregular expression\fR describes strings of characters.
58 It's a pattern that matches certain strings and doesn't match others.
60 .SH "DIFFERENT FLAVORS OF REs"
61 Regular expressions (``RE''s), as defined by POSIX, come in two
62 flavors: \fIextended\fR REs (``EREs'') and \fIbasic\fR REs (``BREs'').
63 EREs are roughly those of the traditional \fIegrep\fR, while BREs are
64 roughly those of the traditional \fIed\fR. This implementation adds
65 a third flavor, \fIadvanced\fR REs (``AREs''), basically EREs with
66 some significant extensions.
68 This manual page primarily describes AREs. BREs mostly exist for
69 backward compatibility in some old programs; they will be discussed at
70 the end. POSIX EREs are almost an exact subset of AREs. Features of
71 AREs that are not present in EREs will be indicated.
73 .SH "REGULAR EXPRESSION SYNTAX"
75 Tcl regular expressions are implemented using the package written by
76 Henry Spencer, based on the 1003.2 spec and some (not quite all) of
77 the Perl5 extensions (thanks, Henry!). Much of the description of
78 regular expressions below is copied verbatim from his manual entry.
80 An ARE is one or more \fIbranches\fR,
81 separated by `\fB|\fR',
82 matching anything that matches any of the branches.
84 A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR,
86 It matches a match for the first, followed by a match for the second, etc;
87 an empty branch matches the empty string.
89 A quantified atom is an \fIatom\fR possibly followed
90 by a single \fIquantifier\fR.
91 Without a quantifier, it matches a match for the atom.
93 and what a so-quantified atom matches, are:
97 a sequence of 0 or more matches of the atom
100 a sequence of 1 or more matches of the atom
103 a sequence of 0 or 1 matches of the atom
106 a sequence of exactly \fIm\fR matches of the atom
109 a sequence of \fIm\fR or more matches of the atom
111 \fB{\fIm\fB,\fIn\fB}\fR
112 a sequence of \fIm\fR through \fIn\fR (inclusive) matches of the atom;
113 \fIm\fR may not exceed \fIn\fR
115 \fB*? +? ?? {\fIm\fB}? {\fIm\fB,}? {\fIm\fB,\fIn\fB}?\fR
116 \fInon-greedy\fR quantifiers,
117 which match the same possibilities,
118 but prefer the smallest number rather than the largest number
119 of matches (see MATCHING)
124 are known as \fIbound\fRs.
126 \fIm\fR and \fIn\fR are unsigned decimal integers
127 with permissible values from 0 to 255 inclusive.
133 (where \fIre\fR is any regular expression)
135 \fIre\fR, with the match noted for possible reporting
139 but does no reporting
140 (a ``non-capturing'' set of parentheses)
143 matches an empty string,
144 noted for possible reporting
147 matches an empty string,
151 a \fIbracket expression\fR,
152 matching any one of the \fIchars\fR (see BRACKET EXPRESSIONS for more detail)
155 matches any single character
158 (where \fIk\fR is a non-alphanumeric character)
159 matches that character taken as an ordinary character,
160 e.g. \e\e matches a backslash character
163 where \fIc\fR is alphanumeric
164 (possibly followed by other characters),
165 an \fIescape\fR (AREs only),
169 when followed by a character other than a digit,
170 matches the left-brace character `\fB{\fR';
171 when followed by a digit, it is the beginning of a
172 \fIbound\fR (see above)
176 a single character with no other significance, matches that character.
179 A \fIconstraint\fR matches an empty string when specific conditions
181 A constraint may not be followed by a quantifier.
182 The simple constraints are as follows; some more constraints are
183 described later, under ESCAPES.
187 matches at the beginning of a line
190 matches at the end of a line
193 \fIpositive lookahead\fR (AREs only), matches at any point
194 where a substring matching \fIre\fR begins
197 \fInegative lookahead\fR (AREs only), matches at any point
198 where no substring matching \fIre\fR begins
201 The lookahead constraints may not contain back references (see later),
202 and all parentheses within them are considered non-capturing.
204 An RE may not end with `\fB\e\fR'.
206 .SH "BRACKET EXPRESSIONS"
207 A \fIbracket expression\fR is a list of characters enclosed in `\fB[\|]\fR'.
208 It normally matches any single character from the list (but see below).
209 If the list begins with `\fB^\fR',
210 it matches any single character
211 (but see below) \fInot\fR from the rest of the list.
213 If two characters in the list are separated by `\fB\-\fR',
215 for the full \fIrange\fR of characters between those two (inclusive) in the
219 in ASCII matches any decimal digit.
220 Two ranges may not share an
224 Ranges are very collating-sequence-dependent,
225 and portable programs should avoid relying on them.
232 the simplest method is to
234 \fB[.\fR and \fB.]\fR
235 to make it a collating element (see below).
237 make it the first character
238 (following a possible `\fB^\fR'),
239 or (AREs only) precede it with `\fB\e\fR'.
240 Alternatively, for `\fB\-\fR',
241 make it the last character,
242 or the second endpoint of a range.
245 as the first endpoint of a range,
246 make it a collating element
247 or (AREs only) precede it with `\fB\e\fR'.
248 With the exception of these, some combinations using
251 paragraphs), and escapes,
252 all other special characters lose their
253 special significance within a bracket expression.
255 Within a bracket expression, a collating element (a character,
256 a multi-character sequence that collates as if it were a single character,
257 or a collating-sequence name for either)
259 \fB[.\fR and \fB.]\fR
261 sequence of characters of that collating element.
262 The sequence is a single element of the bracket expression's list.
263 A bracket expression in a locale that has
264 multi-character collating elements
265 can thus match more than one character.
267 So (insidiously), a bracket expression that starts with \fB^\fR
268 can match multi-character collating elements even if none of them
269 appear in the bracket expression!
270 (\fINote:\fR Tcl currently has no multi-character collating elements.
271 This information is only for illustration.)
273 For example, assume the collating sequence includes a \fBch\fR
274 multi-character collating element.
275 Then the RE \fB[[.ch.]]*c\fR (zero or more \fBch\fP's followed by \fBc\fP)
276 matches the first five characters of `\fBchchcc\fR'.
277 Also, the RE \fB[^c]b\fR matches all of `\fBchb\fR'
278 (because \fB[^c]\fR matches the multi-character \fBch\fR).
281 Within a bracket expression, a collating element enclosed in
285 is an equivalence class, standing for the sequences of characters
286 of all collating elements equivalent to that one, including itself.
287 (If there are no other equivalent collating elements,
288 the treatment is as if the enclosing delimiters were `\fB[.\fR'\&
294 are the members of an equivalence class,
295 then `\fB[[=o=]]\fR', `\fB[[=\o'o^'=]]\fR',
296 and `\fB[o\o'o^']\fR'\&
298 An equivalence class may not be an endpoint
302 Tcl currently implements only the Unicode locale.
303 It doesn't define any equivalence classes.
304 The examples above are just illustrations.)
307 Within a bracket expression, the name of a \fIcharacter class\fR enclosed
312 stands for the list of all characters
313 (not all collating elements!)
316 Standard character classes are:
322 \fBalpha\fR A letter.
323 \fBupper\fR An upper-case letter.
324 \fBlower\fR A lower-case letter.
325 \fBdigit\fR A decimal digit.
326 \fBxdigit\fR A hexadecimal digit.
327 \fBalnum\fR An alphanumeric (letter or digit).
328 \fBprint\fR An alphanumeric (same as alnum).
329 \fBblank\fR A space or tab character.
330 \fBspace\fR A character producing white space in displayed text.
331 \fBpunct\fR A punctuation character.
332 \fBgraph\fR A character with a visible representation.
333 \fBcntrl\fR A control character.
337 A locale may provide others.
339 (Note that the current Tcl implementation has only one locale:
342 A character class may not be used as an endpoint of a range.
344 There are two special cases of bracket expressions:
345 the bracket expressions
349 are constraints, matching empty strings at
350 the beginning and end of a word respectively.
351 '\" note, discussion of escapes below references this definition of word
352 A word is defined as a sequence of
354 that is neither preceded nor followed by
356 A word character is an
361 These special bracket expressions are deprecated;
362 users of AREs should use constraint escapes instead (see below).
364 Escapes (AREs only), which begin with a
366 followed by an alphanumeric character,
367 come in several varieties:
368 character entry, class shorthands, constraint escapes, and back references.
371 followed by an alphanumeric character but not constituting
372 a valid escape is illegal in AREs.
373 In EREs, there are no escapes:
374 outside a bracket expression,
377 followed by an alphanumeric character merely stands for that
378 character as an ordinary character,
379 and inside a bracket expression,
381 is an ordinary character.
382 (The latter is the one actual incompatibility between EREs and AREs.)
384 Character-entry escapes (AREs only) exist to make it easier to specify
385 non-printing and otherwise inconvenient characters in REs:
389 alert (bell) character, as in C
397 to help reduce backslash doubling in some
398 applications where there are multiple levels of backslash processing
401 (where X is any character) the character whose
402 low-order 5 bits are the same as those of
404 and whose other bits are all zero
407 the character whose collating-sequence name
409 or failing that, the character with octal value 033
418 carriage return, as in C
421 horizontal tab, as in C
426 is exactly four hexadecimal digits)
427 the Unicode character
429 in the local byte ordering
434 is exactly eight hexadecimal digits)
435 reserved for a somewhat-hypothetical Unicode extension to 32 bits
438 vertical tab, as in C
444 is any sequence of hexadecimal digits)
445 the character whose hexadecimal value is
447 (a single character no matter how many hexadecimal digits are used).
450 the character whose value is
456 is exactly two octal digits,
458 \fIback reference\fR (see below))
459 the character whose octal value is
465 is exactly three octal digits,
467 back reference (see below))
468 the character whose octal value is
472 Hexadecimal digits are `\fB0\fR'-`\fB9\fR', `\fBa\fR'-`\fBf\fR',
473 and `\fBA\fR'-`\fBF\fR'.
474 Octal digits are `\fB0\fR'-`\fB7\fR'.
476 The character-entry escapes are always taken as ordinary characters.
484 does not terminate a bracket expression.
485 Beware, however, that some applications (e.g., C compilers) interpret
486 such sequences themselves before the regular-expression package
487 gets to see them, which may require doubling (quadrupling, etc.) the `\fB\e\fR'.
489 Class-shorthand escapes (AREs only) provide shorthands for certain commonly-used
514 Within bracket expressions, `\fB\ed\fR', `\fB\es\fR',
516 lose their outer brackets,
517 and `\fB\eD\fR', `\fB\eS\fR',
521 (So, for example, \fB[a-c\ed]\fR is equivalent to \fB[a-c[:digit:]]\fR.
522 Also, \fB[a-c\eD]\fR, which is equivalent to \fB[a-c^[:digit:]]\fR, is illegal.)
525 A constraint escape (AREs only) is a constraint,
526 matching the empty string if specific conditions are met,
527 written as an escape:
531 matches only at the beginning of the string
532 (see MATCHING, below, for how this differs from `\fB^\fR')
535 matches only at the beginning of a word
538 matches only at the end of a word
541 matches only at the beginning or end of a word
544 matches only at a point that is not the beginning or end of a word
547 matches only at the end of the string
548 (see MATCHING, below, for how this differs from `\fB$\fR')
553 is a nonzero digit) a \fIback reference\fR, see below
558 is a nonzero digit, and
561 and the decimal value
563 is not greater than the number of closing capturing parentheses seen so far)
564 a \fIback reference\fR, see below
567 A word is defined as in the specification of
572 Constraint escapes are illegal within bracket expressions.
574 A back reference (AREs only) matches the same string matched by the parenthesized
575 subexpression specified by the number,
583 The subexpression must entirely precede the back reference in the RE.
584 Subexpressions are numbered in the order of their leading parentheses.
585 Non-capturing parentheses do not define subexpressions.
587 There is an inherent historical ambiguity between octal character-entry
588 escapes and back references, which is resolved by heuristics,
590 A leading zero always indicates an octal escape.
591 A single non-zero digit, not followed by another digit,
592 is always taken as a back reference.
593 A multi-digit sequence not starting with a zero is taken as a back
594 reference if it comes after a suitable subexpression
595 (i.e. the number is in the legal range for a back reference),
596 and otherwise is taken as octal.
598 In addition to the main syntax described above, there are some special
599 forms and miscellaneous syntactic facilities available.
601 Normally the flavor of RE being used is specified by
602 application-dependent means.
603 However, this can be overridden by a \fIdirector\fR.
604 If an RE of any flavor begins with `\fB***:\fR',
605 the rest of the RE is an ARE.
606 If an RE of any flavor begins with `\fB***=\fR',
607 the rest of the RE is taken to be a literal string,
608 with all characters considered ordinary characters.
610 An ARE may begin with \fIembedded options\fR:
615 is one or more alphabetic characters)
616 specifies options affecting the rest of the RE.
617 These supplement, and can override,
618 any options specified by the application.
619 The available option letters are:
626 case-sensitive matching (usual default)
632 case-insensitive matching (see MATCHING, below)
635 historical synonym for
639 newline-sensitive matching (see MATCHING, below)
642 partial newline-sensitive matching (see MATCHING, below)
645 rest of RE is a literal (``quoted'') string, all ordinary characters
648 non-newline-sensitive matching (usual default)
651 tight syntax (usual default; see below)
654 inverse partial newline-sensitive (``weird'') matching (see MATCHING, below)
657 expanded syntax (see below)
660 Embedded options take effect at the
662 terminating the sequence.
663 They are available only at the start of an ARE,
664 and may not be used later within it.
666 In addition to the usual (\fItight\fR) RE syntax, in which all characters are
667 significant, there is an \fIexpanded\fR syntax,
668 available in all flavors of RE
669 with the \fB-expanded\fR switch, or in AREs with the embedded x option.
670 In the expanded syntax,
671 white-space characters are ignored
672 and all characters between a
674 and the following newline (or the end of the RE) are ignored,
675 permitting paragraphing and commenting a complex RE.
676 There are three exceptions to that basic rule:
679 a white-space character or `\fB#\fR' preceded by `\fB\e\fR' is retained
681 white space or `\fB#\fR' within a bracket expression is retained
683 white space and comments are illegal within multi-character symbols
684 like the ARE `\fB(?:\fR' or the BRE `\fB\e(\fR'
687 Expanded-syntax white-space characters are blank, tab, newline, and
689 any character that belongs to the \fIspace\fR character class.
693 outside bracket expressions, the sequence `\fB(?#\fIttt\fB)\fR'
696 is any text not containing a `\fB)\fR')
699 Again, this is not allowed between the characters of
700 multi-character symbols like `\fB(?:\fR'.
701 Such comments are more a historical artifact than a useful facility,
702 and their use is deprecated;
703 use the expanded syntax instead.
705 \fINone\fR of these metasyntax extensions is available if the application
709 has specified that the user's input be treated as a literal string
710 rather than as an RE.
712 In the event that an RE could match more than one substring of a given
714 the RE matches the one starting earliest in the string.
715 If the RE could match more than one substring starting at that point,
716 its choice is determined by its \fIpreference\fR:
717 either the longest substring, or the shortest.
719 Most atoms, and all constraints, have no preference.
720 A parenthesized RE has the same preference (possibly none) as the RE.
721 A quantified atom with quantifier
725 has the same preference (possibly none) as the atom itself.
726 A quantified atom with other normal quantifiers (including
727 \fB{\fIm\fB,\fIn\fB}\fR
732 prefers longest match.
733 A quantified atom with other non-greedy quantifiers (including
734 \fB{\fIm\fB,\fIn\fB}?\fR
739 prefers shortest match.
740 A branch has the same preference as the first quantified atom in it
741 which has a preference.
742 An RE consisting of two or more branches connected by the
744 operator prefers longest match.
746 Subject to the constraints imposed by the rules for matching the whole RE,
747 subexpressions also match the longest or shortest possible substrings,
748 based on their preferences,
749 with subexpressions starting earlier in the RE taking priority over
751 Note that outer subexpressions thus take priority over
752 their component subexpressions.
754 Note that the quantifiers
758 can be used to force longest and shortest preference, respectively,
759 on a subexpression or a whole RE.
761 Match lengths are measured in characters, not collating elements.
762 An empty string is considered longer than no match at all.
765 matches the three middle characters of `\fBabbbc\fR',
766 \fB(week|wee)(night|knights)\fR
767 matches all ten characters of `\fBweeknights\fR',
772 the parenthesized subexpression
773 matches all three characters, and
778 both the whole RE and the parenthesized
779 subexpression match an empty string.
781 If case-independent matching is specified,
782 the effect is much as if all case distinctions had vanished from the
784 When an alphabetic that exists in multiple cases appears as an
785 ordinary character outside a bracket expression, it is effectively
786 transformed into a bracket expression containing both cases,
789 becomes `\fB[xX]\fR'.
790 When it appears inside a bracket expression, all case counterparts
791 of it are added to the bracket expression, so that
797 becomes `\fB[^xX]\fR'.
799 If newline-sensitive matching is specified, \fB.\fR
800 and bracket expressions using
802 will never match the newline character
803 (so that matches will never cross newlines unless the RE
804 explicitly arranges it)
809 will match the empty string after and before a newline
810 respectively, in addition to matching at beginning and end of string
816 continue to match beginning or end of string \fIonly\fR.
818 If partial newline-sensitive matching is specified,
820 and bracket expressions
821 as with newline-sensitive matching, but not
825 If inverse partial newline-sensitive matching is specified,
831 newline-sensitive matching,
833 and bracket expressions.
834 This isn't very useful but is provided for symmetry.
835 .SH "LIMITS AND COMPATIBILITY"
836 No particular limit is imposed on the length of REs.
837 Programs intended to be highly portable should not employ REs longer
839 as a POSIX-compliant implementation can refuse to accept such REs.
841 The only feature of AREs that is actually incompatible with
844 does not lose its special
845 significance inside bracket expressions.
846 All other ARE features use syntax which is illegal or has
847 undefined or unspecified effects in POSIX EREs;
850 syntax of directors likewise is outside the POSIX
851 syntax for both BREs and EREs.
853 Many of the ARE extensions are borrowed from Perl, but some have
854 been changed to clean them up, and a few Perl extensions are not present.
855 Incompatibilities of note include `\fB\eb\fR', `\fB\eB\fR',
856 the lack of special treatment for a trailing newline,
857 the addition of complemented bracket expressions to the things
858 affected by newline-sensitive matching,
859 the restrictions on parentheses and back references in lookahead constraints,
860 and the longest/shortest-match (rather than first-match) matching semantics.
862 The matching rules for REs containing both normal and non-greedy quantifiers
863 have changed since early beta-test versions of this package.
864 (The new rules are much simpler and cleaner,
865 but don't work as hard at guessing the user's real intentions.)
867 Henry Spencer's original 1986 \fIregexp\fR package,
868 still in widespread use (e.g., in pre-8.1 releases of Tcl),
869 implemented an early version of today's EREs.
870 There are four incompatibilities between \fIregexp\fR's near-EREs
871 (`RREs' for short) and AREs.
872 In roughly increasing order of significance:
877 followed by an alphanumeric character is either an
879 while in RREs, it was just another way of writing the
881 This should not be a problem because there was no reason to write
882 such a sequence in RREs.
885 followed by a digit in an ARE is the beginning of a bound,
888 was always an ordinary character.
889 Such sequences should be rare,
890 and will often result in an error because following characters
891 will not look like a valid bound.
895 remains a special character within `\fB[\|]\fR',
900 must be written `\fB\e\e\fR'.
907 but only truly paranoid programmers routinely doubled the backslash.
909 AREs report the longest/shortest match for the RE,
910 rather than the first found in a specified search order.
911 This may affect some RREs which were written in the expectation that
912 the first match would be reported.
913 (The careful crafting of RREs to optimize the search order for fast
914 matching is obsolete (AREs examine all possible matches
915 in parallel, and their performance is largely insensitive to their
916 complexity) but cases where the search order was exploited to deliberately
917 find a match which was \fInot\fR the longest/shortest will need rewriting.)
920 .SH "BASIC REGULAR EXPRESSIONS"
921 BREs differ from EREs in several respects. `\fB|\fR', `\fB+\fR',
924 are ordinary characters and there is no equivalent
925 for their functionality.
926 The delimiters for bounds are
933 by themselves ordinary characters.
934 The parentheses for nested subexpressions are
941 by themselves ordinary characters.
943 is an ordinary character except at the beginning of the
944 RE or the beginning of a parenthesized subexpression,
946 is an ordinary character except at the end of the
947 RE or the end of a parenthesized subexpression,
950 is an ordinary character if it appears at the beginning of the
951 RE or the beginning of a parenthesized subexpression
952 (after a possible leading `\fB^\fR').
954 single-digit back references are available,
964 no other escapes are available.
967 RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
970 match, regular expression, string