Commit | Line | Data |
---|---|---|
07dcc217 VZ |
1 | .TH REGEX 7 "25 Oct 1995" |
2 | .BY "Henry Spencer" | |
3 | .SH NAME | |
4 | regex \- POSIX 1003.2 regular expressions | |
5 | .SH DESCRIPTION | |
6 | Regular expressions (``RE''s), | |
7 | as defined in POSIX 1003.2, come in two forms: | |
8 | modern REs (roughly those of | |
9 | .IR egrep ; | |
10 | 1003.2 calls these ``extended'' REs) | |
11 | and obsolete REs (roughly those of | |
12 | .IR ed ; | |
13 | 1003.2 ``basic'' REs). | |
14 | Obsolete REs mostly exist for backward compatibility in some old programs; | |
15 | they will be discussed at the end. | |
16 | 1003.2 leaves some aspects of RE syntax and semantics open; | |
17 | `\(dg' marks decisions on these aspects that | |
18 | may not be fully portable to other 1003.2 implementations. | |
19 | .PP | |
20 | A (modern) RE is one\(dg or more non-empty\(dg \fIbranches\fR, | |
21 | separated by `|'. | |
22 | It matches anything that matches one of the branches. | |
23 | .PP | |
24 | A branch is one\(dg or more \fIpieces\fR, concatenated. | |
25 | It matches a match for the first, followed by a match for the second, etc. | |
26 | .PP | |
27 | A piece is an \fIatom\fR possibly followed | |
28 | by a single\(dg `*', `+', `?', or \fIbound\fR. | |
29 | An atom followed by `*' matches a sequence of 0 or more matches of the atom. | |
30 | An atom followed by `+' matches a sequence of 1 or more matches of the atom. | |
31 | An atom followed by `?' matches a sequence of 0 or 1 matches of the atom. | |
32 | .PP | |
33 | A \fIbound\fR is `{' followed by an unsigned decimal integer, | |
34 | possibly followed by `,' | |
35 | possibly followed by another unsigned decimal integer, | |
36 | always followed by `}'. | |
37 | The integers must lie between 0 and RE_DUP_MAX (255\(dg) inclusive, | |
38 | and if there are two of them, the first may not exceed the second. | |
39 | An atom followed by a bound containing one integer \fIi\fR | |
40 | and no comma matches | |
41 | a sequence of exactly \fIi\fR matches of the atom. | |
42 | An atom followed by a bound | |
43 | containing one integer \fIi\fR and a comma matches | |
44 | a sequence of \fIi\fR or more matches of the atom. | |
45 | An atom followed by a bound | |
46 | containing two integers \fIi\fR and \fIj\fR matches | |
47 | a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom. | |
48 | .PP | |
49 | An atom is a regular expression enclosed in `()' (matching a match for the | |
50 | regular expression), | |
51 | an empty set of `()' (matching the null string)\(dg, | |
52 | a \fIbracket expression\fR (see below), `.' | |
53 | (matching any single character), `^' (matching the null string at the | |
54 | beginning of a line), `$' (matching the null string at the | |
55 | end of a line), a `\e' followed by one of the characters | |
56 | `^.[$()|*+?{\e' | |
57 | (matching that character taken as an ordinary character), | |
58 | a `\e' followed by any other character\(dg | |
59 | (matching that character taken as an ordinary character, | |
60 | as if the `\e' had not been present\(dg), | |
61 | or a single character with no other significance (matching that character). | |
62 | A `{' followed by a character other than a digit is an ordinary | |
63 | character, not the beginning of a bound\(dg. | |
64 | It is illegal to end an RE with `\e'. | |
65 | .PP | |
66 | A \fIbracket expression\fR is a list of characters enclosed in `[]'. | |
67 | It normally matches any single character from the list (but see below). | |
68 | If the list begins with `^', | |
69 | it matches any single character | |
70 | (but see below) \fInot\fR from the rest of the list. | |
71 | If two characters in the list are separated by `\-', this is shorthand | |
72 | for the full \fIrange\fR of characters between those two (inclusive) in the | |
73 | collating sequence, | |
74 | e.g. `[0\-9]' in ASCII matches any decimal digit. | |
75 | It is illegal\(dg for two ranges to share an | |
76 | endpoint, e.g. `a\-c\-e'. | |
77 | Ranges are very collating-sequence-dependent, | |
78 | and portable programs should avoid relying on them. | |
79 | .PP | |
80 | To include a literal `]' in the list, make it the first character | |
81 | (following a possible `^'). | |
82 | To include a literal `\-', make it the first or last character, | |
83 | or the second endpoint of a range. | |
84 | To use a literal `\-' as the first endpoint of a range, | |
85 | enclose it in `[.' and `.]' to make it a collating element (see below). | |
86 | With the exception of these and some combinations using `[' (see next | |
87 | paragraphs), all other special characters, including `\e', lose their | |
88 | special significance within a bracket expression. | |
89 | .PP | |
90 | Within a bracket expression, a collating element (a character, | |
91 | a multi-character sequence that collates as if it were a single character, | |
92 | or a collating-sequence name for either) | |
93 | enclosed in `[.' and `.]' stands for the | |
94 | sequence of characters of that collating element. | |
95 | The sequence is a single element of the bracket expression's list. | |
96 | A bracket expression containing a multi-character collating element | |
97 | can thus match more than one character, | |
98 | e.g. if the collating sequence includes a `ch' collating element, | |
99 | then the RE `[[.ch.]]*c' matches the first five characters | |
100 | of `chchcc'. | |
101 | .PP | |
102 | Within a bracket expression, a collating element enclosed in `[=' and | |
103 | `=]' is an equivalence class, standing for the sequences of characters | |
104 | of all collating elements equivalent to that one, including itself. | |
105 | (If there are no other equivalent collating elements, | |
106 | the treatment is as if the enclosing delimiters were `[.' and `.]'.) | |
107 | For example, if o and \o'o^' are the members of an equivalence class, | |
108 | then `[[=o=]]', `[[=\o'o^'=]]', and `[o\o'o^']' are all synonymous. | |
109 | An equivalence class may not\(dg be an endpoint | |
110 | of a range. | |
111 | .PP | |
112 | Within a bracket expression, the name of a \fIcharacter class\fR enclosed | |
113 | in `[:' and `:]' stands for the list of all characters belonging to that | |
114 | class. | |
115 | Standard character class names are: | |
116 | .PP | |
117 | .RS | |
118 | .nf | |
119 | .ta 3c 6c 9c | |
120 | alnum digit punct | |
121 | alpha graph space | |
122 | blank lower upper | |
123 | cntrl print xdigit | |
124 | .fi | |
125 | .RE | |
126 | .PP | |
127 | These stand for the character classes defined in | |
128 | .IR ctype (3). | |
129 | A locale may provide others. | |
130 | A character class may not be used as an endpoint of a range. | |
131 | .PP | |
132 | There are two special cases\(dg of bracket expressions: | |
133 | the bracket expressions `[[:<:]]' and `[[:>:]]' match the null string at | |
134 | the beginning and end of a word respectively. | |
135 | A word is defined as a sequence of | |
136 | word characters | |
137 | which is neither preceded nor followed by | |
138 | word characters. | |
139 | A word character is an | |
140 | .I alnum | |
141 | character (as defined by | |
142 | .IR ctype (3)) | |
143 | or an underscore. | |
144 | This is an extension, | |
145 | compatible with but not specified by POSIX 1003.2, | |
146 | and should be used with | |
147 | caution in software intended to be portable to other systems. | |
148 | .PP | |
149 | In the event that an RE could match more than one substring of a given | |
150 | string, | |
151 | the RE matches the one starting earliest in the string. | |
152 | If the RE could match more than one substring starting at that point, | |
153 | it matches the longest. | |
154 | Subexpressions also match the longest possible substrings, subject to | |
155 | the constraint that the whole match be as long as possible, | |
156 | with subexpressions starting earlier in the RE taking priority over | |
157 | ones starting later. | |
158 | Note that higher-level subexpressions thus take priority over | |
159 | their lower-level component subexpressions. | |
160 | .PP | |
161 | Match lengths are measured in characters, not collating elements. | |
162 | A null string is considered longer than no match at all. | |
163 | For example, | |
164 | `bb*' matches the three middle characters of `abbbc', | |
165 | `(wee|week)(knights|nights)' matches all ten characters of `weeknights', | |
166 | when `(.*).*' is matched against `abc' the parenthesized subexpression | |
167 | matches all three characters, and | |
168 | when `(a*)*' is matched against `bc' both the whole RE and the parenthesized | |
169 | subexpression match the null string. | |
170 | .PP | |
171 | If case-independent matching is specified, | |
172 | the effect is much as if all case distinctions had vanished from the | |
173 | alphabet. | |
174 | When an alphabetic that exists in multiple cases appears as an | |
175 | ordinary character outside a bracket expression, it is effectively | |
176 | transformed into a bracket expression containing both cases, | |
177 | e.g. `x' becomes `[xX]'. | |
178 | When it appears inside a bracket expression, all case counterparts | |
179 | of it are added to the bracket expression, so that (e.g.) `[x]' | |
180 | becomes `[xX]' and `[^x]' becomes `[^xX]'. | |
181 | .PP | |
182 | No particular limit is imposed on the length of REs\(dg. | |
183 | Programs intended to be portable should not employ REs longer | |
184 | than 256 bytes, | |
185 | as an implementation can refuse to accept such REs and remain | |
186 | POSIX-compliant. | |
187 | .PP | |
188 | Obsolete (``basic'') regular expressions differ in several respects. | |
189 | `|', `+', and `?' are ordinary characters and there is no equivalent | |
190 | for their functionality. | |
191 | The delimiters for bounds are `\e{' and `\e}', | |
192 | with `{' and `}' by themselves ordinary characters. | |
193 | The parentheses for nested subexpressions are `\e(' and `\e)', | |
194 | with `(' and `)' by themselves ordinary characters. | |
195 | `^' is an ordinary character except at the beginning of the | |
196 | RE or\(dg the beginning of a parenthesized subexpression, | |
197 | `$' is an ordinary character except at the end of the | |
198 | RE or\(dg the end of a parenthesized subexpression, | |
199 | and `*' is an ordinary character if it appears at the beginning of the | |
200 | RE or the beginning of a parenthesized subexpression | |
201 | (after a possible leading `^'). | |
202 | Finally, there is one new type of atom, a \fIback reference\fR: | |
203 | `\e' followed by a non-zero decimal digit \fId\fR | |
204 | matches the same sequence of characters | |
205 | matched by the \fId\fRth parenthesized subexpression | |
206 | (numbering subexpressions by the positions of their opening parentheses, | |
207 | left to right), | |
208 | so that (e.g.) `\e([bc]\e)\e1' matches `bb' or `cc' but not `bc'. | |
209 | .SH SEE ALSO | |
210 | regex(3) | |
211 | .PP | |
212 | POSIX 1003.2, section 2.8 (Regular Expression Notation). | |
213 | .SH HISTORY | |
214 | Written by Henry Spencer, based on the 1003.2 spec. | |
215 | .SH BUGS | |
216 | Having two kinds of REs is a botch. | |
217 | .PP | |
218 | The current 1003.2 spec says that `)' is an ordinary character in | |
219 | the absence of an unmatched `('; | |
220 | this was an unintentional result of a wording error, | |
221 | and change is likely. | |
222 | Avoid relying on it. | |
223 | .PP | |
224 | Back references are a dreadful botch, | |
225 | posing major problems for efficient implementations. | |
226 | They are also somewhat vaguely defined | |
227 | (does | |
228 | `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?). | |
229 | Avoid using them. | |
230 | .PP | |
231 | 1003.2's specification of case-independent matching is vague. | |
232 | The ``one case implies all cases'' definition given above | |
233 | is current consensus among implementors as to the right interpretation. | |
234 | .PP | |
235 | The syntax for word boundaries is incredibly ugly. |