]>
Commit | Line | Data |
---|---|---|
5b2abdfb A |
1 | .\" Copyright (c) 1992, 1993, 1994 Henry Spencer. |
2 | .\" Copyright (c) 1992, 1993, 1994 | |
3 | .\" The Regents of the University of California. All rights reserved. | |
4 | .\" | |
5 | .\" This code is derived from software contributed to Berkeley by | |
6 | .\" Henry Spencer. | |
7 | .\" | |
8 | .\" Redistribution and use in source and binary forms, with or without | |
9 | .\" modification, are permitted provided that the following conditions | |
10 | .\" are met: | |
11 | .\" 1. Redistributions of source code must retain the above copyright | |
12 | .\" notice, this list of conditions and the following disclaimer. | |
13 | .\" 2. Redistributions in binary form must reproduce the above copyright | |
14 | .\" notice, this list of conditions and the following disclaimer in the | |
15 | .\" documentation and/or other materials provided with the distribution. | |
16 | .\" 3. All advertising materials mentioning features or use of this software | |
17 | .\" must display the following acknowledgement: | |
18 | .\" This product includes software developed by the University of | |
19 | .\" California, Berkeley and its contributors. | |
20 | .\" 4. Neither the name of the University nor the names of its contributors | |
21 | .\" may be used to endorse or promote products derived from this software | |
22 | .\" without specific prior written permission. | |
23 | .\" | |
24 | .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND | |
25 | .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | |
26 | .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE | |
27 | .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE | |
28 | .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | |
29 | .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS | |
30 | .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) | |
31 | .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT | |
32 | .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY | |
33 | .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF | |
34 | .\" SUCH DAMAGE. | |
35 | .\" | |
36 | .\" @(#)re_format.7 8.3 (Berkeley) 3/20/94 | |
37 | .\" $FreeBSD: src/lib/libc/regex/re_format.7,v 1.9 2001/07/15 07:53:08 dd Exp $ | |
38 | .\" | |
39 | .Dd March 20, 1994 | |
40 | .Dt RE_FORMAT 7 | |
41 | .Os | |
42 | .Sh NAME | |
43 | .Nm re_format | |
44 | .Nd POSIX 1003.2 regular expressions | |
45 | .Sh DESCRIPTION | |
46 | Regular expressions | |
47 | .Pq Dq RE Ns s , | |
48 | as defined in | |
49 | .St -p1003.2 , | |
50 | come in two forms: | |
51 | modern REs (roughly those of | |
52 | .Xr egrep 1 ; | |
53 | 1003.2 calls these | |
54 | .Dq extended | |
55 | REs) | |
56 | and obsolete REs (roughly those of | |
57 | .Xr ed 1 ; | |
58 | 1003.2 | |
59 | .Dq basic | |
60 | REs). | |
61 | Obsolete REs mostly exist for backward compatibility in some old programs; | |
62 | they will be discussed at the end. | |
63 | .St -p1003.2 | |
64 | leaves some aspects of RE syntax and semantics open; | |
65 | `\(dd' marks decisions on these aspects that | |
66 | may not be fully portable to other | |
67 | .St -p1003.2 | |
68 | implementations. | |
69 | .Pp | |
70 | A (modern) RE is one\(dd or more non-empty\(dd | |
71 | .Em branches , | |
72 | separated by | |
73 | .Ql \&| . | |
74 | It matches anything that matches one of the branches. | |
75 | .Pp | |
76 | A branch is one\(dd or more | |
77 | .Em pieces , | |
78 | concatenated. | |
79 | It matches a match for the first, followed by a match for the second, etc. | |
80 | .Pp | |
81 | A piece is an | |
82 | .Em atom | |
83 | possibly followed | |
84 | by a single\(dd | |
85 | .Ql \&* , | |
86 | .Ql \&+ , | |
87 | .Ql \&? , | |
88 | or | |
89 | .Em bound . | |
90 | An atom followed by | |
91 | .Ql \&* | |
92 | matches a sequence of 0 or more matches of the atom. | |
93 | An atom followed by | |
94 | .Ql \&+ | |
95 | matches a sequence of 1 or more matches of the atom. | |
96 | An atom followed by | |
97 | .Ql ?\& | |
98 | matches a sequence of 0 or 1 matches of the atom. | |
99 | .Pp | |
100 | A | |
101 | .Em bound | |
102 | is | |
103 | .Ql \&{ | |
104 | followed by an unsigned decimal integer, | |
105 | possibly followed by | |
106 | .Ql \&, | |
107 | possibly followed by another unsigned decimal integer, | |
108 | always followed by | |
109 | .Ql \&} . | |
110 | The integers must lie between 0 and | |
111 | .Dv RE_DUP_MAX | |
112 | (255\(dd) inclusive, | |
113 | and if there are two of them, the first may not exceed the second. | |
114 | An atom followed by a bound containing one integer | |
115 | .Em i | |
116 | and no comma matches | |
117 | a sequence of exactly | |
118 | .Em i | |
119 | matches of the atom. | |
120 | An atom followed by a bound | |
121 | containing one integer | |
122 | .Em i | |
123 | and a comma matches | |
124 | a sequence o | |
125 | .Em i | |
126 | or more matches of the atom. | |
127 | An atom followed by a bound | |
128 | containing two integers | |
129 | .Em i | |
130 | and | |
131 | .Em j | |
132 | matches | |
133 | a sequence of | |
134 | .Em i | |
135 | through | |
136 | .Em j | |
137 | (inclusive) matches of the atom. | |
138 | .Pp | |
139 | An atom is a regular expression enclosed in | |
140 | .Ql () | |
141 | (matching a match for the | |
142 | regular expression), | |
143 | an empty set of | |
144 | .Ql () | |
145 | (matching the null string)\(dd, | |
146 | a | |
147 | .Em bracket expression | |
148 | (see below), | |
149 | .Ql .\& | |
150 | (matching any single character), | |
151 | .Ql \&^ | |
152 | (matching the null string at the beginning of a line), | |
153 | .Ql \&$ | |
154 | (matching the null string at the end of a line), a | |
155 | .Ql \e | |
156 | followed by one of the characters | |
157 | .Ql ^.[$()|*+?{\e | |
158 | (matching that character taken as an ordinary character), | |
159 | a | |
160 | .Ql \e | |
161 | followed by any other character\(dd | |
162 | (matching that character taken as an ordinary character, | |
163 | as if the | |
164 | .Ql \e | |
165 | had not been present\(dd), | |
166 | or a single character with no other significance (matching that character). | |
167 | A | |
168 | .Ql \&{ | |
169 | followed by a character other than a digit is an ordinary | |
170 | character, not the beginning of a bound\(dd. | |
171 | It is illegal to end an RE with | |
172 | .Ql \e . | |
173 | .Pp | |
174 | A | |
175 | .Em bracket expression | |
176 | is a list of characters enclosed in | |
177 | .Ql [] . | |
178 | It normally matches any single character from the list (but see below). | |
179 | If the list begins with | |
180 | .Ql \&^ , | |
181 | it matches any single character | |
182 | (but see below) | |
183 | .Em not | |
184 | from the rest of the list. | |
185 | If two characters in the list are separated by | |
186 | .Ql \&- , | |
187 | this is shorthand | |
188 | for the full | |
189 | .Em range | |
190 | of characters between those two (inclusive) in the | |
191 | collating sequence, | |
192 | .No e.g. Ql [0-9] | |
193 | in ASCII matches any decimal digit. | |
194 | It is illegal\(dd for two ranges to share an | |
195 | endpoint, | |
196 | .No e.g. Ql a-c-e . | |
197 | Ranges are very collating-sequence-dependent, | |
198 | and portable programs should avoid relying on them. | |
199 | .Pp | |
200 | To include a literal | |
201 | .Ql \&] | |
202 | in the list, make it the first character | |
203 | (following a possible | |
204 | .Ql \&^ ) . | |
205 | To include a literal | |
206 | .Ql \&- , | |
207 | make it the first or last character, | |
208 | or the second endpoint of a range. | |
209 | To use a literal | |
210 | .Ql \&- | |
211 | as the first endpoint of a range, | |
212 | enclose it in | |
213 | .Ql [.\& | |
214 | and | |
215 | .Ql .]\& | |
216 | to make it a collating element (see below). | |
217 | With the exception of these and some combinations using | |
218 | .Ql \&[ | |
219 | (see next paragraphs), all other special characters, including | |
220 | .Ql \e , | |
221 | lose their special significance within a bracket expression. | |
222 | .Pp | |
223 | Within a bracket expression, a collating element (a character, | |
224 | a multi-character sequence that collates as if it were a single character, | |
225 | or a collating-sequence name for either) | |
226 | enclosed in | |
227 | .Ql [.\& | |
228 | and | |
229 | .Ql .]\& | |
230 | stands for the | |
231 | sequence of characters of that collating element. | |
232 | The sequence is a single element of the bracket expression's list. | |
233 | A bracket expression containing a multi-character collating element | |
234 | can thus match more than one character, | |
235 | e.g. if the collating sequence includes a | |
236 | .Ql ch | |
237 | collating element, | |
238 | then the RE | |
239 | .Ql [[.ch.]]*c | |
240 | matches the first five characters | |
241 | of | |
242 | .Ql chchcc . | |
243 | .Pp | |
244 | Within a bracket expression, a collating element enclosed in | |
245 | .Ql [= | |
246 | and | |
247 | .Ql =] | |
248 | is an equivalence class, standing for the sequences of characters | |
249 | of all collating elements equivalent to that one, including itself. | |
250 | (If there are no other equivalent collating elements, | |
251 | the treatment is as if the enclosing delimiters were | |
252 | .Ql [.\& | |
253 | and | |
254 | .Ql .] . ) | |
255 | For example, if | |
256 | .Ql x | |
257 | and | |
258 | .Ql y | |
259 | are the members of an equivalence class, | |
260 | then | |
261 | .Ql [[=x=]] , | |
262 | .Ql [[=y=]] , | |
263 | and | |
264 | .Ql [xy] | |
265 | are all synonymous. | |
266 | An equivalence class may not\(dd be an endpoint | |
267 | of a range. | |
268 | .Pp | |
269 | Within a bracket expression, the name of a | |
270 | .Em character class | |
271 | enclosed in | |
272 | .Ql [: | |
273 | and | |
274 | .Ql :] | |
275 | stands for the list of all characters belonging to that | |
276 | class. | |
277 | Standard character class names are: | |
278 | .Pp | |
279 | .Bl -column "alnum" "digit" "xdigit" -offset indent | |
280 | .It Em "alnum digit punct" | |
281 | .It Em "alpha graph space" | |
282 | .It Em "blank lower upper" | |
283 | .It Em "cntrl print xdigit" | |
284 | .El | |
285 | .Pp | |
286 | These stand for the character classes defined in | |
287 | .Xr ctype 3 . | |
288 | A locale may provide others. | |
289 | A character class may not be used as an endpoint of a range. | |
290 | .Pp | |
291 | There are two special cases\(dd of bracket expressions: | |
292 | the bracket expressions | |
293 | .Ql [[:<:]] | |
294 | and | |
295 | .Ql [[:>:]] | |
296 | match the null string at the beginning and end of a word respectively. | |
297 | A word is defined as a sequence of word characters | |
298 | which is neither preceded nor followed by | |
299 | word characters. | |
300 | A word character is an | |
301 | .Em alnum | |
302 | character (as defined by | |
303 | .Xr ctype 3 ) | |
304 | or an underscore. | |
305 | This is an extension, | |
306 | compatible with but not specified by | |
307 | .St -p1003.2 , | |
308 | and should be used with | |
309 | caution in software intended to be portable to other systems. | |
310 | .Pp | |
311 | In the event that an RE could match more than one substring of a given | |
312 | string, | |
313 | the RE matches the one starting earliest in the string. | |
314 | If the RE could match more than one substring starting at that point, | |
315 | it matches the longest. | |
316 | Subexpressions also match the longest possible substrings, subject to | |
317 | the constraint that the whole match be as long as possible, | |
318 | with subexpressions starting earlier in the RE taking priority over | |
319 | ones starting later. | |
320 | Note that higher-level subexpressions thus take priority over | |
321 | their lower-level component subexpressions. | |
322 | .Pp | |
323 | Match lengths are measured in characters, not collating elements. | |
324 | A null string is considered longer than no match at all. | |
325 | For example, | |
326 | .Ql bb* | |
327 | matches the three middle characters of | |
328 | .Ql abbbc , | |
329 | .Ql (wee|week)(knights|nights) | |
330 | matches all ten characters of | |
331 | .Ql weeknights , | |
332 | when | |
333 | .Ql (.*).*\& | |
334 | is matched against | |
335 | .Ql abc | |
336 | the parenthesized subexpression | |
337 | matches all three characters, and | |
338 | when | |
339 | .Ql (a*)* | |
340 | is matched against | |
341 | .Ql bc | |
342 | both the whole RE and the parenthesized | |
343 | subexpression match the null string. | |
344 | .Pp | |
345 | If case-independent matching is specified, | |
346 | the effect is much as if all case distinctions had vanished from the | |
347 | alphabet. | |
348 | When an alphabetic that exists in multiple cases appears as an | |
349 | ordinary character outside a bracket expression, it is effectively | |
350 | transformed into a bracket expression containing both cases, | |
351 | .No e.g. Ql x | |
352 | becomes | |
353 | .Ql [xX] . | |
354 | When it appears inside a bracket expression, all case counterparts | |
355 | of it are added to the bracket expression, so that (e.g.) | |
356 | .Ql [x] | |
357 | becomes | |
358 | .Ql [xX] | |
359 | and | |
360 | .Ql [^x] | |
361 | becomes | |
362 | .Ql [^xX] . | |
363 | .Pp | |
364 | No particular limit is imposed on the length of REs\(dd. | |
365 | Programs intended to be portable should not employ REs longer | |
366 | than 256 bytes, | |
367 | as an implementation can refuse to accept such REs and remain | |
368 | POSIX-compliant. | |
369 | .Pp | |
370 | Obsolete | |
371 | .Pq Dq basic | |
372 | regular expressions differ in several respects. | |
373 | .Ql \&| | |
374 | is an ordinary character and there is no equivalent | |
375 | for its functionality. | |
376 | .Ql \&+ | |
377 | and | |
378 | .Ql ?\& | |
379 | are ordinary characters, and their functionality | |
380 | can be expressed using bounds | |
381 | .No ( Ql {1,} | |
382 | or | |
383 | .Ql {0,1} | |
384 | respectively). | |
385 | Also note that | |
386 | .Ql x+ | |
387 | in modern REs is equivalent to | |
388 | .Ql xx* . | |
389 | The delimiters for bounds are | |
390 | .Ql \e{ | |
391 | and | |
392 | .Ql \e} , | |
393 | with | |
394 | .Ql \&{ | |
395 | and | |
396 | .Ql \&} | |
397 | by themselves ordinary characters. | |
398 | The parentheses for nested subexpressions are | |
399 | .Ql \e( | |
400 | and | |
401 | .Ql \e) , | |
402 | with | |
403 | .Ql \&( | |
404 | and | |
405 | .Ql \&) | |
406 | by themselves ordinary characters. | |
407 | .Ql \&^ | |
408 | is an ordinary character except at the beginning of the | |
409 | RE or\(dd the beginning of a parenthesized subexpression, | |
410 | .Ql \&$ | |
411 | is an ordinary character except at the end of the | |
412 | RE or\(dd the end of a parenthesized subexpression, | |
413 | and | |
414 | .Ql \&* | |
415 | is an ordinary character if it appears at the beginning of the | |
416 | RE or the beginning of a parenthesized subexpression | |
417 | (after a possible leading | |
418 | .Ql \&^ ) . | |
419 | Finally, there is one new type of atom, a | |
420 | .Em back reference : | |
421 | .Ql \e | |
422 | followed by a non-zero decimal digit | |
423 | .Em d | |
424 | matches the same sequence of characters | |
425 | matched by the | |
426 | .Em d Ns th | |
427 | parenthesized subexpression | |
428 | (numbering subexpressions by the positions of their opening parentheses, | |
429 | left to right), | |
430 | so that (e.g.) | |
431 | .Ql \e([bc]\e)\e1 | |
432 | matches | |
433 | .Ql bb | |
434 | or | |
435 | .Ql cc | |
436 | but not | |
437 | .Ql bc . | |
438 | .Sh SEE ALSO | |
439 | .Xr regex 3 | |
440 | .Rs | |
441 | .%T Regular Expression Notation | |
442 | .%R IEEE Std | |
443 | .%N 1003.2 | |
444 | .%P section 2.8 | |
445 | .Re | |
446 | .Sh BUGS | |
447 | Having two kinds of REs is a botch. | |
448 | .Pp | |
449 | The current | |
450 | .St -p1003.2 | |
451 | spec says that | |
452 | .Ql \&) | |
453 | is an ordinary character in | |
454 | the absence of an unmatched | |
455 | .Ql \&( ; | |
456 | this was an unintentional result of a wording error, | |
457 | and change is likely. | |
458 | Avoid relying on it. | |
459 | .Pp | |
460 | Back references are a dreadful botch, | |
461 | posing major problems for efficient implementations. | |
462 | They are also somewhat vaguely defined | |
463 | (does | |
464 | .Ql a\e(\e(b\e)*\e2\e)*d | |
465 | match | |
466 | .Ql abbbd ? ) . | |
467 | Avoid using them. | |
468 | .Pp | |
469 | .St -p1003.2 | |
470 | specification of case-independent matching is vague. | |
471 | The | |
472 | .Dq one case implies all cases | |
473 | definition given above | |
474 | is current consensus among implementors as to the right interpretation. | |
475 | .Pp | |
476 | The syntax for word boundaries is incredibly ugly. |