]> git.saurik.com Git - apple/libc.git/blob - regex/regex.3
Libc-498.tar.gz
[apple/libc.git] / regex / regex.3
1 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2 .\" Copyright (c) 1992, 1993, 1994
3 .\" The Regents of the University of California. All rights reserved.
4 .\"
5 .\" This code is derived from software contributed to Berkeley by
6 .\" Henry Spencer.
7 .\"
8 .\" Redistribution and use in source and binary forms, with or without
9 .\" modification, are permitted provided that the following conditions
10 .\" are met:
11 .\" 1. Redistributions of source code must retain the above copyright
12 .\" notice, this list of conditions and the following disclaimer.
13 .\" 2. Redistributions in binary form must reproduce the above copyright
14 .\" notice, this list of conditions and the following disclaimer in the
15 .\" documentation and/or other materials provided with the distribution.
16 .\" 3. All advertising materials mentioning features or use of this software
17 .\" must display the following acknowledgement:
18 .\" This product includes software developed by the University of
19 .\" California, Berkeley and its contributors.
20 .\" 4. Neither the name of the University nor the names of its contributors
21 .\" may be used to endorse or promote products derived from this software
22 .\" without specific prior written permission.
23 .\"
24 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34 .\" SUCH DAMAGE.
35 .\"
36 .\" @(#)regex.3 8.4 (Berkeley) 3/20/94
37 .\" $FreeBSD: src/lib/libc/regex/regex.3,v 1.17 2004/07/12 11:03:42 tjr Exp $
38 .\"
39 .Dd July 12, 2004
40 .Dt REGEX 3
41 .Os
42 .Sh NAME
43 .Nm regcomp ,
44 .Nm regerror ,
45 .Nm regexec ,
46 .Nm regfree
47 .Nd regular-expression library
48 .Sh LIBRARY
49 .Lb libc
50 .Sh SYNOPSIS
51 .In regex.h
52 .Ft int
53 .Fo regcomp
54 .Fa "regex_t *restrict preg"
55 .Fa "const char *restrict pattern"
56 .Fa "int cflags"
57 .Fc
58 .Ft size_t
59 .Fo regerror
60 .Fa "int errcode"
61 .Fa "const regex_t *restrict preg"
62 .Fa "char *restrict errbuf"
63 .Fa "size_t errbuf_size"
64 .Fc
65 .Ft int
66 .Fo regexec
67 .Fa "const regex_t *restrict preg"
68 .Fa "const char *restrict string"
69 .Fa "size_t nmatch"
70 .Fa "regmatch_t pmatch[restrict]"
71 .Fa "int eflags"
72 .Fc
73 .Ft void
74 .Fo regfree
75 .Fa "regex_t *preg"
76 .Fc
77 .Sh DESCRIPTION
78 These routines implement
79 .St -p1003.2
80 regular expressions
81 .Pq Do RE Dc Ns s ;
82 see
83 .Xr re_format 7 .
84 The
85 .Fn regcomp
86 function
87 compiles an RE, written as a string, into an internal form.
88 .Fn regexec
89 matches that internal form against a string and reports results.
90 .Fn regerror
91 transforms error codes from either into human-readable messages.
92 .Fn regfree
93 frees any dynamically-allocated storage used by the internal form
94 of an RE.
95 .Pp
96 The header
97 .In regex.h
98 declares two structure types,
99 .Ft regex_t
100 and
101 .Ft regmatch_t ,
102 the former for compiled internal forms and the latter for match reporting.
103 It also declares the four functions,
104 a type
105 .Ft regoff_t ,
106 and a number of constants with names starting with
107 .Dq Dv REG_ .
108 .Pp
109 The
110 .Fn regcomp
111 function
112 compiles the regular expression contained in the
113 .Fa pattern
114 string,
115 subject to the flags in
116 .Fa cflags ,
117 and places the results in the
118 .Ft regex_t
119 structure pointed to by
120 .Fa preg .
121 The
122 .Fa cflags
123 argument
124 is the bitwise OR of zero or more of the following flags:
125 .Bl -tag -width REG_EXTENDED
126 .It Dv REG_EXTENDED
127 Compile modern
128 .Pq Dq extended
129 REs,
130 rather than the obsolete
131 .Pq Dq basic
132 REs that
133 are the default.
134 .It Dv REG_BASIC
135 This is a synonym for 0,
136 provided as a counterpart to
137 .Dv REG_EXTENDED
138 to improve readability.
139 .It Dv REG_NOSPEC
140 Compile with recognition of all special characters turned off.
141 All characters are thus considered ordinary,
142 so the
143 .Dq RE
144 is a literal string.
145 This is an extension,
146 compatible with but not specified by
147 .St -p1003.2 ,
148 and should be used with
149 caution in software intended to be portable to other systems.
150 .Dv REG_EXTENDED
151 and
152 .Dv REG_NOSPEC
153 may not be used
154 in the same call to
155 .Fn regcomp .
156 .It Dv REG_ICASE
157 Compile for matching that ignores upper/lower case distinctions.
158 See
159 .Xr re_format 7 .
160 .It Dv REG_NOSUB
161 Compile for matching that need only report success or failure,
162 not what was matched.
163 .It Dv REG_NEWLINE
164 Compile for newline-sensitive matching.
165 By default, newline is a completely ordinary character with no special
166 meaning in either REs or strings.
167 With this flag,
168 .Ql [^
169 bracket expressions and
170 .Ql .\&
171 never match newline,
172 a
173 .Ql ^\&
174 anchor matches the null string after any newline in the string
175 in addition to its normal function,
176 and the
177 .Ql $\&
178 anchor matches the null string before any newline in the
179 string in addition to its normal function.
180 .It Dv REG_PEND
181 The regular expression ends,
182 not at the first NUL,
183 but just before the character pointed to by the
184 .Va re_endp
185 member of the structure pointed to by
186 .Fa preg .
187 The
188 .Va re_endp
189 member is of type
190 .Ft "const char *" .
191 This flag permits inclusion of NULs in the RE;
192 they are considered ordinary characters.
193 This is an extension,
194 compatible with but not specified by
195 .St -p1003.2 ,
196 and should be used with
197 caution in software intended to be portable to other systems.
198 .El
199 .Pp
200 When successful,
201 .Fn regcomp
202 returns 0 and fills in the structure pointed to by
203 .Fa preg .
204 One member of that structure
205 (other than
206 .Va re_endp )
207 is publicized:
208 .Va re_nsub ,
209 of type
210 .Ft size_t ,
211 contains the number of parenthesized subexpressions within the RE
212 (except that the value of this member is undefined if the
213 .Dv REG_NOSUB
214 flag was used).
215 If
216 .Fn regcomp
217 fails, it returns a non-zero error code;
218 see
219 .Sx DIAGNOSTICS .
220 .Pp
221 The
222 .Fn regexec
223 function
224 matches the compiled RE pointed to by
225 .Fa preg
226 against the
227 .Fa string ,
228 subject to the flags in
229 .Fa eflags ,
230 and reports results using
231 .Fa nmatch ,
232 .Fa pmatch ,
233 and the returned value.
234 The RE must have been compiled by a previous invocation of
235 .Fn regcomp .
236 The compiled form is not altered during execution of
237 .Fn regexec ,
238 so a single compiled RE can be used simultaneously by multiple threads.
239 .Pp
240 By default,
241 the NUL-terminated string pointed to by
242 .Fa string
243 is considered to be the text of an entire line, minus any terminating
244 newline.
245 The
246 .Fa eflags
247 argument is the bitwise OR of zero or more of the following flags:
248 .Bl -tag -width REG_STARTEND
249 .It Dv REG_NOTBOL
250 The first character of
251 the string
252 is not the beginning of a line, so the
253 .Ql ^\&
254 anchor should not match before it.
255 This does not affect the behavior of newlines under
256 .Dv REG_NEWLINE .
257 .It Dv REG_NOTEOL
258 The NUL terminating
259 the string
260 does not end a line, so the
261 .Ql $\&
262 anchor should not match before it.
263 This does not affect the behavior of newlines under
264 .Dv REG_NEWLINE .
265 .It Dv REG_STARTEND
266 The string is considered to start at
267 .Fa string
268 +
269 .Fa pmatch Ns [0]. Ns Va rm_so
270 and to have a terminating NUL located at
271 .Fa string
272 +
273 .Fa pmatch Ns [0]. Ns Va rm_eo
274 (there need not actually be a NUL at that location),
275 regardless of the value of
276 .Fa nmatch .
277 See below for the definition of
278 .Fa pmatch
279 and
280 .Fa nmatch .
281 This is an extension,
282 compatible with but not specified by
283 .St -p1003.2 ,
284 and should be used with
285 caution in software intended to be portable to other systems.
286 Note that a non-zero
287 .Va rm_so
288 does not imply
289 .Dv REG_NOTBOL ;
290 .Dv REG_STARTEND
291 affects only the location of the string,
292 not how it is matched.
293 .El
294 .Pp
295 See
296 .Xr re_format 7
297 for a discussion of what is matched in situations where an RE or a
298 portion thereof could match any of several substrings of
299 .Fa string .
300 .Pp
301 Normally,
302 .Fn regexec
303 returns 0 for success and the non-zero code
304 .Dv REG_NOMATCH
305 for failure.
306 Other non-zero error codes may be returned in exceptional situations;
307 see
308 .Sx DIAGNOSTICS .
309 .Pp
310 If
311 .Dv REG_NOSUB
312 was specified in the compilation of the RE,
313 or if
314 .Fa nmatch
315 is 0,
316 .Fn regexec
317 ignores the
318 .Fa pmatch
319 argument (but see below for the case where
320 .Dv REG_STARTEND
321 is specified).
322 Otherwise,
323 .Fa pmatch
324 points to an array of
325 .Fa nmatch
326 structures of type
327 .Ft regmatch_t .
328 Such a structure has at least the members
329 .Va rm_so
330 and
331 .Va rm_eo ,
332 both of type
333 .Ft regoff_t
334 (a signed arithmetic type at least as large as an
335 .Ft off_t
336 and a
337 .Ft ssize_t ) ,
338 containing respectively the offset of the first character of a substring
339 and the offset of the first character after the end of the substring.
340 Offsets are measured from the beginning of the
341 .Fa string
342 argument given to
343 .Fn regexec .
344 An empty substring is denoted by equal offsets,
345 both indicating the character following the empty substring.
346 .Pp
347 The 0th member of the
348 .Fa pmatch
349 array is filled in to indicate what substring of
350 .Fa string
351 was matched by the entire RE.
352 Remaining members report what substring was matched by parenthesized
353 subexpressions within the RE;
354 member
355 .Va i
356 reports subexpression
357 .Va i ,
358 with subexpressions counted (starting at 1) by the order of their opening
359 parentheses in the RE, left to right.
360 Unused entries in the array (corresponding either to subexpressions that
361 did not participate in the match at all, or to subexpressions that do not
362 exist in the RE (that is,
363 .Va i
364 >
365 .Fa preg Ns -> Ns Va re_nsub ) )
366 have both
367 .Va rm_so
368 and
369 .Va rm_eo
370 set to -1.
371 If a subexpression participated in the match several times,
372 the reported substring is the last one it matched.
373 (Note, as an example in particular, that when the RE
374 .Ql "(b*)+"
375 matches
376 .Ql bbb ,
377 the parenthesized subexpression matches each of the three
378 .So Li b Sc Ns s
379 and then
380 an infinite number of empty strings following the last
381 .Ql b ,
382 so the reported substring is one of the empties.)
383 .Pp
384 If
385 .Dv REG_STARTEND
386 is specified,
387 .Fa pmatch
388 must point to at least one
389 .Ft regmatch_t
390 (even if
391 .Fa nmatch
392 is 0 or
393 .Dv REG_NOSUB
394 was specified),
395 to hold the input offsets for
396 .Dv REG_STARTEND .
397 Use for output is still entirely controlled by
398 .Fa nmatch ;
399 if
400 .Fa nmatch
401 is 0 or
402 .Dv REG_NOSUB
403 was specified,
404 the value of
405 .Fa pmatch Ns [0]
406 will not be changed by a successful
407 .Fn regexec .
408 .Pp
409 The
410 .Fn regerror
411 function
412 maps a non-zero
413 .Fa errcode
414 from either
415 .Fn regcomp
416 or
417 .Fn regexec
418 to a human-readable, printable message.
419 If
420 .Fa preg
421 is
422 .No non\- Ns Dv NULL ,
423 the error code should have arisen from use of
424 the
425 .Ft regex_t
426 pointed to by
427 .Fa preg ,
428 and if the error code came from
429 .Fn regcomp ,
430 it should have been the result from the most recent
431 .Fn regcomp
432 using that
433 .Ft regex_t .
434 The
435 .Fn ( regerror
436 may be able to supply a more detailed message using information
437 from the
438 .Ft regex_t . )
439 The
440 .Fn regerror
441 function
442 places the NUL-terminated message into the buffer pointed to by
443 .Fa errbuf ,
444 limiting the length (including the NUL) to at most
445 .Fa errbuf_size
446 bytes.
447 If the whole message won't fit,
448 as much of it as will fit before the terminating NUL is supplied.
449 In any case,
450 the returned value is the size of buffer needed to hold the whole
451 message (including terminating NUL).
452 If
453 .Fa errbuf_size
454 is 0,
455 .Fa errbuf
456 is ignored but the return value is still correct.
457 .Pp
458 If the
459 .Fa errcode
460 given to
461 .Fn regerror
462 is first ORed with
463 .Dv REG_ITOA ,
464 the
465 .Dq message
466 that results is the printable name of the error code,
467 e.g.\&
468 .Dq Dv REG_NOMATCH ,
469 rather than an explanation thereof.
470 If
471 .Fa errcode
472 is
473 .Dv REG_ATOI ,
474 then
475 .Fa preg
476 shall be
477 .No non\- Ns Dv NULL
478 and the
479 .Va re_endp
480 member of the structure it points to
481 must point to the printable name of an error code;
482 in this case, the result in
483 .Fa errbuf
484 is the decimal digits of
485 the numeric value of the error code
486 (0 if the name is not recognized).
487 .Dv REG_ITOA
488 and
489 .Dv REG_ATOI
490 are intended primarily as debugging facilities;
491 they are extensions,
492 compatible with but not specified by
493 .St -p1003.2 ,
494 and should be used with
495 caution in software intended to be portable to other systems.
496 Be warned also that they are considered experimental and changes are possible.
497 .Pp
498 The
499 .Fn regfree
500 function
501 frees any dynamically-allocated storage associated with the compiled RE
502 pointed to by
503 .Fa preg .
504 The remaining
505 .Ft regex_t
506 is no longer a valid compiled RE
507 and the effect of supplying it to
508 .Fn regexec
509 or
510 .Fn regerror
511 is undefined.
512 .Pp
513 None of these functions references global variables except for tables
514 of constants;
515 all are safe for use from multiple threads if the arguments are safe.
516 .Sh IMPLEMENTATION CHOICES
517 There are a number of decisions that
518 .St -p1003.2
519 leaves up to the implementor,
520 either by explicitly saying
521 .Dq undefined
522 or by virtue of them being
523 forbidden by the RE grammar.
524 This implementation treats them as follows.
525 .Pp
526 See
527 .Xr re_format 7
528 for a discussion of the definition of case-independent matching.
529 .Pp
530 There is no particular limit on the length of REs,
531 except insofar as memory is limited.
532 Memory usage is approximately linear in RE size, and largely insensitive
533 to RE complexity, except for bounded repetitions.
534 See
535 .Sx BUGS
536 for one short RE using them
537 that will run almost any system out of memory.
538 .Pp
539 A backslashed character other than one specifically given a magic meaning
540 by
541 .St -p1003.2
542 (such magic meanings occur only in obsolete
543 .Bq Dq basic
544 REs)
545 is taken as an ordinary character.
546 .Pp
547 Any unmatched
548 .Ql [\&
549 is a
550 .Dv REG_EBRACK
551 error.
552 .Pp
553 Equivalence classes cannot begin or end bracket-expression ranges.
554 The endpoint of one range cannot begin another.
555 .Pp
556 .Dv RE_DUP_MAX ,
557 the limit on repetition counts in bounded repetitions, is 255.
558 .Pp
559 A repetition operator
560 .Ql ( ?\& ,
561 .Ql *\& ,
562 .Ql +\& ,
563 or bounds)
564 cannot follow another
565 repetition operator.
566 A repetition operator cannot begin an expression or subexpression
567 or follow
568 .Ql ^\&
569 or
570 .Ql |\& .
571 .Pp
572 .Ql |\&
573 cannot appear first or last in a (sub)expression or after another
574 .Ql |\& ,
575 i.e., an operand of
576 .Ql |\&
577 cannot be an empty subexpression.
578 An empty parenthesized subexpression,
579 .Ql "()" ,
580 is legal and matches an
581 empty (sub)string.
582 An empty string is not a legal RE.
583 .Pp
584 A
585 .Ql {\&
586 followed by a digit is considered the beginning of bounds for a
587 bounded repetition, which must then follow the syntax for bounds.
588 A
589 .Ql {\&
590 .Em not
591 followed by a digit is considered an ordinary character.
592 .Pp
593 .Ql ^\&
594 and
595 .Ql $\&
596 beginning and ending subexpressions in obsolete
597 .Pq Dq basic
598 REs are anchors, not ordinary characters.
599 .Sh SEE ALSO
600 .Xr grep 1 ,
601 .Xr re_format 7
602 .Pp
603 .St -p1003.2 ,
604 sections 2.8 (Regular Expression Notation)
605 and
606 B.5 (C Binding for Regular Expression Matching).
607 .Sh DIAGNOSTICS
608 Non-zero error codes from
609 .Fn regcomp
610 and
611 .Fn regexec
612 include the following:
613 .Pp
614 .Bl -tag -width REG_ECOLLATE -compact
615 .It Dv REG_NOMATCH
616 The
617 .Fn regexec
618 function
619 failed to match
620 .It Dv REG_BADPAT
621 invalid regular expression
622 .It Dv REG_ECOLLATE
623 invalid collating element
624 .It Dv REG_ECTYPE
625 invalid character class
626 .It Dv REG_EESCAPE
627 .Ql \e
628 applied to unescapable character
629 .It Dv REG_ESUBREG
630 invalid backreference number
631 .It Dv REG_EBRACK
632 brackets
633 .Ql "[ ]"
634 not balanced
635 .It Dv REG_EPAREN
636 parentheses
637 .Ql "( )"
638 not balanced
639 .It Dv REG_EBRACE
640 braces
641 .Ql "{ }"
642 not balanced
643 .It Dv REG_BADBR
644 invalid repetition count(s) in
645 .Ql "{ }"
646 .It Dv REG_ERANGE
647 invalid character range in
648 .Ql "[ ]"
649 .It Dv REG_ESPACE
650 ran out of memory
651 .It Dv REG_BADRPT
652 .Ql ?\& ,
653 .Ql *\& ,
654 or
655 .Ql +\&
656 operand invalid
657 .It Dv REG_EMPTY
658 empty (sub)expression
659 .It Dv REG_ASSERT
660 can't happen - you found a bug
661 .It Dv REG_INVARG
662 invalid argument, e.g.\& negative-length string
663 .It Dv REG_ILLSEQ
664 illegal byte sequence (bad multibyte character)
665 .El
666 .Sh HISTORY
667 Originally written by
668 .An Henry Spencer .
669 Altered for inclusion in the
670 .Bx 4.4
671 distribution.
672 .Sh BUGS
673 This is an alpha release with known defects.
674 Please report problems.
675 .Pp
676 The back-reference code is subtle and doubts linger about its correctness
677 in complex cases.
678 .Pp
679 The
680 .Fn regexec
681 function
682 performance is poor.
683 This will improve with later releases.
684 The
685 .Fa nmatch
686 argument
687 exceeding 0 is expensive;
688 .Fa nmatch
689 exceeding 1 is worse.
690 The
691 .Fn regexec
692 function
693 is largely insensitive to RE complexity
694 .Em except
695 that back
696 references are massively expensive.
697 RE length does matter; in particular, there is a strong speed bonus
698 for keeping RE length under about 30 characters,
699 with most special characters counting roughly double.
700 .Pp
701 The
702 .Fn regcomp
703 function
704 implements bounded repetitions by macro expansion,
705 which is costly in time and space if counts are large
706 or bounded repetitions are nested.
707 An RE like, say,
708 .Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
709 will (eventually) run almost any existing machine out of swap space.
710 .Pp
711 There are suspected problems with response to obscure error conditions.
712 Notably,
713 certain kinds of internal overflow,
714 produced only by truly enormous REs or by multiply nested bounded repetitions,
715 are probably not handled well.
716 .Pp
717 Due to a mistake in
718 .St -p1003.2 ,
719 things like
720 .Ql "a)b"
721 are legal REs because
722 .Ql )\&
723 is
724 a special character only in the presence of a previous unmatched
725 .Ql (\& .
726 This can't be fixed until the spec is fixed.
727 .Pp
728 The standard's definition of back references is vague.
729 For example, does
730 .Ql "a\e(\e(b\e)*\e2\e)*d"
731 match
732 .Ql "abbbd" ?
733 Until the standard is clarified,
734 behavior in such cases should not be relied on.
735 .Pp
736 The implementation of word-boundary matching is a bit of a kludge,
737 and bugs may lurk in combinations of word-boundary matching and anchoring.