]> git.saurik.com Git - apple/libc.git/blob - regex/FreeBSD/regex.3
Libc-391.tar.gz
[apple/libc.git] / regex / FreeBSD / regex.3
1 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2 .\" Copyright (c) 1992, 1993, 1994
3 .\" The Regents of the University of California. All rights reserved.
4 .\"
5 .\" This code is derived from software contributed to Berkeley by
6 .\" Henry Spencer.
7 .\"
8 .\" Redistribution and use in source and binary forms, with or without
9 .\" modification, are permitted provided that the following conditions
10 .\" are met:
11 .\" 1. Redistributions of source code must retain the above copyright
12 .\" notice, this list of conditions and the following disclaimer.
13 .\" 2. Redistributions in binary form must reproduce the above copyright
14 .\" notice, this list of conditions and the following disclaimer in the
15 .\" documentation and/or other materials provided with the distribution.
16 .\" 3. All advertising materials mentioning features or use of this software
17 .\" must display the following acknowledgement:
18 .\" This product includes software developed by the University of
19 .\" California, Berkeley and its contributors.
20 .\" 4. Neither the name of the University nor the names of its contributors
21 .\" may be used to endorse or promote products derived from this software
22 .\" without specific prior written permission.
23 .\"
24 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34 .\" SUCH DAMAGE.
35 .\"
36 .\" @(#)regex.3 8.4 (Berkeley) 3/20/94
37 .\" $FreeBSD: src/lib/libc/regex/regex.3,v 1.17 2004/07/12 11:03:42 tjr Exp $
38 .\"
39 .Dd July 12, 2004
40 .Dt REGEX 3
41 .Os
42 .Sh NAME
43 .Nm regcomp ,
44 .Nm regexec ,
45 .Nm regerror ,
46 .Nm regfree
47 .Nd regular-expression library
48 .Sh LIBRARY
49 .Lb libc
50 .Sh SYNOPSIS
51 .In regex.h
52 .Ft int
53 .Fo regcomp
54 .Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags"
55 .Fc
56 .Ft int
57 .Fo regexec
58 .Fa "const regex_t * restrict preg" "const char * restrict string"
59 .Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags"
60 .Fc
61 .Ft size_t
62 .Fo regerror
63 .Fa "int errcode" "const regex_t * restrict preg"
64 .Fa "char * restrict errbuf" "size_t errbuf_size"
65 .Fc
66 .Ft void
67 .Fn regfree "regex_t *preg"
68 .Sh DESCRIPTION
69 These routines implement
70 .St -p1003.2
71 regular expressions
72 .Pq Do RE Dc Ns s ;
73 see
74 .Xr re_format 7 .
75 The
76 .Fn regcomp
77 function
78 compiles an RE written as a string into an internal form,
79 .Fn regexec
80 matches that internal form against a string and reports results,
81 .Fn regerror
82 transforms error codes from either into human-readable messages,
83 and
84 .Fn regfree
85 frees any dynamically-allocated storage used by the internal form
86 of an RE.
87 .Pp
88 The header
89 .In regex.h
90 declares two structure types,
91 .Ft regex_t
92 and
93 .Ft regmatch_t ,
94 the former for compiled internal forms and the latter for match reporting.
95 It also declares the four functions,
96 a type
97 .Ft regoff_t ,
98 and a number of constants with names starting with
99 .Dq Dv REG_ .
100 .Pp
101 The
102 .Fn regcomp
103 function
104 compiles the regular expression contained in the
105 .Fa pattern
106 string,
107 subject to the flags in
108 .Fa cflags ,
109 and places the results in the
110 .Ft regex_t
111 structure pointed to by
112 .Fa preg .
113 The
114 .Fa cflags
115 argument
116 is the bitwise OR of zero or more of the following flags:
117 .Bl -tag -width REG_EXTENDED
118 .It Dv REG_EXTENDED
119 Compile modern
120 .Pq Dq extended
121 REs,
122 rather than the obsolete
123 .Pq Dq basic
124 REs that
125 are the default.
126 .It Dv REG_BASIC
127 This is a synonym for 0,
128 provided as a counterpart to
129 .Dv REG_EXTENDED
130 to improve readability.
131 .It Dv REG_NOSPEC
132 Compile with recognition of all special characters turned off.
133 All characters are thus considered ordinary,
134 so the
135 .Dq RE
136 is a literal string.
137 This is an extension,
138 compatible with but not specified by
139 .St -p1003.2 ,
140 and should be used with
141 caution in software intended to be portable to other systems.
142 .Dv REG_EXTENDED
143 and
144 .Dv REG_NOSPEC
145 may not be used
146 in the same call to
147 .Fn regcomp .
148 .It Dv REG_ICASE
149 Compile for matching that ignores upper/lower case distinctions.
150 See
151 .Xr re_format 7 .
152 .It Dv REG_NOSUB
153 Compile for matching that need only report success or failure,
154 not what was matched.
155 .It Dv REG_NEWLINE
156 Compile for newline-sensitive matching.
157 By default, newline is a completely ordinary character with no special
158 meaning in either REs or strings.
159 With this flag,
160 .Ql [^
161 bracket expressions and
162 .Ql .\&
163 never match newline,
164 a
165 .Ql ^\&
166 anchor matches the null string after any newline in the string
167 in addition to its normal function,
168 and the
169 .Ql $\&
170 anchor matches the null string before any newline in the
171 string in addition to its normal function.
172 .It Dv REG_PEND
173 The regular expression ends,
174 not at the first NUL,
175 but just before the character pointed to by the
176 .Va re_endp
177 member of the structure pointed to by
178 .Fa preg .
179 The
180 .Va re_endp
181 member is of type
182 .Ft "const char *" .
183 This flag permits inclusion of NULs in the RE;
184 they are considered ordinary characters.
185 This is an extension,
186 compatible with but not specified by
187 .St -p1003.2 ,
188 and should be used with
189 caution in software intended to be portable to other systems.
190 .El
191 .Pp
192 When successful,
193 .Fn regcomp
194 returns 0 and fills in the structure pointed to by
195 .Fa preg .
196 One member of that structure
197 (other than
198 .Va re_endp )
199 is publicized:
200 .Va re_nsub ,
201 of type
202 .Ft size_t ,
203 contains the number of parenthesized subexpressions within the RE
204 (except that the value of this member is undefined if the
205 .Dv REG_NOSUB
206 flag was used).
207 If
208 .Fn regcomp
209 fails, it returns a non-zero error code;
210 see
211 .Sx DIAGNOSTICS .
212 .Pp
213 The
214 .Fn regexec
215 function
216 matches the compiled RE pointed to by
217 .Fa preg
218 against the
219 .Fa string ,
220 subject to the flags in
221 .Fa eflags ,
222 and reports results using
223 .Fa nmatch ,
224 .Fa pmatch ,
225 and the returned value.
226 The RE must have been compiled by a previous invocation of
227 .Fn regcomp .
228 The compiled form is not altered during execution of
229 .Fn regexec ,
230 so a single compiled RE can be used simultaneously by multiple threads.
231 .Pp
232 By default,
233 the NUL-terminated string pointed to by
234 .Fa string
235 is considered to be the text of an entire line, minus any terminating
236 newline.
237 The
238 .Fa eflags
239 argument is the bitwise OR of zero or more of the following flags:
240 .Bl -tag -width REG_STARTEND
241 .It Dv REG_NOTBOL
242 The first character of
243 the string
244 is not the beginning of a line, so the
245 .Ql ^\&
246 anchor should not match before it.
247 This does not affect the behavior of newlines under
248 .Dv REG_NEWLINE .
249 .It Dv REG_NOTEOL
250 The NUL terminating
251 the string
252 does not end a line, so the
253 .Ql $\&
254 anchor should not match before it.
255 This does not affect the behavior of newlines under
256 .Dv REG_NEWLINE .
257 .It Dv REG_STARTEND
258 The string is considered to start at
259 .Fa string
260 +
261 .Fa pmatch Ns [0]. Ns Va rm_so
262 and to have a terminating NUL located at
263 .Fa string
264 +
265 .Fa pmatch Ns [0]. Ns Va rm_eo
266 (there need not actually be a NUL at that location),
267 regardless of the value of
268 .Fa nmatch .
269 See below for the definition of
270 .Fa pmatch
271 and
272 .Fa nmatch .
273 This is an extension,
274 compatible with but not specified by
275 .St -p1003.2 ,
276 and should be used with
277 caution in software intended to be portable to other systems.
278 Note that a non-zero
279 .Va rm_so
280 does not imply
281 .Dv REG_NOTBOL ;
282 .Dv REG_STARTEND
283 affects only the location of the string,
284 not how it is matched.
285 .El
286 .Pp
287 See
288 .Xr re_format 7
289 for a discussion of what is matched in situations where an RE or a
290 portion thereof could match any of several substrings of
291 .Fa string .
292 .Pp
293 Normally,
294 .Fn regexec
295 returns 0 for success and the non-zero code
296 .Dv REG_NOMATCH
297 for failure.
298 Other non-zero error codes may be returned in exceptional situations;
299 see
300 .Sx DIAGNOSTICS .
301 .Pp
302 If
303 .Dv REG_NOSUB
304 was specified in the compilation of the RE,
305 or if
306 .Fa nmatch
307 is 0,
308 .Fn regexec
309 ignores the
310 .Fa pmatch
311 argument (but see below for the case where
312 .Dv REG_STARTEND
313 is specified).
314 Otherwise,
315 .Fa pmatch
316 points to an array of
317 .Fa nmatch
318 structures of type
319 .Ft regmatch_t .
320 Such a structure has at least the members
321 .Va rm_so
322 and
323 .Va rm_eo ,
324 both of type
325 .Ft regoff_t
326 (a signed arithmetic type at least as large as an
327 .Ft off_t
328 and a
329 .Ft ssize_t ) ,
330 containing respectively the offset of the first character of a substring
331 and the offset of the first character after the end of the substring.
332 Offsets are measured from the beginning of the
333 .Fa string
334 argument given to
335 .Fn regexec .
336 An empty substring is denoted by equal offsets,
337 both indicating the character following the empty substring.
338 .Pp
339 The 0th member of the
340 .Fa pmatch
341 array is filled in to indicate what substring of
342 .Fa string
343 was matched by the entire RE.
344 Remaining members report what substring was matched by parenthesized
345 subexpressions within the RE;
346 member
347 .Va i
348 reports subexpression
349 .Va i ,
350 with subexpressions counted (starting at 1) by the order of their opening
351 parentheses in the RE, left to right.
352 Unused entries in the array (corresponding either to subexpressions that
353 did not participate in the match at all, or to subexpressions that do not
354 exist in the RE (that is,
355 .Va i
356 >
357 .Fa preg Ns -> Ns Va re_nsub ) )
358 have both
359 .Va rm_so
360 and
361 .Va rm_eo
362 set to -1.
363 If a subexpression participated in the match several times,
364 the reported substring is the last one it matched.
365 (Note, as an example in particular, that when the RE
366 .Ql "(b*)+"
367 matches
368 .Ql bbb ,
369 the parenthesized subexpression matches each of the three
370 .So Li b Sc Ns s
371 and then
372 an infinite number of empty strings following the last
373 .Ql b ,
374 so the reported substring is one of the empties.)
375 .Pp
376 If
377 .Dv REG_STARTEND
378 is specified,
379 .Fa pmatch
380 must point to at least one
381 .Ft regmatch_t
382 (even if
383 .Fa nmatch
384 is 0 or
385 .Dv REG_NOSUB
386 was specified),
387 to hold the input offsets for
388 .Dv REG_STARTEND .
389 Use for output is still entirely controlled by
390 .Fa nmatch ;
391 if
392 .Fa nmatch
393 is 0 or
394 .Dv REG_NOSUB
395 was specified,
396 the value of
397 .Fa pmatch Ns [0]
398 will not be changed by a successful
399 .Fn regexec .
400 .Pp
401 The
402 .Fn regerror
403 function
404 maps a non-zero
405 .Fa errcode
406 from either
407 .Fn regcomp
408 or
409 .Fn regexec
410 to a human-readable, printable message.
411 If
412 .Fa preg
413 is
414 .No non\- Ns Dv NULL ,
415 the error code should have arisen from use of
416 the
417 .Ft regex_t
418 pointed to by
419 .Fa preg ,
420 and if the error code came from
421 .Fn regcomp ,
422 it should have been the result from the most recent
423 .Fn regcomp
424 using that
425 .Ft regex_t .
426 The
427 .Fn ( regerror
428 may be able to supply a more detailed message using information
429 from the
430 .Ft regex_t . )
431 The
432 .Fn regerror
433 function
434 places the NUL-terminated message into the buffer pointed to by
435 .Fa errbuf ,
436 limiting the length (including the NUL) to at most
437 .Fa errbuf_size
438 bytes.
439 If the whole message won't fit,
440 as much of it as will fit before the terminating NUL is supplied.
441 In any case,
442 the returned value is the size of buffer needed to hold the whole
443 message (including terminating NUL).
444 If
445 .Fa errbuf_size
446 is 0,
447 .Fa errbuf
448 is ignored but the return value is still correct.
449 .Pp
450 If the
451 .Fa errcode
452 given to
453 .Fn regerror
454 is first ORed with
455 .Dv REG_ITOA ,
456 the
457 .Dq message
458 that results is the printable name of the error code,
459 e.g.\&
460 .Dq Dv REG_NOMATCH ,
461 rather than an explanation thereof.
462 If
463 .Fa errcode
464 is
465 .Dv REG_ATOI ,
466 then
467 .Fa preg
468 shall be
469 .No non\- Ns Dv NULL
470 and the
471 .Va re_endp
472 member of the structure it points to
473 must point to the printable name of an error code;
474 in this case, the result in
475 .Fa errbuf
476 is the decimal digits of
477 the numeric value of the error code
478 (0 if the name is not recognized).
479 .Dv REG_ITOA
480 and
481 .Dv REG_ATOI
482 are intended primarily as debugging facilities;
483 they are extensions,
484 compatible with but not specified by
485 .St -p1003.2 ,
486 and should be used with
487 caution in software intended to be portable to other systems.
488 Be warned also that they are considered experimental and changes are possible.
489 .Pp
490 The
491 .Fn regfree
492 function
493 frees any dynamically-allocated storage associated with the compiled RE
494 pointed to by
495 .Fa preg .
496 The remaining
497 .Ft regex_t
498 is no longer a valid compiled RE
499 and the effect of supplying it to
500 .Fn regexec
501 or
502 .Fn regerror
503 is undefined.
504 .Pp
505 None of these functions references global variables except for tables
506 of constants;
507 all are safe for use from multiple threads if the arguments are safe.
508 .Sh IMPLEMENTATION CHOICES
509 There are a number of decisions that
510 .St -p1003.2
511 leaves up to the implementor,
512 either by explicitly saying
513 .Dq undefined
514 or by virtue of them being
515 forbidden by the RE grammar.
516 This implementation treats them as follows.
517 .Pp
518 See
519 .Xr re_format 7
520 for a discussion of the definition of case-independent matching.
521 .Pp
522 There is no particular limit on the length of REs,
523 except insofar as memory is limited.
524 Memory usage is approximately linear in RE size, and largely insensitive
525 to RE complexity, except for bounded repetitions.
526 See
527 .Sx BUGS
528 for one short RE using them
529 that will run almost any system out of memory.
530 .Pp
531 A backslashed character other than one specifically given a magic meaning
532 by
533 .St -p1003.2
534 (such magic meanings occur only in obsolete
535 .Bq Dq basic
536 REs)
537 is taken as an ordinary character.
538 .Pp
539 Any unmatched
540 .Ql [\&
541 is a
542 .Dv REG_EBRACK
543 error.
544 .Pp
545 Equivalence classes cannot begin or end bracket-expression ranges.
546 The endpoint of one range cannot begin another.
547 .Pp
548 .Dv RE_DUP_MAX ,
549 the limit on repetition counts in bounded repetitions, is 255.
550 .Pp
551 A repetition operator
552 .Ql ( ?\& ,
553 .Ql *\& ,
554 .Ql +\& ,
555 or bounds)
556 cannot follow another
557 repetition operator.
558 A repetition operator cannot begin an expression or subexpression
559 or follow
560 .Ql ^\&
561 or
562 .Ql |\& .
563 .Pp
564 .Ql |\&
565 cannot appear first or last in a (sub)expression or after another
566 .Ql |\& ,
567 i.e., an operand of
568 .Ql |\&
569 cannot be an empty subexpression.
570 An empty parenthesized subexpression,
571 .Ql "()" ,
572 is legal and matches an
573 empty (sub)string.
574 An empty string is not a legal RE.
575 .Pp
576 A
577 .Ql {\&
578 followed by a digit is considered the beginning of bounds for a
579 bounded repetition, which must then follow the syntax for bounds.
580 A
581 .Ql {\&
582 .Em not
583 followed by a digit is considered an ordinary character.
584 .Pp
585 .Ql ^\&
586 and
587 .Ql $\&
588 beginning and ending subexpressions in obsolete
589 .Pq Dq basic
590 REs are anchors, not ordinary characters.
591 .Sh SEE ALSO
592 .Xr grep 1 ,
593 .Xr re_format 7
594 .Pp
595 .St -p1003.2 ,
596 sections 2.8 (Regular Expression Notation)
597 and
598 B.5 (C Binding for Regular Expression Matching).
599 .Sh DIAGNOSTICS
600 Non-zero error codes from
601 .Fn regcomp
602 and
603 .Fn regexec
604 include the following:
605 .Pp
606 .Bl -tag -width REG_ECOLLATE -compact
607 .It Dv REG_NOMATCH
608 The
609 .Fn regexec
610 function
611 failed to match
612 .It Dv REG_BADPAT
613 invalid regular expression
614 .It Dv REG_ECOLLATE
615 invalid collating element
616 .It Dv REG_ECTYPE
617 invalid character class
618 .It Dv REG_EESCAPE
619 .Ql \e
620 applied to unescapable character
621 .It Dv REG_ESUBREG
622 invalid backreference number
623 .It Dv REG_EBRACK
624 brackets
625 .Ql "[ ]"
626 not balanced
627 .It Dv REG_EPAREN
628 parentheses
629 .Ql "( )"
630 not balanced
631 .It Dv REG_EBRACE
632 braces
633 .Ql "{ }"
634 not balanced
635 .It Dv REG_BADBR
636 invalid repetition count(s) in
637 .Ql "{ }"
638 .It Dv REG_ERANGE
639 invalid character range in
640 .Ql "[ ]"
641 .It Dv REG_ESPACE
642 ran out of memory
643 .It Dv REG_BADRPT
644 .Ql ?\& ,
645 .Ql *\& ,
646 or
647 .Ql +\&
648 operand invalid
649 .It Dv REG_EMPTY
650 empty (sub)expression
651 .It Dv REG_ASSERT
652 can't happen - you found a bug
653 .It Dv REG_INVARG
654 invalid argument, e.g.\& negative-length string
655 .It Dv REG_ILLSEQ
656 illegal byte sequence (bad multibyte character)
657 .El
658 .Sh HISTORY
659 Originally written by
660 .An Henry Spencer .
661 Altered for inclusion in the
662 .Bx 4.4
663 distribution.
664 .Sh BUGS
665 This is an alpha release with known defects.
666 Please report problems.
667 .Pp
668 The back-reference code is subtle and doubts linger about its correctness
669 in complex cases.
670 .Pp
671 The
672 .Fn regexec
673 function
674 performance is poor.
675 This will improve with later releases.
676 The
677 .Fa nmatch
678 argument
679 exceeding 0 is expensive;
680 .Fa nmatch
681 exceeding 1 is worse.
682 The
683 .Fn regexec
684 function
685 is largely insensitive to RE complexity
686 .Em except
687 that back
688 references are massively expensive.
689 RE length does matter; in particular, there is a strong speed bonus
690 for keeping RE length under about 30 characters,
691 with most special characters counting roughly double.
692 .Pp
693 The
694 .Fn regcomp
695 function
696 implements bounded repetitions by macro expansion,
697 which is costly in time and space if counts are large
698 or bounded repetitions are nested.
699 An RE like, say,
700 .Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
701 will (eventually) run almost any existing machine out of swap space.
702 .Pp
703 There are suspected problems with response to obscure error conditions.
704 Notably,
705 certain kinds of internal overflow,
706 produced only by truly enormous REs or by multiply nested bounded repetitions,
707 are probably not handled well.
708 .Pp
709 Due to a mistake in
710 .St -p1003.2 ,
711 things like
712 .Ql "a)b"
713 are legal REs because
714 .Ql )\&
715 is
716 a special character only in the presence of a previous unmatched
717 .Ql (\& .
718 This can't be fixed until the spec is fixed.
719 .Pp
720 The standard's definition of back references is vague.
721 For example, does
722 .Ql "a\e(\e(b\e)*\e2\e)*d"
723 match
724 .Ql "abbbd" ?
725 Until the standard is clarified,
726 behavior in such cases should not be relied on.
727 .Pp
728 The implementation of word-boundary matching is a bit of a kludge,
729 and bugs may lurk in combinations of word-boundary matching and anchoring.