]> git.saurik.com Git - apple/libc.git/blob - regex/regex.3
Libc-763.12.tar.gz
[apple/libc.git] / regex / regex.3
1 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2 .\" Copyright (c) 1992, 1993, 1994
3 .\" The Regents of the University of California. All rights reserved.
4 .\"
5 .\" This code is derived from software contributed to Berkeley by
6 .\" Henry Spencer.
7 .\"
8 .\" Redistribution and use in source and binary forms, with or without
9 .\" modification, are permitted provided that the following conditions
10 .\" are met:
11 .\" 1. Redistributions of source code must retain the above copyright
12 .\" notice, this list of conditions and the following disclaimer.
13 .\" 2. Redistributions in binary form must reproduce the above copyright
14 .\" notice, this list of conditions and the following disclaimer in the
15 .\" documentation and/or other materials provided with the distribution.
16 .\" 4. Neither the name of the University nor the names of its contributors
17 .\" may be used to endorse or promote products derived from this software
18 .\" without specific prior written permission.
19 .\"
20 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30 .\" SUCH DAMAGE.
31 .\"
32 .\" @(#)regex.3 8.4 (Berkeley) 3/20/94
33 .\" $FreeBSD: src/lib/libc/regex/regex.3,v 1.21 2007/01/09 00:28:04 imp Exp $
34 .\"
35 .Dd August 17, 2005
36 .Dt REGEX 3
37 .Os
38 .Sh NAME
39 .Nm regcomp ,
40 .Nm regerror ,
41 .Nm regexec ,
42 .Nm regfree
43 .Nd regular-expression library
44 .Sh LIBRARY
45 .Lb libc
46 .Sh SYNOPSIS
47 .In regex.h
48 .Ft int
49 .Fo regcomp
50 .Fa "regex_t *restrict preg"
51 .Fa "const char *restrict pattern"
52 .Fa "int cflags"
53 .Fc
54 .Ft size_t
55 .Fo regerror
56 .Fa "int errcode"
57 .Fa "const regex_t *restrict preg"
58 .Fa "char *restrict errbuf"
59 .Fa "size_t errbuf_size"
60 .Fc
61 .Ft int
62 .Fo regexec
63 .Fa "const regex_t *restrict preg"
64 .Fa "const char *restrict string"
65 .Fa "size_t nmatch"
66 .Fa "regmatch_t pmatch[restrict]"
67 .Fa "int eflags"
68 .Fc
69 .Ft void
70 .Fo regfree
71 .Fa "regex_t *preg"
72 .Fc
73 .Sh DESCRIPTION
74 These routines implement
75 .St -p1003.2
76 regular expressions
77 .Pq Do RE Dc Ns s ;
78 see
79 .Xr re_format 7 .
80 The
81 .Fn regcomp
82 function
83 compiles an RE, written as a string, into an internal form.
84 .Fn regexec
85 matches that internal form against a string and reports results.
86 .Fn regerror
87 transforms error codes from either into human-readable messages.
88 .Fn regfree
89 frees any dynamically-allocated storage used by the internal form
90 of an RE.
91 .Pp
92 The header
93 .In regex.h
94 declares two structure types,
95 .Ft regex_t
96 and
97 .Ft regmatch_t ,
98 the former for compiled internal forms and the latter for match reporting.
99 It also declares the four functions,
100 a type
101 .Ft regoff_t ,
102 and a number of constants with names starting with
103 .Dq Dv REG_ .
104 .Pp
105 The
106 .Fn regcomp
107 function
108 compiles the regular expression contained in the
109 .Fa pattern
110 string,
111 subject to the flags in
112 .Fa cflags ,
113 and places the results in the
114 .Ft regex_t
115 structure pointed to by
116 .Fa preg .
117 The
118 .Fa cflags
119 argument
120 is the bitwise OR of zero or more of the following flags:
121 .Bl -tag -width REG_EXTENDED
122 .It Dv REG_EXTENDED
123 Compile modern
124 .Pq Dq extended
125 REs,
126 rather than the obsolete
127 .Pq Dq basic
128 REs that
129 are the default.
130 .It Dv REG_BASIC
131 This is a synonym for 0,
132 provided as a counterpart to
133 .Dv REG_EXTENDED
134 to improve readability.
135 .It Dv REG_NOSPEC
136 Compile with recognition of all special characters turned off.
137 All characters are thus considered ordinary,
138 so the
139 .Dq RE
140 is a literal string.
141 This is an extension,
142 compatible with but not specified by
143 .St -p1003.2 ,
144 and should be used with
145 caution in software intended to be portable to other systems.
146 .Dv REG_EXTENDED
147 and
148 .Dv REG_NOSPEC
149 may not be used
150 in the same call to
151 .Fn regcomp .
152 .It Dv REG_ICASE
153 Compile for matching that ignores upper/lower case distinctions.
154 See
155 .Xr re_format 7 .
156 .It Dv REG_NOSUB
157 Compile for matching that need only report success or failure,
158 not what was matched.
159 .It Dv REG_NEWLINE
160 Compile for newline-sensitive matching.
161 By default, newline is a completely ordinary character with no special
162 meaning in either REs or strings.
163 With this flag,
164 .Ql [^
165 bracket expressions and
166 .Ql .\&
167 never match newline,
168 a
169 .Ql ^\&
170 anchor matches the null string after any newline in the string
171 in addition to its normal function,
172 and the
173 .Ql $\&
174 anchor matches the null string before any newline in the
175 string in addition to its normal function.
176 .It Dv REG_PEND
177 The regular expression ends,
178 not at the first NUL,
179 but just before the character pointed to by the
180 .Va re_endp
181 member of the structure pointed to by
182 .Fa preg .
183 The
184 .Va re_endp
185 member is of type
186 .Ft "const char *" .
187 This flag permits inclusion of NULs in the RE;
188 they are considered ordinary characters.
189 This is an extension,
190 compatible with but not specified by
191 .St -p1003.2 ,
192 and should be used with
193 caution in software intended to be portable to other systems.
194 .El
195 .Pp
196 When successful,
197 .Fn regcomp
198 returns 0 and fills in the structure pointed to by
199 .Fa preg .
200 One member of that structure
201 (other than
202 .Va re_endp )
203 is publicized:
204 .Va re_nsub ,
205 of type
206 .Ft size_t ,
207 contains the number of parenthesized subexpressions within the RE
208 (except that the value of this member is undefined if the
209 .Dv REG_NOSUB
210 flag was used).
211 If
212 .Fn regcomp
213 fails, it returns a non-zero error code;
214 see
215 .Sx DIAGNOSTICS .
216 .Pp
217 The
218 .Fn regexec
219 function
220 matches the compiled RE pointed to by
221 .Fa preg
222 against the
223 .Fa string ,
224 subject to the flags in
225 .Fa eflags ,
226 and reports results using
227 .Fa nmatch ,
228 .Fa pmatch ,
229 and the returned value.
230 The RE must have been compiled by a previous invocation of
231 .Fn regcomp .
232 The compiled form is not altered during execution of
233 .Fn regexec ,
234 so a single compiled RE can be used simultaneously by multiple threads.
235 .Pp
236 By default,
237 the NUL-terminated string pointed to by
238 .Fa string
239 is considered to be the text of an entire line, minus any terminating
240 newline.
241 The
242 .Fa eflags
243 argument is the bitwise OR of zero or more of the following flags:
244 .Bl -tag -width REG_STARTEND
245 .It Dv REG_NOTBOL
246 The first character of
247 the string
248 is not the beginning of a line, so the
249 .Ql ^\&
250 anchor should not match before it.
251 This does not affect the behavior of newlines under
252 .Dv REG_NEWLINE .
253 .It Dv REG_NOTEOL
254 The NUL terminating
255 the string
256 does not end a line, so the
257 .Ql $\&
258 anchor should not match before it.
259 This does not affect the behavior of newlines under
260 .Dv REG_NEWLINE .
261 .It Dv REG_STARTEND
262 The string is considered to start at
263 .Fa string
264 +
265 .Fa pmatch Ns [0]. Ns Va rm_so
266 and to have a terminating NUL located at
267 .Fa string
268 +
269 .Fa pmatch Ns [0]. Ns Va rm_eo
270 (there need not actually be a NUL at that location),
271 regardless of the value of
272 .Fa nmatch .
273 See below for the definition of
274 .Fa pmatch
275 and
276 .Fa nmatch .
277 This is an extension,
278 compatible with but not specified by
279 .St -p1003.2 ,
280 and should be used with
281 caution in software intended to be portable to other systems.
282 Note that a non-zero
283 .Va rm_so
284 does not imply
285 .Dv REG_NOTBOL ;
286 .Dv REG_STARTEND
287 affects only the location of the string,
288 not how it is matched.
289 .El
290 .Pp
291 See
292 .Xr re_format 7
293 for a discussion of what is matched in situations where an RE or a
294 portion thereof could match any of several substrings of
295 .Fa string .
296 .Pp
297 Normally,
298 .Fn regexec
299 returns 0 for success and the non-zero code
300 .Dv REG_NOMATCH
301 for failure.
302 Other non-zero error codes may be returned in exceptional situations;
303 see
304 .Sx DIAGNOSTICS .
305 .Pp
306 If
307 .Dv REG_NOSUB
308 was specified in the compilation of the RE,
309 or if
310 .Fa nmatch
311 is 0,
312 .Fn regexec
313 ignores the
314 .Fa pmatch
315 argument (but see below for the case where
316 .Dv REG_STARTEND
317 is specified).
318 Otherwise,
319 .Fa pmatch
320 points to an array of
321 .Fa nmatch
322 structures of type
323 .Ft regmatch_t .
324 Such a structure has at least the members
325 .Va rm_so
326 and
327 .Va rm_eo ,
328 both of type
329 .Ft regoff_t
330 (a signed arithmetic type at least as large as an
331 .Ft off_t
332 and a
333 .Ft ssize_t ) ,
334 containing respectively the offset of the first character of a substring
335 and the offset of the first character after the end of the substring.
336 Offsets are measured from the beginning of the
337 .Fa string
338 argument given to
339 .Fn regexec .
340 An empty substring is denoted by equal offsets,
341 both indicating the character following the empty substring.
342 .Pp
343 The 0th member of the
344 .Fa pmatch
345 array is filled in to indicate what substring of
346 .Fa string
347 was matched by the entire RE.
348 Remaining members report what substring was matched by parenthesized
349 subexpressions within the RE;
350 member
351 .Va i
352 reports subexpression
353 .Va i ,
354 with subexpressions counted (starting at 1) by the order of their opening
355 parentheses in the RE, left to right.
356 Unused entries in the array (corresponding either to subexpressions that
357 did not participate in the match at all, or to subexpressions that do not
358 exist in the RE (that is,
359 .Va i
360 >
361 .Fa preg Ns -> Ns Va re_nsub ) )
362 have both
363 .Va rm_so
364 and
365 .Va rm_eo
366 set to -1.
367 If a subexpression participated in the match several times,
368 the reported substring is the last one it matched.
369 (Note, as an example in particular, that when the RE
370 .Ql "(b*)+"
371 matches
372 .Ql bbb ,
373 the parenthesized subexpression matches each of the three
374 .So Li b Sc Ns s
375 and then
376 an infinite number of empty strings following the last
377 .Ql b ,
378 so the reported substring is one of the empties.)
379 .Pp
380 If
381 .Dv REG_STARTEND
382 is specified,
383 .Fa pmatch
384 must point to at least one
385 .Ft regmatch_t
386 (even if
387 .Fa nmatch
388 is 0 or
389 .Dv REG_NOSUB
390 was specified),
391 to hold the input offsets for
392 .Dv REG_STARTEND .
393 Use for output is still entirely controlled by
394 .Fa nmatch ;
395 if
396 .Fa nmatch
397 is 0 or
398 .Dv REG_NOSUB
399 was specified,
400 the value of
401 .Fa pmatch Ns [0]
402 will not be changed by a successful
403 .Fn regexec .
404 .Pp
405 The
406 .Fn regerror
407 function
408 maps a non-zero
409 .Fa errcode
410 from either
411 .Fn regcomp
412 or
413 .Fn regexec
414 to a human-readable, printable message.
415 If
416 .Fa preg
417 is
418 .No non\- Ns Dv NULL ,
419 the error code should have arisen from use of
420 the
421 .Ft regex_t
422 pointed to by
423 .Fa preg ,
424 and if the error code came from
425 .Fn regcomp ,
426 it should have been the result from the most recent
427 .Fn regcomp
428 using that
429 .Ft regex_t .
430 The
431 .Fn ( regerror
432 may be able to supply a more detailed message using information
433 from the
434 .Ft regex_t . )
435 The
436 .Fn regerror
437 function
438 places the NUL-terminated message into the buffer pointed to by
439 .Fa errbuf ,
440 limiting the length (including the NUL) to at most
441 .Fa errbuf_size
442 bytes.
443 If the whole message will not fit,
444 as much of it as will fit before the terminating NUL is supplied.
445 In any case,
446 the returned value is the size of buffer needed to hold the whole
447 message (including terminating NUL).
448 If
449 .Fa errbuf_size
450 is 0,
451 .Fa errbuf
452 is ignored but the return value is still correct.
453 .Pp
454 If the
455 .Fa errcode
456 given to
457 .Fn regerror
458 is first ORed with
459 .Dv REG_ITOA ,
460 the
461 .Dq message
462 that results is the printable name of the error code,
463 e.g.\&
464 .Dq Dv REG_NOMATCH ,
465 rather than an explanation thereof.
466 If
467 .Fa errcode
468 is
469 .Dv REG_ATOI ,
470 then
471 .Fa preg
472 shall be
473 .No non\- Ns Dv NULL
474 and the
475 .Va re_endp
476 member of the structure it points to
477 must point to the printable name of an error code;
478 in this case, the result in
479 .Fa errbuf
480 is the decimal digits of
481 the numeric value of the error code
482 (0 if the name is not recognized).
483 .Dv REG_ITOA
484 and
485 .Dv REG_ATOI
486 are intended primarily as debugging facilities;
487 they are extensions,
488 compatible with but not specified by
489 .St -p1003.2 ,
490 and should be used with
491 caution in software intended to be portable to other systems.
492 Be warned also that they are considered experimental and changes are possible.
493 .Pp
494 The
495 .Fn regfree
496 function
497 frees any dynamically-allocated storage associated with the compiled RE
498 pointed to by
499 .Fa preg .
500 The remaining
501 .Ft regex_t
502 is no longer a valid compiled RE
503 and the effect of supplying it to
504 .Fn regexec
505 or
506 .Fn regerror
507 is undefined.
508 .Pp
509 None of these functions references global variables except for tables
510 of constants;
511 all are safe for use from multiple threads if the arguments are safe.
512 .Sh IMPLEMENTATION CHOICES
513 There are a number of decisions that
514 .St -p1003.2
515 leaves up to the implementor,
516 either by explicitly saying
517 .Dq undefined
518 or by virtue of them being
519 forbidden by the RE grammar.
520 This implementation treats them as follows.
521 .Pp
522 See
523 .Xr re_format 7
524 for a discussion of the definition of case-independent matching.
525 .Pp
526 There is no particular limit on the length of REs,
527 except insofar as memory is limited.
528 Memory usage is approximately linear in RE size, and largely insensitive
529 to RE complexity, except for bounded repetitions.
530 See
531 .Sx BUGS
532 for one short RE using them
533 that will run almost any system out of memory.
534 .Pp
535 A backslashed character other than one specifically given a magic meaning
536 by
537 .St -p1003.2
538 (such magic meanings occur only in obsolete
539 .Bq Dq basic
540 REs)
541 is taken as an ordinary character.
542 .Pp
543 Any unmatched
544 .Ql [\&
545 is a
546 .Dv REG_EBRACK
547 error.
548 .Pp
549 Equivalence classes cannot begin or end bracket-expression ranges.
550 The endpoint of one range cannot begin another.
551 .Pp
552 .Dv RE_DUP_MAX ,
553 the limit on repetition counts in bounded repetitions, is 255.
554 .Pp
555 A repetition operator
556 .Ql ( ?\& ,
557 .Ql *\& ,
558 .Ql +\& ,
559 or bounds)
560 cannot follow another
561 repetition operator.
562 A repetition operator cannot begin an expression or subexpression
563 or follow
564 .Ql ^\&
565 or
566 .Ql |\& .
567 .Pp
568 .Ql |\&
569 cannot appear first or last in a (sub)expression or after another
570 .Ql |\& ,
571 i.e., an operand of
572 .Ql |\&
573 cannot be an empty subexpression.
574 An empty parenthesized subexpression,
575 .Ql "()" ,
576 is legal and matches an
577 empty (sub)string.
578 An empty string is not a legal RE.
579 .Pp
580 A
581 .Ql {\&
582 followed by a digit is considered the beginning of bounds for a
583 bounded repetition, which must then follow the syntax for bounds.
584 A
585 .Ql {\&
586 .Em not
587 followed by a digit is considered an ordinary character.
588 .Pp
589 .Ql ^\&
590 and
591 .Ql $\&
592 beginning and ending subexpressions in obsolete
593 .Pq Dq basic
594 REs are anchors, not ordinary characters.
595 .Sh DIAGNOSTICS
596 Non-zero error codes from
597 .Fn regcomp
598 and
599 .Fn regexec
600 include the following:
601 .Pp
602 .Bl -tag -width REG_ECOLLATE -compact
603 .It Dv REG_NOMATCH
604 The
605 .Fn regexec
606 function
607 failed to match
608 .It Dv REG_BADPAT
609 invalid regular expression
610 .It Dv REG_ECOLLATE
611 invalid collating element
612 .It Dv REG_ECTYPE
613 invalid character class
614 .It Dv REG_EESCAPE
615 .Ql \e
616 applied to unescapable character
617 .It Dv REG_ESUBREG
618 invalid backreference number
619 .It Dv REG_EBRACK
620 brackets
621 .Ql "[ ]"
622 not balanced
623 .It Dv REG_EPAREN
624 parentheses
625 .Ql "( )"
626 not balanced
627 .It Dv REG_EBRACE
628 braces
629 .Ql "{ }"
630 not balanced
631 .It Dv REG_BADBR
632 invalid repetition count(s) in
633 .Ql "{ }"
634 .It Dv REG_ERANGE
635 invalid character range in
636 .Ql "[ ]"
637 .It Dv REG_ESPACE
638 ran out of memory
639 .It Dv REG_BADRPT
640 .Ql ?\& ,
641 .Ql *\& ,
642 or
643 .Ql +\&
644 operand invalid
645 .It Dv REG_EMPTY
646 empty (sub)expression
647 .It Dv REG_ASSERT
648 cannot happen - you found a bug
649 .It Dv REG_INVARG
650 invalid argument, e.g.\& negative-length string
651 .It Dv REG_ILLSEQ
652 illegal byte sequence (bad multibyte character)
653 .El
654 .Sh SEE ALSO
655 .Xr grep 1 ,
656 .Xr re_format 7
657 .Pp
658 .St -p1003.2 ,
659 sections 2.8 (Regular Expression Notation)
660 and
661 B.5 (C Binding for Regular Expression Matching).
662 .Sh HISTORY
663 Originally written by
664 .An Henry Spencer .
665 Altered for inclusion in the
666 .Bx 4.4
667 distribution.
668 .Sh BUGS
669 This is an alpha release with known defects.
670 Please report problems.
671 .Pp
672 The back-reference code is subtle and doubts linger about its correctness
673 in complex cases.
674 .Pp
675 The
676 .Fn regexec
677 function
678 performance is poor.
679 This will improve with later releases.
680 The
681 .Fa nmatch
682 argument
683 exceeding 0 is expensive;
684 .Fa nmatch
685 exceeding 1 is worse.
686 The
687 .Fn regexec
688 function
689 is largely insensitive to RE complexity
690 .Em except
691 that back
692 references are massively expensive.
693 RE length does matter; in particular, there is a strong speed bonus
694 for keeping RE length under about 30 characters,
695 with most special characters counting roughly double.
696 .Pp
697 The
698 .Fn regcomp
699 function
700 implements bounded repetitions by macro expansion,
701 which is costly in time and space if counts are large
702 or bounded repetitions are nested.
703 An RE like, say,
704 .Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
705 will (eventually) run almost any existing machine out of swap space.
706 .Pp
707 There are suspected problems with response to obscure error conditions.
708 Notably,
709 certain kinds of internal overflow,
710 produced only by truly enormous REs or by multiply nested bounded repetitions,
711 are probably not handled well.
712 .Pp
713 Due to a mistake in
714 .St -p1003.2 ,
715 things like
716 .Ql "a)b"
717 are legal REs because
718 .Ql )\&
719 is
720 a special character only in the presence of a previous unmatched
721 .Ql (\& .
722 This cannot be fixed until the spec is fixed.
723 .Pp
724 The standard's definition of back references is vague.
725 For example, does
726 .Ql "a\e(\e(b\e)*\e2\e)*d"
727 match
728 .Ql "abbbd" ?
729 Until the standard is clarified,
730 behavior in such cases should not be relied on.
731 .Pp
732 The implementation of word-boundary matching is a bit of a kludge,
733 and bugs may lurk in combinations of word-boundary matching and anchoring.
734 .Pp
735 Word-boundary matching does not work properly in multibyte locales.