]> git.saurik.com Git - apple/libc.git/blame - regex/regex.3
Libc-583.tar.gz
[apple/libc.git] / regex / regex.3
CommitLineData
224c7076
A
1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2.\" Copyright (c) 1992, 1993, 1994
3.\" The Regents of the University of California. All rights reserved.
4.\"
5.\" This code is derived from software contributed to Berkeley by
6.\" Henry Spencer.
7.\"
8.\" Redistribution and use in source and binary forms, with or without
9.\" modification, are permitted provided that the following conditions
10.\" are met:
11.\" 1. Redistributions of source code must retain the above copyright
12.\" notice, this list of conditions and the following disclaimer.
13.\" 2. Redistributions in binary form must reproduce the above copyright
14.\" notice, this list of conditions and the following disclaimer in the
15.\" documentation and/or other materials provided with the distribution.
16.\" 3. All advertising materials mentioning features or use of this software
17.\" must display the following acknowledgement:
18.\" This product includes software developed by the University of
19.\" California, Berkeley and its contributors.
20.\" 4. Neither the name of the University nor the names of its contributors
21.\" may be used to endorse or promote products derived from this software
22.\" without specific prior written permission.
23.\"
24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34.\" SUCH DAMAGE.
35.\"
36.\" @(#)regex.3 8.4 (Berkeley) 3/20/94
37.\" $FreeBSD: src/lib/libc/regex/regex.3,v 1.17 2004/07/12 11:03:42 tjr Exp $
38.\"
39.Dd July 12, 2004
40.Dt REGEX 3
41.Os
42.Sh NAME
43.Nm regcomp ,
44.Nm regerror ,
45.Nm regexec ,
46.Nm regfree
47.Nd regular-expression library
48.Sh LIBRARY
49.Lb libc
50.Sh SYNOPSIS
51.In regex.h
52.Ft int
53.Fo regcomp
54.Fa "regex_t *restrict preg"
55.Fa "const char *restrict pattern"
56.Fa "int cflags"
57.Fc
58.Ft size_t
59.Fo regerror
60.Fa "int errcode"
61.Fa "const regex_t *restrict preg"
62.Fa "char *restrict errbuf"
63.Fa "size_t errbuf_size"
64.Fc
65.Ft int
66.Fo regexec
67.Fa "const regex_t *restrict preg"
68.Fa "const char *restrict string"
69.Fa "size_t nmatch"
70.Fa "regmatch_t pmatch[restrict]"
71.Fa "int eflags"
72.Fc
73.Ft void
74.Fo regfree
75.Fa "regex_t *preg"
76.Fc
77.Sh DESCRIPTION
78These routines implement
79.St -p1003.2
80regular expressions
81.Pq Do RE Dc Ns s ;
82see
83.Xr re_format 7 .
84The
85.Fn regcomp
86function
87compiles an RE, written as a string, into an internal form.
88.Fn regexec
89matches that internal form against a string and reports results.
90.Fn regerror
91transforms error codes from either into human-readable messages.
92.Fn regfree
93frees any dynamically-allocated storage used by the internal form
94of an RE.
95.Pp
96The header
97.In regex.h
98declares two structure types,
99.Ft regex_t
100and
101.Ft regmatch_t ,
102the former for compiled internal forms and the latter for match reporting.
103It also declares the four functions,
104a type
105.Ft regoff_t ,
106and a number of constants with names starting with
107.Dq Dv REG_ .
108.Pp
109The
110.Fn regcomp
111function
112compiles the regular expression contained in the
113.Fa pattern
114string,
115subject to the flags in
116.Fa cflags ,
117and places the results in the
118.Ft regex_t
119structure pointed to by
120.Fa preg .
121The
122.Fa cflags
123argument
124is the bitwise OR of zero or more of the following flags:
125.Bl -tag -width REG_EXTENDED
126.It Dv REG_EXTENDED
127Compile modern
128.Pq Dq extended
129REs,
130rather than the obsolete
131.Pq Dq basic
132REs that
133are the default.
134.It Dv REG_BASIC
135This is a synonym for 0,
136provided as a counterpart to
137.Dv REG_EXTENDED
138to improve readability.
139.It Dv REG_NOSPEC
140Compile with recognition of all special characters turned off.
141All characters are thus considered ordinary,
142so the
143.Dq RE
144is a literal string.
145This is an extension,
146compatible with but not specified by
147.St -p1003.2 ,
148and should be used with
149caution in software intended to be portable to other systems.
150.Dv REG_EXTENDED
151and
152.Dv REG_NOSPEC
153may not be used
154in the same call to
155.Fn regcomp .
156.It Dv REG_ICASE
157Compile for matching that ignores upper/lower case distinctions.
158See
159.Xr re_format 7 .
160.It Dv REG_NOSUB
161Compile for matching that need only report success or failure,
162not what was matched.
163.It Dv REG_NEWLINE
164Compile for newline-sensitive matching.
165By default, newline is a completely ordinary character with no special
166meaning in either REs or strings.
167With this flag,
168.Ql [^
169bracket expressions and
170.Ql .\&
171never match newline,
172a
173.Ql ^\&
174anchor matches the null string after any newline in the string
175in addition to its normal function,
176and the
177.Ql $\&
178anchor matches the null string before any newline in the
179string in addition to its normal function.
180.It Dv REG_PEND
181The regular expression ends,
182not at the first NUL,
183but just before the character pointed to by the
184.Va re_endp
185member of the structure pointed to by
186.Fa preg .
187The
188.Va re_endp
189member is of type
190.Ft "const char *" .
191This flag permits inclusion of NULs in the RE;
192they are considered ordinary characters.
193This is an extension,
194compatible with but not specified by
195.St -p1003.2 ,
196and should be used with
197caution in software intended to be portable to other systems.
198.El
199.Pp
200When successful,
201.Fn regcomp
202returns 0 and fills in the structure pointed to by
203.Fa preg .
204One member of that structure
205(other than
206.Va re_endp )
207is publicized:
208.Va re_nsub ,
209of type
210.Ft size_t ,
211contains the number of parenthesized subexpressions within the RE
212(except that the value of this member is undefined if the
213.Dv REG_NOSUB
214flag was used).
215If
216.Fn regcomp
217fails, it returns a non-zero error code;
218see
219.Sx DIAGNOSTICS .
220.Pp
221The
222.Fn regexec
223function
224matches the compiled RE pointed to by
225.Fa preg
226against the
227.Fa string ,
228subject to the flags in
229.Fa eflags ,
230and reports results using
231.Fa nmatch ,
232.Fa pmatch ,
233and the returned value.
234The RE must have been compiled by a previous invocation of
235.Fn regcomp .
236The compiled form is not altered during execution of
237.Fn regexec ,
238so a single compiled RE can be used simultaneously by multiple threads.
239.Pp
240By default,
241the NUL-terminated string pointed to by
242.Fa string
243is considered to be the text of an entire line, minus any terminating
244newline.
245The
246.Fa eflags
247argument is the bitwise OR of zero or more of the following flags:
248.Bl -tag -width REG_STARTEND
249.It Dv REG_NOTBOL
250The first character of
251the string
252is not the beginning of a line, so the
253.Ql ^\&
254anchor should not match before it.
255This does not affect the behavior of newlines under
256.Dv REG_NEWLINE .
257.It Dv REG_NOTEOL
258The NUL terminating
259the string
260does not end a line, so the
261.Ql $\&
262anchor should not match before it.
263This does not affect the behavior of newlines under
264.Dv REG_NEWLINE .
265.It Dv REG_STARTEND
266The string is considered to start at
267.Fa string
268+
269.Fa pmatch Ns [0]. Ns Va rm_so
270and to have a terminating NUL located at
271.Fa string
272+
273.Fa pmatch Ns [0]. Ns Va rm_eo
274(there need not actually be a NUL at that location),
275regardless of the value of
276.Fa nmatch .
277See below for the definition of
278.Fa pmatch
279and
280.Fa nmatch .
281This is an extension,
282compatible with but not specified by
283.St -p1003.2 ,
284and should be used with
285caution in software intended to be portable to other systems.
286Note that a non-zero
287.Va rm_so
288does not imply
289.Dv REG_NOTBOL ;
290.Dv REG_STARTEND
291affects only the location of the string,
292not how it is matched.
293.El
294.Pp
295See
296.Xr re_format 7
297for a discussion of what is matched in situations where an RE or a
298portion thereof could match any of several substrings of
299.Fa string .
300.Pp
301Normally,
302.Fn regexec
303returns 0 for success and the non-zero code
304.Dv REG_NOMATCH
305for failure.
306Other non-zero error codes may be returned in exceptional situations;
307see
308.Sx DIAGNOSTICS .
309.Pp
310If
311.Dv REG_NOSUB
312was specified in the compilation of the RE,
313or if
314.Fa nmatch
315is 0,
316.Fn regexec
317ignores the
318.Fa pmatch
319argument (but see below for the case where
320.Dv REG_STARTEND
321is specified).
322Otherwise,
323.Fa pmatch
324points to an array of
325.Fa nmatch
326structures of type
327.Ft regmatch_t .
328Such a structure has at least the members
329.Va rm_so
330and
331.Va rm_eo ,
332both of type
333.Ft regoff_t
334(a signed arithmetic type at least as large as an
335.Ft off_t
336and a
337.Ft ssize_t ) ,
338containing respectively the offset of the first character of a substring
339and the offset of the first character after the end of the substring.
340Offsets are measured from the beginning of the
341.Fa string
342argument given to
343.Fn regexec .
344An empty substring is denoted by equal offsets,
345both indicating the character following the empty substring.
346.Pp
347The 0th member of the
348.Fa pmatch
349array is filled in to indicate what substring of
350.Fa string
351was matched by the entire RE.
352Remaining members report what substring was matched by parenthesized
353subexpressions within the RE;
354member
355.Va i
356reports subexpression
357.Va i ,
358with subexpressions counted (starting at 1) by the order of their opening
359parentheses in the RE, left to right.
360Unused entries in the array (corresponding either to subexpressions that
361did not participate in the match at all, or to subexpressions that do not
362exist in the RE (that is,
363.Va i
364>
365.Fa preg Ns -> Ns Va re_nsub ) )
366have both
367.Va rm_so
368and
369.Va rm_eo
370set to -1.
371If a subexpression participated in the match several times,
372the reported substring is the last one it matched.
373(Note, as an example in particular, that when the RE
374.Ql "(b*)+"
375matches
376.Ql bbb ,
377the parenthesized subexpression matches each of the three
378.So Li b Sc Ns s
379and then
380an infinite number of empty strings following the last
381.Ql b ,
382so the reported substring is one of the empties.)
383.Pp
384If
385.Dv REG_STARTEND
386is specified,
387.Fa pmatch
388must point to at least one
389.Ft regmatch_t
390(even if
391.Fa nmatch
392is 0 or
393.Dv REG_NOSUB
394was specified),
395to hold the input offsets for
396.Dv REG_STARTEND .
397Use for output is still entirely controlled by
398.Fa nmatch ;
399if
400.Fa nmatch
401is 0 or
402.Dv REG_NOSUB
403was specified,
404the value of
405.Fa pmatch Ns [0]
406will not be changed by a successful
407.Fn regexec .
408.Pp
409The
410.Fn regerror
411function
412maps a non-zero
413.Fa errcode
414from either
415.Fn regcomp
416or
417.Fn regexec
418to a human-readable, printable message.
419If
420.Fa preg
421is
422.No non\- Ns Dv NULL ,
423the error code should have arisen from use of
424the
425.Ft regex_t
426pointed to by
427.Fa preg ,
428and if the error code came from
429.Fn regcomp ,
430it should have been the result from the most recent
431.Fn regcomp
432using that
433.Ft regex_t .
434The
435.Fn ( regerror
436may be able to supply a more detailed message using information
437from the
438.Ft regex_t . )
439The
440.Fn regerror
441function
442places the NUL-terminated message into the buffer pointed to by
443.Fa errbuf ,
444limiting the length (including the NUL) to at most
445.Fa errbuf_size
446bytes.
447If the whole message won't fit,
448as much of it as will fit before the terminating NUL is supplied.
449In any case,
450the returned value is the size of buffer needed to hold the whole
451message (including terminating NUL).
452If
453.Fa errbuf_size
454is 0,
455.Fa errbuf
456is ignored but the return value is still correct.
457.Pp
458If the
459.Fa errcode
460given to
461.Fn regerror
462is first ORed with
463.Dv REG_ITOA ,
464the
465.Dq message
466that results is the printable name of the error code,
467e.g.\&
468.Dq Dv REG_NOMATCH ,
469rather than an explanation thereof.
470If
471.Fa errcode
472is
473.Dv REG_ATOI ,
474then
475.Fa preg
476shall be
477.No non\- Ns Dv NULL
478and the
479.Va re_endp
480member of the structure it points to
481must point to the printable name of an error code;
482in this case, the result in
483.Fa errbuf
484is the decimal digits of
485the numeric value of the error code
486(0 if the name is not recognized).
487.Dv REG_ITOA
488and
489.Dv REG_ATOI
490are intended primarily as debugging facilities;
491they are extensions,
492compatible with but not specified by
493.St -p1003.2 ,
494and should be used with
495caution in software intended to be portable to other systems.
496Be warned also that they are considered experimental and changes are possible.
497.Pp
498The
499.Fn regfree
500function
501frees any dynamically-allocated storage associated with the compiled RE
502pointed to by
503.Fa preg .
504The remaining
505.Ft regex_t
506is no longer a valid compiled RE
507and the effect of supplying it to
508.Fn regexec
509or
510.Fn regerror
511is undefined.
512.Pp
513None of these functions references global variables except for tables
514of constants;
515all are safe for use from multiple threads if the arguments are safe.
516.Sh IMPLEMENTATION CHOICES
517There are a number of decisions that
518.St -p1003.2
519leaves up to the implementor,
520either by explicitly saying
521.Dq undefined
522or by virtue of them being
523forbidden by the RE grammar.
524This implementation treats them as follows.
525.Pp
526See
527.Xr re_format 7
528for a discussion of the definition of case-independent matching.
529.Pp
530There is no particular limit on the length of REs,
531except insofar as memory is limited.
532Memory usage is approximately linear in RE size, and largely insensitive
533to RE complexity, except for bounded repetitions.
534See
535.Sx BUGS
536for one short RE using them
537that will run almost any system out of memory.
538.Pp
539A backslashed character other than one specifically given a magic meaning
540by
541.St -p1003.2
542(such magic meanings occur only in obsolete
543.Bq Dq basic
544REs)
545is taken as an ordinary character.
546.Pp
547Any unmatched
548.Ql [\&
549is a
550.Dv REG_EBRACK
551error.
552.Pp
553Equivalence classes cannot begin or end bracket-expression ranges.
554The endpoint of one range cannot begin another.
555.Pp
556.Dv RE_DUP_MAX ,
557the limit on repetition counts in bounded repetitions, is 255.
558.Pp
559A repetition operator
560.Ql ( ?\& ,
561.Ql *\& ,
562.Ql +\& ,
563or bounds)
564cannot follow another
565repetition operator.
566A repetition operator cannot begin an expression or subexpression
567or follow
568.Ql ^\&
569or
570.Ql |\& .
571.Pp
572.Ql |\&
573cannot appear first or last in a (sub)expression or after another
574.Ql |\& ,
575i.e., an operand of
576.Ql |\&
577cannot be an empty subexpression.
578An empty parenthesized subexpression,
579.Ql "()" ,
580is legal and matches an
581empty (sub)string.
582An empty string is not a legal RE.
583.Pp
584A
585.Ql {\&
586followed by a digit is considered the beginning of bounds for a
587bounded repetition, which must then follow the syntax for bounds.
588A
589.Ql {\&
590.Em not
591followed by a digit is considered an ordinary character.
592.Pp
593.Ql ^\&
594and
595.Ql $\&
596beginning and ending subexpressions in obsolete
597.Pq Dq basic
598REs are anchors, not ordinary characters.
599.Sh SEE ALSO
600.Xr grep 1 ,
601.Xr re_format 7
602.Pp
603.St -p1003.2 ,
604sections 2.8 (Regular Expression Notation)
605and
606B.5 (C Binding for Regular Expression Matching).
607.Sh DIAGNOSTICS
608Non-zero error codes from
609.Fn regcomp
610and
611.Fn regexec
612include the following:
613.Pp
614.Bl -tag -width REG_ECOLLATE -compact
615.It Dv REG_NOMATCH
616The
617.Fn regexec
618function
619failed to match
620.It Dv REG_BADPAT
621invalid regular expression
622.It Dv REG_ECOLLATE
623invalid collating element
624.It Dv REG_ECTYPE
625invalid character class
626.It Dv REG_EESCAPE
627.Ql \e
628applied to unescapable character
629.It Dv REG_ESUBREG
630invalid backreference number
631.It Dv REG_EBRACK
632brackets
633.Ql "[ ]"
634not balanced
635.It Dv REG_EPAREN
636parentheses
637.Ql "( )"
638not balanced
639.It Dv REG_EBRACE
640braces
641.Ql "{ }"
642not balanced
643.It Dv REG_BADBR
644invalid repetition count(s) in
645.Ql "{ }"
646.It Dv REG_ERANGE
647invalid character range in
648.Ql "[ ]"
649.It Dv REG_ESPACE
650ran out of memory
651.It Dv REG_BADRPT
652.Ql ?\& ,
653.Ql *\& ,
654or
655.Ql +\&
656operand invalid
657.It Dv REG_EMPTY
658empty (sub)expression
659.It Dv REG_ASSERT
660can't happen - you found a bug
661.It Dv REG_INVARG
662invalid argument, e.g.\& negative-length string
663.It Dv REG_ILLSEQ
664illegal byte sequence (bad multibyte character)
665.El
666.Sh HISTORY
667Originally written by
668.An Henry Spencer .
669Altered for inclusion in the
670.Bx 4.4
671distribution.
672.Sh BUGS
673This is an alpha release with known defects.
674Please report problems.
675.Pp
676The back-reference code is subtle and doubts linger about its correctness
677in complex cases.
678.Pp
679The
680.Fn regexec
681function
682performance is poor.
683This will improve with later releases.
684The
685.Fa nmatch
686argument
687exceeding 0 is expensive;
688.Fa nmatch
689exceeding 1 is worse.
690The
691.Fn regexec
692function
693is largely insensitive to RE complexity
694.Em except
695that back
696references are massively expensive.
697RE length does matter; in particular, there is a strong speed bonus
698for keeping RE length under about 30 characters,
699with most special characters counting roughly double.
700.Pp
701The
702.Fn regcomp
703function
704implements bounded repetitions by macro expansion,
705which is costly in time and space if counts are large
706or bounded repetitions are nested.
707An RE like, say,
708.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
709will (eventually) run almost any existing machine out of swap space.
710.Pp
711There are suspected problems with response to obscure error conditions.
712Notably,
713certain kinds of internal overflow,
714produced only by truly enormous REs or by multiply nested bounded repetitions,
715are probably not handled well.
716.Pp
717Due to a mistake in
718.St -p1003.2 ,
719things like
720.Ql "a)b"
721are legal REs because
722.Ql )\&
723is
724a special character only in the presence of a previous unmatched
725.Ql (\& .
726This can't be fixed until the spec is fixed.
727.Pp
728The standard's definition of back references is vague.
729For example, does
730.Ql "a\e(\e(b\e)*\e2\e)*d"
731match
732.Ql "abbbd" ?
733Until the standard is clarified,
734behavior in such cases should not be relied on.
735.Pp
736The implementation of word-boundary matching is a bit of a kludge,
737and bugs may lurk in combinations of word-boundary matching and anchoring.