]> git.saurik.com Git - apple/libc.git/blame - regex/regex.3
Libc-763.11.tar.gz
[apple/libc.git] / regex / regex.3
CommitLineData
224c7076
A
1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2.\" Copyright (c) 1992, 1993, 1994
3.\" The Regents of the University of California. All rights reserved.
4.\"
5.\" This code is derived from software contributed to Berkeley by
6.\" Henry Spencer.
7.\"
8.\" Redistribution and use in source and binary forms, with or without
9.\" modification, are permitted provided that the following conditions
10.\" are met:
11.\" 1. Redistributions of source code must retain the above copyright
12.\" notice, this list of conditions and the following disclaimer.
13.\" 2. Redistributions in binary form must reproduce the above copyright
14.\" notice, this list of conditions and the following disclaimer in the
15.\" documentation and/or other materials provided with the distribution.
224c7076
A
16.\" 4. Neither the name of the University nor the names of its contributors
17.\" may be used to endorse or promote products derived from this software
18.\" without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\" @(#)regex.3 8.4 (Berkeley) 3/20/94
1f2f436a 33.\" $FreeBSD: src/lib/libc/regex/regex.3,v 1.21 2007/01/09 00:28:04 imp Exp $
224c7076 34.\"
1f2f436a 35.Dd August 17, 2005
224c7076
A
36.Dt REGEX 3
37.Os
38.Sh NAME
39.Nm regcomp ,
40.Nm regerror ,
41.Nm regexec ,
42.Nm regfree
43.Nd regular-expression library
44.Sh LIBRARY
45.Lb libc
46.Sh SYNOPSIS
47.In regex.h
48.Ft int
49.Fo regcomp
50.Fa "regex_t *restrict preg"
51.Fa "const char *restrict pattern"
52.Fa "int cflags"
53.Fc
54.Ft size_t
55.Fo regerror
56.Fa "int errcode"
57.Fa "const regex_t *restrict preg"
58.Fa "char *restrict errbuf"
59.Fa "size_t errbuf_size"
60.Fc
61.Ft int
62.Fo regexec
63.Fa "const regex_t *restrict preg"
64.Fa "const char *restrict string"
65.Fa "size_t nmatch"
66.Fa "regmatch_t pmatch[restrict]"
67.Fa "int eflags"
68.Fc
69.Ft void
70.Fo regfree
71.Fa "regex_t *preg"
72.Fc
73.Sh DESCRIPTION
74These routines implement
75.St -p1003.2
76regular expressions
77.Pq Do RE Dc Ns s ;
78see
79.Xr re_format 7 .
80The
81.Fn regcomp
82function
83compiles an RE, written as a string, into an internal form.
84.Fn regexec
85matches that internal form against a string and reports results.
86.Fn regerror
87transforms error codes from either into human-readable messages.
88.Fn regfree
89frees any dynamically-allocated storage used by the internal form
90of an RE.
91.Pp
92The header
93.In regex.h
94declares two structure types,
95.Ft regex_t
96and
97.Ft regmatch_t ,
98the former for compiled internal forms and the latter for match reporting.
99It also declares the four functions,
100a type
101.Ft regoff_t ,
102and a number of constants with names starting with
103.Dq Dv REG_ .
104.Pp
105The
106.Fn regcomp
107function
108compiles the regular expression contained in the
109.Fa pattern
110string,
111subject to the flags in
112.Fa cflags ,
113and places the results in the
114.Ft regex_t
115structure pointed to by
116.Fa preg .
117The
118.Fa cflags
119argument
120is the bitwise OR of zero or more of the following flags:
121.Bl -tag -width REG_EXTENDED
122.It Dv REG_EXTENDED
123Compile modern
124.Pq Dq extended
125REs,
126rather than the obsolete
127.Pq Dq basic
128REs that
129are the default.
130.It Dv REG_BASIC
131This is a synonym for 0,
132provided as a counterpart to
133.Dv REG_EXTENDED
134to improve readability.
135.It Dv REG_NOSPEC
136Compile with recognition of all special characters turned off.
137All characters are thus considered ordinary,
138so the
139.Dq RE
140is a literal string.
141This is an extension,
142compatible with but not specified by
143.St -p1003.2 ,
144and should be used with
145caution in software intended to be portable to other systems.
146.Dv REG_EXTENDED
147and
148.Dv REG_NOSPEC
149may not be used
150in the same call to
151.Fn regcomp .
152.It Dv REG_ICASE
153Compile for matching that ignores upper/lower case distinctions.
154See
155.Xr re_format 7 .
156.It Dv REG_NOSUB
157Compile for matching that need only report success or failure,
158not what was matched.
159.It Dv REG_NEWLINE
160Compile for newline-sensitive matching.
161By default, newline is a completely ordinary character with no special
162meaning in either REs or strings.
163With this flag,
164.Ql [^
165bracket expressions and
166.Ql .\&
167never match newline,
168a
169.Ql ^\&
170anchor matches the null string after any newline in the string
171in addition to its normal function,
172and the
173.Ql $\&
174anchor matches the null string before any newline in the
175string in addition to its normal function.
176.It Dv REG_PEND
177The regular expression ends,
178not at the first NUL,
179but just before the character pointed to by the
180.Va re_endp
181member of the structure pointed to by
182.Fa preg .
183The
184.Va re_endp
185member is of type
186.Ft "const char *" .
187This flag permits inclusion of NULs in the RE;
188they are considered ordinary characters.
189This is an extension,
190compatible with but not specified by
191.St -p1003.2 ,
192and should be used with
193caution in software intended to be portable to other systems.
194.El
195.Pp
196When successful,
197.Fn regcomp
198returns 0 and fills in the structure pointed to by
199.Fa preg .
200One member of that structure
201(other than
202.Va re_endp )
203is publicized:
204.Va re_nsub ,
205of type
206.Ft size_t ,
207contains the number of parenthesized subexpressions within the RE
208(except that the value of this member is undefined if the
209.Dv REG_NOSUB
210flag was used).
211If
212.Fn regcomp
213fails, it returns a non-zero error code;
214see
215.Sx DIAGNOSTICS .
216.Pp
217The
218.Fn regexec
219function
220matches the compiled RE pointed to by
221.Fa preg
222against the
223.Fa string ,
224subject to the flags in
225.Fa eflags ,
226and reports results using
227.Fa nmatch ,
228.Fa pmatch ,
229and the returned value.
230The RE must have been compiled by a previous invocation of
231.Fn regcomp .
232The compiled form is not altered during execution of
233.Fn regexec ,
234so a single compiled RE can be used simultaneously by multiple threads.
235.Pp
236By default,
237the NUL-terminated string pointed to by
238.Fa string
239is considered to be the text of an entire line, minus any terminating
240newline.
241The
242.Fa eflags
243argument is the bitwise OR of zero or more of the following flags:
244.Bl -tag -width REG_STARTEND
245.It Dv REG_NOTBOL
246The first character of
247the string
248is not the beginning of a line, so the
249.Ql ^\&
250anchor should not match before it.
251This does not affect the behavior of newlines under
252.Dv REG_NEWLINE .
253.It Dv REG_NOTEOL
254The NUL terminating
255the string
256does not end a line, so the
257.Ql $\&
258anchor should not match before it.
259This does not affect the behavior of newlines under
260.Dv REG_NEWLINE .
261.It Dv REG_STARTEND
262The string is considered to start at
263.Fa string
264+
265.Fa pmatch Ns [0]. Ns Va rm_so
266and to have a terminating NUL located at
267.Fa string
268+
269.Fa pmatch Ns [0]. Ns Va rm_eo
270(there need not actually be a NUL at that location),
271regardless of the value of
272.Fa nmatch .
273See below for the definition of
274.Fa pmatch
275and
276.Fa nmatch .
277This is an extension,
278compatible with but not specified by
279.St -p1003.2 ,
280and should be used with
281caution in software intended to be portable to other systems.
282Note that a non-zero
283.Va rm_so
284does not imply
285.Dv REG_NOTBOL ;
286.Dv REG_STARTEND
287affects only the location of the string,
288not how it is matched.
289.El
290.Pp
291See
292.Xr re_format 7
293for a discussion of what is matched in situations where an RE or a
294portion thereof could match any of several substrings of
295.Fa string .
296.Pp
297Normally,
298.Fn regexec
299returns 0 for success and the non-zero code
300.Dv REG_NOMATCH
301for failure.
302Other non-zero error codes may be returned in exceptional situations;
303see
304.Sx DIAGNOSTICS .
305.Pp
306If
307.Dv REG_NOSUB
308was specified in the compilation of the RE,
309or if
310.Fa nmatch
311is 0,
312.Fn regexec
313ignores the
314.Fa pmatch
315argument (but see below for the case where
316.Dv REG_STARTEND
317is specified).
318Otherwise,
319.Fa pmatch
320points to an array of
321.Fa nmatch
322structures of type
323.Ft regmatch_t .
324Such a structure has at least the members
325.Va rm_so
326and
327.Va rm_eo ,
328both of type
329.Ft regoff_t
330(a signed arithmetic type at least as large as an
331.Ft off_t
332and a
333.Ft ssize_t ) ,
334containing respectively the offset of the first character of a substring
335and the offset of the first character after the end of the substring.
336Offsets are measured from the beginning of the
337.Fa string
338argument given to
339.Fn regexec .
340An empty substring is denoted by equal offsets,
341both indicating the character following the empty substring.
342.Pp
343The 0th member of the
344.Fa pmatch
345array is filled in to indicate what substring of
346.Fa string
347was matched by the entire RE.
348Remaining members report what substring was matched by parenthesized
349subexpressions within the RE;
350member
351.Va i
352reports subexpression
353.Va i ,
354with subexpressions counted (starting at 1) by the order of their opening
355parentheses in the RE, left to right.
356Unused entries in the array (corresponding either to subexpressions that
357did not participate in the match at all, or to subexpressions that do not
358exist in the RE (that is,
359.Va i
360>
361.Fa preg Ns -> Ns Va re_nsub ) )
362have both
363.Va rm_so
364and
365.Va rm_eo
366set to -1.
367If a subexpression participated in the match several times,
368the reported substring is the last one it matched.
369(Note, as an example in particular, that when the RE
370.Ql "(b*)+"
371matches
372.Ql bbb ,
373the parenthesized subexpression matches each of the three
374.So Li b Sc Ns s
375and then
376an infinite number of empty strings following the last
377.Ql b ,
378so the reported substring is one of the empties.)
379.Pp
380If
381.Dv REG_STARTEND
382is specified,
383.Fa pmatch
384must point to at least one
385.Ft regmatch_t
386(even if
387.Fa nmatch
388is 0 or
389.Dv REG_NOSUB
390was specified),
391to hold the input offsets for
392.Dv REG_STARTEND .
393Use for output is still entirely controlled by
394.Fa nmatch ;
395if
396.Fa nmatch
397is 0 or
398.Dv REG_NOSUB
399was specified,
400the value of
401.Fa pmatch Ns [0]
402will not be changed by a successful
403.Fn regexec .
404.Pp
405The
406.Fn regerror
407function
408maps a non-zero
409.Fa errcode
410from either
411.Fn regcomp
412or
413.Fn regexec
414to a human-readable, printable message.
415If
416.Fa preg
417is
418.No non\- Ns Dv NULL ,
419the error code should have arisen from use of
420the
421.Ft regex_t
422pointed to by
423.Fa preg ,
424and if the error code came from
425.Fn regcomp ,
426it should have been the result from the most recent
427.Fn regcomp
428using that
429.Ft regex_t .
430The
431.Fn ( regerror
432may be able to supply a more detailed message using information
433from the
434.Ft regex_t . )
435The
436.Fn regerror
437function
438places the NUL-terminated message into the buffer pointed to by
439.Fa errbuf ,
440limiting the length (including the NUL) to at most
441.Fa errbuf_size
442bytes.
1f2f436a 443If the whole message will not fit,
224c7076
A
444as much of it as will fit before the terminating NUL is supplied.
445In any case,
446the returned value is the size of buffer needed to hold the whole
447message (including terminating NUL).
448If
449.Fa errbuf_size
450is 0,
451.Fa errbuf
452is ignored but the return value is still correct.
453.Pp
454If the
455.Fa errcode
456given to
457.Fn regerror
458is first ORed with
459.Dv REG_ITOA ,
460the
461.Dq message
462that results is the printable name of the error code,
463e.g.\&
464.Dq Dv REG_NOMATCH ,
465rather than an explanation thereof.
466If
467.Fa errcode
468is
469.Dv REG_ATOI ,
470then
471.Fa preg
472shall be
473.No non\- Ns Dv NULL
474and the
475.Va re_endp
476member of the structure it points to
477must point to the printable name of an error code;
478in this case, the result in
479.Fa errbuf
480is the decimal digits of
481the numeric value of the error code
482(0 if the name is not recognized).
483.Dv REG_ITOA
484and
485.Dv REG_ATOI
486are intended primarily as debugging facilities;
487they are extensions,
488compatible with but not specified by
489.St -p1003.2 ,
490and should be used with
491caution in software intended to be portable to other systems.
492Be warned also that they are considered experimental and changes are possible.
493.Pp
494The
495.Fn regfree
496function
497frees any dynamically-allocated storage associated with the compiled RE
498pointed to by
499.Fa preg .
500The remaining
501.Ft regex_t
502is no longer a valid compiled RE
503and the effect of supplying it to
504.Fn regexec
505or
506.Fn regerror
507is undefined.
508.Pp
509None of these functions references global variables except for tables
510of constants;
511all are safe for use from multiple threads if the arguments are safe.
512.Sh IMPLEMENTATION CHOICES
513There are a number of decisions that
514.St -p1003.2
515leaves up to the implementor,
516either by explicitly saying
517.Dq undefined
518or by virtue of them being
519forbidden by the RE grammar.
520This implementation treats them as follows.
521.Pp
522See
523.Xr re_format 7
524for a discussion of the definition of case-independent matching.
525.Pp
526There is no particular limit on the length of REs,
527except insofar as memory is limited.
528Memory usage is approximately linear in RE size, and largely insensitive
529to RE complexity, except for bounded repetitions.
530See
531.Sx BUGS
532for one short RE using them
533that will run almost any system out of memory.
534.Pp
535A backslashed character other than one specifically given a magic meaning
536by
537.St -p1003.2
538(such magic meanings occur only in obsolete
539.Bq Dq basic
540REs)
541is taken as an ordinary character.
542.Pp
543Any unmatched
544.Ql [\&
545is a
546.Dv REG_EBRACK
547error.
548.Pp
549Equivalence classes cannot begin or end bracket-expression ranges.
550The endpoint of one range cannot begin another.
551.Pp
552.Dv RE_DUP_MAX ,
553the limit on repetition counts in bounded repetitions, is 255.
554.Pp
555A repetition operator
556.Ql ( ?\& ,
557.Ql *\& ,
558.Ql +\& ,
559or bounds)
560cannot follow another
561repetition operator.
562A repetition operator cannot begin an expression or subexpression
563or follow
564.Ql ^\&
565or
566.Ql |\& .
567.Pp
568.Ql |\&
569cannot appear first or last in a (sub)expression or after another
570.Ql |\& ,
571i.e., an operand of
572.Ql |\&
573cannot be an empty subexpression.
574An empty parenthesized subexpression,
575.Ql "()" ,
576is legal and matches an
577empty (sub)string.
578An empty string is not a legal RE.
579.Pp
580A
581.Ql {\&
582followed by a digit is considered the beginning of bounds for a
583bounded repetition, which must then follow the syntax for bounds.
584A
585.Ql {\&
586.Em not
587followed by a digit is considered an ordinary character.
588.Pp
589.Ql ^\&
590and
591.Ql $\&
592beginning and ending subexpressions in obsolete
593.Pq Dq basic
594REs are anchors, not ordinary characters.
224c7076
A
595.Sh DIAGNOSTICS
596Non-zero error codes from
597.Fn regcomp
598and
599.Fn regexec
600include the following:
601.Pp
602.Bl -tag -width REG_ECOLLATE -compact
603.It Dv REG_NOMATCH
604The
605.Fn regexec
606function
607failed to match
608.It Dv REG_BADPAT
609invalid regular expression
610.It Dv REG_ECOLLATE
611invalid collating element
612.It Dv REG_ECTYPE
613invalid character class
614.It Dv REG_EESCAPE
615.Ql \e
616applied to unescapable character
617.It Dv REG_ESUBREG
618invalid backreference number
619.It Dv REG_EBRACK
620brackets
621.Ql "[ ]"
622not balanced
623.It Dv REG_EPAREN
624parentheses
625.Ql "( )"
626not balanced
627.It Dv REG_EBRACE
628braces
629.Ql "{ }"
630not balanced
631.It Dv REG_BADBR
632invalid repetition count(s) in
633.Ql "{ }"
634.It Dv REG_ERANGE
635invalid character range in
636.Ql "[ ]"
637.It Dv REG_ESPACE
638ran out of memory
639.It Dv REG_BADRPT
640.Ql ?\& ,
641.Ql *\& ,
642or
643.Ql +\&
644operand invalid
645.It Dv REG_EMPTY
646empty (sub)expression
647.It Dv REG_ASSERT
1f2f436a 648cannot happen - you found a bug
224c7076
A
649.It Dv REG_INVARG
650invalid argument, e.g.\& negative-length string
651.It Dv REG_ILLSEQ
652illegal byte sequence (bad multibyte character)
653.El
1f2f436a
A
654.Sh SEE ALSO
655.Xr grep 1 ,
656.Xr re_format 7
657.Pp
658.St -p1003.2 ,
659sections 2.8 (Regular Expression Notation)
660and
661B.5 (C Binding for Regular Expression Matching).
224c7076
A
662.Sh HISTORY
663Originally written by
664.An Henry Spencer .
665Altered for inclusion in the
666.Bx 4.4
667distribution.
668.Sh BUGS
669This is an alpha release with known defects.
670Please report problems.
671.Pp
672The back-reference code is subtle and doubts linger about its correctness
673in complex cases.
674.Pp
675The
676.Fn regexec
677function
678performance is poor.
679This will improve with later releases.
680The
681.Fa nmatch
682argument
683exceeding 0 is expensive;
684.Fa nmatch
685exceeding 1 is worse.
686The
687.Fn regexec
688function
689is largely insensitive to RE complexity
690.Em except
691that back
692references are massively expensive.
693RE length does matter; in particular, there is a strong speed bonus
694for keeping RE length under about 30 characters,
695with most special characters counting roughly double.
696.Pp
697The
698.Fn regcomp
699function
700implements bounded repetitions by macro expansion,
701which is costly in time and space if counts are large
702or bounded repetitions are nested.
703An RE like, say,
704.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
705will (eventually) run almost any existing machine out of swap space.
706.Pp
707There are suspected problems with response to obscure error conditions.
708Notably,
709certain kinds of internal overflow,
710produced only by truly enormous REs or by multiply nested bounded repetitions,
711are probably not handled well.
712.Pp
713Due to a mistake in
714.St -p1003.2 ,
715things like
716.Ql "a)b"
717are legal REs because
718.Ql )\&
719is
720a special character only in the presence of a previous unmatched
721.Ql (\& .
1f2f436a 722This cannot be fixed until the spec is fixed.
224c7076
A
723.Pp
724The standard's definition of back references is vague.
725For example, does
726.Ql "a\e(\e(b\e)*\e2\e)*d"
727match
728.Ql "abbbd" ?
729Until the standard is clarified,
730behavior in such cases should not be relied on.
731.Pp
732The implementation of word-boundary matching is a bit of a kludge,
733and bugs may lurk in combinations of word-boundary matching and anchoring.
1f2f436a
A
734.Pp
735Word-boundary matching does not work properly in multibyte locales.