1 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2 .\" Copyright (c) 1992, 1993, 1994
3 .\" The Regents of the University of California. All rights reserved.
5 .\" This code is derived from software contributed to Berkeley by
8 .\" Redistribution and use in source and binary forms, with or without
9 .\" modification, are permitted provided that the following conditions
11 .\" 1. Redistributions of source code must retain the above copyright
12 .\" notice, this list of conditions and the following disclaimer.
13 .\" 2. Redistributions in binary form must reproduce the above copyright
14 .\" notice, this list of conditions and the following disclaimer in the
15 .\" documentation and/or other materials provided with the distribution.
16 .\" 3. All advertising materials mentioning features or use of this software
17 .\" must display the following acknowledgement:
18 .\" This product includes software developed by the University of
19 .\" California, Berkeley and its contributors.
20 .\" 4. Neither the name of the University nor the names of its contributors
21 .\" may be used to endorse or promote products derived from this software
22 .\" without specific prior written permission.
24 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
36 .\" @(#)regex.3 8.4 (Berkeley) 3/20/94
37 .\" $FreeBSD: src/lib/libc/regex/regex.3,v 1.17 2004/07/12 11:03:42 tjr Exp $
47 .Nd regular-expression library
54 .Fa "regex_t *restrict preg"
55 .Fa "const char *restrict pattern"
61 .Fa "const regex_t *restrict preg"
62 .Fa "char *restrict errbuf"
63 .Fa "size_t errbuf_size"
67 .Fa "const regex_t *restrict preg"
68 .Fa "const char *restrict string"
70 .Fa "regmatch_t pmatch[restrict]"
78 These routines implement
87 compiles an RE, written as a string, into an internal form.
89 matches that internal form against a string and reports results.
91 transforms error codes from either into human-readable messages.
93 frees any dynamically-allocated storage used by the internal form
98 declares two structure types,
102 the former for compiled internal forms and the latter for match reporting.
103 It also declares the four functions,
106 and a number of constants with names starting with
112 compiles the regular expression contained in the
115 subject to the flags in
117 and places the results in the
119 structure pointed to by
124 is the bitwise OR of zero or more of the following flags:
125 .Bl -tag -width REG_EXTENDED
130 rather than the obsolete
135 This is a synonym for 0,
136 provided as a counterpart to
138 to improve readability.
140 Compile with recognition of all special characters turned off.
141 All characters are thus considered ordinary,
145 This is an extension,
146 compatible with but not specified by
148 and should be used with
149 caution in software intended to be portable to other systems.
157 Compile for matching that ignores upper/lower case distinctions.
161 Compile for matching that need only report success or failure,
162 not what was matched.
164 Compile for newline-sensitive matching.
165 By default, newline is a completely ordinary character with no special
166 meaning in either REs or strings.
169 bracket expressions and
174 anchor matches the null string after any newline in the string
175 in addition to its normal function,
178 anchor matches the null string before any newline in the
179 string in addition to its normal function.
181 The regular expression ends,
182 not at the first NUL,
183 but just before the character pointed to by the
185 member of the structure pointed to by
191 This flag permits inclusion of NULs in the RE;
192 they are considered ordinary characters.
193 This is an extension,
194 compatible with but not specified by
196 and should be used with
197 caution in software intended to be portable to other systems.
202 returns 0 and fills in the structure pointed to by
204 One member of that structure
211 contains the number of parenthesized subexpressions within the RE
212 (except that the value of this member is undefined if the
217 fails, it returns a non-zero error code;
224 matches the compiled RE pointed to by
228 subject to the flags in
230 and reports results using
233 and the returned value.
234 The RE must have been compiled by a previous invocation of
236 The compiled form is not altered during execution of
238 so a single compiled RE can be used simultaneously by multiple threads.
241 the NUL-terminated string pointed to by
243 is considered to be the text of an entire line, minus any terminating
247 argument is the bitwise OR of zero or more of the following flags:
248 .Bl -tag -width REG_STARTEND
250 The first character of
252 is not the beginning of a line, so the
254 anchor should not match before it.
255 This does not affect the behavior of newlines under
260 does not end a line, so the
262 anchor should not match before it.
263 This does not affect the behavior of newlines under
266 The string is considered to start at
269 .Fa pmatch Ns [0]. Ns Va rm_so
270 and to have a terminating NUL located at
273 .Fa pmatch Ns [0]. Ns Va rm_eo
274 (there need not actually be a NUL at that location),
275 regardless of the value of
277 See below for the definition of
281 This is an extension,
282 compatible with but not specified by
284 and should be used with
285 caution in software intended to be portable to other systems.
291 affects only the location of the string,
292 not how it is matched.
297 for a discussion of what is matched in situations where an RE or a
298 portion thereof could match any of several substrings of
303 returns 0 for success and the non-zero code
306 Other non-zero error codes may be returned in exceptional situations;
312 was specified in the compilation of the RE,
319 argument (but see below for the case where
324 points to an array of
328 Such a structure has at least the members
334 (a signed arithmetic type at least as large as an
338 containing respectively the offset of the first character of a substring
339 and the offset of the first character after the end of the substring.
340 Offsets are measured from the beginning of the
344 An empty substring is denoted by equal offsets,
345 both indicating the character following the empty substring.
347 The 0th member of the
349 array is filled in to indicate what substring of
351 was matched by the entire RE.
352 Remaining members report what substring was matched by parenthesized
353 subexpressions within the RE;
356 reports subexpression
358 with subexpressions counted (starting at 1) by the order of their opening
359 parentheses in the RE, left to right.
360 Unused entries in the array (corresponding either to subexpressions that
361 did not participate in the match at all, or to subexpressions that do not
362 exist in the RE (that is,
365 .Fa preg Ns -> Ns Va re_nsub ) )
371 If a subexpression participated in the match several times,
372 the reported substring is the last one it matched.
373 (Note, as an example in particular, that when the RE
377 the parenthesized subexpression matches each of the three
380 an infinite number of empty strings following the last
382 so the reported substring is one of the empties.)
388 must point to at least one
395 to hold the input offsets for
397 Use for output is still entirely controlled by
406 will not be changed by a successful
418 to a human-readable, printable message.
422 .No non\- Ns Dv NULL ,
423 the error code should have arisen from use of
428 and if the error code came from
430 it should have been the result from the most recent
436 may be able to supply a more detailed message using information
442 places the NUL-terminated message into the buffer pointed to by
444 limiting the length (including the NUL) to at most
447 If the whole message won't fit,
448 as much of it as will fit before the terminating NUL is supplied.
450 the returned value is the size of buffer needed to hold the whole
451 message (including terminating NUL).
456 is ignored but the return value is still correct.
466 that results is the printable name of the error code,
469 rather than an explanation thereof.
480 member of the structure it points to
481 must point to the printable name of an error code;
482 in this case, the result in
484 is the decimal digits of
485 the numeric value of the error code
486 (0 if the name is not recognized).
490 are intended primarily as debugging facilities;
492 compatible with but not specified by
494 and should be used with
495 caution in software intended to be portable to other systems.
496 Be warned also that they are considered experimental and changes are possible.
501 frees any dynamically-allocated storage associated with the compiled RE
506 is no longer a valid compiled RE
507 and the effect of supplying it to
513 None of these functions references global variables except for tables
515 all are safe for use from multiple threads if the arguments are safe.
516 .Sh IMPLEMENTATION CHOICES
517 There are a number of decisions that
519 leaves up to the implementor,
520 either by explicitly saying
522 or by virtue of them being
523 forbidden by the RE grammar.
524 This implementation treats them as follows.
528 for a discussion of the definition of case-independent matching.
530 There is no particular limit on the length of REs,
531 except insofar as memory is limited.
532 Memory usage is approximately linear in RE size, and largely insensitive
533 to RE complexity, except for bounded repetitions.
536 for one short RE using them
537 that will run almost any system out of memory.
539 A backslashed character other than one specifically given a magic meaning
542 (such magic meanings occur only in obsolete
545 is taken as an ordinary character.
553 Equivalence classes cannot begin or end bracket-expression ranges.
554 The endpoint of one range cannot begin another.
557 the limit on repetition counts in bounded repetitions, is 255.
559 A repetition operator
564 cannot follow another
566 A repetition operator cannot begin an expression or subexpression
573 cannot appear first or last in a (sub)expression or after another
577 cannot be an empty subexpression.
578 An empty parenthesized subexpression,
580 is legal and matches an
582 An empty string is not a legal RE.
586 followed by a digit is considered the beginning of bounds for a
587 bounded repetition, which must then follow the syntax for bounds.
591 followed by a digit is considered an ordinary character.
596 beginning and ending subexpressions in obsolete
598 REs are anchors, not ordinary characters.
604 sections 2.8 (Regular Expression Notation)
606 B.5 (C Binding for Regular Expression Matching).
608 Non-zero error codes from
612 include the following:
614 .Bl -tag -width REG_ECOLLATE -compact
621 invalid regular expression
623 invalid collating element
625 invalid character class
628 applied to unescapable character
630 invalid backreference number
644 invalid repetition count(s) in
647 invalid character range in
658 empty (sub)expression
660 can't happen - you found a bug
662 invalid argument, e.g.\& negative-length string
664 illegal byte sequence (bad multibyte character)
667 Originally written by
669 Altered for inclusion in the
673 This is an alpha release with known defects.
674 Please report problems.
676 The back-reference code is subtle and doubts linger about its correctness
683 This will improve with later releases.
687 exceeding 0 is expensive;
689 exceeding 1 is worse.
693 is largely insensitive to RE complexity
696 references are massively expensive.
697 RE length does matter; in particular, there is a strong speed bonus
698 for keeping RE length under about 30 characters,
699 with most special characters counting roughly double.
704 implements bounded repetitions by macro expansion,
705 which is costly in time and space if counts are large
706 or bounded repetitions are nested.
708 .Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
709 will (eventually) run almost any existing machine out of swap space.
711 There are suspected problems with response to obscure error conditions.
713 certain kinds of internal overflow,
714 produced only by truly enormous REs or by multiply nested bounded repetitions,
715 are probably not handled well.
721 are legal REs because
724 a special character only in the presence of a previous unmatched
726 This can't be fixed until the spec is fixed.
728 The standard's definition of back references is vague.
730 .Ql "a\e(\e(b\e)*\e2\e)*d"
733 Until the standard is clarified,
734 behavior in such cases should not be relied on.
736 The implementation of word-boundary matching is a bit of a kludge,
737 and bugs may lurk in combinations of word-boundary matching and anchoring.