1 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2 .\" Copyright (c) 1992, 1993, 1994
3 .\" The Regents of the University of California. All rights reserved.
5 .\" This code is derived from software contributed to Berkeley by
8 .\" Redistribution and use in source and binary forms, with or without
9 .\" modification, are permitted provided that the following conditions
11 .\" 1. Redistributions of source code must retain the above copyright
12 .\" notice, this list of conditions and the following disclaimer.
13 .\" 2. Redistributions in binary form must reproduce the above copyright
14 .\" notice, this list of conditions and the following disclaimer in the
15 .\" documentation and/or other materials provided with the distribution.
16 .\" 4. Neither the name of the University nor the names of its contributors
17 .\" may be used to endorse or promote products derived from this software
18 .\" without specific prior written permission.
20 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
32 .\" @(#)regex.3 8.4 (Berkeley) 3/20/94
33 .\" $FreeBSD: src/lib/libc/regex/regex.3,v 1.21 2007/01/09 00:28:04 imp Exp $
43 .Nd regular-expression library
50 .Fa "regex_t *restrict preg"
51 .Fa "const char *restrict pattern"
57 .Fa "const regex_t *restrict preg"
58 .Fa "char *restrict errbuf"
59 .Fa "size_t errbuf_size"
63 .Fa "const regex_t *restrict preg"
64 .Fa "const char *restrict string"
66 .Fa "regmatch_t pmatch[restrict]"
74 These routines implement
83 compiles an RE, written as a string, into an internal form.
85 matches that internal form against a string and reports results.
87 transforms error codes from either into human-readable messages.
89 frees any dynamically-allocated storage used by the internal form
94 declares two structure types,
98 the former for compiled internal forms and the latter for match reporting.
99 It also declares the four functions,
102 and a number of constants with names starting with
108 compiles the regular expression contained in the
111 subject to the flags in
113 and places the results in the
115 structure pointed to by
120 is the bitwise OR of zero or more of the following flags:
121 .Bl -tag -width REG_EXTENDED
126 rather than the obsolete
131 This is a synonym for 0,
132 provided as a counterpart to
134 to improve readability.
136 Compile with recognition of all special characters turned off.
137 All characters are thus considered ordinary,
141 This is an extension,
142 compatible with but not specified by
144 and should be used with
145 caution in software intended to be portable to other systems.
153 Compile for matching that ignores upper/lower case distinctions.
157 Compile for matching that need only report success or failure,
158 not what was matched.
160 Compile for newline-sensitive matching.
161 By default, newline is a completely ordinary character with no special
162 meaning in either REs or strings.
165 bracket expressions and
170 anchor matches the null string after any newline in the string
171 in addition to its normal function,
174 anchor matches the null string before any newline in the
175 string in addition to its normal function.
177 The regular expression ends,
178 not at the first NUL,
179 but just before the character pointed to by the
181 member of the structure pointed to by
187 This flag permits inclusion of NULs in the RE;
188 they are considered ordinary characters.
189 This is an extension,
190 compatible with but not specified by
192 and should be used with
193 caution in software intended to be portable to other systems.
198 returns 0 and fills in the structure pointed to by
200 One member of that structure
207 contains the number of parenthesized subexpressions within the RE
208 (except that the value of this member is undefined if the
213 fails, it returns a non-zero error code;
220 matches the compiled RE pointed to by
224 subject to the flags in
226 and reports results using
229 and the returned value.
230 The RE must have been compiled by a previous invocation of
232 The compiled form is not altered during execution of
234 so a single compiled RE can be used simultaneously by multiple threads.
237 the NUL-terminated string pointed to by
239 is considered to be the text of an entire line, minus any terminating
243 argument is the bitwise OR of zero or more of the following flags:
244 .Bl -tag -width REG_STARTEND
246 The first character of
248 is not the beginning of a line, so the
250 anchor should not match before it.
251 This does not affect the behavior of newlines under
256 does not end a line, so the
258 anchor should not match before it.
259 This does not affect the behavior of newlines under
262 The string is considered to start at
265 .Fa pmatch Ns [0]. Ns Va rm_so
266 and to have a terminating NUL located at
269 .Fa pmatch Ns [0]. Ns Va rm_eo
270 (there need not actually be a NUL at that location),
271 regardless of the value of
273 See below for the definition of
277 This is an extension,
278 compatible with but not specified by
280 and should be used with
281 caution in software intended to be portable to other systems.
287 affects only the location of the string,
288 not how it is matched.
293 for a discussion of what is matched in situations where an RE or a
294 portion thereof could match any of several substrings of
299 returns 0 for success and the non-zero code
302 Other non-zero error codes may be returned in exceptional situations;
308 was specified in the compilation of the RE,
315 argument (but see below for the case where
320 points to an array of
324 Such a structure has at least the members
330 (a signed arithmetic type at least as large as an
334 containing respectively the offset of the first character of a substring
335 and the offset of the first character after the end of the substring.
336 Offsets are measured from the beginning of the
340 An empty substring is denoted by equal offsets,
341 both indicating the character following the empty substring.
343 The 0th member of the
345 array is filled in to indicate what substring of
347 was matched by the entire RE.
348 Remaining members report what substring was matched by parenthesized
349 subexpressions within the RE;
352 reports subexpression
354 with subexpressions counted (starting at 1) by the order of their opening
355 parentheses in the RE, left to right.
356 Unused entries in the array (corresponding either to subexpressions that
357 did not participate in the match at all, or to subexpressions that do not
358 exist in the RE (that is,
361 .Fa preg Ns -> Ns Va re_nsub ) )
367 If a subexpression participated in the match several times,
368 the reported substring is the last one it matched.
369 (Note, as an example in particular, that when the RE
373 the parenthesized subexpression matches each of the three
376 an infinite number of empty strings following the last
378 so the reported substring is one of the empties.)
384 must point to at least one
391 to hold the input offsets for
393 Use for output is still entirely controlled by
402 will not be changed by a successful
414 to a human-readable, printable message.
418 .No non\- Ns Dv NULL ,
419 the error code should have arisen from use of
424 and if the error code came from
426 it should have been the result from the most recent
432 may be able to supply a more detailed message using information
438 places the NUL-terminated message into the buffer pointed to by
440 limiting the length (including the NUL) to at most
443 If the whole message will not fit,
444 as much of it as will fit before the terminating NUL is supplied.
446 the returned value is the size of buffer needed to hold the whole
447 message (including terminating NUL).
452 is ignored but the return value is still correct.
462 that results is the printable name of the error code,
465 rather than an explanation thereof.
476 member of the structure it points to
477 must point to the printable name of an error code;
478 in this case, the result in
480 is the decimal digits of
481 the numeric value of the error code
482 (0 if the name is not recognized).
486 are intended primarily as debugging facilities;
488 compatible with but not specified by
490 and should be used with
491 caution in software intended to be portable to other systems.
492 Be warned also that they are considered experimental and changes are possible.
497 frees any dynamically-allocated storage associated with the compiled RE
502 is no longer a valid compiled RE
503 and the effect of supplying it to
509 None of these functions references global variables except for tables
511 all are safe for use from multiple threads if the arguments are safe.
512 .Sh IMPLEMENTATION CHOICES
513 There are a number of decisions that
515 leaves up to the implementor,
516 either by explicitly saying
518 or by virtue of them being
519 forbidden by the RE grammar.
520 This implementation treats them as follows.
524 for a discussion of the definition of case-independent matching.
526 There is no particular limit on the length of REs,
527 except insofar as memory is limited.
528 Memory usage is approximately linear in RE size, and largely insensitive
529 to RE complexity, except for bounded repetitions.
532 for one short RE using them
533 that will run almost any system out of memory.
535 A backslashed character other than one specifically given a magic meaning
538 (such magic meanings occur only in obsolete
541 is taken as an ordinary character.
549 Equivalence classes cannot begin or end bracket-expression ranges.
550 The endpoint of one range cannot begin another.
553 the limit on repetition counts in bounded repetitions, is 255.
555 A repetition operator
560 cannot follow another
562 A repetition operator cannot begin an expression or subexpression
569 cannot appear first or last in a (sub)expression or after another
573 cannot be an empty subexpression.
574 An empty parenthesized subexpression,
576 is legal and matches an
578 An empty string is not a legal RE.
582 followed by a digit is considered the beginning of bounds for a
583 bounded repetition, which must then follow the syntax for bounds.
587 followed by a digit is considered an ordinary character.
592 beginning and ending subexpressions in obsolete
594 REs are anchors, not ordinary characters.
596 Non-zero error codes from
600 include the following:
602 .Bl -tag -width REG_ECOLLATE -compact
609 invalid regular expression
611 invalid collating element
613 invalid character class
616 applied to unescapable character
618 invalid backreference number
632 invalid repetition count(s) in
635 invalid character range in
646 empty (sub)expression
648 cannot happen - you found a bug
650 invalid argument, e.g.\& negative-length string
652 illegal byte sequence (bad multibyte character)
659 sections 2.8 (Regular Expression Notation)
661 B.5 (C Binding for Regular Expression Matching).
663 Originally written by
665 Altered for inclusion in the
669 This is an alpha release with known defects.
670 Please report problems.
672 The back-reference code is subtle and doubts linger about its correctness
679 This will improve with later releases.
683 exceeding 0 is expensive;
685 exceeding 1 is worse.
689 is largely insensitive to RE complexity
692 references are massively expensive.
693 RE length does matter; in particular, there is a strong speed bonus
694 for keeping RE length under about 30 characters,
695 with most special characters counting roughly double.
700 implements bounded repetitions by macro expansion,
701 which is costly in time and space if counts are large
702 or bounded repetitions are nested.
704 .Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
705 will (eventually) run almost any existing machine out of swap space.
707 There are suspected problems with response to obscure error conditions.
709 certain kinds of internal overflow,
710 produced only by truly enormous REs or by multiply nested bounded repetitions,
711 are probably not handled well.
717 are legal REs because
720 a special character only in the presence of a previous unmatched
722 This cannot be fixed until the spec is fixed.
724 The standard's definition of back references is vague.
726 .Ql "a\e(\e(b\e)*\e2\e)*d"
729 Until the standard is clarified,
730 behavior in such cases should not be relied on.
732 The implementation of word-boundary matching is a bit of a kludge,
733 and bugs may lurk in combinations of word-boundary matching and anchoring.
735 Word-boundary matching does not work properly in multibyte locales.