1 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2 .\" Copyright (c) 1992, 1993, 1994
3 .\" The Regents of the University of California. All rights reserved.
5 .\" This code is derived from software contributed to Berkeley by
8 .\" Redistribution and use in source and binary forms, with or without
9 .\" modification, are permitted provided that the following conditions
11 .\" 1. Redistributions of source code must retain the above copyright
12 .\" notice, this list of conditions and the following disclaimer.
13 .\" 2. Redistributions in binary form must reproduce the above copyright
14 .\" notice, this list of conditions and the following disclaimer in the
15 .\" documentation and/or other materials provided with the distribution.
16 .\" 3. All advertising materials mentioning features or use of this software
17 .\" must display the following acknowledgement:
18 .\" This product includes software developed by the University of
19 .\" California, Berkeley and its contributors.
20 .\" 4. Neither the name of the University nor the names of its contributors
21 .\" may be used to endorse or promote products derived from this software
22 .\" without specific prior written permission.
24 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
36 .\" @(#)regex.3 8.4 (Berkeley) 3/20/94
37 .\" $FreeBSD: src/lib/libc/regex/regex.3,v 1.17 2004/07/12 11:03:42 tjr Exp $
47 .Nd regular-expression library
54 .Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags"
58 .Fa "const regex_t * restrict preg" "const char * restrict string"
59 .Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags"
63 .Fa "int errcode" "const regex_t * restrict preg"
64 .Fa "char * restrict errbuf" "size_t errbuf_size"
67 .Fn regfree "regex_t *preg"
69 These routines implement
78 compiles an RE written as a string into an internal form,
80 matches that internal form against a string and reports results,
82 transforms error codes from either into human-readable messages,
85 frees any dynamically-allocated storage used by the internal form
90 declares two structure types,
94 the former for compiled internal forms and the latter for match reporting.
95 It also declares the four functions,
98 and a number of constants with names starting with
104 compiles the regular expression contained in the
107 subject to the flags in
109 and places the results in the
111 structure pointed to by
116 is the bitwise OR of zero or more of the following flags:
117 .Bl -tag -width REG_EXTENDED
122 rather than the obsolete
127 This is a synonym for 0,
128 provided as a counterpart to
130 to improve readability.
132 Compile with recognition of all special characters turned off.
133 All characters are thus considered ordinary,
137 This is an extension,
138 compatible with but not specified by
140 and should be used with
141 caution in software intended to be portable to other systems.
149 Compile for matching that ignores upper/lower case distinctions.
153 Compile for matching that need only report success or failure,
154 not what was matched.
156 Compile for newline-sensitive matching.
157 By default, newline is a completely ordinary character with no special
158 meaning in either REs or strings.
161 bracket expressions and
166 anchor matches the null string after any newline in the string
167 in addition to its normal function,
170 anchor matches the null string before any newline in the
171 string in addition to its normal function.
173 The regular expression ends,
174 not at the first NUL,
175 but just before the character pointed to by the
177 member of the structure pointed to by
183 This flag permits inclusion of NULs in the RE;
184 they are considered ordinary characters.
185 This is an extension,
186 compatible with but not specified by
188 and should be used with
189 caution in software intended to be portable to other systems.
194 returns 0 and fills in the structure pointed to by
196 One member of that structure
203 contains the number of parenthesized subexpressions within the RE
204 (except that the value of this member is undefined if the
209 fails, it returns a non-zero error code;
216 matches the compiled RE pointed to by
220 subject to the flags in
222 and reports results using
225 and the returned value.
226 The RE must have been compiled by a previous invocation of
228 The compiled form is not altered during execution of
230 so a single compiled RE can be used simultaneously by multiple threads.
233 the NUL-terminated string pointed to by
235 is considered to be the text of an entire line, minus any terminating
239 argument is the bitwise OR of zero or more of the following flags:
240 .Bl -tag -width REG_STARTEND
242 The first character of
244 is not the beginning of a line, so the
246 anchor should not match before it.
247 This does not affect the behavior of newlines under
252 does not end a line, so the
254 anchor should not match before it.
255 This does not affect the behavior of newlines under
258 The string is considered to start at
261 .Fa pmatch Ns [0]. Ns Va rm_so
262 and to have a terminating NUL located at
265 .Fa pmatch Ns [0]. Ns Va rm_eo
266 (there need not actually be a NUL at that location),
267 regardless of the value of
269 See below for the definition of
273 This is an extension,
274 compatible with but not specified by
276 and should be used with
277 caution in software intended to be portable to other systems.
283 affects only the location of the string,
284 not how it is matched.
289 for a discussion of what is matched in situations where an RE or a
290 portion thereof could match any of several substrings of
295 returns 0 for success and the non-zero code
298 Other non-zero error codes may be returned in exceptional situations;
304 was specified in the compilation of the RE,
311 argument (but see below for the case where
316 points to an array of
320 Such a structure has at least the members
326 (a signed arithmetic type at least as large as an
330 containing respectively the offset of the first character of a substring
331 and the offset of the first character after the end of the substring.
332 Offsets are measured from the beginning of the
336 An empty substring is denoted by equal offsets,
337 both indicating the character following the empty substring.
339 The 0th member of the
341 array is filled in to indicate what substring of
343 was matched by the entire RE.
344 Remaining members report what substring was matched by parenthesized
345 subexpressions within the RE;
348 reports subexpression
350 with subexpressions counted (starting at 1) by the order of their opening
351 parentheses in the RE, left to right.
352 Unused entries in the array (corresponding either to subexpressions that
353 did not participate in the match at all, or to subexpressions that do not
354 exist in the RE (that is,
357 .Fa preg Ns -> Ns Va re_nsub ) )
363 If a subexpression participated in the match several times,
364 the reported substring is the last one it matched.
365 (Note, as an example in particular, that when the RE
369 the parenthesized subexpression matches each of the three
372 an infinite number of empty strings following the last
374 so the reported substring is one of the empties.)
380 must point to at least one
387 to hold the input offsets for
389 Use for output is still entirely controlled by
398 will not be changed by a successful
410 to a human-readable, printable message.
414 .No non\- Ns Dv NULL ,
415 the error code should have arisen from use of
420 and if the error code came from
422 it should have been the result from the most recent
428 may be able to supply a more detailed message using information
434 places the NUL-terminated message into the buffer pointed to by
436 limiting the length (including the NUL) to at most
439 If the whole message won't fit,
440 as much of it as will fit before the terminating NUL is supplied.
442 the returned value is the size of buffer needed to hold the whole
443 message (including terminating NUL).
448 is ignored but the return value is still correct.
458 that results is the printable name of the error code,
461 rather than an explanation thereof.
472 member of the structure it points to
473 must point to the printable name of an error code;
474 in this case, the result in
476 is the decimal digits of
477 the numeric value of the error code
478 (0 if the name is not recognized).
482 are intended primarily as debugging facilities;
484 compatible with but not specified by
486 and should be used with
487 caution in software intended to be portable to other systems.
488 Be warned also that they are considered experimental and changes are possible.
493 frees any dynamically-allocated storage associated with the compiled RE
498 is no longer a valid compiled RE
499 and the effect of supplying it to
505 None of these functions references global variables except for tables
507 all are safe for use from multiple threads if the arguments are safe.
508 .Sh IMPLEMENTATION CHOICES
509 There are a number of decisions that
511 leaves up to the implementor,
512 either by explicitly saying
514 or by virtue of them being
515 forbidden by the RE grammar.
516 This implementation treats them as follows.
520 for a discussion of the definition of case-independent matching.
522 There is no particular limit on the length of REs,
523 except insofar as memory is limited.
524 Memory usage is approximately linear in RE size, and largely insensitive
525 to RE complexity, except for bounded repetitions.
528 for one short RE using them
529 that will run almost any system out of memory.
531 A backslashed character other than one specifically given a magic meaning
534 (such magic meanings occur only in obsolete
537 is taken as an ordinary character.
545 Equivalence classes cannot begin or end bracket-expression ranges.
546 The endpoint of one range cannot begin another.
549 the limit on repetition counts in bounded repetitions, is 255.
551 A repetition operator
556 cannot follow another
558 A repetition operator cannot begin an expression or subexpression
565 cannot appear first or last in a (sub)expression or after another
569 cannot be an empty subexpression.
570 An empty parenthesized subexpression,
572 is legal and matches an
574 An empty string is not a legal RE.
578 followed by a digit is considered the beginning of bounds for a
579 bounded repetition, which must then follow the syntax for bounds.
583 followed by a digit is considered an ordinary character.
588 beginning and ending subexpressions in obsolete
590 REs are anchors, not ordinary characters.
596 sections 2.8 (Regular Expression Notation)
598 B.5 (C Binding for Regular Expression Matching).
600 Non-zero error codes from
604 include the following:
606 .Bl -tag -width REG_ECOLLATE -compact
613 invalid regular expression
615 invalid collating element
617 invalid character class
620 applied to unescapable character
622 invalid backreference number
636 invalid repetition count(s) in
639 invalid character range in
650 empty (sub)expression
652 can't happen - you found a bug
654 invalid argument, e.g.\& negative-length string
656 illegal byte sequence (bad multibyte character)
659 Originally written by
661 Altered for inclusion in the
665 This is an alpha release with known defects.
666 Please report problems.
668 The back-reference code is subtle and doubts linger about its correctness
675 This will improve with later releases.
679 exceeding 0 is expensive;
681 exceeding 1 is worse.
685 is largely insensitive to RE complexity
688 references are massively expensive.
689 RE length does matter; in particular, there is a strong speed bonus
690 for keeping RE length under about 30 characters,
691 with most special characters counting roughly double.
696 implements bounded repetitions by macro expansion,
697 which is costly in time and space if counts are large
698 or bounded repetitions are nested.
700 .Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
701 will (eventually) run almost any existing machine out of swap space.
703 There are suspected problems with response to obscure error conditions.
705 certain kinds of internal overflow,
706 produced only by truly enormous REs or by multiply nested bounded repetitions,
707 are probably not handled well.
713 are legal REs because
716 a special character only in the presence of a previous unmatched
718 This can't be fixed until the spec is fixed.
720 The standard's definition of back references is vague.
722 .Ql "a\e(\e(b\e)*\e2\e)*d"
725 Until the standard is clarified,
726 behavior in such cases should not be relied on.
728 The implementation of word-boundary matching is a bit of a kludge,
729 and bugs may lurk in combinations of word-boundary matching and anchoring.