1 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2 .\" Copyright (c) 1992, 1993, 1994
3 .\" The Regents of the University of California. All rights reserved.
5 .\" This code is derived from software contributed to Berkeley by
8 .\" Redistribution and use in source and binary forms, with or without
9 .\" modification, are permitted provided that the following conditions
11 .\" 1. Redistributions of source code must retain the above copyright
12 .\" notice, this list of conditions and the following disclaimer.
13 .\" 2. Redistributions in binary form must reproduce the above copyright
14 .\" notice, this list of conditions and the following disclaimer in the
15 .\" documentation and/or other materials provided with the distribution.
16 .\" 4. Neither the name of the University nor the names of its contributors
17 .\" may be used to endorse or promote products derived from this software
18 .\" without specific prior written permission.
20 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
32 .\" @(#)regex.3 8.4 (Berkeley) 3/20/94
33 .\" $FreeBSD: src/lib/libc/regex/regex.3,v 1.21 2007/01/09 00:28:04 imp Exp $
53 .Nd regular-expression library
55 .Sy (Standards-compliant APIs)
60 .Fa "regex_t *restrict preg"
61 .Fa "const char *restrict pattern"
67 .Fa "const regex_t *restrict preg"
68 .Fa "char *restrict errbuf"
69 .Fa "size_t errbuf_size"
73 .Fa "const regex_t *restrict preg"
74 .Fa "const char *restrict string"
76 .Fa "regmatch_t pmatch[restrict]"
84 .Sy (Non-portable extensions)
87 .Fa "regex_t *restrict preg"
88 .Fa "const char *restrict pattern"
94 .Fa "const regex_t *restrict preg"
95 .Fa "const char *restrict string"
98 .Fa "regmatch_t pmatch[restrict]"
103 .Fa "regex_t *restrict preg"
104 .Fa "const wchar_t *restrict widepat"
109 .Fa "const regex_t *restrict preg"
110 .Fa "const wchar_t *restrict widestr"
112 .Fa "regmatch_t pmatch[restrict]"
117 .Fa "regex_t *restrict preg"
118 .Fa "const wchar_t *restrict widepat"
124 .Fa "const regex_t *restrict preg"
125 .Fa "const wchar_t *restrict widestr"
128 .Fa "regmatch_t pmatch[restrict]"
135 .Fa "regex_t *restrict preg"
136 .Fa "const char *restrict pattern"
138 .Fa "locale_t restrict"
142 .Fa "regex_t *restrict preg"
143 .Fa "const char *restrict pattern"
146 .Fa "locale_t restrict"
150 .Fa "regex_t *restrict preg"
151 .Fa "const wchar_t *restrict widepat"
153 .Fa "locale_t restrict"
157 .Fa "regex_t *restrict preg"
158 .Fa "const wchar_t *restrict widepat"
161 .Fa "locale_t restrict"
164 These routines implement
173 compiles an RE, written as a string, into an internal form.
175 matches that internal form against a string and reports results.
177 transforms error codes from either into human-readable messages.
179 frees any dynamically-allocated storage used by the internal form
184 declares two structure types,
188 the former for compiled internal forms and the latter for match reporting.
189 It also declares the four functions,
192 and a number of constants with names starting with
198 compiles the regular expression contained in the
201 subject to the flags in
203 and places the results in the
205 structure pointed to by
210 is the bitwise OR of zero or more of the following flags:
211 .Bl -tag -width REG_EXTENDED
216 rather than the obsolete
221 This is a synonym for 0,
222 provided as a counterpart to
224 to improve readability.
226 Compile with recognition of all special characters turned off.
227 All characters are thus considered ordinary,
231 This is an extension,
232 compatible with but not specified by
234 and should be used with
235 caution in software intended to be portable to other systems.
246 Compile for matching that ignores upper/lower case distinctions.
250 Compile for matching that need only report success or failure,
251 not what was matched.
253 Compile for newline-sensitive matching.
254 By default, newline is a completely ordinary character with no special
255 meaning in either REs or strings.
258 bracket expressions and
263 anchor matches the null string after any newline in the string
264 in addition to its normal function,
267 anchor matches the null string before any newline in the
268 string in addition to its normal function.
272 is not recognized by any of the wide character or
277 variants can be used instead of
279 see EXTENDED APIS below.)
280 The regular expression ends,
281 not at the first NUL,
282 but just before the character pointed to by the
284 member of the structure pointed to by
290 This flag permits inclusion of NULs in the RE;
291 they are considered ordinary characters.
292 This is an extension,
293 compatible with but not specified by
295 and should be used with
296 caution in software intended to be portable to other systems.
298 Recognized enhanced regular expression features; see
301 This is an extension not specified by
303 and should be used with
304 caution in software intended to be portable to other systems.
306 Use minimal (non-greedy) repetitions instead of the normal greedy ones; see
309 (This only applies when both
314 This is an extension not specified by
316 and should be used with
317 caution in software intended to be portable to other systems.
325 returns 0 and fills in the structure pointed to by
327 One member of that structure
334 contains the number of parenthesized subexpressions within the RE
335 (except that the value of this member is undefined if the
340 fails, it returns a non-zero error code;
347 matches the compiled RE pointed to by
351 subject to the flags in
353 and reports results using
356 and the returned value.
357 The RE must have been compiled by a previous invocation of
359 The compiled form is not altered during execution of
361 so a single compiled RE can be used simultaneously by multiple threads.
364 the NUL-terminated string pointed to by
366 is considered to be the text of an entire line, minus any terminating
370 argument is the bitwise OR of zero or more of the following flags:
371 .Bl -tag -width REG_STARTEND
373 The first character of
375 is not the beginning of a line, so the
377 anchor should not match before it.
378 This does not affect the behavior of newlines under
383 does not end a line, so the
385 anchor should not match before it.
386 This does not affect the behavior of newlines under
389 The string is considered to start at
392 .Fa pmatch Ns [0]. Ns Va rm_so
393 and to have a terminating NUL located at
396 .Fa pmatch Ns [0]. Ns Va rm_eo
397 (there need not actually be a NUL at that location),
398 regardless of the value of
400 See below for the definition of
404 This is an extension,
405 compatible with but not specified by
407 and should be used with
408 caution in software intended to be portable to other systems.
414 affects only the location of the string,
415 not how it is matched.
420 for a discussion of what is matched in situations where an RE or a
421 portion thereof could match any of several substrings of
426 returns 0 for success and the non-zero code
429 Other non-zero error codes may be returned in exceptional situations;
435 was specified in the compilation of the RE,
442 argument (but see below for the case where
447 points to an array of
451 Such a structure has at least the members
457 (a signed arithmetic type at least as large as an
461 containing respectively the offset of the first character of a substring
462 and the offset of the first character after the end of the substring.
463 Offsets are measured from the beginning of the
467 An empty substring is denoted by equal offsets,
468 both indicating the character following the empty substring.
470 The 0th member of the
472 array is filled in to indicate what substring of
474 was matched by the entire RE.
475 Remaining members report what substring was matched by parenthesized
476 subexpressions within the RE;
479 reports subexpression
481 with subexpressions counted (starting at 1) by the order of their opening
482 parentheses in the RE, left to right.
483 Unused entries in the array (corresponding either to subexpressions that
484 did not participate in the match at all, or to subexpressions that do not
485 exist in the RE (that is,
488 .Fa preg Ns -> Ns Va re_nsub ) )
494 If a subexpression participated in the match several times,
495 the reported substring is the last one it matched.
496 (Note, as an example in particular, that when the RE
500 the parenthesized subexpression matches each of the three
503 an infinite number of empty strings following the last
505 so the reported substring is one of the empties.)
511 must point to at least one
518 to hold the input offsets for
520 Use for output is still entirely controlled by
529 will not be changed by a successful
541 to a human-readable, printable message.
545 .No non\- Ns Dv NULL ,
546 the error code should have arisen from use of
551 and if the error code came from
553 it should have been the result from the most recent
559 may be able to supply a more detailed message using information
565 places the NUL-terminated message into the buffer pointed to by
567 limiting the length (including the NUL) to at most
570 If the whole message will not fit,
571 as much of it as will fit before the terminating NUL is supplied.
573 the returned value is the size of buffer needed to hold the whole
574 message (including terminating NUL).
579 is ignored but the return value is still correct.
589 that results is the printable name of the error code,
592 rather than an explanation thereof.
603 member of the structure it points to
604 must point to the printable name of an error code;
605 in this case, the result in
607 is the decimal digits of
608 the numeric value of the error code
609 (0 if the name is not recognized).
613 are intended primarily as debugging facilities;
615 compatible with but not specified by
617 and should be used with
618 caution in software intended to be portable to other systems.
619 Be warned also that they are considered experimental and changes are possible.
624 frees any dynamically-allocated storage associated with the compiled RE
629 is no longer a valid compiled RE
630 and the effect of supplying it to
636 None of these functions references global variables except for tables
638 all are safe for use from multiple threads if the arguments are safe.
640 These extended APIs are available in Mac OS X 10.8 and beyond, when the
641 deployment target is 10.8 or later.
642 It should also be noted that any of the
644 variants may be used to initialize a
646 structure, that can then be passed to any of the
649 So it is quite legal to compile a wide character RE and use it to match a
650 multibyte character string, or vice versa.
654 routine compiles regular expressions like
656 but the length of the regular expression string is specified, allowing a string
657 that is not NUL terminated and/or contains NUL characters.
658 This is a modern replacement for using
668 but the length of the string to match is specified, allowing a string
669 that is not NUL terminated and/or contains NUL characters.
675 variants take a wide-character
677 string for the regular expression and string to match.
682 are variants that allow specifying the wide character string length, and
683 so allows wide character strings that are not NUL terminated and/or
684 contains NUL characters.
685 .Sh INTERACTION WITH THE LOCALE
688 or one of its variants is run, the regular expression is compiled into an
689 internal form, which may include specific information about the locale currently
690 in effect, such as equivalence classes or multi-character collation symbols.
691 So a reference to the current locale is also stored with the internal form,
694 is run, it can use the same locale (even if the locale is changed in-between
700 To provide more direct control over which locale is used,
703 appended to their names are provided that work just like the variants
706 except that a locale (via a
708 variable type) is specified directly.
709 Note that only variants of
715 variants just use the reference to the locale stored in the internal form.
716 .Sh IMPLEMENTATION CHOICES
719 implementation in Mac OS X 10.8 and later is based on a heavily modified subset
720 of TRE (http://laurikari.net/tre/).
721 This provides improved performance, better conformance and additional features.
722 However, both API and binary compatibility have been maintained with previous
723 releases, so binaries
724 built on previous releases should work on 10.8 and later, and binaries built on
725 10.8 and later should be able to run on previous releases (as long as none of
726 the new variants or new features are used.
728 There are a number of decisions that
730 leaves up to the implementor,
731 either by explicitly saying
733 or by virtue of them being
734 forbidden by the RE grammar.
735 This implementation treats them as follows.
739 for a discussion of the definition of case-independent matching.
741 There is no particular limit on the length of REs,
742 except insofar as memory is limited.
743 Memory usage is approximately linear in RE size, and largely insensitive
744 to RE complexity, except for bounded repetitions.
747 for one short RE using them
748 that will run almost any system out of memory.
750 A backslashed character other than one specifically given a magic meaning
753 (such magic meanings occur only in obsolete
756 is taken as an ordinary character.
764 Equivalence classes cannot begin or end bracket-expression ranges.
765 The endpoint of one range cannot begin another.
768 the limit on repetition counts in bounded repetitions, is 255.
770 A repetition operator
775 cannot follow another
776 repetition operator, except for the use of
778 for minimal repetition (for enhanced extended REs; see
781 A repetition operator cannot begin an expression or subexpression
788 cannot appear first or last in a (sub)expression or after another
792 cannot be an empty subexpression.
793 An empty parenthesized subexpression,
795 is legal and matches an
797 An empty string is not a legal RE.
801 followed by a digit is considered the beginning of bounds for a
802 bounded repetition, which must then follow the syntax for bounds.
806 followed by a digit is considered an ordinary character.
811 beginning and ending subexpressions in obsolete
813 REs are anchors, not ordinary characters.
815 Non-zero error codes from
819 include the following:
821 .Bl -tag -width REG_ECOLLATE -compact
828 invalid regular expression
830 invalid collating element
832 invalid character class
835 applied to unescapable character
837 invalid backreference number
851 invalid repetition count(s) in
854 invalid character range in
865 empty (sub)expression
867 cannot happen - you found a bug
869 invalid argument, e.g.\& negative-length string
871 illegal byte sequence (bad multibyte character)
878 sections 2.8 (Regular Expression Notation)
880 B.5 (C Binding for Regular Expression Matching).
884 implementation is based on a heavily modified subset of TRE
885 (http://laurikari.net/tre/), originally written by Ville Laurikari.
886 Previous releases used an implementation originally written by
888 and altered for inclusion in the
892 The beginning-of-line and end-of-line anchors (
896 are currently implemented so that repetitions can not be applied to them.
897 The standards are unclear about whether this is legal, but other
899 packages do support this case.
900 It is best to avoid this non-portable (and not really very useful) case.
902 The back-reference code is subtle and doubts linger about its correctness
907 variants use one of two internal matching engines.
908 The normal one is linear worst-case time in the length of the text being
909 searched, and quadratic worst-case time in the length of the used regular
911 When back-references are used, a slower, backtracking engine is used.
912 While all backtracking matching engines suffer from extreme slowness for certain
913 pathological cases, the normal engines doesn't suffer from these cases.
914 It is advised to avoid back-references whenever possible.
919 implements bounded repetitions by macro expansion,
920 which is costly in time and space if counts are large
921 or bounded repetitions are nested.
923 .Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
924 will (eventually) run almost any existing machine out of swap space.
930 are legal REs because
933 a special character only in the presence of a previous unmatched
935 This cannot be fixed until the spec is fixed.
937 The standard's definition of back references is vague.
939 .Ql "a\e(\e(b\e)*\e2\e)*d"
942 Until the standard is clarified,
943 behavior in such cases should not be relied on.