1 .TH REGEX 3 "25 Sept 1997"
4 .\" one other place knows this name: the SEE ALSO section
8 regcomp, regexec, regerror, regfree \- regular-expression library
12 #include <sys/types.h>
16 int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
18 int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
19 size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
21 size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
22 char\ *errbuf, size_t\ errbuf_size);
24 void\ regfree(regex_t\ *preg);
28 These routines implement POSIX 1003.2 regular expressions (``RE''s);
32 compiles an RE written as a string into an internal form,
34 matches that internal form against a string and reports results,
36 transforms error codes from either into human-readable messages,
39 frees any dynamically-allocated storage used by the internal form
44 declares two structure types,
48 the former for compiled internal forms and the latter for match reporting.
49 It also declares the four functions,
52 and a number of constants with names starting with ``REG_''.
55 compiles the regular expression contained in the
58 subject to the flags in
60 and places the results in the
62 structure pointed to by
65 is the bitwise OR of zero or more of the following flags:
66 .IP REG_EXTENDED \w'REG_EXTENDED'u+2n
67 Compile modern (``extended'') REs,
68 rather than the obsolete (``basic'') REs that
71 This is a synonym for 0,
72 provided as a counterpart to REG_EXTENDED to improve readability.
74 compatible with but not specified by POSIX 1003.2,
75 and should be used with
76 caution in software intended to be portable to other systems.
78 Compile with recognition of all special characters turned off.
79 All characters are thus considered ordinary,
80 so the ``RE'' is a literal string.
82 compatible with but not specified by POSIX 1003.2,
83 and should be used with
84 caution in software intended to be portable to other systems.
85 REG_EXTENDED and REG_NOSPEC may not be used
89 Compile for matching that ignores upper/lower case distinctions.
93 Compile for matching that need only report success or failure,
96 Compile for newline-sensitive matching.
97 By default, newline is a completely ordinary character with no special
98 meaning in either REs or strings.
100 `[^' bracket expressions and `.' never match newline,
101 a `^' anchor matches the null string after any newline in the string
102 in addition to its normal function,
103 and the `$' anchor matches the null string before any newline in the
104 string in addition to its normal function.
106 The regular expression ends,
107 not at the first NUL,
108 but just before the character pointed to by the
110 member of the structure pointed to by
116 This flag permits inclusion of NULs in the RE;
117 they are considered ordinary characters.
118 This is an extension,
119 compatible with but not specified by POSIX 1003.2,
120 and should be used with
121 caution in software intended to be portable to other systems.
125 returns 0 and fills in the structure pointed to by
127 One member of that structure
134 contains the number of parenthesized subexpressions within the RE
135 (except that the value of this member is undefined if the
136 REG_NOSUB flag was used).
139 fails, it returns a non-zero error code;
143 matches the compiled RE pointed to by
147 subject to the flags in
149 and reports results using
152 and the returned value.
153 The RE must have been compiled by a previous invocation of
155 The compiled form is not altered during execution of
157 so a single compiled RE can be used simultaneously by multiple threads.
160 the NUL-terminated string pointed to by
162 is considered to be the text of an entire line,
163 with the NUL indicating the end of the line.
165 any other end-of-line marker is considered to have been removed
166 and replaced by the NUL.)
169 argument is the bitwise OR of zero or more of the following flags:
170 .IP REG_NOTBOL \w'REG_STARTEND'u+2n
171 The first character of
173 is not the beginning of a line, so the `^' anchor should not match before it.
174 This does not affect the behavior of newlines under REG_NEWLINE.
178 does not end a line, so the `$' anchor should not match before it.
179 This does not affect the behavior of newlines under REG_NEWLINE.
181 The string is considered to start at
182 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
183 and to have a terminating NUL located at
184 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
185 (there need not actually be a NUL at that location),
186 regardless of the value of
188 See below for the definition of
192 This is an extension,
193 compatible with but not specified by POSIX 1003.2,
194 and should be used with
195 caution in software intended to be portable to other systems.
196 Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
197 REG_STARTEND affects only the location of the string,
198 not how it is matched.
202 for a discussion of what is matched in situations where an RE or a
203 portion thereof could match any of several substrings of
208 returns 0 for success and the non-zero code REG_NOMATCH for failure.
209 Other non-zero error codes may be returned in exceptional situations;
212 If REG_NOSUB was specified in the compilation of the RE,
219 argument (but see below for the case where REG_STARTEND is specified).
222 points to an array of
226 Such a structure has at least the members
232 (a signed arithmetic type at least as large as an
236 containing respectively the offset of the first character of a substring
237 and the offset of the first character after the end of the substring.
238 Offsets are measured from the beginning of the
242 An empty substring is denoted by equal offsets,
243 both indicating the character following the empty substring.
245 The 0th member of the
247 array is filled in to indicate what substring of
249 was matched by the entire RE.
250 Remaining members report what substring was matched by parenthesized
251 subexpressions within the RE;
254 reports subexpression
256 with subexpressions counted (starting at 1) by the order of their opening
257 parentheses in the RE, left to right.
258 Unused entries in the array\(emcorresponding either to subexpressions that
259 did not participate in the match at all, or to subexpressions that do not
260 exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
265 If a subexpression participated in the match several times,
266 the reported substring is the last one it matched.
267 (Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
268 the parenthesized subexpression matches the three `b's and then
269 an infinite number of empty strings following the last `b',
270 so the reported substring is one of the empties.)
272 If REG_STARTEND is specified,
274 must point to at least one
278 is 0 or REG_NOSUB was specified),
279 to hold the input offsets for REG_STARTEND.
280 Use for output is still entirely controlled by
284 is 0 or REG_NOSUB was specified,
287 will not be changed by a successful
297 to a human-readable, printable message.
301 the error code should have arisen from use of
306 and if the error code came from
308 it should have been the result from the most recent
313 may be able to supply a more detailed message using information
317 places the NUL-terminated message into the buffer pointed to by
319 limiting the length (including the NUL) to at most
322 If the whole message won't fit,
323 as much of it as will fit before the terminating NUL is supplied.
325 the returned value is the size of buffer needed to hold the whole
326 message (including terminating NUL).
331 is ignored but the return value is still correct.
337 is first ORed with REG_ITOA,
338 the ``message'' that results is the printable name of the error code,
339 e.g. ``REG_NOMATCH'',
340 rather than an explanation thereof.
346 shall be non-NULL and the
348 member of the structure it points to
349 must point to the printable name of an error code;
350 in this case, the result in
352 is the decimal digits of
353 the numeric value of the error code
354 (0 if the name is not recognized).
355 REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
357 compatible with but not specified by POSIX 1003.2,
358 and should be used with
359 caution in software intended to be portable to other systems.
360 Be warned also that they are considered experimental and changes are possible.
363 frees any dynamically-allocated storage associated with the compiled RE
368 is no longer a valid compiled RE
369 and the effect of supplying it to
375 None of these functions references global variables except for tables
377 all are safe for use from multiple threads if the arguments are safe.
378 .SH IMPLEMENTATION CHOICES
379 There are a number of decisions that 1003.2 leaves up to the implementor,
380 either by explicitly saying ``undefined'' or by virtue of them being
381 forbidden by the RE grammar.
382 This implementation treats them as follows.
386 for a discussion of the definition of case-independent matching.
388 There is no particular limit on the length of REs,
389 except insofar as memory is limited.
390 Memory usage is approximately linear in RE size, and largely insensitive
391 to RE complexity, except for bounded repetitions.
392 See BUGS for one short RE using them
393 that will run almost any system out of memory.
395 A backslashed character other than one specifically given a magic meaning
396 by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
397 is taken as an ordinary character.
399 Any unmatched [ is a REG_EBRACK error.
401 Equivalence classes cannot begin or end bracket-expression ranges.
402 The endpoint of one range cannot begin another.
404 RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
406 A repetition operator (?, *, +, or bounds) cannot follow another
408 A repetition operator cannot begin an expression or subexpression
409 or follow `^' or `|'.
411 `|' cannot appear first or last in a (sub)expression or after another `|',
412 i.e. an operand of `|' cannot be an empty subexpression.
413 An empty parenthesized subexpression, `()', is legal and matches an
415 An empty string is not a legal RE.
417 A `{' followed by a digit is considered the beginning of bounds for a
418 bounded repetition, which must then follow the syntax for bounds.
419 A `{' \fInot\fR followed by a digit is considered an ordinary character.
421 `^' and `$' beginning and ending subexpressions in obsolete (``basic'')
422 REs are anchors, not ordinary characters.
426 POSIX 1003.2, sections 2.8 (Regular Expression Notation)
428 B.5 (C Binding for Regular Expression Matching).
430 Non-zero error codes from
434 include the following:
437 .ta \w'REG_ECOLLATE'u+3n
438 REG_NOMATCH regexec() failed to match
439 REG_BADPAT invalid regular expression
440 REG_ECOLLATE invalid collating element
441 REG_ECTYPE invalid character class
442 REG_EESCAPE \e applied to unescapable character
443 REG_ESUBREG invalid backreference number
444 REG_EBRACK brackets [ ] not balanced
445 REG_EPAREN parentheses ( ) not balanced
446 REG_EBRACE braces { } not balanced
447 REG_BADBR invalid repetition count(s) in { }
448 REG_ERANGE invalid character range in [ ]
449 REG_ESPACE ran out of memory
450 REG_BADRPT ?, *, or + operand invalid
451 REG_EMPTY empty (sub)expression
452 REG_ASSERT ``can't happen''\(emyou found a bug
453 REG_INVARG invalid argument, e.g. negative-length string
456 Written by Henry Spencer,
457 henry@zoo.toronto.edu.
459 This is an alpha release with known defects.
460 Please report problems.
462 There is one known functionality bug.
463 The implementation of internationalization is incomplete:
464 the locale is always assumed to be the default one of 1003.2,
465 and only the collating elements etc. of that locale are available.
467 The back-reference code is subtle and doubts linger about its correctness
472 This will improve with later releases.
474 exceeding 0 is expensive;
476 exceeding 1 is worse.
478 is largely insensitive to RE complexity \fIexcept\fR that back
479 references are massively expensive.
480 RE length does matter; in particular, there is a strong speed bonus
481 for keeping RE length under about 30 characters,
482 with most special characters counting roughly double.
485 implements bounded repetitions by macro expansion,
486 which is costly in time and space if counts are large
487 or bounded repetitions are nested.
489 `((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
490 will (eventually) run almost any existing machine out of swap space.
492 There are suspected problems with response to obscure error conditions.
494 certain kinds of internal overflow,
495 produced only by truly enormous REs or by multiply nested bounded repetitions,
496 are probably not handled well.
498 Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
499 a special character only in the presence of a previous unmatched `('.
500 This can't be fixed until the spec is fixed.
502 The standard's definition of back references is vague.
504 `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
505 Until the standard is clarified,
506 behavior in such cases should not be relied on.
508 The implementation of word-boundary matching is a bit of a kludge,
509 and bugs may lurk in combinations of word-boundary matching and anchoring.