]> git.saurik.com Git - wxWidgets.git/blame - src/regex/regex.3
Unicode compilation fixes (patch from Dimitri)
[wxWidgets.git] / src / regex / regex.3
CommitLineData
07dcc217
VZ
1.TH REGEX 3 "25 Sept 1997"
2.BY "Henry Spencer"
3.de ZR
4.\" one other place knows this name: the SEE ALSO section
5.IR regex (7) \\$1
6..
7.SH NAME
8regcomp, regexec, regerror, regfree \- regular-expression library
9.SH SYNOPSIS
10.ft B
11.\".na
12#include <sys/types.h>
13.br
14#include <regex.h>
15.HP 10
16int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
17.HP
18int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
19size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
20.HP
21size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
22char\ *errbuf, size_t\ errbuf_size);
23.HP
24void\ regfree(regex_t\ *preg);
25.\".ad
26.ft
27.SH DESCRIPTION
28These routines implement POSIX 1003.2 regular expressions (``RE''s);
29see
30.ZR .
31.I Regcomp
32compiles an RE written as a string into an internal form,
33.I regexec
34matches that internal form against a string and reports results,
35.I regerror
36transforms error codes from either into human-readable messages,
37and
38.I regfree
39frees any dynamically-allocated storage used by the internal form
40of an RE.
41.PP
42The header
43.I <regex.h>
44declares two structure types,
45.I regex_t
46and
47.IR regmatch_t ,
48the former for compiled internal forms and the latter for match reporting.
49It also declares the four functions,
50a type
51.IR regoff_t ,
52and a number of constants with names starting with ``REG_''.
53.PP
54.I Regcomp
55compiles the regular expression contained in the
56.I pattern
57string,
58subject to the flags in
59.IR cflags ,
60and places the results in the
61.I regex_t
62structure pointed to by
63.IR preg .
64.I Cflags
65is the bitwise OR of zero or more of the following flags:
66.IP REG_EXTENDED \w'REG_EXTENDED'u+2n
67Compile modern (``extended'') REs,
68rather than the obsolete (``basic'') REs that
69are the default.
70.IP REG_BASIC
71This is a synonym for 0,
72provided as a counterpart to REG_EXTENDED to improve readability.
73This is an extension,
74compatible with but not specified by POSIX 1003.2,
75and should be used with
76caution in software intended to be portable to other systems.
77.IP REG_NOSPEC
78Compile with recognition of all special characters turned off.
79All characters are thus considered ordinary,
80so the ``RE'' is a literal string.
81This is an extension,
82compatible with but not specified by POSIX 1003.2,
83and should be used with
84caution in software intended to be portable to other systems.
85REG_EXTENDED and REG_NOSPEC may not be used
86in the same call to
87.IR regcomp .
88.IP REG_ICASE
89Compile for matching that ignores upper/lower case distinctions.
90See
91.ZR .
92.IP REG_NOSUB
93Compile for matching that need only report success or failure,
94not what was matched.
95.IP REG_NEWLINE
96Compile for newline-sensitive matching.
97By default, newline is a completely ordinary character with no special
98meaning in either REs or strings.
99With this flag,
100`[^' bracket expressions and `.' never match newline,
101a `^' anchor matches the null string after any newline in the string
102in addition to its normal function,
103and the `$' anchor matches the null string before any newline in the
104string in addition to its normal function.
105.IP REG_PEND
106The regular expression ends,
107not at the first NUL,
108but just before the character pointed to by the
109.I re_endp
110member of the structure pointed to by
111.IR preg .
112The
113.I re_endp
114member is of type
115.IR const\ char\ * .
116This flag permits inclusion of NULs in the RE;
117they are considered ordinary characters.
118This is an extension,
119compatible with but not specified by POSIX 1003.2,
120and should be used with
121caution in software intended to be portable to other systems.
122.PP
123When successful,
124.I regcomp
125returns 0 and fills in the structure pointed to by
126.IR preg .
127One member of that structure
128(other than
129.IR re_endp )
130is publicized:
131.IR re_nsub ,
132of type
133.IR size_t ,
134contains the number of parenthesized subexpressions within the RE
135(except that the value of this member is undefined if the
136REG_NOSUB flag was used).
137If
138.I regcomp
139fails, it returns a non-zero error code;
140see DIAGNOSTICS.
141.PP
142.I Regexec
143matches the compiled RE pointed to by
144.I preg
145against the
146.IR string ,
147subject to the flags in
148.IR eflags ,
149and reports results using
150.IR nmatch ,
151.IR pmatch ,
152and the returned value.
153The RE must have been compiled by a previous invocation of
154.IR regcomp .
155The compiled form is not altered during execution of
156.IR regexec ,
157so a single compiled RE can be used simultaneously by multiple threads.
158.PP
159By default,
160the NUL-terminated string pointed to by
161.I string
162is considered to be the text of an entire line,
163with the NUL indicating the end of the line.
164(That is,
165any other end-of-line marker is considered to have been removed
166and replaced by the NUL.)
167The
168.I eflags
169argument is the bitwise OR of zero or more of the following flags:
170.IP REG_NOTBOL \w'REG_STARTEND'u+2n
171The first character of
172the string
173is not the beginning of a line, so the `^' anchor should not match before it.
174This does not affect the behavior of newlines under REG_NEWLINE.
175.IP REG_NOTEOL
176The NUL terminating
177the string
178does not end a line, so the `$' anchor should not match before it.
179This does not affect the behavior of newlines under REG_NEWLINE.
180.IP REG_STARTEND
181The string is considered to start at
182\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
183and to have a terminating NUL located at
184\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
185(there need not actually be a NUL at that location),
186regardless of the value of
187.IR nmatch .
188See below for the definition of
189.IR pmatch
190and
191.IR nmatch .
192This is an extension,
193compatible with but not specified by POSIX 1003.2,
194and should be used with
195caution in software intended to be portable to other systems.
196Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
197REG_STARTEND affects only the location of the string,
198not how it is matched.
199.PP
200See
201.ZR
202for a discussion of what is matched in situations where an RE or a
203portion thereof could match any of several substrings of
204.IR string .
205.PP
206Normally,
207.I regexec
208returns 0 for success and the non-zero code REG_NOMATCH for failure.
209Other non-zero error codes may be returned in exceptional situations;
210see DIAGNOSTICS.
211.PP
212If REG_NOSUB was specified in the compilation of the RE,
213or if
214.I nmatch
215is 0,
216.I regexec
217ignores the
218.I pmatch
219argument (but see below for the case where REG_STARTEND is specified).
220Otherwise,
221.I pmatch
222points to an array of
223.I nmatch
224structures of type
225.IR regmatch_t .
226Such a structure has at least the members
227.I rm_so
228and
229.IR rm_eo ,
230both of type
231.I regoff_t
232(a signed arithmetic type at least as large as an
233.I off_t
234and a
235.IR ssize_t ),
236containing respectively the offset of the first character of a substring
237and the offset of the first character after the end of the substring.
238Offsets are measured from the beginning of the
239.I string
240argument given to
241.IR regexec .
242An empty substring is denoted by equal offsets,
243both indicating the character following the empty substring.
244.PP
245The 0th member of the
246.I pmatch
247array is filled in to indicate what substring of
248.I string
249was matched by the entire RE.
250Remaining members report what substring was matched by parenthesized
251subexpressions within the RE;
252member
253.I i
254reports subexpression
255.IR i ,
256with subexpressions counted (starting at 1) by the order of their opening
257parentheses in the RE, left to right.
258Unused entries in the array\(emcorresponding either to subexpressions that
259did not participate in the match at all, or to subexpressions that do not
260exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
261.I rm_so
262and
263.I rm_eo
264set to \-1.
265If a subexpression participated in the match several times,
266the reported substring is the last one it matched.
267(Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
268the parenthesized subexpression matches the three `b's and then
269an infinite number of empty strings following the last `b',
270so the reported substring is one of the empties.)
271.PP
272If REG_STARTEND is specified,
273.I pmatch
274must point to at least one
275.I regmatch_t
276(even if
277.I nmatch
278is 0 or REG_NOSUB was specified),
279to hold the input offsets for REG_STARTEND.
280Use for output is still entirely controlled by
281.IR nmatch ;
282if
283.I nmatch
284is 0 or REG_NOSUB was specified,
285the value of
286.IR pmatch [0]
287will not be changed by a successful
288.IR regexec .
289.PP
290.I Regerror
291maps a non-zero
292.I errcode
293from either
294.I regcomp
295or
296.I regexec
297to a human-readable, printable message.
298If
299.I preg
300is non-NULL,
301the error code should have arisen from use of
302the
303.I regex_t
304pointed to by
305.IR preg ,
306and if the error code came from
307.IR regcomp ,
308it should have been the result from the most recent
309.I regcomp
310using that
311.IR regex_t .
312.RI ( Regerror
313may be able to supply a more detailed message using information
314from the
315.IR regex_t .)
316.I Regerror
317places the NUL-terminated message into the buffer pointed to by
318.IR errbuf ,
319limiting the length (including the NUL) to at most
320.I errbuf_size
321bytes.
322If the whole message won't fit,
323as much of it as will fit before the terminating NUL is supplied.
324In any case,
325the returned value is the size of buffer needed to hold the whole
326message (including terminating NUL).
327If
328.I errbuf_size
329is 0,
330.I errbuf
331is ignored but the return value is still correct.
332.PP
333If the
334.I errcode
335given to
336.I regerror
337is first ORed with REG_ITOA,
338the ``message'' that results is the printable name of the error code,
339e.g. ``REG_NOMATCH'',
340rather than an explanation thereof.
341If
342.I errcode
343is REG_ATOI,
344then
345.I preg
346shall be non-NULL and the
347.I re_endp
348member of the structure it points to
349must point to the printable name of an error code;
350in this case, the result in
351.I errbuf
352is the decimal digits of
353the numeric value of the error code
354(0 if the name is not recognized).
355REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
356they are extensions,
357compatible with but not specified by POSIX 1003.2,
358and should be used with
359caution in software intended to be portable to other systems.
360Be warned also that they are considered experimental and changes are possible.
361.PP
362.I Regfree
363frees any dynamically-allocated storage associated with the compiled RE
364pointed to by
365.IR preg .
366The remaining
367.I regex_t
368is no longer a valid compiled RE
369and the effect of supplying it to
370.I regexec
371or
372.I regerror
373is undefined.
374.PP
375None of these functions references global variables except for tables
376of constants;
377all are safe for use from multiple threads if the arguments are safe.
378.SH IMPLEMENTATION CHOICES
379There are a number of decisions that 1003.2 leaves up to the implementor,
380either by explicitly saying ``undefined'' or by virtue of them being
381forbidden by the RE grammar.
382This implementation treats them as follows.
383.PP
384See
385.ZR
386for a discussion of the definition of case-independent matching.
387.PP
388There is no particular limit on the length of REs,
389except insofar as memory is limited.
390Memory usage is approximately linear in RE size, and largely insensitive
391to RE complexity, except for bounded repetitions.
392See BUGS for one short RE using them
393that will run almost any system out of memory.
394.PP
395A backslashed character other than one specifically given a magic meaning
396by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
397is taken as an ordinary character.
398.PP
399Any unmatched [ is a REG_EBRACK error.
400.PP
401Equivalence classes cannot begin or end bracket-expression ranges.
402The endpoint of one range cannot begin another.
403.PP
404RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
405.PP
406A repetition operator (?, *, +, or bounds) cannot follow another
407repetition operator.
408A repetition operator cannot begin an expression or subexpression
409or follow `^' or `|'.
410.PP
411`|' cannot appear first or last in a (sub)expression or after another `|',
412i.e. an operand of `|' cannot be an empty subexpression.
413An empty parenthesized subexpression, `()', is legal and matches an
414empty (sub)string.
415An empty string is not a legal RE.
416.PP
417A `{' followed by a digit is considered the beginning of bounds for a
418bounded repetition, which must then follow the syntax for bounds.
419A `{' \fInot\fR followed by a digit is considered an ordinary character.
420.PP
421`^' and `$' beginning and ending subexpressions in obsolete (``basic'')
422REs are anchors, not ordinary characters.
423.SH SEE ALSO
424grep(1), regex(7)
425.PP
426POSIX 1003.2, sections 2.8 (Regular Expression Notation)
427and
428B.5 (C Binding for Regular Expression Matching).
429.SH DIAGNOSTICS
430Non-zero error codes from
431.I regcomp
432and
433.I regexec
434include the following:
435.PP
436.nf
437.ta \w'REG_ECOLLATE'u+3n
438REG_NOMATCH regexec() failed to match
439REG_BADPAT invalid regular expression
440REG_ECOLLATE invalid collating element
441REG_ECTYPE invalid character class
442REG_EESCAPE \e applied to unescapable character
443REG_ESUBREG invalid backreference number
444REG_EBRACK brackets [ ] not balanced
445REG_EPAREN parentheses ( ) not balanced
446REG_EBRACE braces { } not balanced
447REG_BADBR invalid repetition count(s) in { }
448REG_ERANGE invalid character range in [ ]
449REG_ESPACE ran out of memory
450REG_BADRPT ?, *, or + operand invalid
451REG_EMPTY empty (sub)expression
452REG_ASSERT ``can't happen''\(emyou found a bug
453REG_INVARG invalid argument, e.g. negative-length string
454.fi
455.SH HISTORY
456Written by Henry Spencer,
457henry@zoo.toronto.edu.
458.SH BUGS
459This is an alpha release with known defects.
460Please report problems.
461.PP
462There is one known functionality bug.
463The implementation of internationalization is incomplete:
464the locale is always assumed to be the default one of 1003.2,
465and only the collating elements etc. of that locale are available.
466.PP
467The back-reference code is subtle and doubts linger about its correctness
468in complex cases.
469.PP
470.I Regexec
471performance is poor.
472This will improve with later releases.
473.I Nmatch
474exceeding 0 is expensive;
475.I nmatch
476exceeding 1 is worse.
477.I Regexec
478is largely insensitive to RE complexity \fIexcept\fR that back
479references are massively expensive.
480RE length does matter; in particular, there is a strong speed bonus
481for keeping RE length under about 30 characters,
482with most special characters counting roughly double.
483.PP
484.I Regcomp
485implements bounded repetitions by macro expansion,
486which is costly in time and space if counts are large
487or bounded repetitions are nested.
488An RE like, say,
489`((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
490will (eventually) run almost any existing machine out of swap space.
491.PP
492There are suspected problems with response to obscure error conditions.
493Notably,
494certain kinds of internal overflow,
495produced only by truly enormous REs or by multiply nested bounded repetitions,
496are probably not handled well.
497.PP
498Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
499a special character only in the presence of a previous unmatched `('.
500This can't be fixed until the spec is fixed.
501.PP
502The standard's definition of back references is vague.
503For example, does
504`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
505Until the standard is clarified,
506behavior in such cases should not be relied on.
507.PP
508The implementation of word-boundary matching is a bit of a kludge,
509and bugs may lurk in combinations of word-boundary matching and anchoring.