]> git.saurik.com Git - wxWidgets.git/blob - src/regex/regex.3
added a check which should prevent the crash of bug 555111
[wxWidgets.git] / src / regex / regex.3
1 .TH REGEX 3 "25 Sept 1997"
2 .BY "Henry Spencer"
3 .de ZR
4 .\" one other place knows this name: the SEE ALSO section
5 .IR regex (7) \\$1
6 ..
7 .SH NAME
8 regcomp, regexec, regerror, regfree \- regular-expression library
9 .SH SYNOPSIS
10 .ft B
11 .\".na
12 #include <sys/types.h>
13 .br
14 #include <regex.h>
15 .HP 10
16 int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
17 .HP
18 int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
19 size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
20 .HP
21 size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
22 char\ *errbuf, size_t\ errbuf_size);
23 .HP
24 void\ regfree(regex_t\ *preg);
25 .\".ad
26 .ft
27 .SH DESCRIPTION
28 These routines implement POSIX 1003.2 regular expressions (``RE''s);
29 see
30 .ZR .
31 .I Regcomp
32 compiles an RE written as a string into an internal form,
33 .I regexec
34 matches that internal form against a string and reports results,
35 .I regerror
36 transforms error codes from either into human-readable messages,
37 and
38 .I regfree
39 frees any dynamically-allocated storage used by the internal form
40 of an RE.
41 .PP
42 The header
43 .I <regex.h>
44 declares two structure types,
45 .I regex_t
46 and
47 .IR regmatch_t ,
48 the former for compiled internal forms and the latter for match reporting.
49 It also declares the four functions,
50 a type
51 .IR regoff_t ,
52 and a number of constants with names starting with ``REG_''.
53 .PP
54 .I Regcomp
55 compiles the regular expression contained in the
56 .I pattern
57 string,
58 subject to the flags in
59 .IR cflags ,
60 and places the results in the
61 .I regex_t
62 structure pointed to by
63 .IR preg .
64 .I Cflags
65 is the bitwise OR of zero or more of the following flags:
66 .IP REG_EXTENDED \w'REG_EXTENDED'u+2n
67 Compile modern (``extended'') REs,
68 rather than the obsolete (``basic'') REs that
69 are the default.
70 .IP REG_BASIC
71 This is a synonym for 0,
72 provided as a counterpart to REG_EXTENDED to improve readability.
73 This is an extension,
74 compatible with but not specified by POSIX 1003.2,
75 and should be used with
76 caution in software intended to be portable to other systems.
77 .IP REG_NOSPEC
78 Compile with recognition of all special characters turned off.
79 All characters are thus considered ordinary,
80 so the ``RE'' is a literal string.
81 This is an extension,
82 compatible with but not specified by POSIX 1003.2,
83 and should be used with
84 caution in software intended to be portable to other systems.
85 REG_EXTENDED and REG_NOSPEC may not be used
86 in the same call to
87 .IR regcomp .
88 .IP REG_ICASE
89 Compile for matching that ignores upper/lower case distinctions.
90 See
91 .ZR .
92 .IP REG_NOSUB
93 Compile for matching that need only report success or failure,
94 not what was matched.
95 .IP REG_NEWLINE
96 Compile for newline-sensitive matching.
97 By default, newline is a completely ordinary character with no special
98 meaning in either REs or strings.
99 With this flag,
100 `[^' bracket expressions and `.' never match newline,
101 a `^' anchor matches the null string after any newline in the string
102 in addition to its normal function,
103 and the `$' anchor matches the null string before any newline in the
104 string in addition to its normal function.
105 .IP REG_PEND
106 The regular expression ends,
107 not at the first NUL,
108 but just before the character pointed to by the
109 .I re_endp
110 member of the structure pointed to by
111 .IR preg .
112 The
113 .I re_endp
114 member is of type
115 .IR const\ char\ * .
116 This flag permits inclusion of NULs in the RE;
117 they are considered ordinary characters.
118 This is an extension,
119 compatible with but not specified by POSIX 1003.2,
120 and should be used with
121 caution in software intended to be portable to other systems.
122 .PP
123 When successful,
124 .I regcomp
125 returns 0 and fills in the structure pointed to by
126 .IR preg .
127 One member of that structure
128 (other than
129 .IR re_endp )
130 is publicized:
131 .IR re_nsub ,
132 of type
133 .IR size_t ,
134 contains the number of parenthesized subexpressions within the RE
135 (except that the value of this member is undefined if the
136 REG_NOSUB flag was used).
137 If
138 .I regcomp
139 fails, it returns a non-zero error code;
140 see DIAGNOSTICS.
141 .PP
142 .I Regexec
143 matches the compiled RE pointed to by
144 .I preg
145 against the
146 .IR string ,
147 subject to the flags in
148 .IR eflags ,
149 and reports results using
150 .IR nmatch ,
151 .IR pmatch ,
152 and the returned value.
153 The RE must have been compiled by a previous invocation of
154 .IR regcomp .
155 The compiled form is not altered during execution of
156 .IR regexec ,
157 so a single compiled RE can be used simultaneously by multiple threads.
158 .PP
159 By default,
160 the NUL-terminated string pointed to by
161 .I string
162 is considered to be the text of an entire line,
163 with the NUL indicating the end of the line.
164 (That is,
165 any other end-of-line marker is considered to have been removed
166 and replaced by the NUL.)
167 The
168 .I eflags
169 argument is the bitwise OR of zero or more of the following flags:
170 .IP REG_NOTBOL \w'REG_STARTEND'u+2n
171 The first character of
172 the string
173 is not the beginning of a line, so the `^' anchor should not match before it.
174 This does not affect the behavior of newlines under REG_NEWLINE.
175 .IP REG_NOTEOL
176 The NUL terminating
177 the string
178 does not end a line, so the `$' anchor should not match before it.
179 This does not affect the behavior of newlines under REG_NEWLINE.
180 .IP REG_STARTEND
181 The string is considered to start at
182 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
183 and to have a terminating NUL located at
184 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
185 (there need not actually be a NUL at that location),
186 regardless of the value of
187 .IR nmatch .
188 See below for the definition of
189 .IR pmatch
190 and
191 .IR nmatch .
192 This is an extension,
193 compatible with but not specified by POSIX 1003.2,
194 and should be used with
195 caution in software intended to be portable to other systems.
196 Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
197 REG_STARTEND affects only the location of the string,
198 not how it is matched.
199 .PP
200 See
201 .ZR
202 for a discussion of what is matched in situations where an RE or a
203 portion thereof could match any of several substrings of
204 .IR string .
205 .PP
206 Normally,
207 .I regexec
208 returns 0 for success and the non-zero code REG_NOMATCH for failure.
209 Other non-zero error codes may be returned in exceptional situations;
210 see DIAGNOSTICS.
211 .PP
212 If REG_NOSUB was specified in the compilation of the RE,
213 or if
214 .I nmatch
215 is 0,
216 .I regexec
217 ignores the
218 .I pmatch
219 argument (but see below for the case where REG_STARTEND is specified).
220 Otherwise,
221 .I pmatch
222 points to an array of
223 .I nmatch
224 structures of type
225 .IR regmatch_t .
226 Such a structure has at least the members
227 .I rm_so
228 and
229 .IR rm_eo ,
230 both of type
231 .I regoff_t
232 (a signed arithmetic type at least as large as an
233 .I off_t
234 and a
235 .IR ssize_t ),
236 containing respectively the offset of the first character of a substring
237 and the offset of the first character after the end of the substring.
238 Offsets are measured from the beginning of the
239 .I string
240 argument given to
241 .IR regexec .
242 An empty substring is denoted by equal offsets,
243 both indicating the character following the empty substring.
244 .PP
245 The 0th member of the
246 .I pmatch
247 array is filled in to indicate what substring of
248 .I string
249 was matched by the entire RE.
250 Remaining members report what substring was matched by parenthesized
251 subexpressions within the RE;
252 member
253 .I i
254 reports subexpression
255 .IR i ,
256 with subexpressions counted (starting at 1) by the order of their opening
257 parentheses in the RE, left to right.
258 Unused entries in the array\(emcorresponding either to subexpressions that
259 did not participate in the match at all, or to subexpressions that do not
260 exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
261 .I rm_so
262 and
263 .I rm_eo
264 set to \-1.
265 If a subexpression participated in the match several times,
266 the reported substring is the last one it matched.
267 (Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
268 the parenthesized subexpression matches the three `b's and then
269 an infinite number of empty strings following the last `b',
270 so the reported substring is one of the empties.)
271 .PP
272 If REG_STARTEND is specified,
273 .I pmatch
274 must point to at least one
275 .I regmatch_t
276 (even if
277 .I nmatch
278 is 0 or REG_NOSUB was specified),
279 to hold the input offsets for REG_STARTEND.
280 Use for output is still entirely controlled by
281 .IR nmatch ;
282 if
283 .I nmatch
284 is 0 or REG_NOSUB was specified,
285 the value of
286 .IR pmatch [0]
287 will not be changed by a successful
288 .IR regexec .
289 .PP
290 .I Regerror
291 maps a non-zero
292 .I errcode
293 from either
294 .I regcomp
295 or
296 .I regexec
297 to a human-readable, printable message.
298 If
299 .I preg
300 is non-NULL,
301 the error code should have arisen from use of
302 the
303 .I regex_t
304 pointed to by
305 .IR preg ,
306 and if the error code came from
307 .IR regcomp ,
308 it should have been the result from the most recent
309 .I regcomp
310 using that
311 .IR regex_t .
312 .RI ( Regerror
313 may be able to supply a more detailed message using information
314 from the
315 .IR regex_t .)
316 .I Regerror
317 places the NUL-terminated message into the buffer pointed to by
318 .IR errbuf ,
319 limiting the length (including the NUL) to at most
320 .I errbuf_size
321 bytes.
322 If the whole message won't fit,
323 as much of it as will fit before the terminating NUL is supplied.
324 In any case,
325 the returned value is the size of buffer needed to hold the whole
326 message (including terminating NUL).
327 If
328 .I errbuf_size
329 is 0,
330 .I errbuf
331 is ignored but the return value is still correct.
332 .PP
333 If the
334 .I errcode
335 given to
336 .I regerror
337 is first ORed with REG_ITOA,
338 the ``message'' that results is the printable name of the error code,
339 e.g. ``REG_NOMATCH'',
340 rather than an explanation thereof.
341 If
342 .I errcode
343 is REG_ATOI,
344 then
345 .I preg
346 shall be non-NULL and the
347 .I re_endp
348 member of the structure it points to
349 must point to the printable name of an error code;
350 in this case, the result in
351 .I errbuf
352 is the decimal digits of
353 the numeric value of the error code
354 (0 if the name is not recognized).
355 REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
356 they are extensions,
357 compatible with but not specified by POSIX 1003.2,
358 and should be used with
359 caution in software intended to be portable to other systems.
360 Be warned also that they are considered experimental and changes are possible.
361 .PP
362 .I Regfree
363 frees any dynamically-allocated storage associated with the compiled RE
364 pointed to by
365 .IR preg .
366 The remaining
367 .I regex_t
368 is no longer a valid compiled RE
369 and the effect of supplying it to
370 .I regexec
371 or
372 .I regerror
373 is undefined.
374 .PP
375 None of these functions references global variables except for tables
376 of constants;
377 all are safe for use from multiple threads if the arguments are safe.
378 .SH IMPLEMENTATION CHOICES
379 There are a number of decisions that 1003.2 leaves up to the implementor,
380 either by explicitly saying ``undefined'' or by virtue of them being
381 forbidden by the RE grammar.
382 This implementation treats them as follows.
383 .PP
384 See
385 .ZR
386 for a discussion of the definition of case-independent matching.
387 .PP
388 There is no particular limit on the length of REs,
389 except insofar as memory is limited.
390 Memory usage is approximately linear in RE size, and largely insensitive
391 to RE complexity, except for bounded repetitions.
392 See BUGS for one short RE using them
393 that will run almost any system out of memory.
394 .PP
395 A backslashed character other than one specifically given a magic meaning
396 by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
397 is taken as an ordinary character.
398 .PP
399 Any unmatched [ is a REG_EBRACK error.
400 .PP
401 Equivalence classes cannot begin or end bracket-expression ranges.
402 The endpoint of one range cannot begin another.
403 .PP
404 RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
405 .PP
406 A repetition operator (?, *, +, or bounds) cannot follow another
407 repetition operator.
408 A repetition operator cannot begin an expression or subexpression
409 or follow `^' or `|'.
410 .PP
411 `|' cannot appear first or last in a (sub)expression or after another `|',
412 i.e. an operand of `|' cannot be an empty subexpression.
413 An empty parenthesized subexpression, `()', is legal and matches an
414 empty (sub)string.
415 An empty string is not a legal RE.
416 .PP
417 A `{' followed by a digit is considered the beginning of bounds for a
418 bounded repetition, which must then follow the syntax for bounds.
419 A `{' \fInot\fR followed by a digit is considered an ordinary character.
420 .PP
421 `^' and `$' beginning and ending subexpressions in obsolete (``basic'')
422 REs are anchors, not ordinary characters.
423 .SH SEE ALSO
424 grep(1), regex(7)
425 .PP
426 POSIX 1003.2, sections 2.8 (Regular Expression Notation)
427 and
428 B.5 (C Binding for Regular Expression Matching).
429 .SH DIAGNOSTICS
430 Non-zero error codes from
431 .I regcomp
432 and
433 .I regexec
434 include the following:
435 .PP
436 .nf
437 .ta \w'REG_ECOLLATE'u+3n
438 REG_NOMATCH regexec() failed to match
439 REG_BADPAT invalid regular expression
440 REG_ECOLLATE invalid collating element
441 REG_ECTYPE invalid character class
442 REG_EESCAPE \e applied to unescapable character
443 REG_ESUBREG invalid backreference number
444 REG_EBRACK brackets [ ] not balanced
445 REG_EPAREN parentheses ( ) not balanced
446 REG_EBRACE braces { } not balanced
447 REG_BADBR invalid repetition count(s) in { }
448 REG_ERANGE invalid character range in [ ]
449 REG_ESPACE ran out of memory
450 REG_BADRPT ?, *, or + operand invalid
451 REG_EMPTY empty (sub)expression
452 REG_ASSERT ``can't happen''\(emyou found a bug
453 REG_INVARG invalid argument, e.g. negative-length string
454 .fi
455 .SH HISTORY
456 Written by Henry Spencer,
457 henry@zoo.toronto.edu.
458 .SH BUGS
459 This is an alpha release with known defects.
460 Please report problems.
461 .PP
462 There is one known functionality bug.
463 The implementation of internationalization is incomplete:
464 the locale is always assumed to be the default one of 1003.2,
465 and only the collating elements etc. of that locale are available.
466 .PP
467 The back-reference code is subtle and doubts linger about its correctness
468 in complex cases.
469 .PP
470 .I Regexec
471 performance is poor.
472 This will improve with later releases.
473 .I Nmatch
474 exceeding 0 is expensive;
475 .I nmatch
476 exceeding 1 is worse.
477 .I Regexec
478 is largely insensitive to RE complexity \fIexcept\fR that back
479 references are massively expensive.
480 RE length does matter; in particular, there is a strong speed bonus
481 for keeping RE length under about 30 characters,
482 with most special characters counting roughly double.
483 .PP
484 .I Regcomp
485 implements bounded repetitions by macro expansion,
486 which is costly in time and space if counts are large
487 or bounded repetitions are nested.
488 An RE like, say,
489 `((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
490 will (eventually) run almost any existing machine out of swap space.
491 .PP
492 There are suspected problems with response to obscure error conditions.
493 Notably,
494 certain kinds of internal overflow,
495 produced only by truly enormous REs or by multiply nested bounded repetitions,
496 are probably not handled well.
497 .PP
498 Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
499 a special character only in the presence of a previous unmatched `('.
500 This can't be fixed until the spec is fixed.
501 .PP
502 The standard's definition of back references is vague.
503 For example, does
504 `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
505 Until the standard is clarified,
506 behavior in such cases should not be relied on.
507 .PP
508 The implementation of word-boundary matching is a bit of a kludge,
509 and bugs may lurk in combinations of word-boundary matching and anchoring.