src/regex/regex.3

   1 .TH REGEX 3 "25 Sept 1997"
   2 .BY "Henry Spencer"
   3 .de ZR
   4 .\" one other place knows this name:  the SEE ALSO section
   5 .IR regex (7) \\$1
   6 ..
   7 .SH NAME
   8 regcomp, regexec, regerror, regfree \- regular-expression library
   9 .SH SYNOPSIS
  10 .ft B
  11 .\".na
  12 #include <sys/types.h>
  13 .br
  14 #include <regex.h>
  15 .HP 10
  16 int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
  17 .HP
  18 int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
  19 size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
  20 .HP
  21 size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
  22 char\ *errbuf, size_t\ errbuf_size);
  23 .HP
  24 void\ regfree(regex_t\ *preg);
  25 .\".ad
  26 .ft
  27 .SH DESCRIPTION
  28 These routines implement POSIX 1003.2 regular expressions (``RE''s);
  29 see
  30 .ZR .
  31 .I Regcomp
  32 compiles an RE written as a string into an internal form,
  33 .I regexec
  34 matches that internal form against a string and reports results,
  35 .I regerror
  36 transforms error codes from either into human-readable messages,
  37 and
  38 .I regfree
  39 frees any dynamically-allocated storage used by the internal form
  40 of an RE.
  41 .PP
  42 The header
  43 .I <regex.h>
  44 declares two structure types,
  45 .I regex_t
  46 and
  47 .IR regmatch_t ,
  48 the former for compiled internal forms and the latter for match reporting.
  49 It also declares the four functions,
  50 a type
  51 .IR regoff_t ,
  52 and a number of constants with names starting with ``REG_''.
  53 .PP
  54 .I Regcomp
  55 compiles the regular expression contained in the
  56 .I pattern
  57 string,
  58 subject to the flags in
  59 .IR cflags ,
  60 and places the results in the
  61 .I regex_t
  62 structure pointed to by
  63 .IR preg .
  64 .I Cflags
  65 is the bitwise OR of zero or more of the following flags:
  66 .IP REG_EXTENDED \w'REG_EXTENDED'u+2n
  67 Compile modern (``extended'') REs,
  68 rather than the obsolete (``basic'') REs that
  69 are the default.
  70 .IP REG_BASIC
  71 This is a synonym for 0,
  72 provided as a counterpart to REG_EXTENDED to improve readability.
  73 This is an extension,
  74 compatible with but not specified by POSIX 1003.2,
  75 and should be used with
  76 caution in software intended to be portable to other systems.
  77 .IP REG_NOSPEC
  78 Compile with recognition of all special characters turned off.
  79 All characters are thus considered ordinary,
  80 so the ``RE'' is a literal string.
  81 This is an extension,
  82 compatible with but not specified by POSIX 1003.2,
  83 and should be used with
  84 caution in software intended to be portable to other systems.
  85 REG_EXTENDED and REG_NOSPEC may not be used
  86 in the same call to
  87 .IR regcomp .
  88 .IP REG_ICASE
  89 Compile for matching that ignores upper/lower case distinctions.
  90 See
  91 .ZR .
  92 .IP REG_NOSUB
  93 Compile for matching that need only report success or failure,
  94 not what was matched.
  95 .IP REG_NEWLINE
  96 Compile for newline-sensitive matching.
  97 By default, newline is a completely ordinary character with no special
  98 meaning in either REs or strings.
  99 With this flag,
 100 `[^' bracket expressions and `.' never match newline,
 101 a `^' anchor matches the null string after any newline in the string
 102 in addition to its normal function,
 103 and the `$' anchor matches the null string before any newline in the
 104 string in addition to its normal function.
 105 .IP REG_PEND
 106 The regular expression ends,
 107 not at the first NUL,
 108 but just before the character pointed to by the
 109 .I re_endp
 110 member of the structure pointed to by
 111 .IR preg .
 112 The
 113 .I re_endp
 114 member is of type
 115 .IR const\ char\ * .
 116 This flag permits inclusion of NULs in the RE;
 117 they are considered ordinary characters.
 118 This is an extension,
 119 compatible with but not specified by POSIX 1003.2,
 120 and should be used with
 121 caution in software intended to be portable to other systems.
 122 .PP
 123 When successful,
 124 .I regcomp
 125 returns 0 and fills in the structure pointed to by
 126 .IR preg .
 127 One member of that structure
 128 (other than
 129 .IR re_endp )
 130 is publicized:
 131 .IR re_nsub ,
 132 of type
 133 .IR size_t ,
 134 contains the number of parenthesized subexpressions within the RE
 135 (except that the value of this member is undefined if the
 136 REG_NOSUB flag was used).
 137 If
 138 .I regcomp
 139 fails, it returns a non-zero error code;
 140 see DIAGNOSTICS.
 141 .PP
 142 .I Regexec
 143 matches the compiled RE pointed to by
 144 .I preg
 145 against the
 146 .IR string ,
 147 subject to the flags in
 148 .IR eflags ,
 149 and reports results using
 150 .IR nmatch ,
 151 .IR pmatch ,
 152 and the returned value.
 153 The RE must have been compiled by a previous invocation of
 154 .IR regcomp .
 155 The compiled form is not altered during execution of
 156 .IR regexec ,
 157 so a single compiled RE can be used simultaneously by multiple threads.
 158 .PP
 159 By default,
 160 the NUL-terminated string pointed to by
 161 .I string
 162 is considered to be the text of an entire line,
 163 with the NUL indicating the end of the line.
 164 (That is,
 165 any other end-of-line marker is considered to have been removed
 166 and replaced by the NUL.)
 167 The
 168 .I eflags
 169 argument is the bitwise OR of zero or more of the following flags:
 170 .IP REG_NOTBOL \w'REG_STARTEND'u+2n
 171 The first character of
 172 the string
 173 is not the beginning of a line, so the `^' anchor should not match before it.
 174 This does not affect the behavior of newlines under REG_NEWLINE.
 175 .IP REG_NOTEOL
 176 The NUL terminating
 177 the string
 178 does not end a line, so the `$' anchor should not match before it.
 179 This does not affect the behavior of newlines under REG_NEWLINE.
 180 .IP REG_STARTEND
 181 The string is considered to start at
 182 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
 183 and to have a terminating NUL located at
 184 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
 185 (there need not actually be a NUL at that location),
 186 regardless of the value of
 187 .IR nmatch .
 188 See below for the definition of
 189 .IR pmatch
 190 and
 191 .IR nmatch .
 192 This is an extension,
 193 compatible with but not specified by POSIX 1003.2,
 194 and should be used with
 195 caution in software intended to be portable to other systems.
 196 Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
 197 REG_STARTEND affects only the location of the string,
 198 not how it is matched.
 199 .PP
 200 See
 201 .ZR
 202 for a discussion of what is matched in situations where an RE or a
 203 portion thereof could match any of several substrings of
 204 .IR string .
 205 .PP
 206 Normally,
 207 .I regexec
 208 returns 0 for success and the non-zero code REG_NOMATCH for failure.
 209 Other non-zero error codes may be returned in exceptional situations;
 210 see DIAGNOSTICS.
 211 .PP
 212 If REG_NOSUB was specified in the compilation of the RE,
 213 or if
 214 .I nmatch
 215 is 0,
 216 .I regexec
 217 ignores the
 218 .I pmatch
 219 argument (but see below for the case where REG_STARTEND is specified).
 220 Otherwise,
 221 .I pmatch
 222 points to an array of
 223 .I nmatch
 224 structures of type
 225 .IR regmatch_t .
 226 Such a structure has at least the members
 227 .I rm_so
 228 and
 229 .IR rm_eo ,
 230 both of type
 231 .I regoff_t
 232 (a signed arithmetic type at least as large as an
 233 .I off_t
 234 and a
 235 .IR ssize_t ),
 236 containing respectively the offset of the first character of a substring
 237 and the offset of the first character after the end of the substring.
 238 Offsets are measured from the beginning of the
 239 .I string
 240 argument given to
 241 .IR regexec .
 242 An empty substring is denoted by equal offsets,
 243 both indicating the character following the empty substring.
 244 .PP
 245 The 0th member of the
 246 .I pmatch
 247 array is filled in to indicate what substring of
 248 .I string
 249 was matched by the entire RE.
 250 Remaining members report what substring was matched by parenthesized
 251 subexpressions within the RE;
 252 member
 253 .I i
 254 reports subexpression
 255 .IR i ,
 256 with subexpressions counted (starting at 1) by the order of their opening
 257 parentheses in the RE, left to right.
 258 Unused entries in the array\(emcorresponding either to subexpressions that
 259 did not participate in the match at all, or to subexpressions that do not
 260 exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
 261 .I rm_so
 262 and
 263 .I rm_eo
 264 set to \-1.
 265 If a subexpression participated in the match several times,
 266 the reported substring is the last one it matched.
 267 (Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
 268 the parenthesized subexpression matches the three `b's and then
 269 an infinite number of empty strings following the last `b',
 270 so the reported substring is one of the empties.)
 271 .PP
 272 If REG_STARTEND is specified,
 273 .I pmatch
 274 must point to at least one
 275 .I regmatch_t
 276 (even if
 277 .I nmatch
 278 is 0 or REG_NOSUB was specified),
 279 to hold the input offsets for REG_STARTEND.
 280 Use for output is still entirely controlled by
 281 .IR nmatch ;
 282 if
 283 .I nmatch
 284 is 0 or REG_NOSUB was specified,
 285 the value of
 286 .IR pmatch [0]
 287 will not be changed by a successful
 288 .IR regexec .
 289 .PP
 290 .I Regerror
 291 maps a non-zero
 292 .I errcode
 293 from either
 294 .I regcomp
 295 or
 296 .I regexec
 297 to a human-readable, printable message.
 298 If
 299 .I preg
 300 is non-NULL,
 301 the error code should have arisen from use of
 302 the
 303 .I regex_t
 304 pointed to by
 305 .IR preg ,
 306 and if the error code came from
 307 .IR regcomp ,
 308 it should have been the result from the most recent
 309 .I regcomp
 310 using that
 311 .IR regex_t .
 312 .RI ( Regerror
 313 may be able to supply a more detailed message using information
 314 from the
 315 .IR regex_t .)
 316 .I Regerror
 317 places the NUL-terminated message into the buffer pointed to by
 318 .IR errbuf ,
 319 limiting the length (including the NUL) to at most
 320 .I errbuf_size
 321 bytes.
 322 If the whole message won't fit,
 323 as much of it as will fit before the terminating NUL is supplied.
 324 In any case,
 325 the returned value is the size of buffer needed to hold the whole
 326 message (including terminating NUL).
 327 If
 328 .I errbuf_size
 329 is 0,
 330 .I errbuf
 331 is ignored but the return value is still correct.
 332 .PP
 333 If the
 334 .I errcode
 335 given to
 336 .I regerror
 337 is first ORed with REG_ITOA,
 338 the ``message'' that results is the printable name of the error code,
 339 e.g. ``REG_NOMATCH'',
 340 rather than an explanation thereof.
 341 If
 342 .I errcode
 343 is REG_ATOI,
 344 then
 345 .I preg
 346 shall be non-NULL and the
 347 .I re_endp
 348 member of the structure it points to
 349 must point to the printable name of an error code;
 350 in this case, the result in
 351 .I errbuf
 352 is the decimal digits of
 353 the numeric value of the error code
 354 (0 if the name is not recognized).
 355 REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
 356 they are extensions,
 357 compatible with but not specified by POSIX 1003.2,
 358 and should be used with
 359 caution in software intended to be portable to other systems.
 360 Be warned also that they are considered experimental and changes are possible.
 361 .PP
 362 .I Regfree
 363 frees any dynamically-allocated storage associated with the compiled RE
 364 pointed to by
 365 .IR preg .
 366 The remaining
 367 .I regex_t
 368 is no longer a valid compiled RE
 369 and the effect of supplying it to
 370 .I regexec
 371 or
 372 .I regerror
 373 is undefined.
 374 .PP
 375 None of these functions references global variables except for tables
 376 of constants;
 377 all are safe for use from multiple threads if the arguments are safe.
 378 .SH IMPLEMENTATION CHOICES
 379 There are a number of decisions that 1003.2 leaves up to the implementor,
 380 either by explicitly saying ``undefined'' or by virtue of them being
 381 forbidden by the RE grammar.
 382 This implementation treats them as follows.
 383 .PP
 384 See
 385 .ZR
 386 for a discussion of the definition of case-independent matching.
 387 .PP
 388 There is no particular limit on the length of REs,
 389 except insofar as memory is limited.
 390 Memory usage is approximately linear in RE size, and largely insensitive
 391 to RE complexity, except for bounded repetitions.
 392 See BUGS for one short RE using them
 393 that will run almost any system out of memory.
 394 .PP
 395 A backslashed character other than one specifically given a magic meaning
 396 by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
 397 is taken as an ordinary character.
 398 .PP
 399 Any unmatched [ is a REG_EBRACK error.
 400 .PP
 401 Equivalence classes cannot begin or end bracket-expression ranges.
 402 The endpoint of one range cannot begin another.
 403 .PP
 404 RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
 405 .PP
 406 A repetition operator (?, *, +, or bounds) cannot follow another
 407 repetition operator.
 408 A repetition operator cannot begin an expression or subexpression
 409 or follow `^' or `|'.
 410 .PP
 411 `|' cannot appear first or last in a (sub)expression or after another `|',
 412 i.e. an operand of `|' cannot be an empty subexpression.
 413 An empty parenthesized subexpression, `()', is legal and matches an
 414 empty (sub)string.
 415 An empty string is not a legal RE.
 416 .PP
 417 A `{' followed by a digit is considered the beginning of bounds for a
 418 bounded repetition, which must then follow the syntax for bounds.
 419 A `{' \fInot\fR followed by a digit is considered an ordinary character.
 420 .PP
 421 `^' and `$' beginning and ending subexpressions in obsolete (``basic'')
 422 REs are anchors, not ordinary characters.
 423 .SH SEE ALSO
 424 grep(1), regex(7)
 425 .PP
 426 POSIX 1003.2, sections 2.8 (Regular Expression Notation)
 427 and
 428 B.5 (C Binding for Regular Expression Matching).
 429 .SH DIAGNOSTICS
 430 Non-zero error codes from
 431 .I regcomp
 432 and
 433 .I regexec
 434 include the following:
 435 .PP
 436 .nf
 437 .ta \w'REG_ECOLLATE'u+3n
 438 REG_NOMATCH     regexec() failed to match
 439 REG_BADPAT      invalid regular expression
 440 REG_ECOLLATE    invalid collating element
 441 REG_ECTYPE      invalid character class
 442 REG_EESCAPE     \e applied to unescapable character
 443 REG_ESUBREG     invalid backreference number
 444 REG_EBRACK      brackets [ ] not balanced
 445 REG_EPAREN      parentheses ( ) not balanced
 446 REG_EBRACE      braces { } not balanced
 447 REG_BADBR       invalid repetition count(s) in { }
 448 REG_ERANGE      invalid character range in [ ]
 449 REG_ESPACE      ran out of memory
 450 REG_BADRPT      ?, *, or + operand invalid
 451 REG_EMPTY       empty (sub)expression
 452 REG_ASSERT      ``can't happen''\(emyou found a bug
 453 REG_INVARG      invalid argument, e.g. negative-length string
 454 .fi
 455 .SH HISTORY
 456 Written by Henry Spencer,
 457 henry@zoo.toronto.edu.
 458 .SH BUGS
 459 This is an alpha release with known defects.
 460 Please report problems.
 461 .PP
 462 There is one known functionality bug.
 463 The implementation of internationalization is incomplete:
 464 the locale is always assumed to be the default one of 1003.2,
 465 and only the collating elements etc. of that locale are available.
 466 .PP
 467 The back-reference code is subtle and doubts linger about its correctness
 468 in complex cases.
 469 .PP
 470 .I Regexec
 471 performance is poor.
 472 This will improve with later releases.
 473 .I Nmatch
 474 exceeding 0 is expensive;
 475 .I nmatch
 476 exceeding 1 is worse.
 477 .I Regexec
 478 is largely insensitive to RE complexity \fIexcept\fR that back
 479 references are massively expensive.
 480 RE length does matter; in particular, there is a strong speed bonus
 481 for keeping RE length under about 30 characters,
 482 with most special characters counting roughly double.
 483 .PP
 484 .I Regcomp
 485 implements bounded repetitions by macro expansion,
 486 which is costly in time and space if counts are large
 487 or bounded repetitions are nested.
 488 An RE like, say,
 489 `((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
 490 will (eventually) run almost any existing machine out of swap space.
 491 .PP
 492 There are suspected problems with response to obscure error conditions.
 493 Notably,
 494 certain kinds of internal overflow,
 495 produced only by truly enormous REs or by multiply nested bounded repetitions,
 496 are probably not handled well.
 497 .PP
 498 Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
 499 a special character only in the presence of a previous unmatched `('.
 500 This can't be fixed until the spec is fixed.
 501 .PP
 502 The standard's definition of back references is vague.
 503 For example, does
 504 `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
 505 Until the standard is clarified,
 506 behavior in such cases should not be relied on.
 507 .PP
 508 The implementation of word-boundary matching is a bit of a kludge,
 509 and bugs may lurk in combinations of word-boundary matching and anchoring.