| 1 | .TH REGEX 3 "25 Sept 1997" |
| 2 | .BY "Henry Spencer" |
| 3 | .de ZR |
| 4 | .\" one other place knows this name: the SEE ALSO section |
| 5 | .IR regex (7) \\$1 |
| 6 | .. |
| 7 | .SH NAME |
| 8 | regcomp, regexec, regerror, regfree \- regular-expression library |
| 9 | .SH SYNOPSIS |
| 10 | .ft B |
| 11 | .\".na |
| 12 | #include <sys/types.h> |
| 13 | .br |
| 14 | #include <regex.h> |
| 15 | .HP 10 |
| 16 | int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags); |
| 17 | .HP |
| 18 | int\ regexec(const\ regex_t\ *preg, const\ char\ *string, |
| 19 | size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags); |
| 20 | .HP |
| 21 | size_t\ regerror(int\ errcode, const\ regex_t\ *preg, |
| 22 | char\ *errbuf, size_t\ errbuf_size); |
| 23 | .HP |
| 24 | void\ regfree(regex_t\ *preg); |
| 25 | .\".ad |
| 26 | .ft |
| 27 | .SH DESCRIPTION |
| 28 | These routines implement POSIX 1003.2 regular expressions (``RE''s); |
| 29 | see |
| 30 | .ZR . |
| 31 | .I Regcomp |
| 32 | compiles an RE written as a string into an internal form, |
| 33 | .I regexec |
| 34 | matches that internal form against a string and reports results, |
| 35 | .I regerror |
| 36 | transforms error codes from either into human-readable messages, |
| 37 | and |
| 38 | .I regfree |
| 39 | frees any dynamically-allocated storage used by the internal form |
| 40 | of an RE. |
| 41 | .PP |
| 42 | The header |
| 43 | .I <regex.h> |
| 44 | declares two structure types, |
| 45 | .I regex_t |
| 46 | and |
| 47 | .IR regmatch_t , |
| 48 | the former for compiled internal forms and the latter for match reporting. |
| 49 | It also declares the four functions, |
| 50 | a type |
| 51 | .IR regoff_t , |
| 52 | and a number of constants with names starting with ``REG_''. |
| 53 | .PP |
| 54 | .I Regcomp |
| 55 | compiles the regular expression contained in the |
| 56 | .I pattern |
| 57 | string, |
| 58 | subject to the flags in |
| 59 | .IR cflags , |
| 60 | and places the results in the |
| 61 | .I regex_t |
| 62 | structure pointed to by |
| 63 | .IR preg . |
| 64 | .I Cflags |
| 65 | is the bitwise OR of zero or more of the following flags: |
| 66 | .IP REG_EXTENDED \w'REG_EXTENDED'u+2n |
| 67 | Compile modern (``extended'') REs, |
| 68 | rather than the obsolete (``basic'') REs that |
| 69 | are the default. |
| 70 | .IP REG_BASIC |
| 71 | This is a synonym for 0, |
| 72 | provided as a counterpart to REG_EXTENDED to improve readability. |
| 73 | This is an extension, |
| 74 | compatible with but not specified by POSIX 1003.2, |
| 75 | and should be used with |
| 76 | caution in software intended to be portable to other systems. |
| 77 | .IP REG_NOSPEC |
| 78 | Compile with recognition of all special characters turned off. |
| 79 | All characters are thus considered ordinary, |
| 80 | so the ``RE'' is a literal string. |
| 81 | This is an extension, |
| 82 | compatible with but not specified by POSIX 1003.2, |
| 83 | and should be used with |
| 84 | caution in software intended to be portable to other systems. |
| 85 | REG_EXTENDED and REG_NOSPEC may not be used |
| 86 | in the same call to |
| 87 | .IR regcomp . |
| 88 | .IP REG_ICASE |
| 89 | Compile for matching that ignores upper/lower case distinctions. |
| 90 | See |
| 91 | .ZR . |
| 92 | .IP REG_NOSUB |
| 93 | Compile for matching that need only report success or failure, |
| 94 | not what was matched. |
| 95 | .IP REG_NEWLINE |
| 96 | Compile for newline-sensitive matching. |
| 97 | By default, newline is a completely ordinary character with no special |
| 98 | meaning in either REs or strings. |
| 99 | With this flag, |
| 100 | `[^' bracket expressions and `.' never match newline, |
| 101 | a `^' anchor matches the null string after any newline in the string |
| 102 | in addition to its normal function, |
| 103 | and the `$' anchor matches the null string before any newline in the |
| 104 | string in addition to its normal function. |
| 105 | .IP REG_PEND |
| 106 | The regular expression ends, |
| 107 | not at the first NUL, |
| 108 | but just before the character pointed to by the |
| 109 | .I re_endp |
| 110 | member of the structure pointed to by |
| 111 | .IR preg . |
| 112 | The |
| 113 | .I re_endp |
| 114 | member is of type |
| 115 | .IR const\ char\ * . |
| 116 | This flag permits inclusion of NULs in the RE; |
| 117 | they are considered ordinary characters. |
| 118 | This is an extension, |
| 119 | compatible with but not specified by POSIX 1003.2, |
| 120 | and should be used with |
| 121 | caution in software intended to be portable to other systems. |
| 122 | .PP |
| 123 | When successful, |
| 124 | .I regcomp |
| 125 | returns 0 and fills in the structure pointed to by |
| 126 | .IR preg . |
| 127 | One member of that structure |
| 128 | (other than |
| 129 | .IR re_endp ) |
| 130 | is publicized: |
| 131 | .IR re_nsub , |
| 132 | of type |
| 133 | .IR size_t , |
| 134 | contains the number of parenthesized subexpressions within the RE |
| 135 | (except that the value of this member is undefined if the |
| 136 | REG_NOSUB flag was used). |
| 137 | If |
| 138 | .I regcomp |
| 139 | fails, it returns a non-zero error code; |
| 140 | see DIAGNOSTICS. |
| 141 | .PP |
| 142 | .I Regexec |
| 143 | matches the compiled RE pointed to by |
| 144 | .I preg |
| 145 | against the |
| 146 | .IR string , |
| 147 | subject to the flags in |
| 148 | .IR eflags , |
| 149 | and reports results using |
| 150 | .IR nmatch , |
| 151 | .IR pmatch , |
| 152 | and the returned value. |
| 153 | The RE must have been compiled by a previous invocation of |
| 154 | .IR regcomp . |
| 155 | The compiled form is not altered during execution of |
| 156 | .IR regexec , |
| 157 | so a single compiled RE can be used simultaneously by multiple threads. |
| 158 | .PP |
| 159 | By default, |
| 160 | the NUL-terminated string pointed to by |
| 161 | .I string |
| 162 | is considered to be the text of an entire line, |
| 163 | with the NUL indicating the end of the line. |
| 164 | (That is, |
| 165 | any other end-of-line marker is considered to have been removed |
| 166 | and replaced by the NUL.) |
| 167 | The |
| 168 | .I eflags |
| 169 | argument is the bitwise OR of zero or more of the following flags: |
| 170 | .IP REG_NOTBOL \w'REG_STARTEND'u+2n |
| 171 | The first character of |
| 172 | the string |
| 173 | is not the beginning of a line, so the `^' anchor should not match before it. |
| 174 | This does not affect the behavior of newlines under REG_NEWLINE. |
| 175 | .IP REG_NOTEOL |
| 176 | The NUL terminating |
| 177 | the string |
| 178 | does not end a line, so the `$' anchor should not match before it. |
| 179 | This does not affect the behavior of newlines under REG_NEWLINE. |
| 180 | .IP REG_STARTEND |
| 181 | The string is considered to start at |
| 182 | \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR |
| 183 | and to have a terminating NUL located at |
| 184 | \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR |
| 185 | (there need not actually be a NUL at that location), |
| 186 | regardless of the value of |
| 187 | .IR nmatch . |
| 188 | See below for the definition of |
| 189 | .IR pmatch |
| 190 | and |
| 191 | .IR nmatch . |
| 192 | This is an extension, |
| 193 | compatible with but not specified by POSIX 1003.2, |
| 194 | and should be used with |
| 195 | caution in software intended to be portable to other systems. |
| 196 | Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL; |
| 197 | REG_STARTEND affects only the location of the string, |
| 198 | not how it is matched. |
| 199 | .PP |
| 200 | See |
| 201 | .ZR |
| 202 | for a discussion of what is matched in situations where an RE or a |
| 203 | portion thereof could match any of several substrings of |
| 204 | .IR string . |
| 205 | .PP |
| 206 | Normally, |
| 207 | .I regexec |
| 208 | returns 0 for success and the non-zero code REG_NOMATCH for failure. |
| 209 | Other non-zero error codes may be returned in exceptional situations; |
| 210 | see DIAGNOSTICS. |
| 211 | .PP |
| 212 | If REG_NOSUB was specified in the compilation of the RE, |
| 213 | or if |
| 214 | .I nmatch |
| 215 | is 0, |
| 216 | .I regexec |
| 217 | ignores the |
| 218 | .I pmatch |
| 219 | argument (but see below for the case where REG_STARTEND is specified). |
| 220 | Otherwise, |
| 221 | .I pmatch |
| 222 | points to an array of |
| 223 | .I nmatch |
| 224 | structures of type |
| 225 | .IR regmatch_t . |
| 226 | Such a structure has at least the members |
| 227 | .I rm_so |
| 228 | and |
| 229 | .IR rm_eo , |
| 230 | both of type |
| 231 | .I regoff_t |
| 232 | (a signed arithmetic type at least as large as an |
| 233 | .I off_t |
| 234 | and a |
| 235 | .IR ssize_t ), |
| 236 | containing respectively the offset of the first character of a substring |
| 237 | and the offset of the first character after the end of the substring. |
| 238 | Offsets are measured from the beginning of the |
| 239 | .I string |
| 240 | argument given to |
| 241 | .IR regexec . |
| 242 | An empty substring is denoted by equal offsets, |
| 243 | both indicating the character following the empty substring. |
| 244 | .PP |
| 245 | The 0th member of the |
| 246 | .I pmatch |
| 247 | array is filled in to indicate what substring of |
| 248 | .I string |
| 249 | was matched by the entire RE. |
| 250 | Remaining members report what substring was matched by parenthesized |
| 251 | subexpressions within the RE; |
| 252 | member |
| 253 | .I i |
| 254 | reports subexpression |
| 255 | .IR i , |
| 256 | with subexpressions counted (starting at 1) by the order of their opening |
| 257 | parentheses in the RE, left to right. |
| 258 | Unused entries in the array\(emcorresponding either to subexpressions that |
| 259 | did not participate in the match at all, or to subexpressions that do not |
| 260 | exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both |
| 261 | .I rm_so |
| 262 | and |
| 263 | .I rm_eo |
| 264 | set to \-1. |
| 265 | If a subexpression participated in the match several times, |
| 266 | the reported substring is the last one it matched. |
| 267 | (Note, as an example in particular, that when the RE `(b*)+' matches `bbb', |
| 268 | the parenthesized subexpression matches the three `b's and then |
| 269 | an infinite number of empty strings following the last `b', |
| 270 | so the reported substring is one of the empties.) |
| 271 | .PP |
| 272 | If REG_STARTEND is specified, |
| 273 | .I pmatch |
| 274 | must point to at least one |
| 275 | .I regmatch_t |
| 276 | (even if |
| 277 | .I nmatch |
| 278 | is 0 or REG_NOSUB was specified), |
| 279 | to hold the input offsets for REG_STARTEND. |
| 280 | Use for output is still entirely controlled by |
| 281 | .IR nmatch ; |
| 282 | if |
| 283 | .I nmatch |
| 284 | is 0 or REG_NOSUB was specified, |
| 285 | the value of |
| 286 | .IR pmatch [0] |
| 287 | will not be changed by a successful |
| 288 | .IR regexec . |
| 289 | .PP |
| 290 | .I Regerror |
| 291 | maps a non-zero |
| 292 | .I errcode |
| 293 | from either |
| 294 | .I regcomp |
| 295 | or |
| 296 | .I regexec |
| 297 | to a human-readable, printable message. |
| 298 | If |
| 299 | .I preg |
| 300 | is non-NULL, |
| 301 | the error code should have arisen from use of |
| 302 | the |
| 303 | .I regex_t |
| 304 | pointed to by |
| 305 | .IR preg , |
| 306 | and if the error code came from |
| 307 | .IR regcomp , |
| 308 | it should have been the result from the most recent |
| 309 | .I regcomp |
| 310 | using that |
| 311 | .IR regex_t . |
| 312 | .RI ( Regerror |
| 313 | may be able to supply a more detailed message using information |
| 314 | from the |
| 315 | .IR regex_t .) |
| 316 | .I Regerror |
| 317 | places the NUL-terminated message into the buffer pointed to by |
| 318 | .IR errbuf , |
| 319 | limiting the length (including the NUL) to at most |
| 320 | .I errbuf_size |
| 321 | bytes. |
| 322 | If the whole message won't fit, |
| 323 | as much of it as will fit before the terminating NUL is supplied. |
| 324 | In any case, |
| 325 | the returned value is the size of buffer needed to hold the whole |
| 326 | message (including terminating NUL). |
| 327 | If |
| 328 | .I errbuf_size |
| 329 | is 0, |
| 330 | .I errbuf |
| 331 | is ignored but the return value is still correct. |
| 332 | .PP |
| 333 | If the |
| 334 | .I errcode |
| 335 | given to |
| 336 | .I regerror |
| 337 | is first ORed with REG_ITOA, |
| 338 | the ``message'' that results is the printable name of the error code, |
| 339 | e.g. ``REG_NOMATCH'', |
| 340 | rather than an explanation thereof. |
| 341 | If |
| 342 | .I errcode |
| 343 | is REG_ATOI, |
| 344 | then |
| 345 | .I preg |
| 346 | shall be non-NULL and the |
| 347 | .I re_endp |
| 348 | member of the structure it points to |
| 349 | must point to the printable name of an error code; |
| 350 | in this case, the result in |
| 351 | .I errbuf |
| 352 | is the decimal digits of |
| 353 | the numeric value of the error code |
| 354 | (0 if the name is not recognized). |
| 355 | REG_ITOA and REG_ATOI are intended primarily as debugging facilities; |
| 356 | they are extensions, |
| 357 | compatible with but not specified by POSIX 1003.2, |
| 358 | and should be used with |
| 359 | caution in software intended to be portable to other systems. |
| 360 | Be warned also that they are considered experimental and changes are possible. |
| 361 | .PP |
| 362 | .I Regfree |
| 363 | frees any dynamically-allocated storage associated with the compiled RE |
| 364 | pointed to by |
| 365 | .IR preg . |
| 366 | The remaining |
| 367 | .I regex_t |
| 368 | is no longer a valid compiled RE |
| 369 | and the effect of supplying it to |
| 370 | .I regexec |
| 371 | or |
| 372 | .I regerror |
| 373 | is undefined. |
| 374 | .PP |
| 375 | None of these functions references global variables except for tables |
| 376 | of constants; |
| 377 | all are safe for use from multiple threads if the arguments are safe. |
| 378 | .SH IMPLEMENTATION CHOICES |
| 379 | There are a number of decisions that 1003.2 leaves up to the implementor, |
| 380 | either by explicitly saying ``undefined'' or by virtue of them being |
| 381 | forbidden by the RE grammar. |
| 382 | This implementation treats them as follows. |
| 383 | .PP |
| 384 | See |
| 385 | .ZR |
| 386 | for a discussion of the definition of case-independent matching. |
| 387 | .PP |
| 388 | There is no particular limit on the length of REs, |
| 389 | except insofar as memory is limited. |
| 390 | Memory usage is approximately linear in RE size, and largely insensitive |
| 391 | to RE complexity, except for bounded repetitions. |
| 392 | See BUGS for one short RE using them |
| 393 | that will run almost any system out of memory. |
| 394 | .PP |
| 395 | A backslashed character other than one specifically given a magic meaning |
| 396 | by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs) |
| 397 | is taken as an ordinary character. |
| 398 | .PP |
| 399 | Any unmatched [ is a REG_EBRACK error. |
| 400 | .PP |
| 401 | Equivalence classes cannot begin or end bracket-expression ranges. |
| 402 | The endpoint of one range cannot begin another. |
| 403 | .PP |
| 404 | RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255. |
| 405 | .PP |
| 406 | A repetition operator (?, *, +, or bounds) cannot follow another |
| 407 | repetition operator. |
| 408 | A repetition operator cannot begin an expression or subexpression |
| 409 | or follow `^' or `|'. |
| 410 | .PP |
| 411 | `|' cannot appear first or last in a (sub)expression or after another `|', |
| 412 | i.e. an operand of `|' cannot be an empty subexpression. |
| 413 | An empty parenthesized subexpression, `()', is legal and matches an |
| 414 | empty (sub)string. |
| 415 | An empty string is not a legal RE. |
| 416 | .PP |
| 417 | A `{' followed by a digit is considered the beginning of bounds for a |
| 418 | bounded repetition, which must then follow the syntax for bounds. |
| 419 | A `{' \fInot\fR followed by a digit is considered an ordinary character. |
| 420 | .PP |
| 421 | `^' and `$' beginning and ending subexpressions in obsolete (``basic'') |
| 422 | REs are anchors, not ordinary characters. |
| 423 | .SH SEE ALSO |
| 424 | grep(1), regex(7) |
| 425 | .PP |
| 426 | POSIX 1003.2, sections 2.8 (Regular Expression Notation) |
| 427 | and |
| 428 | B.5 (C Binding for Regular Expression Matching). |
| 429 | .SH DIAGNOSTICS |
| 430 | Non-zero error codes from |
| 431 | .I regcomp |
| 432 | and |
| 433 | .I regexec |
| 434 | include the following: |
| 435 | .PP |
| 436 | .nf |
| 437 | .ta \w'REG_ECOLLATE'u+3n |
| 438 | REG_NOMATCH regexec() failed to match |
| 439 | REG_BADPAT invalid regular expression |
| 440 | REG_ECOLLATE invalid collating element |
| 441 | REG_ECTYPE invalid character class |
| 442 | REG_EESCAPE \e applied to unescapable character |
| 443 | REG_ESUBREG invalid backreference number |
| 444 | REG_EBRACK brackets [ ] not balanced |
| 445 | REG_EPAREN parentheses ( ) not balanced |
| 446 | REG_EBRACE braces { } not balanced |
| 447 | REG_BADBR invalid repetition count(s) in { } |
| 448 | REG_ERANGE invalid character range in [ ] |
| 449 | REG_ESPACE ran out of memory |
| 450 | REG_BADRPT ?, *, or + operand invalid |
| 451 | REG_EMPTY empty (sub)expression |
| 452 | REG_ASSERT ``can't happen''\(emyou found a bug |
| 453 | REG_INVARG invalid argument, e.g. negative-length string |
| 454 | .fi |
| 455 | .SH HISTORY |
| 456 | Written by Henry Spencer, |
| 457 | henry@zoo.toronto.edu. |
| 458 | .SH BUGS |
| 459 | This is an alpha release with known defects. |
| 460 | Please report problems. |
| 461 | .PP |
| 462 | There is one known functionality bug. |
| 463 | The implementation of internationalization is incomplete: |
| 464 | the locale is always assumed to be the default one of 1003.2, |
| 465 | and only the collating elements etc. of that locale are available. |
| 466 | .PP |
| 467 | The back-reference code is subtle and doubts linger about its correctness |
| 468 | in complex cases. |
| 469 | .PP |
| 470 | .I Regexec |
| 471 | performance is poor. |
| 472 | This will improve with later releases. |
| 473 | .I Nmatch |
| 474 | exceeding 0 is expensive; |
| 475 | .I nmatch |
| 476 | exceeding 1 is worse. |
| 477 | .I Regexec |
| 478 | is largely insensitive to RE complexity \fIexcept\fR that back |
| 479 | references are massively expensive. |
| 480 | RE length does matter; in particular, there is a strong speed bonus |
| 481 | for keeping RE length under about 30 characters, |
| 482 | with most special characters counting roughly double. |
| 483 | .PP |
| 484 | .I Regcomp |
| 485 | implements bounded repetitions by macro expansion, |
| 486 | which is costly in time and space if counts are large |
| 487 | or bounded repetitions are nested. |
| 488 | An RE like, say, |
| 489 | `((((a{1,100}){1,100}){1,100}){1,100}){1,100}' |
| 490 | will (eventually) run almost any existing machine out of swap space. |
| 491 | .PP |
| 492 | There are suspected problems with response to obscure error conditions. |
| 493 | Notably, |
| 494 | certain kinds of internal overflow, |
| 495 | produced only by truly enormous REs or by multiply nested bounded repetitions, |
| 496 | are probably not handled well. |
| 497 | .PP |
| 498 | Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is |
| 499 | a special character only in the presence of a previous unmatched `('. |
| 500 | This can't be fixed until the spec is fixed. |
| 501 | .PP |
| 502 | The standard's definition of back references is vague. |
| 503 | For example, does |
| 504 | `a\e(\e(b\e)*\e2\e)*d' match `abbbd'? |
| 505 | Until the standard is clarified, |
| 506 | behavior in such cases should not be relied on. |
| 507 | .PP |
| 508 | The implementation of word-boundary matching is a bit of a kludge, |
| 509 | and bugs may lurk in combinations of word-boundary matching and anchoring. |