src/regex/re_syntax.n

   1 '\"
   2 '\" Copyright (c) 1998 Sun Microsystems, Inc.
   3 '\" Copyright (c) 1999 Scriptics Corporation
   4 '\"
   5 '\" This software is copyrighted by the Regents of the University of
   6 '\" California, Sun Microsystems, Inc., Scriptics Corporation, ActiveState
   7 '\" Corporation and other parties.  The following terms apply to all files
   8 '\" associated with the software unless explicitly disclaimed in
   9 '\" individual files.
  10 '\"
  11 '\" The authors hereby grant permission to use, copy, modify, distribute,
  12 '\" and license this software and its documentation for any purpose, provided
  13 '\" that existing copyright notices are retained in all copies and that this
  14 '\" notice is included verbatim in any distributions. No written agreement,
  15 '\" license, or royalty fee is required for any of the authorized uses.
  16 '\" Modifications to this software may be copyrighted by their authors
  17 '\" and need not follow the licensing terms described here, provided that
  18 '\" the new terms are clearly indicated on the first page of each file where
  19 '\" they apply.
  20 '\"
  21 '\" IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY
  22 '\" FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES
  23 '\" ARISING OUT OF THE USE OF THIS SOFTWARE, ITS DOCUMENTATION, OR ANY
  24 '\" DERIVATIVES THEREOF, EVEN IF THE AUTHORS HAVE BEEN ADVISED OF THE
  25 '\" POSSIBILITY OF SUCH DAMAGE.
  26 '\"
  27 '\" THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
  28 '\" INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY,
  29 '\" FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT.  THIS SOFTWARE
  30 '\" IS PROVIDED ON AN "AS IS" BASIS, AND THE AUTHORS AND DISTRIBUTORS HAVE
  31 '\" NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR
  32 '\" MODIFICATIONS.
  33 '\"
  34 '\" GOVERNMENT USE: If you are acquiring this software on behalf of the
  35 '\" U.S. government, the Government shall have only "Restricted Rights"
  36 '\" in the software and related documentation as defined in the Federal
  37 '\" Acquisition Regulations (FARs) in Clause 52.227.19 (c) (2).  If you
  38 '\" are acquiring the software on behalf of the Department of Defense, the
  39 '\" software shall be classified as "Commercial Computer Software" and the
  40 '\" Government shall have only "Restricted Rights" as defined in Clause
  41 '\" 252.227-7013 (c) (1) of DFARs.  Notwithstanding the foregoing, the
  42 '\" authors grant the U.S. Government and others acting in its behalf
  43 '\" permission to use and distribute the software in accordance with the
  44 '\" terms specified in this license.
  45 '\"
  46 '\" RCS: @(#) Id: re_syntax.n,v 1.3 1999/07/14 19:09:36 jpeek Exp
  47 '\"
  48 .so man.macros
  49 .TH re_syntax n "8.1" Tcl "Tcl Built-In Commands"
  50 .BS
  51 .SH NAME
  52 re_syntax \- Syntax of Tcl regular expressions.
  53 .BE
  54
  55 .SH DESCRIPTION
  56 .PP
  57 A \fIregular expression\fR describes strings of characters.
  58 It's a pattern that matches certain strings and doesn't match others.
  59
  60 .SH "DIFFERENT FLAVORS OF REs"
  61 Regular expressions (``RE''s), as defined by POSIX, come in two
  62 flavors: \fIextended\fR REs (``EREs'') and \fIbasic\fR REs (``BREs'').
  63 EREs are roughly those of the traditional \fIegrep\fR, while BREs are
  64 roughly those of the traditional \fIed\fR.  This implementation adds
  65 a third flavor, \fIadvanced\fR REs (``AREs''), basically EREs with
  66 some significant extensions.
  67 .PP
  68 This manual page primarily describes AREs.  BREs mostly exist for
  69 backward compatibility in some old programs; they will be discussed at
  70 the end.  POSIX EREs are almost an exact subset of AREs.  Features of
  71 AREs that are not present in EREs will be indicated.
  72
  73 .SH "REGULAR EXPRESSION SYNTAX"
  74 .PP
  75 Tcl regular expressions are implemented using the package written by
  76 Henry Spencer, based on the 1003.2 spec and some (not quite all) of
  77 the Perl5 extensions (thanks, Henry!).  Much of the description of
  78 regular expressions below is copied verbatim from his manual entry.
  79 .PP
  80 An ARE is one or more \fIbranches\fR,
  81 separated by `\fB|\fR',
  82 matching anything that matches any of the branches.
  83 .PP
  84 A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR,
  85 concatenated.
  86 It matches a match for the first, followed by a match for the second, etc;
  87 an empty branch matches the empty string.
  88 .PP
  89 A quantified atom is an \fIatom\fR possibly followed
  90 by a single \fIquantifier\fR.
  91 Without a quantifier, it matches a match for the atom.
  92 The quantifiers,
  93 and what a so-quantified atom matches, are:
  94 .RS 2
  95 .TP 6
  96 \fB*\fR
  97 a sequence of 0 or more matches of the atom
  98 .TP
  99 \fB+\fR
 100 a sequence of 1 or more matches of the atom
 101 .TP
 102 \fB?\fR
 103 a sequence of 0 or 1 matches of the atom
 104 .TP
 105 \fB{\fIm\fB}\fR
 106 a sequence of exactly \fIm\fR matches of the atom
 107 .TP
 108 \fB{\fIm\fB,}\fR
 109 a sequence of \fIm\fR or more matches of the atom
 110 .TP
 111 \fB{\fIm\fB,\fIn\fB}\fR
 112 a sequence of \fIm\fR through \fIn\fR (inclusive) matches of the atom;
 113 \fIm\fR may not exceed \fIn\fR
 114 .TP
 115 \fB*?  +?  ??  {\fIm\fB}?  {\fIm\fB,}?  {\fIm\fB,\fIn\fB}?\fR
 116 \fInon-greedy\fR quantifiers,
 117 which match the same possibilities,
 118 but prefer the smallest number rather than the largest number
 119 of matches (see MATCHING)
 120 .RE
 121 .PP
 122 The forms using
 123 \fB{\fR and \fB}\fR
 124 are known as \fIbound\fRs.
 125 The numbers
 126 \fIm\fR and \fIn\fR are unsigned decimal integers
 127 with permissible values from 0 to 255 inclusive.
 128 .PP
 129 An atom is one of:
 130 .RS 2
 131 .TP 6
 132 \fB(\fIre\fB)\fR
 133 (where \fIre\fR is any regular expression)
 134 matches a match for
 135 \fIre\fR, with the match noted for possible reporting
 136 .TP
 137 \fB(?:\fIre\fB)\fR
 138 as previous,
 139 but does no reporting
 140 (a ``non-capturing'' set of parentheses)
 141 .TP
 142 \fB()\fR
 143 matches an empty string,
 144 noted for possible reporting
 145 .TP
 146 \fB(?:)\fR
 147 matches an empty string,
 148 without reporting
 149 .TP
 150 \fB[\fIchars\fB]\fR
 151 a \fIbracket expression\fR,
 152 matching any one of the \fIchars\fR (see BRACKET EXPRESSIONS for more detail)
 153 .TP
 154  \fB.\fR
 155 matches any single character
 156 .TP
 157 \fB\e\fIk\fR
 158 (where \fIk\fR is a non-alphanumeric character)
 159 matches that character taken as an ordinary character,
 160 e.g. \e\e matches a backslash character
 161 .TP
 162 \fB\e\fIc\fR
 163 where \fIc\fR is alphanumeric
 164 (possibly followed by other characters),
 165 an \fIescape\fR (AREs only),
 166 see ESCAPES below
 167 .TP
 168 \fB{\fR
 169 when followed by a character other than a digit,
 170 matches the left-brace character `\fB{\fR';
 171 when followed by a digit, it is the beginning of a
 172 \fIbound\fR (see above)
 173 .TP
 174 \fIx\fR
 175 where \fIx\fR is
 176 a single character with no other significance, matches that character.
 177 .RE
 178 .PP
 179 A \fIconstraint\fR matches an empty string when specific conditions
 180 are met.
 181 A constraint may not be followed by a quantifier.
 182 The simple constraints are as follows; some more constraints are
 183 described later, under ESCAPES.
 184 .RS 2
 185 .TP 8
 186 \fB^\fR
 187 matches at the beginning of a line
 188 .TP
 189 \fB$\fR
 190 matches at the end of a line
 191 .TP
 192 \fB(?=\fIre\fB)\fR
 193 \fIpositive lookahead\fR (AREs only), matches at any point
 194 where a substring matching \fIre\fR begins
 195 .TP
 196 \fB(?!\fIre\fB)\fR
 197 \fInegative lookahead\fR (AREs only), matches at any point
 198 where no substring matching \fIre\fR begins
 199 .RE
 200 .PP
 201 The lookahead constraints may not contain back references (see later),
 202 and all parentheses within them are considered non-capturing.
 203 .PP
 204 An RE may not end with `\fB\e\fR'.
 205
 206 .SH "BRACKET EXPRESSIONS"
 207 A \fIbracket expression\fR is a list of characters enclosed in `\fB[\|]\fR'.
 208 It normally matches any single character from the list (but see below).
 209 If the list begins with `\fB^\fR',
 210 it matches any single character
 211 (but see below) \fInot\fR from the rest of the list.
 212 .PP
 213 If two characters in the list are separated by `\fB\-\fR',
 214 this is shorthand
 215 for the full \fIrange\fR of characters between those two (inclusive) in the
 216 collating sequence,
 217 e.g.
 218 \fB[0\-9]\fR
 219 in ASCII matches any decimal digit.
 220 Two ranges may not share an
 221 endpoint, so e.g.
 222 \fBa\-c\-e\fR
 223 is illegal.
 224 Ranges are very collating-sequence-dependent,
 225 and portable programs should avoid relying on them.
 226 .PP
 227 To include a literal
 228 \fB]\fR
 229 or
 230 \fB\-\fR
 231 in the list,
 232 the simplest method is to
 233 enclose it in
 234 \fB[.\fR and \fB.]\fR
 235 to make it a collating element (see below).
 236 Alternatively,
 237 make it the first character
 238 (following a possible `\fB^\fR'),
 239 or (AREs only) precede it with `\fB\e\fR'.
 240 Alternatively, for `\fB\-\fR',
 241 make it the last character,
 242 or the second endpoint of a range.
 243 To use a literal
 244 \fB\-\fR
 245 as the first endpoint of a range,
 246 make it a collating element
 247 or (AREs only) precede it with `\fB\e\fR'.
 248 With the exception of these, some combinations using
 249 \fB[\fR
 250 (see next
 251 paragraphs), and escapes,
 252 all other special characters lose their
 253 special significance within a bracket expression.
 254 .PP
 255 Within a bracket expression, a collating element (a character,
 256 a multi-character sequence that collates as if it were a single character,
 257 or a collating-sequence name for either)
 258 enclosed in
 259 \fB[.\fR and \fB.]\fR
 260 stands for the
 261 sequence of characters of that collating element.
 262 The sequence is a single element of the bracket expression's list.
 263 A bracket expression in a locale that has
 264 multi-character collating elements
 265 can thus match more than one character.
 266 .VS 8.2
 267 So (insidiously), a bracket expression that starts with \fB^\fR
 268 can match multi-character collating elements even if none of them
 269 appear in the bracket expression!
 270 (\fINote:\fR Tcl currently has no multi-character collating elements.
 271 This information is only for illustration.)
 272 .PP
 273 For example, assume the collating sequence includes a \fBch\fR
 274 multi-character collating element.
 275 Then the RE \fB[[.ch.]]*c\fR (zero or more \fBch\fP's followed by \fBc\fP)
 276 matches the first five characters of `\fBchchcc\fR'.
 277 Also, the RE \fB[^c]b\fR matches all of `\fBchb\fR'
 278 (because \fB[^c]\fR matches the multi-character \fBch\fR).
 279 .VE 8.2
 280 .PP
 281 Within a bracket expression, a collating element enclosed in
 282 \fB[=\fR
 283 and
 284 \fB=]\fR
 285 is an equivalence class, standing for the sequences of characters
 286 of all collating elements equivalent to that one, including itself.
 287 (If there are no other equivalent collating elements,
 288 the treatment is as if the enclosing delimiters were `\fB[.\fR'\&
 289 and `\fB.]\fR'.)
 290 For example, if
 291 \fBo\fR
 292 and
 293 \fB\o'o^'\fR
 294 are the members of an equivalence class,
 295 then `\fB[[=o=]]\fR', `\fB[[=\o'o^'=]]\fR',
 296 and `\fB[o\o'o^']\fR'\&
 297 are all synonymous.
 298 An equivalence class may not be an endpoint
 299 of a range.
 300 .VS 8.2
 301 (\fINote:\fR
 302 Tcl currently implements only the Unicode locale.
 303 It doesn't define any equivalence classes.
 304 The examples above are just illustrations.)
 305 .VE 8.2
 306 .PP
 307 Within a bracket expression, the name of a \fIcharacter class\fR enclosed
 308 in
 309 \fB[:\fR
 310 and
 311 \fB:]\fR
 312 stands for the list of all characters
 313 (not all collating elements!)
 314 belonging to that
 315 class.
 316 Standard character classes are:
 317 .PP
 318 .RS
 319 .ne 5
 320 .nf
 321 .ta 3c
 322 \fBalpha\fR     A letter.
 323 \fBupper\fR     An upper-case letter.
 324 \fBlower\fR     A lower-case letter.
 325 \fBdigit\fR     A decimal digit.
 326 \fBxdigit\fR    A hexadecimal digit.
 327 \fBalnum\fR     An alphanumeric (letter or digit).
 328 \fBprint\fR     An alphanumeric (same as alnum).
 329 \fBblank\fR     A space or tab character.
 330 \fBspace\fR     A character producing white space in displayed text.
 331 \fBpunct\fR     A punctuation character.
 332 \fBgraph\fR     A character with a visible representation.
 333 \fBcntrl\fR     A control character.
 334 .fi
 335 .RE
 336 .PP
 337 A locale may provide others.
 338 .VS 8.2
 339 (Note that the current Tcl implementation has only one locale:
 340 the Unicode locale.)
 341 .VE 8.2
 342 A character class may not be used as an endpoint of a range.
 343 .PP
 344 There are two special cases of bracket expressions:
 345 the bracket expressions
 346 \fB[[:<:]]\fR
 347 and
 348 \fB[[:>:]]\fR
 349 are constraints, matching empty strings at
 350 the beginning and end of a word respectively.
 351 '\" note, discussion of escapes below references this definition of word
 352 A word is defined as a sequence of
 353 word characters
 354 that is neither preceded nor followed by
 355 word characters.
 356 A word character is an
 357 \fIalnum\fR
 358 character
 359 or an underscore
 360 (\fB_\fR).
 361 These special bracket expressions are deprecated;
 362 users of AREs should use constraint escapes instead (see below).
 363 .SH ESCAPES
 364 Escapes (AREs only), which begin with a
 365 \fB\e\fR
 366 followed by an alphanumeric character,
 367 come in several varieties:
 368 character entry, class shorthands, constraint escapes, and back references.
 369 A
 370 \fB\e\fR
 371 followed by an alphanumeric character but not constituting
 372 a valid escape is illegal in AREs.
 373 In EREs, there are no escapes:
 374 outside a bracket expression,
 375 a
 376 \fB\e\fR
 377 followed by an alphanumeric character merely stands for that
 378 character as an ordinary character,
 379 and inside a bracket expression,
 380 \fB\e\fR
 381 is an ordinary character.
 382 (The latter is the one actual incompatibility between EREs and AREs.)
 383 .PP
 384 Character-entry escapes (AREs only) exist to make it easier to specify
 385 non-printing and otherwise inconvenient characters in REs:
 386 .RS 2
 387 .TP 5
 388 \fB\ea\fR
 389 alert (bell) character, as in C
 390 .TP
 391 \fB\eb\fR
 392 backspace, as in C
 393 .TP
 394 \fB\eB\fR
 395 synonym for
 396 \fB\e\fR
 397 to help reduce backslash doubling in some
 398 applications where there are multiple levels of backslash processing
 399 .TP
 400 \fB\ec\fIX\fR
 401 (where X is any character) the character whose
 402 low-order 5 bits are the same as those of
 403 \fIX\fR,
 404 and whose other bits are all zero
 405 .TP
 406 \fB\ee\fR
 407 the character whose collating-sequence name
 408 is `\fBESC\fR',
 409 or failing that, the character with octal value 033
 410 .TP
 411 \fB\ef\fR
 412 formfeed, as in C
 413 .TP
 414 \fB\en\fR
 415 newline, as in C
 416 .TP
 417 \fB\er\fR
 418 carriage return, as in C
 419 .TP
 420 \fB\et\fR
 421 horizontal tab, as in C
 422 .TP
 423 \fB\eu\fIwxyz\fR
 424 (where
 425 \fIwxyz\fR
 426 is exactly four hexadecimal digits)
 427 the Unicode character
 428 \fBU+\fIwxyz\fR
 429 in the local byte ordering
 430 .TP
 431 \fB\eU\fIstuvwxyz\fR
 432 (where
 433 \fIstuvwxyz\fR
 434 is exactly eight hexadecimal digits)
 435 reserved for a somewhat-hypothetical Unicode extension to 32 bits
 436 .TP
 437 \fB\ev\fR
 438 vertical tab, as in C
 439 are all available.
 440 .TP
 441 \fB\ex\fIhhh\fR
 442 (where
 443 \fIhhh\fR
 444 is any sequence of hexadecimal digits)
 445 the character whose hexadecimal value is
 446 \fB0x\fIhhh\fR
 447 (a single character no matter how many hexadecimal digits are used).
 448 .TP
 449 \fB\e0\fR
 450 the character whose value is
 451 \fB0\fR
 452 .TP
 453 \fB\e\fIxy\fR
 454 (where
 455 \fIxy\fR
 456 is exactly two octal digits,
 457 and is not a
 458 \fIback reference\fR (see below))
 459 the character whose octal value is
 460 \fB0\fIxy\fR
 461 .TP
 462 \fB\e\fIxyz\fR
 463 (where
 464 \fIxyz\fR
 465 is exactly three octal digits,
 466 and is not a
 467 back reference (see below))
 468 the character whose octal value is
 469 \fB0\fIxyz\fR
 470 .RE
 471 .PP
 472 Hexadecimal digits are `\fB0\fR'-`\fB9\fR', `\fBa\fR'-`\fBf\fR',
 473 and `\fBA\fR'-`\fBF\fR'.
 474 Octal digits are `\fB0\fR'-`\fB7\fR'.
 475 .PP
 476 The character-entry escapes are always taken as ordinary characters.
 477 For example,
 478 \fB\e135\fR
 479 is
 480 \fB]\fR
 481 in ASCII,
 482 but
 483 \fB\e135\fR
 484 does not terminate a bracket expression.
 485 Beware, however, that some applications (e.g., C compilers) interpret
 486 such sequences themselves before the regular-expression package
 487 gets to see them, which may require doubling (quadrupling, etc.) the `\fB\e\fR'.
 488 .PP
 489 Class-shorthand escapes (AREs only) provide shorthands for certain commonly-used
 490 character classes:
 491 .RS 2
 492 .TP 10
 493 \fB\ed\fR
 494 \fB[[:digit:]]\fR
 495 .TP
 496 \fB\es\fR
 497 \fB[[:space:]]\fR
 498 .TP
 499 \fB\ew\fR
 500 \fB[[:alnum:]_]\fR
 501 (note underscore)
 502 .TP
 503 \fB\eD\fR
 504 \fB[^[:digit:]]\fR
 505 .TP
 506 \fB\eS\fR
 507 \fB[^[:space:]]\fR
 508 .TP
 509 \fB\eW\fR
 510 \fB[^[:alnum:]_]\fR
 511 (note underscore)
 512 .RE
 513 .PP
 514 Within bracket expressions, `\fB\ed\fR', `\fB\es\fR',
 515 and `\fB\ew\fR'\&
 516 lose their outer brackets,
 517 and `\fB\eD\fR', `\fB\eS\fR',
 518 and `\fB\eW\fR'\&
 519 are illegal.
 520 .VS 8.2
 521 (So, for example, \fB[a-c\ed]\fR is equivalent to \fB[a-c[:digit:]]\fR.
 522 Also, \fB[a-c\eD]\fR, which is equivalent to \fB[a-c^[:digit:]]\fR, is illegal.)
 523 .VE 8.2
 524 .PP
 525 A constraint escape (AREs only) is a constraint,
 526 matching the empty string if specific conditions are met,
 527 written as an escape:
 528 .RS 2
 529 .TP 6
 530 \fB\eA\fR
 531 matches only at the beginning of the string
 532 (see MATCHING, below, for how this differs from `\fB^\fR')
 533 .TP
 534 \fB\em\fR
 535 matches only at the beginning of a word
 536 .TP
 537 \fB\eM\fR
 538 matches only at the end of a word
 539 .TP
 540 \fB\ey\fR
 541 matches only at the beginning or end of a word
 542 .TP
 543 \fB\eY\fR
 544 matches only at a point that is not the beginning or end of a word
 545 .TP
 546 \fB\eZ\fR
 547 matches only at the end of the string
 548 (see MATCHING, below, for how this differs from `\fB$\fR')
 549 .TP
 550 \fB\e\fIm\fR
 551 (where
 552 \fIm\fR
 553 is a nonzero digit) a \fIback reference\fR, see below
 554 .TP
 555 \fB\e\fImnn\fR
 556 (where
 557 \fIm\fR
 558 is a nonzero digit, and
 559 \fInn\fR
 560 is some more digits,
 561 and the decimal value
 562 \fImnn\fR
 563 is not greater than the number of closing capturing parentheses seen so far)
 564 a \fIback reference\fR, see below
 565 .RE
 566 .PP
 567 A word is defined as in the specification of
 568 \fB[[:<:]]\fR
 569 and
 570 \fB[[:>:]]\fR
 571 above.
 572 Constraint escapes are illegal within bracket expressions.
 573 .PP
 574 A back reference (AREs only) matches the same string matched by the parenthesized
 575 subexpression specified by the number,
 576 so that (e.g.)
 577 \fB([bc])\e1\fR
 578 matches
 579 \fBbb\fR
 580 or
 581 \fBcc\fR
 582 but not `\fBbc\fR'.
 583 The subexpression must entirely precede the back reference in the RE.
 584 Subexpressions are numbered in the order of their leading parentheses.
 585 Non-capturing parentheses do not define subexpressions.
 586 .PP
 587 There is an inherent historical ambiguity between octal character-entry
 588 escapes and back references, which is resolved by heuristics,
 589 as hinted at above.
 590 A leading zero always indicates an octal escape.
 591 A single non-zero digit, not followed by another digit,
 592 is always taken as a back reference.
 593 A multi-digit sequence not starting with a zero is taken as a back
 594 reference if it comes after a suitable subexpression
 595 (i.e. the number is in the legal range for a back reference),
 596 and otherwise is taken as octal.
 597 .SH "METASYNTAX"
 598 In addition to the main syntax described above, there are some special
 599 forms and miscellaneous syntactic facilities available.
 600 .PP
 601 Normally the flavor of RE being used is specified by
 602 application-dependent means.
 603 However, this can be overridden by a \fIdirector\fR.
 604 If an RE of any flavor begins with `\fB***:\fR',
 605 the rest of the RE is an ARE.
 606 If an RE of any flavor begins with `\fB***=\fR',
 607 the rest of the RE is taken to be a literal string,
 608 with all characters considered ordinary characters.
 609 .PP
 610 An ARE may begin with \fIembedded options\fR:
 611 a sequence
 612 \fB(?\fIxyz\fB)\fR
 613 (where
 614 \fIxyz\fR
 615 is one or more alphabetic characters)
 616 specifies options affecting the rest of the RE.
 617 These supplement, and can override,
 618 any options specified by the application.
 619 The available option letters are:
 620 .RS 2
 621 .TP 3
 622 \fBb\fR
 623 rest of RE is a BRE
 624 .TP 3
 625 \fBc\fR
 626 case-sensitive matching (usual default)
 627 .TP 3
 628 \fBe\fR
 629 rest of RE is an ERE
 630 .TP 3
 631 \fBi\fR
 632 case-insensitive matching (see MATCHING, below)
 633 .TP 3
 634 \fBm\fR
 635 historical synonym for
 636 \fBn\fR
 637 .TP 3
 638 \fBn\fR
 639 newline-sensitive matching (see MATCHING, below)
 640 .TP 3
 641 \fBp\fR
 642 partial newline-sensitive matching (see MATCHING, below)
 643 .TP 3
 644 \fBq\fR
 645 rest of RE is a literal (``quoted'') string, all ordinary characters
 646 .TP 3
 647 \fBs\fR
 648 non-newline-sensitive matching (usual default)
 649 .TP 3
 650 \fBt\fR
 651 tight syntax (usual default; see below)
 652 .TP 3
 653 \fBw\fR
 654 inverse partial newline-sensitive (``weird'') matching (see MATCHING, below)
 655 .TP 3
 656 \fBx\fR
 657 expanded syntax (see below)
 658 .RE
 659 .PP
 660 Embedded options take effect at the
 661 \fB)\fR
 662 terminating the sequence.
 663 They are available only at the start of an ARE,
 664 and may not be used later within it.
 665 .PP
 666 In addition to the usual (\fItight\fR) RE syntax, in which all characters are
 667 significant, there is an \fIexpanded\fR syntax,
 668 available in all flavors of RE
 669 with the \fB-expanded\fR switch, or in AREs with the embedded x option.
 670 In the expanded syntax,
 671 white-space characters are ignored
 672 and all characters between a
 673 \fB#\fR
 674 and the following newline (or the end of the RE) are ignored,
 675 permitting paragraphing and commenting a complex RE.
 676 There are three exceptions to that basic rule:
 677 .RS 2
 678 .PP
 679 a white-space character or `\fB#\fR' preceded by `\fB\e\fR' is retained
 680 .PP
 681 white space or `\fB#\fR' within a bracket expression is retained
 682 .PP
 683 white space and comments are illegal within multi-character symbols
 684 like the ARE `\fB(?:\fR' or the BRE `\fB\e(\fR'
 685 .RE
 686 .PP
 687 Expanded-syntax white-space characters are blank, tab, newline, and
 688 .VS 8.2
 689 any character that belongs to the \fIspace\fR character class.
 690 .VE 8.2
 691 .PP
 692 Finally, in an ARE,
 693 outside bracket expressions, the sequence `\fB(?#\fIttt\fB)\fR'
 694 (where
 695 \fIttt\fR
 696 is any text not containing a `\fB)\fR')
 697 is a comment,
 698 completely ignored.
 699 Again, this is not allowed between the characters of
 700 multi-character symbols like `\fB(?:\fR'.
 701 Such comments are more a historical artifact than a useful facility,
 702 and their use is deprecated;
 703 use the expanded syntax instead.
 704 .PP
 705 \fINone\fR of these metasyntax extensions is available if the application
 706 (or an initial
 707 \fB***=\fR
 708 director)
 709 has specified that the user's input be treated as a literal string
 710 rather than as an RE.
 711 .SH MATCHING
 712 In the event that an RE could match more than one substring of a given
 713 string,
 714 the RE matches the one starting earliest in the string.
 715 If the RE could match more than one substring starting at that point,
 716 its choice is determined by its \fIpreference\fR:
 717 either the longest substring, or the shortest.
 718 .PP
 719 Most atoms, and all constraints, have no preference.
 720 A parenthesized RE has the same preference (possibly none) as the RE.
 721 A quantified atom with quantifier
 722 \fB{\fIm\fB}\fR
 723 or
 724 \fB{\fIm\fB}?\fR
 725 has the same preference (possibly none) as the atom itself.
 726 A quantified atom with other normal quantifiers (including
 727 \fB{\fIm\fB,\fIn\fB}\fR
 728 with
 729 \fIm\fR
 730 equal to
 731 \fIn\fR)
 732 prefers longest match.
 733 A quantified atom with other non-greedy quantifiers (including
 734 \fB{\fIm\fB,\fIn\fB}?\fR
 735 with
 736 \fIm\fR
 737 equal to
 738 \fIn\fR)
 739 prefers shortest match.
 740 A branch has the same preference as the first quantified atom in it
 741 which has a preference.
 742 An RE consisting of two or more branches connected by the
 743 \fB|\fR
 744 operator prefers longest match.
 745 .PP
 746 Subject to the constraints imposed by the rules for matching the whole RE,
 747 subexpressions also match the longest or shortest possible substrings,
 748 based on their preferences,
 749 with subexpressions starting earlier in the RE taking priority over
 750 ones starting later.
 751 Note that outer subexpressions thus take priority over
 752 their component subexpressions.
 753 .PP
 754 Note that the quantifiers
 755 \fB{1,1}\fR
 756 and
 757 \fB{1,1}?\fR
 758 can be used to force longest and shortest preference, respectively,
 759 on a subexpression or a whole RE.
 760 .PP
 761 Match lengths are measured in characters, not collating elements.
 762 An empty string is considered longer than no match at all.
 763 For example,
 764 \fBbb*\fR
 765 matches the three middle characters of `\fBabbbc\fR',
 766 \fB(week|wee)(night|knights)\fR
 767 matches all ten characters of `\fBweeknights\fR',
 768 when
 769 \fB(.*).*\fR
 770 is matched against
 771 \fBabc\fR
 772 the parenthesized subexpression
 773 matches all three characters, and
 774 when
 775 \fB(a*)*\fR
 776 is matched against
 777 \fBbc\fR
 778 both the whole RE and the parenthesized
 779 subexpression match an empty string.
 780 .PP
 781 If case-independent matching is specified,
 782 the effect is much as if all case distinctions had vanished from the
 783 alphabet.
 784 When an alphabetic that exists in multiple cases appears as an
 785 ordinary character outside a bracket expression, it is effectively
 786 transformed into a bracket expression containing both cases,
 787 so that
 788 \fBx\fR
 789 becomes `\fB[xX]\fR'.
 790 When it appears inside a bracket expression, all case counterparts
 791 of it are added to the bracket expression, so that
 792 \fB[x]\fR
 793 becomes
 794 \fB[xX]\fR
 795 and
 796 \fB[^x]\fR
 797 becomes `\fB[^xX]\fR'.
 798 .PP
 799 If newline-sensitive matching is specified, \fB.\fR
 800 and bracket expressions using
 801 \fB^\fR
 802 will never match the newline character
 803 (so that matches will never cross newlines unless the RE
 804 explicitly arranges it)
 805 and
 806 \fB^\fR
 807 and
 808 \fB$\fR
 809 will match the empty string after and before a newline
 810 respectively, in addition to matching at beginning and end of string
 811 respectively.
 812 ARE
 813 \fB\eA\fR
 814 and
 815 \fB\eZ\fR
 816 continue to match beginning or end of string \fIonly\fR.
 817 .PP
 818 If partial newline-sensitive matching is specified,
 819 this affects \fB.\fR
 820 and bracket expressions
 821 as with newline-sensitive matching, but not
 822 \fB^\fR
 823 and `\fB$\fR'.
 824 .PP
 825 If inverse partial newline-sensitive matching is specified,
 826 this affects
 827 \fB^\fR
 828 and
 829 \fB$\fR
 830 as with
 831 newline-sensitive matching,
 832 but not \fB.\fR
 833 and bracket expressions.
 834 This isn't very useful but is provided for symmetry.
 835 .SH "LIMITS AND COMPATIBILITY"
 836 No particular limit is imposed on the length of REs.
 837 Programs intended to be highly portable should not employ REs longer
 838 than 256 bytes,
 839 as a POSIX-compliant implementation can refuse to accept such REs.
 840 .PP
 841 The only feature of AREs that is actually incompatible with
 842 POSIX EREs is that
 843 \fB\e\fR
 844 does not lose its special
 845 significance inside bracket expressions.
 846 All other ARE features use syntax which is illegal or has
 847 undefined or unspecified effects in POSIX EREs;
 848 the
 849 \fB***\fR
 850 syntax of directors likewise is outside the POSIX
 851 syntax for both BREs and EREs.
 852 .PP
 853 Many of the ARE extensions are borrowed from Perl, but some have
 854 been changed to clean them up, and a few Perl extensions are not present.
 855 Incompatibilities of note include `\fB\eb\fR', `\fB\eB\fR',
 856 the lack of special treatment for a trailing newline,
 857 the addition of complemented bracket expressions to the things
 858 affected by newline-sensitive matching,
 859 the restrictions on parentheses and back references in lookahead constraints,
 860 and the longest/shortest-match (rather than first-match) matching semantics.
 861 .PP
 862 The matching rules for REs containing both normal and non-greedy quantifiers
 863 have changed since early beta-test versions of this package.
 864 (The new rules are much simpler and cleaner,
 865 but don't work as hard at guessing the user's real intentions.)
 866 .PP
 867 Henry Spencer's original 1986 \fIregexp\fR package,
 868 still in widespread use (e.g., in pre-8.1 releases of Tcl),
 869 implemented an early version of today's EREs.
 870 There are four incompatibilities between \fIregexp\fR's near-EREs
 871 (`RREs' for short) and AREs.
 872 In roughly increasing order of significance:
 873 .PP
 874 .RS
 875 In AREs,
 876 \fB\e\fR
 877 followed by an alphanumeric character is either an
 878 escape or an error,
 879 while in RREs, it was just another way of writing the
 880 alphanumeric.
 881 This should not be a problem because there was no reason to write
 882 such a sequence in RREs.
 883 .PP
 884 \fB{\fR
 885 followed by a digit in an ARE is the beginning of a bound,
 886 while in RREs,
 887 \fB{\fR
 888 was always an ordinary character.
 889 Such sequences should be rare,
 890 and will often result in an error because following characters
 891 will not look like a valid bound.
 892 .PP
 893 In AREs,
 894 \fB\e\fR
 895 remains a special character within `\fB[\|]\fR',
 896 so a literal
 897 \fB\e\fR
 898 within
 899 \fB[\|]\fR
 900 must be written `\fB\e\e\fR'.
 901 \fB\e\e\fR
 902 also gives a literal
 903 \fB\e\fR
 904 within
 905 \fB[\|]\fR
 906 in RREs,
 907 but only truly paranoid programmers routinely doubled the backslash.
 908 .PP
 909 AREs report the longest/shortest match for the RE,
 910 rather than the first found in a specified search order.
 911 This may affect some RREs which were written in the expectation that
 912 the first match would be reported.
 913 (The careful crafting of RREs to optimize the search order for fast
 914 matching is obsolete (AREs examine all possible matches
 915 in parallel, and their performance is largely insensitive to their
 916 complexity) but cases where the search order was exploited to deliberately
 917 find a match which was \fInot\fR the longest/shortest will need rewriting.)
 918 .RE
 919
 920 .SH "BASIC REGULAR EXPRESSIONS"
 921 BREs differ from EREs in several respects.  `\fB|\fR', `\fB+\fR',
 922 and
 923 \fB?\fR
 924 are ordinary characters and there is no equivalent
 925 for their functionality.
 926 The delimiters for bounds are
 927 \fB\e{\fR
 928 and `\fB\e}\fR',
 929 with
 930 \fB{\fR
 931 and
 932 \fB}\fR
 933 by themselves ordinary characters.
 934 The parentheses for nested subexpressions are
 935 \fB\e(\fR
 936 and `\fB\e)\fR',
 937 with
 938 \fB(\fR
 939 and
 940 \fB)\fR
 941 by themselves ordinary characters.
 942 \fB^\fR
 943 is an ordinary character except at the beginning of the
 944 RE or the beginning of a parenthesized subexpression,
 945 \fB$\fR
 946 is an ordinary character except at the end of the
 947 RE or the end of a parenthesized subexpression,
 948 and
 949 \fB*\fR
 950 is an ordinary character if it appears at the beginning of the
 951 RE or the beginning of a parenthesized subexpression
 952 (after a possible leading `\fB^\fR').
 953 Finally,
 954 single-digit back references are available,
 955 and
 956 \fB\e<\fR
 957 and
 958 \fB\e>\fR
 959 are synonyms for
 960 \fB[[:<:]]\fR
 961 and
 962 \fB[[:>:]]\fR
 963 respectively;
 964 no other escapes are available.
 965
 966 .SH "SEE ALSO"
 967 RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
 968
 969 .SH KEYWORDS
 970 match, regular expression, string