| 1 | .TH REGEX 7 "25 Oct 1995" |
| 2 | .BY "Henry Spencer" |
| 3 | .SH NAME |
| 4 | regex \- POSIX 1003.2 regular expressions |
| 5 | .SH DESCRIPTION |
| 6 | Regular expressions (``RE''s), |
| 7 | as defined in POSIX 1003.2, come in two forms: |
| 8 | modern REs (roughly those of |
| 9 | .IR egrep ; |
| 10 | 1003.2 calls these ``extended'' REs) |
| 11 | and obsolete REs (roughly those of |
| 12 | .IR ed ; |
| 13 | 1003.2 ``basic'' REs). |
| 14 | Obsolete REs mostly exist for backward compatibility in some old programs; |
| 15 | they will be discussed at the end. |
| 16 | 1003.2 leaves some aspects of RE syntax and semantics open; |
| 17 | `\(dg' marks decisions on these aspects that |
| 18 | may not be fully portable to other 1003.2 implementations. |
| 19 | .PP |
| 20 | A (modern) RE is one\(dg or more non-empty\(dg \fIbranches\fR, |
| 21 | separated by `|'. |
| 22 | It matches anything that matches one of the branches. |
| 23 | .PP |
| 24 | A branch is one\(dg or more \fIpieces\fR, concatenated. |
| 25 | It matches a match for the first, followed by a match for the second, etc. |
| 26 | .PP |
| 27 | A piece is an \fIatom\fR possibly followed |
| 28 | by a single\(dg `*', `+', `?', or \fIbound\fR. |
| 29 | An atom followed by `*' matches a sequence of 0 or more matches of the atom. |
| 30 | An atom followed by `+' matches a sequence of 1 or more matches of the atom. |
| 31 | An atom followed by `?' matches a sequence of 0 or 1 matches of the atom. |
| 32 | .PP |
| 33 | A \fIbound\fR is `{' followed by an unsigned decimal integer, |
| 34 | possibly followed by `,' |
| 35 | possibly followed by another unsigned decimal integer, |
| 36 | always followed by `}'. |
| 37 | The integers must lie between 0 and RE_DUP_MAX (255\(dg) inclusive, |
| 38 | and if there are two of them, the first may not exceed the second. |
| 39 | An atom followed by a bound containing one integer \fIi\fR |
| 40 | and no comma matches |
| 41 | a sequence of exactly \fIi\fR matches of the atom. |
| 42 | An atom followed by a bound |
| 43 | containing one integer \fIi\fR and a comma matches |
| 44 | a sequence of \fIi\fR or more matches of the atom. |
| 45 | An atom followed by a bound |
| 46 | containing two integers \fIi\fR and \fIj\fR matches |
| 47 | a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom. |
| 48 | .PP |
| 49 | An atom is a regular expression enclosed in `()' (matching a match for the |
| 50 | regular expression), |
| 51 | an empty set of `()' (matching the null string)\(dg, |
| 52 | a \fIbracket expression\fR (see below), `.' |
| 53 | (matching any single character), `^' (matching the null string at the |
| 54 | beginning of a line), `$' (matching the null string at the |
| 55 | end of a line), a `\e' followed by one of the characters |
| 56 | `^.[$()|*+?{\e' |
| 57 | (matching that character taken as an ordinary character), |
| 58 | a `\e' followed by any other character\(dg |
| 59 | (matching that character taken as an ordinary character, |
| 60 | as if the `\e' had not been present\(dg), |
| 61 | or a single character with no other significance (matching that character). |
| 62 | A `{' followed by a character other than a digit is an ordinary |
| 63 | character, not the beginning of a bound\(dg. |
| 64 | It is illegal to end an RE with `\e'. |
| 65 | .PP |
| 66 | A \fIbracket expression\fR is a list of characters enclosed in `[]'. |
| 67 | It normally matches any single character from the list (but see below). |
| 68 | If the list begins with `^', |
| 69 | it matches any single character |
| 70 | (but see below) \fInot\fR from the rest of the list. |
| 71 | If two characters in the list are separated by `\-', this is shorthand |
| 72 | for the full \fIrange\fR of characters between those two (inclusive) in the |
| 73 | collating sequence, |
| 74 | e.g. `[0\-9]' in ASCII matches any decimal digit. |
| 75 | It is illegal\(dg for two ranges to share an |
| 76 | endpoint, e.g. `a\-c\-e'. |
| 77 | Ranges are very collating-sequence-dependent, |
| 78 | and portable programs should avoid relying on them. |
| 79 | .PP |
| 80 | To include a literal `]' in the list, make it the first character |
| 81 | (following a possible `^'). |
| 82 | To include a literal `\-', make it the first or last character, |
| 83 | or the second endpoint of a range. |
| 84 | To use a literal `\-' as the first endpoint of a range, |
| 85 | enclose it in `[.' and `.]' to make it a collating element (see below). |
| 86 | With the exception of these and some combinations using `[' (see next |
| 87 | paragraphs), all other special characters, including `\e', lose their |
| 88 | special significance within a bracket expression. |
| 89 | .PP |
| 90 | Within a bracket expression, a collating element (a character, |
| 91 | a multi-character sequence that collates as if it were a single character, |
| 92 | or a collating-sequence name for either) |
| 93 | enclosed in `[.' and `.]' stands for the |
| 94 | sequence of characters of that collating element. |
| 95 | The sequence is a single element of the bracket expression's list. |
| 96 | A bracket expression containing a multi-character collating element |
| 97 | can thus match more than one character, |
| 98 | e.g. if the collating sequence includes a `ch' collating element, |
| 99 | then the RE `[[.ch.]]*c' matches the first five characters |
| 100 | of `chchcc'. |
| 101 | .PP |
| 102 | Within a bracket expression, a collating element enclosed in `[=' and |
| 103 | `=]' is an equivalence class, standing for the sequences of characters |
| 104 | of all collating elements equivalent to that one, including itself. |
| 105 | (If there are no other equivalent collating elements, |
| 106 | the treatment is as if the enclosing delimiters were `[.' and `.]'.) |
| 107 | For example, if o and \o'o^' are the members of an equivalence class, |
| 108 | then `[[=o=]]', `[[=\o'o^'=]]', and `[o\o'o^']' are all synonymous. |
| 109 | An equivalence class may not\(dg be an endpoint |
| 110 | of a range. |
| 111 | .PP |
| 112 | Within a bracket expression, the name of a \fIcharacter class\fR enclosed |
| 113 | in `[:' and `:]' stands for the list of all characters belonging to that |
| 114 | class. |
| 115 | Standard character class names are: |
| 116 | .PP |
| 117 | .RS |
| 118 | .nf |
| 119 | .ta 3c 6c 9c |
| 120 | alnum digit punct |
| 121 | alpha graph space |
| 122 | blank lower upper |
| 123 | cntrl print xdigit |
| 124 | .fi |
| 125 | .RE |
| 126 | .PP |
| 127 | These stand for the character classes defined in |
| 128 | .IR ctype (3). |
| 129 | A locale may provide others. |
| 130 | A character class may not be used as an endpoint of a range. |
| 131 | .PP |
| 132 | There are two special cases\(dg of bracket expressions: |
| 133 | the bracket expressions `[[:<:]]' and `[[:>:]]' match the null string at |
| 134 | the beginning and end of a word respectively. |
| 135 | A word is defined as a sequence of |
| 136 | word characters |
| 137 | which is neither preceded nor followed by |
| 138 | word characters. |
| 139 | A word character is an |
| 140 | .I alnum |
| 141 | character (as defined by |
| 142 | .IR ctype (3)) |
| 143 | or an underscore. |
| 144 | This is an extension, |
| 145 | compatible with but not specified by POSIX 1003.2, |
| 146 | and should be used with |
| 147 | caution in software intended to be portable to other systems. |
| 148 | .PP |
| 149 | In the event that an RE could match more than one substring of a given |
| 150 | string, |
| 151 | the RE matches the one starting earliest in the string. |
| 152 | If the RE could match more than one substring starting at that point, |
| 153 | it matches the longest. |
| 154 | Subexpressions also match the longest possible substrings, subject to |
| 155 | the constraint that the whole match be as long as possible, |
| 156 | with subexpressions starting earlier in the RE taking priority over |
| 157 | ones starting later. |
| 158 | Note that higher-level subexpressions thus take priority over |
| 159 | their lower-level component subexpressions. |
| 160 | .PP |
| 161 | Match lengths are measured in characters, not collating elements. |
| 162 | A null string is considered longer than no match at all. |
| 163 | For example, |
| 164 | `bb*' matches the three middle characters of `abbbc', |
| 165 | `(wee|week)(knights|nights)' matches all ten characters of `weeknights', |
| 166 | when `(.*).*' is matched against `abc' the parenthesized subexpression |
| 167 | matches all three characters, and |
| 168 | when `(a*)*' is matched against `bc' both the whole RE and the parenthesized |
| 169 | subexpression match the null string. |
| 170 | .PP |
| 171 | If case-independent matching is specified, |
| 172 | the effect is much as if all case distinctions had vanished from the |
| 173 | alphabet. |
| 174 | When an alphabetic that exists in multiple cases appears as an |
| 175 | ordinary character outside a bracket expression, it is effectively |
| 176 | transformed into a bracket expression containing both cases, |
| 177 | e.g. `x' becomes `[xX]'. |
| 178 | When it appears inside a bracket expression, all case counterparts |
| 179 | of it are added to the bracket expression, so that (e.g.) `[x]' |
| 180 | becomes `[xX]' and `[^x]' becomes `[^xX]'. |
| 181 | .PP |
| 182 | No particular limit is imposed on the length of REs\(dg. |
| 183 | Programs intended to be portable should not employ REs longer |
| 184 | than 256 bytes, |
| 185 | as an implementation can refuse to accept such REs and remain |
| 186 | POSIX-compliant. |
| 187 | .PP |
| 188 | Obsolete (``basic'') regular expressions differ in several respects. |
| 189 | `|', `+', and `?' are ordinary characters and there is no equivalent |
| 190 | for their functionality. |
| 191 | The delimiters for bounds are `\e{' and `\e}', |
| 192 | with `{' and `}' by themselves ordinary characters. |
| 193 | The parentheses for nested subexpressions are `\e(' and `\e)', |
| 194 | with `(' and `)' by themselves ordinary characters. |
| 195 | `^' is an ordinary character except at the beginning of the |
| 196 | RE or\(dg the beginning of a parenthesized subexpression, |
| 197 | `$' is an ordinary character except at the end of the |
| 198 | RE or\(dg the end of a parenthesized subexpression, |
| 199 | and `*' is an ordinary character if it appears at the beginning of the |
| 200 | RE or the beginning of a parenthesized subexpression |
| 201 | (after a possible leading `^'). |
| 202 | Finally, there is one new type of atom, a \fIback reference\fR: |
| 203 | `\e' followed by a non-zero decimal digit \fId\fR |
| 204 | matches the same sequence of characters |
| 205 | matched by the \fId\fRth parenthesized subexpression |
| 206 | (numbering subexpressions by the positions of their opening parentheses, |
| 207 | left to right), |
| 208 | so that (e.g.) `\e([bc]\e)\e1' matches `bb' or `cc' but not `bc'. |
| 209 | .SH SEE ALSO |
| 210 | regex(3) |
| 211 | .PP |
| 212 | POSIX 1003.2, section 2.8 (Regular Expression Notation). |
| 213 | .SH HISTORY |
| 214 | Written by Henry Spencer, based on the 1003.2 spec. |
| 215 | .SH BUGS |
| 216 | Having two kinds of REs is a botch. |
| 217 | .PP |
| 218 | The current 1003.2 spec says that `)' is an ordinary character in |
| 219 | the absence of an unmatched `('; |
| 220 | this was an unintentional result of a wording error, |
| 221 | and change is likely. |
| 222 | Avoid relying on it. |
| 223 | .PP |
| 224 | Back references are a dreadful botch, |
| 225 | posing major problems for efficient implementations. |
| 226 | They are also somewhat vaguely defined |
| 227 | (does |
| 228 | `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?). |
| 229 | Avoid using them. |
| 230 | .PP |
| 231 | 1003.2's specification of case-independent matching is vague. |
| 232 | The ``one case implies all cases'' definition given above |
| 233 | is current consensus among implementors as to the right interpretation. |
| 234 | .PP |
| 235 | The syntax for word boundaries is incredibly ugly. |