]> git.saurik.com Git - wxWidgets.git/blame - src/regex/regex.7
Convert argv[] to Unicode
[wxWidgets.git] / src / regex / regex.7
CommitLineData
07dcc217
VZ
1.TH REGEX 7 "25 Oct 1995"
2.BY "Henry Spencer"
3.SH NAME
4regex \- POSIX 1003.2 regular expressions
5.SH DESCRIPTION
6Regular expressions (``RE''s),
7as defined in POSIX 1003.2, come in two forms:
8modern REs (roughly those of
9.IR egrep ;
101003.2 calls these ``extended'' REs)
11and obsolete REs (roughly those of
12.IR ed ;
131003.2 ``basic'' REs).
14Obsolete REs mostly exist for backward compatibility in some old programs;
15they will be discussed at the end.
161003.2 leaves some aspects of RE syntax and semantics open;
17`\(dg' marks decisions on these aspects that
18may not be fully portable to other 1003.2 implementations.
19.PP
20A (modern) RE is one\(dg or more non-empty\(dg \fIbranches\fR,
21separated by `|'.
22It matches anything that matches one of the branches.
23.PP
24A branch is one\(dg or more \fIpieces\fR, concatenated.
25It matches a match for the first, followed by a match for the second, etc.
26.PP
27A piece is an \fIatom\fR possibly followed
28by a single\(dg `*', `+', `?', or \fIbound\fR.
29An atom followed by `*' matches a sequence of 0 or more matches of the atom.
30An atom followed by `+' matches a sequence of 1 or more matches of the atom.
31An atom followed by `?' matches a sequence of 0 or 1 matches of the atom.
32.PP
33A \fIbound\fR is `{' followed by an unsigned decimal integer,
34possibly followed by `,'
35possibly followed by another unsigned decimal integer,
36always followed by `}'.
37The integers must lie between 0 and RE_DUP_MAX (255\(dg) inclusive,
38and if there are two of them, the first may not exceed the second.
39An atom followed by a bound containing one integer \fIi\fR
40and no comma matches
41a sequence of exactly \fIi\fR matches of the atom.
42An atom followed by a bound
43containing one integer \fIi\fR and a comma matches
44a sequence of \fIi\fR or more matches of the atom.
45An atom followed by a bound
46containing two integers \fIi\fR and \fIj\fR matches
47a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom.
48.PP
49An atom is a regular expression enclosed in `()' (matching a match for the
50regular expression),
51an empty set of `()' (matching the null string)\(dg,
52a \fIbracket expression\fR (see below), `.'
53(matching any single character), `^' (matching the null string at the
54beginning of a line), `$' (matching the null string at the
55end of a line), a `\e' followed by one of the characters
56`^.[$()|*+?{\e'
57(matching that character taken as an ordinary character),
58a `\e' followed by any other character\(dg
59(matching that character taken as an ordinary character,
60as if the `\e' had not been present\(dg),
61or a single character with no other significance (matching that character).
62A `{' followed by a character other than a digit is an ordinary
63character, not the beginning of a bound\(dg.
64It is illegal to end an RE with `\e'.
65.PP
66A \fIbracket expression\fR is a list of characters enclosed in `[]'.
67It normally matches any single character from the list (but see below).
68If the list begins with `^',
69it matches any single character
70(but see below) \fInot\fR from the rest of the list.
71If two characters in the list are separated by `\-', this is shorthand
72for the full \fIrange\fR of characters between those two (inclusive) in the
73collating sequence,
74e.g. `[0\-9]' in ASCII matches any decimal digit.
75It is illegal\(dg for two ranges to share an
76endpoint, e.g. `a\-c\-e'.
77Ranges are very collating-sequence-dependent,
78and portable programs should avoid relying on them.
79.PP
80To include a literal `]' in the list, make it the first character
81(following a possible `^').
82To include a literal `\-', make it the first or last character,
83or the second endpoint of a range.
84To use a literal `\-' as the first endpoint of a range,
85enclose it in `[.' and `.]' to make it a collating element (see below).
86With the exception of these and some combinations using `[' (see next
87paragraphs), all other special characters, including `\e', lose their
88special significance within a bracket expression.
89.PP
90Within a bracket expression, a collating element (a character,
91a multi-character sequence that collates as if it were a single character,
92or a collating-sequence name for either)
93enclosed in `[.' and `.]' stands for the
94sequence of characters of that collating element.
95The sequence is a single element of the bracket expression's list.
96A bracket expression containing a multi-character collating element
97can thus match more than one character,
98e.g. if the collating sequence includes a `ch' collating element,
99then the RE `[[.ch.]]*c' matches the first five characters
100of `chchcc'.
101.PP
102Within a bracket expression, a collating element enclosed in `[=' and
103`=]' is an equivalence class, standing for the sequences of characters
104of all collating elements equivalent to that one, including itself.
105(If there are no other equivalent collating elements,
106the treatment is as if the enclosing delimiters were `[.' and `.]'.)
107For example, if o and \o'o^' are the members of an equivalence class,
108then `[[=o=]]', `[[=\o'o^'=]]', and `[o\o'o^']' are all synonymous.
109An equivalence class may not\(dg be an endpoint
110of a range.
111.PP
112Within a bracket expression, the name of a \fIcharacter class\fR enclosed
113in `[:' and `:]' stands for the list of all characters belonging to that
114class.
115Standard character class names are:
116.PP
117.RS
118.nf
119.ta 3c 6c 9c
120alnum digit punct
121alpha graph space
122blank lower upper
123cntrl print xdigit
124.fi
125.RE
126.PP
127These stand for the character classes defined in
128.IR ctype (3).
129A locale may provide others.
130A character class may not be used as an endpoint of a range.
131.PP
132There are two special cases\(dg of bracket expressions:
133the bracket expressions `[[:<:]]' and `[[:>:]]' match the null string at
134the beginning and end of a word respectively.
135A word is defined as a sequence of
136word characters
137which is neither preceded nor followed by
138word characters.
139A word character is an
140.I alnum
141character (as defined by
142.IR ctype (3))
143or an underscore.
144This is an extension,
145compatible with but not specified by POSIX 1003.2,
146and should be used with
147caution in software intended to be portable to other systems.
148.PP
149In the event that an RE could match more than one substring of a given
150string,
151the RE matches the one starting earliest in the string.
152If the RE could match more than one substring starting at that point,
153it matches the longest.
154Subexpressions also match the longest possible substrings, subject to
155the constraint that the whole match be as long as possible,
156with subexpressions starting earlier in the RE taking priority over
157ones starting later.
158Note that higher-level subexpressions thus take priority over
159their lower-level component subexpressions.
160.PP
161Match lengths are measured in characters, not collating elements.
162A null string is considered longer than no match at all.
163For example,
164`bb*' matches the three middle characters of `abbbc',
165`(wee|week)(knights|nights)' matches all ten characters of `weeknights',
166when `(.*).*' is matched against `abc' the parenthesized subexpression
167matches all three characters, and
168when `(a*)*' is matched against `bc' both the whole RE and the parenthesized
169subexpression match the null string.
170.PP
171If case-independent matching is specified,
172the effect is much as if all case distinctions had vanished from the
173alphabet.
174When an alphabetic that exists in multiple cases appears as an
175ordinary character outside a bracket expression, it is effectively
176transformed into a bracket expression containing both cases,
177e.g. `x' becomes `[xX]'.
178When it appears inside a bracket expression, all case counterparts
179of it are added to the bracket expression, so that (e.g.) `[x]'
180becomes `[xX]' and `[^x]' becomes `[^xX]'.
181.PP
182No particular limit is imposed on the length of REs\(dg.
183Programs intended to be portable should not employ REs longer
184than 256 bytes,
185as an implementation can refuse to accept such REs and remain
186POSIX-compliant.
187.PP
188Obsolete (``basic'') regular expressions differ in several respects.
189`|', `+', and `?' are ordinary characters and there is no equivalent
190for their functionality.
191The delimiters for bounds are `\e{' and `\e}',
192with `{' and `}' by themselves ordinary characters.
193The parentheses for nested subexpressions are `\e(' and `\e)',
194with `(' and `)' by themselves ordinary characters.
195`^' is an ordinary character except at the beginning of the
196RE or\(dg the beginning of a parenthesized subexpression,
197`$' is an ordinary character except at the end of the
198RE or\(dg the end of a parenthesized subexpression,
199and `*' is an ordinary character if it appears at the beginning of the
200RE or the beginning of a parenthesized subexpression
201(after a possible leading `^').
202Finally, there is one new type of atom, a \fIback reference\fR:
203`\e' followed by a non-zero decimal digit \fId\fR
204matches the same sequence of characters
205matched by the \fId\fRth parenthesized subexpression
206(numbering subexpressions by the positions of their opening parentheses,
207left to right),
208so that (e.g.) `\e([bc]\e)\e1' matches `bb' or `cc' but not `bc'.
209.SH SEE ALSO
210regex(3)
211.PP
212POSIX 1003.2, section 2.8 (Regular Expression Notation).
213.SH HISTORY
214Written by Henry Spencer, based on the 1003.2 spec.
215.SH BUGS
216Having two kinds of REs is a botch.
217.PP
218The current 1003.2 spec says that `)' is an ordinary character in
219the absence of an unmatched `(';
220this was an unintentional result of a wording error,
221and change is likely.
222Avoid relying on it.
223.PP
224Back references are a dreadful botch,
225posing major problems for efficient implementations.
226They are also somewhat vaguely defined
227(does
228`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?).
229Avoid using them.
230.PP
2311003.2's specification of case-independent matching is vague.
232The ``one case implies all cases'' definition given above
233is current consensus among implementors as to the right interpretation.
234.PP
235The syntax for word boundaries is incredibly ugly.