X-Git-Url: https://git.saurik.com/wxWidgets.git/blobdiff_plain/a2968d85eb320b948e39972c75d55df98474ff89..4e15d1caa03346c126015019c1fdf093033ef40b:/docs/doxygen/overviews/resyntax.h?ds=sidebyside
diff --git a/docs/doxygen/overviews/resyntax.h b/docs/doxygen/overviews/resyntax.h
index 65e2bf29c4..9f4facbde7 100644
--- a/docs/doxygen/overviews/resyntax.h
+++ b/docs/doxygen/overviews/resyntax.h
@@ -3,33 +3,21 @@
// Purpose: topic overview
// Author: wxWidgets team
// RCS-ID: $Id$
-// Licence: wxWindows license
+// Licence: wxWindows licence
/////////////////////////////////////////////////////////////////////////////
-/*!
+/**
-@page overview_resyntax Syntax of the Built-in Regular Expression Library
+@page overview_resyntax Regular Expressions
+
+@tableofcontents
A regular expression describes strings of characters. It's a pattern
that matches certain strings and doesn't match others.
-@li @ref overview_resyntax_differentflavors
-@li @ref overview_resyntax_syntax
-@li @ref overview_resyntax_bracket
-@li @ref overview_resyntax_escapes
-@li @ref overview_resyntax_metasyntax
-@li @ref overview_resyntax_matching
-@li @ref overview_resyntax_limits
-@li @ref overview_resyntax_bre
-@li @ref overview_resyntax_characters
-
-@seealso
-
-@li #wxRegEx
+@see wxRegEx
-
-
@section overview_resyntax_differentflavors Different Flavors of Regular Expressions
@@ -323,7 +311,7 @@ specific conditions are met, written as an escape:
@endTable
A word is defined as in the specification of [[:@<:]] and
-[[:>:]] above. Constraint escapes are illegal within bracket
+[[:@>:]] above. Constraint escapes are illegal within bracket
expressions.
A back reference (AREs only) matches the same string matched by the
@@ -343,226 +331,194 @@ back reference), and otherwise is taken as octal.
@section overview_resyntax_metasyntax Metasyntax
-In addition to the main syntax described above,
-there are some special forms and miscellaneous syntactic facilities available.
+In addition to the main syntax described above, there are some special forms
+and miscellaneous syntactic facilities available.
+
Normally the flavor of RE being used is specified by application-dependent
means. However, this can be overridden by a @e director. If an RE of any flavor
-begins with '@b ***:', the rest of the RE is an ARE. If an RE of any flavor begins
-with '@b ***=', the rest of the RE is taken to be a literal string, with all
-characters considered ordinary characters.
-An ARE may begin with @e embedded options: a sequence @b (?xyz)
-(where @e xyz is one or more alphabetic characters)
-specifies options affecting the rest of the RE. These supplement, and can
-override, any options specified by the application. The available option
-letters are:
-
-
-
-@b b
-
-rest of RE is a BRE
-
-@b c
-
-case-sensitive matching (usual default)
-
-@b e
-
-rest of RE is an ERE
-
-@b i
-
-case-insensitive matching (see #Matching, below)
-
-@b m
-
-historical synonym for @b n
+begins with ***:, the rest of the RE is an ARE. If an RE of any
+flavor begins with ***=, the rest of the RE is taken to be a literal
+string, with all characters considered ordinary characters.
-@b n
-
-newline-sensitive matching (see #Matching, below)
-
-@b p
-
-partial newline-sensitive matching (see #Matching, below)
-
-@b q
-
-rest of RE
-is a literal ("quoted'') string, all ordinary characters
-
-@b s
-
-non-newline-sensitive matching (usual default)
-
-@b t
-
-tight syntax (usual default; see below)
-
-@b w
-
-inverse
-partial newline-sensitive ("weird'') matching (see #Matching, below)
-
-@b x
-
-expanded syntax (see below)
+An ARE may begin with embedded options: a sequence (?xyz)
+(where @e xyz is one or more alphabetic characters) specifies options affecting
+the rest of the RE. These supplement, and can override, any options specified
+by the application. The available option letters are:
+@beginTable
+@row2col{ b , Rest of RE is a BRE. }
+@row2col{ c , Case-sensitive matching (usual default). }
+@row2col{ e , Rest of RE is an ERE. }
+@row2col{ i , Case-insensitive matching (see
+ @ref overview_resyntax_matching, below). }
+@row2col{ m , Historical synonym for @e n. }
+@row2col{ n , Newline-sensitive matching (see
+ @ref overview_resyntax_matching, below). }
+@row2col{ p , Partial newline-sensitive matching (see
+ @ref overview_resyntax_matching, below). }
+@row2col{ q , Rest of RE is a literal ("quoted") string, all ordinary
+ characters. }
+@row2col{ s , Non-newline-sensitive matching (usual default). }
+@row2col{ t , Tight syntax (usual default; see below). }
+@row2col{ w , Inverse partial newline-sensitive ("weird") matching
+ (see @ref overview_resyntax_matching, below). }
+@row2col{ x , Expanded syntax (see below). }
+@endTable
+Embedded options take effect at the ) terminating the sequence. They
+are available only at the start of an ARE, and may not be used later within it.
-Embedded options take effect at the @b ) terminating the
-sequence. They are available only at the start of an ARE, and may not be
-used later within it.
-In addition to the usual (@e tight) RE syntax, in which
-all characters are significant, there is an @e expanded syntax, available
-in AREs with the embedded
-x option. In the expanded syntax, white-space characters are ignored and
-all characters between a @b # and the following newline (or the end of the
-RE) are ignored, permitting paragraphing and commenting a complex RE. There
-are three exceptions to that basic rule:
+In addition to the usual (@e tight) RE syntax, in which all characters are
+significant, there is an @e expanded syntax, available in AREs with the
+embedded x option. In the expanded syntax, white-space characters are ignored
+and all characters between a @# and the following newline (or the end
+of the RE) are ignored, permitting paragraphing and commenting a complex RE.
+There are three exceptions to that basic rule:
+@li A white-space character or @# preceded by @\ is retained.
+@li White space or @# within a bracket expression is retained.
+@li White space and comments are illegal within multi-character symbols like
+ the ARE (?: or the BRE \(.
-a white-space character or '@b #' preceded
-by '@b \' is retained
-white space or '@b #' within a bracket expression is retained
-white space and comments are illegal within multi-character symbols like
-the ARE '@b (?:' or the BRE '@b \('
+Expanded-syntax white-space characters are blank, tab, newline, and any
+character that belongs to the @e space character class.
+Finally, in an ARE, outside bracket expressions, the sequence (?@#ttt)
+(where @e ttt is any text not containing a )) is a comment, completely
+ignored. Again, this is not allowed between the characters of multi-character
+symbols like (?:. Such comments are more a historical artifact than a
+useful facility, and their use is deprecated; use the expanded syntax instead.
-Expanded-syntax white-space characters are blank,
-tab, newline, and any character that belongs to the @e space character class.
-Finally, in an ARE, outside bracket expressions, the sequence '@b (?#ttt)' (where
-@e ttt is any text not containing a '@b )') is a comment, completely ignored. Again,
-this is not allowed between the characters of multi-character symbols like
-'@b (?:'. Such comments are more a historical artifact than a useful facility,
-and their use is deprecated; use the expanded syntax instead.
-@e None of these
-metasyntax extensions is available if the application (or an initial @b ***=
-director) has specified that the user's input be treated as a literal string
-rather than as an RE.
+@e None of these metasyntax extensions is available if the application (or an
+initial ***= director) has specified that the user's input be treated
+as a literal string rather than as an RE.
@section overview_resyntax_matching Matching
-In the event that an RE could match more than
-one substring of a given string, the RE matches the one starting earliest
-in the string. If the RE could match more than one substring starting at
-that point, its choice is determined by its @e preference: either the longest
-substring, or the shortest.
-Most atoms, and all constraints, have no preference.
-A parenthesized RE has the same preference (possibly none) as the RE. A
-quantified atom with quantifier @b {m} or @b {m}? has the same preference (possibly
-none) as the atom itself. A quantified atom with other normal quantifiers
-(including @b {m,n} with @e m equal to @e n) prefers longest match. A quantified
-atom with other non-greedy quantifiers (including @b {m,n}? with @e m equal to
-@e n) prefers shortest match. A branch has the same preference as the first
-quantified atom in it which has a preference. An RE consisting of two or
-more branches connected by the @b | operator prefers longest match.
-Subject to the constraints imposed by the rules for matching the whole RE, subexpressions
-also match the longest or shortest possible substrings, based on their
-preferences, with subexpressions starting earlier in the RE taking priority
-over ones starting later. Note that outer subexpressions thus take priority
-over their component subexpressions.
-Note that the quantifiers @b {1,1} and
-@b {1,1}? can be used to force longest and shortest preference, respectively,
-on a subexpression or a whole RE.
-Match lengths are measured in characters,
-not collating elements. An empty string is considered longer than no match
-at all. For example, @b bb* matches the three middle characters
-of '@b abbbc', @b (week|wee)(night|knights)
-matches all ten characters of '@b weeknights', when @b (.*).* is matched against
-@b abc the parenthesized subexpression matches all three characters, and when
-@b (a*)* is matched against @b bc both the whole RE and the parenthesized subexpression
-match an empty string.
-If case-independent matching is specified, the effect
-is much as if all case distinctions had vanished from the alphabet. When
-an alphabetic that exists in multiple cases appears as an ordinary character
-outside a bracket expression, it is effectively transformed into a bracket
-expression containing both cases, so that @b x becomes '@b [xX]'. When it appears
-inside a bracket expression, all case counterparts of it are added to the
-bracket expression, so that @b [x] becomes @b [xX] and @b [^x] becomes '@b [^xX]'.
-If newline-sensitive
-matching is specified, @b . and bracket expressions using @b ^ will never match
-the newline character (so that matches will never cross newlines unless
-the RE explicitly arranges it) and @b ^ and @b $ will match the empty string after
-and before a newline respectively, in addition to matching at beginning
-and end of string respectively. ARE @b \A and @b \Z continue to match beginning
-or end of string @e only.
-If partial newline-sensitive matching is specified,
-this affects @b . and bracket expressions as with newline-sensitive matching,
-but not @b ^ and '@b $'.
-If inverse partial newline-sensitive matching is specified,
-this affects @b ^ and @b $ as with newline-sensitive matching, but not @b . and bracket
+In the event that an RE could match more than one substring of a given string,
+the RE matches the one starting earliest in the string. If the RE could match
+more than one substring starting at that point, the choice is determined by
+it's @e preference: either the longest substring, or the shortest.
+
+Most atoms, and all constraints, have no preference. A parenthesized RE has the
+same preference (possibly none) as the RE. A quantified atom with quantifier
+{m} or {m}? has the same preference (possibly none) as the
+atom itself. A quantified atom with other normal quantifiers (including
+{m,n} with @e m equal to @e n) prefers longest match. A quantified
+atom with other non-greedy quantifiers (including {m,n}? with @e m
+equal to @e n) prefers shortest match. A branch has the same preference as the
+first quantified atom in it which has a preference. An RE consisting of two or
+more branches connected by the @c | operator prefers longest match.
+
+Subject to the constraints imposed by the rules for matching the whole RE,
+subexpressions also match the longest or shortest possible substrings, based on
+their preferences, with subexpressions starting earlier in the RE taking
+priority over ones starting later. Note that outer subexpressions thus take
+priority over their component subexpressions.
+
+Note that the quantifiers {1,1} and {1,1}? can be used to
+force longest and shortest preference, respectively, on a subexpression or a
+whole RE.
+
+Match lengths are measured in characters, not collating elements. An empty
+string is considered longer than no match at all. For example, bb*
+matches the three middle characters of "abbbc",
+(week|wee)(night|knights) matches all ten characters of "weeknights",
+when (.*).* is matched against "abc" the parenthesized subexpression
+matches all three characters, and when (a*)* is matched against "bc"
+both the whole RE and the parenthesized subexpression match an empty string.
+
+If case-independent matching is specified, the effect is much as if all case
+distinctions had vanished from the alphabet. When an alphabetic that exists in
+multiple cases appears as an ordinary character outside a bracket expression,
+it is effectively transformed into a bracket expression containing both cases,
+so that @c x becomes @c [xX]. When it appears inside a bracket expression, all
+case counterparts of it are added to the bracket expression, so that @c [x]
+becomes @c [xX] and @c [^x] becomes @c [^xX].
+
+If newline-sensitive matching is specified, "." and bracket expressions using
+"^" will never match the newline character (so that matches will never cross
+newlines unless the RE explicitly arranges it) and "^" and "$" will match the
+empty string after and before a newline respectively, in addition to matching
+at beginning and end of string respectively. ARE @\A and @\Z
+continue to match beginning or end of string @e only.
+
+If partial newline-sensitive matching is specified, this affects "." and
+bracket expressions as with newline-sensitive matching, but not "^" and "$".
+
+If inverse partial newline-sensitive matching is specified, this affects "^"
+and "$" as with newline-sensitive matching, but not "." and bracket
expressions. This isn't very useful but is provided for symmetry.
@section overview_resyntax_limits Limits and Compatibility
-No particular limit is imposed on the length of REs. Programs
-intended to be highly portable should not employ REs longer than 256 bytes,
-as a POSIX-compliant implementation can refuse to accept such REs.
-The only
-feature of AREs that is actually incompatible with POSIX EREs is that @b \
-does not lose its special significance inside bracket expressions. All other
-ARE features use syntax which is illegal or has undefined or unspecified
-effects in POSIX EREs; the @b *** syntax of directors likewise is outside
-the POSIX syntax for both BREs and EREs.
-Many of the ARE extensions are
-borrowed from Perl, but some have been changed to clean them up, and a
-few Perl extensions are not present. Incompatibilities of note include '@b \b',
-'@b \B', the lack of special treatment for a trailing newline, the addition of
-complemented bracket expressions to the things affected by newline-sensitive
-matching, the restrictions on parentheses and back references in lookahead
-constraints, and the longest/shortest-match (rather than first-match) matching
-semantics.
-The matching rules for REs containing both normal and non-greedy
-quantifiers have changed since early beta-test versions of this package.
-(The new rules are much simpler and cleaner, but don't work as hard at guessing
-the user's real intentions.)
+No particular limit is imposed on the length of REs. Programs intended to be
+highly portable should not employ REs longer than 256 bytes, as a
+POSIX-compliant implementation can refuse to accept such REs.
+
+The only feature of AREs that is actually incompatible with POSIX EREs is that
+@\ does not lose its special significance inside bracket expressions.
+All other ARE features use syntax which is illegal or has undefined or
+unspecified effects in POSIX EREs; the *** syntax of directors
+likewise is outside the POSIX syntax for both BREs and EREs.
+
+Many of the ARE extensions are borrowed from Perl, but some have been changed
+to clean them up, and a few Perl extensions are not present. Incompatibilities
+of note include @\b, @\B, the lack of special treatment for a
+trailing newline, the addition of complemented bracket expressions to the
+things affected by newline-sensitive matching, the restrictions on parentheses
+and back references in lookahead constraints, and the longest/shortest-match
+(rather than first-match) matching semantics.
+
+The matching rules for REs containing both normal and non-greedy quantifiers
+have changed since early beta-test versions of this package. The new rules are
+much simpler and cleaner, but don't work as hard at guessing the user's real
+intentions.
+
Henry Spencer's original 1986 @e regexp package, still in widespread use,
-implemented an early version of today's EREs. There are four incompatibilities between @e regexp's
-near-EREs ('RREs' for short) and AREs. In roughly increasing order of significance:
-
-In AREs, @b \ followed by an alphanumeric character is either an escape or
-an error, while in RREs, it was just another way of writing the alphanumeric.
-This should not be a problem because there was no reason to write such
-a sequence in RREs.
-@b { followed by a digit in an ARE is the beginning of
-a bound, while in RREs, @b { was always an ordinary character. Such sequences
-should be rare, and will often result in an error because following characters
-will not look like a valid bound.
-In AREs, @b \ remains a special character
-within '@b []', so a literal @b \ within @b [] must be
-written '@b \\'. @b \\ also gives a literal
-@b \ within @b [] in RREs, but only truly paranoid programmers routinely doubled
-the backslash.
-AREs report the longest/shortest match for the RE, rather
-than the first found in a specified search order. This may affect some RREs
-which were written in the expectation that the first match would be reported.
-(The careful crafting of RREs to optimize the search order for fast matching
-is obsolete (AREs examine all possible matches in parallel, and their performance
-is largely insensitive to their complexity) but cases where the search
-order was exploited to deliberately find a match which was @e not the longest/shortest
-will need rewriting.)
+implemented an early version of today's EREs. There are four incompatibilities
+between @e regexp's near-EREs (RREs for short) and AREs. In roughly increasing
+order of significance:
+
+@li In AREs, @\ followed by an alphanumeric character is either an
+ escape or an error, while in RREs, it was just another way of writing the
+ alphanumeric. This should not be a problem because there was no reason to
+ write such a sequence in RREs.
+@li @c { followed by a digit in an ARE is the beginning of a bound, while in
+ RREs, @c { was always an ordinary character. Such sequences should be rare,
+ and will often result in an error because following characters will not
+ look like a valid bound.
+@li In AREs, @c @\ remains a special character within @c [], so a literal @c @\
+ within @c [] must be written as @\@\. @\@\ also gives a
+ literal @c @\ within @c [] in RREs, but only truly paranoid programmers
+ routinely doubled the backslash.
+@li AREs report the longest/shortest match for the RE, rather than the first
+ found in a specified search order. This may affect some RREs which were
+ written in the expectation that the first match would be reported. The
+ careful crafting of RREs to optimize the search order for fast matching is
+ obsolete (AREs examine all possible matches in parallel, and their
+ performance is largely insensitive to their complexity) but cases where the
+ search order was exploited to deliberately find a match which was @e not
+ the longest/shortest will need rewriting.
@section overview_resyntax_bre Basic Regular Expressions
-BREs differ from EREs in
-several respects. '@b |', '@b +', and @b ? are ordinary characters and there is no equivalent
-for their functionality. The delimiters for bounds
-are @b \{ and '@b \}', with @b { and
-@b } by themselves ordinary characters. The parentheses for nested subexpressions
-are @b \( and '@b \)', with @b ( and @b ) by themselves
-ordinary characters. @b ^ is an ordinary
+BREs differ from EREs in several respects. @c |, @c +, and @c ? are ordinary
+characters and there is no equivalent for their functionality. The delimiters
+for bounds are @c @\{ and @c @\}, with @c { and @c } by themselves ordinary
+characters. The parentheses for nested subexpressions are @c @\( and @c @\),
+with @c ( and @c ) by themselves ordinary characters. @c ^ is an ordinary
character except at the beginning of the RE or the beginning of a parenthesized
-subexpression, @b $ is an ordinary character except at the end of the RE or
-the end of a parenthesized subexpression, and @b * is an ordinary character
-if it appears at the beginning of the RE or the beginning of a parenthesized
-subexpression (after a possible leading '@b ^'). Finally, single-digit back references
-are available, and @b \ and @b \ are synonyms
-for [[:@<:]] and [[:@>:]] respectively;
-no other escapes are available.
+subexpression, @c $ is an ordinary character except at the end of the RE or the
+end of a parenthesized subexpression, and @c * is an ordinary character if it
+appears at the beginning of the RE or the beginning of a parenthesized
+subexpression (after a possible leading ^). Finally, single-digit back
+references are available, and @c @\@< and @c @\@> are synonyms for
+[[:@<:]] and [[:@>:]] respectively; no other escapes are
+available.
@section overview_resyntax_characters Regular Expression Character Names
@@ -694,4 +650,3 @@ Note that the character names are case sensitive.
*/
-