X-Git-Url: https://git.saurik.com/wxWidgets.git/blobdiff_plain/a2968d85eb320b948e39972c75d55df98474ff89..4e15d1caa03346c126015019c1fdf093033ef40b:/docs/doxygen/overviews/resyntax.h diff --git a/docs/doxygen/overviews/resyntax.h b/docs/doxygen/overviews/resyntax.h index 65e2bf29c4..9f4facbde7 100644 --- a/docs/doxygen/overviews/resyntax.h +++ b/docs/doxygen/overviews/resyntax.h @@ -3,33 +3,21 @@ // Purpose: topic overview // Author: wxWidgets team // RCS-ID: $Id$ -// Licence: wxWindows license +// Licence: wxWindows licence ///////////////////////////////////////////////////////////////////////////// -/*! +/** -@page overview_resyntax Syntax of the Built-in Regular Expression Library +@page overview_resyntax Regular Expressions + +@tableofcontents A regular expression describes strings of characters. It's a pattern that matches certain strings and doesn't match others. -@li @ref overview_resyntax_differentflavors -@li @ref overview_resyntax_syntax -@li @ref overview_resyntax_bracket -@li @ref overview_resyntax_escapes -@li @ref overview_resyntax_metasyntax -@li @ref overview_resyntax_matching -@li @ref overview_resyntax_limits -@li @ref overview_resyntax_bre -@li @ref overview_resyntax_characters - -@seealso - -@li #wxRegEx +@see wxRegEx -
- @section overview_resyntax_differentflavors Different Flavors of Regular Expressions @@ -323,7 +311,7 @@ specific conditions are met, written as an escape: @endTable A word is defined as in the specification of [[:@<:]] and -[[:>:]] above. Constraint escapes are illegal within bracket +[[:@>:]] above. Constraint escapes are illegal within bracket expressions. A back reference (AREs only) matches the same string matched by the @@ -343,226 +331,194 @@ back reference), and otherwise is taken as octal. @section overview_resyntax_metasyntax Metasyntax -In addition to the main syntax described above, -there are some special forms and miscellaneous syntactic facilities available. +In addition to the main syntax described above, there are some special forms +and miscellaneous syntactic facilities available. + Normally the flavor of RE being used is specified by application-dependent means. However, this can be overridden by a @e director. If an RE of any flavor -begins with '@b ***:', the rest of the RE is an ARE. If an RE of any flavor begins -with '@b ***=', the rest of the RE is taken to be a literal string, with all -characters considered ordinary characters. -An ARE may begin with @e embedded options: a sequence @b (?xyz) -(where @e xyz is one or more alphabetic characters) -specifies options affecting the rest of the RE. These supplement, and can -override, any options specified by the application. The available option -letters are: - - - -@b b - -rest of RE is a BRE - -@b c - -case-sensitive matching (usual default) - -@b e - -rest of RE is an ERE - -@b i - -case-insensitive matching (see #Matching, below) - -@b m - -historical synonym for @b n +begins with ***:, the rest of the RE is an ARE. If an RE of any +flavor begins with ***=, the rest of the RE is taken to be a literal +string, with all characters considered ordinary characters. -@b n - -newline-sensitive matching (see #Matching, below) - -@b p - -partial newline-sensitive matching (see #Matching, below) - -@b q - -rest of RE -is a literal ("quoted'') string, all ordinary characters - -@b s - -non-newline-sensitive matching (usual default) - -@b t - -tight syntax (usual default; see below) - -@b w - -inverse -partial newline-sensitive ("weird'') matching (see #Matching, below) - -@b x - -expanded syntax (see below) +An ARE may begin with embedded options: a sequence (?xyz) +(where @e xyz is one or more alphabetic characters) specifies options affecting +the rest of the RE. These supplement, and can override, any options specified +by the application. The available option letters are: +@beginTable +@row2col{ b , Rest of RE is a BRE. } +@row2col{ c , Case-sensitive matching (usual default). } +@row2col{ e , Rest of RE is an ERE. } +@row2col{ i , Case-insensitive matching (see + @ref overview_resyntax_matching, below). } +@row2col{ m , Historical synonym for @e n. } +@row2col{ n , Newline-sensitive matching (see + @ref overview_resyntax_matching, below). } +@row2col{ p , Partial newline-sensitive matching (see + @ref overview_resyntax_matching, below). } +@row2col{ q , Rest of RE is a literal ("quoted") string, all ordinary + characters. } +@row2col{ s , Non-newline-sensitive matching (usual default). } +@row2col{ t , Tight syntax (usual default; see below). } +@row2col{ w , Inverse partial newline-sensitive ("weird") matching + (see @ref overview_resyntax_matching, below). } +@row2col{ x , Expanded syntax (see below). } +@endTable +Embedded options take effect at the ) terminating the sequence. They +are available only at the start of an ARE, and may not be used later within it. -Embedded options take effect at the @b ) terminating the -sequence. They are available only at the start of an ARE, and may not be -used later within it. -In addition to the usual (@e tight) RE syntax, in which -all characters are significant, there is an @e expanded syntax, available -in AREs with the embedded -x option. In the expanded syntax, white-space characters are ignored and -all characters between a @b # and the following newline (or the end of the -RE) are ignored, permitting paragraphing and commenting a complex RE. There -are three exceptions to that basic rule: +In addition to the usual (@e tight) RE syntax, in which all characters are +significant, there is an @e expanded syntax, available in AREs with the +embedded x option. In the expanded syntax, white-space characters are ignored +and all characters between a @# and the following newline (or the end +of the RE) are ignored, permitting paragraphing and commenting a complex RE. +There are three exceptions to that basic rule: +@li A white-space character or @# preceded by @\ is retained. +@li White space or @# within a bracket expression is retained. +@li White space and comments are illegal within multi-character symbols like + the ARE (?: or the BRE \(. -a white-space character or '@b #' preceded -by '@b \' is retained -white space or '@b #' within a bracket expression is retained -white space and comments are illegal within multi-character symbols like -the ARE '@b (?:' or the BRE '@b \(' +Expanded-syntax white-space characters are blank, tab, newline, and any +character that belongs to the @e space character class. +Finally, in an ARE, outside bracket expressions, the sequence (?@#ttt) +(where @e ttt is any text not containing a )) is a comment, completely +ignored. Again, this is not allowed between the characters of multi-character +symbols like (?:. Such comments are more a historical artifact than a +useful facility, and their use is deprecated; use the expanded syntax instead. -Expanded-syntax white-space characters are blank, -tab, newline, and any character that belongs to the @e space character class. -Finally, in an ARE, outside bracket expressions, the sequence '@b (?#ttt)' (where -@e ttt is any text not containing a '@b )') is a comment, completely ignored. Again, -this is not allowed between the characters of multi-character symbols like -'@b (?:'. Such comments are more a historical artifact than a useful facility, -and their use is deprecated; use the expanded syntax instead. -@e None of these -metasyntax extensions is available if the application (or an initial @b ***= -director) has specified that the user's input be treated as a literal string -rather than as an RE. +@e None of these metasyntax extensions is available if the application (or an +initial ***= director) has specified that the user's input be treated +as a literal string rather than as an RE. @section overview_resyntax_matching Matching -In the event that an RE could match more than -one substring of a given string, the RE matches the one starting earliest -in the string. If the RE could match more than one substring starting at -that point, its choice is determined by its @e preference: either the longest -substring, or the shortest. -Most atoms, and all constraints, have no preference. -A parenthesized RE has the same preference (possibly none) as the RE. A -quantified atom with quantifier @b {m} or @b {m}? has the same preference (possibly -none) as the atom itself. A quantified atom with other normal quantifiers -(including @b {m,n} with @e m equal to @e n) prefers longest match. A quantified -atom with other non-greedy quantifiers (including @b {m,n}? with @e m equal to -@e n) prefers shortest match. A branch has the same preference as the first -quantified atom in it which has a preference. An RE consisting of two or -more branches connected by the @b | operator prefers longest match. -Subject to the constraints imposed by the rules for matching the whole RE, subexpressions -also match the longest or shortest possible substrings, based on their -preferences, with subexpressions starting earlier in the RE taking priority -over ones starting later. Note that outer subexpressions thus take priority -over their component subexpressions. -Note that the quantifiers @b {1,1} and -@b {1,1}? can be used to force longest and shortest preference, respectively, -on a subexpression or a whole RE. -Match lengths are measured in characters, -not collating elements. An empty string is considered longer than no match -at all. For example, @b bb* matches the three middle characters -of '@b abbbc', @b (week|wee)(night|knights) -matches all ten characters of '@b weeknights', when @b (.*).* is matched against -@b abc the parenthesized subexpression matches all three characters, and when -@b (a*)* is matched against @b bc both the whole RE and the parenthesized subexpression -match an empty string. -If case-independent matching is specified, the effect -is much as if all case distinctions had vanished from the alphabet. When -an alphabetic that exists in multiple cases appears as an ordinary character -outside a bracket expression, it is effectively transformed into a bracket -expression containing both cases, so that @b x becomes '@b [xX]'. When it appears -inside a bracket expression, all case counterparts of it are added to the -bracket expression, so that @b [x] becomes @b [xX] and @b [^x] becomes '@b [^xX]'. -If newline-sensitive -matching is specified, @b . and bracket expressions using @b ^ will never match -the newline character (so that matches will never cross newlines unless -the RE explicitly arranges it) and @b ^ and @b $ will match the empty string after -and before a newline respectively, in addition to matching at beginning -and end of string respectively. ARE @b \A and @b \Z continue to match beginning -or end of string @e only. -If partial newline-sensitive matching is specified, -this affects @b . and bracket expressions as with newline-sensitive matching, -but not @b ^ and '@b $'. -If inverse partial newline-sensitive matching is specified, -this affects @b ^ and @b $ as with newline-sensitive matching, but not @b . and bracket +In the event that an RE could match more than one substring of a given string, +the RE matches the one starting earliest in the string. If the RE could match +more than one substring starting at that point, the choice is determined by +it's @e preference: either the longest substring, or the shortest. + +Most atoms, and all constraints, have no preference. A parenthesized RE has the +same preference (possibly none) as the RE. A quantified atom with quantifier +{m} or {m}? has the same preference (possibly none) as the +atom itself. A quantified atom with other normal quantifiers (including +{m,n} with @e m equal to @e n) prefers longest match. A quantified +atom with other non-greedy quantifiers (including {m,n}? with @e m +equal to @e n) prefers shortest match. A branch has the same preference as the +first quantified atom in it which has a preference. An RE consisting of two or +more branches connected by the @c | operator prefers longest match. + +Subject to the constraints imposed by the rules for matching the whole RE, +subexpressions also match the longest or shortest possible substrings, based on +their preferences, with subexpressions starting earlier in the RE taking +priority over ones starting later. Note that outer subexpressions thus take +priority over their component subexpressions. + +Note that the quantifiers {1,1} and {1,1}? can be used to +force longest and shortest preference, respectively, on a subexpression or a +whole RE. + +Match lengths are measured in characters, not collating elements. An empty +string is considered longer than no match at all. For example, bb* +matches the three middle characters of "abbbc", +(week|wee)(night|knights) matches all ten characters of "weeknights", +when (.*).* is matched against "abc" the parenthesized subexpression +matches all three characters, and when (a*)* is matched against "bc" +both the whole RE and the parenthesized subexpression match an empty string. + +If case-independent matching is specified, the effect is much as if all case +distinctions had vanished from the alphabet. When an alphabetic that exists in +multiple cases appears as an ordinary character outside a bracket expression, +it is effectively transformed into a bracket expression containing both cases, +so that @c x becomes @c [xX]. When it appears inside a bracket expression, all +case counterparts of it are added to the bracket expression, so that @c [x] +becomes @c [xX] and @c [^x] becomes @c [^xX]. + +If newline-sensitive matching is specified, "." and bracket expressions using +"^" will never match the newline character (so that matches will never cross +newlines unless the RE explicitly arranges it) and "^" and "$" will match the +empty string after and before a newline respectively, in addition to matching +at beginning and end of string respectively. ARE @\A and @\Z +continue to match beginning or end of string @e only. + +If partial newline-sensitive matching is specified, this affects "." and +bracket expressions as with newline-sensitive matching, but not "^" and "$". + +If inverse partial newline-sensitive matching is specified, this affects "^" +and "$" as with newline-sensitive matching, but not "." and bracket expressions. This isn't very useful but is provided for symmetry. @section overview_resyntax_limits Limits and Compatibility -No particular limit is imposed on the length of REs. Programs -intended to be highly portable should not employ REs longer than 256 bytes, -as a POSIX-compliant implementation can refuse to accept such REs. -The only -feature of AREs that is actually incompatible with POSIX EREs is that @b \ -does not lose its special significance inside bracket expressions. All other -ARE features use syntax which is illegal or has undefined or unspecified -effects in POSIX EREs; the @b *** syntax of directors likewise is outside -the POSIX syntax for both BREs and EREs. -Many of the ARE extensions are -borrowed from Perl, but some have been changed to clean them up, and a -few Perl extensions are not present. Incompatibilities of note include '@b \b', -'@b \B', the lack of special treatment for a trailing newline, the addition of -complemented bracket expressions to the things affected by newline-sensitive -matching, the restrictions on parentheses and back references in lookahead -constraints, and the longest/shortest-match (rather than first-match) matching -semantics. -The matching rules for REs containing both normal and non-greedy -quantifiers have changed since early beta-test versions of this package. -(The new rules are much simpler and cleaner, but don't work as hard at guessing -the user's real intentions.) +No particular limit is imposed on the length of REs. Programs intended to be +highly portable should not employ REs longer than 256 bytes, as a +POSIX-compliant implementation can refuse to accept such REs. + +The only feature of AREs that is actually incompatible with POSIX EREs is that +@\ does not lose its special significance inside bracket expressions. +All other ARE features use syntax which is illegal or has undefined or +unspecified effects in POSIX EREs; the *** syntax of directors +likewise is outside the POSIX syntax for both BREs and EREs. + +Many of the ARE extensions are borrowed from Perl, but some have been changed +to clean them up, and a few Perl extensions are not present. Incompatibilities +of note include @\b, @\B, the lack of special treatment for a +trailing newline, the addition of complemented bracket expressions to the +things affected by newline-sensitive matching, the restrictions on parentheses +and back references in lookahead constraints, and the longest/shortest-match +(rather than first-match) matching semantics. + +The matching rules for REs containing both normal and non-greedy quantifiers +have changed since early beta-test versions of this package. The new rules are +much simpler and cleaner, but don't work as hard at guessing the user's real +intentions. + Henry Spencer's original 1986 @e regexp package, still in widespread use, -implemented an early version of today's EREs. There are four incompatibilities between @e regexp's -near-EREs ('RREs' for short) and AREs. In roughly increasing order of significance: - -In AREs, @b \ followed by an alphanumeric character is either an escape or -an error, while in RREs, it was just another way of writing the alphanumeric. -This should not be a problem because there was no reason to write such -a sequence in RREs. -@b { followed by a digit in an ARE is the beginning of -a bound, while in RREs, @b { was always an ordinary character. Such sequences -should be rare, and will often result in an error because following characters -will not look like a valid bound. -In AREs, @b \ remains a special character -within '@b []', so a literal @b \ within @b [] must be -written '@b \\'. @b \\ also gives a literal -@b \ within @b [] in RREs, but only truly paranoid programmers routinely doubled -the backslash. -AREs report the longest/shortest match for the RE, rather -than the first found in a specified search order. This may affect some RREs -which were written in the expectation that the first match would be reported. -(The careful crafting of RREs to optimize the search order for fast matching -is obsolete (AREs examine all possible matches in parallel, and their performance -is largely insensitive to their complexity) but cases where the search -order was exploited to deliberately find a match which was @e not the longest/shortest -will need rewriting.) +implemented an early version of today's EREs. There are four incompatibilities +between @e regexp's near-EREs (RREs for short) and AREs. In roughly increasing +order of significance: + +@li In AREs, @\ followed by an alphanumeric character is either an + escape or an error, while in RREs, it was just another way of writing the + alphanumeric. This should not be a problem because there was no reason to + write such a sequence in RREs. +@li @c { followed by a digit in an ARE is the beginning of a bound, while in + RREs, @c { was always an ordinary character. Such sequences should be rare, + and will often result in an error because following characters will not + look like a valid bound. +@li In AREs, @c @\ remains a special character within @c [], so a literal @c @\ + within @c [] must be written as @\@\. @\@\ also gives a + literal @c @\ within @c [] in RREs, but only truly paranoid programmers + routinely doubled the backslash. +@li AREs report the longest/shortest match for the RE, rather than the first + found in a specified search order. This may affect some RREs which were + written in the expectation that the first match would be reported. The + careful crafting of RREs to optimize the search order for fast matching is + obsolete (AREs examine all possible matches in parallel, and their + performance is largely insensitive to their complexity) but cases where the + search order was exploited to deliberately find a match which was @e not + the longest/shortest will need rewriting. @section overview_resyntax_bre Basic Regular Expressions -BREs differ from EREs in -several respects. '@b |', '@b +', and @b ? are ordinary characters and there is no equivalent -for their functionality. The delimiters for bounds -are @b \{ and '@b \}', with @b { and -@b } by themselves ordinary characters. The parentheses for nested subexpressions -are @b \( and '@b \)', with @b ( and @b ) by themselves -ordinary characters. @b ^ is an ordinary +BREs differ from EREs in several respects. @c |, @c +, and @c ? are ordinary +characters and there is no equivalent for their functionality. The delimiters +for bounds are @c @\{ and @c @\}, with @c { and @c } by themselves ordinary +characters. The parentheses for nested subexpressions are @c @\( and @c @\), +with @c ( and @c ) by themselves ordinary characters. @c ^ is an ordinary character except at the beginning of the RE or the beginning of a parenthesized -subexpression, @b $ is an ordinary character except at the end of the RE or -the end of a parenthesized subexpression, and @b * is an ordinary character -if it appears at the beginning of the RE or the beginning of a parenthesized -subexpression (after a possible leading '@b ^'). Finally, single-digit back references -are available, and @b \ and @b \ are synonyms -for [[:@<:]] and [[:@>:]] respectively; -no other escapes are available. +subexpression, @c $ is an ordinary character except at the end of the RE or the +end of a parenthesized subexpression, and @c * is an ordinary character if it +appears at the beginning of the RE or the beginning of a parenthesized +subexpression (after a possible leading ^). Finally, single-digit back +references are available, and @c @\@< and @c @\@> are synonyms for +[[:@<:]] and [[:@>:]] respectively; no other escapes are +available. @section overview_resyntax_characters Regular Expression Character Names @@ -694,4 +650,3 @@ Note that the character names are case sensitive. */ -