From 7fceb615a5c1a2a9445ac85faa0a2a1618b5ab54 Mon Sep 17 00:00:00 2001 From: "Joel E. Denny" Date: Mon, 21 Feb 2011 19:09:24 -0500 Subject: [PATCH] doc: create a new Tuning LR section in the manual. And clean up all other documentation of the features described there. * NEWS (2.5): Tweak wording of lr.type and parse.lac entries a bit, update the cross-references to the manual, and point out that LAC has caveats. Don't be so adamant that IELR+LAC=canonical LR. That is, as the referenced section in the manual documents, LAC does not fix infinite parsing loops on syntax errors. * doc/bison.texinfo: Consistently drop the "(1)" suffix from LALR, IELR, and LR in @cindex. (%define Summary): Condense the entries for lr.default-reductions, lr.keep-unreachable-states, lr.type, and parse.lac into brief summaries, and cross-reference the appropriate subsections of Tuning LR. For parse.lac, mention that it's only implemented for deterministic parsers in C. In parse.error entry, mention LAC, and add cross-reference to the LAC section. (Error Reporting): When mentioning parse.error, mention LAC, and add cross-reference to the LAC section. (Tuning LR): New section with an extended version of the documentation removed from %define Summary. Change all cross-references in the manual to point here instead of there. (Calc++ Parser): When mentioning parse.error, mention LAC, and add cross-reference to the LAC section. (Table of Symbols): In %error-verbose entry, add cross-reference to Error Reporting. (Glossary): Capitalize entry titles consistently. Add definitions for "defaulted state" and "unreachable state". Expand IELR acronym in IELR's entry. (cherry picked from commit 6f04ee6c78ba01f9d8e02dbe2baace0c3bd8f4fd) Conflicts: doc/bison.texinfo --- ChangeLog | 31 ++ NEWS | 44 +-- doc/bison.texinfo | 786 +++++++++++++++++++++++++++++----------------- 3 files changed, 545 insertions(+), 316 deletions(-) diff --git a/ChangeLog b/ChangeLog index b7a5ef3a..5173bd2e 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,34 @@ +2011-03-06 Joel E. Denny + + doc: create a new Tuning LR section in the manual. + And clean up all other documentation of the features described + there. + * NEWS (2.5): Tweak wording of lr.type and parse.lac entries a + bit, update the cross-references to the manual, and point out that + LAC has caveats. Don't be so adamant that IELR+LAC=canonical LR. + That is, as the referenced section in the manual documents, LAC + does not fix infinite parsing loops on syntax errors. + * doc/bison.texinfo: Consistently drop the "(1)" suffix from LALR, + IELR, and LR in @cindex. + (%define Summary): Condense the entries for lr.default-reductions, + lr.keep-unreachable-states, lr.type, and parse.lac into brief + summaries, and cross-reference the appropriate subsections of + Tuning LR. For parse.lac, mention that it's only implemented for + deterministic parsers in C. In parse.error entry, mention LAC, + and add cross-reference to the LAC section. + (Error Reporting): When mentioning parse.error, mention LAC, and + add cross-reference to the LAC section. + (Tuning LR): New section with an extended version of the + documentation removed from %define Summary. Change all + cross-references in the manual to point here instead of there. + (Calc++ Parser): When mentioning parse.error, mention LAC, and add + cross-reference to the LAC section. + (Table of Symbols): In %error-verbose entry, add cross-reference + to Error Reporting. + (Glossary): Capitalize entry titles consistently. Add definitions + for "defaulted state" and "unreachable state". Expand IELR + acronym in IELR's entry. + 2011-02-20 Joel E. Denny doc: add bibliography to manual. diff --git a/NEWS b/NEWS index 2c1ecdfe..93c6e0bf 100644 --- a/NEWS +++ b/NEWS @@ -116,27 +116,27 @@ Bison News %define lr.type ielr %define lr.type canonical-lr - The default reduction optimization in the parser tables can also be - adjusted using `%define lr.default-reductions'. See the documentation - for `%define lr.type' and `%define lr.default-reductions' in the - section `Bison Declaration Summary' in the Bison manual for the - details. + The default-reduction optimization in the parser tables can also be + adjusted using `%define lr.default-reductions'. For details on both + of these features, see the new section `Tuning LR' in the Bison + manual. These features are experimental. More user feedback will help to stabilize them. -** LAC (lookahead correction) for syntax error handling: +** LAC (Lookahead Correction) for syntax error handling: Canonical LR, IELR, and LALR can suffer from a couple of problems upon encountering a syntax error. First, the parser might perform additional parser stack reductions before discovering the syntax - error. Such reductions perform user semantic actions that are + error. Such reductions can perform user semantic actions that are unexpected because they are based on an invalid token, and they cause error recovery to begin in a different syntactic context than the one in which the invalid token was encountered. Second, when - verbose error messages are enabled (with %error-verbose or `#define - YYERROR_VERBOSE'), the expected token list in the syntax error - message can both contain invalid tokens and omit valid tokens. + verbose error messages are enabled (with %error-verbose or the + obsolete `#define YYERROR_VERBOSE'), the expected token list in the + syntax error message can both contain invalid tokens and omit valid + tokens. The culprits for the above problems are %nonassoc, default reductions in inconsistent states, and parser state merging. Thus, @@ -144,11 +144,11 @@ Bison News %nonassoc is used or if default reductions are enabled for inconsistent states. - LAC is a new mechanism within the parsing algorithm that completely - solves these problems for canonical LR, IELR, and LALR without - sacrificing %nonassoc, default reductions, or state mering. When - LAC is in use, canonical LR and IELR behave exactly the same for - both syntactically acceptable and syntactically unacceptable input. + LAC is a new mechanism within the parsing algorithm that solves + these problems for canonical LR, IELR, and LALR without sacrificing + %nonassoc, default reductions, or state merging. When LAC is in + use, canonical LR and IELR behave almost exactly the same for both + syntactically acceptable and syntactically unacceptable input. While LALR still does not support the full language-recognition power of canonical LR and IELR, LAC at least enables LALR's syntax error handling to correctly reflect LALR's language-recognition @@ -159,8 +159,8 @@ Bison News %define parse.lac full - See the documentation for `%define parse.lac' in the section `Bison - Declaration Summary' in the Bison manual for additional details. + See the new section `LAC' in the Bison manual for additional + details including a few caveats. LAC is an experimental feature. More user feedback will help to stabilize it. @@ -314,11 +314,11 @@ Bison News ** Verbose syntax error message fixes: - When %error-verbose or `#define YYERROR_VERBOSE' is specified, - syntax error messages produced by the generated parser include the - unexpected token as well as a list of expected tokens. The effect - of %nonassoc on these verbose messages has been corrected in two - ways, but a complete fix requires LAC, described above: + When %error-verbose or the obsolete `#define YYERROR_VERBOSE' is + specified, syntax error messages produced by the generated parser + include the unexpected token as well as a list of expected tokens. + The effect of %nonassoc on these verbose messages has been corrected + in two ways, but a more complete fix requires LAC, described above: *** When %nonassoc is used, there can exist parser states that accept no tokens, and so the parser does not always require a lookahead token diff --git a/doc/bison.texinfo b/doc/bison.texinfo index 2a0c0ffa..ef25f6cc 100644 --- a/doc/bison.texinfo +++ b/doc/bison.texinfo @@ -266,6 +266,7 @@ The Bison Parser Algorithm * Parser States:: The parser is a finite-state-machine with stack. * Reduce/Reduce:: When two rules are applicable in the same situation. * Mystery Conflicts:: Reduce/reduce conflicts that look unjustified. +* Tuning LR:: How to tune fundamental aspects of LR-based parsing. * Generalized LR Parsing:: Parsing arbitrary context-free grammars. * Memory Management:: What happens when memory is exhausted. How to avoid it. @@ -277,6 +278,13 @@ Operator Precedence * Precedence Examples:: How these features are used in the previous example. * How Precedence:: How they work. +Tuning LR + +* LR Table Construction:: Choose a different construction algorithm. +* Default Reductions:: Disable default reductions. +* LAC:: Correct lookahead sets in the parser states. +* Unreachable States:: Keep unreachable parser states for debugging. + Handling Context Dependencies * Semantic Tokens:: Token parsing can depend on the semantic context. @@ -473,21 +481,19 @@ order to specify the language Algol 60. Any grammar expressed in BNF is a context-free grammar. The input to Bison is essentially machine-readable BNF. -@cindex LALR(1) grammars -@cindex IELR(1) grammars -@cindex LR(1) grammars -There are various important subclasses of context-free grammars. -Although it can handle almost all context-free grammars, Bison is -optimized for what are called LR(1) grammars. -In brief, in these grammars, it must be possible to tell how to parse -any portion of an input string with just a single token of lookahead. -For historical reasons, Bison by default is limited by the additional -restrictions of LALR(1), which is hard to explain simply. -@xref{Mystery Conflicts, ,Mysterious Reduce/Reduce Conflicts}, for -more information on this. -As an experimental feature, you can escape these additional restrictions by -requesting IELR(1) or canonical LR(1) parser tables. -@xref{%define Summary,,lr.type}, to learn how. +@cindex LALR grammars +@cindex IELR grammars +@cindex LR grammars +There are various important subclasses of context-free grammars. Although +it can handle almost all context-free grammars, Bison is optimized for what +are called LR(1) grammars. In brief, in these grammars, it must be possible +to tell how to parse any portion of an input string with just a single token +of lookahead. For historical reasons, Bison by default is limited by the +additional restrictions of LALR(1), which is hard to explain simply. +@xref{Mystery Conflicts, ,Mysterious Reduce/Reduce Conflicts}, for more +information on this. As an experimental feature, you can escape these +additional restrictions by requesting IELR(1) or canonical LR(1) parser +tables. @xref{LR Table Construction}, to learn how. @cindex GLR parsing @cindex generalized LR (GLR) parsing @@ -5352,65 +5358,17 @@ Boolean. @c ================================================== lr.default-reductions @item lr.default-reductions -@cindex default reductions @findex %define lr.default-reductions -@cindex delayed syntax errors -@cindex syntax errors delayed -@cindex LAC -@findex %nonassoc @itemize @bullet @item Language(s): all @item Purpose: Specify the kind of states that are permitted to -contain default reductions. -That is, in such a state, Bison selects the reduction with the largest -lookahead set to be the default parser action and then removes that -lookahead set. -(The ability to specify where default reductions should be used is -experimental. -More user feedback will help to stabilize it.) - -@item Accepted Values: -@itemize -@item @code{all}. -This is the traditional Bison behavior. The main advantage is a -significant decrease in the size of the parser tables. The -disadvantage is that, when the generated parser encounters a -syntactically unacceptable token, the parser might then perform -unnecessary default reductions before it can detect the syntax error. -Such delayed syntax error detection is usually inherent in LALR and -IELR parser tables anyway due to LR state merging (@pxref{%define -Summary,,lr.type}). Furthermore, the use of @code{%nonassoc} can -contribute to delayed syntax error detection even in the case of -canonical LR. As an experimental feature, delayed syntax error -detection can be overcome in all cases by enabling LAC (@pxref{%define -Summary,,parse.lac}, for details, including a discussion of the -effects of delayed syntax error detection). - -@item @code{consistent}. -@cindex consistent states -A consistent state is a state that has only one possible action. -If that action is a reduction, then the parser does not need to request -a lookahead token from the scanner before performing that action. -However, the parser recognizes the ability to ignore the lookahead token -in this way only when such a reduction is encoded as a default -reduction. -Thus, if default reductions are permitted only in consistent states, -then a canonical LR parser that does not employ -@code{%nonassoc} detects a syntax error as soon as it @emph{needs} the -syntactically unacceptable token from the scanner. - -@item @code{accepting}. -@cindex accepting state -In the accepting state, the default reduction is actually the accept -action. -In this case, a canonical LR parser that does not employ -@code{%nonassoc} detects a syntax error as soon as it @emph{reaches} the -syntactically unacceptable token in the input. -That is, it does not perform any extra reductions. -@end itemize +contain default reductions. @xref{Default Reductions}. (The ability to +specify where default reductions should be used is experimental. More user +feedback will help to stabilize it.) +@item Accepted Values: @code{all}, @code{consistent}, @code{accepting} @item Default Value: @itemize @item @code{accepting} if @code{lr.type} is @code{canonical-lr}. @@ -5425,42 +5383,10 @@ That is, it does not perform any extra reductions. @itemize @bullet @item Language(s): all - @item Purpose: Request that Bison allow unreachable parser states to -remain in the parser tables. -Bison considers a state to be unreachable if there exists no sequence of -transitions from the start state to that state. -A state can become unreachable during conflict resolution if Bison disables a -shift action leading to it from a predecessor state. -Keeping unreachable states is sometimes useful for analysis purposes, but they -are useless in the generated parser. - +remain in the parser tables. @xref{Unreachable States}. @item Accepted Values: Boolean - @item Default Value: @code{false} - -@item Caveats: - -@itemize @bullet - -@item Unreachable states may contain conflicts and may use rules not used in -any other state. -Thus, keeping unreachable states may induce warnings that are irrelevant to -your parser's behavior, and it may eliminate warnings that are relevant. -Of course, the change in warnings may actually be relevant to a parser table -analysis that wants to keep unreachable states, so this behavior will likely -remain in future Bison releases. - -@item While Bison is able to remove unreachable states, it is not guaranteed to -remove other kinds of useless states. -Specifically, when Bison disables reduce actions during conflict resolution, -some goto actions may become useless, and thus some additional states may -become useless. -If Bison were to compute which goto actions were useless and then disable those -actions, it could identify such states as unreachable and then remove those -states. -However, Bison does not compute which goto actions are useless. -@end itemize @end itemize @c lr.keep-unreachable-states @@ -5468,87 +5394,15 @@ However, Bison does not compute which goto actions are useless. @item lr.type @findex %define lr.type -@cindex LALR -@cindex IELR -@cindex LR @itemize @bullet @item Language(s): all @item Purpose: Specify the type of parser tables within the -LR(1) family. -(This feature is experimental. +LR(1) family. @xref{LR Table Construction}. (This feature is experimental. More user feedback will help to stabilize it.) -@item Accepted Values: -@itemize -@item @code{lalr}. -While Bison generates LALR parser tables by default for -historical reasons, IELR or canonical LR is almost -always preferable for deterministic parsers. -The trouble is that LALR parser tables can suffer from -mysterious conflicts and thus may not accept the full set of sentences -that IELR and canonical LR accept. -@xref{Mystery Conflicts}, for details. -However, there are at least two scenarios where LALR may be -worthwhile: -@itemize -@cindex GLR with LALR -@item When employing GLR parsers (@pxref{GLR Parsers}), if you -do not resolve any conflicts statically (for example, with @code{%left} -or @code{%prec}), then the parser explores all potential parses of any -given input. -In this case, the use of LALR parser tables is guaranteed not -to alter the language accepted by the parser. -LALR parser tables are the smallest parser tables Bison can -currently generate, so they may be preferable. -Nevertheless, once you begin to resolve conflicts statically, -GLR begins to behave more like a deterministic parser, and so -IELR and canonical LR can be helpful to avoid -LALR's mysterious behavior. - -@item Occasionally during development, an especially malformed grammar -with a major recurring flaw may severely impede the IELR or -canonical LR parser table generation algorithm. -LALR can be a quick way to generate parser tables in order to -investigate such problems while ignoring the more subtle differences -from IELR and canonical LR. -@end itemize - -@item @code{ielr}. -IELR is a minimal LR algorithm. -That is, given any grammar (LR or non-LR), -IELR and canonical LR always accept exactly the same -set of sentences. -However, as for LALR, the number of parser states is often an -order of magnitude less for IELR than for canonical -LR. -More importantly, because canonical LR's extra parser states -may contain duplicate conflicts in the case of non-LR -grammars, the number of conflicts for IELR is often an order -of magnitude less as well. -This can significantly reduce the complexity of developing of a grammar. - -@item @code{canonical-lr}. -@cindex delayed syntax errors -@cindex syntax errors delayed -@cindex LAC -@findex %nonassoc -While inefficient, canonical LR parser tables can be an interesting -means to explore a grammar because they have a property that IELR and -LALR tables do not. That is, if @code{%nonassoc} is not used and -default reductions are left disabled (@pxref{%define -Summary,,lr.default-reductions}), then, for every left context of -every canonical LR state, the set of tokens accepted by that state is -guaranteed to be the exact set of tokens that is syntactically -acceptable in that left context. It might then seem that an advantage -of canonical LR parsers in production is that, under the above -constraints, they are guaranteed to detect a syntax error as soon as -possible without performing any unnecessary reductions. However, IELR -parsers using LAC (@pxref{%define Summary,,parse.lac}) are also able -to achieve this behavior without sacrificing @code{%nonassoc} or -default reductions. -@end itemize +@item Accepted Values: @code{lalr}, @code{ielr}, @code{canonical-lr} @item Default Value: @code{lalr} @end itemize @@ -5596,8 +5450,9 @@ function. @xref{Error Reporting, ,The Error Reporting Function Error messages passed to @code{yyerror} are simply @w{@code{"syntax error"}}. @item @code{verbose} -Error messages report the unexpected token, and possibly the expected -ones. +Error messages report the unexpected token, and possibly the expected ones. +However, this report can often be incorrect when LAC is not enabled +(@pxref{LAC}). @end itemize @item Default Value: @@ -5609,84 +5464,13 @@ ones. @c ================================================== parse.lac @item parse.lac @findex %define parse.lac -@cindex LAC -@cindex lookahead correction @itemize -@item Languages(s): C +@item Languages(s): C (deterministic parsers only) @item Purpose: Enable LAC (lookahead correction) to improve -syntax error handling. - -Canonical LR, IELR, and LALR can suffer -from a couple of problems upon encountering a syntax error. First, the -parser might perform additional parser stack reductions before -discovering the syntax error. Such reductions perform user semantic -actions that are unexpected because they are based on an invalid token, -and they cause error recovery to begin in a different syntactic context -than the one in which the invalid token was encountered. Second, when -verbose error messages are enabled (with @code{%error-verbose} or -@code{#define YYERROR_VERBOSE}), the expected token list in the syntax -error message can both contain invalid tokens and omit valid tokens. - -The culprits for the above problems are @code{%nonassoc}, default -reductions in inconsistent states, and parser state merging. Thus, -IELR and LALR suffer the most. Canonical -LR can suffer only if @code{%nonassoc} is used or if default -reductions are enabled for inconsistent states. - -LAC is a new mechanism within the parsing algorithm that -completely solves these problems for canonical LR, -IELR, and LALR without sacrificing @code{%nonassoc}, -default reductions, or state mering. Conceptually, the mechanism is -straight-forward. Whenever the parser fetches a new token from the -scanner so that it can determine the next parser action, it immediately -suspends normal parsing and performs an exploratory parse using a -temporary copy of the normal parser state stack. During this -exploratory parse, the parser does not perform user semantic actions. -If the exploratory parse reaches a shift action, normal parsing then -resumes on the normal parser stacks. If the exploratory parse reaches -an error instead, the parser reports a syntax error. If verbose syntax -error messages are enabled, the parser must then discover the list of -expected tokens, so it performs a separate exploratory parse for each -token in the grammar. - -There is one subtlety about the use of LAC. That is, when in a -consistent parser state with a default reduction, the parser will not -attempt to fetch a token from the scanner because no lookahead is -needed to determine the next parser action. Thus, whether default -reductions are enabled in consistent states (@pxref{%define -Summary,,lr.default-reductions}) affects how soon the parser detects a -syntax error: when it @emph{reaches} an erroneous token or when it -eventually @emph{needs} that token as a lookahead. The latter -behavior is probably more intuitive, so Bison currently provides no -way to achieve the former behavior while default reductions are fully -enabled. - -Thus, when LAC is in use, for some fixed decision of whether -to enable default reductions in consistent states, canonical -LR and IELR behave exactly the same for both -syntactically acceptable and syntactically unacceptable input. While -LALR still does not support the full language-recognition -power of canonical LR and IELR, LAC at -least enables LALR's syntax error handling to correctly -reflect LALR's language-recognition power. - -Because LAC requires many parse actions to be performed twice, -it can have a performance penalty. However, not all parse actions must -be performed twice. Specifically, during a series of default reductions -in consistent states and shift actions, the parser never has to initiate -an exploratory parse. Moreover, the most time-consuming tasks in a -parse are often the file I/O, the lexical analysis performed by the -scanner, and the user's semantic actions, but none of these are -performed during the exploratory parse. Finally, the base of the -temporary stack used during an exploratory parse is a pointer into the -normal parser state stack so that the stack is never physically copied. -In our experience, the performance penalty of LAC has proven -insignificant for practical grammars. - +syntax error handling. @xref{LAC}. @item Accepted Values: @code{none}, @code{full} - @item Default Value: @code{none} @end itemize @c parse.lac @@ -6328,10 +6112,11 @@ receives one argument. For a syntax error, the string is normally @w{@code{"syntax error"}}. @findex %define parse.error -If you invoke @samp{%define parse.error verbose} in the Bison -declarations section (@pxref{Bison Declarations, ,The Bison Declarations -Section}), then Bison provides a more verbose and specific error message -string instead of just plain @w{@code{"syntax error"}}. +If you invoke @samp{%define parse.error verbose} in the Bison declarations +section (@pxref{Bison Declarations, ,The Bison Declarations Section}), then +Bison provides a more verbose and specific error message string instead of +just plain @w{@code{"syntax error"}}. However, that message sometimes +contains incorrect information if LAC is not enabled (@pxref{LAC}). The parser can detect one other kind of error: memory exhaustion. This can happen when the input contains constructions that are very deeply @@ -6732,6 +6517,7 @@ This kind of parser is known in the literature as a bottom-up parser. * Parser States:: The parser is a finite-state-machine with stack. * Reduce/Reduce:: When two rules are applicable in the same situation. * Mystery Conflicts:: Reduce/reduce conflicts that look unjustified. +* Tuning LR:: How to tune fundamental aspects of LR-based parsing. * Generalized LR Parsing:: Parsing arbitrary context-free grammars. * Memory Management:: What happens when memory is exhausted. How to avoid it. @end menu @@ -7301,6 +7087,7 @@ redirects:redirect @node Mystery Conflicts @section Mysterious Reduce/Reduce Conflicts +@cindex Mysterious Conflicts Sometimes reduce/reduce conflicts can occur that don't look warranted. Here is an example: @@ -7342,8 +7129,8 @@ of lookahead: when a @code{param_spec} is being read, an @code{ID} is a @code{name} if a comma or colon follows, or a @code{type} if another @code{ID} follows. In other words, this grammar is LR(1). -@cindex LR(1) -@cindex LALR(1) +@cindex LR +@cindex LALR However, for historical reasons, Bison cannot by default handle all LR(1) grammars. In this grammar, two contexts, that after an @code{ID} at the beginning @@ -7358,15 +7145,16 @@ contexts, so it makes a single parser state for them both. Combining the two contexts causes a conflict later. In parser terminology, this occurrence means that the grammar is not LALR(1). -For many practical grammars (specifically those that fall into the -non-LR(1) class), the limitations of LALR(1) result in difficulties -beyond just mysterious reduce/reduce conflicts. The best way to fix -all these problems is to select a different parser table generation -algorithm. Either IELR(1) or canonical LR(1) would suffice, but the -former is more efficient and easier to debug during development. -@xref{%define Summary,,lr.type}, for details. (Bison's IELR(1) and -canonical LR(1) implementations are experimental. More user feedback -will help to stabilize them.) +@cindex IELR +@cindex canonical LR +For many practical grammars (specifically those that fall into the non-LR(1) +class), the limitations of LALR(1) result in difficulties beyond just +mysterious reduce/reduce conflicts. The best way to fix all these problems +is to select a different parser table construction algorithm. Either +IELR(1) or canonical LR(1) would suffice, but the former is more efficient +and easier to debug during development. @xref{LR Table Construction}, for +details. (Bison's IELR(1) and canonical LR(1) implementations are +experimental. More user feedback will help to stabilize them.) If you instead wish to work around LALR(1)'s limitations, you can often fix a mysterious conflict by identifying the two parser states @@ -7417,6 +7205,409 @@ return_spec: For a more detailed exposition of LALR(1) parsers and parser generators, @pxref{Bibliography,,DeRemer 1982}. +@node Tuning LR +@section Tuning LR + +The default behavior of Bison's LR-based parsers is chosen mostly for +historical reasons, but that behavior is often not robust. For example, in +the previous section, we discussed the mysterious conflicts that can be +produced by LALR(1), Bison's default parser table construction algorithm. +Another example is Bison's @code{%define parse.error verbose} directive, +which instructs the generated parser to produce verbose syntax error +messages, which can sometimes contain incorrect information. + +In this section, we explore several modern features of Bison that allow you +to tune fundamental aspects of the generated LR-based parsers. Some of +these features easily eliminate shortcomings like those mentioned above. +Others can be helpful purely for understanding your parser. + +Most of the features discussed in this section are still experimental. More +user feedback will help to stabilize them. + +@menu +* LR Table Construction:: Choose a different construction algorithm. +* Default Reductions:: Disable default reductions. +* LAC:: Correct lookahead sets in the parser states. +* Unreachable States:: Keep unreachable parser states for debugging. +@end menu + +@node LR Table Construction +@subsection LR Table Construction +@cindex Mysterious Conflict +@cindex LALR +@cindex IELR +@cindex canonical LR +@findex %define lr.type + +For historical reasons, Bison constructs LALR(1) parser tables by default. +However, LALR does not possess the full language-recognition power of LR. +As a result, the behavior of parsers employing LALR parser tables is often +mysterious. We presented a simple example of this effect in @ref{Mystery +Conflicts}. + +As we also demonstrated in that example, the traditional approach to +eliminating such mysterious behavior is to restructure the grammar. +Unfortunately, doing so correctly is often difficult. Moreover, merely +discovering that LALR causes mysterious behavior in your parser can be +difficult as well. + +Fortunately, Bison provides an easy way to eliminate the possibility of such +mysterious behavior altogether. You simply need to activate a more powerful +parser table construction algorithm by using the @code{%define lr.type} +directive. + +@deffn {Directive} {%define lr.type @var{TYPE}} +Specify the type of parser tables within the LR(1) family. The accepted +values for @var{TYPE} are: + +@itemize +@item @code{lalr} (default) +@item @code{ielr} +@item @code{canonical-lr} +@end itemize + +(This feature is experimental. More user feedback will help to stabilize +it.) +@end deffn + +For example, to activate IELR, you might add the following directive to you +grammar file: + +@example +%define lr.type ielr +@end example + +@noindent For the example in @ref{Mystery Conflicts}, the mysterious +conflict is then eliminated, so there is no need to invest time in +comprehending the conflict or restructuring the grammar to fix it. If, +during future development, the grammar evolves such that all mysterious +behavior would have disappeared using just LALR, you need not fear that +continuing to use IELR will result in unnecessarily large parser tables. +That is, IELR generates LALR tables when LALR (using a deterministic parsing +algorithm) is sufficient to support the full language-recognition power of +LR. Thus, by enabling IELR at the start of grammar development, you can +safely and completely eliminate the need to consider LALR's shortcomings. + +While IELR is almost always preferable, there are circumstances where LALR +or the canonical LR parser tables described by Knuth +(@pxref{Bibliography,,Knuth 1965}) can be useful. Here we summarize the +relative advantages of each parser table construction algorithm within +Bison: + +@itemize +@item LALR + +There are at least two scenarios where LALR can be worthwhile: + +@itemize +@item GLR without static conflict resolution. + +@cindex GLR with LALR +When employing GLR parsers (@pxref{GLR Parsers}), if you do not resolve any +conflicts statically (for example, with @code{%left} or @code{%prec}), then +the parser explores all potential parses of any given input. In this case, +the choice of parser table construction algorithm is guaranteed not to alter +the language accepted by the parser. LALR parser tables are the smallest +parser tables Bison can currently construct, so they may then be preferable. +Nevertheless, once you begin to resolve conflicts statically, GLR behaves +more like a deterministic parser in the syntactic contexts where those +conflicts appear, and so either IELR or canonical LR can then be helpful to +avoid LALR's mysterious behavior. + +@item Malformed grammars. + +Occasionally during development, an especially malformed grammar with a +major recurring flaw may severely impede the IELR or canonical LR parser +table construction algorithm. LALR can be a quick way to construct parser +tables in order to investigate such problems while ignoring the more subtle +differences from IELR and canonical LR. +@end itemize + +@item IELR + +IELR (Inadequacy Elimination LR) is a minimal LR algorithm. That is, given +any grammar (LR or non-LR), parsers using IELR or canonical LR parser tables +always accept exactly the same set of sentences. However, like LALR, IELR +merges parser states during parser table construction so that the number of +parser states is often an order of magnitude less than for canonical LR. +More importantly, because canonical LR's extra parser states may contain +duplicate conflicts in the case of non-LR grammars, the number of conflicts +for IELR is often an order of magnitude less as well. This effect can +significantly reduce the complexity of developing a grammar. + +@item Canonical LR + +@cindex delayed syntax error detection +@cindex LAC +@findex %nonassoc +While inefficient, canonical LR parser tables can be an interesting means to +explore a grammar because they possess a property that IELR and LALR tables +do not. That is, if @code{%nonassoc} is not used and default reductions are +left disabled (@pxref{Default Reductions}), then, for every left context of +every canonical LR state, the set of tokens accepted by that state is +guaranteed to be the exact set of tokens that is syntactically acceptable in +that left context. It might then seem that an advantage of canonical LR +parsers in production is that, under the above constraints, they are +guaranteed to detect a syntax error as soon as possible without performing +any unnecessary reductions. However, IELR parsers that use LAC are also +able to achieve this behavior without sacrificing @code{%nonassoc} or +default reductions. For details and a few caveats of LAC, @pxref{LAC}. +@end itemize + +For a more detailed exposition of the mysterious behavior in LALR parsers +and the benefits of IELR, @pxref{Bibliography,,Denny 2008 March}, and +@ref{Bibliography,,Denny 2010 November}. + +@node Default Reductions +@subsection Default Reductions +@cindex default reductions +@findex %define lr.default-reductions +@findex %nonassoc + +After parser table construction, Bison identifies the reduction with the +largest lookahead set in each parser state. To reduce the size of the +parser state, traditional Bison behavior is to remove that lookahead set and +to assign that reduction to be the default parser action. Such a reduction +is known as a @dfn{default reduction}. + +Default reductions affect more than the size of the parser tables. They +also affect the behavior of the parser: + +@itemize +@item Delayed @code{yylex} invocations. + +@cindex delayed yylex invocations +@cindex consistent states +@cindex defaulted states +A @dfn{consistent state} is a state that has only one possible parser +action. If that action is a reduction and is encoded as a default +reduction, then that consistent state is called a @dfn{defaulted state}. +Upon reaching a defaulted state, a Bison-generated parser does not bother to +invoke @code{yylex} to fetch the next token before performing the reduction. +In other words, whether default reductions are enabled in consistent states +determines how soon a Bison-generated parser invokes @code{yylex} for a +token: immediately when it @emph{reaches} that token in the input or when it +eventually @emph{needs} that token as a lookahead to determine the next +parser action. Traditionally, default reductions are enabled, and so the +parser exhibits the latter behavior. + +The presence of defaulted states is an important consideration when +designing @code{yylex} and the grammar file. That is, if the behavior of +@code{yylex} can influence or be influenced by the semantic actions +associated with the reductions in defaulted states, then the delay of the +next @code{yylex} invocation until after those reductions is significant. +For example, the semantic actions might pop a scope stack that @code{yylex} +uses to determine what token to return. Thus, the delay might be necessary +to ensure that @code{yylex} does not look up the next token in a scope that +should already be considered closed. + +@item Delayed syntax error detection. + +@cindex delayed syntax error detection +When the parser fetches a new token by invoking @code{yylex}, it checks +whether there is an action for that token in the current parser state. The +parser detects a syntax error if and only if either (1) there is no action +for that token or (2) the action for that token is the error action (due to +the use of @code{%nonassoc}). However, if there is a default reduction in +that state (which might or might not be a defaulted state), then it is +impossible for condition 1 to exist. That is, all tokens have an action. +Thus, the parser sometimes fails to detect the syntax error until it reaches +a later state. + +@cindex LAC +@c If there's an infinite loop, default reductions can prevent an incorrect +@c sentence from being rejected. +While default reductions never cause the parser to accept syntactically +incorrect sentences, the delay of syntax error detection can have unexpected +effects on the behavior of the parser. However, the delay can be caused +anyway by parser state merging and the use of @code{%nonassoc}, and it can +be fixed by another Bison feature, LAC. We discuss the effects of delayed +syntax error detection and LAC more in the next section (@pxref{LAC}). +@end itemize + +For canonical LR, the only default reduction that Bison enables by default +is the accept action, which appears only in the accepting state, which has +no other action and is thus a defaulted state. However, the default accept +action does not delay any @code{yylex} invocation or syntax error detection +because the accept action ends the parse. + +For LALR and IELR, Bison enables default reductions in nearly all states by +default. There are only two exceptions. First, states that have a shift +action on the @code{error} token do not have default reductions because +delayed syntax error detection could then prevent the @code{error} token +from ever being shifted in that state. However, parser state merging can +cause the same effect anyway, and LAC fixes it in both cases, so future +versions of Bison might drop this exception when LAC is activated. Second, +GLR parsers do not record the default reduction as the action on a lookahead +token for which there is a conflict. The correct action in this case is to +split the parse instead. + +To adjust which states have default reductions enabled, use the +@code{%define lr.default-reductions} directive. + +@deffn {Directive} {%define lr.default-reductions @var{WHERE}} +Specify the kind of states that are permitted to contain default reductions. +The accepted values of @var{WHERE} are: +@itemize +@item @code{all} (default for LALR and IELR) +@item @code{consistent} +@item @code{accepting} (default for canonical LR) +@end itemize + +(The ability to specify where default reductions are permitted is +experimental. More user feedback will help to stabilize it.) +@end deffn + +FIXME: Because of the exceptions described above, @code{all} is a misnomer. +Rename to @code{full}. + +@node LAC +@subsection LAC +@findex %define parse.lac +@cindex LAC +@cindex lookahead correction + +Canonical LR, IELR, and LALR can suffer from a couple of problems upon +encountering a syntax error. First, the parser might perform additional +parser stack reductions before discovering the syntax error. Such +reductions can perform user semantic actions that are unexpected because +they are based on an invalid token, and they cause error recovery to begin +in a different syntactic context than the one in which the invalid token was +encountered. Second, when verbose error messages are enabled (@pxref{Error +Reporting}), the expected token list in the syntax error message can both +contain invalid tokens and omit valid tokens. + +The culprits for the above problems are @code{%nonassoc}, default reductions +in inconsistent states (@pxref{Default Reductions}), and parser state +merging. Because IELR and LALR merge parser states, they suffer the most. +Canonical LR can suffer only if @code{%nonassoc} is used or if default +reductions are enabled for inconsistent states. + +LAC (Lookahead Correction) is a new mechanism within the parsing algorithm +that solves these problems for canonical LR, IELR, and LALR without +sacrificing @code{%nonassoc}, default reductions, or state merging. You can +enable LAC with the @code{%define parse.lac} directive. + +@deffn {Directive} {%define parse.lac @var{VALUE}} +Enable LAC to improve syntax error handling. +@itemize +@item @code{none} (default) +@item @code{full} +@end itemize +(This feature is experimental. More user feedback will help to stabilize +it. Moreover, it is currently only available for deterministic parsers in +C.) +@end deffn + +Conceptually, the LAC mechanism is straight-forward. Whenever the parser +fetches a new token from the scanner so that it can determine the next +parser action, it immediately suspends normal parsing and performs an +exploratory parse using a temporary copy of the normal parser state stack. +During this exploratory parse, the parser does not perform user semantic +actions. If the exploratory parse reaches a shift action, normal parsing +then resumes on the normal parser stacks. If the exploratory parse reaches +an error instead, the parser reports a syntax error. If verbose syntax +error messages are enabled, the parser must then discover the list of +expected tokens, so it performs a separate exploratory parse for each token +in the grammar. + +There is one subtlety about the use of LAC. That is, when in a consistent +parser state with a default reduction, the parser will not attempt to fetch +a token from the scanner because no lookahead is needed to determine the +next parser action. Thus, whether default reductions are enabled in +consistent states (@pxref{Default Reductions}) affects how soon the parser +detects a syntax error: immediately when it @emph{reaches} an erroneous +token or when it eventually @emph{needs} that token as a lookahead to +determine the next parser action. The latter behavior is probably more +intuitive, so Bison currently provides no way to achieve the former behavior +while default reductions are enabled in consistent states. + +Thus, when LAC is in use, for some fixed decision of whether to enable +default reductions in consistent states, canonical LR and IELR behave almost +exactly the same for both syntactically acceptable and syntactically +unacceptable input. While LALR still does not support the full +language-recognition power of canonical LR and IELR, LAC at least enables +LALR's syntax error handling to correctly reflect LALR's +language-recognition power. + +There are a few caveats to consider when using LAC: + +@itemize +@item Infinite parsing loops. + +IELR plus LAC does have one shortcoming relative to canonical LR. Some +parsers generated by Bison can loop infinitely. LAC does not fix infinite +parsing loops that occur between encountering a syntax error and detecting +it, but enabling canonical LR or disabling default reductions sometimes +does. + +@item Verbose error message limitations. + +Because of internationalization considerations, Bison-generated parsers +limit the size of the expected token list they are willing to report in a +verbose syntax error message. If the number of expected tokens exceeds that +limit, the list is simply dropped from the message. Enabling LAC can +increase the size of the list and thus cause the parser to drop it. Of +course, dropping the list is better than reporting an incorrect list. + +@item Performance. + +Because LAC requires many parse actions to be performed twice, it can have a +performance penalty. However, not all parse actions must be performed +twice. Specifically, during a series of default reductions in consistent +states and shift actions, the parser never has to initiate an exploratory +parse. Moreover, the most time-consuming tasks in a parse are often the +file I/O, the lexical analysis performed by the scanner, and the user's +semantic actions, but none of these are performed during the exploratory +parse. Finally, the base of the temporary stack used during an exploratory +parse is a pointer into the normal parser state stack so that the stack is +never physically copied. In our experience, the performance penalty of LAC +has proven insignificant for practical grammars. +@end itemize + +@node Unreachable States +@subsection Unreachable States +@findex %define lr.keep-unreachable-states +@cindex unreachable states + +If there exists no sequence of transitions from the parser's start state to +some state @var{s}, then Bison considers @var{s} to be an @dfn{unreachable +state}. A state can become unreachable during conflict resolution if Bison +disables a shift action leading to it from a predecessor state. + +By default, Bison removes unreachable states from the parser after conflict +resolution because they are useless in the generated parser. However, +keeping unreachable states is sometimes useful when trying to understand the +relationship between the parser and the grammar. + +@deffn {Directive} {%define lr.keep-unreachable-states @var{VALUE}} +Request that Bison allow unreachable states to remain in the parser tables. +@var{VALUE} must be a Boolean. The default is @code{false}. +@end deffn + +There are a few caveats to consider: + +@itemize @bullet +@item Missing or extraneous warnings. + +Unreachable states may contain conflicts and may use rules not used in any +other state. Thus, keeping unreachable states may induce warnings that are +irrelevant to your parser's behavior, and it may eliminate warnings that are +relevant. Of course, the change in warnings may actually be relevant to a +parser table analysis that wants to keep unreachable states, so this +behavior will likely remain in future Bison releases. + +@item Other useless states. + +While Bison is able to remove unreachable states, it is not guaranteed to +remove other kinds of useless states. Specifically, when Bison disables +reduce actions during conflict resolution, some goto actions may become +useless, and thus some additional states may become useless. If Bison were +to compute which goto actions were useless and then disable those actions, +it could identify such states as unreachable and then remove those states. +However, Bison does not compute which goto actions are useless. +@end itemize + @node Generalized LR Parsing @section Generalized LR (GLR) Parsing @cindex GLR parsing @@ -9500,8 +9691,9 @@ propagated. @end example @noindent -Use the following two directives to enable parser tracing and verbose -error messages. +Use the following two directives to enable parser tracing and verbose error +messages. However, verbose error messages can contain incorrect information +(@pxref{LAC}). @comment file: calc++-parser.yy @example @@ -10908,9 +11100,9 @@ Precedence}. @end deffn @end ifset -@deffn {Directive} %define @var{define-variable} -@deffnx {Directive} %define @var{define-variable} @var{value} -@deffnx {Directive} %define @var{define-variable} "@var{value}" +@deffn {Directive} %define @var{variable} +@deffnx {Directive} %define @var{variable} @var{value} +@deffnx {Directive} %define @var{variable} "@var{value}" Define a variable to adjust Bison's behavior. @xref{%define Summary}. @end deffn @@ -10952,7 +11144,8 @@ token is reset to the token that originally caused the violation. @end deffn @deffn {Directive} %error-verbose -An obsolete directive standing for @samp{%define parse.error verbose}. +An obsolete directive standing for @samp{%define parse.error verbose} +(@pxref{Error Reporting, ,The Error Reporting Function @code{yyerror}}). @end deffn @deffn {Directive} %file-prefix "@var{prefix}" @@ -11305,7 +11498,7 @@ Data type of semantic values; @code{int} by default. @cindex glossary @table @asis -@item Accepting State +@item Accepting state A state whose only action is the accept action. The accepting state is thus a consistent state. @xref{Understanding,,}. @@ -11316,9 +11509,8 @@ by John Backus, and slightly improved by Peter Naur in his 1960-01-02 committee document contributing to what became the Algol 60 report. @xref{Language and Grammar, ,Languages and Context-Free Grammars}. -@item Consistent State -A state containing only one possible action. @xref{%define -Summary,,lr.default-reductions}. +@item Consistent state +A state containing only one possible action. @xref{Default Reductions}. @item Context-free grammars Grammars specified as rules that can be applied regardless of context. @@ -11327,12 +11519,15 @@ expression, integers are allowed @emph{anywhere} an expression is permitted. @xref{Language and Grammar, ,Languages and Context-Free Grammars}. -@item Default Reduction +@item Default reduction The reduction that a parser should perform if the current parser state contains no other action for the lookahead token. In permitted parser -states, Bison declares the reduction with the largest lookahead set to -be the default reduction and removes that lookahead set. -@xref{%define Summary,,lr.default-reductions}. +states, Bison declares the reduction with the largest lookahead set to be +the default reduction and removes that lookahead set. @xref{Default +Reductions}. + +@item Defaulted state +A consistent state with a default reduction. @xref{Default Reductions}. @item Dynamic allocation Allocation of memory that occurs during execution, rather than at @@ -11364,17 +11559,16 @@ A language construct that is (in general) grammatically divisible; for example, `expression' or `declaration' in C@. @xref{Language and Grammar, ,Languages and Context-Free Grammars}. -@item IELR(1) -A minimal LR(1) parser table generation algorithm. That is, given any +@item IELR(1) (Inadequacy Elimination LR(1)) +A minimal LR(1) parser table construction algorithm. That is, given any context-free grammar, IELR(1) generates parser tables with the full -language recognition power of canonical LR(1) but with nearly the same -number of parser states as LALR(1). This reduction in parser states -is often an order of magnitude. More importantly, because canonical -LR(1)'s extra parser states may contain duplicate conflicts in the -case of non-LR(1) grammars, the number of conflicts for IELR(1) is -often an order of magnitude less as well. This can significantly -reduce the complexity of developing of a grammar. @xref{%define -Summary,,lr.type}. +language-recognition power of canonical LR(1) but with nearly the same +number of parser states as LALR(1). This reduction in parser states is +often an order of magnitude. More importantly, because canonical LR(1)'s +extra parser states may contain duplicate conflicts in the case of non-LR(1) +grammars, the number of conflicts for IELR(1) is often an order of magnitude +less as well. This can significantly reduce the complexity of developing a +grammar. @xref{LR Table Construction}. @item Infix operator An arithmetic operator that is placed between the operands on which it @@ -11385,12 +11579,11 @@ A continuous flow of data between devices or programs. @item LAC (Lookahead Correction) A parsing mechanism that fixes the problem of delayed syntax error -detection, which is caused by LR state merging, default reductions, -and the use of @code{%nonassoc}. Delayed syntax error detection -results in unexpected semantic actions, initiation of error recovery -in the wrong syntactic context, and an incorrect list of expected -tokens in a verbose syntax error message. @xref{%define -Summary,,parse.lac}. +detection, which is caused by LR state merging, default reductions, and the +use of @code{%nonassoc}. Delayed syntax error detection results in +unexpected semantic actions, initiation of error recovery in the wrong +syntactic context, and an incorrect list of expected tokens in a verbose +syntax error message. @xref{LAC}. @item Language construct One of the typical usage schemas of the language. For example, one of @@ -11506,6 +11699,11 @@ the lexical analyzer. @xref{Symbols}. A grammar symbol that has no rules in the grammar and therefore is grammatically indivisible. The piece of text it represents is a token. @xref{Language and Grammar, ,Languages and Context-Free Grammars}. + +@item Unreachable state +A parser state to which there does not exist a sequence of transitions from +the parser's start state. A state can become unreachable during conflict +resolution. @xref{Unreachable States}. @end table @node Copying This Manual -- 2.45.2