From 99a9344e7752b06a00db033a203c81db9d58a832 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Mon, 21 Jun 2004 20:55:20 +0000 Subject: [PATCH] New section "Simple GLR Parsers". --- ChangeLog | 4 + doc/bison.texinfo | 211 +++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 212 insertions(+), 3 deletions(-) diff --git a/ChangeLog b/ChangeLog index 3f67a0c9..e2a4defa 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,7 @@ +2004-06-21 Frank Heckenbach + + * doc/bison.texinfo (Simple GLR Parsers): New section. + 2004-06-21 Paul Eggert * NEWS, TODO, doc/bison.texinfo: diff --git a/doc/bison.texinfo b/doc/bison.texinfo index 1feb4073..f4464467 100644 --- a/doc/bison.texinfo +++ b/doc/bison.texinfo @@ -135,7 +135,8 @@ The Concepts of Bison a semantic value (the value of an integer, the name of an identifier, etc.). * Semantic Actions:: Each rule can have an action containing C code. -* GLR Parsers:: Writing parsers for general context-free languages +* GLR Parsers:: Writing parsers for general context-free languages. +* Simple GLR Parsers:: Using GLR in its simplest form. * Locations Overview:: Tracking Locations. * Bison Parser:: What are Bison's input and output, how is the output used? @@ -381,7 +382,8 @@ use Bison or Yacc, we suggest you start by reading this chapter carefully. a semantic value (the value of an integer, the name of an identifier, etc.). * Semantic Actions:: Each rule can have an action containing C code. -* GLR Parsers:: Writing parsers for general context-free languages +* GLR Parsers:: Writing parsers for general context-free languages. +* Simple GLR Parsers:: Using GLR in its simplest form. * Locations Overview:: Tracking Locations. * Bison Parser:: What are Bison's input and output, how is the output used? @@ -860,6 +862,208 @@ will suffice. Otherwise, we suggest %@} @end example +@node Simple GLR Parsers +@section Using @acronym{GLR} in its Simplest Form +@cindex @acronym{GLR} parsing, unambiguous grammars +@cindex generalized @acronym{LR} (@acronym{GLR}) parsing, unambiguous grammars +@findex %glr-parser +@findex %expect-rr +@cindex conflicts +@cindex reduce/reduce conflicts + +The C++ example for @acronym{GLR} (@pxref{GLR Parsers}) explains how to use +the @acronym{GLR} parsing algorithm with some advanced features such as +@samp{%dprec} and @samp{%merge} to handle syntactically ambiguous +grammars. However, the @acronym{GLR} algorithm can also be used in a simpler +way to parse grammars that are unambiguous, but fail to be @acronym{LALR}(1). +Such grammars typically require more than one symbol of lookahead, +or (in rare cases) fall into the category of grammars in which the +@acronym{LALR}(1) algorithm throws away too much information (they are in +@acronym{LR}(1), but not @acronym{LALR}(1), @ref{Mystery Conflicts}). + +Here is an example of this situation, using a problem that +arises in the declaration of enumerated and subrange types in the +programming language Pascal. These declarations look like this: + +@example +type subrange = lo .. hi; +type enum = (a, b, c); +@end example + +@noindent +The original language standard allows only numeric +literals and constant identifiers for the subrange bounds (@samp{lo} +and @samp{hi}), but Extended Pascal (ISO/IEC 10206:1990) and many other +Pascal implementations allow arbitrary expressions there. This gives +rise to the following situation, containing a superfluous pair of +parentheses: + +@example +type subrange = (a) .. b; +@end example + +@noindent +Compare this to the following declaration of an enumerated +type with only one value: + +@example +type enum = (a); +@end example + +@noindent +(These declarations are contrived, but they are syntactically +valid, and more-complicated cases can come up in practical programs.) + +These two declarations look identical until the @samp{..} token. +With normal @acronym{LALR}(1) one-token look-ahead it is not +possible to decide between the two forms when the identifier +@samp{a} is parsed. It is, however, desirable +for a parser to decide this, since in the latter case +@samp{a} must become a new identifier to represent the enumeration +value, while in the former case @samp{a} must be evaluated with its +current meaning, which may be a constant or even a function call. + +You could parse @samp{(a)} as an ``unspecified identifier in parentheses'', +to be resolved later, but this typically requires substantial +contortions in both semantic actions and large parts of the +grammar, where the parentheses are nested in the recursive rules for +expressions. + +You might think of using the lexer to distinguish between the two +forms by returning different tokens for currently defined and +undefined identifiers. But if these declarations occur in a local +scope, and @samp{a} is defined in an outer scope, then both forms +are possible---either locally redefining @samp{a}, or using the +value of @samp{a} from the outer scope. So this approach cannot +work. + +A solution to this problem is to use a @acronym{GLR} parser in its simplest +form, i.e., without using special features such as @samp{%dprec} and +@samp{%merge}. When the @acronym{GLR} parser reaches the critical state, it +simply splits into two branches and pursues both syntax rules +simultaneously. Sooner or later, one of them runs into a parsing +error. If there is a @samp{..} token before the next +@samp{;}, the rule for enumerated types fails since it cannot +accept @samp{..} anywhere; otherwise, the subrange type rule +fails since it requires a @samp{..} token. So one of the branches +fails silently, and the other one continues normally, performing +all the intermediate actions that were postponed during the split. + +If the input is syntactically incorrect, both branches fail and the parser +reports a syntax error as usual. + +The effect of all this is that the parser seems to ``guess'' the +correct branch to take, or in other words, it seems to use more +look-ahead than the underlying @acronym{LALR}(1) algorithm actually allows +for. In this example, @acronym{LALR}(2) would suffice, but also some cases +that are not @acronym{LALR}(@math{k}) for any @math{k} can be handled this way. + +Since there can be only two branches and at least one of them +must fail, you need not worry about merging the branches by +using dynamic precedence or @samp{%merge}. + +Another potential problem of @acronym{GLR} does not arise here, either. In +general, a @acronym{GLR} parser can take quadratic or cubic worst-case time, +and the current Bison parser even takes exponential time and space +for some grammars. In practice, this rarely happens, and for many +grammars it is possible to prove that it cannot happen. In +in the present example, there is only one conflict between two +rules, and the type-declaration context where the conflict +arises cannot be nested. So the number of +branches that can exist at any time is limited by the constant 2, +and the parsing time is still linear. + +So here we have a case where we can use the benefits of @acronym{GLR}, almost +without disadvantages. There are two things to note, though. +First, one should carefully analyze the conflicts reported by +Bison to make sure that @acronym{GLR} splitting is done only where it is +intended to be. A @acronym{GLR} parser splitting inadvertently may cause +problems less obvious than an @acronym{LALR} parser statically choosing the +wrong alternative in a conflict. + +Second, interactions with the lexer (@pxref{Semantic Tokens}) must +be considered with great care. Since a split parser consumes tokens +without performing any actions during the split, the lexer cannot +obtain information via parser actions. Some cases of +lexer interactions can simply be eliminated by using @acronym{GLR}, i.e., +shifting the complications from the lexer to the parser. Remaining +cases have to be checked for safety. + +In our example, it would be safe for the lexer to return tokens +based on their current meanings in some symbol table, because no new +symbols are defined in the middle of a type declaration. Though it +is possible for a parser to define the enumeration +constants as they are parsed, before the type declaration is +completed, it actually makes no difference since they cannot be used +within the same enumerated type declaration. + +Here is a Bison grammar corresponding to the example above. It +parses a vastly simplified form of Pascal type declarations. + +@example +%token TYPE DOTDOT ID + +@group +%left '+' '-' +%left '*' '/' +@end group + +%% + +@group +type_decl: + TYPE ID '=' type ';' +; +@end group + +@group +type: '(' id_list ')' + | expr DOTDOT expr +; +@end group + +@group +id_list: ID + | id_list ',' ID +; +@end group + +@group +expr: '(' expr ')' + | expr '+' expr + | expr '-' expr + | expr '*' expr + | expr '/' expr + | ID +; +@end group +@end example + +When used as a normal @acronym{LALR}(1) grammar, Bison correctly complains +about one reduce/reduce conflict. In the conflicting situation the +parser chooses one of the alternatives, arbitrarily the one +declared first. Therefore the following correct input is not +recognized: + +@example +type t = (a) .. b; +@end example + +The parser can be turned into a @acronym{GLR} parser, while also telling Bison +to be silent about the one known reduce/reduce conflict, simply by +adding these two declarations to the Bison input file: + +@example +%glr-parser +%expect-rr 1 +@end example + +@noindent +No change in the grammar itself is required. Now the +parser recognizes all valid declarations, according to the +limited syntax above, transparently. In fact, the user does not even +notice when the parser splits. + @node Locations Overview @section Locations @cindex location @@ -1290,7 +1494,7 @@ not require it. You can add or change white space as much as you wish. For example, this: @example -exp : NUM | exp exp '+' @{$$ = $1 + $2; @} | @dots{} +exp : NUM | exp exp '+' @{$$ = $1 + $2; @} | @dots{} ; @end example @noindent @@ -1300,6 +1504,7 @@ means the same thing as this: exp: NUM | exp exp '+' @{ $$ = $1 + $2; @} | @dots{} +; @end example @noindent -- 2.47.2