[bison.git] / TODO

-*- outline -*-

* Coding system independence
Paul notes:

	Currently Bison assumes 8-bit bytes (i.e. that UCHAR_MAX is
	255).  It also assumes that the 8-bit character encoding is
	the same for the invocation of 'bison' as it is for the
	invocation of 'cc', but this is not necessarily true when
	people run bison on an ASCII host and then use cc on an EBCDIC
	host.  I don't think these topics are worth our time
	addressing (unless we find a gung-ho volunteer for EBCDIC or
	PDP-10 ports :-) but they should probably be documented
	somewhere.

* Using enums instead of int for tokens.
Paul suggests:

   #ifndef YYTOKENTYPE
   # if defined (__STDC__) || defined (__cplusplus)
      /* Put the tokens into the symbol table, so that GDB and other debuggers
         know about them.  */
      enum yytokentype {
        FOO = 256,
        BAR,
        ...
      };
      /* POSIX requires `int' for tokens in interfaces.  */
   #  define YYTOKENTYPE int
   # endif
   #endif
   #define FOO 256
   #define BAR 257
   ...

> I'm in favor of
>
> %token FOO 256
> %token BAR 257
>
> and Bison moves error into 258.

Yes, I think that's a valid extension too, if the user doesn't define
the token number for error.

* Unit rules
Maybe we could expand unit rules, i.e., transform

	exp: arith | bool;
	arith: exp '+' exp;
	bool: exp '&' exp;

into

	exp: exp '+' exp | exp '&' exp;

when there are no actions.  This can significantly speed up some
grammars.

* Stupid error messages
An example shows it easily:

src/bison/tests % ./testsuite -k calc,location,error-verbose -l
GNU Bison 1.49a test suite test groups:

 NUM: FILENAME:LINE      TEST-GROUP-NAME
      KEYWORDS

  51: calc.at:440        Calculator --locations --yyerror-verbose
  52: calc.at:442        Calculator --defines --locations --name-prefix=calc --verbose --yacc --yyerror-verbose
  54: calc.at:445        Calculator --debug --defines --locations --name-prefix=calc --verbose --yacc --yyerror-verbose
src/bison/tests % ./testsuite 51 -d
## --------------------------- ##
## GNU Bison 1.49a test suite. ##
## --------------------------- ##
 51: calc.at:440       ok
## ---------------------------- ##
## All 1 tests were successful. ##
## ---------------------------- ##
src/bison/tests % cd ./testsuite.dir/51
tests/testsuite.dir/51 % echo "()" | ./calc
1.2-1.3: parse error, unexpected ')', expecting error or "number" or '-' or '('

* read_pipe.c
This is not portable to DOS for instance.  Implement a more portable
scheme.  Sources of inspiration include GNU diff, and Free Recode.

* Memory leaks in the generator
A round of memory leak clean ups would be most welcome.  Dmalloc,
Checker GCC, Electric Fence, or Valgrind: you chose your tool.

* Memory leaks in the parser
The same applies to the generated parsers.  In particular, this is
critical for user data: when aborting a parsing, when handling the
error token etc., we often throw away yylval without giving a chance
of cleaning it up to the user.

* NEWS
Sort from 1.31 NEWS.

* Prologue
The %union is declared after the user C declarations. It can be
a problem if YYSTYPE is declared after the user part.	[]

Actually, the real problem seems that the %union ought to be output
where it was defined.  For instance, in gettext/intl/plural.y, we
have:

	%{
	...
	#include "gettextP.h"
	...
	%}

	%union {
	  unsigned long int num;
	  enum operator op;
	  struct expression *exp;
	}

	%{
	...
	static int yylex PARAMS ((YYSTYPE *lval, const char **pexp));
	...
	%}

Where the first part defines struct expression, the second uses it to
define YYSTYPE, and the last uses YYSTYPE.  Only this order is valid.

* --graph
Show reductions.	[]

* Broken options ?
** %no-lines		[ok]
** %no-parser		[]
** %pure-parser		[]
** %semantic-parser	[]
** %token-table		[]
** Options which could use parse_dquoted_param ().
Maybe transfered in lex.c.
*** %skeleton		[ok]
*** %output		[]
*** %file-prefix	[]
*** %name-prefix	[]

** Skeleton strategy.	[]
Must we keep %no-parser?
	     %token-table?
*** New skeletons.	[]

* src/print_graph.c
Find the best graph parameters.	[]

* doc/bison.texinfo
** Update
informations about ERROR_VERBOSE.	[]
** Add explainations about
skeleton muscles.	[]
%skeleton.		[]

* testsuite
** tests/pure-parser.at	[]
New tests.

* Debugging parsers

From Greg McGary:

akim demaille <akim.demaille@epita.fr> writes:

> With great pleasure!  Nonetheless, things which are debatable
> (or not, but just `big') should be discuss in `public': something
> like help- or bug-bison@gnu.org is just fine.  Jesse and I are there,
> but there is also Jim and some other people.

I have no idea whether it qualifies as big or controversial, so I'll
just summarize for you.  I proposed this change years ago and was
surprised that it was met with utter indifference!

This debug feature is for the programs/grammars one develops with
bison, not for debugging bison itself.  I find that the YYDEBUG
output comes in a very inconvenient format for my purposes.
When debugging gcc, for instance, what I want is to see a trace of
the sequence of reductions and the line#s for the semantic actions
so I can follow what's happening.  Single-step in gdb doesn't cut it
because to move from one semantic action to the next takes you through
lots of internal machinery of the parser, which is uninteresting.

The change I made was to the format of the debug output, so that it
comes out in the format of C error messages, digestible by emacs
compile mode, like so:

grammar.y:1234: foo: bar(0x123456) baz(0x345678)

where "foo: bar baz" is the reduction rule, whose semantic action
appears on line 1234 of the bison grammar file grammar.y.  The hex
numbers on the rhs tokens are the parse-stack values associated with
those tokens.  Of course, yytype might be something totally
incompatible with that representation, but for the most part, yytype
values are single words (scalars or pointers).  In the case of gcc,
they're most often pointers to tree nodes.  Come to think of it, the
right thing to do is to make the printing of stack values be
user-definable.  It would also be useful to include the filename &
line# of the file being parsed, but the main filename & line# should
continue to be that of grammar.y

Anyway, this feature has saved my life on numerous occasions.  The way
I customarily use it is to first run bison with the traces on, isolate
the sequence of reductions that interests me, put those traces in a
buffer and force it into compile-mode, then visit each of those lines
in the grammar and set breakpoints with C-x SPACE.  Then, I can run
again under the control of gdb and stop at each semantic action.
With the hex addresses of tree nodes, I can inspect the values
associated with any rhs token.

You like?

* input synclines
Some users create their foo.y files, and equip them with #line.  Bison
should recognize these, and preserve them.

* BTYacc
See if we can integrate backtracking in Bison.  Contact the BTYacc
maintainers.

* Automaton report
Display more clearly the lookaheads for each item.

* RR conflicts
See if we can use precedence between rules to solve RR conflicts.  See
what POSIX says.

* Precedence
It is unfortunate that there is a total order for precedence.  It
makes it impossible to have modular precedence information.  We should
move to partial orders.

* Parsing grammars
Rewrite the reader in Bison.
Commit	Line	Data
416bd7a9 MA	1	-- outline --
416bd7a9 MA	2
eaff5ee3	3	* Coding system independence
4358321a	4	Paul notes:
eaff5ee3 AD	5
	6	Currently Bison assumes 8-bit bytes (i.e. that UCHAR_MAX is
	7	255). It also assumes that the 8-bit character encoding is
	8	the same for the invocation of 'bison' as it is for the
	9	invocation of 'cc', but this is not necessarily true when
	10	people run bison on an ASCII host and then use cc on an EBCDIC
	11	host. I don't think these topics are worth our time
	12	addressing (unless we find a gung-ho volunteer for EBCDIC or
	13	PDP-10 ports :-) but they should probably be documented
	14	somewhere.
	15
	16	* Using enums instead of int for tokens.
	17	Paul suggests:
	18
	19	#ifndef YYTOKENTYPE
	20	# if defined (__STDC__) \|\| defined (__cplusplus)
	21	/* Put the tokens into the symbol table, so that GDB and other debuggers
	22	know about them. */
	23	enum yytokentype {
	24	FOO = 256,
	25	BAR,
	26	...
	27	};
	28	/* POSIX requires `int' for tokens in interfaces. */
	29	# define YYTOKENTYPE int
	30	# endif
	31	#endif
	32	#define FOO 256
	33	#define BAR 257
	34	...
	35
4358321a AD	36	> I'm in favor of
	37	>
	38	> %token FOO 256
	39	> %token BAR 257
	40	>
	41	> and Bison moves error into 258.
	42
	43	Yes, I think that's a valid extension too, if the user doesn't define
	44	the token number for error.
	45
fa770c86 AD	46	* Unit rules
	47	Maybe we could expand unit rules, i.e., transform
	48
	49	exp: arith \| bool;
	50	arith: exp '+' exp;
	51	bool: exp '&' exp;
	52
	53	into
	54
	55	exp: exp '+' exp \| exp '&' exp;
	56
	57	when there are no actions. This can significantly speed up some
	58	grammars.
	59
51dec47b AD	60	* Stupid error messages
	61	An example shows it easily:
	62
	63	src/bison/tests % ./testsuite -k calc,location,error-verbose -l
	64	GNU Bison 1.49a test suite test groups:
	65
	66	NUM: FILENAME:LINE TEST-GROUP-NAME
	67	KEYWORDS
	68
	69	51: calc.at:440 Calculator --locations --yyerror-verbose
	70	52: calc.at:442 Calculator --defines --locations --name-prefix=calc --verbose --yacc --yyerror-verbose
	71	54: calc.at:445 Calculator --debug --defines --locations --name-prefix=calc --verbose --yacc --yyerror-verbose
	72	src/bison/tests % ./testsuite 51 -d
	73	## --------------------------- ##
	74	## GNU Bison 1.49a test suite. ##
	75	## --------------------------- ##
	76	51: calc.at:440 ok
	77	## ---------------------------- ##
	78	## All 1 tests were successful. ##
	79	## ---------------------------- ##
	80	src/bison/tests % cd ./testsuite.dir/51
	81	tests/testsuite.dir/51 % echo "()" \| ./calc
	82	1.2-1.3: parse error, unexpected ')', expecting error or "number" or '-' or '('
fa770c86 AD	83
	84	* read_pipe.c
	85	This is not portable to DOS for instance. Implement a more portable
	86	scheme. Sources of inspiration include GNU diff, and Free Recode.
	87
aef1ffd5 AD	88	* Memory leaks in the generator
	89	A round of memory leak clean ups would be most welcome. Dmalloc,
	90	Checker GCC, Electric Fence, or Valgrind: you chose your tool.
	91
	92	* Memory leaks in the parser
	93	The same applies to the generated parsers. In particular, this is
	94	critical for user data: when aborting a parsing, when handling the
	95	error token etc., we often throw away yylval without giving a chance
	96	of cleaning it up to the user.
	97
52d1aeee MA	98	* NEWS
	99	Sort from 1.31 NEWS.
	100
bcb05e75 MA	101	* Prologue
bcb05e75 MA	102	The %union is declared after the user C declarations. It can be
704a47c4	103	a problem if YYSTYPE is declared after the user part. []
bcb05e75	104
704a47c4 AD	105	Actually, the real problem seems that the %union ought to be output
	106	where it was defined. For instance, in gettext/intl/plural.y, we
	107	have:
	108
	109	%{
	110	...
	111	#include "gettextP.h"
	112	...
	113	%}
	114
	115	%union {
	116	unsigned long int num;
	117	enum operator op;
	118	struct expression *exp;
	119	}
	120
	121	%{
	122	...
	123	static int yylex PARAMS ((YYSTYPE lval, const char *pexp));
	124	...
	125	%}
	126
	127	Where the first part defines struct expression, the second uses it to
	128	define YYSTYPE, and the last uses YYSTYPE. Only this order is valid.
bcb05e75 MA	129
	130	* --graph
	131	Show reductions. []
	132
704a47c4	133	* Broken options ?
c3995d99	134	** %no-lines [ok]
04a76783	135	** %no-parser []
fbbf9b3b	136	** %pure-parser []
04a76783 MA	137	** %semantic-parser []
	138	** %token-table []
	139	** Options which could use parse_dquoted_param ().
	140	Maybe transfered in lex.c.
	141	*** %skeleton [ok]
	142	*** %output []
	143	*** %file-prefix []
	144	*** %name-prefix []
ec93a213	145
fbbf9b3b	146	** Skeleton strategy. []
c3a8cbaa MA	147	Must we keep %no-parser?
c3a8cbaa MA	148	%token-table?
fbbf9b3b	149	*** New skeletons. []
416bd7a9	150
c111e171	151	* src/print_graph.c
31b53af2	152	Find the best graph parameters. []
63c2d5de MA	153
63c2d5de MA	154	* doc/bison.texinfo
1a4648ff	155	** Update
c3a8cbaa	156	informations about ERROR_VERBOSE. []
1a4648ff	157	** Add explainations about
c3a8cbaa MA	158	skeleton muscles. []
c3a8cbaa MA	159	%skeleton. []
eeeb962b	160
704a47c4	161	* testsuite
c3a8cbaa MA	162	** tests/pure-parser.at []
c3a8cbaa MA	163	New tests.
0f8d586a AD	164
	165	* Debugging parsers
	166
	167	From Greg McGary:
	168
	169	akim demaille <akim.demaille@epita.fr> writes:
	170
	171	> With great pleasure! Nonetheless, things which are debatable
	172	> (or not, but just `big') should be discuss in `public': something
	173	> like help- or bug-bison@gnu.org is just fine. Jesse and I are there,
	174	> but there is also Jim and some other people.
	175
	176	I have no idea whether it qualifies as big or controversial, so I'll
	177	just summarize for you. I proposed this change years ago and was
	178	surprised that it was met with utter indifference!
	179
	180	This debug feature is for the programs/grammars one develops with
	181	bison, not for debugging bison itself. I find that the YYDEBUG
	182	output comes in a very inconvenient format for my purposes.
	183	When debugging gcc, for instance, what I want is to see a trace of
	184	the sequence of reductions and the line#s for the semantic actions
	185	so I can follow what's happening. Single-step in gdb doesn't cut it
	186	because to move from one semantic action to the next takes you through
	187	lots of internal machinery of the parser, which is uninteresting.
	188
	189	The change I made was to the format of the debug output, so that it
	190	comes out in the format of C error messages, digestible by emacs
	191	compile mode, like so:
	192
	193	grammar.y:1234: foo: bar(0x123456) baz(0x345678)
	194
	195	where "foo: bar baz" is the reduction rule, whose semantic action
	196	appears on line 1234 of the bison grammar file grammar.y. The hex
	197	numbers on the rhs tokens are the parse-stack values associated with
	198	those tokens. Of course, yytype might be something totally
	199	incompatible with that representation, but for the most part, yytype
	200	values are single words (scalars or pointers). In the case of gcc,
	201	they're most often pointers to tree nodes. Come to think of it, the
	202	right thing to do is to make the printing of stack values be
	203	user-definable. It would also be useful to include the filename &
	204	line# of the file being parsed, but the main filename & line# should
	205	continue to be that of grammar.y
	206
	207	Anyway, this feature has saved my life on numerous occasions. The way
	208	I customarily use it is to first run bison with the traces on, isolate
	209	the sequence of reductions that interests me, put those traces in a
	210	buffer and force it into compile-mode, then visit each of those lines
	211	in the grammar and set breakpoints with C-x SPACE. Then, I can run
	212	again under the control of gdb and stop at each semantic action.
	213	With the hex addresses of tree nodes, I can inspect the values
	214	associated with any rhs token.
	215
	216	You like?
cd6a695e AD	217
	218	* input synclines
	219	Some users create their foo.y files, and equip them with #line. Bison
	220	should recognize these, and preserve them.
0e95c1dd AD	221
	222	* BTYacc
	223	See if we can integrate backtracking in Bison. Contact the BTYacc
	224	maintainers.
	225
	226	* Automaton report
	227	Display more clearly the lookaheads for each item.
	228
	229	* RR conflicts
	230	See if we can use precedence between rules to solve RR conflicts. See
	231	what POSIX says.
	232
	233	* Precedence
	234	It is unfortunate that there is a total order for precedence. It
	235	makes it impossible to have modular precedence information. We should
	236	move to partial orders.
	237
	238	* Parsing grammars
	239	Rewrite the reader in Bison.