Commit 3135b4bb authored by POTTIER Francois's avatar POTTIER Francois

Release 3135b4bb.

parent 31fcf82c

Too many changes to show.

To preserve performance only 1000 of 1000+ files are displayed.

This diff is collapsed.
# Developer guide
This guide is intended for new Menhir developers, and should explain how
things work.
For the moment, there is not much information in it.
## Build Instructions
There are two ways of recompiling Menhir after making a change in the sources.
To perform a single compilation pass, just go down into the `src` directory
and type `make`. This produces an executable file named `_stage1/menhir.native`.
To go further and check that Menhir can process its own grammar,
still in the `src` directory,
type `make bootstrap`.
This produces an executable file named `_stage2/menhir.native`.
`make bootstrap` occasionally fails for no good reason. In that case,
use `make clean` before attempting `make bootstrap` again.
## Testing
To run Menhir's test suite, just go down into the `test` directory
and type `make test`. The package `functory` is required; install
it first via `opam install functory`.
The subdirectory `test/good` contains a number of correct `.mly` files.
The test suite checks that Menhir accepts these files and
compares the output of `menhir --only-preprocess` against an expected output.
It does not check that Menhir actually produces a working parser.
The subdirectory `test/bad` contains a number of incorrect `.mly` files.
The test suite checks that Menhir rejects these files
and produces the expected error message.
Some performance and correctness checks can be found in the directory `quicktest`;
see [quicktest/README](quicktest/README).
## About the Module Ordering
Some toplevel modules have side effects and must be executed in the
following order:
| Module | Task |
| --------------------- | ---- |
| Settings | parses the command line |
| PreFront | reads the grammar description files |
| TokenType | deals with `--only-tokens` and exits |
| Front | deals with `--depend`, `--infer`, `--only-preprocess`, and exits |
| Grammar | performs a number of analyses of the grammar |
| Lr0 | constructs the LR(0) automaton |
| Slr | determines whether the grammar is SLR |
| Lr1 | constructs the LR(1) automaton |
| Conflict | performs default conflict resolution and explains conflicts |
| Invariant | performs a number of analyses of the automaton |
| Interpret | deals with `--interpret` and exits |
| Back | produces the output and exits |
A few artificial dependencies have been added in the code in order
to ensure that this ordering is respected by `ocamlbuild`.
This diff is collapsed.
This file contains a series of ideas and remarks that could be in
the TODO list -- except I do not intend to do anything about them,
for now.
* Add an --auto-inline pass, which marks certain symbols %inline according
to a well-defined strategy. E.g.,
- mark %inline every symbol that is referenced only once
- mark %inline every symbol that has only one production
(unconditional? subject to a size constraint?)
- mark %inline every symbol whose productions have length <= 1
Ideally, do this *after* the analyses that guarantee that every symbol
is reachable and recognizes a nonempty language.
Check that the auto-%inlined symbol has no %prec annotation.
Document that --auto-inline requires pure semantic actions.
(Note that --auto-inline turns midrule into endrule!)
Should --auto-inline be a command line switch, or a directive %autoinline?
Consider adding %noinline to prevent --auto-inline from inlining a symbol.
* Incompatibility with ocamlyacc/yacc/bison: these tools guarantee
to perform a default reduction without looking ahead at the next
token, whereas Menhir does not.
(See messages by Tiphaine Turpin from 30/08/2011 on.)
- Changing this behavior would involve changing both back-ends.
- Changing this behavior could break existing Menhir parsers.
(Make it a command line option.)
- This affects only people who are doing lexical feedback.
- Suggestion by Frédéric Bour: allow annotating a production with %default
to indicate that it should always be a default reduction. (Not sure why
this helps, though.)
- Think about end-of-stream conflicts, too.
If there is a default reduction, there is no end-of-stream conflict.
(Do we report one at the moment?)
If there is a conflict, why do we arbitrarily solve it in favor of
looking ahead (eliminating the reduction on #)? What does ocamlyacc do?
* Idea following our work with Jacques-Henri: allow asserting that the
distinction between certain tokens (say, {A, B, C}) cannot influence
which action is chosen. Or (maybe) asserting that no reduction can take
place when the lookahead symbol is one of {A, B, C}. The goal is to ensure
that a lexer hack is robust, i.e., even if the lexer cannot reliably
distinguish between A, B, C, this cannot cause the parser to become
misguided.
* Incompatibility with ocamlyacc: Menhir discards the lookahead token when
entering error mode, whereas ocamlyacc doesn't. (Reported by Frédéric Bour.)
* Look into the format of Bison's tables, and see if we could produce such
tables upon demand. This could hopefully / possibly be implemented outside
Menhir using the .cmly API.
* Implement $0, $-1, etc.
(Along the lines of Bison. They are a potentially dangerous feature, as
they allow peeking into the stack, which requires the shape of the stack
to be known / guessed by the programmer.)
Propose a named syntax, perhaps <x> foo: ...
where the name x is bound to the value in the topmost stack cell.
Ensure that the mechanism is type-safe.
(Is --infer required? I don't think so. Just issue type constraints.)
(Requires an analysis so that the shape of the stack is known. The
existing analysis in Invariant may be sufficient.)
Implement it in both back-ends.
On top of this mechanism, it is easy to implement mid-rule actions (à la Bison).
On top of that, it should be easy to implement inherited attributes (à la BtYacc).
--
However, my impression so far is that the whole thing is of interest
mainly to people who are doing lexer hacks. Otherwise, it is much easier
to just parse, produce a tree, and transform the tree afterwards.
* Try generating a canonical automaton, resolving conflicts, and THEN
minimizing the automaton. Would the result be close to IELR(1)?
(Unfortunately, producing a canonical automaton remains a bit costly:
8 seconds for OCaml's grammar.)
* Why does --canonical --table consume a huge amount of time on a large grammar?
(3m57s for ocaml.mly, versus 16s without --table)
Display how much time is spent compressing tables.
* Explain end-of-stream conflicts, too.
* If one wishes to assign a priority level to a token, without choosing
an associativity status (%left, %right, %nonassoc) one should be allowed
to declare %neutral and obtain an unspecified associativity status
(causing an error if this status is ever consulted).
* Since the table back-end does not use the module Invariant,
it should be possible to save time, in --table mode,
by not running the analysis in Invariant.
That would require making Invariant a functor,
and calling it inside CodeBackend, CoqBackend, and Interpret (complicated).
Or perhaps just running the computation on demand (using lazy)
but that makes timing more difficult.
* The warning "this production is never reduced" is sound but incomplete.
See never_reduced.mly:
a: b B | B {}
b: {}
where we get a warning that "b -> " is never reduced, but actually (as a
result) "a -> b B" is never reduced either. Reword the current warning?
Document the problem? Develop a new warning based on LRijkstra?
By the same token, some states could be unreachable, without us knowing.
What should we do about it?
Note that the incremental API may allow reaching some states that LRijkstra
would declare unreachable -- so, be careful.
* Implementing may_reduce by looping over the action table may be too
conservative? There may be situations where there is no reduce action
in the table (because they were killed off by conflict resolution) yet
this state is still capable of reducing this production.
* Read Chen and Pager's paper, "An Extension Of The Unit Production
Elimination Algorithm", and find out whether such an optimization
would be useful (beneficial) in the context of Menhir.
This diff is collapsed.
http://www.i3s.unice.fr/langages/pub/these-fortes.ps.gz
le générateur lalr produit des tables (et non du code) et va 2-3 fois plus vite que yacc;
accepte l'EBNF; explique mieux les conflits que yacc. Cf J. Grosch, `Lalr---a generator for
efficient parsers', Software Practice & Experience 20(11):1115--1135, nov 1990. J'ai lu
l'article, il est sans intérêt. L'algorithme de rapport des conflits est celui de DeRemer
et Pennello (1982).
Jean Gallier et Karl Schimpf ont écrit un outil nommé LR1GEN, mentionné
dans le CV en ligne de Gallier. Où le trouver? Cf. la thèse de Schimpf.
Le code semble identique à celui de Menhir en ce qui concerne l'algo de
Pager.
[Pager77]
A Practical General Method for Constructing LR(k) Parsers
David Pager
Acta Informatica 7, 1977, p. 249-268
[WeSha81]
LR -- Automatic Parser Generator and LR(1) Parser
Charles Wetherell, Alfred Shannon
IEEE Transactions on Software Engineering SE-7:3, May 1981, p. 274-278
[Ives86]
Unifying View of Recent LALR(1) Lookahead Set Algorithms
Fred Ives
SIGPLAN 1986 Symposium on Compiler Construction, p. 131-135
[BeSchi86]
A Practical Arbitrary Look-ahead LR Parsing Technique
Manuel Bermudez, Karl Schimpf
SIGPLAN 1986 Symposium on Compiler Construction, p. 136-144
[Spector88]
Efficient Full LR(1) Parser Generation
David Spector
SIGPLAN Notices 23:12, Dec 1998, p. 143-150
[Burshteyn94]
Algorithms in Muskok parser generator
Boris Burshteyn
comp.compilers, March 16, 1994
Pfahler: `Optimizing directly executable LR parsers', in Compiler Compilers (1990)
+ Aho and Ullman, `A technique for speeding up LR(k) parsers', SIAM J. Comput. 2(2), June 1973, 106-127.
Aho and Ullman, `Optimization of LR(k) parsers', J. Comput. Syst. Sci 6(6), December 1972, 573-602.
La notion de "don't care error entry" est apparemment importante pour éliminer les "single productions".
Soisalon-Soininen, `Inessential error entries and their use in LR parser optimization'.
Compression des tables:
Optimization of parser tables for portable compilers (Dencker, Dürre, Heuft)
Minimizing Row Displacement Dispatch Tables (Karel Driesen and Urs Holzle)
Autres outils:
Jacc (Mark Jones) pour Java (http://web.cecs.pdx.edu/~mpj/jacc/index.html)
Merr (Clinton Jeffery) pour la gestion des erreurs (http://doi.acm.org/10.1145/937563.937566)
(http://unicon.sourceforge.net/merr/)
+ Idée d'optim pour la vitesse: mémoriser le suffixe de la pile dont la forme est connue
à l'aide de paramètres supplémentaires aux fonctions run (donc, si tout va bien, dans
les registres). Du coup, on alloue seulement lorsqu'on fait un décalage avec oubli, et
on accède à la mémoire seulement lorsqu'on fait une réduction avec redécouverte. Ca
permettrait d'avoir gratuitement quelques optimisations du style "si on sait qu'on va
réduire tout de suite, alors on n'alloue pas de cellule sur la pile (shiftreduce)".
Quelques liens en vrac:
These d'Eelco Visser
http://www.cs.uu.nl/people/visser/ftp/Vis97.ps.gz
COCOM tool set
http://cocom.sourceforge.net/
YACC/M
http://david.tribble.com/yaccm.html
Comp.compilers: Algorithms in Muskok parser generator
http://compilers.iecc.com/comparch/article/94-03-067
Produire des tables. Voir
"Optimization of parser tables for portable compilers",
http://portal.acm.org/citation.cfm?id=1802&coll=portal&dl=ACM
(La compression peut-elle remplacer une erreur par une réduction?)
Tarjan & Yao, "Storing a Sparse Table",
http://doi.acm.org/10.1145/359168.359175
Approche de Bison,
http://www.cs.uic.edu/~spopuri/cparser.html
Error recovery:
See my journal at Dec 7, 2015.
"Comparison of Syntactic Error Handling in LR Parsers" (Degano & Priami, 1995)
"Natural and Flexible Error Recovery for Generated Modular Language Environments" (de Jonge et al., 2012)
Error recovery in a GLR parser; indentation-aware.
See also "Natural and Flexible Error Recovery for Generated Parsers",
available as a 2009 technical report
(http://swerl.tudelft.nl/twiki/pub/Main/TechnicalReports/TUD-SERG-2009-024.pdf).