manual.tex 187 KB
Newer Older
1
2
\def\true{true}
\let\fpacm\true
3
\documentclass[onecolumn,11pt,nocopyrightspace,preprint]{sigplanconf}
4
\usepackage{amstext}
5
\usepackage[T1]{fontenc}
6
\usepackage[utf8]{inputenc}
7
\usepackage{moreverb}
8
\usepackage{tikz}
9
10
\usepackage{xspace}
\usepackage{mymacros}
11
\def\fppdf{true}
12
13
14
\usepackage{fppdf}
\input{macros}
\input{version}
15
16
17
18
% Let Menhir's version number appear at the bottom right of every page.
\makeatletter
\def\@formatyear{\menhirversion}
\makeatother
19

20
21
22
23
24
25
26
27
28
29
30
% Hevea-specific logic; see
%   http://hevea.inria.fr/doc/manual007.html
%   http://hevea.inria.fr/doc/manual008.html
\usepackage{hevea}
\newenvironment{heveapicture}{
\begin{toimage}
}{
\end{toimage}
\imageflush{}
}

POTTIER Francois's avatar
POTTIER Francois committed
31
32
% ------------------------------------------------------------------------------

33
34
35
36
37
38
% Headings.

\title{\menhir Reference Manual\\\normalsize (version \menhirversion)}

\begin{document}

39
\authorinfo{François Pottier\and Yann Régis-Gianas}
40
41
42
43
44
	   {INRIA}
	   {\{Francois.Pottier, Yann.Regis-Gianas\}@inria.fr}

\maketitle

POTTIER Francois's avatar
POTTIER Francois committed
45
% ------------------------------------------------------------------------------
46
47
48
49
50

\clearpage
\tableofcontents
\clearpage

POTTIER Francois's avatar
POTTIER Francois committed
51
% ------------------------------------------------------------------------------
52
53
54
55
56

\section{Foreword}

\menhir is a parser generator. It turns high-level grammar specifications,
decorated with semantic actions expressed in the \ocaml programming
57
language~\cite{ocaml}, into parsers, again expressed in \ocaml. It is
58
59
based on Knuth's LR(1) parser construction technique~\cite{knuth-lr-65}. It is
strongly inspired by its precursors: \yacc~\cite{johnson-yacc-79},
60
\texttt{ML-Yacc}~\cite{tarditi-appel-00}, and \ocamlyacc~\cite{ocaml},
61
62
63
64
65
66
67
68
69
70
but offers a large number of minor and major improvements that make it a more
modern tool.

This brief reference manual explains how to use \menhir. It does not attempt to
explain context-free grammars, parsing, or the LR technique. Readers who have
never used a parser generator are encouraged to read about these ideas
first~\cite{aho-86,appel-tiger-98,hopcroft-motwani-ullman-00}. They are also
invited to have a look at the \distrib{demos} directory in \menhir's
distribution.

71
Potential users of \menhir should be warned that \menhir's feature set is not
72
73
74
75
76
77
78
79
80
81
completely stable. There is a tension between preserving a measure of
compatibility with \ocamlyacc, on the one hand, and introducing new ideas, on
the other hand. Some aspects of the tool, such as the error handling
mechanism, are still potentially subject to incompatible changes: for
instance, in the future, the current error handling mechanism (which is based
on the \error token, see \sref{sec:errors}) could be removed and replaced with
an entirely different mechanism.

There is room for improvement in the tool and in this reference manual. Bug
reports and suggestions are welcome!
82

POTTIER Francois's avatar
POTTIER Francois committed
83
% ------------------------------------------------------------------------------
84
85
86
87
88
89
90

\section{Usage}

\menhir is invoked as follows:
\begin{quote}
\cmenhir \nt{option} \ldots \nt{option} \nt{filename} \ldots \nt{filename}
\end{quote}
91
92
Each of the file names must end with \mly (unless \ocoq is used,
in which case it must end with \vy) and denotes a partial
93
94
grammar specification. These partial grammar specifications are joined
(\sref{sec:split}) to form a single, self-contained grammar specification,
95
which is then processed. The following optional command line switches allow
96
97
98
99
100
101
102
103
controlling many aspects of the process.

\docswitch{\obase \nt{basename}} This switch controls the base name
of the \ml and \mli files that are produced. That is, the tool will produce
files named \nt{basename}\texttt{.ml} and \nt{basename}\texttt{.mli}. Note
that \nt{basename} can contain occurrences of the \texttt{/} character, so it
really specifies a path and a base name. When only one \nt{filename} is
provided on the command line, the default \nt{basename} is obtained by
104
depriving \nt{filename} of its final \mly suffix. When multiple file
105
106
107
names are provided on the command line, no default base name exists, so that
the \obase switch \emph{must} be used.

108
\docswitch{\ocmly} This switch causes \menhir to produce a \cmly file in
109
110
111
addition to its normal operation. This file contains a (binary-form)
representation of the grammar and automaton (see \sref{sec:sdk}).

112
113
114
\docswitch{\ocomment} This switch causes a few comments to be inserted into the
\ocaml code that is written to the \ml file.

115
116
117
118
119
120
\docswitch{\ocompareerrors \nt{filename1} \ocompareerrors \nt{filename2}} Two
such switches must always be used in conjunction so as to specify the names of
two \messages files, \nt{filename1} and \nt{filename2}. Each file is read and
internally translated to a mapping of states to messages. \menhir then checks
that the left-hand mapping is a subset of the right-hand mapping. This feature
is typically used in conjunction with \olisterrors to check that \nt{filename2}
121
is complete (that is, covers all states where an error can occur).
122
123
124
125
For more information, see \sref{sec:errors:new}.

\docswitch{\ocompileerrors \nt{filename}} This switch causes \menhir to read the
file \nt{filename}, which must obey the \messages file format, and to compile
126
it to an \ocaml function that maps a state number to a message. The \ocaml code
127
128
129
130
is sent to the standard output channel. At the same time, \menhir checks that
the collection of input sentences in the file \nt{filename} is correct and
irredundant. For more information, see \sref{sec:errors:new}.

131
132
\docswitch{\ocoq} This switch causes \menhir to produce Coq code. See \sref{sec:coq}.

133
134
135
136
137
138
139
140
141
\docswitch{\ocoqlibpath \nt{path}} This switch allows specifying under what name (or path) the Coq support library
MenhirLib is known to Coq. When \menhir runs in \ocoq mode, the generated
parser contains references to several modules in this library. This path is
used to qualify these references. Its default value is \texttt{MenhirLib}.

\docswitch{\ocoqlibnopath} This switch indicates that references to the Coq library MenhirLib
should \emph{not} be qualified. This was the default behavior of \menhir prior to 2018/05/30.
This switch is provided for compatibility, but normally should not be used.

142
\docswitch{\ocoqnoactions} (Used in conjunction with \ocoq.) This switch
143
causes the semantic actions present in the \vy file to be ignored and
144
replaced with \verb+tt+, the unique inhabitant of Coq's \verb+unit+ type. This
145
feature can be used to test the Coq back-end with a standard grammar, that is, a
146
grammar that contains \ocaml semantic actions. Just rename the file from
147
\mly to \vy and set this switch.
148
149
150
151
152
153
154
155
156

\docswitch{\ocoqnocomplete} (Used in conjunction with \ocoq.) This switch
disables the generation of the proof of completeness of the parser
(\sref{sec:coq}). This can be necessary because the proof of completeness is
possible only if the grammar has no conflict (not even a benign one, in the
sense of \sref{sec:conflicts:benign}). This can be desirable also because, for
a complex grammar, completeness may require a heavy certificate and its
validation by Coq may take time.

157
\docswitch{\odepend} See \sref{sec:build}.
158
159
160
161

\docswitch{\odump} This switch causes a description of the automaton
to be written to the file \nt{basename}\automaton.

POTTIER Francois's avatar
Typo.    
POTTIER Francois committed
162
\docswitch{\oechoerrors \nt{filename}} This switch causes \menhir to
163
164
165
166
read the \messages file \nt{filename} and to produce on the standard output
channel just the input sentences. (That is, all messages, blank lines, and
comments are filtered out.) For more information, see \sref{sec:errors:new}.

167
168
169
170
171
172
173
174
175
176
177
\docswitch{\oexplain} This switch causes conflict explanations to be
written  to the file \nt{basename}\conflicts. See also \sref{sec:conflicts}.

\docswitch{\oexternaltokens \nt{T}} This switch causes the definition of
the \token type to be omitted in \nt{basename}\texttt{.ml} and
\nt{basename}\texttt{.mli}. Instead, the generated parser relies on
the type $T$\texttt{.}\token, where $T$ is an \ocaml module name. It is up to
the user to define module $T$ and to make sure that it exports a suitable
\token type. Module $T$ can be hand-written. It can also be automatically generated
out of a grammar specification using the \oonlytokens switch.

178
\docswitch{\ofixedexc} This switch causes the exception \texttt{Error} to be
179
180
internally defined as a synonym for \texttt{Parsing.Parse\_error}. This means
that an exception handler that catches \texttt{Parsing.Parse\_error} will also
181
catch the generated parser's \texttt{Error}. This helps increase \menhir's
182
compatibility with \ocamlyacc. There is otherwise no reason to use this switch.
183

184
185
186
187
188
189
\docswitch{\ograph} This switch causes a description of the grammar's
dependency graph to be written to the file \nt{basename}\dott. The graph's
vertices are the grammar's nonterminal symbols. There is a directed edge from
vertex $A$ to vertex $B$ if the definition of $A$ refers to $B$. The file is
in a format that is suitable for processing by the \emph{graphviz} toolkit.

190
\docswitch{\oinfer, \oinferwrite, \oinferread} See \sref{sec:build}.
191

POTTIER Francois's avatar
POTTIER Francois committed
192
193
194
195
196
197
\docswitch{\oinspection} This switch requires \otable. It causes \menhir to generate
not only the monolithic and incremental APIs (\sref{sec:monolithic},
\sref{sec:incremental}), but also the inspection API (\sref{sec:inspection}).
Activating this switch causes a few more tables to be produced, resulting in
somewhat larger code size.

198
199
200
\docswitch{\ointerpret} This switch causes \menhir to act as an interpreter,
rather than as a compiler. No \ocaml code is generated. Instead, \menhir
reads sentences off the standard input channel, parses them, and displays
201
202
203
204
205
206
207
outcomes. This switch can be usefully combined with \otrace.
For more information, see \sref{sec:interpret}.

\docswitch{\ointerpreterror} This switch is analogous to \ointerpret, except
\menhir expects every sentence to cause an error on its last token, and
displays information about the state in which the error is detected, in
the \messages file format. For more information, see \sref{sec:errors:new}.
208
209
210
211
212

\docswitch{\ointerpretshowcst} This switch, used in conjunction with \ointerpret,
causes \menhir to display a concrete syntax tree when a sentence is successfully
parsed. For more information, see \sref{sec:interpret}.

213
214
215
216
\docswitch{\olisterrors} This switch causes \menhir to produce (on the standard
output channel) a complete list of input sentences that cause an error, in the
\messages file format. For more information, see \sref{sec:errors:new}.

217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
\docswitch{\ologautomaton \nt{level}} When \nt{level} is nonzero, this switch
causes some information about the automaton to be logged to the standard error
channel.

\docswitch{\ologcode \nt{level}} When \nt{level} is nonzero, this switch
causes some information about the generated \ocaml code to be logged to the
standard error channel.

\docswitch{\ologgrammar \nt{level}} When \nt{level} is nonzero, this switch
causes some information about the grammar to be logged to the standard error
channel. When \nt{level} is 2, the \emph{nullable}, \emph{FIRST}, and
\emph{FOLLOW} tables are displayed.

\docswitch{\onoinline} This switch causes all \dinline keywords in the
grammar specification to be ignored. This is especially useful in order
to understand whether these keywords help solve any conflicts.

234
\docswitch{\onostdlib} This switch instructs \menhir to \emph{not} use
POTTIER Francois's avatar
POTTIER Francois committed
235
its standard library (\sref{sec:library}).
236

237
\docswitch{\oocamlc \nt{command}} See \sref{sec:build}.
238

239
\docswitch{\oocamldep \nt{command}} See \sref{sec:build}.
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256

\docswitch{\oonlypreprocess} This switch causes the grammar specifications
to be transformed up to the point where the automaton's construction can
begin. The grammar specifications whose names are provided on the command line
are joined (\sref{sec:split}); all parameterized nonterminal symbols are
expanded away (\sref{sec:templates}); type inference is performed, if \oinfer
is enabled; all nonterminal symbols marked \dinline are expanded away
(\sref{sec:inline}). This yields a single, monolithic grammar specification,
which is printed on the standard output channel.

\docswitch{\oonlytokens} This switch causes the \dtoken declarations in
the grammar specification to be translated into a definition of the \token
type, which is written to the files \nt{basename}\texttt{.ml} and
\nt{basename}\texttt{.mli}. No code is generated. This is useful when
a single set of tokens is to be shared between several parsers. The directory
\distrib{demos/calc-two} contains a demo that illustrates the use of this switch.

257
\docswitch{\orawdepend} See \sref{sec:build}.
258

POTTIER Francois's avatar
POTTIER Francois committed
259
260
261
262
263
\docswitch{\ostdlib \nt{directory}} This switch controls the directory where
the standard library (\sref{sec:library}) is found. It takes precedence over
both the installation-time directory and the directory that may be specified
via the environment variable \verb+$MENHIR_STDLIB+.

264
265
266
267
268
269
270
\docswitch{\ostrict} This switch causes several warnings about the grammar
and about the automaton to be considered errors. This includes warnings about
useless precedence declarations, non-terminal symbols that produce the empty
language, unreachable non-terminal symbols, productions that are never
reduced, conflicts that are not resolved by precedence declarations, and
end-of-stream conflicts.

271
\docswitch{\oo{suggest-*}} See \sref{sec:build}.
272

273
274
\docswitch{\otable} This switch causes \menhir to use its table-based
back-end, as opposed to its (default) code-based back-end. When \otable is
POTTIER Francois's avatar
POTTIER Francois committed
275
276
used, \menhir produces significantly more compact and somewhat slower parsers.
See \sref{sec:qa} for a speed comparison.
277
278
279
280
281
282
283
284
285
286
287
288

The table-based back-end produces rather compact tables, which are analogous
to those produced by \yacc, \bison, or \ocamlyacc. These tables are not quite
stand-alone: they are exploited by an interpreter, which is shipped as part of
the support library \menhirlib. For this reason, when \otable is used,
\menhirlib must be made visible to the \ocaml compilers, and must be linked
into your executable program. The \texttt{--suggest-*} switches, described
above, help do this.

The code-based back-end compiles the LR automaton directly into a nest of
mutually recursive \ocaml functions. In that case, \menhirlib is not required.

POTTIER Francois's avatar
POTTIER Francois committed
289
290
291
The incremental API (\sref{sec:incremental}) and the inspection API
(\sref{sec:inspection}) are made available only by the table-based back-end.

292
293
294
295
296
297
298
299
300
\docswitch{\otimings} This switch causes internal timing information to
be sent to the standard error channel.

\docswitch{\otrace} This switch causes tracing code to be inserted into
the generated parser, so that, when the parser is run, its actions are
logged to the standard error channel. This is analogous to \texttt{ocamlrun}'s
\texttt{p=1} parameter, except this switch must be enabled at compile time:
one cannot selectively enable or disable tracing at runtime.

301
302
303
304
305
306
307
308
309
310
\docswitch{\oignoreprec} This switch suppresses all warnings about
useless \dleft, \dright, \dnonassoc and \dprec declarations.

\docswitch{\oignoreone \nt{symbol}} This switch suppresses the warning that
is normally emitted when \menhir finds that the terminal symbol \nt{symbol} is
unused.

\docswitch{\oignoreall} This switch suppresses all of the warnings that are
normally emitted when \menhir finds that some terminal symbols are unused.

POTTIER Francois's avatar
Typo.    
POTTIER Francois committed
311
\docswitch{\oupdateerrors \nt{filename}} This switch causes \menhir to
312
313
314
315
316
read the \messages file \nt{filename} and to produce on the standard output
channel a new \messages file that is identical, except the auto-generated
comments have been re-generated. For more information,
see \sref{sec:errors:new}.

317
318
319
\docswitch{\oversion} This switch causes \menhir to print its own version
number and exit.

POTTIER Francois's avatar
POTTIER Francois committed
320
% ------------------------------------------------------------------------------
321
322
323
324
325

\section{Lexical conventions}

The semicolon character (\kw{;}) is treated as insignificant, just like white
space. Thus, rules and producers (for instance) can be separated with
POTTIER Francois's avatar
POTTIER Francois committed
326
semicolons if it is thought that this improves readability. Semicolons can be
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
omitted otherwise.

Identifiers (\nt{id}) coincide with \ocaml identifiers, except they are not
allowed to contain the quote (\kw{'}) character. Following
\ocaml, identifiers that begin with a lowercase letter
(\nt{lid}) or with an uppercase letter (\nt{uid}) are distinguished.

Comments are C-style (surrounded with \kw{/*} and \kw{*/}, cannot be nested),
C++-style (announced by \kw{/$\!$/} and extending until the end of the line), or
\ocaml-style (surrounded with \kw{(*} and \kw{*)}, can be nested). Of course,
inside \ocaml code, only \ocaml-style comments are allowed.

\ocaml type expressions are surrounded with \kangle{and}. Within such expressions,
all references to type constructors (other than the built-in \textit{list}, \textit{option}, etc.)
must be fully qualified.

POTTIER Francois's avatar
POTTIER Francois committed
343
% ------------------------------------------------------------------------------
344
345
346
347
348
349
350
351
352
353
354

\section{Syntax of grammar specifications}

\begin{figure}
\begin{center}
\begin{tabular}{r@{}c@{}l}

\nt{specification} \is
   \sepspacelist{\nt{declaration}}
   \percentpercent
   \sepspacelist{\nt{rule}}
355
   \optional{\percentpercent \textit{\ocaml code}} \\
356
357

\nt{declaration} \is
358
   \dheader{\textit{\ocaml code}} \\
359
360
361
362
363
364
365
&& \dparameter \ocamlparam \\
&& \dtoken \optional{\ocamltype} \sepspacelist{\nt{uid}} \\
&& \dnonassoc \sepspacelist{\nt{uid}} \\
&& \dleft \sepspacelist{\nt{uid}} \\
&& \dright \sepspacelist{\nt{uid}} \\
&& \dtype \ocamltype \sepspacelist{\nt{lid}} \\
&& \dstart \optional{\ocamltype} \sepspacelist{\nt{lid}} \\
366
&& \donerrorreduce \sepspacelist{\nt{lid}} \\
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386

\nt{rule} \is
   \optional{\dpublic} \optional{\dinline}
   \nt{lid}
   \optional{\dlpar\sepcommalist{\nt{id}}\drpar}
   \deuxpoints
   \optional{\barre} \seplist{\ \barre}{\nt{group}} \\

\nt{group} \is
   \seplist{\ \barre}{\nt{production}}
   \daction
   \optional {\dprec \nt{id}} \\

\nt{production} \is
   \sepspacelist{\nt{producer}} \optional {\dprec \nt{id}} \\

\nt{producer} \is
   \optional{\nt{lid} \dequal} \nt{actual} \\

\nt{actual} \is
387
388
   \nt{id} \optional{\dlpar\sepcommalist{\nt{actual}}\drpar} \\
&& \nt{actual} \optional{\dquestion \barre \dplus \barre \dstar} \\
389
&& \seplist{\ \barre}{\nt{group}} % not really allowed everywhere
390
391
392
393
394
395
396
397
398

\end{tabular}
\end{center}
\caption{Syntax of grammar specifications}
\label{fig:syntax}
\end{figure}

The syntax of grammar specifications appears in \fref{fig:syntax}. (For
compatibility with \ocamlyacc, some specifications that do not fully adhere to
399
400
401
this syntax are also accepted.) Attributes are not documented in
\fref{fig:syntax}: see \sref{sec:attributes}.

402
403

\subsection{Declarations}
404
\label{sec:decls}
405
406
407
408
409

A specification file begins with a sequence of declarations, ended by a
mandatory \percentpercent keyword.

\subsubsection{Headers}
POTTIER Francois's avatar
POTTIER Francois committed
410
\label{sec:decls:headers}
411
412
413
414
415
416
417
418
419
420

A header is a piece of \ocaml code, surrounded with \dheader{and}. It is
copied verbatim at the beginning of the \ml file. It typically contains \ocaml
\kw{open} directives and function definitions for use by the semantic
actions. If a single grammar specification file contains multiple headers,
their order is preserved. However, when two headers originate in distinct
grammar specification files, the order in which they are copied to the \ml
file is unspecified.

\subsubsection{Parameters}
POTTIER Francois's avatar
POTTIER Francois committed
421
\label{sec:parameter}
422
423
424
425
426
427

A declaration of the form:
\begin{quote}
\dparameter \ocamlparam
\end{quote}
causes the entire parser to become parameterized over the \ocaml module
428
429
430
431
\nt{uid}, that is, to become an \ocaml functor. The directory
\distrib{demos/calc-param} contains a demo that illustrates the use of this switch.

If a single specification file
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
contains multiple \dparameter declarations, their order is preserved, so that
the module name \nt{uid} introduced by one declaration is effectively in scope
in the declarations that follow. When two \dparameter declarations originate
in distinct grammar specification files, the order in which they are processed
is unspecified. Last, \dparameter declarations take effect before \dheader{$\ldots$},
\dtoken, \dtype, or \dstart declarations are considered, so that the module name
\nt{uid} introduced by a \dparameter declaration is effectively in scope in
\emph{all} \dheader{$\ldots$}, \dtoken, \dtype, or \dstart declarations,
regardless of whether they precede or follow the \dparameter declaration.
This means, in particular, that the side effects of an \ocaml header are
observed only when the functor is applied, not when it is defined.

\subsubsection{Tokens}

A declaration of the form:
\begin{quote}
\dtoken \optional{\ocamltype} $\nt{uid}_1, \ldots, \nt{uid}_n$
\end{quote}
defines the identifiers $\nt{uid}_1, \ldots, \nt{uid}_n$ as tokens, that is,
as terminal symbols in the grammar specification and as data constructors in
the \textit{token} type. If an \ocaml type $t$ is present, then these tokens
are considered to carry a semantic value of type $t$, otherwise they are
considered to carry no semantic value.

\subsubsection{Priority and associativity}
\label{sec:assoc}

A declaration of one of the following forms:
\begin{quote}
\dnonassoc $\nt{uid}_1 \ldots \nt{uid}_n$ \\
\dleft $\nt{uid}_1 \ldots \nt{uid}_n$ \\
\dright $\nt{uid}_1 \ldots \nt{uid}_n$
\end{quote}
465
assigns both a \emph{priority level} and an \emph{associativity status} to
466
467
468
469
470
471
472
473
474
475
476
the symbols $\nt{uid}_1, \ldots, \nt{uid}_n$. The priority level assigned to
$\nt{uid}_1, \ldots, \nt{uid}_n$ is not defined explicitly: instead, it is
defined to be higher than the priority level assigned by the previous
\dnonassoc, \dleft, or \dright declaration, and lower than that assigned by the next
\dnonassoc, \dleft, or \dright declaration. The symbols $\nt{uid}_1, \ldots, \nt{uid}_n$
can be tokens (defined elsewhere by a \dtoken declaration) or dummies (not
defined anywhere). Both can be referred to as part of \dprec annotations.
Associativity status and priority levels allow shift/reduce conflicts to be
silently resolved (\sref{sec:conflicts}).

\subsubsection{Types}
POTTIER Francois's avatar
POTTIER Francois committed
477
\label{sec:type}
478
479
480
481
482
483
484
485
486

A declaration of the form:
\begin{quote}
\dtype \ocamltype $\nt{lid}_1 \ldots \nt{lid}_n$
\end{quote}
assigns an \ocaml type to each of the nonterminal symbols $\nt{lid}_1, \ldots, \nt{lid}_n$.
For start symbols, providing an \ocaml type is mandatory, but is usually done as part of the
\dstart declaration. For other symbols, it is optional. Providing type information can improve
the quality of \ocaml's type error messages.
487
488
489
490
491

A \dtype declaration may concern not only a nonterminal symbol, such as, say,
\texttt{expression}, but also a fully applied parameterized nonterminal
symbol, such as \texttt{list(expression)} or \texttt{separated\_list(COMMA,
  option(expression))}.
492

POTTIER Francois's avatar
POTTIER Francois committed
493
494
495
496
497
498
499
500
The types provided as part of \dtype declarations are copied verbatim to the
\ml and \mli files. In contrast, headers (\sref{sec:decls:headers}) are copied
to the \ml file only. For this reason, the types provided as part of \dtype
declarations must make sense both in the presence and in the absence of these
headers. They should typically be fully qualified types.

% TEMPORARY type information can be mandatory in --coq mode; document?

501
\subsubsection{Start symbols}
POTTIER Francois's avatar
POTTIER Francois committed
502
\label{sec:start}
503
504
505
506
507
508
509
510
511
512
513
514

A declaration of the form:
\begin{quote}
\dstart \optional{\ocamltype} $\nt{lid}_1 \ldots \nt{lid}_n$
\end{quote}
declares the nonterminal symbols $\nt{lid}_1, \ldots, \nt{lid}_n$ to be
start symbols. Each such symbol must be assigned an \ocaml type either as
part of the \dstart declaration or via separate \dtype declarations. Each
of $\nt{lid}_1, \ldots, \nt{lid}_n$ becomes the name of a function whose
signature is published in the \mli file and that can be used to invoke
the parser.

515
516
517
518
519
520
521
522
523
\subsubsection{Extra reductions on error}
\label{sec:onerrorreduce}

A declaration of the form:
\begin{quote}
\donerrorreduce $\nt{lid}_1 \ldots \nt{lid}_n$
\end{quote}
marks the nonterminal symbols $\nt{lid}_1, \ldots, \nt{lid}_n$ as
potentially eligible for reduction when an invalid token is found.
524
525
This may cause one or more extra reduction steps to be performed
before the error is detected.
526
527
528
529
530

More precisely, this declaration affects the automaton as follows. Let us say
that a production $\nt{lid} \rightarrow \ldots$ is ``reducible on error'' if
its left-hand symbol~\nt{lid} appears in a \donerrorreduce declaration. After
the automaton has been constructed and after any conflicts have been resolved,
531
532
533
534
535
536
537
538
539
in every state~$s$, the following algorithm is applied:
\begin{enumerate}
\item Construct the set of all productions that are ready to be reduced in
  state~$s$ and are reducible on error;
\item Test if one of them, say $p$, has higher ``on-error-reduce-priority''
  than every other production in this set;
\item If so, in state~$s$, replace every error action with a reduction of the
  production~$p$.
(In other words, for every terminal symbol~$t$, if the action table
540
says: ``in state~$s$, when the next input symbol is~$t$, fail'', then this
541
542
543
entry is replaced with: ``in state~$s$, when the next input symbol
is~$t$, reduce production~$p$''.)
\end{enumerate}
544

545
546
If step 3 above is executed in state~$s$, then an error can never be detected
in state~$s$, since all error actions in state~$s$ are replaced with reduce
547
548
549
550
actions. Error detection is deferred: at least one reduction takes place
before the error is detected. It is a ``spurious'' reduction: in a canonical
LR(1) automaton, it would not take place.

551
552
553
554
555
556
An \donerrorreduce declaration does not affect the language that is accepted
by the automaton. It does not affect the location where an error is detected.
It is used to control in which state an error is detected. If used wisely, it
can make errors easier to report, because they are detected in a state for
which it is easier to write an accurate diagnostic message
(\sref{sec:errors:diagnostics}).
557
558
559
560
561
562
563
564
565

% This may make the tables bigger (but I have no statistics).
% This makes LRijkstra significantly slower.

Like a \dtype declaration, an \donerrorreduce declaration may concern not only
a nonterminal symbol, such as, say, \texttt{expression}, but also a fully
applied parameterized nonterminal symbol, such as \texttt{list(expression)} or
\texttt{separated\_list(COMMA, option(expression))}.

566
567
568
569
570
571
572
573
574
The ``on-error-reduce-priority'' of a production is that of its left-hand
symbol. The ``on-error-reduce-priority'' of a nonterminal symbol is determined
implicitly by the order of \donerrorreduce declarations. In the declaration
$\donerrorreduce\;\nt{lid}_1 \ldots \nt{lid}_n$, the symbols $\nt{lid}_1, \ldots,
\nt{lid}_n$ have the same ``on-error-reduce-priority''. They have higher
``on-error-reduce-priority'' than the symbols listed in previous
\donerrorreduce declarations, and lower ``on-error-reduce-priority''
than those listed in later \donerrorreduce declarations.

575
576
577
\subsection{Rules}

Following the mandatory \percentpercent keyword, a sequence of rules is
578
579
580
581
582
583
584
585
expected. Each rule defines a nonterminal symbol~\nt{id}.
%
(It is recommended that the name of a nonterminal symbol begin with a lowercase
letter, so it falls in the category \nt{lid}. This is in fact mandatory for the
start symbols.)
In its simplest
form, a rule begins with the nonterminal symbol \nt{id},
followed by a colon character (\deuxpoints),
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
and continues with a sequence of production groups
(\sref{sec:productiongroups}). Each production group is preceded with a
vertical bar character (\barre); the very first bar is optional. The meaning
of the bar is choice: the nonterminal symbol \nt{id} develops to either of the
production groups. We defer explanations of the keyword \dpublic
(\sref{sec:split}), of the keyword \dinline (\sref{sec:inline}), and of the
optional formal parameters $\dlpar\sepcommalist{\nt{id}}\drpar$
(\sref{sec:templates}).

\subsubsection{Production groups}
\label{sec:productiongroups}

In its simplest form, a production group consists of a single production (\sref{sec:productions}),
followed by an \ocaml semantic action (\sref{sec:actions}) and an optional
\dprec annotation (\sref{sec:prec}). A production specifies a sequence of terminal and
nonterminal symbols that should be recognized, and optionally binds
identifiers to their semantic values.

\paragraph{Semantic actions}
\label{sec:actions}

A semantic action is a piece of \ocaml code that is executed in order to
assign a semantic value to the nonterminal symbol with which this production
group is associated. A semantic action can refer to the (already computed)
semantic values of the terminal or nonterminal symbols that appear in the
611
612
613
614
production via the semantic value identifiers bound by the production.

For compatibility with \ocamlyacc, semantic actions can also refer to
unnamed semantic values via positional keywords of the form
Yann Régis-Gianas's avatar
Yann Régis-Gianas committed
615
616
\kw{\$1}, \kw{\$2}, etc.\ This style is discouraged. Furthermore, as
a positional keyword of the form \kw{\$i} is internally rewritten as
617
\nt{\_i}, the user should not use identifiers of the form \nt{\_i}.
618
619
620
621

\paragraph{\dprec annotations}
\label{sec:prec}

622
623
An annotation of the form \dprec \nt{id} indicates that the precedence level
of the production group is the level assigned to the symbol \nt{id} via a
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
previous \dnonassoc, \dleft, or \dright declaration (\sref{sec:assoc}). In the
absence of a
\dprec annotation, the precedence level assigned to each production is the
level assigned to the rightmost terminal symbol that appears in it. It is
undefined if the rightmost terminal symbol has an undefined precedence level
or if the production mentions no terminal symbols at all. The precedence level
assigned to a production is used when resolving shift/reduce conflicts
(\sref{sec:conflicts}).

\paragraph{Multiple productions in a group}

If multiple productions are present in a single group, then the semantic
action and precedence annotation are shared between them. This short-hand
effectively allows several productions to share a semantic action and
precedence annotation without requiring textual duplication. It is legal only
when every production binds exactly the same set of semantic value identifiers
and when no positional semantic value keywords (\kw{\$1}, etc.) are used.

\subsubsection{Productions}
\label{sec:productions}

A production is a sequence of producers (\sref{sec:producers}), optionally
followed by a \dprec annotation (\sref{sec:prec}). If a precedence annotation
is present, it applies to this production alone, not to other productions in
the production group. It is illegal for a production and its production group
to both carry \dprec annotations.

\subsubsection{Producers}
\label{sec:producers}

A producer is an actual (\sref{sec:actual}), optionally preceded with a
binding of a semantic value identifier, of the form \nt{lid} \dequal. The
actual specifies which construction should be recognized and how a semantic
value should be computed for that construction. The identifier \nt{lid}, if
present, becomes bound to that semantic value in the semantic action that
follows. Otherwise, the semantic value can be referred to via a positional
keyword (\kw{\$1}, etc.).

\subsubsection{Actuals}
\label{sec:actual}

665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
In its simplest form, an actual is just a terminal or nonterminal symbol
$\nt{id}$. If it is a parameterized non-terminal symbol (see
\sref{sec:templates}), then it should be applied:
$\nt{id}\dlpar\sepcommalist{\nt{actual}}\drpar$.

An actual may be followed with a modifier (\dquestion, \dplus, or
\dstar). This is explained further on (see \sref{sec:templates} and
\fref{fig:sugar}).

An actual may also be an ``anonymous rule''. In that case, one writes
just the rule's right-hand side, which takes the form $\seplist{\ \barre\
}{\nt{group}}$.
(This form is allowed only as an argument in an application.)
This form is expanded on the fly to a definition of a fresh non-terminal
symbol, which is declared \dinline.
For instance, providing an anonymous rule as an argument to \nt{list}:
\begin{quote}
\begin{tabular}{l}
\nt{list} \dlpar \basic{e} = \nt{expression}; \basic{SEMICOLON} \dpaction{\basic{e}} \drpar
\end{tabular}
\end{quote}
is equivalent to writing this:
\begin{quote}
\begin{tabular}{l}
\nt{list} \dlpar \nt{expression\_SEMICOLON} \drpar
\end{tabular}
\end{quote}
where the non-terminal symbol \nt{expression\_SEMICOLON} is chosen fresh and is defined as follows:
\begin{quote}
\begin{tabular}{l}
\dinline \nt{expression\_SEMICOLON}:
\newprod \basic{e} = \nt{expression}; \basic{SEMICOLON} \dpaction{\basic{e}}
\end{tabular}
\end{quote}
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748

\section{Advanced features}

\subsection{Splitting specifications over multiple files}
\label{sec:split}

\paragraph{Modules}

Grammar specifications can be split over multiple files. When \menhir is
invoked with multiple argument file names, it considers each of these files as
a \emph{partial} grammar specification, and \emph{joins} these partial
specifications in order to obtain a single, complete specification.

This feature is intended to promote a form a modularity. It is hoped that, by
splitting large grammar specifications into several ``modules'', they can be
made more manageable. It is also hoped that this mechanism, in conjunction
with parameterization (\sref{sec:templates}), will promote sharing and reuse.
It should be noted, however, that this is only a weak form of
modularity. Indeed, partial specifications cannot be independently processed
(say, checked for conflicts).  It is necessary to first join them, so as to
form a complete grammar specification, before any kind of grammar analysis can
be done.

This mechanism is, in fact, how \menhir's standard library (\sref{sec:library})
is made available: even though its name does not appear on the command line,
it is automatically joined with the user's explicitly-provided grammar
specifications, making the standard library's definitions globally visible.

A partial grammar specification, or module, contains declarations and rules,
just like a complete one: there is no visible difference. Of course, it can
consist of only declarations, or only rules, if the user so chooses. (Don't
forget the mandatory \percentpercent keyword that separates declarations and
rules. It must be present, even if one of the two sections is empty.)

\paragraph{Private and public nonterminal symbols}

It should be noted that joining is \emph{not} a purely textual process. If two
modules happen to define a nonterminal symbol by the same name, then it is
considered, by default, that this is an accidental name clash. In that case,
each of the two nonterminal symbols is silently renamed so as to avoid the
clash. In other words, by default, a nonterminal symbol defined in module $A$
is considered \emph{private}, and cannot be defined again, or referred to, in
module $B$.

Naturally, it is sometimes desirable to define a nonterminal symbol $N$ in
module $A$ and to refer to it in module $B$. This is permitted if $N$ is
public, that is, if either its definition carries the keyword \dpublic or
$N$ is declared to be a start symbol. A public nonterminal symbol is never
renamed, so it can be referred to by modules other than its defining module.

749
750
751
752
753
In fact, it is permitted to split the definition of a \emph{public} nonterminal
symbol, over multiple modules and/or within a single module.
That is, a public nonterminal symbol $N$ can
have multiple definitions, within one module and/or in distinct modules.
All of these definitions are joined using the choice (\barre) operator. This
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
feature allows splitting a grammar specification in a manner that is
independent of the grammar's structure. For instance, in the grammar of a
programming language, the definition of the nonterminal symbol \nt{expression}
could be split into multiple modules, where one module groups the expression
forms that have to do with arithmetic, one module groups those that concern
function definitions and function calls, one module groups those that concern
object definitions and method calls, and so on.

\paragraph{Tokens aside}

Another use of modularity consists in placing all \dtoken declarations in one
module, and the actual grammar specification in another module. The module
that contains the token definitions can then be shared, making it easier to
define multiple parsers that accept the same type of tokens. (On this topic,
see \distrib{demos/calc-two}.)

\subsection{Parameterizing rules}
\label{sec:templates}

A rule (that is, the definition of a nonterminal symbol) can be parameterized
over an arbitrary number of symbols, which are referred to as formal
parameters.

\paragraph{Example}

For instance, here is the definition of the parameterized
nonterminal symbol \nt{option}, taken from the standard library (\sref{sec:library}):
%
\begin{quote}
\begin{tabular}{l}
\dpublic \basic{option}(\basic{X}):
\newprod \dpaction{\basic{None}}
\newprod \basic{x} = \basic{X} \dpaction{\basic{Some} \basic{x}}
\end{tabular}
\end{quote}
%
This definition states that \nt{option}(\basic{X}) expands to either the empty
string, producing the semantic value \basic{None}, or to the string \basic{X},
producing the semantic value {\basic{Some}~\basic{x}}, where \basic{x} is the
semantic value of \basic{X}. In this definition, the symbol \basic{X} is
abstract: it stands for an arbitrary terminal or nonterminal symbol. The
definition is made public, so \nt{option} can be referred to within client
modules.

A client that wishes to use \nt{option} simply refers to it, together with
an actual parameter -- a symbol that is intended to replace \basic{X}. For
instance, here is how one might define a sequence of declarations, preceded
with optional commas:
%
\begin{quote}
\begin{tabular}{l}
\nt{declarations}:
\newprod \dpaction{[]}
\newprod \basic{ds} = \nt{declarations}; \nt{option}(\basic{COMMA}); \basic{d} = \nt{declaration}
         \dpaction{ \basic{d} :: \basic{ds} }
\end{tabular}
\end{quote}
%
This definition states that \nt{declarations} expands either to the empty
string or to \nt{declarations} followed by an optional comma followed by
\nt{declaration}. (Here, \basic{COMMA} is presumably a terminal symbol.)
When this rule is encountered, the definition of \nt{option} is instantiated:
that is, a copy of the definition, where \basic{COMMA} replaces \basic{X},
is produced. Things behave exactly as if one had written:

\begin{quote}
\begin{tabular}{l}
\basic{optional\_comma}:
\newprod \dpaction{\basic{None}}
\newprod \basic{x} = \basic{COMMA} \dpaction{\basic{Some} \basic{x}} \\

\nt{declarations}:
\newprod \dpaction{[]}
\newprod \basic{ds} = \nt{declarations}; \nt{optional\_comma}; \basic{d} = \nt{declaration}
         \dpaction{ \basic{d} :: \basic{ds} }
\end{tabular}
\end{quote}
%
Note that, even though \basic{COMMA} presumably has been declared as a token
with no semantic value, writing \basic{x}~=~\basic{COMMA} is legal, and binds
\basic{x} to the unit value. This design choice ensures that the definition
of \nt{option} makes sense regardless of the nature of \basic{X}: that is, \basic{X}
can be instantiated with a terminal symbol, with or without a semantic value,
or with a nonterminal symbol.

\paragraph{Parameterization in general}

In general, the definition of a nonterminal symbol $N$ can be
parameterized with an arbitrary number of formal parameters. When
$N$ is referred to within a production, it must be applied
to the same number of actuals. In general, an actual is:
%
\begin{itemize}
\item either a single symbol, which can be a terminal symbol, a nonterminal symbol, or a formal parameter;
\item or an application of such a symbol to a number of actuals.
\end{itemize}

For instance, here is a rule whose single production consists of a single
producer, which contains several, nested actuals. (This example is discussed
again in \sref{sec:library}.)
%
\begin{quote}
\begin{tabular}{l}
\nt{plist}(\nt{X}):
\newprod
  \basic{xs} = \nt{loption}(%
                     \nt{delimited}(%
                       \basic{LPAREN},
                       \nt{separated\_nonempty\_list}(\basic{COMMA}, \basic{X}),
                       \basic{RPAREN}%
                     )%
                   )
    \dpaction{\basic{xs}}
\end{tabular}
\end{quote}

\begin{figure}
\begin{center}
872
\begin{tabular}{r@{\hspace{2mm}}c@{\hspace{2mm}}l}
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\nt{actual}\dquestion & is syntactic sugar for & \nt{option}(\nt{actual}) \\
\nt{actual}\dplus & is syntactic sugar for & \nt{nonempty\_list}(\nt{actual}) \\
\nt{actual}\dstar & is syntactic sugar for & \nt{list}(\nt{actual})
\end{tabular}
\end{center}
\caption{Syntactic sugar for simulating regular expressions}
\label{fig:sugar}
\end{figure}
%
Applications of the parameterized nonterminal symbols \nt{option},
\nt{nonempty\_list}, and \nt{list}, which are defined in
the standard library (\sref{sec:library}), can be written using
a familiar, regular-expression like syntax (\fref{fig:sugar}).

\paragraph{Higher-order parameters}

A formal parameter can itself expect parameters. For instance, here is a rule
that defines the syntax of procedures in an imaginary programming language:
%
\begin{quote}
\begin{tabular}{l}
\nt{procedure}(\nt{list}):
\newprod
\basic{PROCEDURE} \basic{ID} \nt{list}(\nt{formal}) \nt{SEMICOLON} \nt{block} \nt{SEMICOLON} \dpaction{$\ldots$}
\end{tabular}
\end{quote}
%
This rule states that the token \basic{ID}, which represents the name of the
procedure, should be followed with a list of formal parameters. (The
definitions of the nonterminal symbols \nt{formal} and \nt{block} are not
shown.) However, because \nt{list} is a formal parameter, as opposed to a
concrete nonterminal symbol defined elsewhere, this definition does not
specify how the list is laid out: which token, if any, is used to separate, or
terminate, list elements? is the list allowed to be empty? and so on. A more
concrete notion of procedure is obtained by instantiating the formal parameter
\nt{list}: for instance, \nt{procedure}(\nt{plist}), where \nt{plist} is the
parameterized nonterminal symbol defined earlier, is a valid application.

\paragraph{Consistency} Definitions and uses of parameterized nonterminal
symbols are checked for consistency before they are expanded away. In short,
it is checked that, wherever a nonterminal symbol is used, it is supplied with
actual arguments in appropriate number and of appropriate nature. This
guarantees that expansion of parameterized definitions terminates and produces
a well-formed grammar as its outcome.

\subsection{Inlining}
\label{sec:inline}

It is well-known that the following grammar of arithmetic expressions does not
work as expected: that is, in spite of the priority declarations, it has
shift/reduce conflicts.
%
\begin{quote}
\begin{tabular}{l}
\dtoken \kangle{\basic{int}} \basic{INT} \\
\dtoken \basic{PLUS} \basic{TIMES} \\
\dleft \basic{PLUS} \\
\dleft \basic{TIMES} \\ \\
\percentpercent \\ \\
\nt{expression}:
\newprod \basic{i} = \basic{INT}
         \dpaction{\basic{i}}
\newprod \basic{e} = \nt{expression}; \basic{o} = \nt{op}; \basic{f} = \nt{expression}
         \dpaction{\basic{o} \basic{e} \basic{f}} \\
\nt{op}:
\newprod \basic{PLUS} \dpaction{( + )}
\newprod \basic{TIMES} \dpaction{( * )}
\end{tabular}
\end{quote}
%
The trouble is, the precedence level of the production \nt{expression}
$\rightarrow$ \nt{expression} \nt{op} \nt{expression} is undefined, and
there is no sensible way of defining it via a \dprec declaration, since
the desired level really depends upon the symbol that was recognized by
\nt{op}: was it \basic{PLUS} or \basic{TIMES}?

The standard workaround is to abandon the definition of \nt{op} as a
separate nonterminal symbol, and to inline its definition into the
definition of \nt{expression}, like this:
%
\begin{quote}
\begin{tabular}{l}
\nt{expression}:
\newprod \basic{i} = \basic{INT}
         \dpaction{\basic{i}}
\newprod \basic{e} = \nt{expression}; \basic{PLUS}; \basic{f} = \nt{expression}
         \dpaction{\basic{e} + \basic{f}}
\newprod \basic{e} = \nt{expression}; \basic{TIMES}; \basic{f} = \nt{expression}
         \dpaction{\basic{e} * \basic{f}}
\end{tabular}
\end{quote}
%

This avoids the shift/reduce conflict, but gives up some of the original
specification's structure, which, in realistic situations, can be damageable.
Fortunately, \menhir offers a way of avoiding the conflict without manually
transforming the grammar, by declaring that the nonterminal symbol \nt{op}
should be inlined:
%
\begin{quote}
\begin{tabular}{l}
\nt{expression}:
\newprod \basic{i} = \basic{INT}
         \dpaction{\basic{i}}
\newprod \basic{e} = \nt{expression}; \basic{o} = \nt{op}; \basic{f} = \nt{expression}
         \dpaction{\basic{o} \basic{e} \basic{f}} \\
\dinline \nt{op}:
\newprod \basic{PLUS} \dpaction{( + )}
\newprod \basic{TIMES} \dpaction{( * )}
\end{tabular}
\end{quote}
%
The \dinline keyword causes all references to \nt{op} to be replaced with its
definition. In this example, the definition of \nt{op} involves two
productions, one that develops to \basic{PLUS} and one that expands to
\basic{TIMES}, so every production that refers to \nt{op} is effectively
turned into two productions, one that refers to \basic{PLUS} and one that refers to
\basic{TIMES}. After inlining, \nt{op} disappears and \nt{expression} has three
productions: that is, the result of inlining is exactly the manual workaround
shown above.

In some situations, inlining can also help recover a slight efficiency margin.
For instance, the definition:
%
\begin{quote}
\begin{tabular}{l}
\dinline \nt{plist}(\nt{X}):
\newprod
For faster browsing, not all history is shown. View entire blame