manual.tex 205 KB
Newer Older
1
2
\def\true{true}
\let\fpacm\true
3
\documentclass[onecolumn,11pt,nocopyrightspace,preprint]{sigplanconf}
4
\usepackage{amstext}
5
\usepackage[T1]{fontenc}
6
\usepackage[utf8]{inputenc}
7
\usepackage{moreverb}
8
\usepackage{tikz}
9
10
\usepackage{xspace}
\usepackage{mymacros}
11
\def\fppdf{true}
12
13
14
\usepackage{fppdf}
\input{macros}
\input{version}
15
16
17
18
% Let Menhir's version number appear at the bottom right of every page.
\makeatletter
\def\@formatyear{\menhirversion}
\makeatother
19

20
21
22
23
24
25
26
27
28
29
30
% Hevea-specific logic; see
%   http://hevea.inria.fr/doc/manual007.html
%   http://hevea.inria.fr/doc/manual008.html
\usepackage{hevea}
\newenvironment{heveapicture}{
\begin{toimage}
}{
\end{toimage}
\imageflush{}
}

POTTIER Francois's avatar
POTTIER Francois committed
31
32
% ------------------------------------------------------------------------------

33
34
35
36
37
38
% Headings.

\title{\menhir Reference Manual\\\normalsize (version \menhirversion)}

\begin{document}

39
\authorinfo{François Pottier\and Yann Régis-Gianas}
40
41
42
43
44
	   {INRIA}
	   {\{Francois.Pottier, Yann.Regis-Gianas\}@inria.fr}

\maketitle

POTTIER Francois's avatar
POTTIER Francois committed
45
% ------------------------------------------------------------------------------
46
47
48
49
50

\clearpage
\tableofcontents
\clearpage

POTTIER Francois's avatar
POTTIER Francois committed
51
% ------------------------------------------------------------------------------
52
53
54
55
56

\section{Foreword}

\menhir is a parser generator. It turns high-level grammar specifications,
decorated with semantic actions expressed in the \ocaml programming
57
language~\cite{ocaml}, into parsers, again expressed in \ocaml. It is
58
59
based on Knuth's LR(1) parser construction technique~\cite{knuth-lr-65}. It is
strongly inspired by its precursors: \yacc~\cite{johnson-yacc-79},
60
\texttt{ML-Yacc}~\cite{tarditi-appel-00}, and \ocamlyacc~\cite{ocaml},
61
62
63
64
65
66
67
68
69
70
but offers a large number of minor and major improvements that make it a more
modern tool.

This brief reference manual explains how to use \menhir. It does not attempt to
explain context-free grammars, parsing, or the LR technique. Readers who have
never used a parser generator are encouraged to read about these ideas
first~\cite{aho-86,appel-tiger-98,hopcroft-motwani-ullman-00}. They are also
invited to have a look at the \distrib{demos} directory in \menhir's
distribution.

71
Potential users of \menhir should be warned that \menhir's feature set is not
72
73
74
75
76
77
78
79
80
81
completely stable. There is a tension between preserving a measure of
compatibility with \ocamlyacc, on the one hand, and introducing new ideas, on
the other hand. Some aspects of the tool, such as the error handling
mechanism, are still potentially subject to incompatible changes: for
instance, in the future, the current error handling mechanism (which is based
on the \error token, see \sref{sec:errors}) could be removed and replaced with
an entirely different mechanism.

There is room for improvement in the tool and in this reference manual. Bug
reports and suggestions are welcome!
82

POTTIER Francois's avatar
POTTIER Francois committed
83
% ------------------------------------------------------------------------------
84
85
86
87
88
89
90

\section{Usage}

\menhir is invoked as follows:
\begin{quote}
\cmenhir \nt{option} \ldots \nt{option} \nt{filename} \ldots \nt{filename}
\end{quote}
91
92
Each of the file names must end with \mly (unless \ocoq is used,
in which case it must end with \vy) and denotes a partial
93
94
grammar specification. These partial grammar specifications are joined
(\sref{sec:split}) to form a single, self-contained grammar specification,
95
which is then processed. The following optional command line switches allow
96
97
98
99
100
101
102
103
controlling many aspects of the process.

\docswitch{\obase \nt{basename}} This switch controls the base name
of the \ml and \mli files that are produced. That is, the tool will produce
files named \nt{basename}\texttt{.ml} and \nt{basename}\texttt{.mli}. Note
that \nt{basename} can contain occurrences of the \texttt{/} character, so it
really specifies a path and a base name. When only one \nt{filename} is
provided on the command line, the default \nt{basename} is obtained by
104
depriving \nt{filename} of its final \mly suffix. When multiple file
105
106
107
names are provided on the command line, no default base name exists, so that
the \obase switch \emph{must} be used.

108
\docswitch{\ocmly} This switch causes \menhir to produce a \cmly file in
109
110
111
addition to its normal operation. This file contains a (binary-form)
representation of the grammar and automaton (see \sref{sec:sdk}).

112
113
114
\docswitch{\ocomment} This switch causes a few comments to be inserted into the
\ocaml code that is written to the \ml file.

115
116
117
118
119
120
\docswitch{\ocompareerrors \nt{filename1} \ocompareerrors \nt{filename2}} Two
such switches must always be used in conjunction so as to specify the names of
two \messages files, \nt{filename1} and \nt{filename2}. Each file is read and
internally translated to a mapping of states to messages. \menhir then checks
that the left-hand mapping is a subset of the right-hand mapping. This feature
is typically used in conjunction with \olisterrors to check that \nt{filename2}
121
is complete (that is, covers all states where an error can occur).
122
123
124
125
For more information, see \sref{sec:errors:new}.

\docswitch{\ocompileerrors \nt{filename}} This switch causes \menhir to read the
file \nt{filename}, which must obey the \messages file format, and to compile
126
it to an \ocaml function that maps a state number to a message. The \ocaml code
127
128
129
130
is sent to the standard output channel. At the same time, \menhir checks that
the collection of input sentences in the file \nt{filename} is correct and
irredundant. For more information, see \sref{sec:errors:new}.

131
132
\docswitch{\ocoq} This switch causes \menhir to produce Coq code. See \sref{sec:coq}.

133
134
135
136
137
138
139
140
141
\docswitch{\ocoqlibpath \nt{path}} This switch allows specifying under what name (or path) the Coq support library
MenhirLib is known to Coq. When \menhir runs in \ocoq mode, the generated
parser contains references to several modules in this library. This path is
used to qualify these references. Its default value is \texttt{MenhirLib}.

\docswitch{\ocoqlibnopath} This switch indicates that references to the Coq library MenhirLib
should \emph{not} be qualified. This was the default behavior of \menhir prior to 2018/05/30.
This switch is provided for compatibility, but normally should not be used.

142
\docswitch{\ocoqnoactions} (Used in conjunction with \ocoq.) This switch
143
causes the semantic actions present in the \vy file to be ignored and
144
replaced with \verb+tt+, the unique inhabitant of Coq's \verb+unit+ type. This
145
feature can be used to test the Coq back-end with a standard grammar, that is, a
146
grammar that contains \ocaml semantic actions. Just rename the file from
147
\mly to \vy and set this switch.
148
149
150
151
152
153
154
155
156

\docswitch{\ocoqnocomplete} (Used in conjunction with \ocoq.) This switch
disables the generation of the proof of completeness of the parser
(\sref{sec:coq}). This can be necessary because the proof of completeness is
possible only if the grammar has no conflict (not even a benign one, in the
sense of \sref{sec:conflicts:benign}). This can be desirable also because, for
a complex grammar, completeness may require a heavy certificate and its
validation by Coq may take time.

157
\docswitch{\odepend} See \sref{sec:build}.
158
159
160
161

\docswitch{\odump} This switch causes a description of the automaton
to be written to the file \nt{basename}\automaton.

POTTIER Francois's avatar
Typo.    
POTTIER Francois committed
162
\docswitch{\oechoerrors \nt{filename}} This switch causes \menhir to
163
164
165
166
read the \messages file \nt{filename} and to produce on the standard output
channel just the input sentences. (That is, all messages, blank lines, and
comments are filtered out.) For more information, see \sref{sec:errors:new}.

167
168
169
170
171
172
173
174
175
176
177
\docswitch{\oexplain} This switch causes conflict explanations to be
written  to the file \nt{basename}\conflicts. See also \sref{sec:conflicts}.

\docswitch{\oexternaltokens \nt{T}} This switch causes the definition of
the \token type to be omitted in \nt{basename}\texttt{.ml} and
\nt{basename}\texttt{.mli}. Instead, the generated parser relies on
the type $T$\texttt{.}\token, where $T$ is an \ocaml module name. It is up to
the user to define module $T$ and to make sure that it exports a suitable
\token type. Module $T$ can be hand-written. It can also be automatically generated
out of a grammar specification using the \oonlytokens switch.

178
\docswitch{\ofixedexc} This switch causes the exception \texttt{Error} to be
179
180
internally defined as a synonym for \texttt{Parsing.Parse\_error}. This means
that an exception handler that catches \texttt{Parsing.Parse\_error} will also
181
catch the generated parser's \texttt{Error}. This helps increase \menhir's
182
compatibility with \ocamlyacc. There is otherwise no reason to use this switch.
183

184
185
186
187
188
189
\docswitch{\ograph} This switch causes a description of the grammar's
dependency graph to be written to the file \nt{basename}\dott. The graph's
vertices are the grammar's nonterminal symbols. There is a directed edge from
vertex $A$ to vertex $B$ if the definition of $A$ refers to $B$. The file is
in a format that is suitable for processing by the \emph{graphviz} toolkit.

190
\docswitch{\oinfer, \oinferwrite, \oinferread} See \sref{sec:build}.
191

POTTIER Francois's avatar
POTTIER Francois committed
192
193
194
195
196
197
\docswitch{\oinspection} This switch requires \otable. It causes \menhir to generate
not only the monolithic and incremental APIs (\sref{sec:monolithic},
\sref{sec:incremental}), but also the inspection API (\sref{sec:inspection}).
Activating this switch causes a few more tables to be produced, resulting in
somewhat larger code size.

198
199
200
\docswitch{\ointerpret} This switch causes \menhir to act as an interpreter,
rather than as a compiler. No \ocaml code is generated. Instead, \menhir
reads sentences off the standard input channel, parses them, and displays
201
202
203
204
205
206
207
outcomes. This switch can be usefully combined with \otrace.
For more information, see \sref{sec:interpret}.

\docswitch{\ointerpreterror} This switch is analogous to \ointerpret, except
\menhir expects every sentence to cause an error on its last token, and
displays information about the state in which the error is detected, in
the \messages file format. For more information, see \sref{sec:errors:new}.
208
209
210
211
212

\docswitch{\ointerpretshowcst} This switch, used in conjunction with \ointerpret,
causes \menhir to display a concrete syntax tree when a sentence is successfully
parsed. For more information, see \sref{sec:interpret}.

213
214
215
216
\docswitch{\olisterrors} This switch causes \menhir to produce (on the standard
output channel) a complete list of input sentences that cause an error, in the
\messages file format. For more information, see \sref{sec:errors:new}.

217
218
219
220
221
222
223
224
225
226
227
228
229
\docswitch{\ologautomaton \nt{level}} When \nt{level} is nonzero, this switch
causes some information about the automaton to be logged to the standard error
channel.

\docswitch{\ologcode \nt{level}} When \nt{level} is nonzero, this switch
causes some information about the generated \ocaml code to be logged to the
standard error channel.

\docswitch{\ologgrammar \nt{level}} When \nt{level} is nonzero, this switch
causes some information about the grammar to be logged to the standard error
channel. When \nt{level} is 2, the \emph{nullable}, \emph{FIRST}, and
\emph{FOLLOW} tables are displayed.

230
231
232
\docswitch{\onodollars} This switch disallows the use of positional keywords
of the form \kw{\$i}.

233
234
235
236
\docswitch{\onoinline} This switch causes all \dinline keywords in the
grammar specification to be ignored. This is especially useful in order
to understand whether these keywords help solve any conflicts.

237
\docswitch{\onostdlib} This switch instructs \menhir to \emph{not} use
POTTIER Francois's avatar
POTTIER Francois committed
238
its standard library (\sref{sec:library}).
239

240
\docswitch{\oocamlc \nt{command}} See \sref{sec:build}.
241

242
\docswitch{\oocamldep \nt{command}} See \sref{sec:build}.
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259

\docswitch{\oonlypreprocess} This switch causes the grammar specifications
to be transformed up to the point where the automaton's construction can
begin. The grammar specifications whose names are provided on the command line
are joined (\sref{sec:split}); all parameterized nonterminal symbols are
expanded away (\sref{sec:templates}); type inference is performed, if \oinfer
is enabled; all nonterminal symbols marked \dinline are expanded away
(\sref{sec:inline}). This yields a single, monolithic grammar specification,
which is printed on the standard output channel.

\docswitch{\oonlytokens} This switch causes the \dtoken declarations in
the grammar specification to be translated into a definition of the \token
type, which is written to the files \nt{basename}\texttt{.ml} and
\nt{basename}\texttt{.mli}. No code is generated. This is useful when
a single set of tokens is to be shared between several parsers. The directory
\distrib{demos/calc-two} contains a demo that illustrates the use of this switch.

260
\docswitch{\orawdepend} See \sref{sec:build}.
261

Nicolás Ojeda Bär's avatar
Nicolás Ojeda Bär committed
262
\docswitch{\ostdlib \nt{directory}} This switch exists only for
POTTIER Francois's avatar
POTTIER Francois committed
263
backwards compatibility and is ignored. It may be removed in the future.
POTTIER Francois's avatar
POTTIER Francois committed
264

265
266
267
268
269
270
271
\docswitch{\ostrict} This switch causes several warnings about the grammar
and about the automaton to be considered errors. This includes warnings about
useless precedence declarations, non-terminal symbols that produce the empty
language, unreachable non-terminal symbols, productions that are never
reduced, conflicts that are not resolved by precedence declarations, and
end-of-stream conflicts.

272
\docswitch{\oo{suggest-*}} See \sref{sec:build}.
273

274
275
\docswitch{\otable} This switch causes \menhir to use its table-based
back-end, as opposed to its (default) code-based back-end. When \otable is
POTTIER Francois's avatar
POTTIER Francois committed
276
277
used, \menhir produces significantly more compact and somewhat slower parsers.
See \sref{sec:qa} for a speed comparison.
278
279
280
281
282
283
284
285
286
287
288
289

The table-based back-end produces rather compact tables, which are analogous
to those produced by \yacc, \bison, or \ocamlyacc. These tables are not quite
stand-alone: they are exploited by an interpreter, which is shipped as part of
the support library \menhirlib. For this reason, when \otable is used,
\menhirlib must be made visible to the \ocaml compilers, and must be linked
into your executable program. The \texttt{--suggest-*} switches, described
above, help do this.

The code-based back-end compiles the LR automaton directly into a nest of
mutually recursive \ocaml functions. In that case, \menhirlib is not required.

POTTIER Francois's avatar
POTTIER Francois committed
290
291
292
The incremental API (\sref{sec:incremental}) and the inspection API
(\sref{sec:inspection}) are made available only by the table-based back-end.

293
294
295
\docswitch{\otimings} This switch causes internal timing information to
be sent to the standard error channel.

296
297
298
\docswitch{\otimingsto \nt{filename}} This switch causes internal timing
information to be written to the file \nt{filename}.

299
300
301
302
303
304
\docswitch{\otrace} This switch causes tracing code to be inserted into
the generated parser, so that, when the parser is run, its actions are
logged to the standard error channel. This is analogous to \texttt{ocamlrun}'s
\texttt{p=1} parameter, except this switch must be enabled at compile time:
one cannot selectively enable or disable tracing at runtime.

305
306
307
308
309
310
311
312
313
314
\docswitch{\oignoreprec} This switch suppresses all warnings about
useless \dleft, \dright, \dnonassoc and \dprec declarations.

\docswitch{\oignoreone \nt{symbol}} This switch suppresses the warning that
is normally emitted when \menhir finds that the terminal symbol \nt{symbol} is
unused.

\docswitch{\oignoreall} This switch suppresses all of the warnings that are
normally emitted when \menhir finds that some terminal symbols are unused.

POTTIER Francois's avatar
Typo.    
POTTIER Francois committed
315
\docswitch{\oupdateerrors \nt{filename}} This switch causes \menhir to
316
317
318
319
320
read the \messages file \nt{filename} and to produce on the standard output
channel a new \messages file that is identical, except the auto-generated
comments have been re-generated. For more information,
see \sref{sec:errors:new}.

321
322
323
\docswitch{\oversion} This switch causes \menhir to print its own version
number and exit.

POTTIER Francois's avatar
POTTIER Francois committed
324
% ------------------------------------------------------------------------------
325
326
327

\section{Lexical conventions}

POTTIER Francois's avatar
POTTIER Francois committed
328
329
330
331
332
333
334
335
336
337
338
339
340
341
A semicolon character (\kw{;}) \emph{may} appear after a declaration
(\sref{sec:decls}).

An old-style rule (\sref{sec:old:rules}) \emph{may} be terminated with a
semicolon. Also, within an old-style rule, each producer
(\sref{sec:producers}) \emph{may} be terminated with a semicolon.

A new-style rule (\sref{sec:new:rules}) \emph{must not} be terminated with a
semicolon. Within such a rule, the elements of a sequence \emph{must} be
separated with semicolons.

Semicolons are not allowed to appear anywhere except in the places mentioned
above. This is in contrast with \ocamlyacc, which views semicolons as
insignificant, just like whitespace.
342
343
344
345
346
347

Identifiers (\nt{id}) coincide with \ocaml identifiers, except they are not
allowed to contain the quote (\kw{'}) character. Following
\ocaml, identifiers that begin with a lowercase letter
(\nt{lid}) or with an uppercase letter (\nt{uid}) are distinguished.

348
349
350
351
A quoted identifier \nt{qid} is a string enclosed in double quotes.
Such a string cannot contain a double quote or a backslash.
Quoted identifiers are used as token aliases (\sref{sec:tokens}).

352
353
354
355
356
357
358
359
360
Comments are C-style (surrounded with \kw{/*} and \kw{*/}, cannot be nested),
C++-style (announced by \kw{/$\!$/} and extending until the end of the line), or
\ocaml-style (surrounded with \kw{(*} and \kw{*)}, can be nested). Of course,
inside \ocaml code, only \ocaml-style comments are allowed.

\ocaml type expressions are surrounded with \kangle{and}. Within such expressions,
all references to type constructors (other than the built-in \textit{list}, \textit{option}, etc.)
must be fully qualified.

POTTIER Francois's avatar
POTTIER Francois committed
361
% ------------------------------------------------------------------------------
362
363
364

\section{Syntax of grammar specifications}

POTTIER Francois's avatar
POTTIER Francois committed
365
\newcommand{\modifier}{(\,\dquestion \metachoice \dplus \metachoice \dstar\hspace{-.3mm})}
POTTIER Francois's avatar
POTTIER Francois committed
366

367
368
369
370
371
372
373
374
\begin{figure}
\begin{center}
\begin{tabular}{r@{}c@{}l}

\nt{specification} \is
   \sepspacelist{\nt{declaration}}
   \percentpercent
   \sepspacelist{\nt{rule}}
375
   \optional{\percentpercent \textit{\ocaml code}} \\
376
377

\nt{declaration} \is
378
   \dheader{\textit{\ocaml code}} \\
379
&& \dparameter \ocamlparam \\
380
&& \dtoken \optional{\ocamltype} \sepspacelist{\nt{uid} \optional{\nt{qid}}} \\
381
382
383
384
385
&& \dnonassoc \sepspacelist{\nt{uid}} \\
&& \dleft \sepspacelist{\nt{uid}} \\
&& \dright \sepspacelist{\nt{uid}} \\
&& \dtype \ocamltype \sepspacelist{\nt{lid}} \\
&& \dstart \optional{\ocamltype} \sepspacelist{\nt{lid}} \\
386
387
&& \dattribute \sepspacelist{\nt{actual}} \sepspacelist{\nt{attribute}} \\
&& \kw{\%} \nt{attribute} \\ % a grammar-wide attribute
388
&& \donerrorreduce \sepspacelist{\nt{lid}} \\
POTTIER Francois's avatar
POTTIER Francois committed
389
390
391
\nt{attribute} \is
    \kw{[@} \nt{name} \nt{payload} \kw{]}
\\[4mm]
392

POTTIER Francois's avatar
POTTIER Francois committed
393
\emph{old syntax} ---
394
395
396
\nt{rule} \is
   \optional{\dpublic} \optional{\dinline}
   \nt{lid}
POTTIER Francois's avatar
POTTIER Francois committed
397
   \oparams{\nt{id}}
398
   \deuxpoints
POTTIER Francois's avatar
POTTIER Francois committed
399
   \precseplist\barre{\nt{group}} \\
400
401
402
403
404
405
406
407
408
409
410
411
412

\nt{group} \is
   \seplist{\ \barre}{\nt{production}}
   \daction
   \optional {\dprec \nt{id}} \\

\nt{production} \is
   \sepspacelist{\nt{producer}} \optional {\dprec \nt{id}} \\

\nt{producer} \is
   \optional{\nt{lid} \dequal} \nt{actual} \\

\nt{actual} \is
POTTIER Francois's avatar
POTTIER Francois committed
413
414
   \nt{id} \oparams{\nt{actual}} \\
&& \nt{actual} \modifier \\
415
&& \seplist{\ \barre}{\nt{group}} % not really allowed everywhere
POTTIER Francois's avatar
POTTIER Francois committed
416
\\[4mm]
417

POTTIER Francois's avatar
POTTIER Francois committed
418
419
420
421
422
423
\emph{new syntax} ---
\nt{rule} \is
  \optional{\dpublic}
  \dlet
  \nt{lid}
  \oparams{\nt{id}}
POTTIER Francois's avatar
POTTIER Francois committed
424
  (\,\dcolonequal \metachoice \dequalequal\hspace{-.2mm})
POTTIER Francois's avatar
POTTIER Francois committed
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
  \expression
\\
\expression \is
  % a choice between sequence expressions:
  \precseplist\barre\expression
\\&&
  % a sequence expression:
  \optional{\pattern \dequal{}} \expression \dsemi \expression
\\&&
  % a symbol expression:
  \nt{id} \oparams{\expression}
\\&&
  % a symbol expression:
  \expression \modifier
\\&&
  % an action expression:
  \daction
  \optional {\dprec \nt{id}}
    % %prec is in fact allowed to appear before the semantic action,
    % but this is not documented.
\\&&
  % an action expression:
  \dpfaction{\nt{\ocaml id}}
  \optional {\dprec \nt{id}}
\\
\pattern \is
  \nt{lid}
\,\metachoice\,
  \dunderscore
\,\metachoice\,
  \dtilde
\,\metachoice\,
  \tuple\pattern

% The places where attributes can be attached are not shown in this
% figure. This is intentional; let's avoid pollution. Attributes are
% described separately.
462
463
464
465
466
467
\end{tabular}
\end{center}
\caption{Syntax of grammar specifications}
\label{fig:syntax}
\end{figure}

POTTIER Francois's avatar
POTTIER Francois committed
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
The syntax of grammar specifications appears in \fref{fig:syntax}.
The places where attributes can be attached are not shown; they
are documented separately (\sref{sec:attributes}).
%
% (For compatibility with \ocamlyacc, some specifications that do not fully
% adhere to this syntax are also accepted.)
%
A grammar specification begins with a sequence of declarations
(\sref{sec:decls}), ended by a mandatory \percentpercent keyword.
%
Following this keyword, a sequence of rules is expected.
Each rule defines a nonterminal symbol~\nt{lid},
whose name must begin with a lowercase letter.
%
% In reality, in the old syntax, this is enforced only for start symbols.
% In the new syntax, this is enforced for all symbols.
%
% A rule can also *extend* a symbol, but let's not mention that here.
%
A rule is expressed either in the ``old syntax'' (\sref{sec:old:rules}) or in the
``new syntax'' (\sref{sec:new:rules}), which is slightly more elegant and
powerful.
490
491

\subsection{Declarations}
492
\label{sec:decls}
493
494

\subsubsection{Headers}
POTTIER Francois's avatar
POTTIER Francois committed
495
\label{sec:decls:headers}
496
497
498
499
500
501
502
503
504

A header is a piece of \ocaml code, surrounded with \dheader{and}. It is
copied verbatim at the beginning of the \ml file. It typically contains \ocaml
\kw{open} directives and function definitions for use by the semantic
actions. If a single grammar specification file contains multiple headers,
their order is preserved. However, when two headers originate in distinct
grammar specification files, the order in which they are copied to the \ml
file is unspecified.

505
506
507
508
509
510
511
512
513
514
515
It is important to note that the header is copied by \menhir only to the \ml
file, \emph{not} to the \mli file. Therefore, it should not contain
declarations that affect the meaning of the types that appear in the \mli
file. Here are two problems that people commonly run into:
\begin{itemize}
\item Placing an \kw{open} directive that is required for a \dtype declaration
  to make sense. For instance, writing \verb+open Foo+ in the header and
  declaring \verb+%type<t> bar+, where the type \verb+t+ is defined in the
  module \verb+Foo+, will not work. You must write \verb+%type<Foo.t> bar+.
\item Declaring a module alias that affects a (declared or inferred) type. For
  instance, writing \verb+module F = Foo+ in the header and declaring
516
517
518
  \verb+%type<Foo.t> bar+ may not work (from
  2020/05/25 on). The reason is, OCaml may infer that the symbol \verb+bar+ has
  type \verb+F.t+, and Menhir will rely on this information without realizing
519
520
521
522
  that \verb+F+ is a local name, so in the end, the \mli file contains a
  reference to \verb+F.t+ that does not make sense.
\end{itemize}

523
\subsubsection{Parameters}
POTTIER Francois's avatar
POTTIER Francois committed
524
\label{sec:parameter}
525
526
527
528
529
530

A declaration of the form:
\begin{quote}
\dparameter \ocamlparam
\end{quote}
causes the entire parser to become parameterized over the \ocaml module
531
532
533
534
\nt{uid}, that is, to become an \ocaml functor. The directory
\distrib{demos/calc-param} contains a demo that illustrates the use of this switch.

If a single specification file
535
536
537
538
539
540
541
542
543
544
545
546
547
contains multiple \dparameter declarations, their order is preserved, so that
the module name \nt{uid} introduced by one declaration is effectively in scope
in the declarations that follow. When two \dparameter declarations originate
in distinct grammar specification files, the order in which they are processed
is unspecified. Last, \dparameter declarations take effect before \dheader{$\ldots$},
\dtoken, \dtype, or \dstart declarations are considered, so that the module name
\nt{uid} introduced by a \dparameter declaration is effectively in scope in
\emph{all} \dheader{$\ldots$}, \dtoken, \dtype, or \dstart declarations,
regardless of whether they precede or follow the \dparameter declaration.
This means, in particular, that the side effects of an \ocaml header are
observed only when the functor is applied, not when it is defined.

\subsubsection{Tokens}
548
\label{sec:tokens}
549
550
551

A declaration of the form:
\begin{quote}
552
553
554
555
\dtoken
\optional{\ocamltype}
$\nt{uid}_1$ \optional{$\nt{qid}_1$} $\;\ldots\;$
$\nt{uid}_n$ \optional{$\nt{qid}_n$}
556
557
558
\end{quote}
defines the identifiers $\nt{uid}_1, \ldots, \nt{uid}_n$ as tokens, that is,
as terminal symbols in the grammar specification and as data constructors in
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
the \textit{token} type.

If an \ocaml type $t$ is present, then these tokens are considered to carry a
semantic value of type $t$, otherwise they are considered to carry no semantic
value.

If a quoted identifier $\nt{qid}_i$ is present, then it is considered an alias
for the terminal symbol $\nt{uid}_i$. (This feature, known as ``token
aliases'', is borrowed from Bison.)
% https://www.gnu.org/software/bison/manual/html_node/Token-Decl.html#Token-Decl
Throughout the grammar, the quoted identifier $\nt{qid}_i$ is then
synonymous with the identifier $\nt{uid}_i$.
%
For example, if one declares:
\begin{verbatim}
%token PLUS "+"
\end{verbatim}
then the quoted identifier \texttt{"+"} stands for the terminal symbol
\texttt{PLUS} throughout the grammar. An example of the use of token aliases
appears in the directory \distrib{demos/calc-alias}.
%
Token aliases can be used to improve the readability of a grammar. One must
keep in mind, however, that they are just syntactic sugar: they are not
interpreted in any way by Menhir or conveyed to tools like \ocamllex.
%
They could be considered confusing by a reader who mistakenly believes that
they are interpreted as string literals.
586
587
588
589
590
591
592
593
594
595

\subsubsection{Priority and associativity}
\label{sec:assoc}

A declaration of one of the following forms:
\begin{quote}
\dnonassoc $\nt{uid}_1 \ldots \nt{uid}_n$ \\
\dleft $\nt{uid}_1 \ldots \nt{uid}_n$ \\
\dright $\nt{uid}_1 \ldots \nt{uid}_n$
\end{quote}
596
assigns both a \emph{priority level} and an \emph{associativity status} to
597
598
599
600
601
602
603
604
605
606
607
the symbols $\nt{uid}_1, \ldots, \nt{uid}_n$. The priority level assigned to
$\nt{uid}_1, \ldots, \nt{uid}_n$ is not defined explicitly: instead, it is
defined to be higher than the priority level assigned by the previous
\dnonassoc, \dleft, or \dright declaration, and lower than that assigned by the next
\dnonassoc, \dleft, or \dright declaration. The symbols $\nt{uid}_1, \ldots, \nt{uid}_n$
can be tokens (defined elsewhere by a \dtoken declaration) or dummies (not
defined anywhere). Both can be referred to as part of \dprec annotations.
Associativity status and priority levels allow shift/reduce conflicts to be
silently resolved (\sref{sec:conflicts}).

\subsubsection{Types}
POTTIER Francois's avatar
POTTIER Francois committed
608
\label{sec:type}
609
610
611
612
613
614
615
616
617

A declaration of the form:
\begin{quote}
\dtype \ocamltype $\nt{lid}_1 \ldots \nt{lid}_n$
\end{quote}
assigns an \ocaml type to each of the nonterminal symbols $\nt{lid}_1, \ldots, \nt{lid}_n$.
For start symbols, providing an \ocaml type is mandatory, but is usually done as part of the
\dstart declaration. For other symbols, it is optional. Providing type information can improve
the quality of \ocaml's type error messages.
618
619
620
621
622

A \dtype declaration may concern not only a nonterminal symbol, such as, say,
\texttt{expression}, but also a fully applied parameterized nonterminal
symbol, such as \texttt{list(expression)} or \texttt{separated\_list(COMMA,
  option(expression))}.
623

POTTIER Francois's avatar
POTTIER Francois committed
624
625
626
627
628
629
630
631
The types provided as part of \dtype declarations are copied verbatim to the
\ml and \mli files. In contrast, headers (\sref{sec:decls:headers}) are copied
to the \ml file only. For this reason, the types provided as part of \dtype
declarations must make sense both in the presence and in the absence of these
headers. They should typically be fully qualified types.

% TEMPORARY type information can be mandatory in --coq mode; document?

632
\subsubsection{Start symbols}
POTTIER Francois's avatar
POTTIER Francois committed
633
\label{sec:start}
634
635
636
637
638
639
640
641
642
643
644
645

A declaration of the form:
\begin{quote}
\dstart \optional{\ocamltype} $\nt{lid}_1 \ldots \nt{lid}_n$
\end{quote}
declares the nonterminal symbols $\nt{lid}_1, \ldots, \nt{lid}_n$ to be
start symbols. Each such symbol must be assigned an \ocaml type either as
part of the \dstart declaration or via separate \dtype declarations. Each
of $\nt{lid}_1, \ldots, \nt{lid}_n$ becomes the name of a function whose
signature is published in the \mli file and that can be used to invoke
the parser.

646
647
648
649
650
651
652
653
654
655
\subsubsection{Attribute declarations}

Attribute declarations
of the form
\dattribute \sepspacelist{\nt{actual}} \sepspacelist{\nt{attribute}}
and
\kw{\%} \nt{attribute}
are explained
in \sref{sec:attributes}.

656
657
658
659
660
661
662
663
664
\subsubsection{Extra reductions on error}
\label{sec:onerrorreduce}

A declaration of the form:
\begin{quote}
\donerrorreduce $\nt{lid}_1 \ldots \nt{lid}_n$
\end{quote}
marks the nonterminal symbols $\nt{lid}_1, \ldots, \nt{lid}_n$ as
potentially eligible for reduction when an invalid token is found.
665
666
This may cause one or more extra reduction steps to be performed
before the error is detected.
667
668
669
670
671

More precisely, this declaration affects the automaton as follows. Let us say
that a production $\nt{lid} \rightarrow \ldots$ is ``reducible on error'' if
its left-hand symbol~\nt{lid} appears in a \donerrorreduce declaration. After
the automaton has been constructed and after any conflicts have been resolved,
672
673
674
675
676
677
678
679
680
in every state~$s$, the following algorithm is applied:
\begin{enumerate}
\item Construct the set of all productions that are ready to be reduced in
  state~$s$ and are reducible on error;
\item Test if one of them, say $p$, has higher ``on-error-reduce-priority''
  than every other production in this set;
\item If so, in state~$s$, replace every error action with a reduction of the
  production~$p$.
(In other words, for every terminal symbol~$t$, if the action table
681
says: ``in state~$s$, when the next input symbol is~$t$, fail'', then this
682
683
684
entry is replaced with: ``in state~$s$, when the next input symbol
is~$t$, reduce production~$p$''.)
\end{enumerate}
685

686
687
If step 3 above is executed in state~$s$, then an error can never be detected
in state~$s$, since all error actions in state~$s$ are replaced with reduce
688
689
690
691
actions. Error detection is deferred: at least one reduction takes place
before the error is detected. It is a ``spurious'' reduction: in a canonical
LR(1) automaton, it would not take place.

692
693
694
695
696
697
An \donerrorreduce declaration does not affect the language that is accepted
by the automaton. It does not affect the location where an error is detected.
It is used to control in which state an error is detected. If used wisely, it
can make errors easier to report, because they are detected in a state for
which it is easier to write an accurate diagnostic message
(\sref{sec:errors:diagnostics}).
698
699
700
701
702
703
704
705
706

% This may make the tables bigger (but I have no statistics).
% This makes LRijkstra significantly slower.

Like a \dtype declaration, an \donerrorreduce declaration may concern not only
a nonterminal symbol, such as, say, \texttt{expression}, but also a fully
applied parameterized nonterminal symbol, such as \texttt{list(expression)} or
\texttt{separated\_list(COMMA, option(expression))}.

707
708
709
710
711
712
713
714
715
The ``on-error-reduce-priority'' of a production is that of its left-hand
symbol. The ``on-error-reduce-priority'' of a nonterminal symbol is determined
implicitly by the order of \donerrorreduce declarations. In the declaration
$\donerrorreduce\;\nt{lid}_1 \ldots \nt{lid}_n$, the symbols $\nt{lid}_1, \ldots,
\nt{lid}_n$ have the same ``on-error-reduce-priority''. They have higher
``on-error-reduce-priority'' than the symbols listed in previous
\donerrorreduce declarations, and lower ``on-error-reduce-priority''
than those listed in later \donerrorreduce declarations.

POTTIER Francois's avatar
POTTIER Francois committed
716
717
\subsection{Rules---old syntax}
\label{sec:old:rules}
718

719
In its simplest
POTTIER Francois's avatar
POTTIER Francois committed
720
form, a rule begins with the nonterminal symbol \nt{lid},
721
followed by a colon character (\deuxpoints),
722
723
724
725
726
727
and continues with a sequence of production groups
(\sref{sec:productiongroups}). Each production group is preceded with a
vertical bar character (\barre); the very first bar is optional. The meaning
of the bar is choice: the nonterminal symbol \nt{id} develops to either of the
production groups. We defer explanations of the keyword \dpublic
(\sref{sec:split}), of the keyword \dinline (\sref{sec:inline}), and of the
POTTIER Francois's avatar
POTTIER Francois committed
728
optional formal parameters $\tuple{\nt{id}}$
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
(\sref{sec:templates}).

\subsubsection{Production groups}
\label{sec:productiongroups}

In its simplest form, a production group consists of a single production (\sref{sec:productions}),
followed by an \ocaml semantic action (\sref{sec:actions}) and an optional
\dprec annotation (\sref{sec:prec}). A production specifies a sequence of terminal and
nonterminal symbols that should be recognized, and optionally binds
identifiers to their semantic values.

\paragraph{Semantic actions}
\label{sec:actions}

A semantic action is a piece of \ocaml code that is executed in order to
assign a semantic value to the nonterminal symbol with which this production
group is associated. A semantic action can refer to the (already computed)
semantic values of the terminal or nonterminal symbols that appear in the
747
748
749
750
production via the semantic value identifiers bound by the production.

For compatibility with \ocamlyacc, semantic actions can also refer to
unnamed semantic values via positional keywords of the form
751
752
753
\kw{\$1}, \kw{\$2}, etc.\ This style is discouraged.
(It is in fact forbidden if \onodollars is turned on.)
Furthermore, as
Yann Régis-Gianas's avatar
Yann Régis-Gianas committed
754
a positional keyword of the form \kw{\$i} is internally rewritten as
755
\nt{\_i}, the user should not use identifiers of the form \nt{\_i}.
756
757
758
759

\paragraph{\dprec annotations}
\label{sec:prec}

760
761
An annotation of the form \dprec \nt{id} indicates that the precedence level
of the production group is the level assigned to the symbol \nt{id} via a
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
previous \dnonassoc, \dleft, or \dright declaration (\sref{sec:assoc}). In the
absence of a
\dprec annotation, the precedence level assigned to each production is the
level assigned to the rightmost terminal symbol that appears in it. It is
undefined if the rightmost terminal symbol has an undefined precedence level
or if the production mentions no terminal symbols at all. The precedence level
assigned to a production is used when resolving shift/reduce conflicts
(\sref{sec:conflicts}).

\paragraph{Multiple productions in a group}

If multiple productions are present in a single group, then the semantic
action and precedence annotation are shared between them. This short-hand
effectively allows several productions to share a semantic action and
precedence annotation without requiring textual duplication. It is legal only
when every production binds exactly the same set of semantic value identifiers
and when no positional semantic value keywords (\kw{\$1}, etc.) are used.

\subsubsection{Productions}
\label{sec:productions}

A production is a sequence of producers (\sref{sec:producers}), optionally
followed by a \dprec annotation (\sref{sec:prec}). If a precedence annotation
is present, it applies to this production alone, not to other productions in
the production group. It is illegal for a production and its production group
to both carry \dprec annotations.

\subsubsection{Producers}
\label{sec:producers}

A producer is an actual (\sref{sec:actual}), optionally preceded with a
binding of a semantic value identifier, of the form \nt{lid} \dequal. The
actual specifies which construction should be recognized and how a semantic
value should be computed for that construction. The identifier \nt{lid}, if
present, becomes bound to that semantic value in the semantic action that
follows. Otherwise, the semantic value can be referred to via a positional
keyword (\kw{\$1}, etc.).

\subsubsection{Actuals}
\label{sec:actual}

803
804
805
In its simplest form, an actual is just a terminal or nonterminal symbol
$\nt{id}$. If it is a parameterized non-terminal symbol (see
\sref{sec:templates}), then it should be applied:
POTTIER Francois's avatar
POTTIER Francois committed
806
$\nt{id}\tuple{\nt{actual}}$.
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836

An actual may be followed with a modifier (\dquestion, \dplus, or
\dstar). This is explained further on (see \sref{sec:templates} and
\fref{fig:sugar}).

An actual may also be an ``anonymous rule''. In that case, one writes
just the rule's right-hand side, which takes the form $\seplist{\ \barre\
}{\nt{group}}$.
(This form is allowed only as an argument in an application.)
This form is expanded on the fly to a definition of a fresh non-terminal
symbol, which is declared \dinline.
For instance, providing an anonymous rule as an argument to \nt{list}:
\begin{quote}
\begin{tabular}{l}
\nt{list} \dlpar \basic{e} = \nt{expression}; \basic{SEMICOLON} \dpaction{\basic{e}} \drpar
\end{tabular}
\end{quote}
is equivalent to writing this:
\begin{quote}
\begin{tabular}{l}
\nt{list} \dlpar \nt{expression\_SEMICOLON} \drpar
\end{tabular}
\end{quote}
where the non-terminal symbol \nt{expression\_SEMICOLON} is chosen fresh and is defined as follows:
\begin{quote}
\begin{tabular}{l}
\dinline \nt{expression\_SEMICOLON}:
\newprod \basic{e} = \nt{expression}; \basic{SEMICOLON} \dpaction{\basic{e}}
\end{tabular}
\end{quote}
837

POTTIER Francois's avatar
POTTIER Francois committed
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
\subsection{Rules---new syntax}
\label{sec:new:rules}

Please be warned that \textbf{the new syntax is considered experimental}
and is subject to change in the future.
% TEMPORARY à supprimer un jour...

In its simplest form,
a rule takes the form
\dlet \nt{lid} \dcolonequal \expression.
%
Its left-hand side \nt{lid} is a nonterminal symbol;
its right-hand side is an expression.
%
Such a rule defines an ordinary nonterminal symbol,
while the alternate form
\dlet \nt{lid} \dequalequal \expression
defines an \dinline nonterminal symbol (\sref{sec:inline}),
that is, a macro.
%
A rule can be preceded with the keyword \dpublic (\sref{sec:split})
and can be parameterized with a tuple of formal parameters
$\tuple{\nt{id}}$ (\sref{sec:templates}).
%
The various forms of expressions, listed in \fref{fig:syntax}, are:
%
\begin{itemize}
\item
  A \textbf{choice} between several expressions,
  % \precseplist\barre\expression.
  \optional{\barre}
  \expressionsub{1} \barre ${}\ldots{}$ \barre\expressionsub{n}.
  The leading bar is optional.

\item
  A \textbf{sequence} of two expressions,
  \pattern \dequal \expressionsub{1} \dsemi \expressionsub{2}.
  The semantic value produced by \expressionsub{1}
  is decomposed according to the pattern \pattern.
  The \ocaml variables introduced by \pattern
  may appear in a semantic action
  that ends the sequence \expressionsub{2}.

\item
  A sequence \dtilde \dequal \nt{id}${}_1$ \dsemi \expressionsub{2},
  which is sugar for \nt{id}${}_1$ \dequal \nt{id}${}_1$ \dsemi
  \expressionsub{2}.
  This is a \textbf{pun}.
  % This is a special case of the previous form,
  % yet it receives special treatment; this is the
  % only case where ~ represents a deterministically-chosen name.

\item
  A sequence \expressionsub{1} \dsemi \expressionsub{2},
  which is sugar for
  \dunderscore \dequal \expressionsub{1} \dsemi \expressionsub{2}.

\item
  A \textbf{symbol} \nt{id}, possibly applied to
  a tuple of expressions
  \dlpar \expressionsub{1},\ ${}\ldots{}$,\ \expressionsub{n} \drpar.
  It is worth noting that such an expression \emph{can} form the
  end of a sequence: \nt{id} at the end of a sequence stands for
  \nt{x} \dequal \nt{id} \dsemi \dpaction{\nt{x}}
  for some fresh variable \nt{x}.
  Thus, a sequence need not end with a semantic action.

\item
  An expression followed with \dquestion, \dplus, or \dstar. This is sugar
  for the previous form: see \sref{sec:templates} and \fref{fig:sugar}.

\item
  A \textbf{semantic action} \daction,
  possibly followed with a precedence annotation \dprec \nt{id}.
  This \ocaml code can refer to the variables that have been bound
  earlier in the sequence that this semantic action ends.
  These include all variables named by the user as well as all
  variables introduced by a $\dtilde$ pattern as part of a pun.
  % (but not variables introduced by deep ~ patterns)
  The notation $\kw{\$}i$, where $i$ is an integer, is forbidden.

\item
  A \textbf{point-free semantic action} \dpfaction{\nt{\ocaml id}},
  possibly followed with a precedence annotation \dprec~\nt{id}.
  The \ocaml identifier \nt{id} must denote a function or a data
  constructor. It is applied to a tuple of the variables that have
  been bound earlier in the sequence that this semantic action ends.
  Thus, $\dpfaction{\,\nt{id}\,}$ is sugar for
  $\dpaction{\,\nt{id}\;\,(x_1, \ldots, x_n)\,}$,
  where $x_1, \ldots, x_n$ are the variables bound earlier.
  These include all variables named by the user as well as all
  variables introduced by a $\dtilde$ pattern.

\item
  An identity semantic action \dpfidentityaction. This is sugar for
  \dpfaction{\nt{identity}}, where \nt{identity} is \ocaml's
  identity function.
  Therefore, it is sugar for
  $\dpaction{\,(x_1, \ldots, x_n)\,}$,
  where $x_1, \ldots, x_n$ are the variables bound earlier.
\end{itemize}

\begin{comment}
% fp: not sure if this paragraph is helpful.
To some degree, an expression is analogous to an \ocaml expression: it returns
an \ocaml value, and as a side effect, recognizes and consumes a fragment of
the input. In particular, a sequence
%
\pattern \dequal \expressionsub{1} \dsemi \expressionsub{2}
%
is roughly analogous to an \ocaml sequence \verb+let p = e1 in e2+.
% terminal symbol = recognize and consume one input symbol
% nonterminal symbol = procedure call or macro invocation (if %inline)
% semantic action = OCaml code insertion
\end{comment}

The syntax of expressions, as presented in \fref{fig:syntax}, seems more
permissive than it really is. In reality, a choice cannot be nested inside a
sequence;
% (either on the left or on the right)
a sequence cannot be nested in the left-hand side of a sequence; a semantic
action cannot appear in the left-hand side of a sequence. (Thus, there is a
stratification in three levels: choice expressions, sequence expressions, and
atomic expressions, which corresponds roughly to the stratification of rules,
productions, and producers in the old syntax.)
%
Furthermore, an expression between parentheses
$\dlpar \expression \drpar$ is \emph{not} a valid expression.
To surround an expression with parentheses, one must write either
$\nt{midrule}\, \dlpar \expression \drpar$ or
$\nt{endrule}\, \dlpar \expression \drpar$;
see \sref{sec:library} and \fref{fig:standard}.

When a complex expression (e.g., a choice or a sequence) is placed
in parentheses, as in \nt{id}\,\dlpar\expression\drpar, this is
equivalent to using $\nt{id}\,\dlpar\nt{s}\drpar$, where the fresh
symbol~\nt{s} is declared as a synonym for this expression, via the
declaration
\dlet \nt{s} \dequalequal \expression.
This idiom is also known as an anonymous rule (\sref{sec:actual}).

\paragraph{Examples}

As an example of a rule in the new syntax,
the parameterized nonterminal symbol \nt{option},
which is part of Menhir's standard library (\sref{sec:library}),
can be defined as follows:
%
\begin{quote}
  \begin{tabular}{l}
    \dlet \nt{option}(\nt{x}) \dcolonequal \\
    \quad \barre \phantom{\nt{x} \dequal \nt{x} \dsemi{}} \dpaction{\nt{None}} \\
    \quad \barre          \nt{x} \dequal \nt{x} \dsemi{}  \dpaction{\nt{Some x}}
  \end{tabular}
\end{quote}
%
Using a pun, it can also be written as follows:
%
\begin{quote}
  \begin{tabular}{l}
    \dlet \nt{option}(\nt{x}) \dcolonequal \\
    \quad \barre \phantom{\dtilde \dequal \nt{x} \dsemi{}} \dpaction{\nt{None}} \\
    \quad \barre          \dtilde \dequal \nt{x} \dsemi{}  \dpaction{\nt{Some x}}
  \end{tabular}
\end{quote}
%
Using a pun and a point-free semantic action,
it can also be expressed as follows:
%
\begin{quote}
  \begin{tabular}{l}
    \dlet \nt{option}(\nt{x}) \dcolonequal \\
    \quad \barre \phantom{\dtilde \dequal \nt{x} \dsemi{}} \dpaction{\nt{None}} \\
    \quad \barre          \dtilde \dequal \nt{x} \dsemi{}  \dpfaction{\nt{Some}}
  \end{tabular}
\end{quote}
%
As another example, the parameterized symbol $\nt{delimited}$,
also part of Menhir's standard library (\sref{sec:library}),
can be defined in the new syntax as follows:
%
\begin{quote}
  \begin{tabular}{l}
    \dlet \nt{delimited}(\nt{opening}, \nt{x}, \nt{closing}) \dequalequal \\
    \quad \nt{opening} \dsemi
          \dtilde \dequal \nt{x} \dsemi
          \nt{closing} \dsemi
          \dpfidentityaction
  \end{tabular}
\end{quote}
%
The use of $\dequalequal$ indicates that this is a macro,
i.e., an \dinline nonterminal symbol (see \sref{sec:inline}).
The identity semantic action \dpfidentityaction is here synonymous with
\dpaction{\nt{x}}.

Other illustrations of the new syntax can be found in
1035
1036
the directories \distrib{demos/calc-new-syntax}
and \distrib{demos/calc-ast}.
POTTIER Francois's avatar
POTTIER Francois committed
1037

1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
\section{Advanced features}

\subsection{Splitting specifications over multiple files}
\label{sec:split}

\paragraph{Modules}

Grammar specifications can be split over multiple files. When \menhir is
invoked with multiple argument file names, it considers each of these files as
a \emph{partial} grammar specification, and \emph{joins} these partial
specifications in order to obtain a single, complete specification.

This feature is intended to promote a form a modularity. It is hoped that, by
splitting large grammar specifications into several ``modules'', they can be
made more manageable. It is also hoped that this mechanism, in conjunction
with parameterization (\sref{sec:templates}), will promote sharing and reuse.
It should be noted, however, that this is only a weak form of
modularity. Indeed, partial specifications cannot be independently processed
(say, checked for conflicts).  It is necessary to first join them, so as to
form a complete grammar specification, before any kind of grammar analysis can
be done.

This mechanism is, in fact, how \menhir's standard library (\sref{sec:library})
is made available: even though its name does not appear on the command line,
it is automatically joined with the user's explicitly-provided grammar
specifications, making the standard library's definitions globally visible.

A partial grammar specification, or module, contains declarations and rules,
just like a complete one: there is no visible difference. Of course, it can
consist of only declarations, or only rules, if the user so chooses. (Don't
forget the mandatory \percentpercent keyword that separates declarations and
rules. It must be present, even if one of the two sections is empty.)

\paragraph{Private and public nonterminal symbols}

It should be noted that joining is \emph{not} a purely textual process. If two
modules happen to define a nonterminal symbol by the same name, then it is
considered, by default, that this is an accidental name clash. In that case,
each of the two nonterminal symbols is silently renamed so as to avoid the
clash. In other words, by default, a nonterminal symbol defined in module $A$
is considered \emph{private}, and cannot be defined again, or referred to, in
module $B$.

Naturally, it is sometimes desirable to define a nonterminal symbol $N$ in
1082
module $A$ and to refer to it in module~$B$. This is permitted if $N$ is
1083
1084
1085
1086
public, that is, if either its definition carries the keyword \dpublic or
$N$ is declared to be a start symbol. A public nonterminal symbol is never
renamed, so it can be referred to by modules other than its defining module.

1087
1088
1089
1090
In fact, it is permitted to split the definition of a \emph{public} nonterminal
symbol, over multiple modules and/or within a single module.
That is, a public nonterminal symbol $N$ can
have multiple definitions, within one module and/or in distinct modules.
1091
1092
All of these definitions are joined using the choice (\barre) operator.
For instance, in the grammar of a
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
programming language, the definition of the nonterminal symbol \nt{expression}
could be split into multiple modules, where one module groups the expression
forms that have to do with arithmetic, one module groups those that concern
function definitions and function calls, one module groups those that concern
object definitions and method calls, and so on.

\paragraph{Tokens aside}

Another use of modularity consists in placing all \dtoken declarations in one
module, and the actual grammar specification in another module. The module
that contains the token definitions can then be shared, making it easier to
define multiple parsers that accept the same type of tokens. (On this topic,
see \distrib{demos/calc-two}.)

\subsection{Parameterizing rules}
\label{sec:templates}

A rule (that is, the definition of a nonterminal symbol) can be parameterized
over an arbitrary number of symbols, which are referred to as formal
parameters.

\paragraph{Example}

For instance, here is the definition of the parameterized
nonterminal symbol \nt{option}, taken from the standard library (\sref{sec:library}):
%
\begin{quote}
\begin{tabular}{l}
\dpublic \basic{option}(\basic{X}):
\newprod \dpaction{\basic{None}}
\newprod \basic{x} = \basic{X} \dpaction{\basic{Some} \basic{x}}
\end{tabular}
\end{quote}
%
This definition states that \nt{option}(\basic{X}) expands to either the empty
string, producing the semantic value \basic{None}, or to the string \basic{X},
producing the semantic value {\basic{Some}~\basic{x}}, where \basic{x} is the
semantic value of \basic{X}. In this definition, the symbol \basic{X} is
abstract: it stands for an arbitrary terminal or nonterminal symbol. The
definition is made public, so \nt{option} can be referred to within client
modules.

POTTIER Francois's avatar
POTTIER Francois committed
1135
A client who wishes to use \nt{option} simply refers to it, together with
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
an actual parameter -- a symbol that is intended to replace \basic{X}. For
instance, here is how one might define a sequence of declarations, preceded
with optional commas:
%
\begin{quote}
\begin{tabular}{l}
\nt{declarations}:
\newprod \dpaction{[]}
\newprod \basic{ds} = \nt{declarations}; \nt{option}(\basic{COMMA}); \basic{d} = \nt{declaration}
         \dpaction{ \basic{d} :: \basic{ds} }
\end{tabular}
\end{quote}
%
This definition states that \nt{declarations} expands either to the empty
string or to \nt{declarations} followed by an optional comma followed by
\nt{declaration}. (Here, \basic{COMMA} is presumably a terminal symbol.)
When this rule is encountered, the definition of \nt{option} is instantiated:
that is, a copy of the definition, where \basic{COMMA} replaces \basic{X},
is produced. Things behave exactly as if one had written:

\begin{quote}
\begin{tabular}{l}
\basic{optional\_comma}:
\newprod \dpaction{\basic{None}}
\newprod \basic{x} = \basic{COMMA} \dpaction{\basic{Some} \basic{x}} \\

\nt{declarations}:
\newprod \dpaction{[]}
\newprod \basic{ds} = \nt{declarations}; \nt{optional\_comma}; \basic{d} = \nt{declaration}
         \dpaction{ \basic{d} :: \basic{ds} }
\end{tabular}
\end{quote}
%
Note that, even though \basic{COMMA} presumably has been declared as a token
with no semantic value, writing \basic{x}~=~\basic{COMMA} is legal, and binds
\basic{x} to the unit value. This design choice ensures that the definition
of \nt{option} makes sense regardless of the nature of \basic{X}: that is, \basic{X}
can be instantiated with a terminal symbol, with or without a semantic value,
or with a nonterminal symbol.

\paragraph{Parameterization in general}

In general, the definition of a nonterminal symbol $N$ can be
parameterized with an arbitrary number of formal parameters. When
$N$ is referred to within a production, it must be applied
to the same number of actuals. In general, an actual is:
%
\begin{itemize}
\item either a single symbol, which can be a terminal symbol, a nonterminal symbol, or a formal parameter;
\item or an application of such a symbol to a number of actuals.
\end{itemize}

For instance, here is a rule whose single production consists of a single
producer, which contains several, nested actuals. (This example is discussed
again in \sref{sec:library}.)
%
\begin{quote}
\begin{tabular}{l}
\nt{plist}(\nt{X}):
\newprod
  \basic{xs} = \nt{loption}(%
                     \nt{delimited}(%
                       \basic{LPAREN},
                       \nt{separated\_nonempty\_list}(\basic{COMMA}, \basic{X}),
                       \basic{RPAREN}%
                     )%
                   )
    \dpaction{\basic{xs}}
\end{tabular}
\end{quote}

\begin{figure}
\begin{center}
1209
\begin{tabular}{r@{\hspace{2mm}}c@{\hspace{2mm}}l}
1210
1211
1212
1213
1214
\nt{actual}\dquestion & is syntactic sugar for & \nt{option}(\nt{actual}) \\
\nt{actual}\dplus & is syntactic sugar for & \nt{nonempty\_list}(\nt{actual}) \\
\nt{actual}\dstar & is syntactic sugar for & \nt{list}(\nt{actual})
\end{tabular}
\end{center}
1215
\caption{Syntactic sugar for simulating regular expressions, also known as EBNF}
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
\label{fig:sugar}
\end{figure}
%
Applications of the parameterized nonterminal symbols \nt{option},
\nt{nonempty\_list}, and \nt{list}, which are defined in
the standard library (\sref{sec:library}), can be written using
a familiar, regular-expression like syntax (\fref{fig:sugar}).

\paragraph{Higher-order parameters}

A formal parameter can itself expect parameters. For instance, here is a rule
that defines the syntax of procedures in an imaginary programming language:
%
\begin{quote}
\begin{tabular}{l}
\nt{procedure}(\nt{list}):
\newprod
\basic{PROCEDURE} \basic{ID} \nt{list}(\nt{formal}) \nt{SEMICOLON} \nt{block} \nt{SEMICOLON} \dpaction{$\ldots$}
\end{tabular}
\end{quote}
%
This rule states that the token \basic{ID}, which represents the name of the
procedure, should be followed with a list of formal parameters. (The
definitions of the nonterminal symbols \nt{formal} and \nt{block} are not
shown.) However, because \nt{list} is a formal parameter, as opposed to a
concrete nonterminal symbol defined elsewhere, this definition does not
specify how the list is laid out: which token, if any, is used to separate, or
terminate, list elements? is the list allowed to be empty? and so on. A more
concrete notion of procedure is obtained by instantiating the formal parameter
\nt{list}: for instance, \nt{procedure}(\nt{plist}), where \nt{plist} is the
parameterized nonterminal symbol defined earlier, is a valid application.

\paragraph{Consistency} Definitions and uses of parameterized nonterminal
symbols are checked for consistency before they are expanded away. In short,
it is checked that, wherever a nonterminal symbol is used, it is supplied with
actual arguments in appropriate number and of appropriate nature. This
guarantees that expansion of parameterized definitions terminates and produces
a well-formed grammar as its outcome.

\subsection{Inlining}
\label{sec:inline}

It is well-known that the following grammar of arithmetic expressions does not
work as expected: that is, in spite of the priority declarations, it has
shift/reduce conflicts.
%
\begin{quote}
\begin{tabular}{l}
\dtoken \kangle{\basic{int}} \basic{INT} \\
\dtoken \basic{PLUS} \basic{TIMES} \\
\dleft \basic{PLUS} \\
\dleft \basic{TIMES} \\ \\
\percentpercent \\ \\
\nt{expression}:
\newprod \basic{i} = \basic{INT}
         \dpaction{\basic{i}}
\newprod \basic{e} = \nt{expression}; \basic{o} = \nt{op}; \basic{f} = \nt{expression}
         \dpaction{\basic{o} \basic{e} \basic{f}} \\
\nt{op}:
\newprod \basic{PLUS} \dpaction{( + )}
\newprod \basic{TIMES} \dpaction{( * )}
\end{tabular}
\end{quote}
%
The trouble is, the precedence level of the production \nt{expression}
$\rightarrow$ \nt{expression} \nt{op} \nt{expression} is undefined, and
there is no sensible way of defining it via a \dprec declaration, since
the desired level really depends upon the symbol that was recognized by
\nt{op}: was it \basic{PLUS} or \basic{TIMES}?

The standard workaround is to abandon the definition of \nt{op} as a
separate nonterminal symbol, and to inline its definition into the
definition of \nt{expression}, like this:
%
\begin{quote}
\begin{tabular}{l}
\nt{expression}:
\newprod \basic{i} = \basic{INT}
         \dpaction{\basic{i}}
\newprod \basic{e} = \nt{expression}; \basic{PLUS}; \basic{f} = \nt{expression}
         \dpaction{\basic{e} + \basic{f}}
\newprod \basic{e} = \nt{expression}; \basic{TIMES}; \basic{f} = \nt{expression}
         \dpaction{\basic{e} * \basic{f}}
\end{tabular}
\end{quote}
%

This avoids the shift/reduce conflict, but gives up some of the original
specification's structure, which, in realistic situations, can be damageable.
Fortunately, \menhir offers a way of avoiding the conflict without manually
transforming the grammar, by declaring that the nonterminal symbol \nt{op}
should be inlined:
%
\begin{quote}
\begin{tabular}{l}
\nt{expression}:
\newprod \basic{i} = \basic{INT}
         \dpaction{\basic{i}}
\newprod \basic{e} = \nt{expression}; \basic{o} = \nt{op}; \basic{f} = \nt{expression}
         \dpaction{\basic{o} \basic{e} \basic{f}} \\
\dinline \nt{op}:
\newprod \basic{PLUS} \dpaction{( + )}
\newprod \basic{TIMES} \dpaction{( * )}
\end{tabular}
\end{quote}
%
The \dinline keyword causes all references to \nt{op} to be replaced with its
definition. In this example, the definition of \nt{op} involves two
productions, one that develops to \basic{PLUS} and one that expands to
\basic{TIMES}, so every production that refers to \nt{op} is effectively
turned into two productions, one that refers to \basic{PLUS} and one that refers to
\basic{TIMES}. After inlining, \nt{op} disappears and \nt{expression} has three
productions: that is, the result of inlining is exactly the manual workaround
shown above.

In some situations, inlining can also help recover a slight efficiency margin.
For instance, the definition:
%
\begin{quote}
\begin{tabular}{l}
\dinline \nt{plist}(\nt{X}):
\newprod
  \basic{xs} = \nt{loption}(%
                     \nt{delimited}(%
                       \basic{LPAREN},
                       \nt{separated\_nonempty\_list}(\basic{COMMA}, \basic{X}),
                       \basic{RPAREN}%
                     )%
                   )
    \dpaction{\basic{xs}}
\end{tabular}
\end{quote}
%
effectively makes \nt{plist}(\nt{X}) an alias for the right-hand side
\nt{loption}($\ldots$). Without the \dinline keyword, the language
recognized by the grammar would be the same, but the LR automaton
would probably have one more state and would perform one more
reduction at run time.

1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
The \dinline keyword does not affect the computation of positions
(\sref{sec:positions}). The same positions are computed, regardless of
where \dinline keywords are placed.

If the semantic actions have side effects, the \dinline keyword \emph{can}
affect the order in which these side effects take place. In the example of
\nt{op} and \nt{expression} above, if for some reason the semantic action
associated with \nt{op} has a side effect (such as updating a global variable,
or printing a message), then, by inlining \nt{op}, we delay this side effect,
which takes place \emph{after} the second operand has been recognized, whereas
in the absence of inlining it takes place as soon as the operator has been
recognized.

1368
1369
% Du coup, ça change l'ordre des effets, dans cet exemple, de infixe
% à postfixe.
1370

1371
1372
1373
1374
1375
\subsection{The standard library}
\label{sec:library}

\begin{figure}
\begin{center}
POTTIER Francois's avatar
POTTIER Francois committed
1376
\begin{tabular}{lp{51mm}l@{}l}
1377
1378
Name & Recognizes & Produces & Comment \\
\hline\\
1379
1380
% \nt{epsilon}         & $\epsilon$               & \basic{unit} & (inlined) \\
% \\
1381
1382
1383
\nt{endrule}(\nt{X}) & \nt{X} & $\alpha$, if \nt{X} : $\alpha$ & (inlined) \\
\nt{midrule}(\nt{X}) & \nt{X} & $\alpha$, if \nt{X} : $\alpha$ \\
\\
1384
\nt{option}(\nt{X})  & $\epsilon$ \barre \nt{X} & $\alpha$ \basic{option}, if \nt{X} : $\alpha$ & (also \nt{X}\dquestion) \\
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
\nt{ioption}(\nt{X}) & $\epsilon$ \barre \nt{X} & $\alpha$ \basic{option}, if \nt{X} : $\alpha$ & (inlined) \\
\nt{boption}(\nt{X}) & $\epsilon$ \barre \nt{X} & \basic{bool} \\
\nt{loption}(\nt{X}) & $\epsilon$ \barre \nt{X} & $\alpha$ \basic{list}, if \nt{X} : $\alpha$ \nt{list} \\
\\
\nt{pair}(\nt{X}, \nt{Y}) & \nt{X} \nt{Y} & $\alpha\times\beta$, if \nt{X} : $\alpha$ and \nt{Y} : $\beta$ \\
\nt{separated\_pair}(\nt{X}, \nt{sep}, \nt{Y}) & \nt{X} \nt{sep} \nt{Y} & $\alpha\times\beta$,
                                                                 if \nt{X} : $\alpha$ and \nt{Y} : $\beta$ \\
\nt{preceded}(\nt{opening}, \nt{X}) & \nt{opening} \nt{X} & $\alpha$, if \nt{X} : $\alpha$ \\
\nt{terminated}(\nt{X}, \nt{closing}) & \nt{X} \nt{closing} & $\alpha$, if \nt{X} : $\alpha$ \\
\nt{delimited}(\nt{opening}, \nt{X}, \nt{closing}) & \nt{opening} \nt{X} \nt{closing}
                                                   & $\alpha$, if \nt{X} : $\alpha$ \\
\\
\nt{list}(\nt{X})
  & a possibly empty sequence of \nt{X}'s
1399
1400
  & $\alpha$ \basic{list}, if \nt{X} : $\alpha$
  & (also \nt{X}\dstar) \\
1401
1402
\nt{nonempty\_list}(\nt{X})
  & a nonempty sequence of \nt{X}'s
1403
1404
  & $\alpha$ \basic{list}, if \nt{X} : $\alpha$
  & (also \nt{X}\dplus) \\
1405
1406
1407
1408
\nt{separated\_list}(\nt{sep}, \nt{X})
  & a possibly empty sequence of \nt{X}'s separated with \nt{sep}'s
  & $\alpha$ \basic{list}, if \nt{X} : $\alpha$ \\
\nt{separated\_nonempty\_list}(\nt{sep}, \nt{X})
1409
  & a nonempty sequence of \nt{X}'s \hspace{2mm} separated with \nt{sep}'s
1410
1411
  & $\alpha$ \basic{list}, if \nt{X} : $\alpha$ \\

1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
\\
\nt{rev}(\nt{X})
  & \nt{X}
  & $\alpha$ \basic{list}, if \nt{X} : $\alpha$ \basic{list}
  & (inlined)
\\
\nt{flatten}(\nt{X})
  & \nt{X}
  & $\alpha$ \basic{list}, if \nt{X} : $\alpha$ \basic{list} \basic{list}
  & (inlined)
\\
\nt{append}(\nt{X}, \nt{Y})
  & \nt{X} \nt{Y}
  & $\alpha$ \basic{list}, if \nt{X}, \nt{Y} : $\alpha$ \basic{list}
  & (inlined)
\\

1429
1430
\end{tabular}
\end{center}
1431
\caption{Summary of the standard library; see \standardmly for details}
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
\label{fig:standard}
\end{figure}

Once equipped with a rudimentary module system (\sref{sec:split}),
parameterization (\sref{sec:templates}), and inlining (\sref{sec:inline}), it
is straightforward to propose a collection of commonly used definitions, such
as options, sequences, lists, and so on. This \emph{standard library} is
joined, by default, with every grammar specification. A summary of the
nonterminal symbols offered by the standard library appears in
\fref{fig:standard}. See also the short-hands documented in
\fref{fig:sugar}.

By relying on the standard library, a client module can concisely define
more elaborate notions. For instance, the following rule:
%
\begin{quote}
\begin{tabular}{l}
\dinline \nt{plist}(\nt{X}):
\newprod
  \basic{xs} = \nt{loption}(%
                     \nt{delimited}(%
                       \basic{LPAREN},
                       \nt{separated\_nonempty\_list}(\basic{COMMA}, \basic{X}),
                       \basic{RPAREN}%
                     )%
                   )
    \dpaction{\basic{xs}}
\end{tabular}
\end{quote}
%
causes \nt{plist}(\nt{X}) to recognize a list of \nt{X}'s, where the empty
list is represented by the empty string, and a non-empty list is delimited
with parentheses and comma-separated.

1466
The standard library is stored in a file named \standardmly, which is
Nicolás Ojeda Bär's avatar
Nicolás Ojeda Bär committed
1467
embedded inside \menhir when it is built.
POTTIER Francois's avatar
POTTIER Francois committed
1468
%
1469
The command line switch \onostdlib instructs \menhir to \emph{not} load the
POTTIER Francois's avatar
POTTIER Francois committed
1470
1471
standard library.

1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
The meaning of the symbols defined in the standard library
(\fref{fig:standard}) should be clear in most cases. Yet, the
symbols \nt{endrule}(\nt{X}) and \nt{midrule}(\nt{X}) deserve an explanation.
Both take an argument \nt{X}, which typically will be instantiated with an
anonymous rule (\sref{sec:actual}). Both are defined as a synonym
for \nt{X}. In both cases, this allows placing an anonymous subrule in the
middle of a rule.

\newcommand{\AAA}{\nt{cat}}
\newcommand{\BBB}{\nt{dog}}
\newcommand{\CCC}{\nt{cow}}
\newcommand{\XXX}{\nt{xxx}}

For instance, the following is a well-formed production:
%
\[\begin{array}{l}
  \AAA \quad
  \nt{endrule}(\BBB \quad \dpaction{\nt{\ocaml code$_1$}}) \quad
  \CCC \quad
  \dpaction{\nt{\ocaml code$_2$}}
\end{array}\]
%
This production consists of three producers, namely
\AAA{} and
\nt{endrule}(\BBB$\;$\dpaction{\nt{\ocaml code$_1$}}) and
\CCC,
and a semantic action \dpaction{\nt{\ocaml code$_2$}}.
%
Because \nt{endrule}(\nt{X}) is declared as an \dinline synonym for \nt{X},
the expansion of anonymous rules (\sref{sec:actual}), followed with the
expansion of \dinline symbols (\sref{sec:inline}), transforms the above
production into the following:
%
\[\begin{array}{l}
  \AAA \quad
  \BBB \quad
  \CCC \quad
  \dpaction{\nt{\ocaml code$_1$; \ocaml code$_2$}}
\end{array}\]
%
Note that \nt{\ocaml code$_1$} moves to the end of the rule, which means that
this code is executed only after \AAA, \BBB{} and \CCC{} have
been recognized. In this example, the use of \nt{endrule} is rather pointless,
as the expanded code is more concise and clearer than the original code. Still,
\nt{endrule} can be useful when its actual argument is an anonymous rule with
multiple branches.

% Let me *not* show an example. See the comments in standard.mly.

\nt{midrule} is used in exactly the same way as \nt{endrule}, but its expansion
is different. For instance, the following is a well-formed production:
%
\[\begin{array}{l}
  \AAA \quad
  \nt{midrule}(\dpaction{\nt{\ocaml code$_1$}}) \quad
  \CCC \quad
  \dpaction{\nt{\ocaml code$_2$}}
\end{array}\]
%
(There is no \BBB{} in this example; this is intentional.)
Because \nt{midrule}(\nt{X}) is a synonym for \nt{X}, but is not declared
\dinline, the expansion of anonymous rules (\sref{sec:actual}), followed
with the expansion of \dinline symbols (\sref{sec:inline}), transforms the
above production into the following:
%
\[\begin{array}{l}
  \AAA \quad
  \XXX \quad
  \CCC \quad
  \dpaction{\nt{\ocaml code$_2$}}
\end{array}\]
%
where the fresh nonterminal symbol $\XXX$ is separately defined by the
rule $\XXX: \dpaction{\nt{\ocaml code$_1$}}$. Thus, $\XXX$ recognizes
the empty string, and as soon as it is recognized, \nt{\ocaml code$_1$}
is executed. This is known as a ``mid-rule action''.

% https://www.gnu.org/software/bison/manual/html_node/Mid_002dRule-Actions.html

POTTIER Francois's avatar
POTTIER Francois committed
1551
% ------------------------------------------------------------------------------
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561

\section{Conflicts}
\label{sec:conflicts}

When a shift/reduce or reduce/reduce conflict is detected, it is classified as
either benign, if it can be resolved by consulting user-supplied precedence
declarations, or severe, if it cannot. Benign conflicts are not reported.
Severe conflicts are reported and, if the \oexplain switch is on, explained.

\subsection{When is a conflict benign?}
1562
\label{sec:conflicts:benign}
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692

A shift/reduce conflict involves a single token (the one that one might wish
to shift) and one or more productions (those that one might wish to
reduce). When such a conflict is detected, the precedence level
(\sref{sec:assoc}, \sref{sec:prec}) of these entities are looked up and
compared as follows:
\begin{enumerate}
\item if only one production is involved, and if it has higher priority
      than the token, then the conflict is resolved in favor of reduction.
\item if only one production is involved, and if it has the same priority
      as the token, then the associativity status of the token is looked up:
      \begin{enumerate}
      \item if the token was declared nonassociative, then the conflict is
            resolved in favor of neither action, that is, a syntax error
	    will be signaled if this token shows up when this production
	    is about to be reduced;
      \item if the token was declared left-associative, then the conflict
            is resolved in favor of reduction;
      \item if the token was declared right-associative, then the conflict
            is resolved in favor of shifting.
      \end{enumerate}
\item \label{multiway}
      if multiple productions are involved, and if, considered one by one,
      they all cause the conflict to be resolved in the same way (that is,
      either in favor in shifting, or in favor of neither), then the conflict
      is resolved in that way.
\end{enumerate}
In either of these cases, the conflict is considered benign. Otherwise, it is
considered severe. Note that a reduce/reduce conflict is always considered
severe, unless it happens to be subsumed by a benign multi-way shift/reduce
conflict (item~\ref{multiway} above).

\subsection{How are severe conflicts explained?}

When the \odump switch is on, a description of the automaton is written to the
\automaton file. Severe conflicts are shown as part of this description.
Fortunately, there is also a way of understanding conflicts in terms of the
grammar, rather than in terms of the automaton. When the \oexplain switch is
on, a textual explanation is written to the \conflicts file.

\emph{Not all conflicts are explained} in this file: instead, \emph{only one conflict per
automaton state is explained}. This is done partly in the interest of brevity,
but also because Pager's algorithm can create artificial conflicts in a state
that already contains a true LR(1) conflict; thus, one cannot hope in general
to explain all of the conflicts that appear in the automaton. As a result of
this policy, once all conflicts explained in the \conflicts file have been
fixed, one might need to run \menhir again to produce yet more conflict
explanations.

\begin{figure}
\begin{quote}
\begin{tabular}{l}
\dtoken \basic{IF THEN ELSE} \\
\dstart \kangle{\basic{expression}} \nt{expression} \\
\\
\percentpercent \\
\\
\nt{expression}:
\newprod $\ldots$
\newprod \basic{IF b} = \nt{expression} \basic{THEN e} = \nt{expression} \dpaction{$\ldots$}
\newprod \basic{IF b} = \nt{expression} \basic{THEN e} = \nt{expression} \basic{ELSE f} = \nt{expression} \dpaction{$\ldots$}
\newprod $\ldots$
\end{tabular}
\end{quote}
\caption{Basic example of a shift/reduce conflict}
\label{fig:basicshiftreduce}
\end{figure}

\paragraph{How the conflict state is reached}

\fref{fig:basicshiftreduce} shows a grammar specification
with a typical shift/reduce conflict.
%
When this specification is analyzed, the conflict is detected, and an
explanation is written to the \conflicts file. The explanation first indicates
in which state the conflict lies by showing how that state is reached. Here,
it is reached after recognizing the following string of terminal and
nonterminal symbols---the \emph{conflict string}:
%
\begin{quote}
\basic{IF expression THEN IF expression THEN expression}
\end{quote}

Allowing the conflict string to contain both nonterminal and terminal symbols
usually makes it shorter and more readable. If desired, a conflict string
composed purely of terminal symbols could be obtained by replacing each
occurrence of a nonterminal symbol $N$ with an arbitrary $N$-sentence.

The conflict string can be thought of as a path that leads from one of the
automaton's start states to the conflict state.  When multiple such paths
exist, the one that is displayed is chosen shortest.  Nevertheless, it may
sometimes be quite long. In that case, artificially (and temporarily)
declaring some existing nonterminal symbols to be start symbols has the effect
of adding new start states to the automaton and can help produce shorter
conflict strings. Here, \nt{expression} was declared to be a start symbol,
which is why the conflict string is quite short.

In addition to the conflict string, the \conflicts file also states that the
\emph{conflict token} is \basic{ELSE}. That is, when the automaton has recognized
the conflict string and when the lookahead token (the next token on the input
stream) is \basic{ELSE}, a conflict arises. A conflict corresponds to a
choice: the automaton is faced with several possible actions, and does not
know which one should be taken. This indicates that the grammar is not LR(1).
The grammar may or may not be inherently ambiguous.

In our example, the conflict string and the conflict token are enough to
understand why there is a conflict: when two \basic{IF} constructs are nested,
it is ambiguous which of the two constructs the
\basic{ELSE} branch should be associated with. Nevertheless, the \conflicts file
provides further information: it explicitly shows that there exists a
conflict, by proving that two distinct actions are possible. Here, one of
these actions consists in \emph{shifting}, while the other consists in
\emph{reducing}: this is a \emph{shift/reduce} conflict.

A \emph{proof} takes the form of a \emph{partial derivation tree} whose
\emph{fringe} begins with the conflict string, followed by the conflict
token. A derivation tree is a tree whose nodes are labeled with symbols. The
root node carries a start symbol. A node that carries a terminal symbol is
considered a leaf, and has no children. A node that carries a nonterminal
symbol $N$ either is considered a leaf, and has no children; or is not
considered a leaf, and has $n$ children, where $n\geq 0$, labeled
$\nt{x}_1,\ldots,\nt{x}_n$, where $N \rightarrow
\nt{x}_1,\ldots,\nt{x}_n$ is a production. The fringe of a partial
derivation tree is the string of terminal and nonterminal symbols carried by
the tree's leaves. A string of terminal and nonterminal symbols that is the
fringe of some partial derivation tree is a \emph{sentential form}.

\paragraph{Why shifting is legal}

\begin{figure}
1693
\mycommonbaseline
1694
\begin{center}
1695
\begin{heveapicture}
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
\begin{tikzpicture}[level distance=12mm]
\node { \nt{expression} }
  child { node {\basic{IF}} }
  child { node {\nt{expression}} }
  child { node {\basic{THEN}} }
  child { node {\nt{expression}}
    child { node {\basic{IF}} }
    child { node {\nt{expression}} }
    child { node {\basic{THEN}} }
    child { node {\nt{expression}} }
    child { node {\basic{ELSE}} }
    child { node {\nt{expression}} }
  }
;
\end{tikzpicture}
1711
\end{heveapicture}
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
\end{center}
\caption{A partial derivation tree that justifies shifting}
\label{fig:shifting:tree}
\end{figure}

\begin{figure}
\begin{center}
\begin{tabbing}
\= \nt{expression} \\
\> \basic{IF} \nt{expression} \basic{THEN} \= \nt{expression} \\
\>                                         \> \basic{IF} \nt{expression} \basic{THEN} \basic{expression}
                                              . \basic{ELSE} \nt{expression}
\end{tabbing}
\end{center}
\caption{A textual version of the tree in \fref{fig:shifting:tree}}
\label{fig:shifting:text}
\end{figure}

In our example, the proof that shifting is possible is the derivation tree
shown in Figures~\ref{fig:shifting:tree} and~\ref{fig:shifting:text}. At the
root of the tree is the grammar's start symbol, \nt{expression}. This symbol
develops into the string \nt{IF expression THEN expression}, which forms the
tree's second level. The second occurrence of \nt{expression} in that string
develops into \nt{IF expression THEN expression ELSE expression}, which forms
the tree's last level. The tree's fringe, a sentential form, is the string
\nt{IF expression THEN IF expression THEN expression ELSE expression}. As
announced earlier, it begins with the conflict string \nt{IF expression THEN
IF expression THEN expression}, followed with the conflict token
\nt{ELSE}.

In \fref{fig:shifting:text}, the end of the conflict string is materialized
with a dot. Note that this dot does not occupy the rightmost position in the
tree's last level. In other words, the conflict token (\basic{ELSE}) itself
occurs on the tree's last level. In practical terms, this means that, after
the automaton has recognized the conflict string and peeked at the conflict
token, it makes sense for it to \emph{shift} that token.

\paragraph{Why reducing is legal}

\begin{figure}
1752
\mycommonbaseline
1753
\begin{center}
1754
\begin{heveapicture}
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
\begin{tikzpicture}[level distance=12mm]
\node { \nt{expression} }
  child { node {\basic{IF}} }
  child { node {\nt{expression}} }
  child { node {\basic{THEN}} }
  child { node {\nt{expression}}
    child { node {\basic{IF}} }
    child { node {\nt{expression}} }
    child { node {\basic{THEN}} }
    child { node {\nt{expression}} }
  }
  child { node {\basic{ELSE}} }
  child { node {\nt{expression}} }
;
\end{tikzpicture}
1770
\end{heveapicture}
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
\end{center}
\caption{A partial derivation tree that justifies reducing}
\label{fig:reducing:tree}
\end{figure}

\begin{figure}
\begin{center}
\begin{tabbing}
\= \nt{expression} \\
\> \basic{IF} \nt{expression} \basic{THEN} \= \nt{expression} \basic{ELSE} \nt{expression}
                                                              \sidecomment{lookahead token appears} \\
\>                                         \> \basic{IF} \nt{expression} \basic{THEN} \basic{expression} .
\end{tabbing}
\end{center}
\caption{A textual version of the tree in \fref{fig:reducing:tree}}
\label{fig:reducing:text}
\end{figure}

In our example, the proof that shifting is possible is the derivation tree
shown in Figures~\ref{fig:reducing:tree} and~\ref{fig:reducing:text}. Again,
the sentential form found at the fringe of the tree begins with the conflict
string, followed with the conflict token.

Again, in \fref{fig:reducing:text}, the end of the conflict string is
materialized with a dot. Note that, this time, the dot occupies the rightmost
position in the tree's last level. In other words, the conflict token
(\basic{ELSE}) appeared on an earlier level (here, on the second level).  This
fact is emphasized by the comment \inlinesidecomment{lookahead token appears}
found at the second level. In practical terms, this means that, after the
automaton has recognized the conflict string and peeked at the conflict token,
it makes sense for it to \emph{reduce} the production that corresponds to the
tree's last level---here, the production is \nt{expression} $\rightarrow$
\basic{IF} \nt{expression} \basic{THEN} \basic{expression}.

\paragraph{An example of a more complex derivation tree}

Figures~\ref{fig:xreducing:tree} and~\ref{fig:xreducing:text} show a partial
derivation tree that justifies reduction in a more complex situation. (This
derivation tree is relative to a grammar that is not shown.) Here, the
conflict string is \basic{DATA UIDENT EQUALS UIDENT}; the conflict token is
\basic{LIDENT}.  It is quite clear that the fringe of the tree begins with the
conflict string.  However, in this case, the fringe does not explicitly
exhibit the conflict token. Let us examine the tree more closely and answer
the question: following \basic{UIDENT}, what's the next terminal symbol on the
fringe?

\begin{figure}
1818
\mycommonbaseline
1819
\begin{center}
1820
\begin{heveapicture}
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
\begin{tikzpicture}[level distance=12mm,level 1/.style={sibling distance=18mm},
                                        level 2/.style={sibling distance=18mm},
                                        level 4/.style={sibling distance=24mm}]]
\node { \nt{decls} }
  child { node {\nt{decl}}
    child { node {\basic{DATA}} }
    child { node {\basic{UIDENT}} }
    child { node {\basic{EQUALS}} }
    child { node {\nt{tycon\_expr}}
      child { node {\nt{tycon\_item}}
        child { node {\basic{UIDENT}} }
        child { node {\nt{opt\_type\_exprs}}
          child { node {} edge from parent [dashed] }
        }
      }
    }
  }
  child { node {\nt{opt\_semi}} }
  child { node {\nt{decls}} }
POTTIER Francois's avatar
Typo.    
POTTIER Francois committed
1840
;
1841
\end{tikzpicture}
1842
\end{heveapicture}
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
\end{center}
\caption{A partial derivation tree that justifies reducing}
\label{fig:xreducing:tree}
\end{figure}

\begin{figure}
\begin{center}
\begin{tabbing}
\= \nt{decls} \\
\> \nt{decl} \nt{opt\_semi} \nt{decls}
\sidecomment{lookahead token appears because \nt{opt\_semi} can vanish
   and \nt{decls} can begin with \basic{LIDENT}} \\
\> \basic{DATA UIDENT} \basic{EQUALS} \= \nt{tycon\_expr}
\sidecomment{lookahead token is inherited} \\
\> \> \nt{tycon\_item} \sidecomment{lookahead token is inherited} \\
\> \> \basic{UIDENT} \= \nt{opt\_type\_exprs} \sidecomment{lookahead token is inherited} \\
\> \> \> .
\end{tabbing}
\end{center}
\caption{A textual version of the tree in \fref{fig:xreducing:tree}}
\label{fig:xreducing:text}
\end{figure}
POTTIER Francois's avatar
TODO.    
POTTIER Francois committed
1865
% TEMPORARY the HTML rendering of this figure isn't good
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911

First, note that \nt{opt\_type\_exprs} is \emph{not} a leaf node, even though
it has no children. The grammar contains the production $\nt{opt\_type\_exprs}
\rightarrow \epsilon$: the nonterminal symbol \nt{opt\_type\_exprs} develops
to the empty string. (This is made clear in \fref{fig:xreducing:text}, where a
single dot appears immediately below \nt{opt\_type\_exprs}.) Thus,
\nt{opt\_type\_exprs} is not part of the fringe.

Next, note that \nt{opt\_type\_exprs} is the rightmost symbol within its
level. Thus, in order to find the next symbol on the fringe, we have to look
up one level. This is the meaning of the comment \inlinesidecomment{lookahead
token is inherited}. Similarly, \nt{tycon\_item} and \nt{tycon\_expr} appear
rightmost within their level, so we again have to look further up.

This brings us back to the tree's second level. There, \nt{decl} is \emph{not}
the rightmost symbol: next to it, we find \nt{opt\_semi} and \nt{decls}. Does
this mean that \nt{opt\_semi} is the next symbol on the fringe? Yes and no.
\nt{opt\_semi} is a \emph{nonterminal} symbol, but we are really interested in finding
out what the next \emph{terminal} symbol on the fringe could be. The partial
derivation tree shown in Figures~\ref{fig:xreducing:tree}
and~\ref{fig:xreducing:text} does not explicitly answer this question. In
order to answer it, we need to know more about \nt{opt\_semi} and \nt{decls}.

Here, \nt{opt\_semi} stands (as one might have guessed) for an optional
semicolon, so the grammar contains a production $\nt{opt\_semi} \rightarrow
\epsilon$. This is indicated by the comment
\inlinesidecomment{\nt{opt