main.tex 89.4 KB
 fpottier committed Mar 01, 2013 1 2 3 4 \def\true{true} \let\fpacm\true \documentclass[onecolumn,11pt,nocopyrightspace]{sigplanconf} \usepackage{amstext}  fpottier committed Mar 02, 2013 5 \usepackage[T1]{fontenc}  fpottier committed Mar 01, 2013 6 \usepackage[latin1]{inputenc}  fpottier committed Mar 02, 2013 7 \usepackage{tikz}  fpottier committed Mar 01, 2013 8 9 \usepackage{xspace} \usepackage{mymacros}  fpottier committed Mar 02, 2013 10 \def\fppdf{true}  fpottier committed Mar 01, 2013 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 \usepackage{fppdf} \input{macros} \input{version} % TEMPORARY indiquer que le comportement par défaut (en l'absence % de --only-tokens ou --external-tokens) est d'engendrer la déf. % du type token % --------------------------------------------------------------------------------------------------------------------- % Headings. \title{\menhir Reference Manual\\\normalsize (version \menhirversion)} \begin{document} \authorinfo{François Pottier\and Yann Régis-Gianas} {INRIA} {\{Francois.Pottier, Yann.Regis-Gianas\}@inria.fr} \maketitle % --------------------------------------------------------------------------------------------------------------------- \clearpage \tableofcontents \clearpage % --------------------------------------------------------------------------------------------------------------------- \section{Foreword} \menhir is a parser generator. It turns high-level grammar specifications, decorated with semantic actions expressed in the \ocaml programming language~\cite{objective-caml}, into parsers, again expressed in \ocaml. It is based on Knuth's LR(1) parser construction technique~\cite{knuth-lr-65}. It is strongly inspired by its precursors: \yacc~\cite{johnson-yacc-79}, \texttt{ML-Yacc}~\cite{tarditi-appel-00}, and \ocamlyacc~\cite{objective-caml}, but offers a large number of minor and major improvements that make it a more modern tool. This brief reference manual explains how to use \menhir. It does not attempt to explain context-free grammars, parsing, or the LR technique. Readers who have never used a parser generator are encouraged to read about these ideas first~\cite{aho-86,appel-tiger-98,hopcroft-motwani-ullman-00}. They are also invited to have a look at the \distrib{demos} directory in \menhir's distribution. At this stage, potential users should be warned about two facts. First, \menhir's feature set is not stable. There is a tension between preserving a measure of compatibility with \ocamlyacc, on the one hand, and introducing new ideas, on the other hand. Some aspects of the tool, such as the error handling and recovery mechanism, are still potentially subject to incompatible changes. Second, the present release is \emph{beta}-quality. There is much room for improvement in the tool and in this reference manual. Bug reports and suggestions are welcome! % --------------------------------------------------------------------------------------------------------------------- \section{Usage} \menhir is invoked as follows: \begin{quote} \cmenhir \nt{option} \ldots \nt{option} \nt{filename} \ldots \nt{filename} \end{quote} Each of the file names must end with \texttt{.mly} and denotes a partial grammar specification. These partial grammar specifications are joined (\sref{sec:split}) to form a single, self-contained grammar specification, which is then processed. A number of optional command line switches allow controlling many aspects of the process. \docswitch{\obase \nt{basename}} This switch controls the base name of the \ml and \mli files that are produced. That is, the tool will produce files named \nt{basename}\texttt{.ml} and \nt{basename}\texttt{.mli}. Note that \nt{basename} can contain occurrences of the \texttt{/} character, so it really specifies a path and a base name. When only one \nt{filename} is provided on the command line, the default \nt{basename} is obtained by depriving \nt{filename} of its final \texttt{.mly} suffix. When multiple file names are provided on the command line, no default base name exists, so that the \obase switch \emph{must} be used. \docswitch{\ocomment} This switch causes a few comments to be inserted into the \ocaml code that is written to the \ml file. \docswitch{\odepend} This switch causes \menhir to generate dependency information for use in conjunction with \make. When invoked in this mode, \menhir does not generate a parser. Instead, it examines the grammar specification and prints a list of prerequisites for the targets \nt{basename}\texttt{.cm[oix]}, \nt{basename}\texttt{.ml}, and \nt{basename}\texttt{.mli}. This list is intended to be textually included within a \Makefile. It is important to note that \nt{basename}\texttt{.ml} and \nt{basename}\texttt{.mli} can have \texttt{.cm[iox]} prerequisites. This is because, when the \oinfer switch is used, \menhir infers types by invoking \ocamlc, and \ocamlc itself requires the \ocaml modules that the grammar specification depends upon to have been compiled first. The file \distrib{demos/Makefile.shared} helps exploit the \odepend switch. When in \odepend mode, \menhir computes dependencies by invoking \ocamldep. The command that is used to run \ocamldep is controlled by the \oocamldep switch. \docswitch{\odump} This switch causes a description of the automaton to be written to the file \nt{basename}\automaton. \docswitch{\oerrorrecovery} This switch causes error recovery code to be generated. Error recovery, also known as re-synchronization, consists in dropping tokens off the input stream, after an error has been detected, until a token that can be shifted in the current state is found. This behavior is made optional because it is seldom exploited and requires extra code in the parser. See also \sref{sec:errors}. \docswitch{\oexplain} This switch causes conflict explanations to be written to the file \nt{basename}\conflicts. See also \sref{sec:conflicts}. \docswitch{\oexternaltokens \nt{T}} This switch causes the definition of the \token type to be omitted in \nt{basename}\texttt{.ml} and \nt{basename}\texttt{.mli}. Instead, the generated parser relies on the type $T$\texttt{.}\token, where $T$ is an \ocaml module name. It is up to the user to define module $T$ and to make sure that it exports a suitable \token type. Module $T$ can be hand-written. It can also be automatically generated out of a grammar specification using the \oonlytokens switch.  fpottier committed Mar 02, 2013 132 \docswitch{\ofixedexc} This switch causes the exception \texttt{Error} to be  fpottier committed Mar 02, 2013 133 134 135 136 internally defined as a synonym for \texttt{Parsing.Parse\_error}. This means that an exception handler that catches \texttt{Parsing.Parse\_error} will also catch the generated parser's \texttt{Error}. This helps increase Menhir's compatibility with \ocamlyacc. There is otherwise no reason to use this switch.  fpottier committed Mar 02, 2013 137   fpottier committed Mar 01, 2013 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 \docswitch{\ograph} This switch causes a description of the grammar's dependency graph to be written to the file \nt{basename}\dott. The graph's vertices are the grammar's nonterminal symbols. There is a directed edge from vertex $A$ to vertex $B$ if the definition of $A$ refers to $B$. The file is in a format that is suitable for processing by the \emph{graphviz} toolkit. \docswitch{\oinfer} This switch causes the semantic actions to be checked for type consistency \emph{before} the parser is generated. This is done by invoking the \ocaml compiler. Use of \oinfer is \textbf{strongly recommended}, because it helps obtain consistent, well-located type error messages, especially when advanced features such as \menhir's standard library or \dinline keyword are exploited. One downside of \oinfer is that the \ocaml compiler usually needs to consult a few \texttt{.cm[iox]} files. This means that these files must have been created first, requiring \Makefile changes and use of the \odepend switch. The file \distrib{demos/Makefile.shared} suggests  fpottier committed Mar 02, 2013 153 154 155 how to deal with this difficulty. A better option is to avoid \make altogether and use \ocamlbuild, which has built-in knowledge of \menhir. Using \ocamlbuild is also \textbf{strongly recommended}!  fpottier committed Mar 01, 2013 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000  % There is a slight catch with \oinfer. The types inferred by \ocamlc are valid % in the toplevel context, but can change meaning when inserted into a local % context. \docswitch{\ointerpret} This switch causes \menhir to act as an interpreter, rather than as a compiler. No \ocaml code is generated. Instead, \menhir reads sentences off the standard input channel, parses them, and displays outcomes. For more information, see \sref{sec:interpret}. \docswitch{\ointerpretshowcst} This switch, used in conjunction with \ointerpret, causes \menhir to display a concrete syntax tree when a sentence is successfully parsed. For more information, see \sref{sec:interpret}. \docswitch{\ologautomaton \nt{level}} When \nt{level} is nonzero, this switch causes some information about the automaton to be logged to the standard error channel. \docswitch{\ologcode \nt{level}} When \nt{level} is nonzero, this switch causes some information about the generated \ocaml code to be logged to the standard error channel. \docswitch{\ologgrammar \nt{level}} When \nt{level} is nonzero, this switch causes some information about the grammar to be logged to the standard error channel. When \nt{level} is 2, the \emph{nullable}, \emph{FIRST}, and \emph{FOLLOW} tables are displayed. \docswitch{\onoinline} This switch causes all \dinline keywords in the grammar specification to be ignored. This is especially useful in order to understand whether these keywords help solve any conflicts. \docswitch{\onostdlib} This switch causes the standard library \emph{not} to be implicitly joined with the grammar specifications whose names are explicitly provided on the command line. \docswitch{\oocamlc \nt{command}} This switch controls how \ocamlc is invoked (when \oinfer is used). It allows setting both the name of the executable and the command line options that are passed to it. \docswitch{\oocamldep \nt{command}} This switch controls how \ocamldep is invoked (when \odepend is used). It allows setting both the name of the executable and the command line options that are passed to it. \docswitch{\oonlypreprocess} This switch causes the grammar specifications to be transformed up to the point where the automaton's construction can begin. The grammar specifications whose names are provided on the command line are joined (\sref{sec:split}); all parameterized nonterminal symbols are expanded away (\sref{sec:templates}); type inference is performed, if \oinfer is enabled; all nonterminal symbols marked \dinline are expanded away (\sref{sec:inline}). This yields a single, monolithic grammar specification, which is printed on the standard output channel. \docswitch{\oonlytokens} This switch causes the \dtoken declarations in the grammar specification to be translated into a definition of the \token type, which is written to the files \nt{basename}\texttt{.ml} and \nt{basename}\texttt{.mli}. No code is generated. This is useful when a single set of tokens is to be shared between several parsers. The directory \distrib{demos/calc-two} contains a demo that illustrates the use of this switch. \docswitch{\orawdepend} This switch is analogous to \odepend, except that \ocamldep's output is not postprocessed by \menhir; it is echoed without change. This switch is \emph{not} suitable for direct use with \make; it is intended for use with \omake, which performs its own postprocessing. \docswitch{\ostrict} This switch causes several warnings about the grammar and about the automaton to be considered errors. This includes warnings about useless precedence declarations, non-terminal symbols that produce the empty language, unreachable non-terminal symbols, productions that are never reduced, conflicts that are not resolved by precedence declarations, and end-of-stream conflicts. \docswitch{\osuggestcomp} This switch causes \menhir to print a set of suggested compilation flags, and exit. These flags are intended to be passed to the \ocaml compilers (\ocamlc or \ocamlopt) when compiling and linking the parser generated by \menhir. What are these flags? In the absence of the \otable switch, they are empty. When \otable is set, these flags ensure that \menhirlib is visible to the \ocaml compiler. If the support library \menhirlib was installed via \ocamlfind, a \texttt{-package} directive is issued; otherwise, a \texttt{-I} directive is used. The file \distrib{demos/Makefile.shared} shows how to exploit the \texttt{--suggest-*} switches. \docswitch{\osuggestlinkb} This switch causes \menhir to print a set of suggested link flags, and exit. These flags are intended to be passed to \texttt{ocamlc} when producing a bytecode executable. What are these flags? In the absence of the \otable switch, they are empty. When \otable is set, these flags ensure that \menhirlib is linked in. If the support library \menhirlib was installed via \ocamlfind, a \texttt{-linkpkg} directive is issued; otherwise, the object file \texttt{menhirLib.cmo} is named. The file \distrib{demos/Makefile.shared} shows how to exploit the \texttt{--suggest-*} switches. \docswitch{\osuggestlinko} This switch causes \menhir to print a set of suggested link flags, and exit. These flags are intended to be passed to \texttt{ocamlopt} when producing a native code executable. What are these flags? In the absence of the \otable switch, they are empty. When \otable is set, these flags ensure that \menhirlib is linked in. If the support library \menhirlib was installed via \ocamlfind, a \texttt{-linkpkg} directive is issued; otherwise, the object file \texttt{menhirLib.cmx} is named. The file \distrib{demos/Makefile.shared} shows how to exploit the \texttt{--suggest-*} switches. \docswitch{\ostdlib \nt{directory}} This switch controls the directory where the standard library is found. It allows overriding the default directory that is set at installation time. The trailing \texttt{/} character is optional. \docswitch{\otable} This switch causes \menhir to use its table-based back-end, as opposed to its (default) code-based back-end. When \otable is used, \menhir produces significantly more compact, slightly slower parsers. The table-based back-end produces rather compact tables, which are analogous to those produced by \yacc, \bison, or \ocamlyacc. These tables are not quite stand-alone: they are exploited by an interpreter, which is shipped as part of the support library \menhirlib. For this reason, when \otable is used, \menhirlib must be made visible to the \ocaml compilers, and must be linked into your executable program. The \texttt{--suggest-*} switches, described above, help do this. The code-based back-end compiles the LR automaton directly into a nest of mutually recursive \ocaml functions. In that case, \menhirlib is not required. \docswitch{\otimings} This switch causes internal timing information to be sent to the standard error channel. \docswitch{\otrace} This switch causes tracing code to be inserted into the generated parser, so that, when the parser is run, its actions are logged to the standard error channel. This is analogous to \texttt{ocamlrun}'s \texttt{p=1} parameter, except this switch must be enabled at compile time: one cannot selectively enable or disable tracing at runtime. \docswitch{\oversion} This switch causes \menhir to print its own version number and exit. % --------------------------------------------------------------------------------------------------------------------- \section{Lexical conventions} The semicolon character (\kw{;}) is treated as insignificant, just like white space. Thus, rules and producers (for instance) can be separated with semicolons if it is thought that this improves readability. They can be omitted otherwise. Identifiers (\nt{id}) coincide with \ocaml identifiers, except they are not allowed to contain the quote (\kw{'}) character. Following \ocaml, identifiers that begin with a lowercase letter (\nt{lid}) or with an uppercase letter (\nt{uid}) are distinguished. Comments are C-style (surrounded with \kw{/*} and \kw{*/}, cannot be nested), C++-style (announced by \kw{/$\!$/} and extending until the end of the line), or \ocaml-style (surrounded with \kw{(*} and \kw{*)}, can be nested). Of course, inside \ocaml code, only \ocaml-style comments are allowed. \ocaml type expressions are surrounded with \kangle{and}. Within such expressions, all references to type constructors (other than the built-in \textit{list}, \textit{option}, etc.) must be fully qualified. % --------------------------------------------------------------------------------------------------------------------- \section{Syntax of grammar specifications} \begin{figure} \begin{center} \begin{tabular}{r@{}c@{}l} \nt{specification} \is \sepspacelist{\nt{declaration}} \percentpercent \sepspacelist{\nt{rule}} \optional{\percentpercent \textit{Objective Caml code}} \\ \nt{declaration} \is \dheader{\textit{Objective Caml code}} \\ && \dparameter \ocamlparam \\ && \dtoken \optional{\ocamltype} \sepspacelist{\nt{uid}} \\ && \dnonassoc \sepspacelist{\nt{uid}} \\ && \dleft \sepspacelist{\nt{uid}} \\ && \dright \sepspacelist{\nt{uid}} \\ && \dtype \ocamltype \sepspacelist{\nt{lid}} \\ && \dstart \optional{\ocamltype} \sepspacelist{\nt{lid}} \\ \nt{rule} \is \optional{\dpublic} \optional{\dinline} \nt{lid} \optional{\dlpar\sepcommalist{\nt{id}}\drpar} \deuxpoints \optional{\barre} \seplist{\ \barre}{\nt{group}} \\ \nt{group} \is \seplist{\ \barre}{\nt{production}} \daction \optional {\dprec \nt{id}} \\ \nt{production} \is \sepspacelist{\nt{producer}} \optional {\dprec \nt{id}} \\ \nt{producer} \is \optional{\nt{lid} \dequal} \nt{actual} \\ \nt{actual} \is \nt{id} \optional{\dlpar\sepcommalist{\nt{actual}}\drpar} \optional{\dquestion \barre \dplus \barre \dstar} \\ \end{tabular} \end{center} \caption{Syntax of grammar specifications} \label{fig:syntax} \end{figure} The syntax of grammar specifications appears in \fref{fig:syntax}. (For compatibility with \ocamlyacc, some specifications that do not fully adhere to this syntax are also accepted.) \subsection{Declarations} A specification file begins with a sequence of declarations, ended by a mandatory \percentpercent keyword. \subsubsection{Headers} A header is a piece of \ocaml code, surrounded with \dheader{and}. It is copied verbatim at the beginning of the \ml file. It typically contains \ocaml \kw{open} directives and function definitions for use by the semantic actions. If a single grammar specification file contains multiple headers, their order is preserved. However, when two headers originate in distinct grammar specification files, the order in which they are copied to the \ml file is unspecified. \subsubsection{Parameters} A declaration of the form: \begin{quote} \dparameter \ocamlparam \end{quote} causes the entire parser to become parameterized over the \ocaml module \nt{uid}, that is, to become an \ocaml functor. If a single specification file contains multiple \dparameter declarations, their order is preserved, so that the module name \nt{uid} introduced by one declaration is effectively in scope in the declarations that follow. When two \dparameter declarations originate in distinct grammar specification files, the order in which they are processed is unspecified. Last, \dparameter declarations take effect before \dheader{$\ldots$}, \dtoken, \dtype, or \dstart declarations are considered, so that the module name \nt{uid} introduced by a \dparameter declaration is effectively in scope in \emph{all} \dheader{$\ldots$}, \dtoken, \dtype, or \dstart declarations, regardless of whether they precede or follow the \dparameter declaration. This means, in particular, that the side effects of an \ocaml header are observed only when the functor is applied, not when it is defined. \subsubsection{Tokens} A declaration of the form: \begin{quote} \dtoken \optional{\ocamltype} $\nt{uid}_1, \ldots, \nt{uid}_n$ \end{quote} defines the identifiers $\nt{uid}_1, \ldots, \nt{uid}_n$ as tokens, that is, as terminal symbols in the grammar specification and as data constructors in the \textit{token} type. If an \ocaml type $t$ is present, then these tokens are considered to carry a semantic value of type $t$, otherwise they are considered to carry no semantic value. \subsubsection{Priority and associativity} \label{sec:assoc} A declaration of one of the following forms: \begin{quote} \dnonassoc $\nt{uid}_1 \ldots \nt{uid}_n$ \\ \dleft $\nt{uid}_1 \ldots \nt{uid}_n$ \\ \dright $\nt{uid}_1 \ldots \nt{uid}_n$ \end{quote} attributes both a \emph{priority level} and an \emph{associativity status} to the symbols $\nt{uid}_1, \ldots, \nt{uid}_n$. The priority level assigned to $\nt{uid}_1, \ldots, \nt{uid}_n$ is not defined explicitly: instead, it is defined to be higher than the priority level assigned by the previous \dnonassoc, \dleft, or \dright declaration, and lower than that assigned by the next \dnonassoc, \dleft, or \dright declaration. The symbols $\nt{uid}_1, \ldots, \nt{uid}_n$ can be tokens (defined elsewhere by a \dtoken declaration) or dummies (not defined anywhere). Both can be referred to as part of \dprec annotations. Associativity status and priority levels allow shift/reduce conflicts to be silently resolved (\sref{sec:conflicts}). \subsubsection{Types} A declaration of the form: \begin{quote} \dtype \ocamltype $\nt{lid}_1 \ldots \nt{lid}_n$ \end{quote} assigns an \ocaml type to each of the nonterminal symbols $\nt{lid}_1, \ldots, \nt{lid}_n$. For start symbols, providing an \ocaml type is mandatory, but is usually done as part of the \dstart declaration. For other symbols, it is optional. Providing type information can improve the quality of \ocaml's type error messages. \subsubsection{Start symbols} A declaration of the form: \begin{quote} \dstart \optional{\ocamltype} $\nt{lid}_1 \ldots \nt{lid}_n$ \end{quote} declares the nonterminal symbols $\nt{lid}_1, \ldots, \nt{lid}_n$ to be start symbols. Each such symbol must be assigned an \ocaml type either as part of the \dstart declaration or via separate \dtype declarations. Each of $\nt{lid}_1, \ldots, \nt{lid}_n$ becomes the name of a function whose signature is published in the \mli file and that can be used to invoke the parser. \subsection{Rules} Following the mandatory \percentpercent keyword, a sequence of rules is expected. Each rule defines a nonterminal symbol~\nt{id}. In its simplest form, a rule begins with \nt{id}, followed by a colon character (\deuxpoints), and continues with a sequence of production groups (\sref{sec:productiongroups}). Each production group is preceded with a vertical bar character (\barre); the very first bar is optional. The meaning of the bar is choice: the nonterminal symbol \nt{id} develops to either of the production groups. We defer explanations of the keyword \dpublic (\sref{sec:split}), of the keyword \dinline (\sref{sec:inline}), and of the optional formal parameters $\dlpar\sepcommalist{\nt{id}}\drpar$ (\sref{sec:templates}). \subsubsection{Production groups} \label{sec:productiongroups} In its simplest form, a production group consists of a single production (\sref{sec:productions}), followed by an \ocaml semantic action (\sref{sec:actions}) and an optional \dprec annotation (\sref{sec:prec}). A production specifies a sequence of terminal and nonterminal symbols that should be recognized, and optionally binds identifiers to their semantic values. \paragraph{Semantic actions} \label{sec:actions} A semantic action is a piece of \ocaml code that is executed in order to assign a semantic value to the nonterminal symbol with which this production group is associated. A semantic action can refer to the (already computed) semantic values of the terminal or nonterminal symbols that appear in the production via the semantic value identifiers bound by the production. For compatibility with \ocamlyacc, semantic actions can also refer to these semantic values via positional keywords of the form \kw{\$1}, \kw{\$2}, etc.\ This style is discouraged. \paragraph{\dprec annotations} \label{sec:prec} An annotation of the form \dprec \nt{uid} indicates that the precedence level of the production group is the level assigned to the symbol \nt{uid} via a previous \dnonassoc, \dleft, or \dright declaration (\sref{sec:assoc}). In the absence of a \dprec annotation, the precedence level assigned to each production is the level assigned to the rightmost terminal symbol that appears in it. It is undefined if the rightmost terminal symbol has an undefined precedence level or if the production mentions no terminal symbols at all. The precedence level assigned to a production is used when resolving shift/reduce conflicts (\sref{sec:conflicts}). \paragraph{Multiple productions in a group} If multiple productions are present in a single group, then the semantic action and precedence annotation are shared between them. This short-hand effectively allows several productions to share a semantic action and precedence annotation without requiring textual duplication. It is legal only when every production binds exactly the same set of semantic value identifiers and when no positional semantic value keywords (\kw{\$1}, etc.) are used. \subsubsection{Productions} \label{sec:productions} A production is a sequence of producers (\sref{sec:producers}), optionally followed by a \dprec annotation (\sref{sec:prec}). If a precedence annotation is present, it applies to this production alone, not to other productions in the production group. It is illegal for a production and its production group to both carry \dprec annotations. \subsubsection{Producers} \label{sec:producers} A producer is an actual (\sref{sec:actual}), optionally preceded with a binding of a semantic value identifier, of the form \nt{lid} \dequal. The actual specifies which construction should be recognized and how a semantic value should be computed for that construction. The identifier \nt{lid}, if present, becomes bound to that semantic value in the semantic action that follows. Otherwise, the semantic value can be referred to via a positional keyword (\kw{\$1}, etc.). \subsubsection{Actuals} \label{sec:actual} In its simplest form, an actual simply consists of a terminal or nonterminal symbol. The optional actual parameters $\dlpar\sepcommalist{\nt{actual}}\drpar$ and the optional modifier (\dquestion, \dplus, or \dstar) are explained further on (see \sref{sec:templates} and \fref{fig:sugar}). \section{Advanced features} \subsection{Splitting specifications over multiple files} \label{sec:split} \paragraph{Modules} Grammar specifications can be split over multiple files. When \menhir is invoked with multiple argument file names, it considers each of these files as a \emph{partial} grammar specification, and \emph{joins} these partial specifications in order to obtain a single, complete specification. This feature is intended to promote a form a modularity. It is hoped that, by splitting large grammar specifications into several modules'', they can be made more manageable. It is also hoped that this mechanism, in conjunction with parameterization (\sref{sec:templates}), will promote sharing and reuse. It should be noted, however, that this is only a weak form of modularity. Indeed, partial specifications cannot be independently processed (say, checked for conflicts). It is necessary to first join them, so as to form a complete grammar specification, before any kind of grammar analysis can be done. This mechanism is, in fact, how \menhir's standard library (\sref{sec:library}) is made available: even though its name does not appear on the command line, it is automatically joined with the user's explicitly-provided grammar specifications, making the standard library's definitions globally visible. A partial grammar specification, or module, contains declarations and rules, just like a complete one: there is no visible difference. Of course, it can consist of only declarations, or only rules, if the user so chooses. (Don't forget the mandatory \percentpercent keyword that separates declarations and rules. It must be present, even if one of the two sections is empty.) \paragraph{Private and public nonterminal symbols} It should be noted that joining is \emph{not} a purely textual process. If two modules happen to define a nonterminal symbol by the same name, then it is considered, by default, that this is an accidental name clash. In that case, each of the two nonterminal symbols is silently renamed so as to avoid the clash. In other words, by default, a nonterminal symbol defined in module $A$ is considered \emph{private}, and cannot be defined again, or referred to, in module $B$. Naturally, it is sometimes desirable to define a nonterminal symbol $N$ in module $A$ and to refer to it in module $B$. This is permitted if $N$ is public, that is, if either its definition carries the keyword \dpublic or $N$ is declared to be a start symbol. A public nonterminal symbol is never renamed, so it can be referred to by modules other than its defining module. In fact, it is even permitted to split the definition of a public nonterminal symbol over multiple modules. That is, a public nonterminal symbol $N$ can have multiple definitions in distinct modules. When the modules are joined, the definitions are joined as well, using the choice (\barre) operator. This feature allows splitting a grammar specification in a manner that is independent of the grammar's structure. For instance, in the grammar of a programming language, the definition of the nonterminal symbol \nt{expression} could be split into multiple modules, where one module groups the expression forms that have to do with arithmetic, one module groups those that concern function definitions and function calls, one module groups those that concern object definitions and method calls, and so on. \paragraph{Tokens aside} Another use of modularity consists in placing all \dtoken declarations in one module, and the actual grammar specification in another module. The module that contains the token definitions can then be shared, making it easier to define multiple parsers that accept the same type of tokens. (On this topic, see \distrib{demos/calc-two}.) \subsection{Parameterizing rules} \label{sec:templates} A rule (that is, the definition of a nonterminal symbol) can be parameterized over an arbitrary number of symbols, which are referred to as formal parameters. \paragraph{Example} For instance, here is the definition of the parameterized nonterminal symbol \nt{option}, taken from the standard library (\sref{sec:library}): % \begin{quote} \begin{tabular}{l} \dpublic \basic{option}(\basic{X}): \newprod \dpaction{\basic{None}} \newprod \basic{x} = \basic{X} \dpaction{\basic{Some} \basic{x}} \end{tabular} \end{quote} % This definition states that \nt{option}(\basic{X}) expands to either the empty string, producing the semantic value \basic{None}, or to the string \basic{X}, producing the semantic value {\basic{Some}~\basic{x}}, where \basic{x} is the semantic value of \basic{X}. In this definition, the symbol \basic{X} is abstract: it stands for an arbitrary terminal or nonterminal symbol. The definition is made public, so \nt{option} can be referred to within client modules. A client that wishes to use \nt{option} simply refers to it, together with an actual parameter -- a symbol that is intended to replace \basic{X}. For instance, here is how one might define a sequence of declarations, preceded with optional commas: % \begin{quote} \begin{tabular}{l} \nt{declarations}: \newprod \dpaction{[]} \newprod \basic{ds} = \nt{declarations}; \nt{option}(\basic{COMMA}); \basic{d} = \nt{declaration} \dpaction{ \basic{d} :: \basic{ds} } \end{tabular} \end{quote} % This definition states that \nt{declarations} expands either to the empty string or to \nt{declarations} followed by an optional comma followed by \nt{declaration}. (Here, \basic{COMMA} is presumably a terminal symbol.) When this rule is encountered, the definition of \nt{option} is instantiated: that is, a copy of the definition, where \basic{COMMA} replaces \basic{X}, is produced. Things behave exactly as if one had written: \begin{quote} \begin{tabular}{l} \basic{optional\_comma}: \newprod \dpaction{\basic{None}} \newprod \basic{x} = \basic{COMMA} \dpaction{\basic{Some} \basic{x}} \\ \nt{declarations}: \newprod \dpaction{[]} \newprod \basic{ds} = \nt{declarations}; \nt{optional\_comma}; \basic{d} = \nt{declaration} \dpaction{ \basic{d} :: \basic{ds} } \end{tabular} \end{quote} % Note that, even though \basic{COMMA} presumably has been declared as a token with no semantic value, writing \basic{x}~=~\basic{COMMA} is legal, and binds \basic{x} to the unit value. This design choice ensures that the definition of \nt{option} makes sense regardless of the nature of \basic{X}: that is, \basic{X} can be instantiated with a terminal symbol, with or without a semantic value, or with a nonterminal symbol. \paragraph{Parameterization in general} In general, the definition of a nonterminal symbol $N$ can be parameterized with an arbitrary number of formal parameters. When $N$ is referred to within a production, it must be applied to the same number of actuals. In general, an actual is: % \begin{itemize} \item either a single symbol, which can be a terminal symbol, a nonterminal symbol, or a formal parameter; \item or an application of such a symbol to a number of actuals. \end{itemize} For instance, here is a rule whose single production consists of a single producer, which contains several, nested actuals. (This example is discussed again in \sref{sec:library}.) % \begin{quote} \begin{tabular}{l} \nt{plist}(\nt{X}): \newprod \basic{xs} = \nt{loption}(% \nt{delimited}(% \basic{LPAREN}, \nt{separated\_nonempty\_list}(\basic{COMMA}, \basic{X}), \basic{RPAREN}% )% ) \dpaction{\basic{xs}} \end{tabular} \end{quote} \begin{figure} \begin{center} \begin{tabular}{r@{\hskip 2mm}c@{\hskip 2mm}l} \nt{actual}\dquestion & is syntactic sugar for & \nt{option}(\nt{actual}) \\ \nt{actual}\dplus & is syntactic sugar for & \nt{nonempty\_list}(\nt{actual}) \\ \nt{actual}\dstar & is syntactic sugar for & \nt{list}(\nt{actual}) \end{tabular} \end{center} \caption{Syntactic sugar for simulating regular expressions} \label{fig:sugar} \end{figure} % Applications of the parameterized nonterminal symbols \nt{option}, \nt{nonempty\_list}, and \nt{list}, which are defined in the standard library (\sref{sec:library}), can be written using a familiar, regular-expression like syntax (\fref{fig:sugar}). \paragraph{Higher-order parameters} A formal parameter can itself expect parameters. For instance, here is a rule that defines the syntax of procedures in an imaginary programming language: % \begin{quote} \begin{tabular}{l} \nt{procedure}(\nt{list}): \newprod \basic{PROCEDURE} \basic{ID} \nt{list}(\nt{formal}) \nt{SEMICOLON} \nt{block} \nt{SEMICOLON} \dpaction{$\ldots$} \end{tabular} \end{quote} % This rule states that the token \basic{ID}, which represents the name of the procedure, should be followed with a list of formal parameters. (The definitions of the nonterminal symbols \nt{formal} and \nt{block} are not shown.) However, because \nt{list} is a formal parameter, as opposed to a concrete nonterminal symbol defined elsewhere, this definition does not specify how the list is laid out: which token, if any, is used to separate, or terminate, list elements? is the list allowed to be empty? and so on. A more concrete notion of procedure is obtained by instantiating the formal parameter \nt{list}: for instance, \nt{procedure}(\nt{plist}), where \nt{plist} is the parameterized nonterminal symbol defined earlier, is a valid application. \paragraph{Consistency} Definitions and uses of parameterized nonterminal symbols are checked for consistency before they are expanded away. In short, it is checked that, wherever a nonterminal symbol is used, it is supplied with actual arguments in appropriate number and of appropriate nature. This guarantees that expansion of parameterized definitions terminates and produces a well-formed grammar as its outcome. \subsection{Inlining} \label{sec:inline} It is well-known that the following grammar of arithmetic expressions does not work as expected: that is, in spite of the priority declarations, it has shift/reduce conflicts. % \begin{quote} \begin{tabular}{l} \dtoken \kangle{\basic{int}} \basic{INT} \\ \dtoken \basic{PLUS} \basic{TIMES} \\ \dleft \basic{PLUS} \\ \dleft \basic{TIMES} \\ \\ \percentpercent \\ \\ \nt{expression}: \newprod \basic{i} = \basic{INT} \dpaction{\basic{i}} \newprod \basic{e} = \nt{expression}; \basic{o} = \nt{op}; \basic{f} = \nt{expression} \dpaction{\basic{o} \basic{e} \basic{f}} \\ \nt{op}: \newprod \basic{PLUS} \dpaction{( + )} \newprod \basic{TIMES} \dpaction{( * )} \end{tabular} \end{quote} % The trouble is, the precedence level of the production \nt{expression} $\rightarrow$ \nt{expression} \nt{op} \nt{expression} is undefined, and there is no sensible way of defining it via a \dprec declaration, since the desired level really depends upon the symbol that was recognized by \nt{op}: was it \basic{PLUS} or \basic{TIMES}? The standard workaround is to abandon the definition of \nt{op} as a separate nonterminal symbol, and to inline its definition into the definition of \nt{expression}, like this: % \begin{quote} \begin{tabular}{l} \nt{expression}: \newprod \basic{i} = \basic{INT} \dpaction{\basic{i}} \newprod \basic{e} = \nt{expression}; \basic{PLUS}; \basic{f} = \nt{expression} \dpaction{\basic{e} + \basic{f}} \newprod \basic{e} = \nt{expression}; \basic{TIMES}; \basic{f} = \nt{expression} \dpaction{\basic{e} * \basic{f}} \end{tabular} \end{quote} % This avoids the shift/reduce conflict, but gives up some of the original specification's structure, which, in realistic situations, can be damageable. Fortunately, \menhir offers a way of avoiding the conflict without manually transforming the grammar, by declaring that the nonterminal symbol \nt{op} should be inlined: % \begin{quote} \begin{tabular}{l} \nt{expression}: \newprod \basic{i} = \basic{INT} \dpaction{\basic{i}} \newprod \basic{e} = \nt{expression}; \basic{o} = \nt{op}; \basic{f} = \nt{expression} \dpaction{\basic{o} \basic{e} \basic{f}} \\ \dinline \nt{op}: \newprod \basic{PLUS} \dpaction{( + )} \newprod \basic{TIMES} \dpaction{( * )} \end{tabular} \end{quote} % The \dinline keyword causes all references to \nt{op} to be replaced with its definition. In this example, the definition of \nt{op} involves two productions, one that develops to \basic{PLUS} and one that expands to \basic{TIMES}, so every production that refers to \nt{op} is effectively turned into two productions, one that refers to \basic{PLUS} and one that refers to \basic{TIMES}. After inlining, \nt{op} disappears and \nt{expression} has three productions: that is, the result of inlining is exactly the manual workaround shown above. In some situations, inlining can also help recover a slight efficiency margin. For instance, the definition: % \begin{quote} \begin{tabular}{l} \dinline \nt{plist}(\nt{X}): \newprod \basic{xs} = \nt{loption}(% \nt{delimited}(% \basic{LPAREN}, \nt{separated\_nonempty\_list}(\basic{COMMA}, \basic{X}), \basic{RPAREN}% )% ) \dpaction{\basic{xs}} \end{tabular} \end{quote} % effectively makes \nt{plist}(\nt{X}) an alias for the right-hand side \nt{loption}($\ldots$). Without the \dinline keyword, the language recognized by the grammar would be the same, but the LR automaton would probably have one more state and would perform one more reduction at run time. \subsection{The standard library} \label{sec:library} \begin{figure} \begin{center} \begin{tabular}{lp{51mm}ll} Name & Recognizes & Produces & Comment \\ \hline\\ \nt{option}(\nt{X}) & $\epsilon$ \barre \nt{X} & $\alpha$ \basic{option}, if \nt{X} : $\alpha$ \\ \nt{ioption}(\nt{X}) & $\epsilon$ \barre \nt{X} & $\alpha$ \basic{option}, if \nt{X} : $\alpha$ & (inlined) \\ \nt{boption}(\nt{X}) & $\epsilon$ \barre \nt{X} & \basic{bool} \\ \nt{loption}(\nt{X}) & $\epsilon$ \barre \nt{X} & $\alpha$ \basic{list}, if \nt{X} : $\alpha$ \nt{list} \\ \\ \nt{pair}(\nt{X}, \nt{Y}) & \nt{X} \nt{Y} & $\alpha\times\beta$, if \nt{X} : $\alpha$ and \nt{Y} : $\beta$ \\ \nt{separated\_pair}(\nt{X}, \nt{sep}, \nt{Y}) & \nt{X} \nt{sep} \nt{Y} & $\alpha\times\beta$, if \nt{X} : $\alpha$ and \nt{Y} : $\beta$ \\ \nt{preceded}(\nt{opening}, \nt{X}) & \nt{opening} \nt{X} & $\alpha$, if \nt{X} : $\alpha$ \\ \nt{terminated}(\nt{X}, \nt{closing}) & \nt{X} \nt{closing} & $\alpha$, if \nt{X} : $\alpha$ \\ \nt{delimited}(\nt{opening}, \nt{X}, \nt{closing}) & \nt{opening} \nt{X} \nt{closing} & $\alpha$, if \nt{X} : $\alpha$ \\ \\ \nt{list}(\nt{X}) & a possibly empty sequence of \nt{X}'s & $\alpha$ \basic{list}, if \nt{X} : $\alpha$ \\ \nt{nonempty\_list}(\nt{X}) & a nonempty sequence of \nt{X}'s & $\alpha$ \basic{list}, if \nt{X} : $\alpha$ \\ \nt{separated\_list}(\nt{sep}, \nt{X}) & a possibly empty sequence of \nt{X}'s separated with \nt{sep}'s & $\alpha$ \basic{list}, if \nt{X} : $\alpha$ \\ \nt{separated\_nonempty\_list}(\nt{sep}, \nt{X}) & a nonempty sequence of \nt{X}'s separated with \nt{sep}'s & $\alpha$ \basic{list}, if \nt{X} : $\alpha$ \\ \end{tabular} \end{center} \caption{Summary of the standard library} \label{fig:standard} \end{figure} Once equipped with a rudimentary module system (\sref{sec:split}), parameterization (\sref{sec:templates}), and inlining (\sref{sec:inline}), it is straightforward to propose a collection of commonly used definitions, such as options, sequences, lists, and so on. This \emph{standard library} is joined, by default, with every grammar specification. A summary of the nonterminal symbols offered by the standard library appears in \fref{fig:standard}. See also the short-hands documented in \fref{fig:sugar}. By relying on the standard library, a client module can concisely define more elaborate notions. For instance, the following rule: % \begin{quote} \begin{tabular}{l} \dinline \nt{plist}(\nt{X}): \newprod \basic{xs} = \nt{loption}(% \nt{delimited}(% \basic{LPAREN}, \nt{separated\_nonempty\_list}(\basic{COMMA}, \basic{X}), \basic{RPAREN}% )% ) \dpaction{\basic{xs}} \end{tabular} \end{quote} % causes \nt{plist}(\nt{X}) to recognize a list of \nt{X}'s, where the empty list is represented by the empty string, and a non-empty list is delimited with parentheses and comma-separated. % --------------------------------------------------------------------------------------------------------------------- \section{Conflicts} \label{sec:conflicts} When a shift/reduce or reduce/reduce conflict is detected, it is classified as either benign, if it can be resolved by consulting user-supplied precedence declarations, or severe, if it cannot. Benign conflicts are not reported. Severe conflicts are reported and, if the \oexplain switch is on, explained. \subsection{When is a conflict benign?} A shift/reduce conflict involves a single token (the one that one might wish to shift) and one or more productions (those that one might wish to reduce). When such a conflict is detected, the precedence level (\sref{sec:assoc}, \sref{sec:prec}) of these entities are looked up and compared as follows: \begin{enumerate} \item if only one production is involved, and if it has higher priority than the token, then the conflict is resolved in favor of reduction. \item if only one production is involved, and if it has the same priority as the token, then the associativity status of the token is looked up: \begin{enumerate} \item if the token was declared nonassociative, then the conflict is resolved in favor of neither action, that is, a syntax error will be signaled if this token shows up when this production is about to be reduced; \item if the token was declared left-associative, then the conflict is resolved in favor of reduction; \item if the token was declared right-associative, then the conflict is resolved in favor of shifting. \end{enumerate} \item \label{multiway} if multiple productions are involved, and if, considered one by one, they all cause the conflict to be resolved in the same way (that is, either in favor in shifting, or in favor of neither), then the conflict is resolved in that way. \end{enumerate} In either of these cases, the conflict is considered benign. Otherwise, it is considered severe. Note that a reduce/reduce conflict is always considered severe, unless it happens to be subsumed by a benign multi-way shift/reduce conflict (item~\ref{multiway} above). \subsection{How are severe conflicts explained?} When the \odump switch is on, a description of the automaton is written to the \automaton file. Severe conflicts are shown as part of this description. Fortunately, there is also a way of understanding conflicts in terms of the grammar, rather than in terms of the automaton. When the \oexplain switch is on, a textual explanation is written to the \conflicts file. \emph{Not all conflicts are explained} in this file: instead, \emph{only one conflict per automaton state is explained}. This is done partly in the interest of brevity, but also because Pager's algorithm can create artificial conflicts in a state that already contains a true LR(1) conflict; thus, one cannot hope in general to explain all of the conflicts that appear in the automaton. As a result of this policy, once all conflicts explained in the \conflicts file have been fixed, one might need to run \menhir again to produce yet more conflict explanations. \begin{figure} \begin{quote} \begin{tabular}{l} \dtoken \basic{IF THEN ELSE} \\ \dstart \kangle{\basic{expression}} \nt{expression} \\