xref: /sqlite-3.40.0/doc/lemon.html (revision dfe4e6bb)
175897234Sdrh<html>
275897234Sdrh<head>
375897234Sdrh<title>The Lemon Parser Generator</title>
475897234Sdrh</head>
575897234Sdrh<body bgcolor=white>
675897234Sdrh<h1 align=center>The Lemon Parser Generator</h1>
775897234Sdrh
89bccde3dSdrh<p>Lemon is an LALR(1) parser generator for C.
99bccde3dSdrhIt does the same job as "bison" and "yacc".
109bccde3dSdrhBut lemon is not a bison or yacc clone.  Lemon
1175897234Sdrhuses a different grammar syntax which is designed to
129bccde3dSdrhreduce the number of coding errors.  Lemon also uses a
139bccde3dSdrhparsing engine that is faster than yacc and
149bccde3dSdrhbison and which is both reentrant and threadsafe.
159bccde3dSdrh(Update: Since the previous sentence was written, bison
169bccde3dSdrhhas also been updated so that it too can generate a
179bccde3dSdrhreentrant and threadsafe parser.)
189bccde3dSdrhLemon also implements features that can be used
1975897234Sdrhto eliminate resource leaks, making is suitable for use
2075897234Sdrhin long-running programs such as graphical user interfaces
2175897234Sdrhor embedded controllers.</p>
2275897234Sdrh
2375897234Sdrh<p>This document is an introduction to the Lemon
2475897234Sdrhparser generator.</p>
2575897234Sdrh
2675897234Sdrh<h2>Theory of Operation</h2>
2775897234Sdrh
2875897234Sdrh<p>The main goal of Lemon is to translate a context free grammar (CFG)
2975897234Sdrhfor a particular language into C code that implements a parser for
3075897234Sdrhthat language.
3175897234SdrhThe program has two inputs:
3275897234Sdrh<ul>
3375897234Sdrh<li>The grammar specification.
3475897234Sdrh<li>A parser template file.
3575897234Sdrh</ul>
3675897234SdrhTypically, only the grammar specification is supplied by the programmer.
3775897234SdrhLemon comes with a default parser template which works fine for most
3875897234Sdrhapplications.  But the user is free to substitute a different parser
3975897234Sdrhtemplate if desired.</p>
4075897234Sdrh
4175897234Sdrh<p>Depending on command-line options, Lemon will generate between
4275897234Sdrhone and three files of outputs.
4375897234Sdrh<ul>
4475897234Sdrh<li>C code to implement the parser.
4575897234Sdrh<li>A header file defining an integer ID for each terminal symbol.
4675897234Sdrh<li>An information file that describes the states of the generated parser
4775897234Sdrh    automaton.
4875897234Sdrh</ul>
4975897234SdrhBy default, all three of these output files are generated.
509bccde3dSdrhThe header file is suppressed if the "-m" command-line option is
519bccde3dSdrhused and the report file is omitted when "-q" is selected.</p>
5275897234Sdrh
539bccde3dSdrh<p>The grammar specification file uses a ".y" suffix, by convention.
5475897234SdrhIn the examples used in this document, we'll assume the name of the
559bccde3dSdrhgrammar file is "gram.y".  A typical use of Lemon would be the
5675897234Sdrhfollowing command:
5775897234Sdrh<pre>
5875897234Sdrh   lemon gram.y
5975897234Sdrh</pre>
609bccde3dSdrhThis command will generate three output files named "gram.c",
619bccde3dSdrh"gram.h" and "gram.out".
6275897234SdrhThe first is C code to implement the parser.  The second
6375897234Sdrhis the header file that defines numerical values for all
6475897234Sdrhterminal symbols, and the last is the report that explains
6575897234Sdrhthe states used by the parser automaton.</p>
6675897234Sdrh
6775897234Sdrh<h3>Command Line Options</h3>
6875897234Sdrh
6975897234Sdrh<p>The behavior of Lemon can be modified using command-line options.
7075897234SdrhYou can obtain a list of the available command-line options together
7175897234Sdrhwith a brief explanation of what each does by typing
7275897234Sdrh<pre>
7375897234Sdrh   lemon -?
7475897234Sdrh</pre>
7575897234SdrhAs of this writing, the following command-line options are supported:
7675897234Sdrh<ul>
779bccde3dSdrh<li><b>-b</b>
789bccde3dSdrhShow only the basis for each parser state in the report file.
799bccde3dSdrh<li><b>-c</b>
809bccde3dSdrhDo not compress the generated action tables.
819bccde3dSdrh<li><b>-D<i>name</i></b>
829bccde3dSdrhDefine C preprocessor macro <i>name</i>.  This macro is useable by
839bccde3dSdrh"%ifdef" lines in the grammar file.
849bccde3dSdrh<li><b>-g</b>
859bccde3dSdrhDo not generate a parser.  Instead write the input grammar to standard
869bccde3dSdrhoutput with all comments, actions, and other extraneous text removed.
879bccde3dSdrh<li><b>-l</b>
88*dfe4e6bbSdrhOmit "#line" directives in the generated parser C code.
899bccde3dSdrh<li><b>-m</b>
909bccde3dSdrhCause the output C source code to be compatible with the "makeheaders"
919bccde3dSdrhprogram.
929bccde3dSdrh<li><b>-p</b>
939bccde3dSdrhDisplay all conflicts that are resolved by
949bccde3dSdrh<a href='#precrules'>precedence rules</a>.
959bccde3dSdrh<li><b>-q</b>
969bccde3dSdrhSuppress generation of the report file.
979bccde3dSdrh<li><b>-r</b>
989bccde3dSdrhDo not sort or renumber the parser states as part of optimization.
999bccde3dSdrh<li><b>-s</b>
1009bccde3dSdrhShow parser statistics before existing.
1019bccde3dSdrh<li><b>-T<i>file</i></b>
1029bccde3dSdrhUse <i>file</i> as the template for the generated C-code parser implementation.
1039bccde3dSdrh<li><b>-x</b>
1049bccde3dSdrhPrint the Lemon version number.
10575897234Sdrh</ul>
10675897234Sdrh
10775897234Sdrh<h3>The Parser Interface</h3>
10875897234Sdrh
10975897234Sdrh<p>Lemon doesn't generate a complete, working program.  It only generates
11075897234Sdrha few subroutines that implement a parser.  This section describes
11175897234Sdrhthe interface to those subroutines.  It is up to the programmer to
11275897234Sdrhcall these subroutines in an appropriate way in order to produce a
11375897234Sdrhcomplete system.</p>
11475897234Sdrh
11575897234Sdrh<p>Before a program begins using a Lemon-generated parser, the program
11675897234Sdrhmust first create the parser.
11775897234SdrhA new parser is created as follows:
11875897234Sdrh<pre>
11975897234Sdrh   void *pParser = ParseAlloc( malloc );
12075897234Sdrh</pre>
12175897234SdrhThe ParseAlloc() routine allocates and initializes a new parser and
12275897234Sdrhreturns a pointer to it.
1239bccde3dSdrhThe actual data structure used to represent a parser is opaque &mdash;
12475897234Sdrhits internal structure is not visible or usable by the calling routine.
12575897234SdrhFor this reason, the ParseAlloc() routine returns a pointer to void
12675897234Sdrhrather than a pointer to some particular structure.
12775897234SdrhThe sole argument to the ParseAlloc() routine is a pointer to the
1289bccde3dSdrhsubroutine used to allocate memory.  Typically this means malloc().</p>
12975897234Sdrh
13075897234Sdrh<p>After a program is finished using a parser, it can reclaim all
13175897234Sdrhmemory allocated by that parser by calling
13275897234Sdrh<pre>
13375897234Sdrh   ParseFree(pParser, free);
13475897234Sdrh</pre>
13575897234SdrhThe first argument is the same pointer returned by ParseAlloc().  The
13675897234Sdrhsecond argument is a pointer to the function used to release bulk
13775897234Sdrhmemory back to the system.</p>
13875897234Sdrh
13975897234Sdrh<p>After a parser has been allocated using ParseAlloc(), the programmer
14075897234Sdrhmust supply the parser with a sequence of tokens (terminal symbols) to
14175897234Sdrhbe parsed.  This is accomplished by calling the following function
14275897234Sdrhonce for each token:
14375897234Sdrh<pre>
14475897234Sdrh   Parse(pParser, hTokenID, sTokenData, pArg);
14575897234Sdrh</pre>
14675897234SdrhThe first argument to the Parse() routine is the pointer returned by
14775897234SdrhParseAlloc().
14875897234SdrhThe second argument is a small positive integer that tells the parse the
14975897234Sdrhtype of the next token in the data stream.
15075897234SdrhThere is one token type for each terminal symbol in the grammar.
15175897234SdrhThe gram.h file generated by Lemon contains #define statements that
15275897234Sdrhmap symbolic terminal symbol names into appropriate integer values.
1539bccde3dSdrhA value of 0 for the second argument is a special flag to the
1549bccde3dSdrhparser to indicate that the end of input has been reached.
15575897234SdrhThe third argument is the value of the given token.  By default,
15675897234Sdrhthe type of the third argument is integer, but the grammar will
15775897234Sdrhusually redefine this type to be some kind of structure.
15875897234SdrhTypically the second argument will be a broad category of tokens
1599bccde3dSdrhsuch as "identifier" or "number" and the third argument will
16075897234Sdrhbe the name of the identifier or the value of the number.</p>
16175897234Sdrh
16275897234Sdrh<p>The Parse() function may have either three or four arguments,
16345f31be8Sdrhdepending on the grammar.  If the grammar specification file requests
16445f31be8Sdrhit (via the <a href='#extraarg'><tt>extra_argument</tt> directive</a>),
16545f31be8Sdrhthe Parse() function will have a fourth parameter that can be
16675897234Sdrhof any type chosen by the programmer.  The parser doesn't do anything
16775897234Sdrhwith this argument except to pass it through to action routines.
16875897234SdrhThis is a convenient mechanism for passing state information down
16975897234Sdrhto the action routines without having to use global variables.</p>
17075897234Sdrh
17175897234Sdrh<p>A typical use of a Lemon parser might look something like the
17275897234Sdrhfollowing:
17375897234Sdrh<pre>
17475897234Sdrh   01 ParseTree *ParseFile(const char *zFilename){
17575897234Sdrh   02    Tokenizer *pTokenizer;
17675897234Sdrh   03    void *pParser;
17775897234Sdrh   04    Token sToken;
17875897234Sdrh   05    int hTokenId;
17975897234Sdrh   06    ParserState sState;
18075897234Sdrh   07
18175897234Sdrh   08    pTokenizer = TokenizerCreate(zFilename);
18275897234Sdrh   09    pParser = ParseAlloc( malloc );
18375897234Sdrh   10    InitParserState(&sState);
18475897234Sdrh   11    while( GetNextToken(pTokenizer, &hTokenId, &sToken) ){
18575897234Sdrh   12       Parse(pParser, hTokenId, sToken, &sState);
18675897234Sdrh   13    }
18775897234Sdrh   14    Parse(pParser, 0, sToken, &sState);
18875897234Sdrh   15    ParseFree(pParser, free );
18975897234Sdrh   16    TokenizerFree(pTokenizer);
19075897234Sdrh   17    return sState.treeRoot;
19175897234Sdrh   18 }
19275897234Sdrh</pre>
19375897234SdrhThis example shows a user-written routine that parses a file of
19475897234Sdrhtext and returns a pointer to the parse tree.
1959bccde3dSdrh(All error-handling code is omitted from this example to keep it
19675897234Sdrhsimple.)
19775897234SdrhWe assume the existence of some kind of tokenizer which is created
19875897234Sdrhusing TokenizerCreate() on line 8 and deleted by TokenizerFree()
19975897234Sdrhon line 16.  The GetNextToken() function on line 11 retrieves the
20075897234Sdrhnext token from the input file and puts its type in the
20175897234Sdrhinteger variable hTokenId.  The sToken variable is assumed to be
20275897234Sdrhsome kind of structure that contains details about each token,
20375897234Sdrhsuch as its complete text, what line it occurs on, etc. </p>
20475897234Sdrh
20575897234Sdrh<p>This example also assumes the existence of structure of type
20675897234SdrhParserState that holds state information about a particular parse.
20775897234SdrhAn instance of such a structure is created on line 6 and initialized
20875897234Sdrhon line 10.  A pointer to this structure is passed into the Parse()
20975897234Sdrhroutine as the optional 4th argument.
21075897234SdrhThe action routine specified by the grammar for the parser can use
21175897234Sdrhthe ParserState structure to hold whatever information is useful and
21275897234Sdrhappropriate.  In the example, we note that the treeRoot field of
21375897234Sdrhthe ParserState structure is left pointing to the root of the parse
21475897234Sdrhtree.</p>
21575897234Sdrh
21675897234Sdrh<p>The core of this example as it relates to Lemon is as follows:
21775897234Sdrh<pre>
21875897234Sdrh   ParseFile(){
21975897234Sdrh      pParser = ParseAlloc( malloc );
22075897234Sdrh      while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){
22175897234Sdrh         Parse(pParser, hTokenId, sToken);
22275897234Sdrh      }
22375897234Sdrh      Parse(pParser, 0, sToken);
22475897234Sdrh      ParseFree(pParser, free );
22575897234Sdrh   }
22675897234Sdrh</pre>
22775897234SdrhBasically, what a program has to do to use a Lemon-generated parser
22875897234Sdrhis first create the parser, then send it lots of tokens obtained by
22975897234Sdrhtokenizing an input source.  When the end of input is reached, the
23075897234SdrhParse() routine should be called one last time with a token type
23175897234Sdrhof 0.  This step is necessary to inform the parser that the end of
23275897234Sdrhinput has been reached.  Finally, we reclaim memory used by the
23375897234Sdrhparser by calling ParseFree().</p>
23475897234Sdrh
23575897234Sdrh<p>There is one other interface routine that should be mentioned
23675897234Sdrhbefore we move on.
23775897234SdrhThe ParseTrace() function can be used to generate debugging output
23875897234Sdrhfrom the parser.  A prototype for this routine is as follows:
23975897234Sdrh<pre>
24075897234Sdrh   ParseTrace(FILE *stream, char *zPrefix);
24175897234Sdrh</pre>
24275897234SdrhAfter this routine is called, a short (one-line) message is written
24375897234Sdrhto the designated output stream every time the parser changes states
24475897234Sdrhor calls an action routine.  Each such message is prefaced using
24575897234Sdrhthe text given by zPrefix.  This debugging output can be turned off
24675897234Sdrhby calling ParseTrace() again with a first argument of NULL (0).</p>
24775897234Sdrh
24875897234Sdrh<h3>Differences With YACC and BISON</h3>
24975897234Sdrh
25075897234Sdrh<p>Programmers who have previously used the yacc or bison parser
25175897234Sdrhgenerator will notice several important differences between yacc and/or
25275897234Sdrhbison and Lemon.
25375897234Sdrh<ul>
25475897234Sdrh<li>In yacc and bison, the parser calls the tokenizer.  In Lemon,
25575897234Sdrh    the tokenizer calls the parser.
25675897234Sdrh<li>Lemon uses no global variables.  Yacc and bison use global variables
25775897234Sdrh    to pass information between the tokenizer and parser.
25875897234Sdrh<li>Lemon allows multiple parsers to be running simultaneously.  Yacc
25975897234Sdrh    and bison do not.
26075897234Sdrh</ul>
26175897234SdrhThese differences may cause some initial confusion for programmers
26275897234Sdrhwith prior yacc and bison experience.
26375897234SdrhBut after years of experience using Lemon, I firmly
26475897234Sdrhbelieve that the Lemon way of doing things is better.</p>
26575897234Sdrh
26645f31be8Sdrh<p><i>Updated as of 2016-02-16:</i>
26745f31be8SdrhThe text above was written in the 1990s.
26845f31be8SdrhWe are told that Bison has lately been enhanced to support the
26945f31be8Sdrhtokenizer-calls-parser paradigm used by Lemon, and to obviate the
27045f31be8Sdrhneed for global variables.</p>
27145f31be8Sdrh
27275897234Sdrh<h2>Input File Syntax</h2>
27375897234Sdrh
27475897234Sdrh<p>The main purpose of the grammar specification file for Lemon is
27575897234Sdrhto define the grammar for the parser.  But the input file also
27675897234Sdrhspecifies additional information Lemon requires to do its job.
27775897234SdrhMost of the work in using Lemon is in writing an appropriate
27875897234Sdrhgrammar file.</p>
27975897234Sdrh
28075897234Sdrh<p>The grammar file for lemon is, for the most part, free format.
28175897234SdrhIt does not have sections or divisions like yacc or bison.  Any
28275897234Sdrhdeclaration can occur at any point in the file.
28375897234SdrhLemon ignores whitespace (except where it is needed to separate
28475897234Sdrhtokens) and it honors the same commenting conventions as C and C++.</p>
28575897234Sdrh
28675897234Sdrh<h3>Terminals and Nonterminals</h3>
28775897234Sdrh
28875897234Sdrh<p>A terminal symbol (token) is any string of alphanumeric
2899bccde3dSdrhand/or underscore characters
29075897234Sdrhthat begins with an upper case letter.
291c8eee5e5SdrhA terminal can contain lowercase letters after the first character,
29275897234Sdrhbut the usual convention is to make terminals all upper case.
29375897234SdrhA nonterminal, on the other hand, is any string of alphanumeric
29475897234Sdrhand underscore characters than begins with a lower case letter.
29575897234SdrhAgain, the usual convention is to make nonterminals use all lower
29675897234Sdrhcase letters.</p>
29775897234Sdrh
29875897234Sdrh<p>In Lemon, terminal and nonterminal symbols do not need to
29975897234Sdrhbe declared or identified in a separate section of the grammar file.
30075897234SdrhLemon is able to generate a list of all terminals and nonterminals
30175897234Sdrhby examining the grammar rules, and it can always distinguish a
30275897234Sdrhterminal from a nonterminal by checking the case of the first
30375897234Sdrhcharacter of the name.</p>
30475897234Sdrh
30575897234Sdrh<p>Yacc and bison allow terminal symbols to have either alphanumeric
30675897234Sdrhnames or to be individual characters included in single quotes, like
30775897234Sdrhthis: ')' or '$'.  Lemon does not allow this alternative form for
30875897234Sdrhterminal symbols.  With Lemon, all symbols, terminals and nonterminals,
30975897234Sdrhmust have alphanumeric names.</p>
31075897234Sdrh
31175897234Sdrh<h3>Grammar Rules</h3>
31275897234Sdrh
31375897234Sdrh<p>The main component of a Lemon grammar file is a sequence of grammar
31475897234Sdrhrules.
31575897234SdrhEach grammar rule consists of a nonterminal symbol followed by
3169bccde3dSdrhthe special symbol "::=" and then a list of terminals and/or nonterminals.
31775897234SdrhThe rule is terminated by a period.
31875897234SdrhThe list of terminals and nonterminals on the right-hand side of the
31975897234Sdrhrule can be empty.
32075897234SdrhRules can occur in any order, except that the left-hand side of the
32175897234Sdrhfirst rule is assumed to be the start symbol for the grammar (unless
32275897234Sdrhspecified otherwise using the <tt>%start</tt> directive described below.)
32375897234SdrhA typical sequence of grammar rules might look something like this:
32475897234Sdrh<pre>
32575897234Sdrh  expr ::= expr PLUS expr.
32675897234Sdrh  expr ::= expr TIMES expr.
32775897234Sdrh  expr ::= LPAREN expr RPAREN.
32875897234Sdrh  expr ::= VALUE.
32975897234Sdrh</pre>
33075897234Sdrh</p>
33175897234Sdrh
3329bccde3dSdrh<p>There is one non-terminal in this example, "expr", and five
3339bccde3dSdrhterminal symbols or tokens: "PLUS", "TIMES", "LPAREN",
3349bccde3dSdrh"RPAREN" and "VALUE".</p>
33575897234Sdrh
33675897234Sdrh<p>Like yacc and bison, Lemon allows the grammar to specify a block
33775897234Sdrhof C code that will be executed whenever a grammar rule is reduced
33875897234Sdrhby the parser.
33975897234SdrhIn Lemon, this action is specified by putting the C code (contained
34075897234Sdrhwithin curly braces <tt>{...}</tt>) immediately after the
34175897234Sdrhperiod that closes the rule.
34275897234SdrhFor example:
34375897234Sdrh<pre>
34475897234Sdrh  expr ::= expr PLUS expr.   { printf("Doing an addition...\n"); }
34575897234Sdrh</pre>
34675897234Sdrh</p>
34775897234Sdrh
34875897234Sdrh<p>In order to be useful, grammar actions must normally be linked to
34975897234Sdrhtheir associated grammar rules.
3509bccde3dSdrhIn yacc and bison, this is accomplished by embedding a "$$" in the
35175897234Sdrhaction to stand for the value of the left-hand side of the rule and
3529bccde3dSdrhsymbols "$1", "$2", and so forth to stand for the value of
35375897234Sdrhthe terminal or nonterminal at position 1, 2 and so forth on the
35475897234Sdrhright-hand side of the rule.
35575897234SdrhThis idea is very powerful, but it is also very error-prone.  The
35675897234Sdrhsingle most common source of errors in a yacc or bison grammar is
35775897234Sdrhto miscount the number of symbols on the right-hand side of a grammar
3589bccde3dSdrhrule and say "$7" when you really mean "$8".</p>
35975897234Sdrh
36075897234Sdrh<p>Lemon avoids the need to count grammar symbols by assigning symbolic
36175897234Sdrhnames to each symbol in a grammar rule and then using those symbolic
36275897234Sdrhnames in the action.
36375897234SdrhIn yacc or bison, one would write this:
36475897234Sdrh<pre>
36575897234Sdrh  expr -> expr PLUS expr  { $$ = $1 + $3; };
36675897234Sdrh</pre>
36775897234SdrhBut in Lemon, the same rule becomes the following:
36875897234Sdrh<pre>
36975897234Sdrh  expr(A) ::= expr(B) PLUS expr(C).  { A = B+C; }
37075897234Sdrh</pre>
37175897234SdrhIn the Lemon rule, any symbol in parentheses after a grammar rule
37275897234Sdrhsymbol becomes a place holder for that symbol in the grammar rule.
37375897234SdrhThis place holder can then be used in the associated C action to
37475897234Sdrhstand for the value of that symbol.<p>
37575897234Sdrh
37675897234Sdrh<p>The Lemon notation for linking a grammar rule with its reduce
37775897234Sdrhaction is superior to yacc/bison on several counts.
37875897234SdrhFirst, as mentioned above, the Lemon method avoids the need to
37975897234Sdrhcount grammar symbols.
38075897234SdrhSecondly, if a terminal or nonterminal in a Lemon grammar rule
38175897234Sdrhincludes a linking symbol in parentheses but that linking symbol
38275897234Sdrhis not actually used in the reduce action, then an error message
38375897234Sdrhis generated.
38475897234SdrhFor example, the rule
38575897234Sdrh<pre>
38675897234Sdrh  expr(A) ::= expr(B) PLUS expr(C).  { A = B; }
38775897234Sdrh</pre>
3889bccde3dSdrhwill generate an error because the linking symbol "C" is used
38975897234Sdrhin the grammar rule but not in the reduce action.</p>
39075897234Sdrh
39175897234Sdrh<p>The Lemon notation for linking grammar rules to reduce actions
39275897234Sdrhalso facilitates the use of destructors for reclaiming memory
39375897234Sdrhallocated by the values of terminals and nonterminals on the
39475897234Sdrhright-hand side of a rule.</p>
39575897234Sdrh
3969bccde3dSdrh<a name='precrules'></a>
39775897234Sdrh<h3>Precedence Rules</h3>
39875897234Sdrh
39975897234Sdrh<p>Lemon resolves parsing ambiguities in exactly the same way as
40075897234Sdrhyacc and bison.  A shift-reduce conflict is resolved in favor
40175897234Sdrhof the shift, and a reduce-reduce conflict is resolved by reducing
40275897234Sdrhwhichever rule comes first in the grammar file.</p>
40375897234Sdrh
40475897234Sdrh<p>Just like in
40575897234Sdrhyacc and bison, Lemon allows a measure of control
40675897234Sdrhover the resolution of paring conflicts using precedence rules.
40775897234SdrhA precedence value can be assigned to any terminal symbol
4089bccde3dSdrhusing the
4099bccde3dSdrh<a href='#pleft'>%left</a>,
4109bccde3dSdrh<a href='#pright'>%right</a> or
4119bccde3dSdrh<a href='#pnonassoc'>%nonassoc</a> directives.  Terminal symbols
41275897234Sdrhmentioned in earlier directives have a lower precedence that
41375897234Sdrhterminal symbols mentioned in later directives.  For example:</p>
41475897234Sdrh
41575897234Sdrh<p><pre>
41675897234Sdrh   %left AND.
41775897234Sdrh   %left OR.
41875897234Sdrh   %nonassoc EQ NE GT GE LT LE.
41975897234Sdrh   %left PLUS MINUS.
42075897234Sdrh   %left TIMES DIVIDE MOD.
42175897234Sdrh   %right EXP NOT.
42275897234Sdrh</pre></p>
42375897234Sdrh
42475897234Sdrh<p>In the preceding sequence of directives, the AND operator is
42575897234Sdrhdefined to have the lowest precedence.  The OR operator is one
42675897234Sdrhprecedence level higher.  And so forth.  Hence, the grammar would
42775897234Sdrhattempt to group the ambiguous expression
42875897234Sdrh<pre>
42975897234Sdrh     a AND b OR c
43075897234Sdrh</pre>
43175897234Sdrhlike this
43275897234Sdrh<pre>
43375897234Sdrh     a AND (b OR c).
43475897234Sdrh</pre>
43575897234SdrhThe associativity (left, right or nonassoc) is used to determine
43675897234Sdrhthe grouping when the precedence is the same.  AND is left-associative
43775897234Sdrhin our example, so
43875897234Sdrh<pre>
43975897234Sdrh     a AND b AND c
44075897234Sdrh</pre>
44175897234Sdrhis parsed like this
44275897234Sdrh<pre>
44375897234Sdrh     (a AND b) AND c.
44475897234Sdrh</pre>
44575897234SdrhThe EXP operator is right-associative, though, so
44675897234Sdrh<pre>
44775897234Sdrh     a EXP b EXP c
44875897234Sdrh</pre>
44975897234Sdrhis parsed like this
45075897234Sdrh<pre>
45175897234Sdrh     a EXP (b EXP c).
45275897234Sdrh</pre>
45375897234SdrhThe nonassoc precedence is used for non-associative operators.
45475897234SdrhSo
45575897234Sdrh<pre>
45675897234Sdrh     a EQ b EQ c
45775897234Sdrh</pre>
45875897234Sdrhis an error.</p>
45975897234Sdrh
46075897234Sdrh<p>The precedence of non-terminals is transferred to rules as follows:
46175897234SdrhThe precedence of a grammar rule is equal to the precedence of the
46275897234Sdrhleft-most terminal symbol in the rule for which a precedence is
46375897234Sdrhdefined.  This is normally what you want, but in those cases where
46475897234Sdrhyou want to precedence of a grammar rule to be something different,
46575897234Sdrhyou can specify an alternative precedence symbol by putting the
46675897234Sdrhsymbol in square braces after the period at the end of the rule and
46775897234Sdrhbefore any C-code.  For example:</p>
46875897234Sdrh
46975897234Sdrh<p><pre>
47075897234Sdrh   expr = MINUS expr.  [NOT]
47175897234Sdrh</pre></p>
47275897234Sdrh
47375897234Sdrh<p>This rule has a precedence equal to that of the NOT symbol, not the
47475897234SdrhMINUS symbol as would have been the case by default.</p>
47575897234Sdrh
47675897234Sdrh<p>With the knowledge of how precedence is assigned to terminal
47775897234Sdrhsymbols and individual
47875897234Sdrhgrammar rules, we can now explain precisely how parsing conflicts
47975897234Sdrhare resolved in Lemon.  Shift-reduce conflicts are resolved
48075897234Sdrhas follows:
48175897234Sdrh<ul>
48275897234Sdrh<li> If either the token to be shifted or the rule to be reduced
48375897234Sdrh     lacks precedence information, then resolve in favor of the
48475897234Sdrh     shift, but report a parsing conflict.
48575897234Sdrh<li> If the precedence of the token to be shifted is greater than
48675897234Sdrh     the precedence of the rule to reduce, then resolve in favor
48775897234Sdrh     of the shift.  No parsing conflict is reported.
48875897234Sdrh<li> If the precedence of the token it be shifted is less than the
48975897234Sdrh     precedence of the rule to reduce, then resolve in favor of the
49075897234Sdrh     reduce action.  No parsing conflict is reported.
49175897234Sdrh<li> If the precedences are the same and the shift token is
49275897234Sdrh     right-associative, then resolve in favor of the shift.
49375897234Sdrh     No parsing conflict is reported.
494d5578433Smistachkin<li> If the precedences are the same the shift token is
49575897234Sdrh     left-associative, then resolve in favor of the reduce.
49675897234Sdrh     No parsing conflict is reported.
49775897234Sdrh<li> Otherwise, resolve the conflict by doing the shift and
49875897234Sdrh     report the parsing conflict.
49975897234Sdrh</ul>
50075897234SdrhReduce-reduce conflicts are resolved this way:
50175897234Sdrh<ul>
50275897234Sdrh<li> If either reduce rule
50375897234Sdrh     lacks precedence information, then resolve in favor of the
50475897234Sdrh     rule that appears first in the grammar and report a parsing
50575897234Sdrh     conflict.
50675897234Sdrh<li> If both rules have precedence and the precedence is different
50775897234Sdrh     then resolve the dispute in favor of the rule with the highest
50875897234Sdrh     precedence and do not report a conflict.
50975897234Sdrh<li> Otherwise, resolve the conflict by reducing by the rule that
51075897234Sdrh     appears first in the grammar and report a parsing conflict.
51175897234Sdrh</ul>
51275897234Sdrh
51375897234Sdrh<h3>Special Directives</h3>
51475897234Sdrh
51575897234Sdrh<p>The input grammar to Lemon consists of grammar rules and special
51675897234Sdrhdirectives.  We've described all the grammar rules, so now we'll
51775897234Sdrhtalk about the special directives.</p>
51875897234Sdrh
51975897234Sdrh<p>Directives in lemon can occur in any order.  You can put them before
52075897234Sdrhthe grammar rules, or after the grammar rules, or in the mist of the
52175897234Sdrhgrammar rules.  It doesn't matter.  The relative order of
52275897234Sdrhdirectives used to assign precedence to terminals is important, but
52375897234Sdrhother than that, the order of directives in Lemon is arbitrary.</p>
52475897234Sdrh
52575897234Sdrh<p>Lemon supports the following special directives:
52675897234Sdrh<ul>
527f2340fc7Sdrh<li><tt>%code</tt>
528f2340fc7Sdrh<li><tt>%default_destructor</tt>
529f2340fc7Sdrh<li><tt>%default_type</tt>
53075897234Sdrh<li><tt>%destructor</tt>
5319bccde3dSdrh<li><tt>%endif</tt>
53275897234Sdrh<li><tt>%extra_argument</tt>
5339bccde3dSdrh<li><tt>%fallback</tt>
5349bccde3dSdrh<li><tt>%ifdef</tt>
5359bccde3dSdrh<li><tt>%ifndef</tt>
53675897234Sdrh<li><tt>%include</tt>
53775897234Sdrh<li><tt>%left</tt>
53875897234Sdrh<li><tt>%name</tt>
53975897234Sdrh<li><tt>%nonassoc</tt>
54075897234Sdrh<li><tt>%parse_accept</tt>
54175897234Sdrh<li><tt>%parse_failure </tt>
54275897234Sdrh<li><tt>%right</tt>
54375897234Sdrh<li><tt>%stack_overflow</tt>
54475897234Sdrh<li><tt>%stack_size</tt>
54575897234Sdrh<li><tt>%start_symbol</tt>
54675897234Sdrh<li><tt>%syntax_error</tt>
5479bccde3dSdrh<li><tt>%token_class</tt>
54875897234Sdrh<li><tt>%token_destructor</tt>
54975897234Sdrh<li><tt>%token_prefix</tt>
55075897234Sdrh<li><tt>%token_type</tt>
55175897234Sdrh<li><tt>%type</tt>
5529bccde3dSdrh<li><tt>%wildcard</tt>
55375897234Sdrh</ul>
55475897234SdrhEach of these directives will be described separately in the
55575897234Sdrhfollowing sections:</p>
55675897234Sdrh
5579bccde3dSdrh<a name='pcode'></a>
558f2340fc7Sdrh<h4>The <tt>%code</tt> directive</h4>
559f2340fc7Sdrh
5609bccde3dSdrh<p>The %code directive is used to specify addition C code that
561f2340fc7Sdrhis added to the end of the main output file.  This is similar to
5629bccde3dSdrhthe <a href='#pinclude'>%include</a> directive except that %include
5639bccde3dSdrhis inserted at the beginning of the main output file.</p>
564f2340fc7Sdrh
565f2340fc7Sdrh<p>%code is typically used to include some action routines or perhaps
5669bccde3dSdrha tokenizer or even the "main()" function
5679bccde3dSdrhas part of the output file.</p>
568f2340fc7Sdrh
5699bccde3dSdrh<a name='default_destructor'></a>
570f2340fc7Sdrh<h4>The <tt>%default_destructor</tt> directive</h4>
571f2340fc7Sdrh
572f2340fc7Sdrh<p>The %default_destructor directive specifies a destructor to
573f2340fc7Sdrhuse for non-terminals that do not have their own destructor
574f2340fc7Sdrhspecified by a separate %destructor directive.  See the documentation
5759bccde3dSdrhon the <a name='#destructor'>%destructor</a> directive below for
5769bccde3dSdrhadditional information.</p>
577f2340fc7Sdrh
578f2340fc7Sdrh<p>In some grammers, many different non-terminal symbols have the
579f2340fc7Sdrhsame datatype and hence the same destructor.  This directive is
580f2340fc7Sdrha convenience way to specify the same destructor for all those
581f2340fc7Sdrhnon-terminals using a single statement.</p>
582f2340fc7Sdrh
5839bccde3dSdrh<a name='default_type'></a>
584f2340fc7Sdrh<h4>The <tt>%default_type</tt> directive</h4>
585f2340fc7Sdrh
586f2340fc7Sdrh<p>The %default_type directive specifies the datatype of non-terminal
587f2340fc7Sdrhsymbols that do no have their own datatype defined using a separate
5889bccde3dSdrh<a href='#ptype'>%type</a> directive.
5899bccde3dSdrh</p>
590f2340fc7Sdrh
5919bccde3dSdrh<a name='destructor'></a>
59275897234Sdrh<h4>The <tt>%destructor</tt> directive</h4>
59375897234Sdrh
59475897234Sdrh<p>The %destructor directive is used to specify a destructor for
59575897234Sdrha non-terminal symbol.
5969bccde3dSdrh(See also the <a href='#token_destructor'>%token_destructor</a>
5979bccde3dSdrhdirective which is used to specify a destructor for terminal symbols.)</p>
59875897234Sdrh
59975897234Sdrh<p>A non-terminal's destructor is called to dispose of the
60075897234Sdrhnon-terminal's value whenever the non-terminal is popped from
60175897234Sdrhthe stack.  This includes all of the following circumstances:
60275897234Sdrh<ul>
60375897234Sdrh<li> When a rule reduces and the value of a non-terminal on
60475897234Sdrh     the right-hand side is not linked to C code.
60575897234Sdrh<li> When the stack is popped during error processing.
60675897234Sdrh<li> When the ParseFree() function runs.
60775897234Sdrh</ul>
60875897234SdrhThe destructor can do whatever it wants with the value of
60975897234Sdrhthe non-terminal, but its design is to deallocate memory
61075897234Sdrhor other resources held by that non-terminal.</p>
61175897234Sdrh
61275897234Sdrh<p>Consider an example:
61375897234Sdrh<pre>
61475897234Sdrh   %type nt {void*}
61575897234Sdrh   %destructor nt { free($$); }
61675897234Sdrh   nt(A) ::= ID NUM.   { A = malloc( 100 ); }
61775897234Sdrh</pre>
61875897234SdrhThis example is a bit contrived but it serves to illustrate how
61975897234Sdrhdestructors work.  The example shows a non-terminal named
6209bccde3dSdrh"nt" that holds values of type "void*".  When the rule for
6219bccde3dSdrhan "nt" reduces, it sets the value of the non-terminal to
62275897234Sdrhspace obtained from malloc().  Later, when the nt non-terminal
62375897234Sdrhis popped from the stack, the destructor will fire and call
62475897234Sdrhfree() on this malloced space, thus avoiding a memory leak.
6259bccde3dSdrh(Note that the symbol "$$" in the destructor code is replaced
62675897234Sdrhby the value of the non-terminal.)</p>
62775897234Sdrh
62875897234Sdrh<p>It is important to note that the value of a non-terminal is passed
62975897234Sdrhto the destructor whenever the non-terminal is removed from the
63075897234Sdrhstack, unless the non-terminal is used in a C-code action.  If
63175897234Sdrhthe non-terminal is used by C-code, then it is assumed that the
6329bccde3dSdrhC-code will take care of destroying it.
6339bccde3dSdrhMore commonly, the value is used to build some
63475897234Sdrhlarger structure and we don't want to destroy it, which is why
63575897234Sdrhthe destructor is not called in this circumstance.</p>
63675897234Sdrh
6379bccde3dSdrh<p>Destructors help avoid memory leaks by automatically freeing
6389bccde3dSdrhallocated objects when they go out of scope.
63975897234SdrhTo do the same using yacc or bison is much more difficult.</p>
64075897234Sdrh
64145f31be8Sdrh<a name="extraarg"></a>
64275897234Sdrh<h4>The <tt>%extra_argument</tt> directive</h4>
64375897234Sdrh
64475897234SdrhThe %extra_argument directive instructs Lemon to add a 4th parameter
64575897234Sdrhto the parameter list of the Parse() function it generates.  Lemon
64675897234Sdrhdoesn't do anything itself with this extra argument, but it does
64775897234Sdrhmake the argument available to C-code action routines, destructors,
64875897234Sdrhand so forth.  For example, if the grammar file contains:</p>
64975897234Sdrh
65075897234Sdrh<p><pre>
65175897234Sdrh    %extra_argument { MyStruct *pAbc }
65275897234Sdrh</pre></p>
65375897234Sdrh
65475897234Sdrh<p>Then the Parse() function generated will have an 4th parameter
6559bccde3dSdrhof type "MyStruct*" and all action routines will have access to
6569bccde3dSdrha variable named "pAbc" that is the value of the 4th parameter
65775897234Sdrhin the most recent call to Parse().</p>
65875897234Sdrh
6599bccde3dSdrh<a name='pfallback'></a>
6609bccde3dSdrh<h4>The <tt>%fallback</tt> directive</h4>
6619bccde3dSdrh
6629bccde3dSdrh<p>The %fallback directive specifies an alternative meaning for one
6639bccde3dSdrhor more tokens.  The alternative meaning is tried if the original token
6649bccde3dSdrhwould have generated a syntax error.
6659bccde3dSdrh
6669bccde3dSdrh<p>The %fallback directive was added to support robust parsing of SQL
6679bccde3dSdrhsyntax in <a href="https://www.sqlite.org/">SQLite</a>.
6689bccde3dSdrhThe SQL language contains a large assortment of keywords, each of which
6699bccde3dSdrhappears as a different token to the language parser.  SQL contains so
6709bccde3dSdrhmany keywords, that it can be difficult for programmers to keep up with
6719bccde3dSdrhthem all.  Programmers will, therefore, sometimes mistakenly use an
6729bccde3dSdrhobscure language keyword for an identifier.  The %fallback directive
6739bccde3dSdrhprovides a mechanism to tell the parser:  "If you are unable to parse
6749bccde3dSdrhthis keyword, try treating it as an identifier instead."
6759bccde3dSdrh
6769bccde3dSdrh<p>The syntax of %fallback is as follows:
6779bccde3dSdrh
6789bccde3dSdrh<blockquote>
6799bccde3dSdrh<tt>%fallback</tt>  <i>ID</i> <i>TOKEN...</i> <b>.</b>
6809bccde3dSdrh</blockquote>
6819bccde3dSdrh
6829bccde3dSdrh<p>In words, the %fallback directive is followed by a list of token names
6839bccde3dSdrhterminated by a period.  The first token name is the fallback token - the
6849bccde3dSdrhtoken to which all the other tokens fall back to.  The second and subsequent
6859bccde3dSdrharguments are tokens which fall back to the token identified by the first
6869bccde3dSdrhargument.
6879bccde3dSdrh
6889bccde3dSdrh<a name='pifdef'></a>
6899bccde3dSdrh<h4>The <tt>%ifdef</tt>, <tt>%ifndef</tt>, and <tt>%endif</tt> directives.</h4>
6909bccde3dSdrh
6919bccde3dSdrh<p>The %ifdef, %ifndef, and %endif directives are similar to
6929bccde3dSdrh#ifdef, #ifndef, and #endif in the C-preprocessor, just not as general.
6939bccde3dSdrhEach of these directives must begin at the left margin.  No whitespace
6949bccde3dSdrhis allowed between the "%" and the directive name.
6959bccde3dSdrh
6969bccde3dSdrh<p>Grammar text in between "%ifdef MACRO" and the next nested "%endif" is
6979bccde3dSdrhignored unless the "-DMACRO" command-line option is used.  Grammar text
6989bccde3dSdrhbetwen "%ifndef MACRO" and the next nested "%endif" is included except when
6999bccde3dSdrhthe "-DMACRO" command-line option is used.
7009bccde3dSdrh
7019bccde3dSdrh<p>Note that the argument to %ifdef and %ifndef must be a single
7029bccde3dSdrhpreprocessor symbol name, not a general expression.  There is no "%else"
7039bccde3dSdrhdirective.
7049bccde3dSdrh
7059bccde3dSdrh
7069bccde3dSdrh<a name='pinclude'></a>
70775897234Sdrh<h4>The <tt>%include</tt> directive</h4>
70875897234Sdrh
70975897234Sdrh<p>The %include directive specifies C code that is included at the
71075897234Sdrhtop of the generated parser.  You can include any text you want --
711f2340fc7Sdrhthe Lemon parser generator copies it blindly.  If you have multiple
7129bccde3dSdrh%include directives in your grammar file, their values are concatenated
7139bccde3dSdrhso that all %include code ultimately appears near the top of the
7149bccde3dSdrhgenerated parser, in the same order as it appeared in the grammer.</p>
71575897234Sdrh
71675897234Sdrh<p>The %include directive is very handy for getting some extra #include
71775897234Sdrhpreprocessor statements at the beginning of the generated parser.
71875897234SdrhFor example:</p>
71975897234Sdrh
72075897234Sdrh<p><pre>
72175897234Sdrh   %include {#include &lt;unistd.h&gt;}
72275897234Sdrh</pre></p>
72375897234Sdrh
72475897234Sdrh<p>This might be needed, for example, if some of the C actions in the
72575897234Sdrhgrammar call functions that are prototyed in unistd.h.</p>
72675897234Sdrh
7279bccde3dSdrh<a name='pleft'></a>
72875897234Sdrh<h4>The <tt>%left</tt> directive</h4>
72975897234Sdrh
7309bccde3dSdrhThe %left directive is used (along with the <a href='#pright'>%right</a> and
7319bccde3dSdrh<a href='#pnonassoc'>%nonassoc</a> directives) to declare precedences of
7329bccde3dSdrhterminal symbols.  Every terminal symbol whose name appears after
7339bccde3dSdrha %left directive but before the next period (".") is
73475897234Sdrhgiven the same left-associative precedence value.  Subsequent
73575897234Sdrh%left directives have higher precedence.  For example:</p>
73675897234Sdrh
73775897234Sdrh<p><pre>
73875897234Sdrh   %left AND.
73975897234Sdrh   %left OR.
74075897234Sdrh   %nonassoc EQ NE GT GE LT LE.
74175897234Sdrh   %left PLUS MINUS.
74275897234Sdrh   %left TIMES DIVIDE MOD.
74375897234Sdrh   %right EXP NOT.
74475897234Sdrh</pre></p>
74575897234Sdrh
74675897234Sdrh<p>Note the period that terminates each %left, %right or %nonassoc
74775897234Sdrhdirective.</p>
74875897234Sdrh
74975897234Sdrh<p>LALR(1) grammars can get into a situation where they require
75075897234Sdrha large amount of stack space if you make heavy use or right-associative
75175897234Sdrhoperators.  For this reason, it is recommended that you use %left
75275897234Sdrhrather than %right whenever possible.</p>
75375897234Sdrh
7549bccde3dSdrh<a name='pname'></a>
75575897234Sdrh<h4>The <tt>%name</tt> directive</h4>
75675897234Sdrh
75775897234Sdrh<p>By default, the functions generated by Lemon all begin with the
7589bccde3dSdrhfive-character string "Parse".  You can change this string to something
75975897234Sdrhdifferent using the %name directive.  For instance:</p>
76075897234Sdrh
76175897234Sdrh<p><pre>
76275897234Sdrh   %name Abcde
76375897234Sdrh</pre></p>
76475897234Sdrh
76575897234Sdrh<p>Putting this directive in the grammar file will cause Lemon to generate
76675897234Sdrhfunctions named
76775897234Sdrh<ul>
76875897234Sdrh<li> AbcdeAlloc(),
76975897234Sdrh<li> AbcdeFree(),
77075897234Sdrh<li> AbcdeTrace(), and
77175897234Sdrh<li> Abcde().
77275897234Sdrh</ul>
77375897234SdrhThe %name directive allows you to generator two or more different
77475897234Sdrhparsers and link them all into the same executable.
77575897234Sdrh</p>
77675897234Sdrh
7779bccde3dSdrh<a name='pnonassoc'></a>
77875897234Sdrh<h4>The <tt>%nonassoc</tt> directive</h4>
77975897234Sdrh
78075897234Sdrh<p>This directive is used to assign non-associative precedence to
7819bccde3dSdrhone or more terminal symbols.  See the section on
7829bccde3dSdrh<a href='#precrules'>precedence rules</a>
7839bccde3dSdrhor on the <a href='#pleft'>%left</a> directive for additional information.</p>
78475897234Sdrh
7859bccde3dSdrh<a name='parse_accept'></a>
78675897234Sdrh<h4>The <tt>%parse_accept</tt> directive</h4>
78775897234Sdrh
78875897234Sdrh<p>The %parse_accept directive specifies a block of C code that is
7899bccde3dSdrhexecuted whenever the parser accepts its input string.  To "accept"
79075897234Sdrhan input string means that the parser was able to process all tokens
79175897234Sdrhwithout error.</p>
79275897234Sdrh
79375897234Sdrh<p>For example:</p>
79475897234Sdrh
79575897234Sdrh<p><pre>
79675897234Sdrh   %parse_accept {
79775897234Sdrh      printf("parsing complete!\n");
79875897234Sdrh   }
79975897234Sdrh</pre></p>
80075897234Sdrh
8019bccde3dSdrh<a name='parse_failure'></a>
80275897234Sdrh<h4>The <tt>%parse_failure</tt> directive</h4>
80375897234Sdrh
80475897234Sdrh<p>The %parse_failure directive specifies a block of C code that
80575897234Sdrhis executed whenever the parser fails complete.  This code is not
80675897234Sdrhexecuted until the parser has tried and failed to resolve an input
80775897234Sdrherror using is usual error recovery strategy.  The routine is
80875897234Sdrhonly invoked when parsing is unable to continue.</p>
80975897234Sdrh
81075897234Sdrh<p><pre>
81175897234Sdrh   %parse_failure {
81275897234Sdrh     fprintf(stderr,"Giving up.  Parser is hopelessly lost...\n");
81375897234Sdrh   }
81475897234Sdrh</pre></p>
81575897234Sdrh
8169bccde3dSdrh<a name='pright'></a>
81775897234Sdrh<h4>The <tt>%right</tt> directive</h4>
81875897234Sdrh
81975897234Sdrh<p>This directive is used to assign right-associative precedence to
8209bccde3dSdrhone or more terminal symbols.  See the section on
8219bccde3dSdrh<a href='#precrules'>precedence rules</a>
8229bccde3dSdrhor on the <a href='#pleft'>%left</a> directive for additional information.</p>
82375897234Sdrh
8249bccde3dSdrh<a name='stack_overflow'></a>
82575897234Sdrh<h4>The <tt>%stack_overflow</tt> directive</h4>
82675897234Sdrh
82775897234Sdrh<p>The %stack_overflow directive specifies a block of C code that
82875897234Sdrhis executed if the parser's internal stack ever overflows.  Typically
82975897234Sdrhthis just prints an error message.  After a stack overflow, the parser
83075897234Sdrhwill be unable to continue and must be reset.</p>
83175897234Sdrh
83275897234Sdrh<p><pre>
83375897234Sdrh   %stack_overflow {
83475897234Sdrh     fprintf(stderr,"Giving up.  Parser stack overflow\n");
83575897234Sdrh   }
83675897234Sdrh</pre></p>
83775897234Sdrh
83875897234Sdrh<p>You can help prevent parser stack overflows by avoiding the use
83975897234Sdrhof right recursion and right-precedence operators in your grammar.
84075897234SdrhUse left recursion and and left-precedence operators instead, to
84175897234Sdrhencourage rules to reduce sooner and keep the stack size down.
84275897234SdrhFor example, do rules like this:
84375897234Sdrh<pre>
84475897234Sdrh   list ::= list element.      // left-recursion.  Good!
84575897234Sdrh   list ::= .
84675897234Sdrh</pre>
84775897234SdrhNot like this:
84875897234Sdrh<pre>
84975897234Sdrh   list ::= element list.      // right-recursion.  Bad!
85075897234Sdrh   list ::= .
85175897234Sdrh</pre>
85275897234Sdrh
8539bccde3dSdrh<a name='stack_size'></a>
85475897234Sdrh<h4>The <tt>%stack_size</tt> directive</h4>
85575897234Sdrh
85675897234Sdrh<p>If stack overflow is a problem and you can't resolve the trouble
85775897234Sdrhby using left-recursion, then you might want to increase the size
85875897234Sdrhof the parser's stack using this directive.  Put an positive integer
85975897234Sdrhafter the %stack_size directive and Lemon will generate a parse
86075897234Sdrhwith a stack of the requested size.  The default value is 100.</p>
86175897234Sdrh
86275897234Sdrh<p><pre>
86375897234Sdrh   %stack_size 2000
86475897234Sdrh</pre></p>
86575897234Sdrh
8669bccde3dSdrh<a name='start_symbol'></a>
86775897234Sdrh<h4>The <tt>%start_symbol</tt> directive</h4>
86875897234Sdrh
86975897234Sdrh<p>By default, the start-symbol for the grammar that Lemon generates
87075897234Sdrhis the first non-terminal that appears in the grammar file.  But you
87175897234Sdrhcan choose a different start-symbol using the %start_symbol directive.</p>
87275897234Sdrh
87375897234Sdrh<p><pre>
87475897234Sdrh   %start_symbol  prog
87575897234Sdrh</pre></p>
87675897234Sdrh
8779bccde3dSdrh<a name='token_destructor'></a>
87875897234Sdrh<h4>The <tt>%token_destructor</tt> directive</h4>
87975897234Sdrh
88075897234Sdrh<p>The %destructor directive assigns a destructor to a non-terminal
88175897234Sdrhsymbol.  (See the description of the %destructor directive above.)
88275897234SdrhThis directive does the same thing for all terminal symbols.</p>
88375897234Sdrh
88475897234Sdrh<p>Unlike non-terminal symbols which may each have a different data type
88575897234Sdrhfor their values, terminals all use the same data type (defined by
88675897234Sdrhthe %token_type directive) and so they use a common destructor.  Other
88775897234Sdrhthan that, the token destructor works just like the non-terminal
88875897234Sdrhdestructors.</p>
88975897234Sdrh
8909bccde3dSdrh<a name='token_prefix'></a>
89175897234Sdrh<h4>The <tt>%token_prefix</tt> directive</h4>
89275897234Sdrh
89375897234Sdrh<p>Lemon generates #defines that assign small integer constants
89475897234Sdrhto each terminal symbol in the grammar.  If desired, Lemon will
89575897234Sdrhadd a prefix specified by this directive
89675897234Sdrhto each of the #defines it generates.
89775897234SdrhSo if the default output of Lemon looked like this:
89875897234Sdrh<pre>
89975897234Sdrh    #define AND              1
90075897234Sdrh    #define MINUS            2
90175897234Sdrh    #define OR               3
90275897234Sdrh    #define PLUS             4
90375897234Sdrh</pre>
90475897234SdrhYou can insert a statement into the grammar like this:
90575897234Sdrh<pre>
90675897234Sdrh    %token_prefix    TOKEN_
90775897234Sdrh</pre>
90875897234Sdrhto cause Lemon to produce these symbols instead:
90975897234Sdrh<pre>
91075897234Sdrh    #define TOKEN_AND        1
91175897234Sdrh    #define TOKEN_MINUS      2
91275897234Sdrh    #define TOKEN_OR         3
91375897234Sdrh    #define TOKEN_PLUS       4
91475897234Sdrh</pre>
91575897234Sdrh
9169bccde3dSdrh<a name='token_type'></a><a name='ptype'></a>
91775897234Sdrh<h4>The <tt>%token_type</tt> and <tt>%type</tt> directives</h4>
91875897234Sdrh
91975897234Sdrh<p>These directives are used to specify the data types for values
92075897234Sdrhon the parser's stack associated with terminal and non-terminal
92175897234Sdrhsymbols.  The values of all terminal symbols must be of the same
92275897234Sdrhtype.  This turns out to be the same data type as the 3rd parameter
92375897234Sdrhto the Parse() function generated by Lemon.  Typically, you will
92475897234Sdrhmake the value of a terminal symbol by a pointer to some kind of
92575897234Sdrhtoken structure.  Like this:</p>
92675897234Sdrh
92775897234Sdrh<p><pre>
92875897234Sdrh   %token_type    {Token*}
92975897234Sdrh</pre></p>
93075897234Sdrh
93175897234Sdrh<p>If the data type of terminals is not specified, the default value
932*dfe4e6bbSdrhis "void*".</p>
93375897234Sdrh
93475897234Sdrh<p>Non-terminal symbols can each have their own data types.  Typically
93575897234Sdrhthe data type  of a non-terminal is a pointer to the root of a parse-tree
93675897234Sdrhstructure that contains all information about that non-terminal.
93775897234SdrhFor example:</p>
93875897234Sdrh
93975897234Sdrh<p><pre>
94075897234Sdrh   %type   expr  {Expr*}
94175897234Sdrh</pre></p>
94275897234Sdrh
94375897234Sdrh<p>Each entry on the parser's stack is actually a union containing
94475897234Sdrhinstances of all data types for every non-terminal and terminal symbol.
94575897234SdrhLemon will automatically use the correct element of this union depending
94675897234Sdrhon what the corresponding non-terminal or terminal symbol is.  But
94775897234Sdrhthe grammar designer should keep in mind that the size of the union
94875897234Sdrhwill be the size of its largest element.  So if you have a single
94975897234Sdrhnon-terminal whose data type requires 1K of storage, then your 100
95075897234Sdrhentry parser stack will require 100K of heap space.  If you are willing
95175897234Sdrhand able to pay that price, fine.  You just need to know.</p>
95275897234Sdrh
9539bccde3dSdrh<a name='pwildcard'></a>
9549bccde3dSdrh<h4>The <tt>%wildcard</tt> directive</h4>
9559bccde3dSdrh
9569bccde3dSdrh<p>The %wildcard directive is followed by a single token name and a
9579bccde3dSdrhperiod.  This directive specifies that the identified token should
9589bccde3dSdrhmatch any input token.
9599bccde3dSdrh
9609bccde3dSdrh<p>When the generated parser has the choice of matching an input against
9619bccde3dSdrhthe wildcard token and some other token, the other token is always used.
9629bccde3dSdrhThe wildcard token is only matched if there are no other alternatives.
9639bccde3dSdrh
96475897234Sdrh<h3>Error Processing</h3>
96575897234Sdrh
96675897234Sdrh<p>After extensive experimentation over several years, it has been
96775897234Sdrhdiscovered that the error recovery strategy used by yacc is about
96875897234Sdrhas good as it gets.  And so that is what Lemon uses.</p>
96975897234Sdrh
97075897234Sdrh<p>When a Lemon-generated parser encounters a syntax error, it
97175897234Sdrhfirst invokes the code specified by the %syntax_error directive, if
97275897234Sdrhany.  It then enters its error recovery strategy.  The error recovery
97375897234Sdrhstrategy is to begin popping the parsers stack until it enters a
97475897234Sdrhstate where it is permitted to shift a special non-terminal symbol
9759bccde3dSdrhnamed "error".  It then shifts this non-terminal and continues
97675897234Sdrhparsing.  But the %syntax_error routine will not be called again
97775897234Sdrhuntil at least three new tokens have been successfully shifted.</p>
97875897234Sdrh
97975897234Sdrh<p>If the parser pops its stack until the stack is empty, and it still
98075897234Sdrhis unable to shift the error symbol, then the %parse_failed routine
98175897234Sdrhis invoked and the parser resets itself to its start state, ready
98275897234Sdrhto begin parsing a new file.  This is what will happen at the very
98375897234Sdrhfirst syntax error, of course, if there are no instances of the
9849bccde3dSdrh"error" non-terminal in your grammar.</p>
98575897234Sdrh
98675897234Sdrh</body>
98775897234Sdrh</html>
988