xref: /sqlite-3.40.0/doc/lemon.html (revision c8eee5e5)
175897234Sdrh<html>
275897234Sdrh<head>
375897234Sdrh<title>The Lemon Parser Generator</title>
475897234Sdrh</head>
575897234Sdrh<body bgcolor=white>
675897234Sdrh<h1 align=center>The Lemon Parser Generator</h1>
775897234Sdrh
875897234Sdrh<p>Lemon is an LALR(1) parser generator for C or C++.
975897234SdrhIt does the same job as ``bison'' and ``yacc''.
1075897234SdrhBut lemon is not another bison or yacc clone.  It
1175897234Sdrhuses a different grammar syntax which is designed to
1275897234Sdrhreduce the number of coding errors.  Lemon also uses a more
1375897234Sdrhsophisticated parsing engine that is faster than yacc and
1475897234Sdrhbison and which is both reentrant and thread-safe.
1575897234SdrhFurthermore, Lemon implements features that can be used
1675897234Sdrhto eliminate resource leaks, making is suitable for use
1775897234Sdrhin long-running programs such as graphical user interfaces
1875897234Sdrhor embedded controllers.</p>
1975897234Sdrh
2075897234Sdrh<p>This document is an introduction to the Lemon
2175897234Sdrhparser generator.</p>
2275897234Sdrh
2375897234Sdrh<h2>Theory of Operation</h2>
2475897234Sdrh
2575897234Sdrh<p>The main goal of Lemon is to translate a context free grammar (CFG)
2675897234Sdrhfor a particular language into C code that implements a parser for
2775897234Sdrhthat language.
2875897234SdrhThe program has two inputs:
2975897234Sdrh<ul>
3075897234Sdrh<li>The grammar specification.
3175897234Sdrh<li>A parser template file.
3275897234Sdrh</ul>
3375897234SdrhTypically, only the grammar specification is supplied by the programmer.
3475897234SdrhLemon comes with a default parser template which works fine for most
3575897234Sdrhapplications.  But the user is free to substitute a different parser
3675897234Sdrhtemplate if desired.</p>
3775897234Sdrh
3875897234Sdrh<p>Depending on command-line options, Lemon will generate between
3975897234Sdrhone and three files of outputs.
4075897234Sdrh<ul>
4175897234Sdrh<li>C code to implement the parser.
4275897234Sdrh<li>A header file defining an integer ID for each terminal symbol.
4375897234Sdrh<li>An information file that describes the states of the generated parser
4475897234Sdrh    automaton.
4575897234Sdrh</ul>
4675897234SdrhBy default, all three of these output files are generated.
4775897234SdrhThe header file is suppressed if the ``-m'' command-line option is
4875897234Sdrhused and the report file is omitted when ``-q'' is selected.</p>
4975897234Sdrh
5075897234Sdrh<p>The grammar specification file uses a ``.y'' suffix, by convention.
5175897234SdrhIn the examples used in this document, we'll assume the name of the
5275897234Sdrhgrammar file is ``gram.y''.  A typical use of Lemon would be the
5375897234Sdrhfollowing command:
5475897234Sdrh<pre>
5575897234Sdrh   lemon gram.y
5675897234Sdrh</pre>
5775897234SdrhThis command will generate three output files named ``gram.c'',
5875897234Sdrh``gram.h'' and ``gram.out''.
5975897234SdrhThe first is C code to implement the parser.  The second
6075897234Sdrhis the header file that defines numerical values for all
6175897234Sdrhterminal symbols, and the last is the report that explains
6275897234Sdrhthe states used by the parser automaton.</p>
6375897234Sdrh
6475897234Sdrh<h3>Command Line Options</h3>
6575897234Sdrh
6675897234Sdrh<p>The behavior of Lemon can be modified using command-line options.
6775897234SdrhYou can obtain a list of the available command-line options together
6875897234Sdrhwith a brief explanation of what each does by typing
6975897234Sdrh<pre>
7075897234Sdrh   lemon -?
7175897234Sdrh</pre>
7275897234SdrhAs of this writing, the following command-line options are supported:
7375897234Sdrh<ul>
7475897234Sdrh<li><tt>-b</tt>
7575897234Sdrh<li><tt>-c</tt>
7675897234Sdrh<li><tt>-g</tt>
7775897234Sdrh<li><tt>-m</tt>
7875897234Sdrh<li><tt>-q</tt>
7975897234Sdrh<li><tt>-s</tt>
8075897234Sdrh<li><tt>-x</tt>
8175897234Sdrh</ul>
8275897234SdrhThe ``-b'' option reduces the amount of text in the report file by
8375897234Sdrhprinting only the basis of each parser state, rather than the full
8475897234Sdrhconfiguration.
8575897234SdrhThe ``-c'' option suppresses action table compression.  Using -c
8675897234Sdrhwill make the parser a little larger and slower but it will detect
8775897234Sdrhsyntax errors sooner.
8875897234SdrhThe ``-g'' option causes no output files to be generated at all.
8975897234SdrhInstead, the input grammar file is printed on standard output but
9075897234Sdrhwith all comments, actions and other extraneous text deleted.  This
9175897234Sdrhis a useful way to get a quick summary of a grammar.
9275897234SdrhThe ``-m'' option causes the output C source file to be compatible
9375897234Sdrhwith the ``makeheaders'' program.
9475897234SdrhMakeheaders is a program that automatically generates header files
9575897234Sdrhfrom C source code.  When the ``-m'' option is used, the header
9675897234Sdrhfile is not output since the makeheaders program will take care
9775897234Sdrhof generated all header files automatically.
9875897234SdrhThe ``-q'' option suppresses the report file.
9975897234SdrhUsing ``-s'' causes a brief summary of parser statistics to be
10075897234Sdrhprinted.  Like this:
10175897234Sdrh<pre>
10275897234Sdrh   Parser statistics: 74 terminals, 70 nonterminals, 179 rules
10375897234Sdrh                      340 states, 2026 parser table entries, 0 conflicts
10475897234Sdrh</pre>
10575897234SdrhFinally, the ``-x'' option causes Lemon to print its version number
106b19a2bc6Sdrhand then stops without attempting to read the grammar or generate a parser.</p>
10775897234Sdrh
10875897234Sdrh<h3>The Parser Interface</h3>
10975897234Sdrh
11075897234Sdrh<p>Lemon doesn't generate a complete, working program.  It only generates
11175897234Sdrha few subroutines that implement a parser.  This section describes
11275897234Sdrhthe interface to those subroutines.  It is up to the programmer to
11375897234Sdrhcall these subroutines in an appropriate way in order to produce a
11475897234Sdrhcomplete system.</p>
11575897234Sdrh
11675897234Sdrh<p>Before a program begins using a Lemon-generated parser, the program
11775897234Sdrhmust first create the parser.
11875897234SdrhA new parser is created as follows:
11975897234Sdrh<pre>
12075897234Sdrh   void *pParser = ParseAlloc( malloc );
12175897234Sdrh</pre>
12275897234SdrhThe ParseAlloc() routine allocates and initializes a new parser and
12375897234Sdrhreturns a pointer to it.
12475897234SdrhThe actual data structure used to represent a parser is opaque --
12575897234Sdrhits internal structure is not visible or usable by the calling routine.
12675897234SdrhFor this reason, the ParseAlloc() routine returns a pointer to void
12775897234Sdrhrather than a pointer to some particular structure.
12875897234SdrhThe sole argument to the ParseAlloc() routine is a pointer to the
12975897234Sdrhsubroutine used to allocate memory.  Typically this means ``malloc()''.</p>
13075897234Sdrh
13175897234Sdrh<p>After a program is finished using a parser, it can reclaim all
13275897234Sdrhmemory allocated by that parser by calling
13375897234Sdrh<pre>
13475897234Sdrh   ParseFree(pParser, free);
13575897234Sdrh</pre>
13675897234SdrhThe first argument is the same pointer returned by ParseAlloc().  The
13775897234Sdrhsecond argument is a pointer to the function used to release bulk
13875897234Sdrhmemory back to the system.</p>
13975897234Sdrh
14075897234Sdrh<p>After a parser has been allocated using ParseAlloc(), the programmer
14175897234Sdrhmust supply the parser with a sequence of tokens (terminal symbols) to
14275897234Sdrhbe parsed.  This is accomplished by calling the following function
14375897234Sdrhonce for each token:
14475897234Sdrh<pre>
14575897234Sdrh   Parse(pParser, hTokenID, sTokenData, pArg);
14675897234Sdrh</pre>
14775897234SdrhThe first argument to the Parse() routine is the pointer returned by
14875897234SdrhParseAlloc().
14975897234SdrhThe second argument is a small positive integer that tells the parse the
15075897234Sdrhtype of the next token in the data stream.
15175897234SdrhThere is one token type for each terminal symbol in the grammar.
15275897234SdrhThe gram.h file generated by Lemon contains #define statements that
15375897234Sdrhmap symbolic terminal symbol names into appropriate integer values.
15475897234Sdrh(A value of 0 for the second argument is a special flag to the
15575897234Sdrhparser to indicate that the end of input has been reached.)
15675897234SdrhThe third argument is the value of the given token.  By default,
15775897234Sdrhthe type of the third argument is integer, but the grammar will
15875897234Sdrhusually redefine this type to be some kind of structure.
15975897234SdrhTypically the second argument will be a broad category of tokens
16075897234Sdrhsuch as ``identifier'' or ``number'' and the third argument will
16175897234Sdrhbe the name of the identifier or the value of the number.</p>
16275897234Sdrh
16375897234Sdrh<p>The Parse() function may have either three or four arguments,
16475897234Sdrhdepending on the grammar.  If the grammar specification file request
16575897234Sdrhit, the Parse() function will have a fourth parameter that can be
16675897234Sdrhof any type chosen by the programmer.  The parser doesn't do anything
16775897234Sdrhwith this argument except to pass it through to action routines.
16875897234SdrhThis is a convenient mechanism for passing state information down
16975897234Sdrhto the action routines without having to use global variables.</p>
17075897234Sdrh
17175897234Sdrh<p>A typical use of a Lemon parser might look something like the
17275897234Sdrhfollowing:
17375897234Sdrh<pre>
17475897234Sdrh   01 ParseTree *ParseFile(const char *zFilename){
17575897234Sdrh   02    Tokenizer *pTokenizer;
17675897234Sdrh   03    void *pParser;
17775897234Sdrh   04    Token sToken;
17875897234Sdrh   05    int hTokenId;
17975897234Sdrh   06    ParserState sState;
18075897234Sdrh   07
18175897234Sdrh   08    pTokenizer = TokenizerCreate(zFilename);
18275897234Sdrh   09    pParser = ParseAlloc( malloc );
18375897234Sdrh   10    InitParserState(&sState);
18475897234Sdrh   11    while( GetNextToken(pTokenizer, &hTokenId, &sToken) ){
18575897234Sdrh   12       Parse(pParser, hTokenId, sToken, &sState);
18675897234Sdrh   13    }
18775897234Sdrh   14    Parse(pParser, 0, sToken, &sState);
18875897234Sdrh   15    ParseFree(pParser, free );
18975897234Sdrh   16    TokenizerFree(pTokenizer);
19075897234Sdrh   17    return sState.treeRoot;
19175897234Sdrh   18 }
19275897234Sdrh</pre>
19375897234SdrhThis example shows a user-written routine that parses a file of
19475897234Sdrhtext and returns a pointer to the parse tree.
19575897234Sdrh(We've omitted all error-handling from this example to keep it
19675897234Sdrhsimple.)
19775897234SdrhWe assume the existence of some kind of tokenizer which is created
19875897234Sdrhusing TokenizerCreate() on line 8 and deleted by TokenizerFree()
19975897234Sdrhon line 16.  The GetNextToken() function on line 11 retrieves the
20075897234Sdrhnext token from the input file and puts its type in the
20175897234Sdrhinteger variable hTokenId.  The sToken variable is assumed to be
20275897234Sdrhsome kind of structure that contains details about each token,
20375897234Sdrhsuch as its complete text, what line it occurs on, etc. </p>
20475897234Sdrh
20575897234Sdrh<p>This example also assumes the existence of structure of type
20675897234SdrhParserState that holds state information about a particular parse.
20775897234SdrhAn instance of such a structure is created on line 6 and initialized
20875897234Sdrhon line 10.  A pointer to this structure is passed into the Parse()
20975897234Sdrhroutine as the optional 4th argument.
21075897234SdrhThe action routine specified by the grammar for the parser can use
21175897234Sdrhthe ParserState structure to hold whatever information is useful and
21275897234Sdrhappropriate.  In the example, we note that the treeRoot field of
21375897234Sdrhthe ParserState structure is left pointing to the root of the parse
21475897234Sdrhtree.</p>
21575897234Sdrh
21675897234Sdrh<p>The core of this example as it relates to Lemon is as follows:
21775897234Sdrh<pre>
21875897234Sdrh   ParseFile(){
21975897234Sdrh      pParser = ParseAlloc( malloc );
22075897234Sdrh      while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){
22175897234Sdrh         Parse(pParser, hTokenId, sToken);
22275897234Sdrh      }
22375897234Sdrh      Parse(pParser, 0, sToken);
22475897234Sdrh      ParseFree(pParser, free );
22575897234Sdrh   }
22675897234Sdrh</pre>
22775897234SdrhBasically, what a program has to do to use a Lemon-generated parser
22875897234Sdrhis first create the parser, then send it lots of tokens obtained by
22975897234Sdrhtokenizing an input source.  When the end of input is reached, the
23075897234SdrhParse() routine should be called one last time with a token type
23175897234Sdrhof 0.  This step is necessary to inform the parser that the end of
23275897234Sdrhinput has been reached.  Finally, we reclaim memory used by the
23375897234Sdrhparser by calling ParseFree().</p>
23475897234Sdrh
23575897234Sdrh<p>There is one other interface routine that should be mentioned
23675897234Sdrhbefore we move on.
23775897234SdrhThe ParseTrace() function can be used to generate debugging output
23875897234Sdrhfrom the parser.  A prototype for this routine is as follows:
23975897234Sdrh<pre>
24075897234Sdrh   ParseTrace(FILE *stream, char *zPrefix);
24175897234Sdrh</pre>
24275897234SdrhAfter this routine is called, a short (one-line) message is written
24375897234Sdrhto the designated output stream every time the parser changes states
24475897234Sdrhor calls an action routine.  Each such message is prefaced using
24575897234Sdrhthe text given by zPrefix.  This debugging output can be turned off
24675897234Sdrhby calling ParseTrace() again with a first argument of NULL (0).</p>
24775897234Sdrh
24875897234Sdrh<h3>Differences With YACC and BISON</h3>
24975897234Sdrh
25075897234Sdrh<p>Programmers who have previously used the yacc or bison parser
25175897234Sdrhgenerator will notice several important differences between yacc and/or
25275897234Sdrhbison and Lemon.
25375897234Sdrh<ul>
25475897234Sdrh<li>In yacc and bison, the parser calls the tokenizer.  In Lemon,
25575897234Sdrh    the tokenizer calls the parser.
25675897234Sdrh<li>Lemon uses no global variables.  Yacc and bison use global variables
25775897234Sdrh    to pass information between the tokenizer and parser.
25875897234Sdrh<li>Lemon allows multiple parsers to be running simultaneously.  Yacc
25975897234Sdrh    and bison do not.
26075897234Sdrh</ul>
26175897234SdrhThese differences may cause some initial confusion for programmers
26275897234Sdrhwith prior yacc and bison experience.
26375897234SdrhBut after years of experience using Lemon, I firmly
26475897234Sdrhbelieve that the Lemon way of doing things is better.</p>
26575897234Sdrh
26675897234Sdrh<h2>Input File Syntax</h2>
26775897234Sdrh
26875897234Sdrh<p>The main purpose of the grammar specification file for Lemon is
26975897234Sdrhto define the grammar for the parser.  But the input file also
27075897234Sdrhspecifies additional information Lemon requires to do its job.
27175897234SdrhMost of the work in using Lemon is in writing an appropriate
27275897234Sdrhgrammar file.</p>
27375897234Sdrh
27475897234Sdrh<p>The grammar file for lemon is, for the most part, free format.
27575897234SdrhIt does not have sections or divisions like yacc or bison.  Any
27675897234Sdrhdeclaration can occur at any point in the file.
27775897234SdrhLemon ignores whitespace (except where it is needed to separate
27875897234Sdrhtokens) and it honors the same commenting conventions as C and C++.</p>
27975897234Sdrh
28075897234Sdrh<h3>Terminals and Nonterminals</h3>
28175897234Sdrh
28275897234Sdrh<p>A terminal symbol (token) is any string of alphanumeric
28375897234Sdrhand underscore characters
28475897234Sdrhthat begins with an upper case letter.
285*c8eee5e5SdrhA terminal can contain lowercase letters after the first character,
28675897234Sdrhbut the usual convention is to make terminals all upper case.
28775897234SdrhA nonterminal, on the other hand, is any string of alphanumeric
28875897234Sdrhand underscore characters than begins with a lower case letter.
28975897234SdrhAgain, the usual convention is to make nonterminals use all lower
29075897234Sdrhcase letters.</p>
29175897234Sdrh
29275897234Sdrh<p>In Lemon, terminal and nonterminal symbols do not need to
29375897234Sdrhbe declared or identified in a separate section of the grammar file.
29475897234SdrhLemon is able to generate a list of all terminals and nonterminals
29575897234Sdrhby examining the grammar rules, and it can always distinguish a
29675897234Sdrhterminal from a nonterminal by checking the case of the first
29775897234Sdrhcharacter of the name.</p>
29875897234Sdrh
29975897234Sdrh<p>Yacc and bison allow terminal symbols to have either alphanumeric
30075897234Sdrhnames or to be individual characters included in single quotes, like
30175897234Sdrhthis: ')' or '$'.  Lemon does not allow this alternative form for
30275897234Sdrhterminal symbols.  With Lemon, all symbols, terminals and nonterminals,
30375897234Sdrhmust have alphanumeric names.</p>
30475897234Sdrh
30575897234Sdrh<h3>Grammar Rules</h3>
30675897234Sdrh
30775897234Sdrh<p>The main component of a Lemon grammar file is a sequence of grammar
30875897234Sdrhrules.
30975897234SdrhEach grammar rule consists of a nonterminal symbol followed by
31075897234Sdrhthe special symbol ``::='' and then a list of terminals and/or nonterminals.
31175897234SdrhThe rule is terminated by a period.
31275897234SdrhThe list of terminals and nonterminals on the right-hand side of the
31375897234Sdrhrule can be empty.
31475897234SdrhRules can occur in any order, except that the left-hand side of the
31575897234Sdrhfirst rule is assumed to be the start symbol for the grammar (unless
31675897234Sdrhspecified otherwise using the <tt>%start</tt> directive described below.)
31775897234SdrhA typical sequence of grammar rules might look something like this:
31875897234Sdrh<pre>
31975897234Sdrh  expr ::= expr PLUS expr.
32075897234Sdrh  expr ::= expr TIMES expr.
32175897234Sdrh  expr ::= LPAREN expr RPAREN.
32275897234Sdrh  expr ::= VALUE.
32375897234Sdrh</pre>
32475897234Sdrh</p>
32575897234Sdrh
32675897234Sdrh<p>There is one non-terminal in this example, ``expr'', and five
32775897234Sdrhterminal symbols or tokens: ``PLUS'', ``TIMES'', ``LPAREN'',
32875897234Sdrh``RPAREN'' and ``VALUE''.</p>
32975897234Sdrh
33075897234Sdrh<p>Like yacc and bison, Lemon allows the grammar to specify a block
33175897234Sdrhof C code that will be executed whenever a grammar rule is reduced
33275897234Sdrhby the parser.
33375897234SdrhIn Lemon, this action is specified by putting the C code (contained
33475897234Sdrhwithin curly braces <tt>{...}</tt>) immediately after the
33575897234Sdrhperiod that closes the rule.
33675897234SdrhFor example:
33775897234Sdrh<pre>
33875897234Sdrh  expr ::= expr PLUS expr.   { printf("Doing an addition...\n"); }
33975897234Sdrh</pre>
34075897234Sdrh</p>
34175897234Sdrh
34275897234Sdrh<p>In order to be useful, grammar actions must normally be linked to
34375897234Sdrhtheir associated grammar rules.
34475897234SdrhIn yacc and bison, this is accomplished by embedding a ``$$'' in the
34575897234Sdrhaction to stand for the value of the left-hand side of the rule and
34675897234Sdrhsymbols ``$1'', ``$2'', and so forth to stand for the value of
34775897234Sdrhthe terminal or nonterminal at position 1, 2 and so forth on the
34875897234Sdrhright-hand side of the rule.
34975897234SdrhThis idea is very powerful, but it is also very error-prone.  The
35075897234Sdrhsingle most common source of errors in a yacc or bison grammar is
35175897234Sdrhto miscount the number of symbols on the right-hand side of a grammar
35275897234Sdrhrule and say ``$7'' when you really mean ``$8''.</p>
35375897234Sdrh
35475897234Sdrh<p>Lemon avoids the need to count grammar symbols by assigning symbolic
35575897234Sdrhnames to each symbol in a grammar rule and then using those symbolic
35675897234Sdrhnames in the action.
35775897234SdrhIn yacc or bison, one would write this:
35875897234Sdrh<pre>
35975897234Sdrh  expr -> expr PLUS expr  { $$ = $1 + $3; };
36075897234Sdrh</pre>
36175897234SdrhBut in Lemon, the same rule becomes the following:
36275897234Sdrh<pre>
36375897234Sdrh  expr(A) ::= expr(B) PLUS expr(C).  { A = B+C; }
36475897234Sdrh</pre>
36575897234SdrhIn the Lemon rule, any symbol in parentheses after a grammar rule
36675897234Sdrhsymbol becomes a place holder for that symbol in the grammar rule.
36775897234SdrhThis place holder can then be used in the associated C action to
36875897234Sdrhstand for the value of that symbol.<p>
36975897234Sdrh
37075897234Sdrh<p>The Lemon notation for linking a grammar rule with its reduce
37175897234Sdrhaction is superior to yacc/bison on several counts.
37275897234SdrhFirst, as mentioned above, the Lemon method avoids the need to
37375897234Sdrhcount grammar symbols.
37475897234SdrhSecondly, if a terminal or nonterminal in a Lemon grammar rule
37575897234Sdrhincludes a linking symbol in parentheses but that linking symbol
37675897234Sdrhis not actually used in the reduce action, then an error message
37775897234Sdrhis generated.
37875897234SdrhFor example, the rule
37975897234Sdrh<pre>
38075897234Sdrh  expr(A) ::= expr(B) PLUS expr(C).  { A = B; }
38175897234Sdrh</pre>
38275897234Sdrhwill generate an error because the linking symbol ``C'' is used
38375897234Sdrhin the grammar rule but not in the reduce action.</p>
38475897234Sdrh
38575897234Sdrh<p>The Lemon notation for linking grammar rules to reduce actions
38675897234Sdrhalso facilitates the use of destructors for reclaiming memory
38775897234Sdrhallocated by the values of terminals and nonterminals on the
38875897234Sdrhright-hand side of a rule.</p>
38975897234Sdrh
39075897234Sdrh<h3>Precedence Rules</h3>
39175897234Sdrh
39275897234Sdrh<p>Lemon resolves parsing ambiguities in exactly the same way as
39375897234Sdrhyacc and bison.  A shift-reduce conflict is resolved in favor
39475897234Sdrhof the shift, and a reduce-reduce conflict is resolved by reducing
39575897234Sdrhwhichever rule comes first in the grammar file.</p>
39675897234Sdrh
39775897234Sdrh<p>Just like in
39875897234Sdrhyacc and bison, Lemon allows a measure of control
39975897234Sdrhover the resolution of paring conflicts using precedence rules.
40075897234SdrhA precedence value can be assigned to any terminal symbol
40175897234Sdrhusing the %left, %right or %nonassoc directives.  Terminal symbols
40275897234Sdrhmentioned in earlier directives have a lower precedence that
40375897234Sdrhterminal symbols mentioned in later directives.  For example:</p>
40475897234Sdrh
40575897234Sdrh<p><pre>
40675897234Sdrh   %left AND.
40775897234Sdrh   %left OR.
40875897234Sdrh   %nonassoc EQ NE GT GE LT LE.
40975897234Sdrh   %left PLUS MINUS.
41075897234Sdrh   %left TIMES DIVIDE MOD.
41175897234Sdrh   %right EXP NOT.
41275897234Sdrh</pre></p>
41375897234Sdrh
41475897234Sdrh<p>In the preceding sequence of directives, the AND operator is
41575897234Sdrhdefined to have the lowest precedence.  The OR operator is one
41675897234Sdrhprecedence level higher.  And so forth.  Hence, the grammar would
41775897234Sdrhattempt to group the ambiguous expression
41875897234Sdrh<pre>
41975897234Sdrh     a AND b OR c
42075897234Sdrh</pre>
42175897234Sdrhlike this
42275897234Sdrh<pre>
42375897234Sdrh     a AND (b OR c).
42475897234Sdrh</pre>
42575897234SdrhThe associativity (left, right or nonassoc) is used to determine
42675897234Sdrhthe grouping when the precedence is the same.  AND is left-associative
42775897234Sdrhin our example, so
42875897234Sdrh<pre>
42975897234Sdrh     a AND b AND c
43075897234Sdrh</pre>
43175897234Sdrhis parsed like this
43275897234Sdrh<pre>
43375897234Sdrh     (a AND b) AND c.
43475897234Sdrh</pre>
43575897234SdrhThe EXP operator is right-associative, though, so
43675897234Sdrh<pre>
43775897234Sdrh     a EXP b EXP c
43875897234Sdrh</pre>
43975897234Sdrhis parsed like this
44075897234Sdrh<pre>
44175897234Sdrh     a EXP (b EXP c).
44275897234Sdrh</pre>
44375897234SdrhThe nonassoc precedence is used for non-associative operators.
44475897234SdrhSo
44575897234Sdrh<pre>
44675897234Sdrh     a EQ b EQ c
44775897234Sdrh</pre>
44875897234Sdrhis an error.</p>
44975897234Sdrh
45075897234Sdrh<p>The precedence of non-terminals is transferred to rules as follows:
45175897234SdrhThe precedence of a grammar rule is equal to the precedence of the
45275897234Sdrhleft-most terminal symbol in the rule for which a precedence is
45375897234Sdrhdefined.  This is normally what you want, but in those cases where
45475897234Sdrhyou want to precedence of a grammar rule to be something different,
45575897234Sdrhyou can specify an alternative precedence symbol by putting the
45675897234Sdrhsymbol in square braces after the period at the end of the rule and
45775897234Sdrhbefore any C-code.  For example:</p>
45875897234Sdrh
45975897234Sdrh<p><pre>
46075897234Sdrh   expr = MINUS expr.  [NOT]
46175897234Sdrh</pre></p>
46275897234Sdrh
46375897234Sdrh<p>This rule has a precedence equal to that of the NOT symbol, not the
46475897234SdrhMINUS symbol as would have been the case by default.</p>
46575897234Sdrh
46675897234Sdrh<p>With the knowledge of how precedence is assigned to terminal
46775897234Sdrhsymbols and individual
46875897234Sdrhgrammar rules, we can now explain precisely how parsing conflicts
46975897234Sdrhare resolved in Lemon.  Shift-reduce conflicts are resolved
47075897234Sdrhas follows:
47175897234Sdrh<ul>
47275897234Sdrh<li> If either the token to be shifted or the rule to be reduced
47375897234Sdrh     lacks precedence information, then resolve in favor of the
47475897234Sdrh     shift, but report a parsing conflict.
47575897234Sdrh<li> If the precedence of the token to be shifted is greater than
47675897234Sdrh     the precedence of the rule to reduce, then resolve in favor
47775897234Sdrh     of the shift.  No parsing conflict is reported.
47875897234Sdrh<li> If the precedence of the token it be shifted is less than the
47975897234Sdrh     precedence of the rule to reduce, then resolve in favor of the
48075897234Sdrh     reduce action.  No parsing conflict is reported.
48175897234Sdrh<li> If the precedences are the same and the shift token is
48275897234Sdrh     right-associative, then resolve in favor of the shift.
48375897234Sdrh     No parsing conflict is reported.
48475897234Sdrh<li> If the precedences are the same the the shift token is
48575897234Sdrh     left-associative, then resolve in favor of the reduce.
48675897234Sdrh     No parsing conflict is reported.
48775897234Sdrh<li> Otherwise, resolve the conflict by doing the shift and
48875897234Sdrh     report the parsing conflict.
48975897234Sdrh</ul>
49075897234SdrhReduce-reduce conflicts are resolved this way:
49175897234Sdrh<ul>
49275897234Sdrh<li> If either reduce rule
49375897234Sdrh     lacks precedence information, then resolve in favor of the
49475897234Sdrh     rule that appears first in the grammar and report a parsing
49575897234Sdrh     conflict.
49675897234Sdrh<li> If both rules have precedence and the precedence is different
49775897234Sdrh     then resolve the dispute in favor of the rule with the highest
49875897234Sdrh     precedence and do not report a conflict.
49975897234Sdrh<li> Otherwise, resolve the conflict by reducing by the rule that
50075897234Sdrh     appears first in the grammar and report a parsing conflict.
50175897234Sdrh</ul>
50275897234Sdrh
50375897234Sdrh<h3>Special Directives</h3>
50475897234Sdrh
50575897234Sdrh<p>The input grammar to Lemon consists of grammar rules and special
50675897234Sdrhdirectives.  We've described all the grammar rules, so now we'll
50775897234Sdrhtalk about the special directives.</p>
50875897234Sdrh
50975897234Sdrh<p>Directives in lemon can occur in any order.  You can put them before
51075897234Sdrhthe grammar rules, or after the grammar rules, or in the mist of the
51175897234Sdrhgrammar rules.  It doesn't matter.  The relative order of
51275897234Sdrhdirectives used to assign precedence to terminals is important, but
51375897234Sdrhother than that, the order of directives in Lemon is arbitrary.</p>
51475897234Sdrh
51575897234Sdrh<p>Lemon supports the following special directives:
51675897234Sdrh<ul>
517f2340fc7Sdrh<li><tt>%code</tt>
518f2340fc7Sdrh<li><tt>%default_destructor</tt>
519f2340fc7Sdrh<li><tt>%default_type</tt>
52075897234Sdrh<li><tt>%destructor</tt>
52175897234Sdrh<li><tt>%extra_argument</tt>
52275897234Sdrh<li><tt>%include</tt>
52375897234Sdrh<li><tt>%left</tt>
52475897234Sdrh<li><tt>%name</tt>
52575897234Sdrh<li><tt>%nonassoc</tt>
52675897234Sdrh<li><tt>%parse_accept</tt>
52775897234Sdrh<li><tt>%parse_failure </tt>
52875897234Sdrh<li><tt>%right</tt>
52975897234Sdrh<li><tt>%stack_overflow</tt>
53075897234Sdrh<li><tt>%stack_size</tt>
53175897234Sdrh<li><tt>%start_symbol</tt>
53275897234Sdrh<li><tt>%syntax_error</tt>
53375897234Sdrh<li><tt>%token_destructor</tt>
53475897234Sdrh<li><tt>%token_prefix</tt>
53575897234Sdrh<li><tt>%token_type</tt>
53675897234Sdrh<li><tt>%type</tt>
53775897234Sdrh</ul>
53875897234SdrhEach of these directives will be described separately in the
53975897234Sdrhfollowing sections:</p>
54075897234Sdrh
541f2340fc7Sdrh<h4>The <tt>%code</tt> directive</h4>
542f2340fc7Sdrh
543f2340fc7Sdrh<p>The %code directive is used to specify addition C/C++ code that
544f2340fc7Sdrhis added to the end of the main output file.  This is similar to
545f2340fc7Sdrhthe %include directive except that %include is inserted at the
546f2340fc7Sdrhbeginning of the main output file.</p>
547f2340fc7Sdrh
548f2340fc7Sdrh<p>%code is typically used to include some action routines or perhaps
549f2340fc7Sdrha tokenizer as part of the output file.</p>
550f2340fc7Sdrh
551f2340fc7Sdrh<h4>The <tt>%default_destructor</tt> directive</h4>
552f2340fc7Sdrh
553f2340fc7Sdrh<p>The %default_destructor directive specifies a destructor to
554f2340fc7Sdrhuse for non-terminals that do not have their own destructor
555f2340fc7Sdrhspecified by a separate %destructor directive.  See the documentation
556f2340fc7Sdrhon the %destructor directive below for additional information.</p>
557f2340fc7Sdrh
558f2340fc7Sdrh<p>In some grammers, many different non-terminal symbols have the
559f2340fc7Sdrhsame datatype and hence the same destructor.  This directive is
560f2340fc7Sdrha convenience way to specify the same destructor for all those
561f2340fc7Sdrhnon-terminals using a single statement.</p>
562f2340fc7Sdrh
563f2340fc7Sdrh<h4>The <tt>%default_type</tt> directive</h4>
564f2340fc7Sdrh
565f2340fc7Sdrh<p>The %default_type directive specifies the datatype of non-terminal
566f2340fc7Sdrhsymbols that do no have their own datatype defined using a separate
567f2340fc7Sdrh%type directive.  See the documentation on %type below for addition
568f2340fc7Sdrhinformation.</p>
569f2340fc7Sdrh
57075897234Sdrh<h4>The <tt>%destructor</tt> directive</h4>
57175897234Sdrh
57275897234Sdrh<p>The %destructor directive is used to specify a destructor for
57375897234Sdrha non-terminal symbol.
57475897234Sdrh(See also the %token_destructor directive which is used to
57575897234Sdrhspecify a destructor for terminal symbols.)</p>
57675897234Sdrh
57775897234Sdrh<p>A non-terminal's destructor is called to dispose of the
57875897234Sdrhnon-terminal's value whenever the non-terminal is popped from
57975897234Sdrhthe stack.  This includes all of the following circumstances:
58075897234Sdrh<ul>
58175897234Sdrh<li> When a rule reduces and the value of a non-terminal on
58275897234Sdrh     the right-hand side is not linked to C code.
58375897234Sdrh<li> When the stack is popped during error processing.
58475897234Sdrh<li> When the ParseFree() function runs.
58575897234Sdrh</ul>
58675897234SdrhThe destructor can do whatever it wants with the value of
58775897234Sdrhthe non-terminal, but its design is to deallocate memory
58875897234Sdrhor other resources held by that non-terminal.</p>
58975897234Sdrh
59075897234Sdrh<p>Consider an example:
59175897234Sdrh<pre>
59275897234Sdrh   %type nt {void*}
59375897234Sdrh   %destructor nt { free($$); }
59475897234Sdrh   nt(A) ::= ID NUM.   { A = malloc( 100 ); }
59575897234Sdrh</pre>
59675897234SdrhThis example is a bit contrived but it serves to illustrate how
59775897234Sdrhdestructors work.  The example shows a non-terminal named
59875897234Sdrh``nt'' that holds values of type ``void*''.  When the rule for
59975897234Sdrhan ``nt'' reduces, it sets the value of the non-terminal to
60075897234Sdrhspace obtained from malloc().  Later, when the nt non-terminal
60175897234Sdrhis popped from the stack, the destructor will fire and call
60275897234Sdrhfree() on this malloced space, thus avoiding a memory leak.
60375897234Sdrh(Note that the symbol ``$$'' in the destructor code is replaced
60475897234Sdrhby the value of the non-terminal.)</p>
60575897234Sdrh
60675897234Sdrh<p>It is important to note that the value of a non-terminal is passed
60775897234Sdrhto the destructor whenever the non-terminal is removed from the
60875897234Sdrhstack, unless the non-terminal is used in a C-code action.  If
60975897234Sdrhthe non-terminal is used by C-code, then it is assumed that the
61075897234SdrhC-code will take care of destroying it if it should really
61175897234Sdrhbe destroyed.  More commonly, the value is used to build some
61275897234Sdrhlarger structure and we don't want to destroy it, which is why
61375897234Sdrhthe destructor is not called in this circumstance.</p>
61475897234Sdrh
61575897234Sdrh<p>By appropriate use of destructors, it is possible to
61675897234Sdrhbuild a parser using Lemon that can be used within a long-running
61775897234Sdrhprogram, such as a GUI, that will not leak memory or other resources.
61875897234SdrhTo do the same using yacc or bison is much more difficult.</p>
61975897234Sdrh
62075897234Sdrh<h4>The <tt>%extra_argument</tt> directive</h4>
62175897234Sdrh
62275897234SdrhThe %extra_argument directive instructs Lemon to add a 4th parameter
62375897234Sdrhto the parameter list of the Parse() function it generates.  Lemon
62475897234Sdrhdoesn't do anything itself with this extra argument, but it does
62575897234Sdrhmake the argument available to C-code action routines, destructors,
62675897234Sdrhand so forth.  For example, if the grammar file contains:</p>
62775897234Sdrh
62875897234Sdrh<p><pre>
62975897234Sdrh    %extra_argument { MyStruct *pAbc }
63075897234Sdrh</pre></p>
63175897234Sdrh
63275897234Sdrh<p>Then the Parse() function generated will have an 4th parameter
63375897234Sdrhof type ``MyStruct*'' and all action routines will have access to
63475897234Sdrha variable named ``pAbc'' that is the value of the 4th parameter
63575897234Sdrhin the most recent call to Parse().</p>
63675897234Sdrh
63775897234Sdrh<h4>The <tt>%include</tt> directive</h4>
63875897234Sdrh
63975897234Sdrh<p>The %include directive specifies C code that is included at the
64075897234Sdrhtop of the generated parser.  You can include any text you want --
641f2340fc7Sdrhthe Lemon parser generator copies it blindly.  If you have multiple
642f2340fc7Sdrh%include directives in your grammar file the value of the last
643f2340fc7Sdrh%include directive overwrites all the others.</p.
64475897234Sdrh
64575897234Sdrh<p>The %include directive is very handy for getting some extra #include
64675897234Sdrhpreprocessor statements at the beginning of the generated parser.
64775897234SdrhFor example:</p>
64875897234Sdrh
64975897234Sdrh<p><pre>
65075897234Sdrh   %include {#include &lt;unistd.h&gt;}
65175897234Sdrh</pre></p>
65275897234Sdrh
65375897234Sdrh<p>This might be needed, for example, if some of the C actions in the
65475897234Sdrhgrammar call functions that are prototyed in unistd.h.</p>
65575897234Sdrh
65675897234Sdrh<h4>The <tt>%left</tt> directive</h4>
65775897234Sdrh
65875897234SdrhThe %left directive is used (along with the %right and
65975897234Sdrh%nonassoc directives) to declare precedences of terminal
66075897234Sdrhsymbols.  Every terminal symbol whose name appears after
66175897234Sdrha %left directive but before the next period (``.'') is
66275897234Sdrhgiven the same left-associative precedence value.  Subsequent
66375897234Sdrh%left directives have higher precedence.  For example:</p>
66475897234Sdrh
66575897234Sdrh<p><pre>
66675897234Sdrh   %left AND.
66775897234Sdrh   %left OR.
66875897234Sdrh   %nonassoc EQ NE GT GE LT LE.
66975897234Sdrh   %left PLUS MINUS.
67075897234Sdrh   %left TIMES DIVIDE MOD.
67175897234Sdrh   %right EXP NOT.
67275897234Sdrh</pre></p>
67375897234Sdrh
67475897234Sdrh<p>Note the period that terminates each %left, %right or %nonassoc
67575897234Sdrhdirective.</p>
67675897234Sdrh
67775897234Sdrh<p>LALR(1) grammars can get into a situation where they require
67875897234Sdrha large amount of stack space if you make heavy use or right-associative
67975897234Sdrhoperators.  For this reason, it is recommended that you use %left
68075897234Sdrhrather than %right whenever possible.</p>
68175897234Sdrh
68275897234Sdrh<h4>The <tt>%name</tt> directive</h4>
68375897234Sdrh
68475897234Sdrh<p>By default, the functions generated by Lemon all begin with the
68575897234Sdrhfive-character string ``Parse''.  You can change this string to something
68675897234Sdrhdifferent using the %name directive.  For instance:</p>
68775897234Sdrh
68875897234Sdrh<p><pre>
68975897234Sdrh   %name Abcde
69075897234Sdrh</pre></p>
69175897234Sdrh
69275897234Sdrh<p>Putting this directive in the grammar file will cause Lemon to generate
69375897234Sdrhfunctions named
69475897234Sdrh<ul>
69575897234Sdrh<li> AbcdeAlloc(),
69675897234Sdrh<li> AbcdeFree(),
69775897234Sdrh<li> AbcdeTrace(), and
69875897234Sdrh<li> Abcde().
69975897234Sdrh</ul>
70075897234SdrhThe %name directive allows you to generator two or more different
70175897234Sdrhparsers and link them all into the same executable.
70275897234Sdrh</p>
70375897234Sdrh
70475897234Sdrh<h4>The <tt>%nonassoc</tt> directive</h4>
70575897234Sdrh
70675897234Sdrh<p>This directive is used to assign non-associative precedence to
70775897234Sdrhone or more terminal symbols.  See the section on precedence rules
70875897234Sdrhor on the %left directive for additional information.</p>
70975897234Sdrh
71075897234Sdrh<h4>The <tt>%parse_accept</tt> directive</h4>
71175897234Sdrh
71275897234Sdrh<p>The %parse_accept directive specifies a block of C code that is
71375897234Sdrhexecuted whenever the parser accepts its input string.  To ``accept''
71475897234Sdrhan input string means that the parser was able to process all tokens
71575897234Sdrhwithout error.</p>
71675897234Sdrh
71775897234Sdrh<p>For example:</p>
71875897234Sdrh
71975897234Sdrh<p><pre>
72075897234Sdrh   %parse_accept {
72175897234Sdrh      printf("parsing complete!\n");
72275897234Sdrh   }
72375897234Sdrh</pre></p>
72475897234Sdrh
72575897234Sdrh
72675897234Sdrh<h4>The <tt>%parse_failure</tt> directive</h4>
72775897234Sdrh
72875897234Sdrh<p>The %parse_failure directive specifies a block of C code that
72975897234Sdrhis executed whenever the parser fails complete.  This code is not
73075897234Sdrhexecuted until the parser has tried and failed to resolve an input
73175897234Sdrherror using is usual error recovery strategy.  The routine is
73275897234Sdrhonly invoked when parsing is unable to continue.</p>
73375897234Sdrh
73475897234Sdrh<p><pre>
73575897234Sdrh   %parse_failure {
73675897234Sdrh     fprintf(stderr,"Giving up.  Parser is hopelessly lost...\n");
73775897234Sdrh   }
73875897234Sdrh</pre></p>
73975897234Sdrh
74075897234Sdrh<h4>The <tt>%right</tt> directive</h4>
74175897234Sdrh
74275897234Sdrh<p>This directive is used to assign right-associative precedence to
74375897234Sdrhone or more terminal symbols.  See the section on precedence rules
74475897234Sdrhor on the %left directive for additional information.</p>
74575897234Sdrh
74675897234Sdrh<h4>The <tt>%stack_overflow</tt> directive</h4>
74775897234Sdrh
74875897234Sdrh<p>The %stack_overflow directive specifies a block of C code that
74975897234Sdrhis executed if the parser's internal stack ever overflows.  Typically
75075897234Sdrhthis just prints an error message.  After a stack overflow, the parser
75175897234Sdrhwill be unable to continue and must be reset.</p>
75275897234Sdrh
75375897234Sdrh<p><pre>
75475897234Sdrh   %stack_overflow {
75575897234Sdrh     fprintf(stderr,"Giving up.  Parser stack overflow\n");
75675897234Sdrh   }
75775897234Sdrh</pre></p>
75875897234Sdrh
75975897234Sdrh<p>You can help prevent parser stack overflows by avoiding the use
76075897234Sdrhof right recursion and right-precedence operators in your grammar.
76175897234SdrhUse left recursion and and left-precedence operators instead, to
76275897234Sdrhencourage rules to reduce sooner and keep the stack size down.
76375897234SdrhFor example, do rules like this:
76475897234Sdrh<pre>
76575897234Sdrh   list ::= list element.      // left-recursion.  Good!
76675897234Sdrh   list ::= .
76775897234Sdrh</pre>
76875897234SdrhNot like this:
76975897234Sdrh<pre>
77075897234Sdrh   list ::= element list.      // right-recursion.  Bad!
77175897234Sdrh   list ::= .
77275897234Sdrh</pre>
77375897234Sdrh
77475897234Sdrh<h4>The <tt>%stack_size</tt> directive</h4>
77575897234Sdrh
77675897234Sdrh<p>If stack overflow is a problem and you can't resolve the trouble
77775897234Sdrhby using left-recursion, then you might want to increase the size
77875897234Sdrhof the parser's stack using this directive.  Put an positive integer
77975897234Sdrhafter the %stack_size directive and Lemon will generate a parse
78075897234Sdrhwith a stack of the requested size.  The default value is 100.</p>
78175897234Sdrh
78275897234Sdrh<p><pre>
78375897234Sdrh   %stack_size 2000
78475897234Sdrh</pre></p>
78575897234Sdrh
78675897234Sdrh<h4>The <tt>%start_symbol</tt> directive</h4>
78775897234Sdrh
78875897234Sdrh<p>By default, the start-symbol for the grammar that Lemon generates
78975897234Sdrhis the first non-terminal that appears in the grammar file.  But you
79075897234Sdrhcan choose a different start-symbol using the %start_symbol directive.</p>
79175897234Sdrh
79275897234Sdrh<p><pre>
79375897234Sdrh   %start_symbol  prog
79475897234Sdrh</pre></p>
79575897234Sdrh
79675897234Sdrh<h4>The <tt>%token_destructor</tt> directive</h4>
79775897234Sdrh
79875897234Sdrh<p>The %destructor directive assigns a destructor to a non-terminal
79975897234Sdrhsymbol.  (See the description of the %destructor directive above.)
80075897234SdrhThis directive does the same thing for all terminal symbols.</p>
80175897234Sdrh
80275897234Sdrh<p>Unlike non-terminal symbols which may each have a different data type
80375897234Sdrhfor their values, terminals all use the same data type (defined by
80475897234Sdrhthe %token_type directive) and so they use a common destructor.  Other
80575897234Sdrhthan that, the token destructor works just like the non-terminal
80675897234Sdrhdestructors.</p>
80775897234Sdrh
80875897234Sdrh<h4>The <tt>%token_prefix</tt> directive</h4>
80975897234Sdrh
81075897234Sdrh<p>Lemon generates #defines that assign small integer constants
81175897234Sdrhto each terminal symbol in the grammar.  If desired, Lemon will
81275897234Sdrhadd a prefix specified by this directive
81375897234Sdrhto each of the #defines it generates.
81475897234SdrhSo if the default output of Lemon looked like this:
81575897234Sdrh<pre>
81675897234Sdrh    #define AND              1
81775897234Sdrh    #define MINUS            2
81875897234Sdrh    #define OR               3
81975897234Sdrh    #define PLUS             4
82075897234Sdrh</pre>
82175897234SdrhYou can insert a statement into the grammar like this:
82275897234Sdrh<pre>
82375897234Sdrh    %token_prefix    TOKEN_
82475897234Sdrh</pre>
82575897234Sdrhto cause Lemon to produce these symbols instead:
82675897234Sdrh<pre>
82775897234Sdrh    #define TOKEN_AND        1
82875897234Sdrh    #define TOKEN_MINUS      2
82975897234Sdrh    #define TOKEN_OR         3
83075897234Sdrh    #define TOKEN_PLUS       4
83175897234Sdrh</pre>
83275897234Sdrh
83375897234Sdrh<h4>The <tt>%token_type</tt> and <tt>%type</tt> directives</h4>
83475897234Sdrh
83575897234Sdrh<p>These directives are used to specify the data types for values
83675897234Sdrhon the parser's stack associated with terminal and non-terminal
83775897234Sdrhsymbols.  The values of all terminal symbols must be of the same
83875897234Sdrhtype.  This turns out to be the same data type as the 3rd parameter
83975897234Sdrhto the Parse() function generated by Lemon.  Typically, you will
84075897234Sdrhmake the value of a terminal symbol by a pointer to some kind of
84175897234Sdrhtoken structure.  Like this:</p>
84275897234Sdrh
84375897234Sdrh<p><pre>
84475897234Sdrh   %token_type    {Token*}
84575897234Sdrh</pre></p>
84675897234Sdrh
84775897234Sdrh<p>If the data type of terminals is not specified, the default value
84875897234Sdrhis ``int''.</p>
84975897234Sdrh
85075897234Sdrh<p>Non-terminal symbols can each have their own data types.  Typically
85175897234Sdrhthe data type  of a non-terminal is a pointer to the root of a parse-tree
85275897234Sdrhstructure that contains all information about that non-terminal.
85375897234SdrhFor example:</p>
85475897234Sdrh
85575897234Sdrh<p><pre>
85675897234Sdrh   %type   expr  {Expr*}
85775897234Sdrh</pre></p>
85875897234Sdrh
85975897234Sdrh<p>Each entry on the parser's stack is actually a union containing
86075897234Sdrhinstances of all data types for every non-terminal and terminal symbol.
86175897234SdrhLemon will automatically use the correct element of this union depending
86275897234Sdrhon what the corresponding non-terminal or terminal symbol is.  But
86375897234Sdrhthe grammar designer should keep in mind that the size of the union
86475897234Sdrhwill be the size of its largest element.  So if you have a single
86575897234Sdrhnon-terminal whose data type requires 1K of storage, then your 100
86675897234Sdrhentry parser stack will require 100K of heap space.  If you are willing
86775897234Sdrhand able to pay that price, fine.  You just need to know.</p>
86875897234Sdrh
86975897234Sdrh<h3>Error Processing</h3>
87075897234Sdrh
87175897234Sdrh<p>After extensive experimentation over several years, it has been
87275897234Sdrhdiscovered that the error recovery strategy used by yacc is about
87375897234Sdrhas good as it gets.  And so that is what Lemon uses.</p>
87475897234Sdrh
87575897234Sdrh<p>When a Lemon-generated parser encounters a syntax error, it
87675897234Sdrhfirst invokes the code specified by the %syntax_error directive, if
87775897234Sdrhany.  It then enters its error recovery strategy.  The error recovery
87875897234Sdrhstrategy is to begin popping the parsers stack until it enters a
87975897234Sdrhstate where it is permitted to shift a special non-terminal symbol
88075897234Sdrhnamed ``error''.  It then shifts this non-terminal and continues
88175897234Sdrhparsing.  But the %syntax_error routine will not be called again
88275897234Sdrhuntil at least three new tokens have been successfully shifted.</p>
88375897234Sdrh
88475897234Sdrh<p>If the parser pops its stack until the stack is empty, and it still
88575897234Sdrhis unable to shift the error symbol, then the %parse_failed routine
88675897234Sdrhis invoked and the parser resets itself to its start state, ready
88775897234Sdrhto begin parsing a new file.  This is what will happen at the very
88875897234Sdrhfirst syntax error, of course, if there are no instances of the
88975897234Sdrh``error'' non-terminal in your grammar.</p>
89075897234Sdrh
89175897234Sdrh</body>
89275897234Sdrh</html>
893