175897234Sdrh<html> 275897234Sdrh<head> 375897234Sdrh<title>The Lemon Parser Generator</title> 475897234Sdrh</head> 575897234Sdrh<body bgcolor=white> 675897234Sdrh<h1 align=center>The Lemon Parser Generator</h1> 775897234Sdrh 89bccde3dSdrh<p>Lemon is an LALR(1) parser generator for C. 99bccde3dSdrhIt does the same job as "bison" and "yacc". 109bccde3dSdrhBut lemon is not a bison or yacc clone. Lemon 1175897234Sdrhuses a different grammar syntax which is designed to 129bccde3dSdrhreduce the number of coding errors. Lemon also uses a 139bccde3dSdrhparsing engine that is faster than yacc and 149bccde3dSdrhbison and which is both reentrant and threadsafe. 159bccde3dSdrh(Update: Since the previous sentence was written, bison 169bccde3dSdrhhas also been updated so that it too can generate a 179bccde3dSdrhreentrant and threadsafe parser.) 189bccde3dSdrhLemon also implements features that can be used 1975897234Sdrhto eliminate resource leaks, making is suitable for use 2075897234Sdrhin long-running programs such as graphical user interfaces 2175897234Sdrhor embedded controllers.</p> 2275897234Sdrh 2375897234Sdrh<p>This document is an introduction to the Lemon 2475897234Sdrhparser generator.</p> 2575897234Sdrh 2675897234Sdrh<h2>Theory of Operation</h2> 2775897234Sdrh 2875897234Sdrh<p>The main goal of Lemon is to translate a context free grammar (CFG) 2975897234Sdrhfor a particular language into C code that implements a parser for 3075897234Sdrhthat language. 3175897234SdrhThe program has two inputs: 3275897234Sdrh<ul> 3375897234Sdrh<li>The grammar specification. 3475897234Sdrh<li>A parser template file. 3575897234Sdrh</ul> 3675897234SdrhTypically, only the grammar specification is supplied by the programmer. 3775897234SdrhLemon comes with a default parser template which works fine for most 3875897234Sdrhapplications. But the user is free to substitute a different parser 3975897234Sdrhtemplate if desired.</p> 4075897234Sdrh 4175897234Sdrh<p>Depending on command-line options, Lemon will generate between 4275897234Sdrhone and three files of outputs. 4375897234Sdrh<ul> 4475897234Sdrh<li>C code to implement the parser. 4575897234Sdrh<li>A header file defining an integer ID for each terminal symbol. 4675897234Sdrh<li>An information file that describes the states of the generated parser 4775897234Sdrh automaton. 4875897234Sdrh</ul> 4975897234SdrhBy default, all three of these output files are generated. 509bccde3dSdrhThe header file is suppressed if the "-m" command-line option is 519bccde3dSdrhused and the report file is omitted when "-q" is selected.</p> 5275897234Sdrh 539bccde3dSdrh<p>The grammar specification file uses a ".y" suffix, by convention. 5475897234SdrhIn the examples used in this document, we'll assume the name of the 559bccde3dSdrhgrammar file is "gram.y". A typical use of Lemon would be the 5675897234Sdrhfollowing command: 5775897234Sdrh<pre> 5875897234Sdrh lemon gram.y 5975897234Sdrh</pre> 609bccde3dSdrhThis command will generate three output files named "gram.c", 619bccde3dSdrh"gram.h" and "gram.out". 6275897234SdrhThe first is C code to implement the parser. The second 6375897234Sdrhis the header file that defines numerical values for all 6475897234Sdrhterminal symbols, and the last is the report that explains 6575897234Sdrhthe states used by the parser automaton.</p> 6675897234Sdrh 6775897234Sdrh<h3>Command Line Options</h3> 6875897234Sdrh 6975897234Sdrh<p>The behavior of Lemon can be modified using command-line options. 7075897234SdrhYou can obtain a list of the available command-line options together 7175897234Sdrhwith a brief explanation of what each does by typing 7275897234Sdrh<pre> 7375897234Sdrh lemon -? 7475897234Sdrh</pre> 7575897234SdrhAs of this writing, the following command-line options are supported: 7675897234Sdrh<ul> 779bccde3dSdrh<li><b>-b</b> 789bccde3dSdrhShow only the basis for each parser state in the report file. 799bccde3dSdrh<li><b>-c</b> 809bccde3dSdrhDo not compress the generated action tables. 819bccde3dSdrh<li><b>-D<i>name</i></b> 829bccde3dSdrhDefine C preprocessor macro <i>name</i>. This macro is useable by 839bccde3dSdrh"%ifdef" lines in the grammar file. 849bccde3dSdrh<li><b>-g</b> 859bccde3dSdrhDo not generate a parser. Instead write the input grammar to standard 869bccde3dSdrhoutput with all comments, actions, and other extraneous text removed. 879bccde3dSdrh<li><b>-l</b> 88*dfe4e6bbSdrhOmit "#line" directives in the generated parser C code. 899bccde3dSdrh<li><b>-m</b> 909bccde3dSdrhCause the output C source code to be compatible with the "makeheaders" 919bccde3dSdrhprogram. 929bccde3dSdrh<li><b>-p</b> 939bccde3dSdrhDisplay all conflicts that are resolved by 949bccde3dSdrh<a href='#precrules'>precedence rules</a>. 959bccde3dSdrh<li><b>-q</b> 969bccde3dSdrhSuppress generation of the report file. 979bccde3dSdrh<li><b>-r</b> 989bccde3dSdrhDo not sort or renumber the parser states as part of optimization. 999bccde3dSdrh<li><b>-s</b> 1009bccde3dSdrhShow parser statistics before existing. 1019bccde3dSdrh<li><b>-T<i>file</i></b> 1029bccde3dSdrhUse <i>file</i> as the template for the generated C-code parser implementation. 1039bccde3dSdrh<li><b>-x</b> 1049bccde3dSdrhPrint the Lemon version number. 10575897234Sdrh</ul> 10675897234Sdrh 10775897234Sdrh<h3>The Parser Interface</h3> 10875897234Sdrh 10975897234Sdrh<p>Lemon doesn't generate a complete, working program. It only generates 11075897234Sdrha few subroutines that implement a parser. This section describes 11175897234Sdrhthe interface to those subroutines. It is up to the programmer to 11275897234Sdrhcall these subroutines in an appropriate way in order to produce a 11375897234Sdrhcomplete system.</p> 11475897234Sdrh 11575897234Sdrh<p>Before a program begins using a Lemon-generated parser, the program 11675897234Sdrhmust first create the parser. 11775897234SdrhA new parser is created as follows: 11875897234Sdrh<pre> 11975897234Sdrh void *pParser = ParseAlloc( malloc ); 12075897234Sdrh</pre> 12175897234SdrhThe ParseAlloc() routine allocates and initializes a new parser and 12275897234Sdrhreturns a pointer to it. 1239bccde3dSdrhThe actual data structure used to represent a parser is opaque — 12475897234Sdrhits internal structure is not visible or usable by the calling routine. 12575897234SdrhFor this reason, the ParseAlloc() routine returns a pointer to void 12675897234Sdrhrather than a pointer to some particular structure. 12775897234SdrhThe sole argument to the ParseAlloc() routine is a pointer to the 1289bccde3dSdrhsubroutine used to allocate memory. Typically this means malloc().</p> 12975897234Sdrh 13075897234Sdrh<p>After a program is finished using a parser, it can reclaim all 13175897234Sdrhmemory allocated by that parser by calling 13275897234Sdrh<pre> 13375897234Sdrh ParseFree(pParser, free); 13475897234Sdrh</pre> 13575897234SdrhThe first argument is the same pointer returned by ParseAlloc(). The 13675897234Sdrhsecond argument is a pointer to the function used to release bulk 13775897234Sdrhmemory back to the system.</p> 13875897234Sdrh 13975897234Sdrh<p>After a parser has been allocated using ParseAlloc(), the programmer 14075897234Sdrhmust supply the parser with a sequence of tokens (terminal symbols) to 14175897234Sdrhbe parsed. This is accomplished by calling the following function 14275897234Sdrhonce for each token: 14375897234Sdrh<pre> 14475897234Sdrh Parse(pParser, hTokenID, sTokenData, pArg); 14575897234Sdrh</pre> 14675897234SdrhThe first argument to the Parse() routine is the pointer returned by 14775897234SdrhParseAlloc(). 14875897234SdrhThe second argument is a small positive integer that tells the parse the 14975897234Sdrhtype of the next token in the data stream. 15075897234SdrhThere is one token type for each terminal symbol in the grammar. 15175897234SdrhThe gram.h file generated by Lemon contains #define statements that 15275897234Sdrhmap symbolic terminal symbol names into appropriate integer values. 1539bccde3dSdrhA value of 0 for the second argument is a special flag to the 1549bccde3dSdrhparser to indicate that the end of input has been reached. 15575897234SdrhThe third argument is the value of the given token. By default, 15675897234Sdrhthe type of the third argument is integer, but the grammar will 15775897234Sdrhusually redefine this type to be some kind of structure. 15875897234SdrhTypically the second argument will be a broad category of tokens 1599bccde3dSdrhsuch as "identifier" or "number" and the third argument will 16075897234Sdrhbe the name of the identifier or the value of the number.</p> 16175897234Sdrh 16275897234Sdrh<p>The Parse() function may have either three or four arguments, 16345f31be8Sdrhdepending on the grammar. If the grammar specification file requests 16445f31be8Sdrhit (via the <a href='#extraarg'><tt>extra_argument</tt> directive</a>), 16545f31be8Sdrhthe Parse() function will have a fourth parameter that can be 16675897234Sdrhof any type chosen by the programmer. The parser doesn't do anything 16775897234Sdrhwith this argument except to pass it through to action routines. 16875897234SdrhThis is a convenient mechanism for passing state information down 16975897234Sdrhto the action routines without having to use global variables.</p> 17075897234Sdrh 17175897234Sdrh<p>A typical use of a Lemon parser might look something like the 17275897234Sdrhfollowing: 17375897234Sdrh<pre> 17475897234Sdrh 01 ParseTree *ParseFile(const char *zFilename){ 17575897234Sdrh 02 Tokenizer *pTokenizer; 17675897234Sdrh 03 void *pParser; 17775897234Sdrh 04 Token sToken; 17875897234Sdrh 05 int hTokenId; 17975897234Sdrh 06 ParserState sState; 18075897234Sdrh 07 18175897234Sdrh 08 pTokenizer = TokenizerCreate(zFilename); 18275897234Sdrh 09 pParser = ParseAlloc( malloc ); 18375897234Sdrh 10 InitParserState(&sState); 18475897234Sdrh 11 while( GetNextToken(pTokenizer, &hTokenId, &sToken) ){ 18575897234Sdrh 12 Parse(pParser, hTokenId, sToken, &sState); 18675897234Sdrh 13 } 18775897234Sdrh 14 Parse(pParser, 0, sToken, &sState); 18875897234Sdrh 15 ParseFree(pParser, free ); 18975897234Sdrh 16 TokenizerFree(pTokenizer); 19075897234Sdrh 17 return sState.treeRoot; 19175897234Sdrh 18 } 19275897234Sdrh</pre> 19375897234SdrhThis example shows a user-written routine that parses a file of 19475897234Sdrhtext and returns a pointer to the parse tree. 1959bccde3dSdrh(All error-handling code is omitted from this example to keep it 19675897234Sdrhsimple.) 19775897234SdrhWe assume the existence of some kind of tokenizer which is created 19875897234Sdrhusing TokenizerCreate() on line 8 and deleted by TokenizerFree() 19975897234Sdrhon line 16. The GetNextToken() function on line 11 retrieves the 20075897234Sdrhnext token from the input file and puts its type in the 20175897234Sdrhinteger variable hTokenId. The sToken variable is assumed to be 20275897234Sdrhsome kind of structure that contains details about each token, 20375897234Sdrhsuch as its complete text, what line it occurs on, etc. </p> 20475897234Sdrh 20575897234Sdrh<p>This example also assumes the existence of structure of type 20675897234SdrhParserState that holds state information about a particular parse. 20775897234SdrhAn instance of such a structure is created on line 6 and initialized 20875897234Sdrhon line 10. A pointer to this structure is passed into the Parse() 20975897234Sdrhroutine as the optional 4th argument. 21075897234SdrhThe action routine specified by the grammar for the parser can use 21175897234Sdrhthe ParserState structure to hold whatever information is useful and 21275897234Sdrhappropriate. In the example, we note that the treeRoot field of 21375897234Sdrhthe ParserState structure is left pointing to the root of the parse 21475897234Sdrhtree.</p> 21575897234Sdrh 21675897234Sdrh<p>The core of this example as it relates to Lemon is as follows: 21775897234Sdrh<pre> 21875897234Sdrh ParseFile(){ 21975897234Sdrh pParser = ParseAlloc( malloc ); 22075897234Sdrh while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){ 22175897234Sdrh Parse(pParser, hTokenId, sToken); 22275897234Sdrh } 22375897234Sdrh Parse(pParser, 0, sToken); 22475897234Sdrh ParseFree(pParser, free ); 22575897234Sdrh } 22675897234Sdrh</pre> 22775897234SdrhBasically, what a program has to do to use a Lemon-generated parser 22875897234Sdrhis first create the parser, then send it lots of tokens obtained by 22975897234Sdrhtokenizing an input source. When the end of input is reached, the 23075897234SdrhParse() routine should be called one last time with a token type 23175897234Sdrhof 0. This step is necessary to inform the parser that the end of 23275897234Sdrhinput has been reached. Finally, we reclaim memory used by the 23375897234Sdrhparser by calling ParseFree().</p> 23475897234Sdrh 23575897234Sdrh<p>There is one other interface routine that should be mentioned 23675897234Sdrhbefore we move on. 23775897234SdrhThe ParseTrace() function can be used to generate debugging output 23875897234Sdrhfrom the parser. A prototype for this routine is as follows: 23975897234Sdrh<pre> 24075897234Sdrh ParseTrace(FILE *stream, char *zPrefix); 24175897234Sdrh</pre> 24275897234SdrhAfter this routine is called, a short (one-line) message is written 24375897234Sdrhto the designated output stream every time the parser changes states 24475897234Sdrhor calls an action routine. Each such message is prefaced using 24575897234Sdrhthe text given by zPrefix. This debugging output can be turned off 24675897234Sdrhby calling ParseTrace() again with a first argument of NULL (0).</p> 24775897234Sdrh 24875897234Sdrh<h3>Differences With YACC and BISON</h3> 24975897234Sdrh 25075897234Sdrh<p>Programmers who have previously used the yacc or bison parser 25175897234Sdrhgenerator will notice several important differences between yacc and/or 25275897234Sdrhbison and Lemon. 25375897234Sdrh<ul> 25475897234Sdrh<li>In yacc and bison, the parser calls the tokenizer. In Lemon, 25575897234Sdrh the tokenizer calls the parser. 25675897234Sdrh<li>Lemon uses no global variables. Yacc and bison use global variables 25775897234Sdrh to pass information between the tokenizer and parser. 25875897234Sdrh<li>Lemon allows multiple parsers to be running simultaneously. Yacc 25975897234Sdrh and bison do not. 26075897234Sdrh</ul> 26175897234SdrhThese differences may cause some initial confusion for programmers 26275897234Sdrhwith prior yacc and bison experience. 26375897234SdrhBut after years of experience using Lemon, I firmly 26475897234Sdrhbelieve that the Lemon way of doing things is better.</p> 26575897234Sdrh 26645f31be8Sdrh<p><i>Updated as of 2016-02-16:</i> 26745f31be8SdrhThe text above was written in the 1990s. 26845f31be8SdrhWe are told that Bison has lately been enhanced to support the 26945f31be8Sdrhtokenizer-calls-parser paradigm used by Lemon, and to obviate the 27045f31be8Sdrhneed for global variables.</p> 27145f31be8Sdrh 27275897234Sdrh<h2>Input File Syntax</h2> 27375897234Sdrh 27475897234Sdrh<p>The main purpose of the grammar specification file for Lemon is 27575897234Sdrhto define the grammar for the parser. But the input file also 27675897234Sdrhspecifies additional information Lemon requires to do its job. 27775897234SdrhMost of the work in using Lemon is in writing an appropriate 27875897234Sdrhgrammar file.</p> 27975897234Sdrh 28075897234Sdrh<p>The grammar file for lemon is, for the most part, free format. 28175897234SdrhIt does not have sections or divisions like yacc or bison. Any 28275897234Sdrhdeclaration can occur at any point in the file. 28375897234SdrhLemon ignores whitespace (except where it is needed to separate 28475897234Sdrhtokens) and it honors the same commenting conventions as C and C++.</p> 28575897234Sdrh 28675897234Sdrh<h3>Terminals and Nonterminals</h3> 28775897234Sdrh 28875897234Sdrh<p>A terminal symbol (token) is any string of alphanumeric 2899bccde3dSdrhand/or underscore characters 29075897234Sdrhthat begins with an upper case letter. 291c8eee5e5SdrhA terminal can contain lowercase letters after the first character, 29275897234Sdrhbut the usual convention is to make terminals all upper case. 29375897234SdrhA nonterminal, on the other hand, is any string of alphanumeric 29475897234Sdrhand underscore characters than begins with a lower case letter. 29575897234SdrhAgain, the usual convention is to make nonterminals use all lower 29675897234Sdrhcase letters.</p> 29775897234Sdrh 29875897234Sdrh<p>In Lemon, terminal and nonterminal symbols do not need to 29975897234Sdrhbe declared or identified in a separate section of the grammar file. 30075897234SdrhLemon is able to generate a list of all terminals and nonterminals 30175897234Sdrhby examining the grammar rules, and it can always distinguish a 30275897234Sdrhterminal from a nonterminal by checking the case of the first 30375897234Sdrhcharacter of the name.</p> 30475897234Sdrh 30575897234Sdrh<p>Yacc and bison allow terminal symbols to have either alphanumeric 30675897234Sdrhnames or to be individual characters included in single quotes, like 30775897234Sdrhthis: ')' or '$'. Lemon does not allow this alternative form for 30875897234Sdrhterminal symbols. With Lemon, all symbols, terminals and nonterminals, 30975897234Sdrhmust have alphanumeric names.</p> 31075897234Sdrh 31175897234Sdrh<h3>Grammar Rules</h3> 31275897234Sdrh 31375897234Sdrh<p>The main component of a Lemon grammar file is a sequence of grammar 31475897234Sdrhrules. 31575897234SdrhEach grammar rule consists of a nonterminal symbol followed by 3169bccde3dSdrhthe special symbol "::=" and then a list of terminals and/or nonterminals. 31775897234SdrhThe rule is terminated by a period. 31875897234SdrhThe list of terminals and nonterminals on the right-hand side of the 31975897234Sdrhrule can be empty. 32075897234SdrhRules can occur in any order, except that the left-hand side of the 32175897234Sdrhfirst rule is assumed to be the start symbol for the grammar (unless 32275897234Sdrhspecified otherwise using the <tt>%start</tt> directive described below.) 32375897234SdrhA typical sequence of grammar rules might look something like this: 32475897234Sdrh<pre> 32575897234Sdrh expr ::= expr PLUS expr. 32675897234Sdrh expr ::= expr TIMES expr. 32775897234Sdrh expr ::= LPAREN expr RPAREN. 32875897234Sdrh expr ::= VALUE. 32975897234Sdrh</pre> 33075897234Sdrh</p> 33175897234Sdrh 3329bccde3dSdrh<p>There is one non-terminal in this example, "expr", and five 3339bccde3dSdrhterminal symbols or tokens: "PLUS", "TIMES", "LPAREN", 3349bccde3dSdrh"RPAREN" and "VALUE".</p> 33575897234Sdrh 33675897234Sdrh<p>Like yacc and bison, Lemon allows the grammar to specify a block 33775897234Sdrhof C code that will be executed whenever a grammar rule is reduced 33875897234Sdrhby the parser. 33975897234SdrhIn Lemon, this action is specified by putting the C code (contained 34075897234Sdrhwithin curly braces <tt>{...}</tt>) immediately after the 34175897234Sdrhperiod that closes the rule. 34275897234SdrhFor example: 34375897234Sdrh<pre> 34475897234Sdrh expr ::= expr PLUS expr. { printf("Doing an addition...\n"); } 34575897234Sdrh</pre> 34675897234Sdrh</p> 34775897234Sdrh 34875897234Sdrh<p>In order to be useful, grammar actions must normally be linked to 34975897234Sdrhtheir associated grammar rules. 3509bccde3dSdrhIn yacc and bison, this is accomplished by embedding a "$$" in the 35175897234Sdrhaction to stand for the value of the left-hand side of the rule and 3529bccde3dSdrhsymbols "$1", "$2", and so forth to stand for the value of 35375897234Sdrhthe terminal or nonterminal at position 1, 2 and so forth on the 35475897234Sdrhright-hand side of the rule. 35575897234SdrhThis idea is very powerful, but it is also very error-prone. The 35675897234Sdrhsingle most common source of errors in a yacc or bison grammar is 35775897234Sdrhto miscount the number of symbols on the right-hand side of a grammar 3589bccde3dSdrhrule and say "$7" when you really mean "$8".</p> 35975897234Sdrh 36075897234Sdrh<p>Lemon avoids the need to count grammar symbols by assigning symbolic 36175897234Sdrhnames to each symbol in a grammar rule and then using those symbolic 36275897234Sdrhnames in the action. 36375897234SdrhIn yacc or bison, one would write this: 36475897234Sdrh<pre> 36575897234Sdrh expr -> expr PLUS expr { $$ = $1 + $3; }; 36675897234Sdrh</pre> 36775897234SdrhBut in Lemon, the same rule becomes the following: 36875897234Sdrh<pre> 36975897234Sdrh expr(A) ::= expr(B) PLUS expr(C). { A = B+C; } 37075897234Sdrh</pre> 37175897234SdrhIn the Lemon rule, any symbol in parentheses after a grammar rule 37275897234Sdrhsymbol becomes a place holder for that symbol in the grammar rule. 37375897234SdrhThis place holder can then be used in the associated C action to 37475897234Sdrhstand for the value of that symbol.<p> 37575897234Sdrh 37675897234Sdrh<p>The Lemon notation for linking a grammar rule with its reduce 37775897234Sdrhaction is superior to yacc/bison on several counts. 37875897234SdrhFirst, as mentioned above, the Lemon method avoids the need to 37975897234Sdrhcount grammar symbols. 38075897234SdrhSecondly, if a terminal or nonterminal in a Lemon grammar rule 38175897234Sdrhincludes a linking symbol in parentheses but that linking symbol 38275897234Sdrhis not actually used in the reduce action, then an error message 38375897234Sdrhis generated. 38475897234SdrhFor example, the rule 38575897234Sdrh<pre> 38675897234Sdrh expr(A) ::= expr(B) PLUS expr(C). { A = B; } 38775897234Sdrh</pre> 3889bccde3dSdrhwill generate an error because the linking symbol "C" is used 38975897234Sdrhin the grammar rule but not in the reduce action.</p> 39075897234Sdrh 39175897234Sdrh<p>The Lemon notation for linking grammar rules to reduce actions 39275897234Sdrhalso facilitates the use of destructors for reclaiming memory 39375897234Sdrhallocated by the values of terminals and nonterminals on the 39475897234Sdrhright-hand side of a rule.</p> 39575897234Sdrh 3969bccde3dSdrh<a name='precrules'></a> 39775897234Sdrh<h3>Precedence Rules</h3> 39875897234Sdrh 39975897234Sdrh<p>Lemon resolves parsing ambiguities in exactly the same way as 40075897234Sdrhyacc and bison. A shift-reduce conflict is resolved in favor 40175897234Sdrhof the shift, and a reduce-reduce conflict is resolved by reducing 40275897234Sdrhwhichever rule comes first in the grammar file.</p> 40375897234Sdrh 40475897234Sdrh<p>Just like in 40575897234Sdrhyacc and bison, Lemon allows a measure of control 40675897234Sdrhover the resolution of paring conflicts using precedence rules. 40775897234SdrhA precedence value can be assigned to any terminal symbol 4089bccde3dSdrhusing the 4099bccde3dSdrh<a href='#pleft'>%left</a>, 4109bccde3dSdrh<a href='#pright'>%right</a> or 4119bccde3dSdrh<a href='#pnonassoc'>%nonassoc</a> directives. Terminal symbols 41275897234Sdrhmentioned in earlier directives have a lower precedence that 41375897234Sdrhterminal symbols mentioned in later directives. For example:</p> 41475897234Sdrh 41575897234Sdrh<p><pre> 41675897234Sdrh %left AND. 41775897234Sdrh %left OR. 41875897234Sdrh %nonassoc EQ NE GT GE LT LE. 41975897234Sdrh %left PLUS MINUS. 42075897234Sdrh %left TIMES DIVIDE MOD. 42175897234Sdrh %right EXP NOT. 42275897234Sdrh</pre></p> 42375897234Sdrh 42475897234Sdrh<p>In the preceding sequence of directives, the AND operator is 42575897234Sdrhdefined to have the lowest precedence. The OR operator is one 42675897234Sdrhprecedence level higher. And so forth. Hence, the grammar would 42775897234Sdrhattempt to group the ambiguous expression 42875897234Sdrh<pre> 42975897234Sdrh a AND b OR c 43075897234Sdrh</pre> 43175897234Sdrhlike this 43275897234Sdrh<pre> 43375897234Sdrh a AND (b OR c). 43475897234Sdrh</pre> 43575897234SdrhThe associativity (left, right or nonassoc) is used to determine 43675897234Sdrhthe grouping when the precedence is the same. AND is left-associative 43775897234Sdrhin our example, so 43875897234Sdrh<pre> 43975897234Sdrh a AND b AND c 44075897234Sdrh</pre> 44175897234Sdrhis parsed like this 44275897234Sdrh<pre> 44375897234Sdrh (a AND b) AND c. 44475897234Sdrh</pre> 44575897234SdrhThe EXP operator is right-associative, though, so 44675897234Sdrh<pre> 44775897234Sdrh a EXP b EXP c 44875897234Sdrh</pre> 44975897234Sdrhis parsed like this 45075897234Sdrh<pre> 45175897234Sdrh a EXP (b EXP c). 45275897234Sdrh</pre> 45375897234SdrhThe nonassoc precedence is used for non-associative operators. 45475897234SdrhSo 45575897234Sdrh<pre> 45675897234Sdrh a EQ b EQ c 45775897234Sdrh</pre> 45875897234Sdrhis an error.</p> 45975897234Sdrh 46075897234Sdrh<p>The precedence of non-terminals is transferred to rules as follows: 46175897234SdrhThe precedence of a grammar rule is equal to the precedence of the 46275897234Sdrhleft-most terminal symbol in the rule for which a precedence is 46375897234Sdrhdefined. This is normally what you want, but in those cases where 46475897234Sdrhyou want to precedence of a grammar rule to be something different, 46575897234Sdrhyou can specify an alternative precedence symbol by putting the 46675897234Sdrhsymbol in square braces after the period at the end of the rule and 46775897234Sdrhbefore any C-code. For example:</p> 46875897234Sdrh 46975897234Sdrh<p><pre> 47075897234Sdrh expr = MINUS expr. [NOT] 47175897234Sdrh</pre></p> 47275897234Sdrh 47375897234Sdrh<p>This rule has a precedence equal to that of the NOT symbol, not the 47475897234SdrhMINUS symbol as would have been the case by default.</p> 47575897234Sdrh 47675897234Sdrh<p>With the knowledge of how precedence is assigned to terminal 47775897234Sdrhsymbols and individual 47875897234Sdrhgrammar rules, we can now explain precisely how parsing conflicts 47975897234Sdrhare resolved in Lemon. Shift-reduce conflicts are resolved 48075897234Sdrhas follows: 48175897234Sdrh<ul> 48275897234Sdrh<li> If either the token to be shifted or the rule to be reduced 48375897234Sdrh lacks precedence information, then resolve in favor of the 48475897234Sdrh shift, but report a parsing conflict. 48575897234Sdrh<li> If the precedence of the token to be shifted is greater than 48675897234Sdrh the precedence of the rule to reduce, then resolve in favor 48775897234Sdrh of the shift. No parsing conflict is reported. 48875897234Sdrh<li> If the precedence of the token it be shifted is less than the 48975897234Sdrh precedence of the rule to reduce, then resolve in favor of the 49075897234Sdrh reduce action. No parsing conflict is reported. 49175897234Sdrh<li> If the precedences are the same and the shift token is 49275897234Sdrh right-associative, then resolve in favor of the shift. 49375897234Sdrh No parsing conflict is reported. 494d5578433Smistachkin<li> If the precedences are the same the shift token is 49575897234Sdrh left-associative, then resolve in favor of the reduce. 49675897234Sdrh No parsing conflict is reported. 49775897234Sdrh<li> Otherwise, resolve the conflict by doing the shift and 49875897234Sdrh report the parsing conflict. 49975897234Sdrh</ul> 50075897234SdrhReduce-reduce conflicts are resolved this way: 50175897234Sdrh<ul> 50275897234Sdrh<li> If either reduce rule 50375897234Sdrh lacks precedence information, then resolve in favor of the 50475897234Sdrh rule that appears first in the grammar and report a parsing 50575897234Sdrh conflict. 50675897234Sdrh<li> If both rules have precedence and the precedence is different 50775897234Sdrh then resolve the dispute in favor of the rule with the highest 50875897234Sdrh precedence and do not report a conflict. 50975897234Sdrh<li> Otherwise, resolve the conflict by reducing by the rule that 51075897234Sdrh appears first in the grammar and report a parsing conflict. 51175897234Sdrh</ul> 51275897234Sdrh 51375897234Sdrh<h3>Special Directives</h3> 51475897234Sdrh 51575897234Sdrh<p>The input grammar to Lemon consists of grammar rules and special 51675897234Sdrhdirectives. We've described all the grammar rules, so now we'll 51775897234Sdrhtalk about the special directives.</p> 51875897234Sdrh 51975897234Sdrh<p>Directives in lemon can occur in any order. You can put them before 52075897234Sdrhthe grammar rules, or after the grammar rules, or in the mist of the 52175897234Sdrhgrammar rules. It doesn't matter. The relative order of 52275897234Sdrhdirectives used to assign precedence to terminals is important, but 52375897234Sdrhother than that, the order of directives in Lemon is arbitrary.</p> 52475897234Sdrh 52575897234Sdrh<p>Lemon supports the following special directives: 52675897234Sdrh<ul> 527f2340fc7Sdrh<li><tt>%code</tt> 528f2340fc7Sdrh<li><tt>%default_destructor</tt> 529f2340fc7Sdrh<li><tt>%default_type</tt> 53075897234Sdrh<li><tt>%destructor</tt> 5319bccde3dSdrh<li><tt>%endif</tt> 53275897234Sdrh<li><tt>%extra_argument</tt> 5339bccde3dSdrh<li><tt>%fallback</tt> 5349bccde3dSdrh<li><tt>%ifdef</tt> 5359bccde3dSdrh<li><tt>%ifndef</tt> 53675897234Sdrh<li><tt>%include</tt> 53775897234Sdrh<li><tt>%left</tt> 53875897234Sdrh<li><tt>%name</tt> 53975897234Sdrh<li><tt>%nonassoc</tt> 54075897234Sdrh<li><tt>%parse_accept</tt> 54175897234Sdrh<li><tt>%parse_failure </tt> 54275897234Sdrh<li><tt>%right</tt> 54375897234Sdrh<li><tt>%stack_overflow</tt> 54475897234Sdrh<li><tt>%stack_size</tt> 54575897234Sdrh<li><tt>%start_symbol</tt> 54675897234Sdrh<li><tt>%syntax_error</tt> 5479bccde3dSdrh<li><tt>%token_class</tt> 54875897234Sdrh<li><tt>%token_destructor</tt> 54975897234Sdrh<li><tt>%token_prefix</tt> 55075897234Sdrh<li><tt>%token_type</tt> 55175897234Sdrh<li><tt>%type</tt> 5529bccde3dSdrh<li><tt>%wildcard</tt> 55375897234Sdrh</ul> 55475897234SdrhEach of these directives will be described separately in the 55575897234Sdrhfollowing sections:</p> 55675897234Sdrh 5579bccde3dSdrh<a name='pcode'></a> 558f2340fc7Sdrh<h4>The <tt>%code</tt> directive</h4> 559f2340fc7Sdrh 5609bccde3dSdrh<p>The %code directive is used to specify addition C code that 561f2340fc7Sdrhis added to the end of the main output file. This is similar to 5629bccde3dSdrhthe <a href='#pinclude'>%include</a> directive except that %include 5639bccde3dSdrhis inserted at the beginning of the main output file.</p> 564f2340fc7Sdrh 565f2340fc7Sdrh<p>%code is typically used to include some action routines or perhaps 5669bccde3dSdrha tokenizer or even the "main()" function 5679bccde3dSdrhas part of the output file.</p> 568f2340fc7Sdrh 5699bccde3dSdrh<a name='default_destructor'></a> 570f2340fc7Sdrh<h4>The <tt>%default_destructor</tt> directive</h4> 571f2340fc7Sdrh 572f2340fc7Sdrh<p>The %default_destructor directive specifies a destructor to 573f2340fc7Sdrhuse for non-terminals that do not have their own destructor 574f2340fc7Sdrhspecified by a separate %destructor directive. See the documentation 5759bccde3dSdrhon the <a name='#destructor'>%destructor</a> directive below for 5769bccde3dSdrhadditional information.</p> 577f2340fc7Sdrh 578f2340fc7Sdrh<p>In some grammers, many different non-terminal symbols have the 579f2340fc7Sdrhsame datatype and hence the same destructor. This directive is 580f2340fc7Sdrha convenience way to specify the same destructor for all those 581f2340fc7Sdrhnon-terminals using a single statement.</p> 582f2340fc7Sdrh 5839bccde3dSdrh<a name='default_type'></a> 584f2340fc7Sdrh<h4>The <tt>%default_type</tt> directive</h4> 585f2340fc7Sdrh 586f2340fc7Sdrh<p>The %default_type directive specifies the datatype of non-terminal 587f2340fc7Sdrhsymbols that do no have their own datatype defined using a separate 5889bccde3dSdrh<a href='#ptype'>%type</a> directive. 5899bccde3dSdrh</p> 590f2340fc7Sdrh 5919bccde3dSdrh<a name='destructor'></a> 59275897234Sdrh<h4>The <tt>%destructor</tt> directive</h4> 59375897234Sdrh 59475897234Sdrh<p>The %destructor directive is used to specify a destructor for 59575897234Sdrha non-terminal symbol. 5969bccde3dSdrh(See also the <a href='#token_destructor'>%token_destructor</a> 5979bccde3dSdrhdirective which is used to specify a destructor for terminal symbols.)</p> 59875897234Sdrh 59975897234Sdrh<p>A non-terminal's destructor is called to dispose of the 60075897234Sdrhnon-terminal's value whenever the non-terminal is popped from 60175897234Sdrhthe stack. This includes all of the following circumstances: 60275897234Sdrh<ul> 60375897234Sdrh<li> When a rule reduces and the value of a non-terminal on 60475897234Sdrh the right-hand side is not linked to C code. 60575897234Sdrh<li> When the stack is popped during error processing. 60675897234Sdrh<li> When the ParseFree() function runs. 60775897234Sdrh</ul> 60875897234SdrhThe destructor can do whatever it wants with the value of 60975897234Sdrhthe non-terminal, but its design is to deallocate memory 61075897234Sdrhor other resources held by that non-terminal.</p> 61175897234Sdrh 61275897234Sdrh<p>Consider an example: 61375897234Sdrh<pre> 61475897234Sdrh %type nt {void*} 61575897234Sdrh %destructor nt { free($$); } 61675897234Sdrh nt(A) ::= ID NUM. { A = malloc( 100 ); } 61775897234Sdrh</pre> 61875897234SdrhThis example is a bit contrived but it serves to illustrate how 61975897234Sdrhdestructors work. The example shows a non-terminal named 6209bccde3dSdrh"nt" that holds values of type "void*". When the rule for 6219bccde3dSdrhan "nt" reduces, it sets the value of the non-terminal to 62275897234Sdrhspace obtained from malloc(). Later, when the nt non-terminal 62375897234Sdrhis popped from the stack, the destructor will fire and call 62475897234Sdrhfree() on this malloced space, thus avoiding a memory leak. 6259bccde3dSdrh(Note that the symbol "$$" in the destructor code is replaced 62675897234Sdrhby the value of the non-terminal.)</p> 62775897234Sdrh 62875897234Sdrh<p>It is important to note that the value of a non-terminal is passed 62975897234Sdrhto the destructor whenever the non-terminal is removed from the 63075897234Sdrhstack, unless the non-terminal is used in a C-code action. If 63175897234Sdrhthe non-terminal is used by C-code, then it is assumed that the 6329bccde3dSdrhC-code will take care of destroying it. 6339bccde3dSdrhMore commonly, the value is used to build some 63475897234Sdrhlarger structure and we don't want to destroy it, which is why 63575897234Sdrhthe destructor is not called in this circumstance.</p> 63675897234Sdrh 6379bccde3dSdrh<p>Destructors help avoid memory leaks by automatically freeing 6389bccde3dSdrhallocated objects when they go out of scope. 63975897234SdrhTo do the same using yacc or bison is much more difficult.</p> 64075897234Sdrh 64145f31be8Sdrh<a name="extraarg"></a> 64275897234Sdrh<h4>The <tt>%extra_argument</tt> directive</h4> 64375897234Sdrh 64475897234SdrhThe %extra_argument directive instructs Lemon to add a 4th parameter 64575897234Sdrhto the parameter list of the Parse() function it generates. Lemon 64675897234Sdrhdoesn't do anything itself with this extra argument, but it does 64775897234Sdrhmake the argument available to C-code action routines, destructors, 64875897234Sdrhand so forth. For example, if the grammar file contains:</p> 64975897234Sdrh 65075897234Sdrh<p><pre> 65175897234Sdrh %extra_argument { MyStruct *pAbc } 65275897234Sdrh</pre></p> 65375897234Sdrh 65475897234Sdrh<p>Then the Parse() function generated will have an 4th parameter 6559bccde3dSdrhof type "MyStruct*" and all action routines will have access to 6569bccde3dSdrha variable named "pAbc" that is the value of the 4th parameter 65775897234Sdrhin the most recent call to Parse().</p> 65875897234Sdrh 6599bccde3dSdrh<a name='pfallback'></a> 6609bccde3dSdrh<h4>The <tt>%fallback</tt> directive</h4> 6619bccde3dSdrh 6629bccde3dSdrh<p>The %fallback directive specifies an alternative meaning for one 6639bccde3dSdrhor more tokens. The alternative meaning is tried if the original token 6649bccde3dSdrhwould have generated a syntax error. 6659bccde3dSdrh 6669bccde3dSdrh<p>The %fallback directive was added to support robust parsing of SQL 6679bccde3dSdrhsyntax in <a href="https://www.sqlite.org/">SQLite</a>. 6689bccde3dSdrhThe SQL language contains a large assortment of keywords, each of which 6699bccde3dSdrhappears as a different token to the language parser. SQL contains so 6709bccde3dSdrhmany keywords, that it can be difficult for programmers to keep up with 6719bccde3dSdrhthem all. Programmers will, therefore, sometimes mistakenly use an 6729bccde3dSdrhobscure language keyword for an identifier. The %fallback directive 6739bccde3dSdrhprovides a mechanism to tell the parser: "If you are unable to parse 6749bccde3dSdrhthis keyword, try treating it as an identifier instead." 6759bccde3dSdrh 6769bccde3dSdrh<p>The syntax of %fallback is as follows: 6779bccde3dSdrh 6789bccde3dSdrh<blockquote> 6799bccde3dSdrh<tt>%fallback</tt> <i>ID</i> <i>TOKEN...</i> <b>.</b> 6809bccde3dSdrh</blockquote> 6819bccde3dSdrh 6829bccde3dSdrh<p>In words, the %fallback directive is followed by a list of token names 6839bccde3dSdrhterminated by a period. The first token name is the fallback token - the 6849bccde3dSdrhtoken to which all the other tokens fall back to. The second and subsequent 6859bccde3dSdrharguments are tokens which fall back to the token identified by the first 6869bccde3dSdrhargument. 6879bccde3dSdrh 6889bccde3dSdrh<a name='pifdef'></a> 6899bccde3dSdrh<h4>The <tt>%ifdef</tt>, <tt>%ifndef</tt>, and <tt>%endif</tt> directives.</h4> 6909bccde3dSdrh 6919bccde3dSdrh<p>The %ifdef, %ifndef, and %endif directives are similar to 6929bccde3dSdrh#ifdef, #ifndef, and #endif in the C-preprocessor, just not as general. 6939bccde3dSdrhEach of these directives must begin at the left margin. No whitespace 6949bccde3dSdrhis allowed between the "%" and the directive name. 6959bccde3dSdrh 6969bccde3dSdrh<p>Grammar text in between "%ifdef MACRO" and the next nested "%endif" is 6979bccde3dSdrhignored unless the "-DMACRO" command-line option is used. Grammar text 6989bccde3dSdrhbetwen "%ifndef MACRO" and the next nested "%endif" is included except when 6999bccde3dSdrhthe "-DMACRO" command-line option is used. 7009bccde3dSdrh 7019bccde3dSdrh<p>Note that the argument to %ifdef and %ifndef must be a single 7029bccde3dSdrhpreprocessor symbol name, not a general expression. There is no "%else" 7039bccde3dSdrhdirective. 7049bccde3dSdrh 7059bccde3dSdrh 7069bccde3dSdrh<a name='pinclude'></a> 70775897234Sdrh<h4>The <tt>%include</tt> directive</h4> 70875897234Sdrh 70975897234Sdrh<p>The %include directive specifies C code that is included at the 71075897234Sdrhtop of the generated parser. You can include any text you want -- 711f2340fc7Sdrhthe Lemon parser generator copies it blindly. If you have multiple 7129bccde3dSdrh%include directives in your grammar file, their values are concatenated 7139bccde3dSdrhso that all %include code ultimately appears near the top of the 7149bccde3dSdrhgenerated parser, in the same order as it appeared in the grammer.</p> 71575897234Sdrh 71675897234Sdrh<p>The %include directive is very handy for getting some extra #include 71775897234Sdrhpreprocessor statements at the beginning of the generated parser. 71875897234SdrhFor example:</p> 71975897234Sdrh 72075897234Sdrh<p><pre> 72175897234Sdrh %include {#include <unistd.h>} 72275897234Sdrh</pre></p> 72375897234Sdrh 72475897234Sdrh<p>This might be needed, for example, if some of the C actions in the 72575897234Sdrhgrammar call functions that are prototyed in unistd.h.</p> 72675897234Sdrh 7279bccde3dSdrh<a name='pleft'></a> 72875897234Sdrh<h4>The <tt>%left</tt> directive</h4> 72975897234Sdrh 7309bccde3dSdrhThe %left directive is used (along with the <a href='#pright'>%right</a> and 7319bccde3dSdrh<a href='#pnonassoc'>%nonassoc</a> directives) to declare precedences of 7329bccde3dSdrhterminal symbols. Every terminal symbol whose name appears after 7339bccde3dSdrha %left directive but before the next period (".") is 73475897234Sdrhgiven the same left-associative precedence value. Subsequent 73575897234Sdrh%left directives have higher precedence. For example:</p> 73675897234Sdrh 73775897234Sdrh<p><pre> 73875897234Sdrh %left AND. 73975897234Sdrh %left OR. 74075897234Sdrh %nonassoc EQ NE GT GE LT LE. 74175897234Sdrh %left PLUS MINUS. 74275897234Sdrh %left TIMES DIVIDE MOD. 74375897234Sdrh %right EXP NOT. 74475897234Sdrh</pre></p> 74575897234Sdrh 74675897234Sdrh<p>Note the period that terminates each %left, %right or %nonassoc 74775897234Sdrhdirective.</p> 74875897234Sdrh 74975897234Sdrh<p>LALR(1) grammars can get into a situation where they require 75075897234Sdrha large amount of stack space if you make heavy use or right-associative 75175897234Sdrhoperators. For this reason, it is recommended that you use %left 75275897234Sdrhrather than %right whenever possible.</p> 75375897234Sdrh 7549bccde3dSdrh<a name='pname'></a> 75575897234Sdrh<h4>The <tt>%name</tt> directive</h4> 75675897234Sdrh 75775897234Sdrh<p>By default, the functions generated by Lemon all begin with the 7589bccde3dSdrhfive-character string "Parse". You can change this string to something 75975897234Sdrhdifferent using the %name directive. For instance:</p> 76075897234Sdrh 76175897234Sdrh<p><pre> 76275897234Sdrh %name Abcde 76375897234Sdrh</pre></p> 76475897234Sdrh 76575897234Sdrh<p>Putting this directive in the grammar file will cause Lemon to generate 76675897234Sdrhfunctions named 76775897234Sdrh<ul> 76875897234Sdrh<li> AbcdeAlloc(), 76975897234Sdrh<li> AbcdeFree(), 77075897234Sdrh<li> AbcdeTrace(), and 77175897234Sdrh<li> Abcde(). 77275897234Sdrh</ul> 77375897234SdrhThe %name directive allows you to generator two or more different 77475897234Sdrhparsers and link them all into the same executable. 77575897234Sdrh</p> 77675897234Sdrh 7779bccde3dSdrh<a name='pnonassoc'></a> 77875897234Sdrh<h4>The <tt>%nonassoc</tt> directive</h4> 77975897234Sdrh 78075897234Sdrh<p>This directive is used to assign non-associative precedence to 7819bccde3dSdrhone or more terminal symbols. See the section on 7829bccde3dSdrh<a href='#precrules'>precedence rules</a> 7839bccde3dSdrhor on the <a href='#pleft'>%left</a> directive for additional information.</p> 78475897234Sdrh 7859bccde3dSdrh<a name='parse_accept'></a> 78675897234Sdrh<h4>The <tt>%parse_accept</tt> directive</h4> 78775897234Sdrh 78875897234Sdrh<p>The %parse_accept directive specifies a block of C code that is 7899bccde3dSdrhexecuted whenever the parser accepts its input string. To "accept" 79075897234Sdrhan input string means that the parser was able to process all tokens 79175897234Sdrhwithout error.</p> 79275897234Sdrh 79375897234Sdrh<p>For example:</p> 79475897234Sdrh 79575897234Sdrh<p><pre> 79675897234Sdrh %parse_accept { 79775897234Sdrh printf("parsing complete!\n"); 79875897234Sdrh } 79975897234Sdrh</pre></p> 80075897234Sdrh 8019bccde3dSdrh<a name='parse_failure'></a> 80275897234Sdrh<h4>The <tt>%parse_failure</tt> directive</h4> 80375897234Sdrh 80475897234Sdrh<p>The %parse_failure directive specifies a block of C code that 80575897234Sdrhis executed whenever the parser fails complete. This code is not 80675897234Sdrhexecuted until the parser has tried and failed to resolve an input 80775897234Sdrherror using is usual error recovery strategy. The routine is 80875897234Sdrhonly invoked when parsing is unable to continue.</p> 80975897234Sdrh 81075897234Sdrh<p><pre> 81175897234Sdrh %parse_failure { 81275897234Sdrh fprintf(stderr,"Giving up. Parser is hopelessly lost...\n"); 81375897234Sdrh } 81475897234Sdrh</pre></p> 81575897234Sdrh 8169bccde3dSdrh<a name='pright'></a> 81775897234Sdrh<h4>The <tt>%right</tt> directive</h4> 81875897234Sdrh 81975897234Sdrh<p>This directive is used to assign right-associative precedence to 8209bccde3dSdrhone or more terminal symbols. See the section on 8219bccde3dSdrh<a href='#precrules'>precedence rules</a> 8229bccde3dSdrhor on the <a href='#pleft'>%left</a> directive for additional information.</p> 82375897234Sdrh 8249bccde3dSdrh<a name='stack_overflow'></a> 82575897234Sdrh<h4>The <tt>%stack_overflow</tt> directive</h4> 82675897234Sdrh 82775897234Sdrh<p>The %stack_overflow directive specifies a block of C code that 82875897234Sdrhis executed if the parser's internal stack ever overflows. Typically 82975897234Sdrhthis just prints an error message. After a stack overflow, the parser 83075897234Sdrhwill be unable to continue and must be reset.</p> 83175897234Sdrh 83275897234Sdrh<p><pre> 83375897234Sdrh %stack_overflow { 83475897234Sdrh fprintf(stderr,"Giving up. Parser stack overflow\n"); 83575897234Sdrh } 83675897234Sdrh</pre></p> 83775897234Sdrh 83875897234Sdrh<p>You can help prevent parser stack overflows by avoiding the use 83975897234Sdrhof right recursion and right-precedence operators in your grammar. 84075897234SdrhUse left recursion and and left-precedence operators instead, to 84175897234Sdrhencourage rules to reduce sooner and keep the stack size down. 84275897234SdrhFor example, do rules like this: 84375897234Sdrh<pre> 84475897234Sdrh list ::= list element. // left-recursion. Good! 84575897234Sdrh list ::= . 84675897234Sdrh</pre> 84775897234SdrhNot like this: 84875897234Sdrh<pre> 84975897234Sdrh list ::= element list. // right-recursion. Bad! 85075897234Sdrh list ::= . 85175897234Sdrh</pre> 85275897234Sdrh 8539bccde3dSdrh<a name='stack_size'></a> 85475897234Sdrh<h4>The <tt>%stack_size</tt> directive</h4> 85575897234Sdrh 85675897234Sdrh<p>If stack overflow is a problem and you can't resolve the trouble 85775897234Sdrhby using left-recursion, then you might want to increase the size 85875897234Sdrhof the parser's stack using this directive. Put an positive integer 85975897234Sdrhafter the %stack_size directive and Lemon will generate a parse 86075897234Sdrhwith a stack of the requested size. The default value is 100.</p> 86175897234Sdrh 86275897234Sdrh<p><pre> 86375897234Sdrh %stack_size 2000 86475897234Sdrh</pre></p> 86575897234Sdrh 8669bccde3dSdrh<a name='start_symbol'></a> 86775897234Sdrh<h4>The <tt>%start_symbol</tt> directive</h4> 86875897234Sdrh 86975897234Sdrh<p>By default, the start-symbol for the grammar that Lemon generates 87075897234Sdrhis the first non-terminal that appears in the grammar file. But you 87175897234Sdrhcan choose a different start-symbol using the %start_symbol directive.</p> 87275897234Sdrh 87375897234Sdrh<p><pre> 87475897234Sdrh %start_symbol prog 87575897234Sdrh</pre></p> 87675897234Sdrh 8779bccde3dSdrh<a name='token_destructor'></a> 87875897234Sdrh<h4>The <tt>%token_destructor</tt> directive</h4> 87975897234Sdrh 88075897234Sdrh<p>The %destructor directive assigns a destructor to a non-terminal 88175897234Sdrhsymbol. (See the description of the %destructor directive above.) 88275897234SdrhThis directive does the same thing for all terminal symbols.</p> 88375897234Sdrh 88475897234Sdrh<p>Unlike non-terminal symbols which may each have a different data type 88575897234Sdrhfor their values, terminals all use the same data type (defined by 88675897234Sdrhthe %token_type directive) and so they use a common destructor. Other 88775897234Sdrhthan that, the token destructor works just like the non-terminal 88875897234Sdrhdestructors.</p> 88975897234Sdrh 8909bccde3dSdrh<a name='token_prefix'></a> 89175897234Sdrh<h4>The <tt>%token_prefix</tt> directive</h4> 89275897234Sdrh 89375897234Sdrh<p>Lemon generates #defines that assign small integer constants 89475897234Sdrhto each terminal symbol in the grammar. If desired, Lemon will 89575897234Sdrhadd a prefix specified by this directive 89675897234Sdrhto each of the #defines it generates. 89775897234SdrhSo if the default output of Lemon looked like this: 89875897234Sdrh<pre> 89975897234Sdrh #define AND 1 90075897234Sdrh #define MINUS 2 90175897234Sdrh #define OR 3 90275897234Sdrh #define PLUS 4 90375897234Sdrh</pre> 90475897234SdrhYou can insert a statement into the grammar like this: 90575897234Sdrh<pre> 90675897234Sdrh %token_prefix TOKEN_ 90775897234Sdrh</pre> 90875897234Sdrhto cause Lemon to produce these symbols instead: 90975897234Sdrh<pre> 91075897234Sdrh #define TOKEN_AND 1 91175897234Sdrh #define TOKEN_MINUS 2 91275897234Sdrh #define TOKEN_OR 3 91375897234Sdrh #define TOKEN_PLUS 4 91475897234Sdrh</pre> 91575897234Sdrh 9169bccde3dSdrh<a name='token_type'></a><a name='ptype'></a> 91775897234Sdrh<h4>The <tt>%token_type</tt> and <tt>%type</tt> directives</h4> 91875897234Sdrh 91975897234Sdrh<p>These directives are used to specify the data types for values 92075897234Sdrhon the parser's stack associated with terminal and non-terminal 92175897234Sdrhsymbols. The values of all terminal symbols must be of the same 92275897234Sdrhtype. This turns out to be the same data type as the 3rd parameter 92375897234Sdrhto the Parse() function generated by Lemon. Typically, you will 92475897234Sdrhmake the value of a terminal symbol by a pointer to some kind of 92575897234Sdrhtoken structure. Like this:</p> 92675897234Sdrh 92775897234Sdrh<p><pre> 92875897234Sdrh %token_type {Token*} 92975897234Sdrh</pre></p> 93075897234Sdrh 93175897234Sdrh<p>If the data type of terminals is not specified, the default value 932*dfe4e6bbSdrhis "void*".</p> 93375897234Sdrh 93475897234Sdrh<p>Non-terminal symbols can each have their own data types. Typically 93575897234Sdrhthe data type of a non-terminal is a pointer to the root of a parse-tree 93675897234Sdrhstructure that contains all information about that non-terminal. 93775897234SdrhFor example:</p> 93875897234Sdrh 93975897234Sdrh<p><pre> 94075897234Sdrh %type expr {Expr*} 94175897234Sdrh</pre></p> 94275897234Sdrh 94375897234Sdrh<p>Each entry on the parser's stack is actually a union containing 94475897234Sdrhinstances of all data types for every non-terminal and terminal symbol. 94575897234SdrhLemon will automatically use the correct element of this union depending 94675897234Sdrhon what the corresponding non-terminal or terminal symbol is. But 94775897234Sdrhthe grammar designer should keep in mind that the size of the union 94875897234Sdrhwill be the size of its largest element. So if you have a single 94975897234Sdrhnon-terminal whose data type requires 1K of storage, then your 100 95075897234Sdrhentry parser stack will require 100K of heap space. If you are willing 95175897234Sdrhand able to pay that price, fine. You just need to know.</p> 95275897234Sdrh 9539bccde3dSdrh<a name='pwildcard'></a> 9549bccde3dSdrh<h4>The <tt>%wildcard</tt> directive</h4> 9559bccde3dSdrh 9569bccde3dSdrh<p>The %wildcard directive is followed by a single token name and a 9579bccde3dSdrhperiod. This directive specifies that the identified token should 9589bccde3dSdrhmatch any input token. 9599bccde3dSdrh 9609bccde3dSdrh<p>When the generated parser has the choice of matching an input against 9619bccde3dSdrhthe wildcard token and some other token, the other token is always used. 9629bccde3dSdrhThe wildcard token is only matched if there are no other alternatives. 9639bccde3dSdrh 96475897234Sdrh<h3>Error Processing</h3> 96575897234Sdrh 96675897234Sdrh<p>After extensive experimentation over several years, it has been 96775897234Sdrhdiscovered that the error recovery strategy used by yacc is about 96875897234Sdrhas good as it gets. And so that is what Lemon uses.</p> 96975897234Sdrh 97075897234Sdrh<p>When a Lemon-generated parser encounters a syntax error, it 97175897234Sdrhfirst invokes the code specified by the %syntax_error directive, if 97275897234Sdrhany. It then enters its error recovery strategy. The error recovery 97375897234Sdrhstrategy is to begin popping the parsers stack until it enters a 97475897234Sdrhstate where it is permitted to shift a special non-terminal symbol 9759bccde3dSdrhnamed "error". It then shifts this non-terminal and continues 97675897234Sdrhparsing. But the %syntax_error routine will not be called again 97775897234Sdrhuntil at least three new tokens have been successfully shifted.</p> 97875897234Sdrh 97975897234Sdrh<p>If the parser pops its stack until the stack is empty, and it still 98075897234Sdrhis unable to shift the error symbol, then the %parse_failed routine 98175897234Sdrhis invoked and the parser resets itself to its start state, ready 98275897234Sdrhto begin parsing a new file. This is what will happen at the very 98375897234Sdrhfirst syntax error, of course, if there are no instances of the 9849bccde3dSdrh"error" non-terminal in your grammar.</p> 98575897234Sdrh 98675897234Sdrh</body> 98775897234Sdrh</html> 988