1<html> 2<head> 3<title>The Lemon Parser Generator</title> 4</head> 5<body bgcolor=white> 6<h1 align=center>The Lemon Parser Generator</h1> 7 8<p>Lemon is an LALR(1) parser generator for C. 9It does the same job as "bison" and "yacc". 10But lemon is not a bison or yacc clone. Lemon 11uses a different grammar syntax which is designed to 12reduce the number of coding errors. Lemon also uses a 13parsing engine that is faster than yacc and 14bison and which is both reentrant and threadsafe. 15(Update: Since the previous sentence was written, bison 16has also been updated so that it too can generate a 17reentrant and threadsafe parser.) 18Lemon also implements features that can be used 19to eliminate resource leaks, making is suitable for use 20in long-running programs such as graphical user interfaces 21or embedded controllers.</p> 22 23<p>This document is an introduction to the Lemon 24parser generator.</p> 25 26<h2>Security Note</h2> 27 28<p>The language parser code created by Lemon is very robust and 29is well-suited for use in internet-facing applications that need to 30safely process maliciously crafted inputs. 31 32<p>The "lemon.exe" command-line tool itself works great when given a valid 33input grammar file and almost always gives helpful 34error messages for malformed inputs. However, it is possible for 35a malicious user to craft a grammar file that will cause 36lemon.exe to crash. 37We do not see this as a problem, as lemon.exe is not intended to be used 38with hostile inputs. 39To summarize:</p> 40 41<ul> 42<li>Parser code generated by lemon → Robust and secure 43<li>The "lemon.exe" command line tool itself → Not so much 44</ul> 45 46<h2>Theory of Operation</h2> 47 48<p>The main goal of Lemon is to translate a context free grammar (CFG) 49for a particular language into C code that implements a parser for 50that language. 51The program has two inputs: 52<ul> 53<li>The grammar specification. 54<li>A parser template file. 55</ul> 56Typically, only the grammar specification is supplied by the programmer. 57Lemon comes with a default parser template which works fine for most 58applications. But the user is free to substitute a different parser 59template if desired.</p> 60 61<p>Depending on command-line options, Lemon will generate between 62one and three files of outputs. 63<ul> 64<li>C code to implement the parser. 65<li>A header file defining an integer ID for each terminal symbol. 66<li>An information file that describes the states of the generated parser 67 automaton. 68</ul> 69By default, all three of these output files are generated. 70The header file is suppressed if the "-m" command-line option is 71used and the report file is omitted when "-q" is selected.</p> 72 73<p>The grammar specification file uses a ".y" suffix, by convention. 74In the examples used in this document, we'll assume the name of the 75grammar file is "gram.y". A typical use of Lemon would be the 76following command: 77<pre> 78 lemon gram.y 79</pre> 80This command will generate three output files named "gram.c", 81"gram.h" and "gram.out". 82The first is C code to implement the parser. The second 83is the header file that defines numerical values for all 84terminal symbols, and the last is the report that explains 85the states used by the parser automaton.</p> 86 87<h3>Command Line Options</h3> 88 89<p>The behavior of Lemon can be modified using command-line options. 90You can obtain a list of the available command-line options together 91with a brief explanation of what each does by typing 92<pre> 93 lemon -? 94</pre> 95As of this writing, the following command-line options are supported: 96<ul> 97<li><b>-b</b> 98Show only the basis for each parser state in the report file. 99<li><b>-c</b> 100Do not compress the generated action tables. 101<li><b>-D<i>name</i></b> 102Define C preprocessor macro <i>name</i>. This macro is useable by 103"%ifdef" lines in the grammar file. 104<li><b>-g</b> 105Do not generate a parser. Instead write the input grammar to standard 106output with all comments, actions, and other extraneous text removed. 107<li><b>-l</b> 108Omit "#line" directives in the generated parser C code. 109<li><b>-m</b> 110Cause the output C source code to be compatible with the "makeheaders" 111program. 112<li><b>-p</b> 113Display all conflicts that are resolved by 114<a href='#precrules'>precedence rules</a>. 115<li><b>-q</b> 116Suppress generation of the report file. 117<li><b>-r</b> 118Do not sort or renumber the parser states as part of optimization. 119<li><b>-s</b> 120Show parser statistics before existing. 121<li><b>-T<i>file</i></b> 122Use <i>file</i> as the template for the generated C-code parser implementation. 123<li><b>-x</b> 124Print the Lemon version number. 125</ul> 126 127<h3>The Parser Interface</h3> 128 129<p>Lemon doesn't generate a complete, working program. It only generates 130a few subroutines that implement a parser. This section describes 131the interface to those subroutines. It is up to the programmer to 132call these subroutines in an appropriate way in order to produce a 133complete system.</p> 134 135<p>Before a program begins using a Lemon-generated parser, the program 136must first create the parser. 137A new parser is created as follows: 138<pre> 139 void *pParser = ParseAlloc( malloc ); 140</pre> 141The ParseAlloc() routine allocates and initializes a new parser and 142returns a pointer to it. 143The actual data structure used to represent a parser is opaque — 144its internal structure is not visible or usable by the calling routine. 145For this reason, the ParseAlloc() routine returns a pointer to void 146rather than a pointer to some particular structure. 147The sole argument to the ParseAlloc() routine is a pointer to the 148subroutine used to allocate memory. Typically this means malloc().</p> 149 150<p>After a program is finished using a parser, it can reclaim all 151memory allocated by that parser by calling 152<pre> 153 ParseFree(pParser, free); 154</pre> 155The first argument is the same pointer returned by ParseAlloc(). The 156second argument is a pointer to the function used to release bulk 157memory back to the system.</p> 158 159<p>After a parser has been allocated using ParseAlloc(), the programmer 160must supply the parser with a sequence of tokens (terminal symbols) to 161be parsed. This is accomplished by calling the following function 162once for each token: 163<pre> 164 Parse(pParser, hTokenID, sTokenData, pArg); 165</pre> 166The first argument to the Parse() routine is the pointer returned by 167ParseAlloc(). 168The second argument is a small positive integer that tells the parse the 169type of the next token in the data stream. 170There is one token type for each terminal symbol in the grammar. 171The gram.h file generated by Lemon contains #define statements that 172map symbolic terminal symbol names into appropriate integer values. 173A value of 0 for the second argument is a special flag to the 174parser to indicate that the end of input has been reached. 175The third argument is the value of the given token. By default, 176the type of the third argument is integer, but the grammar will 177usually redefine this type to be some kind of structure. 178Typically the second argument will be a broad category of tokens 179such as "identifier" or "number" and the third argument will 180be the name of the identifier or the value of the number.</p> 181 182<p>The Parse() function may have either three or four arguments, 183depending on the grammar. If the grammar specification file requests 184it (via the <a href='#extraarg'><tt>extra_argument</tt> directive</a>), 185the Parse() function will have a fourth parameter that can be 186of any type chosen by the programmer. The parser doesn't do anything 187with this argument except to pass it through to action routines. 188This is a convenient mechanism for passing state information down 189to the action routines without having to use global variables.</p> 190 191<p>A typical use of a Lemon parser might look something like the 192following: 193<pre> 194 01 ParseTree *ParseFile(const char *zFilename){ 195 02 Tokenizer *pTokenizer; 196 03 void *pParser; 197 04 Token sToken; 198 05 int hTokenId; 199 06 ParserState sState; 200 07 201 08 pTokenizer = TokenizerCreate(zFilename); 202 09 pParser = ParseAlloc( malloc ); 203 10 InitParserState(&sState); 204 11 while( GetNextToken(pTokenizer, &hTokenId, &sToken) ){ 205 12 Parse(pParser, hTokenId, sToken, &sState); 206 13 } 207 14 Parse(pParser, 0, sToken, &sState); 208 15 ParseFree(pParser, free ); 209 16 TokenizerFree(pTokenizer); 210 17 return sState.treeRoot; 211 18 } 212</pre> 213This example shows a user-written routine that parses a file of 214text and returns a pointer to the parse tree. 215(All error-handling code is omitted from this example to keep it 216simple.) 217We assume the existence of some kind of tokenizer which is created 218using TokenizerCreate() on line 8 and deleted by TokenizerFree() 219on line 16. The GetNextToken() function on line 11 retrieves the 220next token from the input file and puts its type in the 221integer variable hTokenId. The sToken variable is assumed to be 222some kind of structure that contains details about each token, 223such as its complete text, what line it occurs on, etc. </p> 224 225<p>This example also assumes the existence of structure of type 226ParserState that holds state information about a particular parse. 227An instance of such a structure is created on line 6 and initialized 228on line 10. A pointer to this structure is passed into the Parse() 229routine as the optional 4th argument. 230The action routine specified by the grammar for the parser can use 231the ParserState structure to hold whatever information is useful and 232appropriate. In the example, we note that the treeRoot field of 233the ParserState structure is left pointing to the root of the parse 234tree.</p> 235 236<p>The core of this example as it relates to Lemon is as follows: 237<pre> 238 ParseFile(){ 239 pParser = ParseAlloc( malloc ); 240 while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){ 241 Parse(pParser, hTokenId, sToken); 242 } 243 Parse(pParser, 0, sToken); 244 ParseFree(pParser, free ); 245 } 246</pre> 247Basically, what a program has to do to use a Lemon-generated parser 248is first create the parser, then send it lots of tokens obtained by 249tokenizing an input source. When the end of input is reached, the 250Parse() routine should be called one last time with a token type 251of 0. This step is necessary to inform the parser that the end of 252input has been reached. Finally, we reclaim memory used by the 253parser by calling ParseFree().</p> 254 255<p>There is one other interface routine that should be mentioned 256before we move on. 257The ParseTrace() function can be used to generate debugging output 258from the parser. A prototype for this routine is as follows: 259<pre> 260 ParseTrace(FILE *stream, char *zPrefix); 261</pre> 262After this routine is called, a short (one-line) message is written 263to the designated output stream every time the parser changes states 264or calls an action routine. Each such message is prefaced using 265the text given by zPrefix. This debugging output can be turned off 266by calling ParseTrace() again with a first argument of NULL (0).</p> 267 268<h3>Differences With YACC and BISON</h3> 269 270<p>Programmers who have previously used the yacc or bison parser 271generator will notice several important differences between yacc and/or 272bison and Lemon. 273<ul> 274<li>In yacc and bison, the parser calls the tokenizer. In Lemon, 275 the tokenizer calls the parser. 276<li>Lemon uses no global variables. Yacc and bison use global variables 277 to pass information between the tokenizer and parser. 278<li>Lemon allows multiple parsers to be running simultaneously. Yacc 279 and bison do not. 280</ul> 281These differences may cause some initial confusion for programmers 282with prior yacc and bison experience. 283But after years of experience using Lemon, I firmly 284believe that the Lemon way of doing things is better.</p> 285 286<p><i>Updated as of 2016-02-16:</i> 287The text above was written in the 1990s. 288We are told that Bison has lately been enhanced to support the 289tokenizer-calls-parser paradigm used by Lemon, and to obviate the 290need for global variables.</p> 291 292<h2>Input File Syntax</h2> 293 294<p>The main purpose of the grammar specification file for Lemon is 295to define the grammar for the parser. But the input file also 296specifies additional information Lemon requires to do its job. 297Most of the work in using Lemon is in writing an appropriate 298grammar file.</p> 299 300<p>The grammar file for lemon is, for the most part, free format. 301It does not have sections or divisions like yacc or bison. Any 302declaration can occur at any point in the file. 303Lemon ignores whitespace (except where it is needed to separate 304tokens) and it honors the same commenting conventions as C and C++.</p> 305 306<h3>Terminals and Nonterminals</h3> 307 308<p>A terminal symbol (token) is any string of alphanumeric 309and/or underscore characters 310that begins with an upper case letter. 311A terminal can contain lowercase letters after the first character, 312but the usual convention is to make terminals all upper case. 313A nonterminal, on the other hand, is any string of alphanumeric 314and underscore characters than begins with a lower case letter. 315Again, the usual convention is to make nonterminals use all lower 316case letters.</p> 317 318<p>In Lemon, terminal and nonterminal symbols do not need to 319be declared or identified in a separate section of the grammar file. 320Lemon is able to generate a list of all terminals and nonterminals 321by examining the grammar rules, and it can always distinguish a 322terminal from a nonterminal by checking the case of the first 323character of the name.</p> 324 325<p>Yacc and bison allow terminal symbols to have either alphanumeric 326names or to be individual characters included in single quotes, like 327this: ')' or '$'. Lemon does not allow this alternative form for 328terminal symbols. With Lemon, all symbols, terminals and nonterminals, 329must have alphanumeric names.</p> 330 331<h3>Grammar Rules</h3> 332 333<p>The main component of a Lemon grammar file is a sequence of grammar 334rules. 335Each grammar rule consists of a nonterminal symbol followed by 336the special symbol "::=" and then a list of terminals and/or nonterminals. 337The rule is terminated by a period. 338The list of terminals and nonterminals on the right-hand side of the 339rule can be empty. 340Rules can occur in any order, except that the left-hand side of the 341first rule is assumed to be the start symbol for the grammar (unless 342specified otherwise using the <tt>%start</tt> directive described below.) 343A typical sequence of grammar rules might look something like this: 344<pre> 345 expr ::= expr PLUS expr. 346 expr ::= expr TIMES expr. 347 expr ::= LPAREN expr RPAREN. 348 expr ::= VALUE. 349</pre> 350</p> 351 352<p>There is one non-terminal in this example, "expr", and five 353terminal symbols or tokens: "PLUS", "TIMES", "LPAREN", 354"RPAREN" and "VALUE".</p> 355 356<p>Like yacc and bison, Lemon allows the grammar to specify a block 357of C code that will be executed whenever a grammar rule is reduced 358by the parser. 359In Lemon, this action is specified by putting the C code (contained 360within curly braces <tt>{...}</tt>) immediately after the 361period that closes the rule. 362For example: 363<pre> 364 expr ::= expr PLUS expr. { printf("Doing an addition...\n"); } 365</pre> 366</p> 367 368<p>In order to be useful, grammar actions must normally be linked to 369their associated grammar rules. 370In yacc and bison, this is accomplished by embedding a "$$" in the 371action to stand for the value of the left-hand side of the rule and 372symbols "$1", "$2", and so forth to stand for the value of 373the terminal or nonterminal at position 1, 2 and so forth on the 374right-hand side of the rule. 375This idea is very powerful, but it is also very error-prone. The 376single most common source of errors in a yacc or bison grammar is 377to miscount the number of symbols on the right-hand side of a grammar 378rule and say "$7" when you really mean "$8".</p> 379 380<p>Lemon avoids the need to count grammar symbols by assigning symbolic 381names to each symbol in a grammar rule and then using those symbolic 382names in the action. 383In yacc or bison, one would write this: 384<pre> 385 expr -> expr PLUS expr { $$ = $1 + $3; }; 386</pre> 387But in Lemon, the same rule becomes the following: 388<pre> 389 expr(A) ::= expr(B) PLUS expr(C). { A = B+C; } 390</pre> 391In the Lemon rule, any symbol in parentheses after a grammar rule 392symbol becomes a place holder for that symbol in the grammar rule. 393This place holder can then be used in the associated C action to 394stand for the value of that symbol.<p> 395 396<p>The Lemon notation for linking a grammar rule with its reduce 397action is superior to yacc/bison on several counts. 398First, as mentioned above, the Lemon method avoids the need to 399count grammar symbols. 400Secondly, if a terminal or nonterminal in a Lemon grammar rule 401includes a linking symbol in parentheses but that linking symbol 402is not actually used in the reduce action, then an error message 403is generated. 404For example, the rule 405<pre> 406 expr(A) ::= expr(B) PLUS expr(C). { A = B; } 407</pre> 408will generate an error because the linking symbol "C" is used 409in the grammar rule but not in the reduce action.</p> 410 411<p>The Lemon notation for linking grammar rules to reduce actions 412also facilitates the use of destructors for reclaiming memory 413allocated by the values of terminals and nonterminals on the 414right-hand side of a rule.</p> 415 416<a name='precrules'></a> 417<h3>Precedence Rules</h3> 418 419<p>Lemon resolves parsing ambiguities in exactly the same way as 420yacc and bison. A shift-reduce conflict is resolved in favor 421of the shift, and a reduce-reduce conflict is resolved by reducing 422whichever rule comes first in the grammar file.</p> 423 424<p>Just like in 425yacc and bison, Lemon allows a measure of control 426over the resolution of paring conflicts using precedence rules. 427A precedence value can be assigned to any terminal symbol 428using the 429<a href='#pleft'>%left</a>, 430<a href='#pright'>%right</a> or 431<a href='#pnonassoc'>%nonassoc</a> directives. Terminal symbols 432mentioned in earlier directives have a lower precedence that 433terminal symbols mentioned in later directives. For example:</p> 434 435<p><pre> 436 %left AND. 437 %left OR. 438 %nonassoc EQ NE GT GE LT LE. 439 %left PLUS MINUS. 440 %left TIMES DIVIDE MOD. 441 %right EXP NOT. 442</pre></p> 443 444<p>In the preceding sequence of directives, the AND operator is 445defined to have the lowest precedence. The OR operator is one 446precedence level higher. And so forth. Hence, the grammar would 447attempt to group the ambiguous expression 448<pre> 449 a AND b OR c 450</pre> 451like this 452<pre> 453 a AND (b OR c). 454</pre> 455The associativity (left, right or nonassoc) is used to determine 456the grouping when the precedence is the same. AND is left-associative 457in our example, so 458<pre> 459 a AND b AND c 460</pre> 461is parsed like this 462<pre> 463 (a AND b) AND c. 464</pre> 465The EXP operator is right-associative, though, so 466<pre> 467 a EXP b EXP c 468</pre> 469is parsed like this 470<pre> 471 a EXP (b EXP c). 472</pre> 473The nonassoc precedence is used for non-associative operators. 474So 475<pre> 476 a EQ b EQ c 477</pre> 478is an error.</p> 479 480<p>The precedence of non-terminals is transferred to rules as follows: 481The precedence of a grammar rule is equal to the precedence of the 482left-most terminal symbol in the rule for which a precedence is 483defined. This is normally what you want, but in those cases where 484you want to precedence of a grammar rule to be something different, 485you can specify an alternative precedence symbol by putting the 486symbol in square braces after the period at the end of the rule and 487before any C-code. For example:</p> 488 489<p><pre> 490 expr = MINUS expr. [NOT] 491</pre></p> 492 493<p>This rule has a precedence equal to that of the NOT symbol, not the 494MINUS symbol as would have been the case by default.</p> 495 496<p>With the knowledge of how precedence is assigned to terminal 497symbols and individual 498grammar rules, we can now explain precisely how parsing conflicts 499are resolved in Lemon. Shift-reduce conflicts are resolved 500as follows: 501<ul> 502<li> If either the token to be shifted or the rule to be reduced 503 lacks precedence information, then resolve in favor of the 504 shift, but report a parsing conflict. 505<li> If the precedence of the token to be shifted is greater than 506 the precedence of the rule to reduce, then resolve in favor 507 of the shift. No parsing conflict is reported. 508<li> If the precedence of the token it be shifted is less than the 509 precedence of the rule to reduce, then resolve in favor of the 510 reduce action. No parsing conflict is reported. 511<li> If the precedences are the same and the shift token is 512 right-associative, then resolve in favor of the shift. 513 No parsing conflict is reported. 514<li> If the precedences are the same the shift token is 515 left-associative, then resolve in favor of the reduce. 516 No parsing conflict is reported. 517<li> Otherwise, resolve the conflict by doing the shift and 518 report the parsing conflict. 519</ul> 520Reduce-reduce conflicts are resolved this way: 521<ul> 522<li> If either reduce rule 523 lacks precedence information, then resolve in favor of the 524 rule that appears first in the grammar and report a parsing 525 conflict. 526<li> If both rules have precedence and the precedence is different 527 then resolve the dispute in favor of the rule with the highest 528 precedence and do not report a conflict. 529<li> Otherwise, resolve the conflict by reducing by the rule that 530 appears first in the grammar and report a parsing conflict. 531</ul> 532 533<h3>Special Directives</h3> 534 535<p>The input grammar to Lemon consists of grammar rules and special 536directives. We've described all the grammar rules, so now we'll 537talk about the special directives.</p> 538 539<p>Directives in lemon can occur in any order. You can put them before 540the grammar rules, or after the grammar rules, or in the mist of the 541grammar rules. It doesn't matter. The relative order of 542directives used to assign precedence to terminals is important, but 543other than that, the order of directives in Lemon is arbitrary.</p> 544 545<p>Lemon supports the following special directives: 546<ul> 547<li><tt>%code</tt> 548<li><tt>%default_destructor</tt> 549<li><tt>%default_type</tt> 550<li><tt>%destructor</tt> 551<li><tt>%endif</tt> 552<li><tt>%extra_argument</tt> 553<li><tt>%fallback</tt> 554<li><tt>%ifdef</tt> 555<li><tt>%ifndef</tt> 556<li><tt>%include</tt> 557<li><tt>%left</tt> 558<li><tt>%name</tt> 559<li><tt>%nonassoc</tt> 560<li><tt>%parse_accept</tt> 561<li><tt>%parse_failure </tt> 562<li><tt>%right</tt> 563<li><tt>%stack_overflow</tt> 564<li><tt>%stack_size</tt> 565<li><tt>%start_symbol</tt> 566<li><tt>%syntax_error</tt> 567<li><tt>%token_class</tt> 568<li><tt>%token_destructor</tt> 569<li><tt>%token_prefix</tt> 570<li><tt>%token_type</tt> 571<li><tt>%type</tt> 572<li><tt>%wildcard</tt> 573</ul> 574Each of these directives will be described separately in the 575following sections:</p> 576 577<a name='pcode'></a> 578<h4>The <tt>%code</tt> directive</h4> 579 580<p>The %code directive is used to specify addition C code that 581is added to the end of the main output file. This is similar to 582the <a href='#pinclude'>%include</a> directive except that %include 583is inserted at the beginning of the main output file.</p> 584 585<p>%code is typically used to include some action routines or perhaps 586a tokenizer or even the "main()" function 587as part of the output file.</p> 588 589<a name='default_destructor'></a> 590<h4>The <tt>%default_destructor</tt> directive</h4> 591 592<p>The %default_destructor directive specifies a destructor to 593use for non-terminals that do not have their own destructor 594specified by a separate %destructor directive. See the documentation 595on the <a name='#destructor'>%destructor</a> directive below for 596additional information.</p> 597 598<p>In some grammers, many different non-terminal symbols have the 599same datatype and hence the same destructor. This directive is 600a convenience way to specify the same destructor for all those 601non-terminals using a single statement.</p> 602 603<a name='default_type'></a> 604<h4>The <tt>%default_type</tt> directive</h4> 605 606<p>The %default_type directive specifies the datatype of non-terminal 607symbols that do no have their own datatype defined using a separate 608<a href='#ptype'>%type</a> directive. 609</p> 610 611<a name='destructor'></a> 612<h4>The <tt>%destructor</tt> directive</h4> 613 614<p>The %destructor directive is used to specify a destructor for 615a non-terminal symbol. 616(See also the <a href='#token_destructor'>%token_destructor</a> 617directive which is used to specify a destructor for terminal symbols.)</p> 618 619<p>A non-terminal's destructor is called to dispose of the 620non-terminal's value whenever the non-terminal is popped from 621the stack. This includes all of the following circumstances: 622<ul> 623<li> When a rule reduces and the value of a non-terminal on 624 the right-hand side is not linked to C code. 625<li> When the stack is popped during error processing. 626<li> When the ParseFree() function runs. 627</ul> 628The destructor can do whatever it wants with the value of 629the non-terminal, but its design is to deallocate memory 630or other resources held by that non-terminal.</p> 631 632<p>Consider an example: 633<pre> 634 %type nt {void*} 635 %destructor nt { free($$); } 636 nt(A) ::= ID NUM. { A = malloc( 100 ); } 637</pre> 638This example is a bit contrived but it serves to illustrate how 639destructors work. The example shows a non-terminal named 640"nt" that holds values of type "void*". When the rule for 641an "nt" reduces, it sets the value of the non-terminal to 642space obtained from malloc(). Later, when the nt non-terminal 643is popped from the stack, the destructor will fire and call 644free() on this malloced space, thus avoiding a memory leak. 645(Note that the symbol "$$" in the destructor code is replaced 646by the value of the non-terminal.)</p> 647 648<p>It is important to note that the value of a non-terminal is passed 649to the destructor whenever the non-terminal is removed from the 650stack, unless the non-terminal is used in a C-code action. If 651the non-terminal is used by C-code, then it is assumed that the 652C-code will take care of destroying it. 653More commonly, the value is used to build some 654larger structure and we don't want to destroy it, which is why 655the destructor is not called in this circumstance.</p> 656 657<p>Destructors help avoid memory leaks by automatically freeing 658allocated objects when they go out of scope. 659To do the same using yacc or bison is much more difficult.</p> 660 661<a name="extraarg"></a> 662<h4>The <tt>%extra_argument</tt> directive</h4> 663 664The %extra_argument directive instructs Lemon to add a 4th parameter 665to the parameter list of the Parse() function it generates. Lemon 666doesn't do anything itself with this extra argument, but it does 667make the argument available to C-code action routines, destructors, 668and so forth. For example, if the grammar file contains:</p> 669 670<p><pre> 671 %extra_argument { MyStruct *pAbc } 672</pre></p> 673 674<p>Then the Parse() function generated will have an 4th parameter 675of type "MyStruct*" and all action routines will have access to 676a variable named "pAbc" that is the value of the 4th parameter 677in the most recent call to Parse().</p> 678 679<a name='pfallback'></a> 680<h4>The <tt>%fallback</tt> directive</h4> 681 682<p>The %fallback directive specifies an alternative meaning for one 683or more tokens. The alternative meaning is tried if the original token 684would have generated a syntax error. 685 686<p>The %fallback directive was added to support robust parsing of SQL 687syntax in <a href="https://www.sqlite.org/">SQLite</a>. 688The SQL language contains a large assortment of keywords, each of which 689appears as a different token to the language parser. SQL contains so 690many keywords, that it can be difficult for programmers to keep up with 691them all. Programmers will, therefore, sometimes mistakenly use an 692obscure language keyword for an identifier. The %fallback directive 693provides a mechanism to tell the parser: "If you are unable to parse 694this keyword, try treating it as an identifier instead." 695 696<p>The syntax of %fallback is as follows: 697 698<blockquote> 699<tt>%fallback</tt> <i>ID</i> <i>TOKEN...</i> <b>.</b> 700</blockquote> 701 702<p>In words, the %fallback directive is followed by a list of token names 703terminated by a period. The first token name is the fallback token - the 704token to which all the other tokens fall back to. The second and subsequent 705arguments are tokens which fall back to the token identified by the first 706argument. 707 708<a name='pifdef'></a> 709<h4>The <tt>%ifdef</tt>, <tt>%ifndef</tt>, and <tt>%endif</tt> directives.</h4> 710 711<p>The %ifdef, %ifndef, and %endif directives are similar to 712#ifdef, #ifndef, and #endif in the C-preprocessor, just not as general. 713Each of these directives must begin at the left margin. No whitespace 714is allowed between the "%" and the directive name. 715 716<p>Grammar text in between "%ifdef MACRO" and the next nested "%endif" is 717ignored unless the "-DMACRO" command-line option is used. Grammar text 718betwen "%ifndef MACRO" and the next nested "%endif" is included except when 719the "-DMACRO" command-line option is used. 720 721<p>Note that the argument to %ifdef and %ifndef must be a single 722preprocessor symbol name, not a general expression. There is no "%else" 723directive. 724 725 726<a name='pinclude'></a> 727<h4>The <tt>%include</tt> directive</h4> 728 729<p>The %include directive specifies C code that is included at the 730top of the generated parser. You can include any text you want -- 731the Lemon parser generator copies it blindly. If you have multiple 732%include directives in your grammar file, their values are concatenated 733so that all %include code ultimately appears near the top of the 734generated parser, in the same order as it appeared in the grammer.</p> 735 736<p>The %include directive is very handy for getting some extra #include 737preprocessor statements at the beginning of the generated parser. 738For example:</p> 739 740<p><pre> 741 %include {#include <unistd.h>} 742</pre></p> 743 744<p>This might be needed, for example, if some of the C actions in the 745grammar call functions that are prototyed in unistd.h.</p> 746 747<a name='pleft'></a> 748<h4>The <tt>%left</tt> directive</h4> 749 750The %left directive is used (along with the <a href='#pright'>%right</a> and 751<a href='#pnonassoc'>%nonassoc</a> directives) to declare precedences of 752terminal symbols. Every terminal symbol whose name appears after 753a %left directive but before the next period (".") is 754given the same left-associative precedence value. Subsequent 755%left directives have higher precedence. For example:</p> 756 757<p><pre> 758 %left AND. 759 %left OR. 760 %nonassoc EQ NE GT GE LT LE. 761 %left PLUS MINUS. 762 %left TIMES DIVIDE MOD. 763 %right EXP NOT. 764</pre></p> 765 766<p>Note the period that terminates each %left, %right or %nonassoc 767directive.</p> 768 769<p>LALR(1) grammars can get into a situation where they require 770a large amount of stack space if you make heavy use or right-associative 771operators. For this reason, it is recommended that you use %left 772rather than %right whenever possible.</p> 773 774<a name='pname'></a> 775<h4>The <tt>%name</tt> directive</h4> 776 777<p>By default, the functions generated by Lemon all begin with the 778five-character string "Parse". You can change this string to something 779different using the %name directive. For instance:</p> 780 781<p><pre> 782 %name Abcde 783</pre></p> 784 785<p>Putting this directive in the grammar file will cause Lemon to generate 786functions named 787<ul> 788<li> AbcdeAlloc(), 789<li> AbcdeFree(), 790<li> AbcdeTrace(), and 791<li> Abcde(). 792</ul> 793The %name directive allows you to generator two or more different 794parsers and link them all into the same executable. 795</p> 796 797<a name='pnonassoc'></a> 798<h4>The <tt>%nonassoc</tt> directive</h4> 799 800<p>This directive is used to assign non-associative precedence to 801one or more terminal symbols. See the section on 802<a href='#precrules'>precedence rules</a> 803or on the <a href='#pleft'>%left</a> directive for additional information.</p> 804 805<a name='parse_accept'></a> 806<h4>The <tt>%parse_accept</tt> directive</h4> 807 808<p>The %parse_accept directive specifies a block of C code that is 809executed whenever the parser accepts its input string. To "accept" 810an input string means that the parser was able to process all tokens 811without error.</p> 812 813<p>For example:</p> 814 815<p><pre> 816 %parse_accept { 817 printf("parsing complete!\n"); 818 } 819</pre></p> 820 821<a name='parse_failure'></a> 822<h4>The <tt>%parse_failure</tt> directive</h4> 823 824<p>The %parse_failure directive specifies a block of C code that 825is executed whenever the parser fails complete. This code is not 826executed until the parser has tried and failed to resolve an input 827error using is usual error recovery strategy. The routine is 828only invoked when parsing is unable to continue.</p> 829 830<p><pre> 831 %parse_failure { 832 fprintf(stderr,"Giving up. Parser is hopelessly lost...\n"); 833 } 834</pre></p> 835 836<a name='pright'></a> 837<h4>The <tt>%right</tt> directive</h4> 838 839<p>This directive is used to assign right-associative precedence to 840one or more terminal symbols. See the section on 841<a href='#precrules'>precedence rules</a> 842or on the <a href='#pleft'>%left</a> directive for additional information.</p> 843 844<a name='stack_overflow'></a> 845<h4>The <tt>%stack_overflow</tt> directive</h4> 846 847<p>The %stack_overflow directive specifies a block of C code that 848is executed if the parser's internal stack ever overflows. Typically 849this just prints an error message. After a stack overflow, the parser 850will be unable to continue and must be reset.</p> 851 852<p><pre> 853 %stack_overflow { 854 fprintf(stderr,"Giving up. Parser stack overflow\n"); 855 } 856</pre></p> 857 858<p>You can help prevent parser stack overflows by avoiding the use 859of right recursion and right-precedence operators in your grammar. 860Use left recursion and and left-precedence operators instead, to 861encourage rules to reduce sooner and keep the stack size down. 862For example, do rules like this: 863<pre> 864 list ::= list element. // left-recursion. Good! 865 list ::= . 866</pre> 867Not like this: 868<pre> 869 list ::= element list. // right-recursion. Bad! 870 list ::= . 871</pre> 872 873<a name='stack_size'></a> 874<h4>The <tt>%stack_size</tt> directive</h4> 875 876<p>If stack overflow is a problem and you can't resolve the trouble 877by using left-recursion, then you might want to increase the size 878of the parser's stack using this directive. Put an positive integer 879after the %stack_size directive and Lemon will generate a parse 880with a stack of the requested size. The default value is 100.</p> 881 882<p><pre> 883 %stack_size 2000 884</pre></p> 885 886<a name='start_symbol'></a> 887<h4>The <tt>%start_symbol</tt> directive</h4> 888 889<p>By default, the start-symbol for the grammar that Lemon generates 890is the first non-terminal that appears in the grammar file. But you 891can choose a different start-symbol using the %start_symbol directive.</p> 892 893<p><pre> 894 %start_symbol prog 895</pre></p> 896 897<a name='token_destructor'></a> 898<h4>The <tt>%token_destructor</tt> directive</h4> 899 900<p>The %destructor directive assigns a destructor to a non-terminal 901symbol. (See the description of the %destructor directive above.) 902This directive does the same thing for all terminal symbols.</p> 903 904<p>Unlike non-terminal symbols which may each have a different data type 905for their values, terminals all use the same data type (defined by 906the %token_type directive) and so they use a common destructor. Other 907than that, the token destructor works just like the non-terminal 908destructors.</p> 909 910<a name='token_prefix'></a> 911<h4>The <tt>%token_prefix</tt> directive</h4> 912 913<p>Lemon generates #defines that assign small integer constants 914to each terminal symbol in the grammar. If desired, Lemon will 915add a prefix specified by this directive 916to each of the #defines it generates. 917So if the default output of Lemon looked like this: 918<pre> 919 #define AND 1 920 #define MINUS 2 921 #define OR 3 922 #define PLUS 4 923</pre> 924You can insert a statement into the grammar like this: 925<pre> 926 %token_prefix TOKEN_ 927</pre> 928to cause Lemon to produce these symbols instead: 929<pre> 930 #define TOKEN_AND 1 931 #define TOKEN_MINUS 2 932 #define TOKEN_OR 3 933 #define TOKEN_PLUS 4 934</pre> 935 936<a name='token_type'></a><a name='ptype'></a> 937<h4>The <tt>%token_type</tt> and <tt>%type</tt> directives</h4> 938 939<p>These directives are used to specify the data types for values 940on the parser's stack associated with terminal and non-terminal 941symbols. The values of all terminal symbols must be of the same 942type. This turns out to be the same data type as the 3rd parameter 943to the Parse() function generated by Lemon. Typically, you will 944make the value of a terminal symbol by a pointer to some kind of 945token structure. Like this:</p> 946 947<p><pre> 948 %token_type {Token*} 949</pre></p> 950 951<p>If the data type of terminals is not specified, the default value 952is "void*".</p> 953 954<p>Non-terminal symbols can each have their own data types. Typically 955the data type of a non-terminal is a pointer to the root of a parse-tree 956structure that contains all information about that non-terminal. 957For example:</p> 958 959<p><pre> 960 %type expr {Expr*} 961</pre></p> 962 963<p>Each entry on the parser's stack is actually a union containing 964instances of all data types for every non-terminal and terminal symbol. 965Lemon will automatically use the correct element of this union depending 966on what the corresponding non-terminal or terminal symbol is. But 967the grammar designer should keep in mind that the size of the union 968will be the size of its largest element. So if you have a single 969non-terminal whose data type requires 1K of storage, then your 100 970entry parser stack will require 100K of heap space. If you are willing 971and able to pay that price, fine. You just need to know.</p> 972 973<a name='pwildcard'></a> 974<h4>The <tt>%wildcard</tt> directive</h4> 975 976<p>The %wildcard directive is followed by a single token name and a 977period. This directive specifies that the identified token should 978match any input token. 979 980<p>When the generated parser has the choice of matching an input against 981the wildcard token and some other token, the other token is always used. 982The wildcard token is only matched if there are no other alternatives. 983 984<h3>Error Processing</h3> 985 986<p>After extensive experimentation over several years, it has been 987discovered that the error recovery strategy used by yacc is about 988as good as it gets. And so that is what Lemon uses.</p> 989 990<p>When a Lemon-generated parser encounters a syntax error, it 991first invokes the code specified by the %syntax_error directive, if 992any. It then enters its error recovery strategy. The error recovery 993strategy is to begin popping the parsers stack until it enters a 994state where it is permitted to shift a special non-terminal symbol 995named "error". It then shifts this non-terminal and continues 996parsing. But the %syntax_error routine will not be called again 997until at least three new tokens have been successfully shifted.</p> 998 999<p>If the parser pops its stack until the stack is empty, and it still 1000is unable to shift the error symbol, then the %parse_failed routine 1001is invoked and the parser resets itself to its start state, ready 1002to begin parsing a new file. This is what will happen at the very 1003first syntax error, of course, if there are no instances of the 1004"error" non-terminal in your grammar.</p> 1005 1006</body> 1007</html> 1008