1d80f118eSChris Lattner=========================================== 2d80f118eSChris LattnerKaleidoscope: Implementing a Parser and AST 3d80f118eSChris Lattner=========================================== 4d80f118eSChris Lattner 5d80f118eSChris Lattner.. contents:: 6d80f118eSChris Lattner :local: 7d80f118eSChris Lattner 8d80f118eSChris LattnerChapter 2 Introduction 9d80f118eSChris Lattner====================== 10d80f118eSChris Lattner 11d80f118eSChris LattnerWelcome to Chapter 2 of the "`Implementing a language with 12d80f118eSChris LattnerLLVM <index.html>`_" tutorial. This chapter shows you how to use the 13d80f118eSChris Lattnerlexer, built in `Chapter 1 <LangImpl01.html>`_, to build a full 14d80f118eSChris Lattner`parser <http://en.wikipedia.org/wiki/Parsing>`_ for our Kaleidoscope 15d80f118eSChris Lattnerlanguage. Once we have a parser, we'll define and build an `Abstract 16d80f118eSChris LattnerSyntax Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST). 17d80f118eSChris Lattner 18d80f118eSChris LattnerThe parser we will build uses a combination of `Recursive Descent 19d80f118eSChris LattnerParsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and 20d80f118eSChris Lattner`Operator-Precedence 21d80f118eSChris LattnerParsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to 22d80f118eSChris Lattnerparse the Kaleidoscope language (the latter for binary expressions and 23d80f118eSChris Lattnerthe former for everything else). Before we get to parsing though, let's 24d80f118eSChris Lattnertalk about the output of the parser: the Abstract Syntax Tree. 25d80f118eSChris Lattner 26d80f118eSChris LattnerThe Abstract Syntax Tree (AST) 27d80f118eSChris Lattner============================== 28d80f118eSChris Lattner 29d80f118eSChris LattnerThe AST for a program captures its behavior in such a way that it is 30d80f118eSChris Lattnereasy for later stages of the compiler (e.g. code generation) to 31d80f118eSChris Lattnerinterpret. We basically want one object for each construct in the 32d80f118eSChris Lattnerlanguage, and the AST should closely model the language. In 33d80f118eSChris LattnerKaleidoscope, we have expressions, a prototype, and a function object. 34d80f118eSChris LattnerWe'll start with expressions first: 35d80f118eSChris Lattner 36d80f118eSChris Lattner.. code-block:: c++ 37d80f118eSChris Lattner 38d80f118eSChris Lattner /// ExprAST - Base class for all expression nodes. 39d80f118eSChris Lattner class ExprAST { 40d80f118eSChris Lattner public: 41d80f118eSChris Lattner virtual ~ExprAST() {} 42d80f118eSChris Lattner }; 43d80f118eSChris Lattner 44d80f118eSChris Lattner /// NumberExprAST - Expression class for numeric literals like "1.0". 45d80f118eSChris Lattner class NumberExprAST : public ExprAST { 46d80f118eSChris Lattner double Val; 47d80f118eSChris Lattner 48d80f118eSChris Lattner public: 49d80f118eSChris Lattner NumberExprAST(double Val) : Val(Val) {} 50d80f118eSChris Lattner }; 51d80f118eSChris Lattner 52d80f118eSChris LattnerThe code above shows the definition of the base ExprAST class and one 53d80f118eSChris Lattnersubclass which we use for numeric literals. The important thing to note 54d80f118eSChris Lattnerabout this code is that the NumberExprAST class captures the numeric 55d80f118eSChris Lattnervalue of the literal as an instance variable. This allows later phases 56d80f118eSChris Lattnerof the compiler to know what the stored numeric value is. 57d80f118eSChris Lattner 58d80f118eSChris LattnerRight now we only create the AST, so there are no useful accessor 59d80f118eSChris Lattnermethods on them. It would be very easy to add a virtual method to pretty 60d80f118eSChris Lattnerprint the code, for example. Here are the other expression AST node 61d80f118eSChris Lattnerdefinitions that we'll use in the basic form of the Kaleidoscope 62d80f118eSChris Lattnerlanguage: 63d80f118eSChris Lattner 64d80f118eSChris Lattner.. code-block:: c++ 65d80f118eSChris Lattner 66d80f118eSChris Lattner /// VariableExprAST - Expression class for referencing a variable, like "a". 67d80f118eSChris Lattner class VariableExprAST : public ExprAST { 68d80f118eSChris Lattner std::string Name; 69d80f118eSChris Lattner 70d80f118eSChris Lattner public: 71d80f118eSChris Lattner VariableExprAST(const std::string &Name) : Name(Name) {} 72d80f118eSChris Lattner }; 73d80f118eSChris Lattner 74d80f118eSChris Lattner /// BinaryExprAST - Expression class for a binary operator. 75d80f118eSChris Lattner class BinaryExprAST : public ExprAST { 76d80f118eSChris Lattner char Op; 77d80f118eSChris Lattner std::unique_ptr<ExprAST> LHS, RHS; 78d80f118eSChris Lattner 79d80f118eSChris Lattner public: 80d80f118eSChris Lattner BinaryExprAST(char op, std::unique_ptr<ExprAST> LHS, 81d80f118eSChris Lattner std::unique_ptr<ExprAST> RHS) 82d80f118eSChris Lattner : Op(op), LHS(std::move(LHS)), RHS(std::move(RHS)) {} 83d80f118eSChris Lattner }; 84d80f118eSChris Lattner 85d80f118eSChris Lattner /// CallExprAST - Expression class for function calls. 86d80f118eSChris Lattner class CallExprAST : public ExprAST { 87d80f118eSChris Lattner std::string Callee; 88d80f118eSChris Lattner std::vector<std::unique_ptr<ExprAST>> Args; 89d80f118eSChris Lattner 90d80f118eSChris Lattner public: 91d80f118eSChris Lattner CallExprAST(const std::string &Callee, 92d80f118eSChris Lattner std::vector<std::unique_ptr<ExprAST>> Args) 93d80f118eSChris Lattner : Callee(Callee), Args(std::move(Args)) {} 94d80f118eSChris Lattner }; 95d80f118eSChris Lattner 96d80f118eSChris LattnerThis is all (intentionally) rather straight-forward: variables capture 97d80f118eSChris Lattnerthe variable name, binary operators capture their opcode (e.g. '+'), and 98d80f118eSChris Lattnercalls capture a function name as well as a list of any argument 99d80f118eSChris Lattnerexpressions. One thing that is nice about our AST is that it captures 100d80f118eSChris Lattnerthe language features without talking about the syntax of the language. 101d80f118eSChris LattnerNote that there is no discussion about precedence of binary operators, 102d80f118eSChris Lattnerlexical structure, etc. 103d80f118eSChris Lattner 104d80f118eSChris LattnerFor our basic language, these are all of the expression nodes we'll 105d80f118eSChris Lattnerdefine. Because it doesn't have conditional control flow, it isn't 106d80f118eSChris LattnerTuring-complete; we'll fix that in a later installment. The two things 107d80f118eSChris Lattnerwe need next are a way to talk about the interface to a function, and a 108d80f118eSChris Lattnerway to talk about functions themselves: 109d80f118eSChris Lattner 110d80f118eSChris Lattner.. code-block:: c++ 111d80f118eSChris Lattner 112d80f118eSChris Lattner /// PrototypeAST - This class represents the "prototype" for a function, 113d80f118eSChris Lattner /// which captures its name, and its argument names (thus implicitly the number 114d80f118eSChris Lattner /// of arguments the function takes). 115d80f118eSChris Lattner class PrototypeAST { 116d80f118eSChris Lattner std::string Name; 117d80f118eSChris Lattner std::vector<std::string> Args; 118d80f118eSChris Lattner 119d80f118eSChris Lattner public: 120d80f118eSChris Lattner PrototypeAST(const std::string &name, std::vector<std::string> Args) 121d80f118eSChris Lattner : Name(name), Args(std::move(Args)) {} 122d80f118eSChris Lattner 123d80f118eSChris Lattner const std::string &getName() const { return Name; } 124d80f118eSChris Lattner }; 125d80f118eSChris Lattner 126d80f118eSChris Lattner /// FunctionAST - This class represents a function definition itself. 127d80f118eSChris Lattner class FunctionAST { 128d80f118eSChris Lattner std::unique_ptr<PrototypeAST> Proto; 129d80f118eSChris Lattner std::unique_ptr<ExprAST> Body; 130d80f118eSChris Lattner 131d80f118eSChris Lattner public: 132d80f118eSChris Lattner FunctionAST(std::unique_ptr<PrototypeAST> Proto, 133d80f118eSChris Lattner std::unique_ptr<ExprAST> Body) 134d80f118eSChris Lattner : Proto(std::move(Proto)), Body(std::move(Body)) {} 135d80f118eSChris Lattner }; 136d80f118eSChris Lattner 137d80f118eSChris LattnerIn Kaleidoscope, functions are typed with just a count of their 138d80f118eSChris Lattnerarguments. Since all values are double precision floating point, the 139d80f118eSChris Lattnertype of each argument doesn't need to be stored anywhere. In a more 140d80f118eSChris Lattneraggressive and realistic language, the "ExprAST" class would probably 141d80f118eSChris Lattnerhave a type field. 142d80f118eSChris Lattner 143d80f118eSChris LattnerWith this scaffolding, we can now talk about parsing expressions and 144d80f118eSChris Lattnerfunction bodies in Kaleidoscope. 145d80f118eSChris Lattner 146d80f118eSChris LattnerParser Basics 147d80f118eSChris Lattner============= 148d80f118eSChris Lattner 149d80f118eSChris LattnerNow that we have an AST to build, we need to define the parser code to 150d80f118eSChris Lattnerbuild it. The idea here is that we want to parse something like "x+y" 151d80f118eSChris Lattner(which is returned as three tokens by the lexer) into an AST that could 152d80f118eSChris Lattnerbe generated with calls like this: 153d80f118eSChris Lattner 154d80f118eSChris Lattner.. code-block:: c++ 155d80f118eSChris Lattner 1560eaee545SJonas Devlieghere auto LHS = std::make_unique<VariableExprAST>("x"); 1570eaee545SJonas Devlieghere auto RHS = std::make_unique<VariableExprAST>("y"); 158d80f118eSChris Lattner auto Result = std::make_unique<BinaryExprAST>('+', std::move(LHS), 159d80f118eSChris Lattner std::move(RHS)); 160d80f118eSChris Lattner 161d80f118eSChris LattnerIn order to do this, we'll start by defining some basic helper routines: 162d80f118eSChris Lattner 163d80f118eSChris Lattner.. code-block:: c++ 164d80f118eSChris Lattner 165d80f118eSChris Lattner /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current 166d80f118eSChris Lattner /// token the parser is looking at. getNextToken reads another token from the 167d80f118eSChris Lattner /// lexer and updates CurTok with its results. 168d80f118eSChris Lattner static int CurTok; 169d80f118eSChris Lattner static int getNextToken() { 170d80f118eSChris Lattner return CurTok = gettok(); 171d80f118eSChris Lattner } 172d80f118eSChris Lattner 173d80f118eSChris LattnerThis implements a simple token buffer around the lexer. This allows us 174d80f118eSChris Lattnerto look one token ahead at what the lexer is returning. Every function 175d80f118eSChris Lattnerin our parser will assume that CurTok is the current token that needs to 176d80f118eSChris Lattnerbe parsed. 177d80f118eSChris Lattner 178d80f118eSChris Lattner.. code-block:: c++ 179d80f118eSChris Lattner 180d80f118eSChris Lattner 181d80f118eSChris Lattner /// LogError* - These are little helper functions for error handling. 182d80f118eSChris Lattner std::unique_ptr<ExprAST> LogError(const char *Str) { 183d80f118eSChris Lattner fprintf(stderr, "LogError: %s\n", Str); 184d80f118eSChris Lattner return nullptr; 185d80f118eSChris Lattner } 186d80f118eSChris Lattner std::unique_ptr<PrototypeAST> LogErrorP(const char *Str) { 187d80f118eSChris Lattner LogError(Str); 188d80f118eSChris Lattner return nullptr; 189d80f118eSChris Lattner } 190d80f118eSChris Lattner 191d80f118eSChris LattnerThe ``LogError`` routines are simple helper routines that our parser will 192d80f118eSChris Lattneruse to handle errors. The error recovery in our parser will not be the 193d80f118eSChris Lattnerbest and is not particular user-friendly, but it will be enough for our 194d80f118eSChris Lattnertutorial. These routines make it easier to handle errors in routines 195d80f118eSChris Lattnerthat have various return types: they always return null. 196d80f118eSChris Lattner 197d80f118eSChris LattnerWith these basic helper functions, we can implement the first piece of 198d80f118eSChris Lattnerour grammar: numeric literals. 199d80f118eSChris Lattner 200d80f118eSChris LattnerBasic Expression Parsing 201d80f118eSChris Lattner======================== 202d80f118eSChris Lattner 203d80f118eSChris LattnerWe start with numeric literals, because they are the simplest to 204d80f118eSChris Lattnerprocess. For each production in our grammar, we'll define a function 205d80f118eSChris Lattnerwhich parses that production. For numeric literals, we have: 206d80f118eSChris Lattner 207d80f118eSChris Lattner.. code-block:: c++ 208d80f118eSChris Lattner 209d80f118eSChris Lattner /// numberexpr ::= number 210d80f118eSChris Lattner static std::unique_ptr<ExprAST> ParseNumberExpr() { 2110eaee545SJonas Devlieghere auto Result = std::make_unique<NumberExprAST>(NumVal); 212d80f118eSChris Lattner getNextToken(); // consume the number 213d80f118eSChris Lattner return std::move(Result); 214d80f118eSChris Lattner } 215d80f118eSChris Lattner 216d80f118eSChris LattnerThis routine is very simple: it expects to be called when the current 217d80f118eSChris Lattnertoken is a ``tok_number`` token. It takes the current number value, 218d80f118eSChris Lattnercreates a ``NumberExprAST`` node, advances the lexer to the next token, 219d80f118eSChris Lattnerand finally returns. 220d80f118eSChris Lattner 221d80f118eSChris LattnerThere are some interesting aspects to this. The most important one is 222d80f118eSChris Lattnerthat this routine eats all of the tokens that correspond to the 223d80f118eSChris Lattnerproduction and returns the lexer buffer with the next token (which is 224d80f118eSChris Lattnernot part of the grammar production) ready to go. This is a fairly 225d80f118eSChris Lattnerstandard way to go for recursive descent parsers. For a better example, 226d80f118eSChris Lattnerthe parenthesis operator is defined like this: 227d80f118eSChris Lattner 228d80f118eSChris Lattner.. code-block:: c++ 229d80f118eSChris Lattner 230d80f118eSChris Lattner /// parenexpr ::= '(' expression ')' 231d80f118eSChris Lattner static std::unique_ptr<ExprAST> ParseParenExpr() { 232d80f118eSChris Lattner getNextToken(); // eat (. 233d80f118eSChris Lattner auto V = ParseExpression(); 234d80f118eSChris Lattner if (!V) 235d80f118eSChris Lattner return nullptr; 236d80f118eSChris Lattner 237d80f118eSChris Lattner if (CurTok != ')') 238d80f118eSChris Lattner return LogError("expected ')'"); 239d80f118eSChris Lattner getNextToken(); // eat ). 240d80f118eSChris Lattner return V; 241d80f118eSChris Lattner } 242d80f118eSChris Lattner 243d80f118eSChris LattnerThis function illustrates a number of interesting things about the 244d80f118eSChris Lattnerparser: 245d80f118eSChris Lattner 246d80f118eSChris Lattner1) It shows how we use the LogError routines. When called, this function 247d80f118eSChris Lattnerexpects that the current token is a '(' token, but after parsing the 248d80f118eSChris Lattnersubexpression, it is possible that there is no ')' waiting. For example, 249d80f118eSChris Lattnerif the user types in "(4 x" instead of "(4)", the parser should emit an 250d80f118eSChris Lattnererror. Because errors can occur, the parser needs a way to indicate that 251d80f118eSChris Lattnerthey happened: in our parser, we return null on an error. 252d80f118eSChris Lattner 253d80f118eSChris Lattner2) Another interesting aspect of this function is that it uses recursion 254d80f118eSChris Lattnerby calling ``ParseExpression`` (we will soon see that 255d80f118eSChris Lattner``ParseExpression`` can call ``ParseParenExpr``). This is powerful 256d80f118eSChris Lattnerbecause it allows us to handle recursive grammars, and keeps each 257d80f118eSChris Lattnerproduction very simple. Note that parentheses do not cause construction 258d80f118eSChris Lattnerof AST nodes themselves. While we could do it this way, the most 259d80f118eSChris Lattnerimportant role of parentheses are to guide the parser and provide 260d80f118eSChris Lattnergrouping. Once the parser constructs the AST, parentheses are not 261d80f118eSChris Lattnerneeded. 262d80f118eSChris Lattner 263d80f118eSChris LattnerThe next simple production is for handling variable references and 264d80f118eSChris Lattnerfunction calls: 265d80f118eSChris Lattner 266d80f118eSChris Lattner.. code-block:: c++ 267d80f118eSChris Lattner 268d80f118eSChris Lattner /// identifierexpr 269d80f118eSChris Lattner /// ::= identifier 270d80f118eSChris Lattner /// ::= identifier '(' expression* ')' 271d80f118eSChris Lattner static std::unique_ptr<ExprAST> ParseIdentifierExpr() { 272d80f118eSChris Lattner std::string IdName = IdentifierStr; 273d80f118eSChris Lattner 274d80f118eSChris Lattner getNextToken(); // eat identifier. 275d80f118eSChris Lattner 276d80f118eSChris Lattner if (CurTok != '(') // Simple variable ref. 2770eaee545SJonas Devlieghere return std::make_unique<VariableExprAST>(IdName); 278d80f118eSChris Lattner 279d80f118eSChris Lattner // Call. 280d80f118eSChris Lattner getNextToken(); // eat ( 281d80f118eSChris Lattner std::vector<std::unique_ptr<ExprAST>> Args; 282d80f118eSChris Lattner if (CurTok != ')') { 283d80f118eSChris Lattner while (1) { 284d80f118eSChris Lattner if (auto Arg = ParseExpression()) 285d80f118eSChris Lattner Args.push_back(std::move(Arg)); 286d80f118eSChris Lattner else 287d80f118eSChris Lattner return nullptr; 288d80f118eSChris Lattner 289d80f118eSChris Lattner if (CurTok == ')') 290d80f118eSChris Lattner break; 291d80f118eSChris Lattner 292d80f118eSChris Lattner if (CurTok != ',') 293d80f118eSChris Lattner return LogError("Expected ')' or ',' in argument list"); 294d80f118eSChris Lattner getNextToken(); 295d80f118eSChris Lattner } 296d80f118eSChris Lattner } 297d80f118eSChris Lattner 298d80f118eSChris Lattner // Eat the ')'. 299d80f118eSChris Lattner getNextToken(); 300d80f118eSChris Lattner 3010eaee545SJonas Devlieghere return std::make_unique<CallExprAST>(IdName, std::move(Args)); 302d80f118eSChris Lattner } 303d80f118eSChris Lattner 304d80f118eSChris LattnerThis routine follows the same style as the other routines. (It expects 305d80f118eSChris Lattnerto be called if the current token is a ``tok_identifier`` token). It 306d80f118eSChris Lattneralso has recursion and error handling. One interesting aspect of this is 307d80f118eSChris Lattnerthat it uses *look-ahead* to determine if the current identifier is a 308d80f118eSChris Lattnerstand alone variable reference or if it is a function call expression. 309d80f118eSChris LattnerIt handles this by checking to see if the token after the identifier is 310d80f118eSChris Lattnera '(' token, constructing either a ``VariableExprAST`` or 311d80f118eSChris Lattner``CallExprAST`` node as appropriate. 312d80f118eSChris Lattner 313d80f118eSChris LattnerNow that we have all of our simple expression-parsing logic in place, we 314d80f118eSChris Lattnercan define a helper function to wrap it together into one entry point. 315d80f118eSChris LattnerWe call this class of expressions "primary" expressions, for reasons 316d80f118eSChris Lattnerthat will become more clear `later in the 3176bfd10ffSJonathan Roelofstutorial <LangImpl06.html#user-defined-unary-operators>`_. In order to parse an arbitrary 318d80f118eSChris Lattnerprimary expression, we need to determine what sort of expression it is: 319d80f118eSChris Lattner 320d80f118eSChris Lattner.. code-block:: c++ 321d80f118eSChris Lattner 322d80f118eSChris Lattner /// primary 323d80f118eSChris Lattner /// ::= identifierexpr 324d80f118eSChris Lattner /// ::= numberexpr 325d80f118eSChris Lattner /// ::= parenexpr 326d80f118eSChris Lattner static std::unique_ptr<ExprAST> ParsePrimary() { 327d80f118eSChris Lattner switch (CurTok) { 328d80f118eSChris Lattner default: 329d80f118eSChris Lattner return LogError("unknown token when expecting an expression"); 330d80f118eSChris Lattner case tok_identifier: 331d80f118eSChris Lattner return ParseIdentifierExpr(); 332d80f118eSChris Lattner case tok_number: 333d80f118eSChris Lattner return ParseNumberExpr(); 334d80f118eSChris Lattner case '(': 335d80f118eSChris Lattner return ParseParenExpr(); 336d80f118eSChris Lattner } 337d80f118eSChris Lattner } 338d80f118eSChris Lattner 339d80f118eSChris LattnerNow that you see the definition of this function, it is more obvious why 340d80f118eSChris Lattnerwe can assume the state of CurTok in the various functions. This uses 341d80f118eSChris Lattnerlook-ahead to determine which sort of expression is being inspected, and 342d80f118eSChris Lattnerthen parses it with a function call. 343d80f118eSChris Lattner 344d80f118eSChris LattnerNow that basic expressions are handled, we need to handle binary 345d80f118eSChris Lattnerexpressions. They are a bit more complex. 346d80f118eSChris Lattner 347d80f118eSChris LattnerBinary Expression Parsing 348d80f118eSChris Lattner========================= 349d80f118eSChris Lattner 350d80f118eSChris LattnerBinary expressions are significantly harder to parse because they are 351d80f118eSChris Lattneroften ambiguous. For example, when given the string "x+y\*z", the parser 352d80f118eSChris Lattnercan choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common 353d80f118eSChris Lattnerdefinitions from mathematics, we expect the later parse, because "\*" 354d80f118eSChris Lattner(multiplication) has higher *precedence* than "+" (addition). 355d80f118eSChris Lattner 356d80f118eSChris LattnerThere are many ways to handle this, but an elegant and efficient way is 357d80f118eSChris Lattnerto use `Operator-Precedence 358d80f118eSChris LattnerParsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_. 359d80f118eSChris LattnerThis parsing technique uses the precedence of binary operators to guide 360d80f118eSChris Lattnerrecursion. To start with, we need a table of precedences: 361d80f118eSChris Lattner 362d80f118eSChris Lattner.. code-block:: c++ 363d80f118eSChris Lattner 364d80f118eSChris Lattner /// BinopPrecedence - This holds the precedence for each binary operator that is 365d80f118eSChris Lattner /// defined. 366d80f118eSChris Lattner static std::map<char, int> BinopPrecedence; 367d80f118eSChris Lattner 368d80f118eSChris Lattner /// GetTokPrecedence - Get the precedence of the pending binary operator token. 369d80f118eSChris Lattner static int GetTokPrecedence() { 370d80f118eSChris Lattner if (!isascii(CurTok)) 371d80f118eSChris Lattner return -1; 372d80f118eSChris Lattner 373d80f118eSChris Lattner // Make sure it's a declared binop. 374d80f118eSChris Lattner int TokPrec = BinopPrecedence[CurTok]; 375d80f118eSChris Lattner if (TokPrec <= 0) return -1; 376d80f118eSChris Lattner return TokPrec; 377d80f118eSChris Lattner } 378d80f118eSChris Lattner 379d80f118eSChris Lattner int main() { 380d80f118eSChris Lattner // Install standard binary operators. 381d80f118eSChris Lattner // 1 is lowest precedence. 382d80f118eSChris Lattner BinopPrecedence['<'] = 10; 383d80f118eSChris Lattner BinopPrecedence['+'] = 20; 384d80f118eSChris Lattner BinopPrecedence['-'] = 20; 385d80f118eSChris Lattner BinopPrecedence['*'] = 40; // highest. 386d80f118eSChris Lattner ... 387d80f118eSChris Lattner } 388d80f118eSChris Lattner 389d80f118eSChris LattnerFor the basic form of Kaleidoscope, we will only support 4 binary 390d80f118eSChris Lattneroperators (this can obviously be extended by you, our brave and intrepid 391d80f118eSChris Lattnerreader). The ``GetTokPrecedence`` function returns the precedence for 392d80f118eSChris Lattnerthe current token, or -1 if the token is not a binary operator. Having a 393d80f118eSChris Lattnermap makes it easy to add new operators and makes it clear that the 394d80f118eSChris Lattneralgorithm doesn't depend on the specific operators involved, but it 395d80f118eSChris Lattnerwould be easy enough to eliminate the map and do the comparisons in the 396d80f118eSChris Lattner``GetTokPrecedence`` function. (Or just use a fixed-size array). 397d80f118eSChris Lattner 398d80f118eSChris LattnerWith the helper above defined, we can now start parsing binary 399d80f118eSChris Lattnerexpressions. The basic idea of operator precedence parsing is to break 400d80f118eSChris Lattnerdown an expression with potentially ambiguous binary operators into 401d80f118eSChris Lattnerpieces. Consider, for example, the expression "a+b+(c+d)\*e\*f+g". 402d80f118eSChris LattnerOperator precedence parsing considers this as a stream of primary 403d80f118eSChris Lattnerexpressions separated by binary operators. As such, it will first parse 404d80f118eSChris Lattnerthe leading primary expression "a", then it will see the pairs [+, b] 405d80f118eSChris Lattner[+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are 406d80f118eSChris Lattnerprimary expressions, the binary expression parser doesn't need to worry 407d80f118eSChris Lattnerabout nested subexpressions like (c+d) at all. 408d80f118eSChris Lattner 409d80f118eSChris LattnerTo start, an expression is a primary expression potentially followed by 410d80f118eSChris Lattnera sequence of [binop,primaryexpr] pairs: 411d80f118eSChris Lattner 412d80f118eSChris Lattner.. code-block:: c++ 413d80f118eSChris Lattner 414d80f118eSChris Lattner /// expression 415d80f118eSChris Lattner /// ::= primary binoprhs 416d80f118eSChris Lattner /// 417d80f118eSChris Lattner static std::unique_ptr<ExprAST> ParseExpression() { 418d80f118eSChris Lattner auto LHS = ParsePrimary(); 419d80f118eSChris Lattner if (!LHS) 420d80f118eSChris Lattner return nullptr; 421d80f118eSChris Lattner 422d80f118eSChris Lattner return ParseBinOpRHS(0, std::move(LHS)); 423d80f118eSChris Lattner } 424d80f118eSChris Lattner 425d80f118eSChris Lattner``ParseBinOpRHS`` is the function that parses the sequence of pairs for 426d80f118eSChris Lattnerus. It takes a precedence and a pointer to an expression for the part 427d80f118eSChris Lattnerthat has been parsed so far. Note that "x" is a perfectly valid 428d80f118eSChris Lattnerexpression: As such, "binoprhs" is allowed to be empty, in which case it 429d80f118eSChris Lattnerreturns the expression that is passed into it. In our example above, the 430d80f118eSChris Lattnercode passes the expression for "a" into ``ParseBinOpRHS`` and the 431d80f118eSChris Lattnercurrent token is "+". 432d80f118eSChris Lattner 433d80f118eSChris LattnerThe precedence value passed into ``ParseBinOpRHS`` indicates the 434d80f118eSChris Lattner*minimal operator precedence* that the function is allowed to eat. For 435d80f118eSChris Lattnerexample, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is 436d80f118eSChris Lattnerpassed in a precedence of 40, it will not consume any tokens (because 437d80f118eSChris Lattnerthe precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS`` 438d80f118eSChris Lattnerstarts with: 439d80f118eSChris Lattner 440d80f118eSChris Lattner.. code-block:: c++ 441d80f118eSChris Lattner 442d80f118eSChris Lattner /// binoprhs 443d80f118eSChris Lattner /// ::= ('+' primary)* 444d80f118eSChris Lattner static std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec, 445d80f118eSChris Lattner std::unique_ptr<ExprAST> LHS) { 446d80f118eSChris Lattner // If this is a binop, find its precedence. 447d80f118eSChris Lattner while (1) { 448d80f118eSChris Lattner int TokPrec = GetTokPrecedence(); 449d80f118eSChris Lattner 450d80f118eSChris Lattner // If this is a binop that binds at least as tightly as the current binop, 451d80f118eSChris Lattner // consume it, otherwise we are done. 452d80f118eSChris Lattner if (TokPrec < ExprPrec) 453d80f118eSChris Lattner return LHS; 454d80f118eSChris Lattner 455d80f118eSChris LattnerThis code gets the precedence of the current token and checks to see if 456d80f118eSChris Lattnerif is too low. Because we defined invalid tokens to have a precedence of 457d80f118eSChris Lattner-1, this check implicitly knows that the pair-stream ends when the token 458d80f118eSChris Lattnerstream runs out of binary operators. If this check succeeds, we know 459d80f118eSChris Lattnerthat the token is a binary operator and that it will be included in this 460d80f118eSChris Lattnerexpression: 461d80f118eSChris Lattner 462d80f118eSChris Lattner.. code-block:: c++ 463d80f118eSChris Lattner 464d80f118eSChris Lattner // Okay, we know this is a binop. 465d80f118eSChris Lattner int BinOp = CurTok; 466d80f118eSChris Lattner getNextToken(); // eat binop 467d80f118eSChris Lattner 468d80f118eSChris Lattner // Parse the primary expression after the binary operator. 469d80f118eSChris Lattner auto RHS = ParsePrimary(); 470d80f118eSChris Lattner if (!RHS) 471d80f118eSChris Lattner return nullptr; 472d80f118eSChris Lattner 473d80f118eSChris LattnerAs such, this code eats (and remembers) the binary operator and then 474d80f118eSChris Lattnerparses the primary expression that follows. This builds up the whole 475d80f118eSChris Lattnerpair, the first of which is [+, b] for the running example. 476d80f118eSChris Lattner 477d80f118eSChris LattnerNow that we parsed the left-hand side of an expression and one pair of 478d80f118eSChris Lattnerthe RHS sequence, we have to decide which way the expression associates. 479d80f118eSChris LattnerIn particular, we could have "(a+b) binop unparsed" or "a + (b binop 480d80f118eSChris Lattnerunparsed)". To determine this, we look ahead at "binop" to determine its 481d80f118eSChris Lattnerprecedence and compare it to BinOp's precedence (which is '+' in this 482d80f118eSChris Lattnercase): 483d80f118eSChris Lattner 484d80f118eSChris Lattner.. code-block:: c++ 485d80f118eSChris Lattner 486d80f118eSChris Lattner // If BinOp binds less tightly with RHS than the operator after RHS, let 487d80f118eSChris Lattner // the pending operator take RHS as its LHS. 488d80f118eSChris Lattner int NextPrec = GetTokPrecedence(); 489d80f118eSChris Lattner if (TokPrec < NextPrec) { 490d80f118eSChris Lattner 491d80f118eSChris LattnerIf the precedence of the binop to the right of "RHS" is lower or equal 492d80f118eSChris Lattnerto the precedence of our current operator, then we know that the 493d80f118eSChris Lattnerparentheses associate as "(a+b) binop ...". In our example, the current 494d80f118eSChris Lattneroperator is "+" and the next operator is "+", we know that they have the 495d80f118eSChris Lattnersame precedence. In this case we'll create the AST node for "a+b", and 496d80f118eSChris Lattnerthen continue parsing: 497d80f118eSChris Lattner 498d80f118eSChris Lattner.. code-block:: c++ 499d80f118eSChris Lattner 500d80f118eSChris Lattner ... if body omitted ... 501d80f118eSChris Lattner } 502d80f118eSChris Lattner 503d80f118eSChris Lattner // Merge LHS/RHS. 5040eaee545SJonas Devlieghere LHS = std::make_unique<BinaryExprAST>(BinOp, std::move(LHS), 505d80f118eSChris Lattner std::move(RHS)); 506d80f118eSChris Lattner } // loop around to the top of the while loop. 507d80f118eSChris Lattner } 508d80f118eSChris Lattner 509d80f118eSChris LattnerIn our example above, this will turn "a+b+" into "(a+b)" and execute the 510d80f118eSChris Lattnernext iteration of the loop, with "+" as the current token. The code 511d80f118eSChris Lattnerabove will eat, remember, and parse "(c+d)" as the primary expression, 512d80f118eSChris Lattnerwhich makes the current pair equal to [+, (c+d)]. It will then evaluate 513d80f118eSChris Lattnerthe 'if' conditional above with "\*" as the binop to the right of the 514d80f118eSChris Lattnerprimary. In this case, the precedence of "\*" is higher than the 515d80f118eSChris Lattnerprecedence of "+" so the if condition will be entered. 516d80f118eSChris Lattner 517d80f118eSChris LattnerThe critical question left here is "how can the if condition parse the 518d80f118eSChris Lattnerright hand side in full"? In particular, to build the AST correctly for 519d80f118eSChris Lattnerour example, it needs to get all of "(c+d)\*e\*f" as the RHS expression 520d80f118eSChris Lattnervariable. The code to do this is surprisingly simple (code from the 521d80f118eSChris Lattnerabove two blocks duplicated for context): 522d80f118eSChris Lattner 523d80f118eSChris Lattner.. code-block:: c++ 524d80f118eSChris Lattner 525d80f118eSChris Lattner // If BinOp binds less tightly with RHS than the operator after RHS, let 526d80f118eSChris Lattner // the pending operator take RHS as its LHS. 527d80f118eSChris Lattner int NextPrec = GetTokPrecedence(); 528d80f118eSChris Lattner if (TokPrec < NextPrec) { 529d80f118eSChris Lattner RHS = ParseBinOpRHS(TokPrec+1, std::move(RHS)); 530d80f118eSChris Lattner if (!RHS) 531d80f118eSChris Lattner return nullptr; 532d80f118eSChris Lattner } 533d80f118eSChris Lattner // Merge LHS/RHS. 5340eaee545SJonas Devlieghere LHS = std::make_unique<BinaryExprAST>(BinOp, std::move(LHS), 535d80f118eSChris Lattner std::move(RHS)); 536d80f118eSChris Lattner } // loop around to the top of the while loop. 537d80f118eSChris Lattner } 538d80f118eSChris Lattner 539d80f118eSChris LattnerAt this point, we know that the binary operator to the RHS of our 540d80f118eSChris Lattnerprimary has higher precedence than the binop we are currently parsing. 541d80f118eSChris LattnerAs such, we know that any sequence of pairs whose operators are all 542d80f118eSChris Lattnerhigher precedence than "+" should be parsed together and returned as 543d80f118eSChris Lattner"RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function 544d80f118eSChris Lattnerspecifying "TokPrec+1" as the minimum precedence required for it to 545d80f118eSChris Lattnercontinue. In our example above, this will cause it to return the AST 546d80f118eSChris Lattnernode for "(c+d)\*e\*f" as RHS, which is then set as the RHS of the '+' 547d80f118eSChris Lattnerexpression. 548d80f118eSChris Lattner 549d80f118eSChris LattnerFinally, on the next iteration of the while loop, the "+g" piece is 550d80f118eSChris Lattnerparsed and added to the AST. With this little bit of code (14 551d80f118eSChris Lattnernon-trivial lines), we correctly handle fully general binary expression 552d80f118eSChris Lattnerparsing in a very elegant way. This was a whirlwind tour of this code, 553d80f118eSChris Lattnerand it is somewhat subtle. I recommend running through it with a few 554d80f118eSChris Lattnertough examples to see how it works. 555d80f118eSChris Lattner 556d80f118eSChris LattnerThis wraps up handling of expressions. At this point, we can point the 557d80f118eSChris Lattnerparser at an arbitrary token stream and build an expression from it, 558d80f118eSChris Lattnerstopping at the first token that is not part of the expression. Next up 559d80f118eSChris Lattnerwe need to handle function definitions, etc. 560d80f118eSChris Lattner 561d80f118eSChris LattnerParsing the Rest 562d80f118eSChris Lattner================ 563d80f118eSChris Lattner 564d80f118eSChris LattnerThe next thing missing is handling of function prototypes. In 565d80f118eSChris LattnerKaleidoscope, these are used both for 'extern' function declarations as 566d80f118eSChris Lattnerwell as function body definitions. The code to do this is 567d80f118eSChris Lattnerstraight-forward and not very interesting (once you've survived 568d80f118eSChris Lattnerexpressions): 569d80f118eSChris Lattner 570d80f118eSChris Lattner.. code-block:: c++ 571d80f118eSChris Lattner 572d80f118eSChris Lattner /// prototype 573d80f118eSChris Lattner /// ::= id '(' id* ')' 574d80f118eSChris Lattner static std::unique_ptr<PrototypeAST> ParsePrototype() { 575d80f118eSChris Lattner if (CurTok != tok_identifier) 576d80f118eSChris Lattner return LogErrorP("Expected function name in prototype"); 577d80f118eSChris Lattner 578d80f118eSChris Lattner std::string FnName = IdentifierStr; 579d80f118eSChris Lattner getNextToken(); 580d80f118eSChris Lattner 581d80f118eSChris Lattner if (CurTok != '(') 582d80f118eSChris Lattner return LogErrorP("Expected '(' in prototype"); 583d80f118eSChris Lattner 584d80f118eSChris Lattner // Read the list of argument names. 585d80f118eSChris Lattner std::vector<std::string> ArgNames; 586d80f118eSChris Lattner while (getNextToken() == tok_identifier) 587d80f118eSChris Lattner ArgNames.push_back(IdentifierStr); 588d80f118eSChris Lattner if (CurTok != ')') 589d80f118eSChris Lattner return LogErrorP("Expected ')' in prototype"); 590d80f118eSChris Lattner 591d80f118eSChris Lattner // success. 592d80f118eSChris Lattner getNextToken(); // eat ')'. 593d80f118eSChris Lattner 5940eaee545SJonas Devlieghere return std::make_unique<PrototypeAST>(FnName, std::move(ArgNames)); 595d80f118eSChris Lattner } 596d80f118eSChris Lattner 597d80f118eSChris LattnerGiven this, a function definition is very simple, just a prototype plus 598d80f118eSChris Lattneran expression to implement the body: 599d80f118eSChris Lattner 600d80f118eSChris Lattner.. code-block:: c++ 601d80f118eSChris Lattner 602d80f118eSChris Lattner /// definition ::= 'def' prototype expression 603d80f118eSChris Lattner static std::unique_ptr<FunctionAST> ParseDefinition() { 604d80f118eSChris Lattner getNextToken(); // eat def. 605d80f118eSChris Lattner auto Proto = ParsePrototype(); 606d80f118eSChris Lattner if (!Proto) return nullptr; 607d80f118eSChris Lattner 608d80f118eSChris Lattner if (auto E = ParseExpression()) 6090eaee545SJonas Devlieghere return std::make_unique<FunctionAST>(std::move(Proto), std::move(E)); 610d80f118eSChris Lattner return nullptr; 611d80f118eSChris Lattner } 612d80f118eSChris Lattner 613d80f118eSChris LattnerIn addition, we support 'extern' to declare functions like 'sin' and 614d80f118eSChris Lattner'cos' as well as to support forward declaration of user functions. These 615d80f118eSChris Lattner'extern's are just prototypes with no body: 616d80f118eSChris Lattner 617d80f118eSChris Lattner.. code-block:: c++ 618d80f118eSChris Lattner 619d80f118eSChris Lattner /// external ::= 'extern' prototype 620d80f118eSChris Lattner static std::unique_ptr<PrototypeAST> ParseExtern() { 621d80f118eSChris Lattner getNextToken(); // eat extern. 622d80f118eSChris Lattner return ParsePrototype(); 623d80f118eSChris Lattner } 624d80f118eSChris Lattner 625d80f118eSChris LattnerFinally, we'll also let the user type in arbitrary top-level expressions 626d80f118eSChris Lattnerand evaluate them on the fly. We will handle this by defining anonymous 627d80f118eSChris Lattnernullary (zero argument) functions for them: 628d80f118eSChris Lattner 629d80f118eSChris Lattner.. code-block:: c++ 630d80f118eSChris Lattner 631d80f118eSChris Lattner /// toplevelexpr ::= expression 632d80f118eSChris Lattner static std::unique_ptr<FunctionAST> ParseTopLevelExpr() { 633d80f118eSChris Lattner if (auto E = ParseExpression()) { 634d80f118eSChris Lattner // Make an anonymous proto. 6350eaee545SJonas Devlieghere auto Proto = std::make_unique<PrototypeAST>("", std::vector<std::string>()); 6360eaee545SJonas Devlieghere return std::make_unique<FunctionAST>(std::move(Proto), std::move(E)); 637d80f118eSChris Lattner } 638d80f118eSChris Lattner return nullptr; 639d80f118eSChris Lattner } 640d80f118eSChris Lattner 641d80f118eSChris LattnerNow that we have all the pieces, let's build a little driver that will 642d80f118eSChris Lattnerlet us actually *execute* this code we've built! 643d80f118eSChris Lattner 644d80f118eSChris LattnerThe Driver 645d80f118eSChris Lattner========== 646d80f118eSChris Lattner 647d80f118eSChris LattnerThe driver for this simply invokes all of the parsing pieces with a 648d80f118eSChris Lattnertop-level dispatch loop. There isn't much interesting here, so I'll just 649d80f118eSChris Lattnerinclude the top-level loop. See `below <#full-code-listing>`_ for full code in the 650d80f118eSChris Lattner"Top-Level Parsing" section. 651d80f118eSChris Lattner 652d80f118eSChris Lattner.. code-block:: c++ 653d80f118eSChris Lattner 654d80f118eSChris Lattner /// top ::= definition | external | expression | ';' 655d80f118eSChris Lattner static void MainLoop() { 656d80f118eSChris Lattner while (1) { 657d80f118eSChris Lattner fprintf(stderr, "ready> "); 658d80f118eSChris Lattner switch (CurTok) { 659d80f118eSChris Lattner case tok_eof: 660d80f118eSChris Lattner return; 661d80f118eSChris Lattner case ';': // ignore top-level semicolons. 662d80f118eSChris Lattner getNextToken(); 663d80f118eSChris Lattner break; 664d80f118eSChris Lattner case tok_def: 665d80f118eSChris Lattner HandleDefinition(); 666d80f118eSChris Lattner break; 667d80f118eSChris Lattner case tok_extern: 668d80f118eSChris Lattner HandleExtern(); 669d80f118eSChris Lattner break; 670d80f118eSChris Lattner default: 671d80f118eSChris Lattner HandleTopLevelExpression(); 672d80f118eSChris Lattner break; 673d80f118eSChris Lattner } 674d80f118eSChris Lattner } 675d80f118eSChris Lattner } 676d80f118eSChris Lattner 677d80f118eSChris LattnerThe most interesting part of this is that we ignore top-level 678d80f118eSChris Lattnersemicolons. Why is this, you ask? The basic reason is that if you type 679d80f118eSChris Lattner"4 + 5" at the command line, the parser doesn't know whether that is the 680d80f118eSChris Lattnerend of what you will type or not. For example, on the next line you 681d80f118eSChris Lattnercould type "def foo..." in which case 4+5 is the end of a top-level 682d80f118eSChris Lattnerexpression. Alternatively you could type "\* 6", which would continue 683d80f118eSChris Lattnerthe expression. Having top-level semicolons allows you to type "4+5;", 684d80f118eSChris Lattnerand the parser will know you are done. 685d80f118eSChris Lattner 686d80f118eSChris LattnerConclusions 687d80f118eSChris Lattner=========== 688d80f118eSChris Lattner 689d80f118eSChris LattnerWith just under 400 lines of commented code (240 lines of non-comment, 690d80f118eSChris Lattnernon-blank code), we fully defined our minimal language, including a 691d80f118eSChris Lattnerlexer, parser, and AST builder. With this done, the executable will 692d80f118eSChris Lattnervalidate Kaleidoscope code and tell us if it is grammatically invalid. 693d80f118eSChris LattnerFor example, here is a sample interaction: 694d80f118eSChris Lattner 695d80f118eSChris Lattner.. code-block:: bash 696d80f118eSChris Lattner 697d80f118eSChris Lattner $ ./a.out 698d80f118eSChris Lattner ready> def foo(x y) x+foo(y, 4.0); 699d80f118eSChris Lattner Parsed a function definition. 700d80f118eSChris Lattner ready> def foo(x y) x+y y; 701d80f118eSChris Lattner Parsed a function definition. 702d80f118eSChris Lattner Parsed a top-level expr 703d80f118eSChris Lattner ready> def foo(x y) x+y ); 704d80f118eSChris Lattner Parsed a function definition. 705d80f118eSChris Lattner Error: unknown token when expecting an expression 706d80f118eSChris Lattner ready> extern sin(a); 707d80f118eSChris Lattner ready> Parsed an extern 708d80f118eSChris Lattner ready> ^D 709d80f118eSChris Lattner $ 710d80f118eSChris Lattner 711d80f118eSChris LattnerThere is a lot of room for extension here. You can define new AST nodes, 712d80f118eSChris Lattnerextend the language in many ways, etc. In the `next 713d80f118eSChris Lattnerinstallment <LangImpl03.html>`_, we will describe how to generate LLVM 714d80f118eSChris LattnerIntermediate Representation (IR) from the AST. 715d80f118eSChris Lattner 716d80f118eSChris LattnerFull Code Listing 717d80f118eSChris Lattner================= 718d80f118eSChris Lattner 719d80f118eSChris LattnerHere is the complete code listing for our running example. Because this 720d80f118eSChris Lattneruses the LLVM libraries, we need to link them in. To do this, we use the 721*72fd1033SSylvestre Ledru`llvm-config <https://llvm.org/cmds/llvm-config.html>`_ tool to inform 722d80f118eSChris Lattnerour makefile/command line about which options to use: 723d80f118eSChris Lattner 724d80f118eSChris Lattner.. code-block:: bash 725d80f118eSChris Lattner 726d80f118eSChris Lattner # Compile 727d80f118eSChris Lattner clang++ -g -O3 toy.cpp `llvm-config --cxxflags` 728d80f118eSChris Lattner # Run 729d80f118eSChris Lattner ./a.out 730d80f118eSChris Lattner 731d80f118eSChris LattnerHere is the code: 732d80f118eSChris Lattner 733147e0ddaSHans Wennborg.. literalinclude:: ../../../examples/Kaleidoscope/Chapter2/toy.cpp 734d80f118eSChris Lattner :language: c++ 735d80f118eSChris Lattner 736d80f118eSChris Lattner`Next: Implementing Code Generation to LLVM IR <LangImpl03.html>`_ 737d80f118eSChris Lattner 738