1d80f118eSChris Lattner===========================================
2d80f118eSChris LattnerKaleidoscope: Implementing a Parser and AST
3d80f118eSChris Lattner===========================================
4d80f118eSChris Lattner
5d80f118eSChris Lattner.. contents::
6d80f118eSChris Lattner   :local:
7d80f118eSChris Lattner
8d80f118eSChris LattnerChapter 2 Introduction
9d80f118eSChris Lattner======================
10d80f118eSChris Lattner
11d80f118eSChris LattnerWelcome to Chapter 2 of the "`Implementing a language with
12d80f118eSChris LattnerLLVM <index.html>`_" tutorial. This chapter shows you how to use the
13d80f118eSChris Lattnerlexer, built in `Chapter 1 <LangImpl01.html>`_, to build a full
14d80f118eSChris Lattner`parser <http://en.wikipedia.org/wiki/Parsing>`_ for our Kaleidoscope
15d80f118eSChris Lattnerlanguage. Once we have a parser, we'll define and build an `Abstract
16d80f118eSChris LattnerSyntax Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
17d80f118eSChris Lattner
18d80f118eSChris LattnerThe parser we will build uses a combination of `Recursive Descent
19d80f118eSChris LattnerParsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
20d80f118eSChris Lattner`Operator-Precedence
21d80f118eSChris LattnerParsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
22d80f118eSChris Lattnerparse the Kaleidoscope language (the latter for binary expressions and
23d80f118eSChris Lattnerthe former for everything else). Before we get to parsing though, let's
24d80f118eSChris Lattnertalk about the output of the parser: the Abstract Syntax Tree.
25d80f118eSChris Lattner
26d80f118eSChris LattnerThe Abstract Syntax Tree (AST)
27d80f118eSChris Lattner==============================
28d80f118eSChris Lattner
29d80f118eSChris LattnerThe AST for a program captures its behavior in such a way that it is
30d80f118eSChris Lattnereasy for later stages of the compiler (e.g. code generation) to
31d80f118eSChris Lattnerinterpret. We basically want one object for each construct in the
32d80f118eSChris Lattnerlanguage, and the AST should closely model the language. In
33d80f118eSChris LattnerKaleidoscope, we have expressions, a prototype, and a function object.
34d80f118eSChris LattnerWe'll start with expressions first:
35d80f118eSChris Lattner
36d80f118eSChris Lattner.. code-block:: c++
37d80f118eSChris Lattner
38d80f118eSChris Lattner    /// ExprAST - Base class for all expression nodes.
39d80f118eSChris Lattner    class ExprAST {
40d80f118eSChris Lattner    public:
41d80f118eSChris Lattner      virtual ~ExprAST() {}
42d80f118eSChris Lattner    };
43d80f118eSChris Lattner
44d80f118eSChris Lattner    /// NumberExprAST - Expression class for numeric literals like "1.0".
45d80f118eSChris Lattner    class NumberExprAST : public ExprAST {
46d80f118eSChris Lattner      double Val;
47d80f118eSChris Lattner
48d80f118eSChris Lattner    public:
49d80f118eSChris Lattner      NumberExprAST(double Val) : Val(Val) {}
50d80f118eSChris Lattner    };
51d80f118eSChris Lattner
52d80f118eSChris LattnerThe code above shows the definition of the base ExprAST class and one
53d80f118eSChris Lattnersubclass which we use for numeric literals. The important thing to note
54d80f118eSChris Lattnerabout this code is that the NumberExprAST class captures the numeric
55d80f118eSChris Lattnervalue of the literal as an instance variable. This allows later phases
56d80f118eSChris Lattnerof the compiler to know what the stored numeric value is.
57d80f118eSChris Lattner
58d80f118eSChris LattnerRight now we only create the AST, so there are no useful accessor
59d80f118eSChris Lattnermethods on them. It would be very easy to add a virtual method to pretty
60d80f118eSChris Lattnerprint the code, for example. Here are the other expression AST node
61d80f118eSChris Lattnerdefinitions that we'll use in the basic form of the Kaleidoscope
62d80f118eSChris Lattnerlanguage:
63d80f118eSChris Lattner
64d80f118eSChris Lattner.. code-block:: c++
65d80f118eSChris Lattner
66d80f118eSChris Lattner    /// VariableExprAST - Expression class for referencing a variable, like "a".
67d80f118eSChris Lattner    class VariableExprAST : public ExprAST {
68d80f118eSChris Lattner      std::string Name;
69d80f118eSChris Lattner
70d80f118eSChris Lattner    public:
71d80f118eSChris Lattner      VariableExprAST(const std::string &Name) : Name(Name) {}
72d80f118eSChris Lattner    };
73d80f118eSChris Lattner
74d80f118eSChris Lattner    /// BinaryExprAST - Expression class for a binary operator.
75d80f118eSChris Lattner    class BinaryExprAST : public ExprAST {
76d80f118eSChris Lattner      char Op;
77d80f118eSChris Lattner      std::unique_ptr<ExprAST> LHS, RHS;
78d80f118eSChris Lattner
79d80f118eSChris Lattner    public:
80d80f118eSChris Lattner      BinaryExprAST(char op, std::unique_ptr<ExprAST> LHS,
81d80f118eSChris Lattner                    std::unique_ptr<ExprAST> RHS)
82d80f118eSChris Lattner        : Op(op), LHS(std::move(LHS)), RHS(std::move(RHS)) {}
83d80f118eSChris Lattner    };
84d80f118eSChris Lattner
85d80f118eSChris Lattner    /// CallExprAST - Expression class for function calls.
86d80f118eSChris Lattner    class CallExprAST : public ExprAST {
87d80f118eSChris Lattner      std::string Callee;
88d80f118eSChris Lattner      std::vector<std::unique_ptr<ExprAST>> Args;
89d80f118eSChris Lattner
90d80f118eSChris Lattner    public:
91d80f118eSChris Lattner      CallExprAST(const std::string &Callee,
92d80f118eSChris Lattner                  std::vector<std::unique_ptr<ExprAST>> Args)
93d80f118eSChris Lattner        : Callee(Callee), Args(std::move(Args)) {}
94d80f118eSChris Lattner    };
95d80f118eSChris Lattner
96d80f118eSChris LattnerThis is all (intentionally) rather straight-forward: variables capture
97d80f118eSChris Lattnerthe variable name, binary operators capture their opcode (e.g. '+'), and
98d80f118eSChris Lattnercalls capture a function name as well as a list of any argument
99d80f118eSChris Lattnerexpressions. One thing that is nice about our AST is that it captures
100d80f118eSChris Lattnerthe language features without talking about the syntax of the language.
101d80f118eSChris LattnerNote that there is no discussion about precedence of binary operators,
102d80f118eSChris Lattnerlexical structure, etc.
103d80f118eSChris Lattner
104d80f118eSChris LattnerFor our basic language, these are all of the expression nodes we'll
105d80f118eSChris Lattnerdefine. Because it doesn't have conditional control flow, it isn't
106d80f118eSChris LattnerTuring-complete; we'll fix that in a later installment. The two things
107d80f118eSChris Lattnerwe need next are a way to talk about the interface to a function, and a
108d80f118eSChris Lattnerway to talk about functions themselves:
109d80f118eSChris Lattner
110d80f118eSChris Lattner.. code-block:: c++
111d80f118eSChris Lattner
112d80f118eSChris Lattner    /// PrototypeAST - This class represents the "prototype" for a function,
113d80f118eSChris Lattner    /// which captures its name, and its argument names (thus implicitly the number
114d80f118eSChris Lattner    /// of arguments the function takes).
115d80f118eSChris Lattner    class PrototypeAST {
116d80f118eSChris Lattner      std::string Name;
117d80f118eSChris Lattner      std::vector<std::string> Args;
118d80f118eSChris Lattner
119d80f118eSChris Lattner    public:
120d80f118eSChris Lattner      PrototypeAST(const std::string &name, std::vector<std::string> Args)
121d80f118eSChris Lattner        : Name(name), Args(std::move(Args)) {}
122d80f118eSChris Lattner
123d80f118eSChris Lattner      const std::string &getName() const { return Name; }
124d80f118eSChris Lattner    };
125d80f118eSChris Lattner
126d80f118eSChris Lattner    /// FunctionAST - This class represents a function definition itself.
127d80f118eSChris Lattner    class FunctionAST {
128d80f118eSChris Lattner      std::unique_ptr<PrototypeAST> Proto;
129d80f118eSChris Lattner      std::unique_ptr<ExprAST> Body;
130d80f118eSChris Lattner
131d80f118eSChris Lattner    public:
132d80f118eSChris Lattner      FunctionAST(std::unique_ptr<PrototypeAST> Proto,
133d80f118eSChris Lattner                  std::unique_ptr<ExprAST> Body)
134d80f118eSChris Lattner        : Proto(std::move(Proto)), Body(std::move(Body)) {}
135d80f118eSChris Lattner    };
136d80f118eSChris Lattner
137d80f118eSChris LattnerIn Kaleidoscope, functions are typed with just a count of their
138d80f118eSChris Lattnerarguments. Since all values are double precision floating point, the
139d80f118eSChris Lattnertype of each argument doesn't need to be stored anywhere. In a more
140d80f118eSChris Lattneraggressive and realistic language, the "ExprAST" class would probably
141d80f118eSChris Lattnerhave a type field.
142d80f118eSChris Lattner
143d80f118eSChris LattnerWith this scaffolding, we can now talk about parsing expressions and
144d80f118eSChris Lattnerfunction bodies in Kaleidoscope.
145d80f118eSChris Lattner
146d80f118eSChris LattnerParser Basics
147d80f118eSChris Lattner=============
148d80f118eSChris Lattner
149d80f118eSChris LattnerNow that we have an AST to build, we need to define the parser code to
150d80f118eSChris Lattnerbuild it. The idea here is that we want to parse something like "x+y"
151d80f118eSChris Lattner(which is returned as three tokens by the lexer) into an AST that could
152d80f118eSChris Lattnerbe generated with calls like this:
153d80f118eSChris Lattner
154d80f118eSChris Lattner.. code-block:: c++
155d80f118eSChris Lattner
1560eaee545SJonas Devlieghere      auto LHS = std::make_unique<VariableExprAST>("x");
1570eaee545SJonas Devlieghere      auto RHS = std::make_unique<VariableExprAST>("y");
158d80f118eSChris Lattner      auto Result = std::make_unique<BinaryExprAST>('+', std::move(LHS),
159d80f118eSChris Lattner                                                    std::move(RHS));
160d80f118eSChris Lattner
161d80f118eSChris LattnerIn order to do this, we'll start by defining some basic helper routines:
162d80f118eSChris Lattner
163d80f118eSChris Lattner.. code-block:: c++
164d80f118eSChris Lattner
165d80f118eSChris Lattner    /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
166d80f118eSChris Lattner    /// token the parser is looking at.  getNextToken reads another token from the
167d80f118eSChris Lattner    /// lexer and updates CurTok with its results.
168d80f118eSChris Lattner    static int CurTok;
169d80f118eSChris Lattner    static int getNextToken() {
170d80f118eSChris Lattner      return CurTok = gettok();
171d80f118eSChris Lattner    }
172d80f118eSChris Lattner
173d80f118eSChris LattnerThis implements a simple token buffer around the lexer. This allows us
174d80f118eSChris Lattnerto look one token ahead at what the lexer is returning. Every function
175d80f118eSChris Lattnerin our parser will assume that CurTok is the current token that needs to
176d80f118eSChris Lattnerbe parsed.
177d80f118eSChris Lattner
178d80f118eSChris Lattner.. code-block:: c++
179d80f118eSChris Lattner
180d80f118eSChris Lattner
181d80f118eSChris Lattner    /// LogError* - These are little helper functions for error handling.
182d80f118eSChris Lattner    std::unique_ptr<ExprAST> LogError(const char *Str) {
183d80f118eSChris Lattner      fprintf(stderr, "LogError: %s\n", Str);
184d80f118eSChris Lattner      return nullptr;
185d80f118eSChris Lattner    }
186d80f118eSChris Lattner    std::unique_ptr<PrototypeAST> LogErrorP(const char *Str) {
187d80f118eSChris Lattner      LogError(Str);
188d80f118eSChris Lattner      return nullptr;
189d80f118eSChris Lattner    }
190d80f118eSChris Lattner
191d80f118eSChris LattnerThe ``LogError`` routines are simple helper routines that our parser will
192d80f118eSChris Lattneruse to handle errors. The error recovery in our parser will not be the
193d80f118eSChris Lattnerbest and is not particular user-friendly, but it will be enough for our
194d80f118eSChris Lattnertutorial. These routines make it easier to handle errors in routines
195d80f118eSChris Lattnerthat have various return types: they always return null.
196d80f118eSChris Lattner
197d80f118eSChris LattnerWith these basic helper functions, we can implement the first piece of
198d80f118eSChris Lattnerour grammar: numeric literals.
199d80f118eSChris Lattner
200d80f118eSChris LattnerBasic Expression Parsing
201d80f118eSChris Lattner========================
202d80f118eSChris Lattner
203d80f118eSChris LattnerWe start with numeric literals, because they are the simplest to
204d80f118eSChris Lattnerprocess. For each production in our grammar, we'll define a function
205d80f118eSChris Lattnerwhich parses that production. For numeric literals, we have:
206d80f118eSChris Lattner
207d80f118eSChris Lattner.. code-block:: c++
208d80f118eSChris Lattner
209d80f118eSChris Lattner    /// numberexpr ::= number
210d80f118eSChris Lattner    static std::unique_ptr<ExprAST> ParseNumberExpr() {
2110eaee545SJonas Devlieghere      auto Result = std::make_unique<NumberExprAST>(NumVal);
212d80f118eSChris Lattner      getNextToken(); // consume the number
213d80f118eSChris Lattner      return std::move(Result);
214d80f118eSChris Lattner    }
215d80f118eSChris Lattner
216d80f118eSChris LattnerThis routine is very simple: it expects to be called when the current
217d80f118eSChris Lattnertoken is a ``tok_number`` token. It takes the current number value,
218d80f118eSChris Lattnercreates a ``NumberExprAST`` node, advances the lexer to the next token,
219d80f118eSChris Lattnerand finally returns.
220d80f118eSChris Lattner
221d80f118eSChris LattnerThere are some interesting aspects to this. The most important one is
222d80f118eSChris Lattnerthat this routine eats all of the tokens that correspond to the
223d80f118eSChris Lattnerproduction and returns the lexer buffer with the next token (which is
224d80f118eSChris Lattnernot part of the grammar production) ready to go. This is a fairly
225d80f118eSChris Lattnerstandard way to go for recursive descent parsers. For a better example,
226d80f118eSChris Lattnerthe parenthesis operator is defined like this:
227d80f118eSChris Lattner
228d80f118eSChris Lattner.. code-block:: c++
229d80f118eSChris Lattner
230d80f118eSChris Lattner    /// parenexpr ::= '(' expression ')'
231d80f118eSChris Lattner    static std::unique_ptr<ExprAST> ParseParenExpr() {
232d80f118eSChris Lattner      getNextToken(); // eat (.
233d80f118eSChris Lattner      auto V = ParseExpression();
234d80f118eSChris Lattner      if (!V)
235d80f118eSChris Lattner        return nullptr;
236d80f118eSChris Lattner
237d80f118eSChris Lattner      if (CurTok != ')')
238d80f118eSChris Lattner        return LogError("expected ')'");
239d80f118eSChris Lattner      getNextToken(); // eat ).
240d80f118eSChris Lattner      return V;
241d80f118eSChris Lattner    }
242d80f118eSChris Lattner
243d80f118eSChris LattnerThis function illustrates a number of interesting things about the
244d80f118eSChris Lattnerparser:
245d80f118eSChris Lattner
246d80f118eSChris Lattner1) It shows how we use the LogError routines. When called, this function
247d80f118eSChris Lattnerexpects that the current token is a '(' token, but after parsing the
248d80f118eSChris Lattnersubexpression, it is possible that there is no ')' waiting. For example,
249d80f118eSChris Lattnerif the user types in "(4 x" instead of "(4)", the parser should emit an
250d80f118eSChris Lattnererror. Because errors can occur, the parser needs a way to indicate that
251d80f118eSChris Lattnerthey happened: in our parser, we return null on an error.
252d80f118eSChris Lattner
253d80f118eSChris Lattner2) Another interesting aspect of this function is that it uses recursion
254d80f118eSChris Lattnerby calling ``ParseExpression`` (we will soon see that
255d80f118eSChris Lattner``ParseExpression`` can call ``ParseParenExpr``). This is powerful
256d80f118eSChris Lattnerbecause it allows us to handle recursive grammars, and keeps each
257d80f118eSChris Lattnerproduction very simple. Note that parentheses do not cause construction
258d80f118eSChris Lattnerof AST nodes themselves. While we could do it this way, the most
259d80f118eSChris Lattnerimportant role of parentheses are to guide the parser and provide
260d80f118eSChris Lattnergrouping. Once the parser constructs the AST, parentheses are not
261d80f118eSChris Lattnerneeded.
262d80f118eSChris Lattner
263d80f118eSChris LattnerThe next simple production is for handling variable references and
264d80f118eSChris Lattnerfunction calls:
265d80f118eSChris Lattner
266d80f118eSChris Lattner.. code-block:: c++
267d80f118eSChris Lattner
268d80f118eSChris Lattner    /// identifierexpr
269d80f118eSChris Lattner    ///   ::= identifier
270d80f118eSChris Lattner    ///   ::= identifier '(' expression* ')'
271d80f118eSChris Lattner    static std::unique_ptr<ExprAST> ParseIdentifierExpr() {
272d80f118eSChris Lattner      std::string IdName = IdentifierStr;
273d80f118eSChris Lattner
274d80f118eSChris Lattner      getNextToken();  // eat identifier.
275d80f118eSChris Lattner
276d80f118eSChris Lattner      if (CurTok != '(') // Simple variable ref.
2770eaee545SJonas Devlieghere        return std::make_unique<VariableExprAST>(IdName);
278d80f118eSChris Lattner
279d80f118eSChris Lattner      // Call.
280d80f118eSChris Lattner      getNextToken();  // eat (
281d80f118eSChris Lattner      std::vector<std::unique_ptr<ExprAST>> Args;
282d80f118eSChris Lattner      if (CurTok != ')') {
283d80f118eSChris Lattner        while (1) {
284d80f118eSChris Lattner          if (auto Arg = ParseExpression())
285d80f118eSChris Lattner            Args.push_back(std::move(Arg));
286d80f118eSChris Lattner          else
287d80f118eSChris Lattner            return nullptr;
288d80f118eSChris Lattner
289d80f118eSChris Lattner          if (CurTok == ')')
290d80f118eSChris Lattner            break;
291d80f118eSChris Lattner
292d80f118eSChris Lattner          if (CurTok != ',')
293d80f118eSChris Lattner            return LogError("Expected ')' or ',' in argument list");
294d80f118eSChris Lattner          getNextToken();
295d80f118eSChris Lattner        }
296d80f118eSChris Lattner      }
297d80f118eSChris Lattner
298d80f118eSChris Lattner      // Eat the ')'.
299d80f118eSChris Lattner      getNextToken();
300d80f118eSChris Lattner
3010eaee545SJonas Devlieghere      return std::make_unique<CallExprAST>(IdName, std::move(Args));
302d80f118eSChris Lattner    }
303d80f118eSChris Lattner
304d80f118eSChris LattnerThis routine follows the same style as the other routines. (It expects
305d80f118eSChris Lattnerto be called if the current token is a ``tok_identifier`` token). It
306d80f118eSChris Lattneralso has recursion and error handling. One interesting aspect of this is
307d80f118eSChris Lattnerthat it uses *look-ahead* to determine if the current identifier is a
308d80f118eSChris Lattnerstand alone variable reference or if it is a function call expression.
309d80f118eSChris LattnerIt handles this by checking to see if the token after the identifier is
310d80f118eSChris Lattnera '(' token, constructing either a ``VariableExprAST`` or
311d80f118eSChris Lattner``CallExprAST`` node as appropriate.
312d80f118eSChris Lattner
313d80f118eSChris LattnerNow that we have all of our simple expression-parsing logic in place, we
314d80f118eSChris Lattnercan define a helper function to wrap it together into one entry point.
315d80f118eSChris LattnerWe call this class of expressions "primary" expressions, for reasons
316d80f118eSChris Lattnerthat will become more clear `later in the
3176bfd10ffSJonathan Roelofstutorial <LangImpl06.html#user-defined-unary-operators>`_. In order to parse an arbitrary
318d80f118eSChris Lattnerprimary expression, we need to determine what sort of expression it is:
319d80f118eSChris Lattner
320d80f118eSChris Lattner.. code-block:: c++
321d80f118eSChris Lattner
322d80f118eSChris Lattner    /// primary
323d80f118eSChris Lattner    ///   ::= identifierexpr
324d80f118eSChris Lattner    ///   ::= numberexpr
325d80f118eSChris Lattner    ///   ::= parenexpr
326d80f118eSChris Lattner    static std::unique_ptr<ExprAST> ParsePrimary() {
327d80f118eSChris Lattner      switch (CurTok) {
328d80f118eSChris Lattner      default:
329d80f118eSChris Lattner        return LogError("unknown token when expecting an expression");
330d80f118eSChris Lattner      case tok_identifier:
331d80f118eSChris Lattner        return ParseIdentifierExpr();
332d80f118eSChris Lattner      case tok_number:
333d80f118eSChris Lattner        return ParseNumberExpr();
334d80f118eSChris Lattner      case '(':
335d80f118eSChris Lattner        return ParseParenExpr();
336d80f118eSChris Lattner      }
337d80f118eSChris Lattner    }
338d80f118eSChris Lattner
339d80f118eSChris LattnerNow that you see the definition of this function, it is more obvious why
340d80f118eSChris Lattnerwe can assume the state of CurTok in the various functions. This uses
341d80f118eSChris Lattnerlook-ahead to determine which sort of expression is being inspected, and
342d80f118eSChris Lattnerthen parses it with a function call.
343d80f118eSChris Lattner
344d80f118eSChris LattnerNow that basic expressions are handled, we need to handle binary
345d80f118eSChris Lattnerexpressions. They are a bit more complex.
346d80f118eSChris Lattner
347d80f118eSChris LattnerBinary Expression Parsing
348d80f118eSChris Lattner=========================
349d80f118eSChris Lattner
350d80f118eSChris LattnerBinary expressions are significantly harder to parse because they are
351d80f118eSChris Lattneroften ambiguous. For example, when given the string "x+y\*z", the parser
352d80f118eSChris Lattnercan choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
353d80f118eSChris Lattnerdefinitions from mathematics, we expect the later parse, because "\*"
354d80f118eSChris Lattner(multiplication) has higher *precedence* than "+" (addition).
355d80f118eSChris Lattner
356d80f118eSChris LattnerThere are many ways to handle this, but an elegant and efficient way is
357d80f118eSChris Lattnerto use `Operator-Precedence
358d80f118eSChris LattnerParsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
359d80f118eSChris LattnerThis parsing technique uses the precedence of binary operators to guide
360d80f118eSChris Lattnerrecursion. To start with, we need a table of precedences:
361d80f118eSChris Lattner
362d80f118eSChris Lattner.. code-block:: c++
363d80f118eSChris Lattner
364d80f118eSChris Lattner    /// BinopPrecedence - This holds the precedence for each binary operator that is
365d80f118eSChris Lattner    /// defined.
366d80f118eSChris Lattner    static std::map<char, int> BinopPrecedence;
367d80f118eSChris Lattner
368d80f118eSChris Lattner    /// GetTokPrecedence - Get the precedence of the pending binary operator token.
369d80f118eSChris Lattner    static int GetTokPrecedence() {
370d80f118eSChris Lattner      if (!isascii(CurTok))
371d80f118eSChris Lattner        return -1;
372d80f118eSChris Lattner
373d80f118eSChris Lattner      // Make sure it's a declared binop.
374d80f118eSChris Lattner      int TokPrec = BinopPrecedence[CurTok];
375d80f118eSChris Lattner      if (TokPrec <= 0) return -1;
376d80f118eSChris Lattner      return TokPrec;
377d80f118eSChris Lattner    }
378d80f118eSChris Lattner
379d80f118eSChris Lattner    int main() {
380d80f118eSChris Lattner      // Install standard binary operators.
381d80f118eSChris Lattner      // 1 is lowest precedence.
382d80f118eSChris Lattner      BinopPrecedence['<'] = 10;
383d80f118eSChris Lattner      BinopPrecedence['+'] = 20;
384d80f118eSChris Lattner      BinopPrecedence['-'] = 20;
385d80f118eSChris Lattner      BinopPrecedence['*'] = 40;  // highest.
386d80f118eSChris Lattner      ...
387d80f118eSChris Lattner    }
388d80f118eSChris Lattner
389d80f118eSChris LattnerFor the basic form of Kaleidoscope, we will only support 4 binary
390d80f118eSChris Lattneroperators (this can obviously be extended by you, our brave and intrepid
391d80f118eSChris Lattnerreader). The ``GetTokPrecedence`` function returns the precedence for
392d80f118eSChris Lattnerthe current token, or -1 if the token is not a binary operator. Having a
393d80f118eSChris Lattnermap makes it easy to add new operators and makes it clear that the
394d80f118eSChris Lattneralgorithm doesn't depend on the specific operators involved, but it
395d80f118eSChris Lattnerwould be easy enough to eliminate the map and do the comparisons in the
396d80f118eSChris Lattner``GetTokPrecedence`` function. (Or just use a fixed-size array).
397d80f118eSChris Lattner
398d80f118eSChris LattnerWith the helper above defined, we can now start parsing binary
399d80f118eSChris Lattnerexpressions. The basic idea of operator precedence parsing is to break
400d80f118eSChris Lattnerdown an expression with potentially ambiguous binary operators into
401d80f118eSChris Lattnerpieces. Consider, for example, the expression "a+b+(c+d)\*e\*f+g".
402d80f118eSChris LattnerOperator precedence parsing considers this as a stream of primary
403d80f118eSChris Lattnerexpressions separated by binary operators. As such, it will first parse
404d80f118eSChris Lattnerthe leading primary expression "a", then it will see the pairs [+, b]
405d80f118eSChris Lattner[+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
406d80f118eSChris Lattnerprimary expressions, the binary expression parser doesn't need to worry
407d80f118eSChris Lattnerabout nested subexpressions like (c+d) at all.
408d80f118eSChris Lattner
409d80f118eSChris LattnerTo start, an expression is a primary expression potentially followed by
410d80f118eSChris Lattnera sequence of [binop,primaryexpr] pairs:
411d80f118eSChris Lattner
412d80f118eSChris Lattner.. code-block:: c++
413d80f118eSChris Lattner
414d80f118eSChris Lattner    /// expression
415d80f118eSChris Lattner    ///   ::= primary binoprhs
416d80f118eSChris Lattner    ///
417d80f118eSChris Lattner    static std::unique_ptr<ExprAST> ParseExpression() {
418d80f118eSChris Lattner      auto LHS = ParsePrimary();
419d80f118eSChris Lattner      if (!LHS)
420d80f118eSChris Lattner        return nullptr;
421d80f118eSChris Lattner
422d80f118eSChris Lattner      return ParseBinOpRHS(0, std::move(LHS));
423d80f118eSChris Lattner    }
424d80f118eSChris Lattner
425d80f118eSChris Lattner``ParseBinOpRHS`` is the function that parses the sequence of pairs for
426d80f118eSChris Lattnerus. It takes a precedence and a pointer to an expression for the part
427d80f118eSChris Lattnerthat has been parsed so far. Note that "x" is a perfectly valid
428d80f118eSChris Lattnerexpression: As such, "binoprhs" is allowed to be empty, in which case it
429d80f118eSChris Lattnerreturns the expression that is passed into it. In our example above, the
430d80f118eSChris Lattnercode passes the expression for "a" into ``ParseBinOpRHS`` and the
431d80f118eSChris Lattnercurrent token is "+".
432d80f118eSChris Lattner
433d80f118eSChris LattnerThe precedence value passed into ``ParseBinOpRHS`` indicates the
434d80f118eSChris Lattner*minimal operator precedence* that the function is allowed to eat. For
435d80f118eSChris Lattnerexample, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is
436d80f118eSChris Lattnerpassed in a precedence of 40, it will not consume any tokens (because
437d80f118eSChris Lattnerthe precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS``
438d80f118eSChris Lattnerstarts with:
439d80f118eSChris Lattner
440d80f118eSChris Lattner.. code-block:: c++
441d80f118eSChris Lattner
442d80f118eSChris Lattner    /// binoprhs
443d80f118eSChris Lattner    ///   ::= ('+' primary)*
444d80f118eSChris Lattner    static std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec,
445d80f118eSChris Lattner                                                  std::unique_ptr<ExprAST> LHS) {
446d80f118eSChris Lattner      // If this is a binop, find its precedence.
447d80f118eSChris Lattner      while (1) {
448d80f118eSChris Lattner        int TokPrec = GetTokPrecedence();
449d80f118eSChris Lattner
450d80f118eSChris Lattner        // If this is a binop that binds at least as tightly as the current binop,
451d80f118eSChris Lattner        // consume it, otherwise we are done.
452d80f118eSChris Lattner        if (TokPrec < ExprPrec)
453d80f118eSChris Lattner          return LHS;
454d80f118eSChris Lattner
455d80f118eSChris LattnerThis code gets the precedence of the current token and checks to see if
456d80f118eSChris Lattnerif is too low. Because we defined invalid tokens to have a precedence of
457d80f118eSChris Lattner-1, this check implicitly knows that the pair-stream ends when the token
458d80f118eSChris Lattnerstream runs out of binary operators. If this check succeeds, we know
459d80f118eSChris Lattnerthat the token is a binary operator and that it will be included in this
460d80f118eSChris Lattnerexpression:
461d80f118eSChris Lattner
462d80f118eSChris Lattner.. code-block:: c++
463d80f118eSChris Lattner
464d80f118eSChris Lattner        // Okay, we know this is a binop.
465d80f118eSChris Lattner        int BinOp = CurTok;
466d80f118eSChris Lattner        getNextToken();  // eat binop
467d80f118eSChris Lattner
468d80f118eSChris Lattner        // Parse the primary expression after the binary operator.
469d80f118eSChris Lattner        auto RHS = ParsePrimary();
470d80f118eSChris Lattner        if (!RHS)
471d80f118eSChris Lattner          return nullptr;
472d80f118eSChris Lattner
473d80f118eSChris LattnerAs such, this code eats (and remembers) the binary operator and then
474d80f118eSChris Lattnerparses the primary expression that follows. This builds up the whole
475d80f118eSChris Lattnerpair, the first of which is [+, b] for the running example.
476d80f118eSChris Lattner
477d80f118eSChris LattnerNow that we parsed the left-hand side of an expression and one pair of
478d80f118eSChris Lattnerthe RHS sequence, we have to decide which way the expression associates.
479d80f118eSChris LattnerIn particular, we could have "(a+b) binop unparsed" or "a + (b binop
480d80f118eSChris Lattnerunparsed)". To determine this, we look ahead at "binop" to determine its
481d80f118eSChris Lattnerprecedence and compare it to BinOp's precedence (which is '+' in this
482d80f118eSChris Lattnercase):
483d80f118eSChris Lattner
484d80f118eSChris Lattner.. code-block:: c++
485d80f118eSChris Lattner
486d80f118eSChris Lattner        // If BinOp binds less tightly with RHS than the operator after RHS, let
487d80f118eSChris Lattner        // the pending operator take RHS as its LHS.
488d80f118eSChris Lattner        int NextPrec = GetTokPrecedence();
489d80f118eSChris Lattner        if (TokPrec < NextPrec) {
490d80f118eSChris Lattner
491d80f118eSChris LattnerIf the precedence of the binop to the right of "RHS" is lower or equal
492d80f118eSChris Lattnerto the precedence of our current operator, then we know that the
493d80f118eSChris Lattnerparentheses associate as "(a+b) binop ...". In our example, the current
494d80f118eSChris Lattneroperator is "+" and the next operator is "+", we know that they have the
495d80f118eSChris Lattnersame precedence. In this case we'll create the AST node for "a+b", and
496d80f118eSChris Lattnerthen continue parsing:
497d80f118eSChris Lattner
498d80f118eSChris Lattner.. code-block:: c++
499d80f118eSChris Lattner
500d80f118eSChris Lattner          ... if body omitted ...
501d80f118eSChris Lattner        }
502d80f118eSChris Lattner
503d80f118eSChris Lattner        // Merge LHS/RHS.
5040eaee545SJonas Devlieghere        LHS = std::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
505d80f118eSChris Lattner                                               std::move(RHS));
506d80f118eSChris Lattner      }  // loop around to the top of the while loop.
507d80f118eSChris Lattner    }
508d80f118eSChris Lattner
509d80f118eSChris LattnerIn our example above, this will turn "a+b+" into "(a+b)" and execute the
510d80f118eSChris Lattnernext iteration of the loop, with "+" as the current token. The code
511d80f118eSChris Lattnerabove will eat, remember, and parse "(c+d)" as the primary expression,
512d80f118eSChris Lattnerwhich makes the current pair equal to [+, (c+d)]. It will then evaluate
513d80f118eSChris Lattnerthe 'if' conditional above with "\*" as the binop to the right of the
514d80f118eSChris Lattnerprimary. In this case, the precedence of "\*" is higher than the
515d80f118eSChris Lattnerprecedence of "+" so the if condition will be entered.
516d80f118eSChris Lattner
517d80f118eSChris LattnerThe critical question left here is "how can the if condition parse the
518d80f118eSChris Lattnerright hand side in full"? In particular, to build the AST correctly for
519d80f118eSChris Lattnerour example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
520d80f118eSChris Lattnervariable. The code to do this is surprisingly simple (code from the
521d80f118eSChris Lattnerabove two blocks duplicated for context):
522d80f118eSChris Lattner
523d80f118eSChris Lattner.. code-block:: c++
524d80f118eSChris Lattner
525d80f118eSChris Lattner        // If BinOp binds less tightly with RHS than the operator after RHS, let
526d80f118eSChris Lattner        // the pending operator take RHS as its LHS.
527d80f118eSChris Lattner        int NextPrec = GetTokPrecedence();
528d80f118eSChris Lattner        if (TokPrec < NextPrec) {
529d80f118eSChris Lattner          RHS = ParseBinOpRHS(TokPrec+1, std::move(RHS));
530d80f118eSChris Lattner          if (!RHS)
531d80f118eSChris Lattner            return nullptr;
532d80f118eSChris Lattner        }
533d80f118eSChris Lattner        // Merge LHS/RHS.
5340eaee545SJonas Devlieghere        LHS = std::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
535d80f118eSChris Lattner                                               std::move(RHS));
536d80f118eSChris Lattner      }  // loop around to the top of the while loop.
537d80f118eSChris Lattner    }
538d80f118eSChris Lattner
539d80f118eSChris LattnerAt this point, we know that the binary operator to the RHS of our
540d80f118eSChris Lattnerprimary has higher precedence than the binop we are currently parsing.
541d80f118eSChris LattnerAs such, we know that any sequence of pairs whose operators are all
542d80f118eSChris Lattnerhigher precedence than "+" should be parsed together and returned as
543d80f118eSChris Lattner"RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function
544d80f118eSChris Lattnerspecifying "TokPrec+1" as the minimum precedence required for it to
545d80f118eSChris Lattnercontinue. In our example above, this will cause it to return the AST
546d80f118eSChris Lattnernode for "(c+d)\*e\*f" as RHS, which is then set as the RHS of the '+'
547d80f118eSChris Lattnerexpression.
548d80f118eSChris Lattner
549d80f118eSChris LattnerFinally, on the next iteration of the while loop, the "+g" piece is
550d80f118eSChris Lattnerparsed and added to the AST. With this little bit of code (14
551d80f118eSChris Lattnernon-trivial lines), we correctly handle fully general binary expression
552d80f118eSChris Lattnerparsing in a very elegant way. This was a whirlwind tour of this code,
553d80f118eSChris Lattnerand it is somewhat subtle. I recommend running through it with a few
554d80f118eSChris Lattnertough examples to see how it works.
555d80f118eSChris Lattner
556d80f118eSChris LattnerThis wraps up handling of expressions. At this point, we can point the
557d80f118eSChris Lattnerparser at an arbitrary token stream and build an expression from it,
558d80f118eSChris Lattnerstopping at the first token that is not part of the expression. Next up
559d80f118eSChris Lattnerwe need to handle function definitions, etc.
560d80f118eSChris Lattner
561d80f118eSChris LattnerParsing the Rest
562d80f118eSChris Lattner================
563d80f118eSChris Lattner
564d80f118eSChris LattnerThe next thing missing is handling of function prototypes. In
565d80f118eSChris LattnerKaleidoscope, these are used both for 'extern' function declarations as
566d80f118eSChris Lattnerwell as function body definitions. The code to do this is
567d80f118eSChris Lattnerstraight-forward and not very interesting (once you've survived
568d80f118eSChris Lattnerexpressions):
569d80f118eSChris Lattner
570d80f118eSChris Lattner.. code-block:: c++
571d80f118eSChris Lattner
572d80f118eSChris Lattner    /// prototype
573d80f118eSChris Lattner    ///   ::= id '(' id* ')'
574d80f118eSChris Lattner    static std::unique_ptr<PrototypeAST> ParsePrototype() {
575d80f118eSChris Lattner      if (CurTok != tok_identifier)
576d80f118eSChris Lattner        return LogErrorP("Expected function name in prototype");
577d80f118eSChris Lattner
578d80f118eSChris Lattner      std::string FnName = IdentifierStr;
579d80f118eSChris Lattner      getNextToken();
580d80f118eSChris Lattner
581d80f118eSChris Lattner      if (CurTok != '(')
582d80f118eSChris Lattner        return LogErrorP("Expected '(' in prototype");
583d80f118eSChris Lattner
584d80f118eSChris Lattner      // Read the list of argument names.
585d80f118eSChris Lattner      std::vector<std::string> ArgNames;
586d80f118eSChris Lattner      while (getNextToken() == tok_identifier)
587d80f118eSChris Lattner        ArgNames.push_back(IdentifierStr);
588d80f118eSChris Lattner      if (CurTok != ')')
589d80f118eSChris Lattner        return LogErrorP("Expected ')' in prototype");
590d80f118eSChris Lattner
591d80f118eSChris Lattner      // success.
592d80f118eSChris Lattner      getNextToken();  // eat ')'.
593d80f118eSChris Lattner
5940eaee545SJonas Devlieghere      return std::make_unique<PrototypeAST>(FnName, std::move(ArgNames));
595d80f118eSChris Lattner    }
596d80f118eSChris Lattner
597d80f118eSChris LattnerGiven this, a function definition is very simple, just a prototype plus
598d80f118eSChris Lattneran expression to implement the body:
599d80f118eSChris Lattner
600d80f118eSChris Lattner.. code-block:: c++
601d80f118eSChris Lattner
602d80f118eSChris Lattner    /// definition ::= 'def' prototype expression
603d80f118eSChris Lattner    static std::unique_ptr<FunctionAST> ParseDefinition() {
604d80f118eSChris Lattner      getNextToken();  // eat def.
605d80f118eSChris Lattner      auto Proto = ParsePrototype();
606d80f118eSChris Lattner      if (!Proto) return nullptr;
607d80f118eSChris Lattner
608d80f118eSChris Lattner      if (auto E = ParseExpression())
6090eaee545SJonas Devlieghere        return std::make_unique<FunctionAST>(std::move(Proto), std::move(E));
610d80f118eSChris Lattner      return nullptr;
611d80f118eSChris Lattner    }
612d80f118eSChris Lattner
613d80f118eSChris LattnerIn addition, we support 'extern' to declare functions like 'sin' and
614d80f118eSChris Lattner'cos' as well as to support forward declaration of user functions. These
615d80f118eSChris Lattner'extern's are just prototypes with no body:
616d80f118eSChris Lattner
617d80f118eSChris Lattner.. code-block:: c++
618d80f118eSChris Lattner
619d80f118eSChris Lattner    /// external ::= 'extern' prototype
620d80f118eSChris Lattner    static std::unique_ptr<PrototypeAST> ParseExtern() {
621d80f118eSChris Lattner      getNextToken();  // eat extern.
622d80f118eSChris Lattner      return ParsePrototype();
623d80f118eSChris Lattner    }
624d80f118eSChris Lattner
625d80f118eSChris LattnerFinally, we'll also let the user type in arbitrary top-level expressions
626d80f118eSChris Lattnerand evaluate them on the fly. We will handle this by defining anonymous
627d80f118eSChris Lattnernullary (zero argument) functions for them:
628d80f118eSChris Lattner
629d80f118eSChris Lattner.. code-block:: c++
630d80f118eSChris Lattner
631d80f118eSChris Lattner    /// toplevelexpr ::= expression
632d80f118eSChris Lattner    static std::unique_ptr<FunctionAST> ParseTopLevelExpr() {
633d80f118eSChris Lattner      if (auto E = ParseExpression()) {
634d80f118eSChris Lattner        // Make an anonymous proto.
6350eaee545SJonas Devlieghere        auto Proto = std::make_unique<PrototypeAST>("", std::vector<std::string>());
6360eaee545SJonas Devlieghere        return std::make_unique<FunctionAST>(std::move(Proto), std::move(E));
637d80f118eSChris Lattner      }
638d80f118eSChris Lattner      return nullptr;
639d80f118eSChris Lattner    }
640d80f118eSChris Lattner
641d80f118eSChris LattnerNow that we have all the pieces, let's build a little driver that will
642d80f118eSChris Lattnerlet us actually *execute* this code we've built!
643d80f118eSChris Lattner
644d80f118eSChris LattnerThe Driver
645d80f118eSChris Lattner==========
646d80f118eSChris Lattner
647d80f118eSChris LattnerThe driver for this simply invokes all of the parsing pieces with a
648d80f118eSChris Lattnertop-level dispatch loop. There isn't much interesting here, so I'll just
649d80f118eSChris Lattnerinclude the top-level loop. See `below <#full-code-listing>`_ for full code in the
650d80f118eSChris Lattner"Top-Level Parsing" section.
651d80f118eSChris Lattner
652d80f118eSChris Lattner.. code-block:: c++
653d80f118eSChris Lattner
654d80f118eSChris Lattner    /// top ::= definition | external | expression | ';'
655d80f118eSChris Lattner    static void MainLoop() {
656d80f118eSChris Lattner      while (1) {
657d80f118eSChris Lattner        fprintf(stderr, "ready> ");
658d80f118eSChris Lattner        switch (CurTok) {
659d80f118eSChris Lattner        case tok_eof:
660d80f118eSChris Lattner          return;
661d80f118eSChris Lattner        case ';': // ignore top-level semicolons.
662d80f118eSChris Lattner          getNextToken();
663d80f118eSChris Lattner          break;
664d80f118eSChris Lattner        case tok_def:
665d80f118eSChris Lattner          HandleDefinition();
666d80f118eSChris Lattner          break;
667d80f118eSChris Lattner        case tok_extern:
668d80f118eSChris Lattner          HandleExtern();
669d80f118eSChris Lattner          break;
670d80f118eSChris Lattner        default:
671d80f118eSChris Lattner          HandleTopLevelExpression();
672d80f118eSChris Lattner          break;
673d80f118eSChris Lattner        }
674d80f118eSChris Lattner      }
675d80f118eSChris Lattner    }
676d80f118eSChris Lattner
677d80f118eSChris LattnerThe most interesting part of this is that we ignore top-level
678d80f118eSChris Lattnersemicolons. Why is this, you ask? The basic reason is that if you type
679d80f118eSChris Lattner"4 + 5" at the command line, the parser doesn't know whether that is the
680d80f118eSChris Lattnerend of what you will type or not. For example, on the next line you
681d80f118eSChris Lattnercould type "def foo..." in which case 4+5 is the end of a top-level
682d80f118eSChris Lattnerexpression. Alternatively you could type "\* 6", which would continue
683d80f118eSChris Lattnerthe expression. Having top-level semicolons allows you to type "4+5;",
684d80f118eSChris Lattnerand the parser will know you are done.
685d80f118eSChris Lattner
686d80f118eSChris LattnerConclusions
687d80f118eSChris Lattner===========
688d80f118eSChris Lattner
689d80f118eSChris LattnerWith just under 400 lines of commented code (240 lines of non-comment,
690d80f118eSChris Lattnernon-blank code), we fully defined our minimal language, including a
691d80f118eSChris Lattnerlexer, parser, and AST builder. With this done, the executable will
692d80f118eSChris Lattnervalidate Kaleidoscope code and tell us if it is grammatically invalid.
693d80f118eSChris LattnerFor example, here is a sample interaction:
694d80f118eSChris Lattner
695d80f118eSChris Lattner.. code-block:: bash
696d80f118eSChris Lattner
697d80f118eSChris Lattner    $ ./a.out
698d80f118eSChris Lattner    ready> def foo(x y) x+foo(y, 4.0);
699d80f118eSChris Lattner    Parsed a function definition.
700d80f118eSChris Lattner    ready> def foo(x y) x+y y;
701d80f118eSChris Lattner    Parsed a function definition.
702d80f118eSChris Lattner    Parsed a top-level expr
703d80f118eSChris Lattner    ready> def foo(x y) x+y );
704d80f118eSChris Lattner    Parsed a function definition.
705d80f118eSChris Lattner    Error: unknown token when expecting an expression
706d80f118eSChris Lattner    ready> extern sin(a);
707d80f118eSChris Lattner    ready> Parsed an extern
708d80f118eSChris Lattner    ready> ^D
709d80f118eSChris Lattner    $
710d80f118eSChris Lattner
711d80f118eSChris LattnerThere is a lot of room for extension here. You can define new AST nodes,
712d80f118eSChris Lattnerextend the language in many ways, etc. In the `next
713d80f118eSChris Lattnerinstallment <LangImpl03.html>`_, we will describe how to generate LLVM
714d80f118eSChris LattnerIntermediate Representation (IR) from the AST.
715d80f118eSChris Lattner
716d80f118eSChris LattnerFull Code Listing
717d80f118eSChris Lattner=================
718d80f118eSChris Lattner
719d80f118eSChris LattnerHere is the complete code listing for our running example. Because this
720d80f118eSChris Lattneruses the LLVM libraries, we need to link them in. To do this, we use the
721*72fd1033SSylvestre Ledru`llvm-config <https://llvm.org/cmds/llvm-config.html>`_ tool to inform
722d80f118eSChris Lattnerour makefile/command line about which options to use:
723d80f118eSChris Lattner
724d80f118eSChris Lattner.. code-block:: bash
725d80f118eSChris Lattner
726d80f118eSChris Lattner    # Compile
727d80f118eSChris Lattner    clang++ -g -O3 toy.cpp `llvm-config --cxxflags`
728d80f118eSChris Lattner    # Run
729d80f118eSChris Lattner    ./a.out
730d80f118eSChris Lattner
731d80f118eSChris LattnerHere is the code:
732d80f118eSChris Lattner
733147e0ddaSHans Wennborg.. literalinclude:: ../../../examples/Kaleidoscope/Chapter2/toy.cpp
734d80f118eSChris Lattner   :language: c++
735d80f118eSChris Lattner
736d80f118eSChris Lattner`Next: Implementing Code Generation to LLVM IR <LangImpl03.html>`_
737d80f118eSChris Lattner
738