Monday, March 23, 2009

Writing a Compiler with F# Part 3

Now that we can read user input and pipe it into our evaluator, we need to start implementing the main parts of the compiler.  First it's helpful to understand what compilers actually do.  In a rough sense, they take the text files containing the code and output an exe or a dll or something, i.e., they transform human readable text into machine code.  This is a simplified definition, but it will suffice for our purposes.

It may be tempting to go directly from text into code, but for anything beyond the simplest of languages this ends up being very difficult both to write and maintain.  Most compilers divide up the work into several stages that form a pipeline of sorts.  Some common stages are:

Lexing
The raw sequence of characters is transformed into a sequence of Tokens. For example, the Lexer would consume 'w' 'h' 'i' 'l' 'e' from the input file and return the WHILE token

Parsing
The tokens from the Lexer are consumed to produce an Abstract Syntax Tree. An AST is a convenient in-memory representation of the program. More on these in a moment. For example, the Parser might consume VAR IDENTIFIER('i') COLON TYPENAME('int') and produce an instance of our VariableDeclaration class with the Name property set to 'i' and the type set to 'int'
Type Checking
Once we have the AST, we can make several passes over it to check the semantics of the language, including that programs are well typed. For example, when looking at an IfStatement we'd want to make sure that the Expression for the if clause was of type bool.
Other backend phases
We'll talk more about backend phases like instruction selection, register allocation, and optimizations if we ever get to building the backend. Ultimately these phases involve walking over the AST and finally spitting out machine code (assembly language, IL for .NET, etc.)

What you should see here is that the actual textual representation of the language plays a very small part in this whole process. A large portion of the compiler depends solely on the AST. Only the Lexer and Parser really care whether you end statements with semicolons or have syntactically significant whitespace, for example. Because of this, it's useful to start thinking about the AST first, and then worry about Lexing and Parsing.

In F#, the most common way to model ASTs is through the use of discriminated unions. These are roughly the functional programming equivalent of a polymorphic class hierarchy, although much simplified. Let's look at an AST for a very simple language that lets you print arithmetic expressions (provided you only want to use parentheses and addition/subtraction).

type binop =
| Plus
| Minus

type expr =
| Int of int
| BinOp of binop * expr * expr

type stmt =
| Print of expr
| Seq of stmt list

The keyword 'type' introduces the declaration of a discriminated union. Each option is followed by a |. binop, for example, resembles an Enum that can either be Plus or Minus. The main difference between these unions and enums becomes apparent when you look at 'expr': each of the options can carry data. We can have an expr Int that contains an integer value, or a BinOp that contains a binop (remember Plus or Minus) and two other expressions. We would represent "print 5 + 3" in our AST as Print(BinOp(Plus, Int(5), Int(3)))

Discriminated unions really become powerful when combined with pattern matching. Given the AST above we can easily build a simple evaluator. Here's all we need:

let rec evalE (env,fenv) = function
| Int i -> i
| BinOp(Plus, l, r) -> (evalE (env,fenv) l) + (evalE (env,fenv) r)
| BinOp(Minus, l, r) -> (evalE (env,fenv) l) - (evalE (env,fenv) r)

and eval (env,fenv) = function
| Seq stmts -> List.fold_left eval (env,fenv) stmts
| Print e -> printfn "%d" (evalE (env,fenv) e); (env,fenv)

Ok what's going on here? We've defined two mutually recursive functions, 'eval' for evaluating statements, and 'evalE' for evaluating expressions. We can safely ignore (env,fenv) for now; those will be useful when we add variables and functions. The body of 'eval' says: If I'm handed a Seq of statements, fold the eval function over each statement. Since we don't have variables yet this is could easily just be 'for each statement, evaluate it'. It also says: If I'm handed a Print of some expression, evaluate the expression and then print it out. Evaluating expressions works similarly. If I'm handed a BinOp(Plus) of two expressions, evaluate each one and then add them together.

In these relatively few lines of code, we've built a simple arithmetic evaluator. If you type this stuff into fsi, you can call the evaluator by doing something like:

eval (0,0) (Print(BinOp(Plus,Int(5),Int(6))))

Which will evaluate 5 + 6 and print out 11. Next time we see how to build a simple Lexer and Parser for the language so we don't have to type the AST out manually any more.

No comments: