Writing a compiler.

iPhoenix · hi (Posts: 1858)

The most important thing to remember here is to check for token types in order. It is quite common, actually, that a given substring can fit into two token categories. The string "if" is a good example of this. It follows the pattern of a variable name, but it is clearly a reserved keyword. Therefore, we should check for reserved keywords before variable names. This seems obvious, but it is important to remember.

Summary: The lexer finds and identifies tokens by looping through the string in a systematic fashion.

Part 4: Context-free grammar

Before I can fully explain how the parser works, I have to explain the glorious world of context-free grammar.

CGFs describe the structure of a network, and do it through a set of nodes and rules.

There are two types of nodes, terminal nodes (in this case, all of our terminal nodes are tokens), and non-terminal nodes.

The rules define the non-terminal nodes as being equivalent to a set of other nodes.

Summary: I'm terrible at explaining, just read the wikipedia article

Part 5: The metalanguage.

It didn't take me very long to realize that describing all the grammar bits and bobs manually using my relatively verbose method was not really going to work.

To remedy this, the compiler actually contains a smaller compiler that compiles a language describing our context-free grammar into the verbose format the rest of the code requires. This lets me add or tweak language features much faster, and it's great.

My metalanguage is super straightforward, it's basically a simple textual representation of Backus-Naur form, the way most CFGs are notated normally.

Summary: The compiler contains a smaller compiler that implements a metalanguage that describes the main language.

Part 6: Abstract syntax trees.

Summary: Best described by Wikipedia.

Part 7: The parser.

The parser is probably the coolest part of the entire compiler, at least in my opinion.

I'm very new to this stuff, so I chose to implement a recursive descent parser, which (as the name suggests) is recursive.

In lieu of writing some cruddy pseudocode (or worse- trying to explain it!), I'll link to this relatively helpful and straightforward article even though I had some issues with it.

Summary: The parser takes the tokens and organizes them into structures.

Part 8: Code optimization and passes.

Most compilers perform optimizations on the code by transforming it to make it smaller or faster. These transformations may require the code to be reorganized Here are some examples:

Dead code elimination. Removes code that is never called.
Constant propagation. This looks for variables whose values are not changed, and eliminates the variable lookup overhead by replacing all references to it with a constant.
Constant folding. This is one of the more compilicated optimizations to perform, but can have a major impact on size and speed. It removes unnecessary terms in expressions, ex. 2*(1+x/2)-2 is reduced to just x. Though I have not implemented it as of now, I plan to implement some form of small CAS to handle it.

On top of all this, we search for and organize local variables to assist in the last step, code generation.

Summary: Good compilers optimize the code to make it smaller and faster.

Part 9: Code generation

After all this work, it is finally time to create an output program.

We do this by recursively going through the constructs in the tree, returning compiled versions of each construct at every step.

For constructs with built-ins implemented in the output language in some way, this is trivial, but this is often not the case.

Summary: Generate code by recursively going through the tree, generating code for each construct one-at-a-time.

Part 10: What I still want to do.

I've been focusing all my time writing and cleaning up the the compiler that the language is barely implemented. I still need to implement things like loops, conditionals, and more. These won't take much time, hopefully.

I want to continue cleaning the codebase. It's messy. As an extension of this, I want my code to be clean enough I'm not scared of open-sourcing it.

I need to refactor the parser and the metalanguage so that I can automatically generate clean, understandable documentation directly from the metalanguage.

I want to implement a runtime, with shiny automatic memory management.

I want a standard library.

I want to implement arrays, more default types, etc.

Part 11: Eye candy!

Currently, this code: