As a few of you know now, I've made the decision to write my own compiler for a language I'm designing. This thread will detail how the compiler works, how I went about writing the compiler core, how the language works, what I've learned, and more. I've formatted the post like a guide, but I make no promises as to the quality of my guide.

This is quite a long post, so I've split it into multiple parts. I recognize this might get is extremely rambly, so I've added a short, concise summary at the end of most sections.

This first post will explain the compiler and how it works at a relatively high level. The second post will explain the language and its semantics (in theory). If necessary, I'll make a third post to explain the runtime and standard library once I write those. This entire project is still very much a work in progress and I cannot wait to continue working on it!

Part 1: Disclaimers and overall design goals.

  • I'm conciously choosing not to use a tool like yacc, even though it's fantastic. This is just a project I thought was cool, so I'm doing it. It's been a great learning experience already, and I can barely compile simple programs.
  • I'm trying not to write C compiler, but it's been really tempting at times. The language is so clever- every time I want to implement a feature, I end up reducing the feature step by step so that it is easier to implement, and I reliably end up at the equivalent in C. I can't really put into words how oddly cool this is to me. I usually write in languages like JavaScript or Python- implementing this compiler has really shown me "oh, so that's why that is like that in <insert name of lower-level language here>
  • I am, however, trying to make a language that has features I'd like to use. Namely, I want a scripting language that lets me focus on writing my code- all my memory management should be handled for me, I'd like classes with inheritance, I should have a nice standard library, etc etc. I should never have to deal with pointers, but the language should not stop me if I want to do my own management for speed reasons. However, I still want this language to be a nice compilation target.
  • I do know that I should probably design my language before starting work on the compiler, but I am willing to do large refactors if I need to- the things I've learned because I've been similtaneously implementing and designing the compiler and the language have made it totally worth it for this side-project of mine.


Summary: I intentionally took the hard route to learn things, and oh boy have I learned things.

Part 2: A heavily-simplified introductory lesson in WebAssembly.
WebAssembly (WASM) is the output language for the compiler.

WASM has two (technically three, but we're simplifying here) representations, a binary format and a textual format. They are essentially equivalent in structure (in most cases), so it is relatively trivial for me to change the compiler (which is designed in modules, so I just strap on a different code generation module, explained below) so it spits out one representation or the other. The text representation is much easier to understand and debug, so I have the compiler output it.

The below sections should explain WASM features as I use or talk about them, but I highly recommend having a basic understanding of the language before proceeding. This MDN page explains the basics of the language far better than I ever could.

Summary: Though you don't need to understand WebAssembly/WASM to comprehend this guide, I recommend reading this MDN page because it explains the language well.

Part 3: Compiler overview and roadmap.

Before a compiler reaches your program, it's just random gibberish as far as the computer is concerned. The object of the compiler is to turn this random gibberish into a format that the computer can understand.

This compiler is split into 4 distinct sections:

  • The lexer: The lexer takes the "random gibberish" provided to the compiler, and identifies tokens and symbols. On top of this, it assigns each token a meaning. Lexers are not strictly required for a working compiler (and many compilers do not have one), but it is included to make the job of the next step, the parser, easier. Relatively basic syntax checking is done is this stage.
  • The parser: The parser takes the list of tokens provided by the lexer and identifies language constructs, like expressions, functions, conditions, and more. The output of the parser is an abstract syntax tree (explained in more detail later). In this stage, we do more advanced syntax checking.
  • Code optimization: Here, we perform transformations on the abstract syntax tree to reduce the size and increase the speed of the compiled output. Things like dead code removal, constant folding, and constant propagation (these will be explained later) all happen here. This is the trickiest and most complex part of the compiler, because many different complixated optimizations may be performed, requiring a variety of code representations and conversions. I also identify and search for variables, to aid in the last step, code generation.
  • Code generation: The code generation phase takes a optimized abstract syntax tree from the previous steps and converts this intermediate representation into a computer-understandable output. A few further optimizations are performed here.


Summary: Before the compiler compiles your code, the computer cannot understand it. The compiler has 4 main parts: a lexer to find and label tokens, a parser to find and organize higher-level language constructs, a code optimizer to compress the code, and a code generator to actually produce an output.

Part 3: The lexer.

Despite not being strictly necessary, the lexer serves an important purpose.

In this case, tokens are merely strings of text that are assigned a meaning. Examples of token meanings include reserved keywords, numeric literals, operators, and more.

Lexers go through the input string, and perform a series of checks to identify tokens. We strip comments and some whitespace here.

Lexer pseudocode:

Code:
while the position we are searching for tokens at is less than the input string length:
  loop through all the token types in order, stopping if one returns a token.
 
  if a token was returned:
    advance the current reading position by the token length, and add the token to the resultant list of tokens.
  otherwise:
    check for a comment or whitespace.
   
    if we found some, remove it
    otherwise, throw a syntax error because there's an unknown token


The most important thing to remember here is to check for token types in order. It is quite common, actually, that a given substring can fit into two token categories. The string "if" is a good example of this. It follows the pattern of a variable name, but it is clearly a reserved keyword. Therefore, we should check for reserved keywords before variable names. This seems obvious, but it is important to remember.

Summary: The lexer finds and identifies tokens by looping through the string in a systematic fashion.

Part 4: Context-free grammar

Before I can fully explain how the parser works, I have to explain the glorious world of context-free grammar.

CGFs describe the structure of a network, and do it through a set of nodes and rules.

There are two types of nodes, terminal nodes (in this case, all of our terminal nodes are tokens), and non-terminal nodes.

The rules define the non-terminal nodes as being equivalent to a set of other nodes.

Summary: I'm terrible at explaining, just read the wikipedia article

Part 5: The metalanguage.

It didn't take me very long to realize that describing all the grammar bits and bobs manually using my relatively verbose method was not really going to work.

To remedy this, the compiler actually contains a smaller compiler that compiles a language describing our context-free grammar into the verbose format the rest of the code requires. This lets me add or tweak language features much faster, and it's great.

My metalanguage is super straightforward, it's basically a simple textual representation of Backus-Naur form, the way most CFGs are notated normally.

Summary: The compiler contains a smaller compiler that implements a metalanguage that describes the main language.

Part 6: Abstract syntax trees.

Summary: Best described by Wikipedia.

Part 7: The parser.

The parser is probably the coolest part of the entire compiler, at least in my opinion.

I'm very new to this stuff, so I chose to implement a recursive descent parser, which (as the name suggests) is recursive.

In lieu of writing some cruddy pseudocode (or worse- trying to explain it!), I'll link to this relatively helpful and straightforward article even though I had some issues with it.

Summary: The parser takes the tokens and organizes them into structures.

Part 8: Code optimization and passes.

Most compilers perform optimizations on the code by transforming it to make it smaller or faster. These transformations may require the code to be reorganized Here are some examples:


  • Dead code elimination. Removes code that is never called.
  • Constant propagation. This looks for variables whose values are not changed, and eliminates the variable lookup overhead by replacing all references to it with a constant.
  • Constant folding. This is one of the more compilicated optimizations to perform, but can have a major impact on size and speed. It removes unnecessary terms in expressions, ex. 2*(1+x/2)-2 is reduced to just x. Though I have not implemented it as of now, I plan to implement some form of small CAS to handle it.


On top of all this, we search for and organize local variables to assist in the last step, code generation.

Summary: Good compilers optimize the code to make it smaller and faster.

Part 9: Code generation

After all this work, it is finally time to create an output program.

We do this by recursively going through the constructs in the tree, returning compiled versions of each construct at every step.

For constructs with built-ins implemented in the output language in some way, this is trivial, but this is often not the case.

Summary: Generate code by recursively going through the tree, generating code for each construct one-at-a-time.

Part 10: What I still want to do.

I've been focusing all my time writing and cleaning up the the compiler that the language is barely implemented. I still need to implement things like loops, conditionals, and more. These won't take much time, hopefully.

I want to continue cleaning the codebase. It's messy. As an extension of this, I want my code to be clean enough I'm not scared of open-sourcing it.

I need to refactor the parser and the metalanguage so that I can automatically generate clean, understandable documentation directly from the metalanguage.

I want to implement a runtime, with shiny automatic memory management.

I want a standard library.

I want to implement arrays, more default types, etc.

Part 11: Eye candy!

Currently, this code:

Code:
module {
  const i32 FOURTY_TWO = 42

  i32 add42(i32 x) {
    local i32 y = const.FOURTY_TWO
    return local.y + local.x
  }
 
  i32 add50(i32 x) {
    add42(local.x)
    local.x = local.y + 8

    return local.x
  }

  export add42 as "addFourtyTwo"
  export add50
}


is compiled into this webassembly text representation


Code:
(module
    (func (param i32) (result i32) (local i32)
        i32.const 42
        set_local 1
        get_local 1
        get_local 0
        i32.add
    )
    (func (param i32) (result i32) (local i32)
        get_local 0
        call 0
        set_local 1
        get_local 1
        i32.const 8
        i32.add
        set_local 0
        get_local 0
    )
    (export "addFourtyTwo" (func 0))
    (export "add50" (func 1))
)


but it's a short ways away from the optimal output:


Code:
(module
    (func (param i32) (result i32) (local i32)
        i32.const 42
        get_local 0
        i32.add
    )
    (func (param i32) (result i32) (local i32)
        get_local 0
        call 0
        i32.const 8
        i32.add
    )
    (export "addFourtyTwo" (func 0))
    (export "add50" (func 1))
)
This is great! I am developing my own too! I will probably post about it today, even though its anatomy isn't complete. The compiler output will be an exe file (No external libs at all), with the option to compile all data resources with it.
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 1 of 1
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement