A Simple Guide to Tokenization and 8xp generation

merthsoft · Administrator Emeritus (Posts: 3820)

You take each character one at a time and traverse down your tree, starting at the top:

The current character is "D", so take the "D" path:

This node has some data in it, but we're not ready to look at that yet--we still have more characters! Take the "i" path next:

I think by now you can see where this is going. You're going to take the "s", "p", and "(space)" paths, and eventually end up here:

Now, there are a couple of things to note here:
1) We're at a leaf node (a node circle with no arrows coming out of it).
2) This node has data.
Using that information, we know that we've successfully found a token, and we can now save off that value somewhere to be out current program, we'll notate this as {DE}.

Once we get to this point, we start back up at the top of the trie, but at the next character. So, we're back here:

The current character is "D", so take the "D" path:

Again, this node has data in it, but we need to look at the next character. The next character is another "D". Looking at this trie we can see that there is no "D" path off of the "D" node, so we can't go anywhere. We're stuck at the "D" node! However, this node has data, so let's just assume we've got a token an use that data. Our program is now {DE,44}.

The next step is the same. We're back here:

The current character is "D", so take the "D" path:

It has data, but we need to check the next character. There is no next character, so we know we're good and can add the data again. Our program is {DE,44,44}.

Well, that's a wrap! We tokenized our first program! This one was pretty easy, though, so let's look at something a bit harder...

From this point forward, I'm not going to draw the tries, and I'm going to assume you have a reference file for the tokens and their data. If you don't know where to get such a file, download TokenIDE and look in the Tokens director. TI-84+CSE.xml is the file I'm using as reference while writing this guide. And a note about representation. Instead of drawing the trie, I will use something like D->i->s->p->(space) to signify a traversal through the trie. Hopefully that's clear.

So, let's look at this program that we're going to tokenize:

merthsoft · Administrator Emeritus (Posts: 3820)

Where "value(i)" returns the string value of the token "i" and "valueExists(i)" returns true if that exists. This works fairly well for something that only has single-byte tokens. There are some holes in the array, so it's not the most efficient method. You can fix that by using a dictionary or other such data structure.

So this gets us an easy way to detokenize programs. Just read the data and check the tokens array and use the string it finds. If it's not found, it's not a valid token, so likely not a valid program (or you're missing some tokens in your data set). This only gets us halfway there, though, for real-world calulators. So let's open up the TI-84+CSE.xml file and take a look at what we're actually dealing with. I'll save you the trouble of digging through the whole file and just let you know that there are two byte tokens. The ones we'll work with today are the string variables:

merthsoft · Administrator Emeritus (Posts: 3820)

Look at that! 0xF7 is both a multi-token prefix-byte, and a token itself! So, how do we handle this?

Easy! We already have a solution that maps streams of values of one type to values of another--namely our Trie maps streams of characters to bytes. Well, now we just need to map streams of bytes to strings! I will intentionally leave the details of this vague; you should be able to take the same theory from the first post and apply it in the other direction. If you're struggling, just let me know and I'll help you out.

Next time we'll talk about the file format--something I'm sure everyone is quite excited for.