I often get question about how programs like TokenIDE and SourceCoder turn a text file into a calculator program, so I figured I'd give a quick little intro to the theory behind how it works.
So, to start out, we're going to talk about the actual tokenization process. Tokenization is a general term, but in this context it refers to transforming text into the binary data that the calculator needs. For example, changing "Disp " to the hex value 0xDE. There are a couple of ways to do this, I'm going to discuss the way that uses a trie data structure. For example, this is what a small sub-trie of the entire trie would look like, this one with just the letter "D" and the token "Disp ":
So, using this trie, let's see how you'd tokenize a program like:
Code:
You take each character one at a time and traverse down your tree, starting at the top:
The current character is "D", so take the "D" path:
This node has some data in it, but we're not ready to look at that yet--we still have more characters! Take the "i" path next:
I think by now you can see where this is going. You're going to take the "s", "p", and "(space)" paths, and eventually end up here:
Now, there are a couple of things to note here:
1) We're at a leaf node (a node circle with no arrows coming out of it).
2) This node has data.
Using that information, we know that we've successfully found a token, and we can now save off that value somewhere to be out current program, we'll notate this as {DE}.
Once we get to this point, we start back up at the top of the trie, but at the next character. So, we're back here:
The current character is "D", so take the "D" path:
Again, this node has data in it, but we need to look at the next character. The next character is another "D". Looking at this trie we can see that there is no "D" path off of the "D" node, so we can't go anywhere. We're stuck at the "D" node! However, this node has data, so let's just assume we've got a token an use that data. Our program is now {DE,44}.
The next step is the same. We're back here:
The current character is "D", so take the "D" path:
It has data, but we need to check the next character. There is no next character, so we know we're good and can add the data again. Our program is {DE,44,44}.
Well, that's a wrap! We tokenized our first program! This one was pretty easy, though, so let's look at something a bit harder...
From this point forward, I'm not going to draw the tries, and I'm going to assume you have a reference file for the tokens and their data. If you don't know where to get such a file, download TokenIDE and look in the Tokens director. TI-84+CSE.xml is the file I'm using as reference while writing this guide. And a note about representation. Instead of drawing the trie, I will use something like D->i->s->p->(space) to signify a traversal through the trie. Hopefully that's clear.
So, let's look at this program that we're going to tokenize:
Code:
We already know how to tokenize Disp so let's skip that and just say our program is {DE} and we're going to start at the quote sign. Using my reference file, I can see that the quote sign is 0x2A. We'll tokenize that and our program is {DE,2A}.
Now comes the tricky part. Above we tokenized Disp so we know at some point we're at the 'p' node in the trie D->i->s->p. Looking at the next character, it's a 'p'. There's no 'p' edge coming off this node. So, what did we do last time we saw something like this? We stayed at the node we were at and used that for the data. However, D->i->s->p doesn't have any data! It's not a valid token. So what's the right thing to do? We should roll back to the last token in the path we're on that had data. So, check backwards. Does D->i->s have data? No, so go back another. D->i? Nope. Does D have data? Yes! So use that and our program is {DE,2A,44}. Because the D was the last letter we tokenized, we need start over at the next character. In this case, the 'i'. Now your tokenizer would start over the process, but we're going to shortcut it and just show you what your program data is now:
{DE,2A,44,BB,B8,BB,C3,BB,C0,BB,BC,BB,B0,BB,C9}.
A discerning reader will note that this seems like too much data. However, that's because lowercase letters are multi-byte tokens. That meas each lowercase letter actually takes up two bytes instead of one, so the "i" node has data BBB8.
Incidentally, TokenIDE has a handy tool to show that binary data of a program:
This can be a handy way to sanity check what you're doing.
That's all for today's episode! Let me know what questions you have--I'm sure there are some parts that are pretty unclear.
Next time I'll talk about the detokenization process, which is very similar, and I'll talk some more about some of the corner cases and caveats you run into when dealing with the calculators' vast token set.
So, to start out, we're going to talk about the actual tokenization process. Tokenization is a general term, but in this context it refers to transforming text into the binary data that the calculator needs. For example, changing "Disp " to the hex value 0xDE. There are a couple of ways to do this, I'm going to discuss the way that uses a trie data structure. For example, this is what a small sub-trie of the entire trie would look like, this one with just the letter "D" and the token "Disp ":
So, using this trie, let's see how you'd tokenize a program like:
Code:
Disp DD
You take each character one at a time and traverse down your tree, starting at the top:
The current character is "D", so take the "D" path:
This node has some data in it, but we're not ready to look at that yet--we still have more characters! Take the "i" path next:
I think by now you can see where this is going. You're going to take the "s", "p", and "(space)" paths, and eventually end up here:
Now, there are a couple of things to note here:
1) We're at a leaf node (a node circle with no arrows coming out of it).
2) This node has data.
Using that information, we know that we've successfully found a token, and we can now save off that value somewhere to be out current program, we'll notate this as {DE}.
Once we get to this point, we start back up at the top of the trie, but at the next character. So, we're back here:
The current character is "D", so take the "D" path:
Again, this node has data in it, but we need to look at the next character. The next character is another "D". Looking at this trie we can see that there is no "D" path off of the "D" node, so we can't go anywhere. We're stuck at the "D" node! However, this node has data, so let's just assume we've got a token an use that data. Our program is now {DE,44}.
The next step is the same. We're back here:
The current character is "D", so take the "D" path:
It has data, but we need to check the next character. There is no next character, so we know we're good and can add the data again. Our program is {DE,44,44}.
Well, that's a wrap! We tokenized our first program! This one was pretty easy, though, so let's look at something a bit harder...
From this point forward, I'm not going to draw the tries, and I'm going to assume you have a reference file for the tokens and their data. If you don't know where to get such a file, download TokenIDE and look in the Tokens director. TI-84+CSE.xml is the file I'm using as reference while writing this guide. And a note about representation. Instead of drawing the trie, I will use something like D->i->s->p->(space) to signify a traversal through the trie. Hopefully that's clear.
So, let's look at this program that we're going to tokenize:
Code:
Disp "Display
We already know how to tokenize Disp so let's skip that and just say our program is {DE} and we're going to start at the quote sign. Using my reference file, I can see that the quote sign is 0x2A. We'll tokenize that and our program is {DE,2A}.
Now comes the tricky part. Above we tokenized Disp so we know at some point we're at the 'p' node in the trie D->i->s->p. Looking at the next character, it's a 'p'. There's no 'p' edge coming off this node. So, what did we do last time we saw something like this? We stayed at the node we were at and used that for the data. However, D->i->s->p doesn't have any data! It's not a valid token. So what's the right thing to do? We should roll back to the last token in the path we're on that had data. So, check backwards. Does D->i->s have data? No, so go back another. D->i? Nope. Does D have data? Yes! So use that and our program is {DE,2A,44}. Because the D was the last letter we tokenized, we need start over at the next character. In this case, the 'i'. Now your tokenizer would start over the process, but we're going to shortcut it and just show you what your program data is now:
{DE,2A,44,BB,B8,BB,C3,BB,C0,BB,BC,BB,B0,BB,C9}.
A discerning reader will note that this seems like too much data. However, that's because lowercase letters are multi-byte tokens. That meas each lowercase letter actually takes up two bytes instead of one, so the "i" node has data BBB8.
Incidentally, TokenIDE has a handy tool to show that binary data of a program:
This can be a handy way to sanity check what you're doing.
That's all for today's episode! Let me know what questions you have--I'm sure there are some parts that are pretty unclear.
Next time I'll talk about the detokenization process, which is very similar, and I'll talk some more about some of the corner cases and caveats you run into when dealing with the calculators' vast token set.