I often get question about how programs like TokenIDE and SourceCoder turn a text file into a calculator program, so I figured I'd give a quick little intro to the theory behind how it works.

So, to start out, we're going to talk about the actual tokenization process. Tokenization is a general term, but in this context it refers to transforming text into the binary data that the calculator needs. For example, changing "Disp " to the hex value 0xDE. There are a couple of ways to do this, I'm going to discuss the way that uses a trie data structure. For example, this is what a small sub-trie of the entire trie would look like, this one with just the letter "D" and the token "Disp ":

So, using this trie, let's see how you'd tokenize a program like:

Code:
Disp DD

You take each character one at a time and traverse down your tree, starting at the top:

The current character is "D", so take the "D" path:

This node has some data in it, but we're not ready to look at that yet--we still have more characters! Take the "i" path next:

I think by now you can see where this is going. You're going to take the "s", "p", and "(space)" paths, and eventually end up here:

Now, there are a couple of things to note here:
1) We're at a leaf node (a node circle with no arrows coming out of it).
2) This node has data.
Using that information, we know that we've successfully found a token, and we can now save off that value somewhere to be out current program, we'll notate this as {DE}.

Once we get to this point, we start back up at the top of the trie, but at the next character. So, we're back here:

The current character is "D", so take the "D" path:

Again, this node has data in it, but we need to look at the next character. The next character is another "D". Looking at this trie we can see that there is no "D" path off of the "D" node, so we can't go anywhere. We're stuck at the "D" node! However, this node has data, so let's just assume we've got a token an use that data. Our program is now {DE,44}.

The next step is the same. We're back here:

The current character is "D", so take the "D" path:

It has data, but we need to check the next character. There is no next character, so we know we're good and can add the data again. Our program is {DE,44,44}.

Well, that's a wrap! We tokenized our first program! This one was pretty easy, though, so let's look at something a bit harder...

From this point forward, I'm not going to draw the tries, and I'm going to assume you have a reference file for the tokens and their data. If you don't know where to get such a file, download TokenIDE and look in the Tokens director. TI-84+CSE.xml is the file I'm using as reference while writing this guide. And a note about representation. Instead of drawing the trie, I will use something like D->i->s->p->(space) to signify a traversal through the trie. Hopefully that's clear.

So, let's look at this program that we're going to tokenize:

Code:
Disp "Display

We already know how to tokenize Disp so let's skip that and just say our program is {DE} and we're going to start at the quote sign. Using my reference file, I can see that the quote sign is 0x2A. We'll tokenize that and our program is {DE,2A}.

Now comes the tricky part. Above we tokenized Disp so we know at some point we're at the 'p' node in the trie D->i->s->p. Looking at the next character, it's a 'p'. There's no 'p' edge coming off this node. So, what did we do last time we saw something like this? We stayed at the node we were at and used that for the data. However, D->i->s->p doesn't have any data! It's not a valid token. So what's the right thing to do? We should roll back to the last token in the path we're on that had data. So, check backwards. Does D->i->s have data? No, so go back another. D->i? Nope. Does D have data? Yes! So use that and our program is {DE,2A,44}. Because the D was the last letter we tokenized, we need start over at the next character. In this case, the 'i'. Now your tokenizer would start over the process, but we're going to shortcut it and just show you what your program data is now:
{DE,2A,44,BB,B8,BB,C3,BB,C0,BB,BC,BB,B0,BB,C9}.
A discerning reader will note that this seems like too much data. However, that's because lowercase letters are multi-byte tokens. That meas each lowercase letter actually takes up two bytes instead of one, so the "i" node has data BBB8.
Incidentally, TokenIDE has a handy tool to show that binary data of a program:

This can be a handy way to sanity check what you're doing.

That's all for today's episode! Let me know what questions you have--I'm sure there are some parts that are pretty unclear.

Next time I'll talk about the detokenization process, which is very similar, and I'll talk some more about some of the corner cases and caveats you run into when dealing with the calculators' vast token set.
Nice write-up Merth! =)

There's another way to view this the problem, which is that every edge has both a label (allowing you to follow it), and possibly also an action. Then, instead of placing token data on nodes, you treat emitting the token data as an action on an edge. Then you draw edges which match the empty string from nodes back to your starting point, and assign them with the token-emitting actions.

Now, instead of a trie, we have what's called a "finite state machine". Since we can always match an empty string, it's a "non-deterministic finite state machine", which means that our choice at each node isn't determined ahead of time by the input. This is bad, since it means the results might be unpredictable! To resolve this issue, we can add rules to decide which rules are more important in the places when you have to make a choice. A good rule here might be "whenever down arrows are available, favor those over up arrows". This rule is called the "greedy" rule since you choose to "take as much as possible" before returning to the starting node.

Finite-state machines can be drawn out by hand as graphs, like Merth did with his tries, or they can be written compactly in a format called "regular expressions".

If you want to see what regular expressions for tokenizing TI-Basic look like, you can peak at my TI-Tokens project, which reads one of shaun's TokenIDE XML files and can convert it into a set of regular expressions in formats used for syntax highlighting of TI-Basic by a couple different text editors, or into a C program that will spit the tokens out directly for you.
Thanks, elfprince! Great point about the state machine. There are, of course, a few ways to solve this type of problem, and it's always good to show more than one solution.

That said, let's look into the detokenization process.

So, last time we saw that the string "Disp " gets translated into the token value 0xDE, and that the letter "D" gets translated into the token value 0x44. I haven't spoken about the file format yet (that's to come later), but we'll touch on it briefly to just say that there's a location in the file that has the binary data of the program contiguously, like how we had our data laid out last time. We also looked last time at the multi-byte tokens (such as lowercase letters). However, for right now, we're going to ignore those.

If we pretend that there are no multibyte tokens, out job is really easy. There are at most 256 tokens ($FF being the largest possible value), so you could, in theory, make an array with 256 values, go through the data, and put the strings in there. Something like this:

Code:
string[] tokens = new string[256];
for(int i = 0; i <256; i++) {
    if (valueExists(i)) {
        tokens[i] = value(i);
    }
}

Where "value(i)" returns the string value of the token "i" and "valueExists(i)" returns true if that exists. This works fairly well for something that only has single-byte tokens. There are some holes in the array, so it's not the most efficient method. You can fix that by using a dictionary or other such data structure.

So this gets us an easy way to detokenize programs. Just read the data and check the tokens array and use the string it finds. If it's not found, it's not a valid token, so likely not a valid program (or you're missing some tokens in your data set). This only gets us halfway there, though, for real-world calulators. So let's open up the TI-84+CSE.xml file and take a look at what we're actually dealing with. I'll save you the trouble of digging through the whole file and just let you know that there are two byte tokens. The ones we'll work with today are the string variables:

Code:
  <Token byte="$AA" stringTerminator="false">
    <Token byte="$00" string="Str1" style="Var" />
    <Token byte="$01" string="Str2" style="Var" />
    <Token byte="$02" string="Str3" style="Var" />
    <Token byte="$03" string="Str4" style="Var" />
    <Token byte="$04" string="Str5" style="Var" />
    <Token byte="$05" string="Str6" style="Var" />
    <Token byte="$06" string="Str7" style="Var" />
    <Token byte="$07" string="Str8" style="Var" />
    <Token byte="$08" string="Str9" style="Var" />
    <Token byte="$09" string="Str0" style="Var" />
  </Token>

So from here you can see that Str1 is {AA,00}. This makes our naive solution no longer work. A convenience you might notice, though, is that {AA} by itself isn't a valid token, so if you see that token, you know you need to go into multi-byte mode. This means, in theory, you could just have another array for each of the multi-byte sections. So if you encounter 0xAA, you just check the next byte and look it up in your 0xAA-array. The TI tokens are pretty nice for this type of solution. There's at most two-bytes per token, and the first byte of the two-byte tokens is always a unique sigil saying "this is a multi-byte token".

So that should give you a good idea on how to detokenize a file. Let's look at the program we created last time and detokenize it!
{DE,2A,44,BB,B8,BB,C3,BB,C0,BB,BC,BB,B0,BB,C9}
That's our data. So we take each byte at a time. We see DE. That's a single-byte token, and it's "Disp ", so our current programs is Disp . Let's do the next one. 2A, single-byte, quote. Good, now we have Disp ". 44 we know is "D", so we have Disp "D. The next byte is BB. This is a multi-byte signifier, so read the next byte. B8. We have to look up B8 in our BB-array:

Code:
  <Token byte="$BB" stringTerminator="false">
    ...
    <Token byte="$B0" string="a" />
    <Token byte="$B1" string="b" />
    <Token byte="$B2" string="c" />
    <Token byte="$B3" string="d" />
    <Token byte="$B4" string="e" />
    <Token byte="$B5" string="f" />
    <Token byte="$B6" string="g" />
    <Token byte="$B7" string="h" />
    <Token byte="$B8" string="i" />
    <Token byte="$B9" string="j" />
    <Token byte="$BA" string="k" />
    <Token byte="$BC" string="l" />
    <Token byte="$BD" string="m" />
    <Token byte="$BE" string="n" />
    <Token byte="$BF" string="o" />
    <Token byte="$C0" string="p" />
    <Token byte="$C1" string="q" />
    <Token byte="$C2" string="r" />
    <Token byte="$C3" string="s" />
    <Token byte="$C4" string="t" />
    <Token byte="$C5" string="u" />
    <Token byte="$C6" string="v" />
    <Token byte="$C7" string="w" />
    <Token byte="$C8" string="x" />
    <Token byte="$C9" string="y" />
    <Token byte="$CA" string="z" />

So BB followed by B8 is "i". That's good! It matches with what we tokenized. Our program is now Disp "Di. Well, I'm sure you can do the rest to get Disp "Display

So with that you should be able to detokenize any TI-84+CSE (and other models) programs! An astute reader will note that this process is theoretically very similar to the tokenization process, and next time we're going to address that when we also address how TokenIDE handles hybrid library functions, and how the Prizm handles its multi-byte tokens.
Today's topic is going to be very short, I just wanted to address how TokenIDE handles hybrid library functions, and how the Prizm handles its multi-byte tokens. We saw last time how to detokenize a file--that is, read in the data and turn it into a human-readable format. The default TI token-set has two-byte tokens in the form of <prefix-byte><other byte>. This means that when we run into <prefix-byte> we can easily to look at another table to get the value. We not that if we see that byte that we're looking at a two-byte token and everything's good. What happens, though, when we can't be sure that the first byte of a multi-byte token is always a mutli-byte token? Well, that's how TokenIDE does it's hybrid functions and how the Prizm tokens are laid out. Let's look at the Prizm tokens:

Code:
  <Token byte="$F7" string="Break" stringTerminator="false">
    <Token byte="$2F" string="➧DMS" />
    <Token byte="$CD" string="ab/c" />
    <Token byte="$90" string="An Type" />
    <Token byte="$91" string="An+1 Type" />
...


Look at that! 0xF7 is both a multi-token prefix-byte, and a token itself! So, how do we handle this?

Easy! We already have a solution that maps streams of values of one type to values of another--namely our Trie maps streams of characters to bytes. Well, now we just need to map streams of bytes to strings! I will intentionally leave the details of this vague; you should be able to take the same theory from the first post and apply it in the other direction. If you're struggling, just let me know and I'll help you out.

Next time we'll talk about the file format--something I'm sure everyone is quite excited for.
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 1 of 1
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement