Marking token boundaries with Unicode tricks - Cemetech | Forum | TI-BASIC [Topic]

Tari · 26 May 2022 05:54:40 am

(This post has also appeared on my web site, reproduced here in an abbreviated form for discussion.)

People who have done some programming using PC tools like SourceCoder or TokenIDE are probably familiar with how sometimes you need to insert backslashes into programs so strings like "pi" don't get incorrectly translated to "π" when you actually do mean to write "pi." Handling this correctly when detokenizing programs is actually somewhat challenging, though there is a well-defined algorithmic solution. While working on some tokenization/detokenization code I arrived at an interesting idea for a way the backslash situation could be somewhat improved using the power of Unicode.

U+200D, the Zero-Width Non Joiner (ZWNJ), is defined to control how character can run together. While intended for cursive scripts (Arabic, for example) or to control when ligatures can form, if you squint at it then a ZWNJ could also be seen as a way to indicate adjacent characters in your source code should be parts of separate tokens!

The basic implication of that idea is that instead of writing "p\i", you could write "p<ZWNJ>i" which is helpful because ZWNJs are not actually visible when rendered. When you expect code to either be translated to tokens by people who are familiar with BASIC (and can predict what your intent is) or only fed through tools that understand the ZWNJ as a backslash replacement, you can avoid visual noise and make code nicer for humans to read.

The obvious downside to this idea is that you need tools to understand ZWNJ as a backslash replacement, but updating tools to handle it would be pretty easy: provided a tool understands Unicode input, then it should be easy to handle ZWNJ in the same way as backslash already is.

An interesting bonus idea is if you were to insert ZWNJs on every token boundary: while this is possible with backslashes too, doing so makes your code much more difficult to read. The advantage of doing so is that retokenizing becomes very easy because a tokenizer need only to split on ZWNJ and find the token exactly matching each split substring. If this kind of ZWNJ-all-the-things can be assumed for a given input, it also means that a program won't unexpectedly develop new interpretations if new tokens are added, because (presumably) the existing exact token strings would continue to exist.

And now it's time for discussion! Does this seem like a compelling thing to support? Other comments? Reply below!

Adriweb · 26 May 2022 06:26:52 am

Jacobly did something similar (?) for his token example o his ti font page, I believe. Interesting, anyway Smile