Most efficient text to list compression - Cemetech | Forum

Nik · 27 Oct 2015 12:30:26 pm

I haven't tried this out yet, but it might be interesting:

Me, and I believe many other programmers before me, had to convert a string to a list efficiently at some point. The "seq(inString("-Routine was too memory consuming (Check out this if you don't know what I mean).
Some time ago I tried to figure out how to compress the text even more efficiently taking advantage of the fact that every 2. digit in such a list's entry won't be bigger than 2 with the normal set of 26 letters (See below on how to implement more).
Today I had the idea of taking the first digits of the corresponding characters, putting them together and converting that from the ternary base to decimal - this works as there are only three possible values for each of those, and 3^2 = 9, so two ternary digits can be compressed into one decimal. This allows you to compress Text-Lists to hold a character at ~1.03 (9/8.75) bytes!

But if you want to go further and 26 characters is not enough, remember the digits 26-29 are still left! You might use them as single characters if you only need a few. But you might also use them as prefixes:
Let's say you are making an advanced notepad program. It supports upper-and lowercase, special characters and formatting tags.
0-25 indentifies a regular character.
26 & 0-29 holds a lowercase letter.
27 & 0-29 holds a special character.
28 & 0-29 holds a system formatting tag.

So you can hold up to 146 characters, 26 single-space and 120 double-space characters.

If this is used with some sort of Huffman Encoding, then possibly Burrows-Wheeler transformation and RLE this allows you to have the most efficient list text.

I have not implemented this, but I might try in future!