Ebook / Etext reader!

SniperOfTheNight · Advanced Member Joined: 24 May 2003 Posts: 260

I think that's basically the idea. It's like the eReader they have for the TI-89.

John Barrus · Member Joined: 25 May 2003 Posts: 131

I was kinda thinking about word compression and I had a few crazy ideas. Here's one (kinda impractical, but oh well):

We make some kind of 'include' file to be loaded on the calculator. The file would equate different bytes to different words

Example:
01h=The
02h=cow
03h=jumped
04h=over
05h=moon

Now, we make a pc program to convert text to hex using these equates. The output for 'The cow jumped over the moon' would be:
01 02 03 04 01 05

We load the include file and the hex file on the calculator, plus another program to convert the hex back to text.

Well, technically I compressed a six word sentence down to 6 bytes (75% compression from 24 Cool

. But, the include file would probably take up massive amounts of space, plus, there's what, 100,000 words in the english language (impossible to fit that on a calculator) (but...maybe the pc program could find the most popular words in the text and build the most size-efficient include file specific only for that text)

Well, that's my impractical idea. what do you think? Confused

John Barrus · Member Joined: 25 May 2003 Posts: 131

And I guess the include file could only include 256 different words. Neutral

Jeremiah Walgren · Posted: 07 Jun 2003 01:33:12 am Post subject:

I thought there was something like 650,000 words in the English language...

John Barrus · Member Joined: 25 May 2003 Posts: 131

That'd make it all the more impossible Smile

But, there are some other ideas I'm coming up with...

SniperOfTheNight · Advanced Member Joined: 24 May 2003 Posts: 260

It would probably make more sense to make it a different equate for each letter. If you wanted to have every word that you would find in a book,that would be wayt too much work.

Spyderbyte · Advanced Member Joined: 29 May 2003 Posts: 372

But then there wouldn't be any compression. You would just end up with a series of numbers as long, if not longer than the original word. Since there are 26 letters, some numbers would have to be two digits, which is not only twice as long as the letter, but then you'd have to figure out some way to tell if the number was 23 or 2 and 3.

But then again, there might be something about how a letter is stored that would make it worthwhile.

Just my two cents worth. Very Happy

Spyderbyte

John Barrus · Member Joined: 25 May 2003 Posts: 131

The thing is that all ascii characters are already equated to a certain byte, so there wouldn't be any compression. But say if we equated combinations of two letters to a single byte. We could equate the most common combinations (qu, in, st, sh). Granted, there are over 6 million two-letter combinations, but we could only do the most used ones. (you won't find jk, xw, qw, and all those other weird combinations in any words in English, so there's no purpose in equating them.) That would provide at most 50 percent compression.

Adm.Wiggin · Posted: 07 Jun 2003 12:47:33 pm Post subject:

ya, combos like qu,th,ch,tch(catch),and... u could do stuff like that, including 3 letter combos...

NETWizz · Posted: 08 Jun 2003 03:32:40 am Post subject:

The loger the text file, the better the compression will be with the Zif or Huffman routine.

As for indexing, it may not be easy because we would probably have to use a 2 byte index due to the fact that there will be more than 256 words.

Also, we would need to make two passes to count the number of time each word is used.

If we use the word book only once, we save nothing by compressing it; in fact, we waste room in the index.

Real compression does not really separate words.

e.g.

The quick brown fox jumped over the lazy dog, and the fox jacked a soda.
x j can be comressed because it occurs more than once!

John Barrus · Member Joined: 25 May 2003 Posts: 131

I'm not familiar with the Zif or Huffman routines. Can you elaborate?

With the two byte index: Yeah, it would provide a larger amount of compression because more words would be equated, but the issue would be space (I'm not sure if we could fit 65536 words and their equates on the calculator.) Even so, that's a lot of different words (I probably don't even use that amount in a day's worth of conversation).

Basically, I think an idea like this is hopeless, but it would be an interesting way to do things...

John Barrus · Member Joined: 25 May 2003 Posts: 131

Ok, I skimmed this from the web on how Huffman works:

"This algorithm, developed by D.A. Huffman, is based on the fact that in an input stream certain tokens occur more often than others. Based on this knowledge, the algorithm builds up a weighted binary tree according to their rate of occurrence. Each element of this tree is assigned a new code word, with the length of the code word being determined by its position in the tree. Therefore, the token which is most frequent and becomes the root of the tree is assigned the shortest code. Each less common element is assigned a longer code word. The least frequent element is assigned a code word which may have become twice as long as the input token. "

My question: What would be the basis for our tokens? Words? Letter combos?

Another crazy idea: We don't use ALL of the 256 character ASCII set, right? I'd venture to say that we normally use only about 100 characters in normal texts (as in literary novels and stuff). Well, if we cut those extra 156 characters out, it'd only take 7 bits per character instead of 8--it's not much but it adds up when you have a 20000 byte text!

NETWizz · Posted: 09 Jun 2003 03:29:55 am Post subject:

We probably use fewer than 80.

Anyway, we would make the program in assembly meaning we would simply be working with bytes. 00 to FF.

Huffman would work, but I do not think the calculator has enough ram and processing speed to really be able to compress a large text file fast.

Lastly, Zif uses an easier method than the tree, so it should be easier to make. It also requires less ram and should run a little faster.

Still, it essentially does the same thing replaces the most frequently used characters with shorter symbols.

John Barrus · Member Joined: 25 May 2003 Posts: 131

It seems to me that if we used Zif (or Huffman), we'd still have to build an index as to what tokens the shorter symbols represent, right? That would take up more space, too...

Yeah, I guess it would be more around 80 characters. Anyway, we could set up a program to read 7 bits at a time. If there was a character that isn't in the chosen 80, we could have a sign bit thingy like 1111111b, which would tell that there was a full 8 bit character to follow. This way, most of the characters would be 7 bits, and every once in a while, there would be a 15 bit character. This method would still provide compression (unless an idiot tried to compress a text full of those other characters -- then it would double the size.)

I wasn't thinking about compressing the texts on the calculator, but on a PC, rather. The calculator would be able to read the compressed format without uncompressing it. (Mind you, I have noexperience in PC programming, so I don't know who'd write the compression prgm.)

JoeImp · Posted: 09 Jun 2003 02:26:41 pm Post subject:

Just tell me what you want done, and what the output will be, and I'll get started on it. I'm assuming we're going to be using C++, does anyone else want to use something different? We should mabey get a forum topic on this project if we are really going to get to work on it.

NETWizz · Posted: 09 Jun 2003 04:13:06 pm Post subject:

I do not know exactly what we need to do, but I do know that it will be complicated.

I think the simplest compression would be to use bits instesd of bytes, sience we will not need a whole byte for each character.

Spyderbyte · Advanced Member Joined: 29 May 2003 Posts: 372

I would think 6 bits would be plenty. 36 letters/numbers and 28 other symbols. All you would really need are punctuation anyway.

Spyderbyte

NETWizz · Posted: 09 Jun 2003 04:39:55 pm Post subject:

Yes, 6 bits would be enough.

John Barrus · Member Joined: 25 May 2003 Posts: 131

Hey thanks for the offer Joe!
But first, We should get a couple things straight:

I'm not sure that we could do 6 bits: I think we'd definitely want 26 upper and lower case letters, 10 numbers, "." and "," , plus a sign bit thingy (anything else?)-- All those would require at least 7 bits. We could cut out things like Z and X since they almost never are capitalized, but... or we could just leave the numbers off and hope that the compressed text is never math class notes Very Happy

I guess it is possible to do 6 bits, if we were real careful about which ones were included..

John Barrus · Member Joined: 25 May 2003 Posts: 131

Well, I looked in the TI dev guide for the characters and their hex addresses and, much to my disappointment, the characters are not in any convenient order. Maybe we could just re-index the characters so that the important 60 or so will be at the front and all the rest would follow? It probably wouldn't affect the size of the reader by more than about 500 bytes.

United-TI Archives -> Project Ideas/Start New Projects
	» Goto page Previous 1, 2, 3 Next » View previous topic :: View next topic

	» Goto page Previous 1, 2, 3 Next » View previous topic :: View next topic
Page 2 of 3	» All times are UTC - 5 Hours