Okay, I think I am mostly through with generating Huffman trees - but I am stuck with debugging... :/
Could someone take a look at this? (Note: Comments for a line are
always before that line. Just mentioning because there are different preferences to that, and its confusing. Note 2: Deliberately removed all apostrophes (') from the comments because they made the syntax highlighter think they started strings, completely screwing it up.)
Code: .HUFFMAN
ClrHome
"appvHUFFTEMP"->Str0
.Initialize L4 as an area for variables
Fill(L4,256,0)
."Stack" for storing the trees elements - explained later in detail
L4->°STACK
."List" for temporary storage while generating the tree - explained later in detail
L4+2->°LIST
.Pointer to the var to be compressed
L4+4->°FILE
.Size of the var to be compressed
L4+6->°SIZE
.A variable designating the number of still valid entries in the list
L4+8->°COUNTER
.Pointer to the end of the stack
L4+10->°OFFSET
.Generates appvar for the list and stack
If GetCalc(Str0,3584)->STACK
.STACK is 2048 bytes big, list is 1536 bytes big
{L4}^r+2048->LIST
Fill(STACK,3584,0)
.L1 will be used for temporary processing later
Fill(L1,512,0)
.Get the prefixed variable name through input, process for GetCalc(
input->A
Copy(A,L1,length(A))
.The "*10^n" "E" designating hex numbers, not the letter "E"
{L1} and E3F->{L1}
GetCalc(L1)->FILE
If {FILE-2}^r->SIZE
...
Now the LIST structure needs to be explained.
It is a "table" or a list of entries of six bytes each. It will medium-term hold every element of the Huffman tree, basically the list from which the tree is constructed, the one which is sorted.
It stores the characters number of occurrences in bytes 0 and 1 of each list entry, one part of the data in bytes 2 and 3 (This is either a character or a pointer to the stack) and the other part in bytes 4 and 5 (This is either 65535 (impossible for pointers) designating that bytes 2 and 3 are a char, or another pointer to the stack.) If it is not 65535, we know that the list entry is a node which points to the stack, if it is, we know it is a leaf which holds a character.
Note: A character in our case is a byte. Just makes sense using bytes here.
...
.Now, this loop counts every "character" and stores their frequency in the "list" at bytes 0 and 1. The list can then be accessed as follows: "Character" 0000 0000 is entry one, character 0000 0001 is entry two, and so forth.
For(A,0,SIZE-1)
{6*{FILE+A}+LIST}^r++
End
.This loop checks every character whether it occurs at least once. If so, it is properly initialized, if not, it is skipped.
For(A,0,255)
.Caching the long pointer for saving memory
6*A+LIST->P
.If the characters frequency at {P}^r is greater than 0, it occurs in the file
If {P}^r
.Store the character in the following "slot", so it can be checked even after the list was sorted
A->{P+2}^r
.The character, or leaf, "marker"
65535->{P+4}^r
.Update number of list entries
COUNTER++
End
End
.Finish initialization by sorting the array. This is needed because the loop only sorts a part of the array, for optimization. But before the loop starts, the entries can be scattered all over the list, so this is required to get them all to the top.
SORT(256)
.This is the main loop which iterates as long as there is more than one entry in the list left.
While 1
.Now, this was more complicated (at least for me). It adds the lists top (Least occurring) element (4 bytes if skipping the frequency, which will not be needed) to the stack and updates the stacks pointer.
Copy(LIST+2,OFFSET++-1*4+STACK,4)
.Same as above, but for the second list entry. At this point, the lists two top entries are safely stored in the stack and the list is ready to be manipulated.
Copy(LIST+8,OFFSET++-1*4+STACK,4)
.Start generating a node for the tree, which will have the two least occurring elements as children. Add the two childrens frequencies and save to the first entry (overwriting it).
{LIST}^r+{LIST+6}^r->{LIST}^r
.Store the pointer to the first child (in the stack now)
OFFSET-2->{LIST+2}^r
.Store the pointer to the second child (in the stack now)
OFFSET-1->{LIST+4}^r
.Now the first list entry is holding the new node, the second one is cleared so that it does not hold the stacked entry anymore.
Fill(LIST+6,6,0)
.Sort only the part of the list which still holds entries
SORT(COUNTER-1)
EndIf COUNTER--
.Irrelevant debug code removed
.This statement belongs to "If {FILE-2}^r->SIZE", i.e. executed if the size was 0.
Else
Disp "ERR:SIZE"
End
.This statement belongs to "If GetCalc(Str0,3584)->STACK", i.e. executed if the appvar failed to create.
Else
Disp "ERR:MEM"
End
Return
Lbl SORT
.Selection sort implementation, works correctly, not worth commenting further
For(A,0,r1-1)
A->C
For(B,A,r1)
If {6*B+LIST}^r>{6*C+LIST}^r
B->C
End
End
Exch(6*A+LIST,6*C+LIST,6)
End
So, this code should be generating a Huffman tree in the area {STACK}*, but for some reason it a) is not the correct tree if reconstructing the broken entries, and b) It contains weird entries like "43xx something" or both children are 0, which are both impossible.
*The encoding works as follows:
1) The first pointer element is still in {LIST+2}. It can be used as the tree's root.
2) An "Element" are always 2 + 2 = 4 bytes.
3) The first two bytes are either a character or a pointer.
4) If the first two bytes are a character, the second two bytes are 65535, else they are another pointer.
5) A pointer, if multiplied by 4 and added to STACK points to another element.
From this, we can reconstruct a tree by starting from the root element in {LIST+2} and then jumping from one pointer to the next. An element containing pointers is a node, an element containing 65535 in the second two bytes is a leaf, and its value is in the first two bytes.