A Brief Reverse-Engineering Tutorial with the g3p Format - Cemetech | Forum

KermMartian
Site Admin (Posts: 64051)

A Brief Reverse-Engineering Tutorial with the g3p Format
13 Aug 2014 08:13:29 pm

I recently announced that I added support for Casio Prizm pictures (.g3p files) to Cemetech's SourceCoder 3 online calculator programming IDE. The hardest part of creating that new feature was not the code that implements it in SourceCoder, but the reverse-enginering work necessary to understand how to read .g3p files and then generate new .g3p files that the Casio fx-CG10 and fx-CG20 will both accept. At the request of several Cemetech members, I have decided to write a short tutorial showing how I reverse-engineered the .g3p format, which I hope will help you with any new file or data format that you might want to try to understand. The tutorial will be roughly divided into sections explaining what you should have to successfully understand a new format, what existing information will accelerate the process, and how to actually peer into the unknown format.

What You Need
Any good tutorial should tell you the prerequisites before you dive in. In this case, you need examples to work from and tools to help you examine the examples and test your hypotheses. For this particular project, I used these tools and data sources:

As many examples of the file format you're examining as possible. One of these should be a "minimum" instance, in this case a completely blank white image captured on my Prizm and copied to my computer. I also collected a number of 3-bit and 16-bit .g3p images from the community, as well as a collection of different 16-bit .g3p files provided by Casio. Having this variety ensures that you can adequately pick out the constant bytes in the header, body, and footer of the format (if it has those components), as well as discover the source of any unknown data.
A scripting language like Python that will let you quickly apply hypotheses to your existing files as well as generate new files. For the .g3p format, this let me first verify that the .g3* file header worked the way I believed, and later that the de-obfuscated and decompressed image data contained the expected pixels. It also let me modify the image data in existing .g3p files on the fly to see if the resulting image displayed on my Casio Prizm looked as I expected.
The device or program that creates and uses the file format, so that once you get as far as making new files you can test that they are correctly formatted. Your new files must be recognized as valid by the device or program that opens them, and the data you include in the files must be decodable. For this project, the Prizm had to recognize the picture files as valid pictures, and also display their contents properly. The former does not guarantee the latter: I discovered that if a certain checksum was incorrect, the calculator would recgonize the file as valid but would display a blank image.
A hex editor and a hexadecimal-capable calculator. For me, this was XVI32 and Windows Calculator, respectively. Other Cemetech members swear by the hex editor HxD.

Once you have all the files and tools you need to begin, you need to also take stock of any existing information you have that will give you clues.

What You Know
You usually need to have some idea of what you're looking at before you begin, and the more information you have, the easier the reverse-engineering process will be. Although not vital, you should generally start by knowing what the file type you're examining contains. If you have a way of viewing its contents elsewhere (in this case, by displaying the pictures on the Prizm), then you know what the file should decode to. You can also use information about similar files and the platform itself to give you additional clues. I worked from some of these clues:

The Prizm's SH3/4 processor is a big-endian 32-bit CPU. Therefore, it was likely that size words in the file would be big-endian 32-bit integers. This was further supported by my previous explorations of the .g3m program format.
Exploring the .g3m program format had given me some experience with what I believed was a common header format on all .g3* files, which turned out to be mostly an accurate belief. Had I investigated existing documentation on the .g1m program format used for the fx9750 and fx9860 calculators, I would have found that it provided similar clues about the 32-byte header on .g3p files.
I knew that all of the .g3p files I was working from contained 3-bit or 16-bit pictures, all 384 pixels wide by 192 pixels tall.
The smallest of the files was far smaller thn (384*192)*(3/8) bytes, the smallest number of bytes in which an uncompressed 3-bit-color 384x192-pixel image could be stored. I also noticed that the more complex image files were larger. Therefore, I deduced that some form of compression was being used.

Reverse-Engineering the .g3p Format
As with any puzzle, reverse-engineering a file format is a process of using the clues and pieces in front of you to build up a progressively larger picture of the data you're examining. For me, this meant first understanding the header, then finding where the image data was stored in the file, then understanding how the data was stored and signed. As with most puzzles, I found myself following red herring clues to incorrect conclusions. I made compromises: in writing code to generate new images, I chose one of the variations on the .g3p format that loads on all fx-CG10 and fx-CG20 calculators but does not require generating a file footer. Without further ado, let me walk you through the process of decoding the header, footer, and body of .g3p picture files. For the sake of brevity, I will be eliding some of the false starts and tedious computation I performed, but I would be happy to clarify any step in the attached discussion thread.

1. Understanding the .g3p Header
I began decoding the .g3p header in a vacuum as an exercise, later cross-referencing my discoveries against my exploration of the .g3m program header to verify its correctness. For your own examination, here are two .g3p headers and one .g3m header:

Quote:

Beach.g3p:
0x00: AA AC BD AF 90 88 9A 8D 82 FF EF FF EF FF DA FE
0x10: FF FF 52 1B 63 00 00 00 00 00 00 00 08 A0 00 00
Pic04.g3p:
0x00: AA AC BD AF 90 88 9A 8D 82 FF EF FF EF FF 28 FE
0x10: FF FF FE 69 B1 00 00 00 00 00 00 00 02 EE 00 00
RPG1.g3m:
0x00: AA AC BD AF 90 88 9A 8D 8A FF EF FF EF FF 42 FE
0x10: FF FF F8 83 CB 01 00 00 00 00 00 00 1E 08 FF FE

It seems that every Prizm file begins with the same 8-bit sequence 0xAA, 0xAC, 0xBD, 0xAF, 0x90, 0x88, 0x9A, 0x8D. In examining other Prizm files, this pattern held true. Next, the byte at offset 0x08 seems to give some indication of the file type. Indeed, in investigating other files, all .g3m programs had 0x8A at that offset, and all .g3p pictures had 0x82 there. Incidentally, .g1m program and picture files happen to have the same 8 header bytes followed by 0xCE.

Next we have a set of unusual bytes that change from file to file. They seem to fall into a number of groups, so I have highlighted them in several difference colors below:

Quote:

Beach.g3p:
0x00: AA AC BD AF 90 88 9A 8D 82 FF EF FF EF FF DA FE
0x10: FF FF 52 1B 63 00 00 00 00 00 00 00 08 A0 00 00
Pic04.g3p:
0x00: AA AC BD AF 90 88 9A 8D 82 FF EF FF EF FF 28 FE
0x10: FF FF FE 69 B1 00 00 00 00 00 00 00 02 EE 00 00
RPG1.g3m:
0x00: AA AC BD AF 90 88 9A 8D 8A FF EF FF EF FF 42 FE
0x10: FF FF F8 83 CB 01 00 00 00 00 00 00 1E 08 FF FE

I started out with the assumption that the colored bytes, all of which vary from file to file, were "security" bytes based on the size of the file. This turned out to be correct, but if it had been wrong, other possibilities would have been some checksum over the entire file contents, the size of the data portion of the file, or a checksum over the data portion. This assumption was supported by the 4-byte value at offset 0x10 in every file. The sizes of these three files happen to be 44516 bytes (0xADE4 bytes), 406 bytes (0x196 bytes), and 1916 bytes (0x77C bytes), respectively. If you represent each of those hex sizes as a 32-bit big-endian integer and invert every bit, you get 0xFFFF521B, 0xFFFFFE69, and 0xFFFFF883. With that big hint that other fields in the header are related to the file size, let's build a table comparing them to the full inverted size int, the lower two bytes of the size, and the lowest byte of the size. I have added a few additional files for further comparison (note that all values are hex; 0x is omitted for brevity):
Code:

+-------------+----------+-----+------+-------------+------+-------------+--------+-------------+


| Size Int    | Size LSW | LSB | 0x0E | *0E-LSB%100 | 0x14 | *14-LSB%100 | 1C, 1D | *1D-LSB%100 |


+-------------+----------+-----+------+-------------+------+-------------+--------+-------------+


| FF FF 52 1B | 52 1B    | 1B  | DA   |  DA-1B = BF |  63  |  63-1B = 48 | 08 A0  |  A0-1B = 85 |


| FF FF FE 69 | FE 69    | 69  | 28   |  28-69 = BF |  B1  |  B1-69 = 48 | 02 EE  |  EE-69 = 85 |


| FF FF F8 83 | F8 83    | 83  | 42   |  42-83 = BF |  CB  |  CB-83 = 48 | 1E 08  |  08-83 = 85 |


| FF FF F8 78 | F8 78    | 78  | 37   |  37-78 = BF |  C0  |  C0-78 = 48 | 0B FD  |  FD-78 = 85 |


| FF FF 8C B2 | 8C B2    | B2  | 71   |  71-B2 = BF |  FA  |  FA-B2 = 48 | D9 37  |  37-B2 = 85 |


+-------------+----------+-----+------+-------------+------+-------------+--------+-------------+





+-------------+----------+-----+--------+-------------+--------------+-----------+--------------+


| Size Int    | Size LSW | LSB | 1C, 1D | *1D-LSB%100 | *1C-LSW%100  |  LSW sum  | *1C-LSWS%100 |


+-------------+----------+-----+--------+-------------+--------------+-----------+--------------+


| FF FF 52 1B | 52 1B    | 1B  | 08 A0  |  A0-1B = 85 | 08-521B=ADED | 52+1B=06D | 08-06D = 9B  |


| FF FF FE 69 | FE 69    | 69  | 02 EE  |  EE-69 = 85 | 02-FE69=0199 | FE+69=167 | 02-167 = 9B  |


| FF FF F8 83 | F8 83    | 83  | 1E 08  |  08-83 = 85 | 1E-F883=079B | F8+83=17B | 1E-17B = A3  |


| FF FF F8 78 | F8 78    | 78  | 0B FD  |  FD-78 = 85 | 0B-F878=0893 | F8+78=170 | 0B-170 = 9B  |


| FF FF 8C B2 | 8C B2    | B2  | D9 37  |  37-B2 = 85 | D9-8CB2=7427 | 8C+B2=13E | D9-13E = 9B  |


+-------------+----------+-----+--------+-------------+--------------+-----------+--------------+

Yes, those worked out very neatly! The actual reverse-engineering progress was much more trial-and-error. I omitted all the attempts I made with incorrect combinations of bytes, like using the LSW or the LSB elsewhere, trying other operations besides addition and subtraction, or not using mod-0x100 (mod-256) math. I also omitted my attempts to use the two bytes at 0x1C as a single word rather than two separate bytes. I left in my mistaken attempts to use the least-significant word (LSW) as a whole word to deduce the process of finding the byte at 0x1C, so you can see how that dead-end led me to consider summing the two bytes of the LSW and mixing that with the byte at 0x1C. So what do these tables tell us? By flipping around the operations that yield the same constants, we now know how to compute the security bytes at offsets 0x0E, 0x14, 0x1C, and 0x1D:

*0x0E = (LSB of inverted size) + 0x85 % 0x100 = *0x13 + 0x85 % 0x100
*0x14 = (LSB of inverted size) + 0x48 % 0x100 = *0x13 + 0x48 % 0x100
*0x1C = (sum of LSW bytes of inverted size) + 0x9B % 0x100 = *0x12 + *0x13 + 0x9B % 0x100
*0x1D = (LSB of inverted size) + 0x85 % 0x100 = *0x13 + 0x85 % 0x100

There is one exception, the middle row of the table, where the result for our .g3m program file shows that the *0x1C constant is 0xA3 for programs instead of 0x9B, as for pictures. In addition, picture files that are not "Casio Provided" yield different constants to generate the bytes at 0x1C and 0x1D. For simplicity, those have been omitted here. See the Cemetech Prizm wiki for full documentation on file headers. Now we have enough information to generate the first 32 bytes of our own .g3p files, so we can move on to understanding the remainder of the format, containing the image data.

2. Data and Metadata
Next, we need to find where in the file the image data starts, and what metadata surrounds it. This information will enable us to reliably decode any possible .g3p file we encounter as well as generate new .g3p files that the Prizm will reliably accept. As just mentioned, we will be looking at one specific "Casio Provided" variation of the .g3p format. It's a good one to examine, because it has no special footer fields, and is accepted by both the fx-CG10 and fx-CG20 Prizm variants. There are more complicated CP0100 and CAPTURE formats, but I'll omit those from this reverse-engineering walkthrough. Let's take a look at the next 192 (0xC0) bytes of the first file we've been looking at, the one that is about 45KB long. It contains a few scattered bytes, a long section of empty space, then what looks like some metadata followed by image data. I'm getting ahead of myself; here's the data:

Quote:

0x20: 43 50 00 01 00 00 00 00 00 00 00 00 00 00 00 00
0x30: 00 00 AD C4 00 00 00 01 00 00 AD 2C 00 00 00 00
0x40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x50-0xA0: ... 00 ...
0xB0: 00 00 00 00 00 00 00 00 00 01 00 00 00 00 AD 18
0xC0: 00 00 01 80 00 C0 00 10 01 00 00 00 00 00 AD 14
0xD0: 3C 1B 98 1C A7 45 27 C0 32 8A 8E 3F 5E 3E 8B 56

Further investigation into other variants of the .g3pformats revealed that all that empty space in the middle is related to the extra footers that can be tacked onto the end of some variants. Since we won't be discussing those types, we can ignore that blank space and assume those are always zeroes. From examining several files, there are several constants between offsets 0x20 and 0xBC:

0x20: Always contains the six-byte sequence 0x43, 0x50, 0x00, 0x01, 0x00, 0x00.
0x34: Always contains the 32-bit big-endian integer 0x00000001.
0xB8: Always contains the four-byte sequence 0x00, 0x01, 0x00, 0x00.

I told you that this file was about 45KB, or 0xADE4 bytes exactly long. It looks like there are a few four-byte big-endian integer size fields scattered around in that data! Here's what they are:

0x30: File size - 0x20 (Size of file after initial header)
0x38: File size - 0xB8 (0xB8=184 bytes; this value is the size of the image data plus 0x18 bytes of metadata/header at offset 0xB8)
0xBC: File size - 0xCC (this value is the size of the image data plus its 4-byte size field)
0xCC: Image data + 0x04 (part of the image data header)

If you examine the rest of the data after 0xD0, you'll see that it is extremely dense, seemingly-random data, presumably corresponding to compressed image data. In fact, if you're extra clever and compute bit probabilities, you'll see that there is very little entropy in this data, indicative of compressed data. I examined the data between offsets 0xC0 and 0xD2 manually, comparing it between several different .g3p files (those multiple file examples coming in handy again!) and deduces the following fields:

0xC0: (Word) Always 0x00, 0x00. Function unknown.
0xC2: (Word) 0x01, 0x80 (Decimal 384); width of image in pixels.
0xC4: (Word) 0x00, 0xC0 (Decimal 192); height of image in pixels.
0xC6: (Word) 0x00, 0x10 (Decimal 16); bit width. Can be 3 or 16.
0xC8: (Word) Always 0x01, 0x00. Function unknown.
0xCA: (Word) Always 0x00, 0x00. Function unknown.
0xCC: (Int) Size of image data, as mentioned above
0xD0: (Word) Part of image data; 2-byte ID. Always 3C1B for "Casio Provided" images with no footers.

If you look through more files, you see other values for that 2-byte ID that forms the first two bytes of the data, including 0x388D, 0x789C, and 0x3E93. They appear to correspond to various CAPTURE formats as well as the more complex Casio Provided images that include information about the graph window. But to avoid getting distracted, we have now reached the first byte of the compressed image data at 0xD2. What should we do?

3. Decompressing the Data
Early in this tutorial, I mentioned that I was using Python to prototype a script to pull apart the .g3p files I was examining and later put them back together. In fact, as I went through the process I have been describing up to here, I was continually expanding that program to display the values of each of the fields I have mentioned, and to emit errors if any of the values, including security and size fields, were not what I expected. Although it may not be what Casio originally intended, I started to view the format as something like an onion, with layers of size ints followed by data nested inside each other. As my Python program grew, it pulled apart each successive layer, ending with the nugget of what I assumed was compressed image data at the end. Once I got here, though, I was stuck. What could I do next? The obvious solution would be to figure out what decompression algorithm was in use. My biggest fear was that either a proprietary protocol was in use, that the compressed data was also wrapped in a layer of encryption (perhaps unlocked by a secret hidden deep in the Prizm firmware), or even both. Unfortunately, preliminary inspection revealed no clues that would help me crack this nut.

A bit of non-hexadecimal sleuthing provided the next clue. I found notes from other intrepid reverse-engineering explorers, who had actually gotten nearly as far as me in understanding the headers and metadata that comprise the beginning of the .g3p format. They uncovered a clue that proved vital: a copyright line in OS documentation referring to the DEFLATE algorithm developed in the 1990s and commonly used as a lightweight but effective compression algorithm. Since it didn't appear that the Prizm OS compressed any other data, it seemed logical to assume that the image data inside the .g3p format was compressed with the DEFLATE algorithm. Unfortunately, feeding the data starting at 0xD2 to the INFLATE algorithm that complements DEFLATE did nothing. I wrote an O(nlog(n)) program that tried cutting off bytes at the beginning and end of the data to no avail; INFLATE still refused to recognize the data as valid. After a few hours of experimentation, I grudgingly accepted that some layer of obfuscation must be applied. I first applied the obvious bit inversion (flipping every bit) that was used for the size integer at offset 0x10; this proved equally fruitless. I then tried inverting only some bits, then later flipping bytes, rotating bytes, or mixing bytes by exchanging groups of bits. By perseverence and my Python program performing exhaustive permutations on the bit mixing and inversion, I succeeded in discovering the key: cutting off the last four bytes of the data (presumably some checksum?) and performing the following steps:

Quote:

.%76543210 === decode ==> ~%21076543
.%bits.... <== encode === .%bits....

In other words, to decode each obfuscated byte, use bits 7-3 as bits 4-0, and bits 2 to 0 as bits 7-5. Then, invert all bits. This yields a chunk of data that can be successfully decompressed by the INFLATE algorithm. The resulting decompressed data contains two pixels per byte for 3-bit-color images and two bytes per pixel for 16-bit-color images. However, remember those final four bytes we snipped off? In order to create new .g3p files of our own, we need to understand what that checksum actually is, or the Prizm will not display the images.

4. Cracking the Checksum
The checksum required more manual experimentation to understand, but in the end the solution was a very simple one. I started by extracting the stored checksum on the data bytes in each of the files I was examining, and adding that to a table including the data length, data type ID. I then tried summing the data in other interesting ways: (1) was the sum of the compressed but unobfuscated bytes in the data section only, (2) included the metadata as well, and (3) used the inverted, obfuscated bytes. I also added another column with the unmixing and inversion process applied to the checksum in case it was obfuscated along with the data. My table at the end of it looked like this:
Code:

Filename        type   Data length   Checksum      Inv CS        sum1          sum2          sum3          Unmixed CS


Pict04.g3p (3)  3E93   00 00 00 3A   04 DF 60 01   FB 20 9F FE   00 00 06 55   00 00 05 3E   00 42 F0 00   7F 04 F3 DF


Pict01.g3p (3)  3E93   00 00 06 2B   69 EA 96 FA   96 15 69 05   00 03 4D 38   00 03 4C 38   00 3D 2A 0C   D2 A2 2D A0


Pict02.g3p (3)  3E93   00 00 0E 29   B8 4E 4C 50   47 B1 B3 AF   00 07 96 4D   00 07 95 33   00 3D 73 61   E8 36 76 F5


Books.g3p (16)  3E93   00 01 70 F5   E3 6C F4 79   1C 93 0B 86   00 BE 43 DF   00 BE 42 C9   01 6C 4C 7B   83 72 61 D0


Bowl~.g3p (16)  3E93   00 00 08 A2   15 83 0D B3                 00 47 26 B6   00 47 26 A6   01 9F 46 37   5D 8F 5E 89


Beach.g3p (16)  3C1B   00 00 AD 14   85 6A D4 D5   7A 95 2B 2A   00 58 9D 59   00 58 9C 59   01 C3 4A D7   4F B2 65 45


Brid~.g3p (16)  3C1B   00 01 56 1A   31 E9 7F 0B                 00 B0 57 2A   00 B0 55 AD   00 E7 03 14   D9 C2 10 9E

It might not be obvious from a first glance, but nothing matched or was even close. I particularly noted that all of the checksums I calculated were relatively small, especially for the short files, whereas the values of the original checksums did not seem correlated to the size of the file. I presumed that a more cumulative sum was in use, one that added the sum of the current checksum and the new byte to the sum on each iteration. Searching for cumulative checksums yielded a checksum commonly used with DEFLATE called an Adler32 checksum. The Adler32 checksum computes a normal summed checksum and a cumulative checksum, and concatenates their bits to form the final checksum. For the .g3p format, this checksum is computed over the raw, uncompressed data, and the checksum is appended before the obfuscation step is performed (and thus is itself obfuscated).

Conclusion
Reverse-engineering the .g3p format was time-consuming but fun, and I learned about a new compression algorithm and a new checksumming technique along the way. I will shortly be releasing the full, more technical description of the different .g3p file formats. In the meantime, I hope this tutorial helped you learn a bit more about the techniques, tools, and experimentation inherent in reverse-engineering a format. As always, questions or comments in the attached topic are encouraged.

APotato
Power User (Posts: 354)

13 Aug 2014 08:24:51 pm

That is amazing to me and I am sure it is to many others as well. And Kerm, how do you come up with ideas for these kinds of projects? Do they just pop into your mind or do you spend time brainstorming?

KermMartian
Site Admin (Posts: 64051)

13 Aug 2014 08:43:59 pm

APotato wrote:

That is amazing to me and I am sure it is to many others as well. And Kerm, how do you come up with ideas for these kinds of projects? Do they just pop into your mind or do you spend time brainstorming?

This one was something I've wanted to do for a while, but I held off because of the political and ethical issues mentioned in this article. It ended up being something I turned my attention to mostly based on my wish to have SourceCoder 3 provide as many features as possible. Thank you, I hope you do indeed enjoy this tutorial, and do not hesitate to ask any questions.

Alex
Official Cemetech Site Manager (Posts: 7912)

13 Aug 2014 08:56:17 pm

KermMartian wrote:

APotato wrote:

That is amazing to me and I am sure it is to many others as well. And Kerm, how do you come up with ideas for these kinds of projects? Do they just pop into your mind or do you spend time brainstorming?

This one was something I've wanted to do for a while, but I held off because of the political and ethical issues mentioned in this article.

This doesn't make sense. An article about G3P support now in SC3 encouraged you to do just that? I think APotato wanted to know more in-depth about the overall process of finding out what you want to do and how you go about doing it.

KermMartian
Site Admin (Posts: 64051)

13 Aug 2014 09:19:39 pm

comicIDIOT wrote:

This doesn't make sense. An article about G3P support now in SC3 encouraged you to do just that?

The article to which I linked discusses the reasons why I originally did not reverse-engineer the format despite wanting to. The article also explains why those reasons are now less valid, and therefore why I finally went ahead and followed through on reverse-engineering this format. I believe that made sense.

Quote:

I think APotato wanted to know more in-depth about the overall process of finding out what you want to do and how you go about doing it.

That's a fair question, but it's very hard to explain where my ideas come from and my criteria for picking which projects I focus on. For the ideas, it's everything from what games and programs I see and play and use that inspire me, to interesting new concepts, systems, and ideas that I read about in academic papers and magazines and blogs, to things in my everyday life that I think would be easier or more fun with a particular piece of hardware or software or both. For the decisions, it's usually whatever I feel most motivated on or haven't worked on in a while. My projects tend to go in a round-robin sort of pattern, eventually working their way out of the cycle when they get to a point where I can release them or am sufficiently disenchanted to lay them aside.

Tari
3
Systems Wrangler (Posts: 3490)

Re: A Brief Reverse-Engineering Tutorial with the g3p Format
14 Aug 2014 08:33:53 am

KermMartian wrote:

By perseverence and my Python program performing exhaustive permutations on the bit mixing and inversion, I succeeded in discovering the key: cutting off the last four bytes of the data (presumably some checksum?) and performing the following steps:

Quote:

.%76543210 === decode ==> ~%21076543
.%bits.... <== encode === .%bits....

Just reading through this section, it seems to me that a closer look at the bitstream of DEFLATE may have yielded an easier solution. For example, you might assume that all compressed payloads would begin with the bits 110 or 010 (inline Huffman trees, possibly only a single block in the stream).

Given the assumption that there's some bit-mixing going on and some variety in data to examine, it would probably be fairly easy to start discarding numerous possibilities. If you wanted to get cleverer yet, you might black-box the encoder (rather than do the work ab initio) by feeding it images expected to generate particular compressed streams.

KermMartian
Site Admin (Posts: 64051)

Re: A Brief Reverse-Engineering Tutorial with the g3p Format
18 Aug 2014 12:15:23 pm

Tari wrote:

Just reading through this section, it seems to me that a closer look at the bitstream of DEFLATE may have yielded an easier solution. For example, you might assume that all compressed payloads would begin with the bits 110 or 010 (inline Huffman trees, possibly only a single block in the stream).

Indeed, my early investigation was guided by looking for those two prefixes. Unfortunately, because I only considered byte-mixing, not bit-mixing, I ended up abandoning that approach, and did not remember to use it to narrow down my investigation once I started experimenting with bit-mixing. Also, I was not originally very confident that it would help, because I thought it was probable there would be additional encoded prologue data before the DEFLATEd image data began. It turned out I was wrong about that.

Tari wrote:

Given the assumption that there's some bit-mixing going on and some variety in data to examine, it would probably be fairly easy to start discarding numerous possibilities. If you wanted to get cleverer yet, you might black-box the encoder (rather than do the work ab initio) by feeding it images expected to generate particular compressed streams.

Yes, that was on my todo list if the bit-mixing test failed to uncover anything that decoded as valid DEFLATE data. Thanks for pointing that out. Smile