When writing a DMG emulator I thought about how these tiny sprites are often sim...

marcan_42 · on Dec 23, 2021

That is how basic Lempel-Ziv compression works, which is one of the two fundamental techniques used in data compression. You keep a history of what you've already decompressed, and if you find something you already have, you instead encode a pointer to it. Variants of this (LZW, LZ77, LZSS, and a bajillion custom variants, because game developers like to reinvent their own compression) have been used in many, many games.

The other is entropy coding, which means re-writing bytes and pointer codes into variable length values that are optimized by frequency: fewer bits for common codes, more bits for uncommon ones. The most basic approach for this is Huffman coding, which is kind of like variable length binary phone numbers.

Together, these two techniques are how DEFLATE compression works, which is the algorithm used by zip, gzip, and zlib. zlib is then used in many, many formats, like PNG and PDF.

Trivial "look for dupes and subsets and just change pointers" compression is already standard when building C code. If your program has two strings, "foobar" and "bar", the linker will just store "foobar" in the strings section and point 3 bytes into it for "bar", assuming the right compilation/linking options to make it work.

selcuka · on Dec 23, 2021

There was a similar trick used for compressing game backgrounds in Commodore 64 games.

The compressor would create a special font (C64 "charsets" are 8x8 pixels), and slice the picture into 8x8 tiles where each block corresponds to an ASCII character. Although C64's hires resolution was 320x200 (160x200 in multicolor mode), because of repeating blocks it would usually end up using less than 256 characters. The actual background would then be drawn using those characters with the special font. There is no eager decompression involved.

oceanghost · on Dec 23, 2021

Many years ago, I implemented an N64 ROM compressor that could work out a rough family tree for variations of a game. It worked something like this:

Brute force compare all the ROMs to each other and find "ancestor" with the least amount of changes as compared to any other.

Pick that as an ancestor, re-brute force check every remaining ROM against that one for changes.

It was silly and wasteful computationally but you could store 40 variations of a cart in 50MB instead of 1GB.

nightfly · on Dec 23, 2021

Do you remember what you determined the "ancestor" to be?

oceanghost · on Dec 23, 2021

For the n+1 cart, it was always the smallest diff between n and any remaining carts.

Picking the first n, was more difficult. A strong hint was if the ROM name contained a variant of the word Japan. Otherwise, imagine you had 4 images, you could diff them 16 ways, but half of those ways are mirror images of each other ie. a diff of ROM 2->4 is the inverse of of a diff of ROM 4->2.

So ideally, the first ancestor would be the image with the most amount of the smallest children.

willis936 · on Dec 23, 2021

Iwata famously implemented RLE in Pokemon Gold/Silver. That gains an easy 50% from all the whitespace.

Huffman coding would be the simplest algorithm to compress the repeated patterns in sprites, but I think even loading up the dictionary to decompress would run into memory limitations on these old systems. If nothing else it would cause hitching.

ars · on Dec 23, 2021

Apparently DNA is compressed in this way, it makes studying it very difficult.

https://www.news-medical.net/health/What-is-an-Overlapping-G...

ucyo · on Dec 23, 2021

Yes. Check out succinct data structures. They are closely related to compression algorithms. The above mentioned compression algorithms (incl. Entropy and all LZ variants) need a decompression step. They might try to improve the decompression speed by chunking and sacrifice storage space, but decompression is needed. This is unrelated to how they remove redundant information from the data (e.g. using pointers). Succinct data structures try to avoid decompression at all. Again, at the cost of storage space but some operations can be done on the succinct data directly as it is on disk.

RicoElectrico · on Dec 23, 2021

Vector quantization, for example. Sound scary, but compression is basically k-means and decompression a table lookup.