By Eyal Gruss
On-chain media storage may require efficient inline text compression for HTML / JS. Here is a custom pipeline to generate stand-alone HTML or JS files, embedding self-extracting text, and having file sizes of 30% - 40% the original. These file sizes include the decoder code which is less than 1.5 kB. The approach makes sense and is optimized for small texts, but performs quite well also on large texts.
File format | War and Peace (en) | Micromegas (en) | |
---|---|---|---|
Project Gutenberg plain text utf8 | txt | 3.2 MB | 63.7 kB |
7-Zip 9 Ultra PPMd (excluding decoder) | 7z | 746 kB (23%) | 20.8 kB (32%) |
7-Zip 9 Ultra PPMd (self extracting) | exe | 958 kB (29%) | 232 kB (364%) |
ZTML (Base125 using utf8 charset) | html | 982 kB (30%) | 29.2 kB (46%) |
ZTML (crEnc using cp1252 charset) | html | 877 kB (27%) | 26.1 kB (41%) |
The standard simplified pipeline can be run by calling generate()
or running python ztml.py
from the command line. See ztml.py.
crEnc gives better compression but requires setting the HTML or JS charset to cp1252. Base125 is the second best option if one must stick with utf8.
See example.py for a complete example reproducing the above benchmark.
Note: files larger than a few MB might not work on iOS Safari or macOS Safary 15
ZTML pipeline:
- Text normalization (irreversible; reduce whitespace, substitute unicode punctuation)
- Text condensation (reversible; lowercase with automatic capitalization*, substitute common strings as: the, qu)
- Huffman encoding (with a codebook-free decoder, beneficial even as followed by DEFLATE)
- Burrows–Wheeler transform
- PNG / DEFLATE compression (allowing native decompression, aspect ratio optimized for minimal padding, Zopfli optimization)
- Binary to text encoding embedded in JS template literals:
- Uglification of the generated JS (substitute recurring element, attribute and function names with short aliases)
*Automatic capitalization recovery is currently partial.