Skip to content
/ ztml Public

Extreme inline text compression for HTML / JS. A custom pipeline that generates stand-alone HTML or JS files which embed competitively compressed self-extracting text, with file sizes of 25% - 40% the original.

License

Notifications You must be signed in to change notification settings

eyaler/ztml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open In Colab

ZTML

Extreme inline text compression for HTML / JS

Partially made at Stochastic Labs

On-chain media storage can require efficient compression for text embedded inline in HTML / JS. ZTML is a custom pipeline that generates stand-alone HTML or JS files which embed competitively compressed self-extracting text, with file sizes of 25% - 40% the original. These file sizes include the decoder code which is ~ 1.5 kB (including auxiliary indices and tables). The approach makes sense and is optimized for small texts (tens of kB), but performs quite well also on large texts. The pipeline includes low overhead binary-to-text alternatives to Base64 which are also useful for inline images. You can find a very high-level overview in these slides from Reversim Summit 2022.

Benchmark

File format Micromegas (En) War and Peace (En)
Project Gutenberg plain text utf8 txt 63.7 kB 3.2 MB
paq8px_v206fix1 -12RT (excluding decoder) paq 13.3 kB (21%) 575 kB (18%)
7-Zip 22.01 9 Ultra PPMd (excluding decoder) 7z 20.8 kB (32%) 746 kB (23%)
7-Zip 22.01 9 Ultra PPMd (self-extracting) exe 232 kB (364%) 958 kB (29%)
Zstandard 1.5.2 -22 --ultra (excluding decoder) zst 23.4 kB (37%) 921 kB (28%)
Roadroller 2.1.0 -O2 js 26.5 kB (42%) 1.0 MB (30%)
ZTML Base125 html (utf8) 26.4 kB (41%) mtf=0 902 kB (28%) mtf=80 ect=True
ZTML crEnc html (cp1252) 23.5 kB (37%) mtf=0 803 kB (24%) mtf=80 ect=True

Installation

git clone https://github.com/eyaler/ztml
pip install -r ztml/requirements.txt

For running validations, you also need to have Chrome, Edge and Firefox installed.

Usage

A standard simplified pipeline can be run by calling ztml() or running python ztml.py from the command line. See ztml.py. Of course, there is also an accessible Google Colab with a simple GUI. Shortcut: bit.ly/ztml1.

crEnc gives better compression but requires setting the HTML or JS charset to cp1252. Base125 is the second-best option if one must stick with utf8.

See example.py for a complete example reproducing the ZTML results in the above benchmark, and example_image.py for an inline image encoding example. Outputs of these runs can be accessed at eyalgruss.com/ztml. On top of the built-in validations for Chrome, Edge and Firefox, these were also manually tested on macOS Monterey 12.5 Safari 15.6 and iOS 16.0 Safari.

A quick and dirty way to compress an existing single-page HTML websites with embedded inline media is to use raw=True.

Caveats

  1. Files larger than a few MB might not work on iOS Safari or macOS Safari 15.
  2. This solution favors compression rate over compression and decompression times. Use mtf=None for faster decompression of large files.
  3. For compressing word lists (sorted lexicographically), solutions as Roadroller do a much better job.

ZTML pipeline breakdown

  1. Text normalization (irreversible; reduce whitespace, substitute unicode punctuation)
  2. Text condensation (reversible; lowercase with automatic capitalization, substitute common strings as: the, qu)
  3. Burrows–Wheeler + Move-to-front transforms on text with some optional variants, including some new ones (beneficial for large texts with higher mtf settings)
  4. Huffman encoding (with a codebook-free decoder, beneficial even as followed by DEFLATE)
  5. Burrows–Wheeler transform on bits (beneficial for large texts)
  6. PNG / DEFLATE compression (allowing native decompression, aspect ratio optimized for minimal padding, Zopfli optimization)
  7. Binary-to-text encoding embedded in JS template literals:
    1. crEnc encoding (a yEnc variant, with 1.2% overhead, to be used with single-byte charset)
    2. Base125 encoding (a Base122 variant, with 14.7% overhead, to be used with utf8 charset)
  8. Uglification of the generated JS (substitute recurring element, attribute and function names with short aliases)
  9. Validation of content reproduction on Chrome, Edge and Firefox

Note: image encoding only uses steps 7 and later.

Projects using this

About

Extreme inline text compression for HTML / JS. A custom pipeline that generates stand-alone HTML or JS files which embed competitively compressed self-extracting text, with file sizes of 25% - 40% the original.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published