Huffer is a small compressor and decompressor that uses canonical Huffman codes.
I wrote this as a "toy" project to explore and learn more about the Haskell programming language.
You can use cabal to compile and run huffer.
Compile and run with:
$ cabal run
Just compile with:
$ cabal configure
$ cabal build
Install with:
$ cabal install
You can see the command line arguments by running huffer help
, that will tell you:
run with: huffer action [inputs] (to output)
action can be 'encode', 'decode' or 'content'
you have to specify at least one input file (or folder) to encode
you can specify only an input file to decode or list content of
you can specify an output file for encoding (or folder for decoding)
if you don't, huffer will use 'output.huf' for encoding ('.' for decoding)
For example, if you run huffer encode movies/ clips/ to vids.huf
it will compress every file contained in the movies and clips directories (and every one of their subdirectories) and compress all of them in a file called vids.huf.
You can then run huffer content vids.huf
and it will tell you the files that vids.huf contains or run huffer decode vids.huf to media
and it will decompress every file contained in vids.huf in the media directory.
For each file to compress huffer will (naively) read the file and count the frequencies of every word (that has the size of a single byte).
For each file it will then calculate the Huffman code, make it canonical and finally read, compress and write them one after another.
Especially for this double-reading Huffer it's not very fast, but because it uses lazy bytestrings it can compress files of any size in almost constant memory.
Huffer stores all the files it compressed into an archive file that starts with a header, structured like this:
Bytes | Description |
---|---|
1 | Body type: defines how the body is stored |
2 | Number of entries: n |
Followed by n entries (one per file), each structured like this:
Bytes | Description |
---|---|
4 | The size (in bytes) the file had originally |
4 | The size (in bytes) the file has after compression: m |
2 | The length of the file path: l |
l | The string containing the path of the file |
The header is followed by the body that contains for each one of the n files listed in the header (in that same order) m bytes of compressed data.
NOTE: At the moment the Body Type byte is always set to 0 and not really considered because there is only one body implementation (others will follow if and when I will keep playing with this).
The body of a compressed file consists of 256 bytes, each containing the number of bits for every possible word, in alphabetical order (see the wiki page for a better explanation) followed by the actual data.