November 14th, 2024
In which I investigate a 30-year-old video format.
Imagine if you could watch movies on your computer. With the arrival of CD-ROM discs in the 90s, many game developers were dreaming of this. With hundreds of megabytes to spare, you could finally make your game a cinematic multimedia masterpiece.
One of those game developers was Westwood Studios. Their games such as Command & Conquer entertained their players with carefully produced but unintentionally campy intermission cutscenes.
What concerns us here however, is their in-house “VQA” video format. I got into it in 2020 when the C&C remaster was released with its somewhat strange-looking upscaled high-definition videos. I learned more about VQA when extending its support in the popular ffmpeg command-line tool. Later I also wrote a standalone encoder for it.
In this article I’ll use the VQA format as a vehicle to talk about more general data compression related concepts. I think VQA is an interesting example since it is optimized for a fast decoding on 90’s CPUs, slow CD-ROM drives and cheap graphics chips. It has shipped in games played by millions of people but is simple by necessity.
There isn’t anything really difficult here. Every frame gets split into blocks of 4x2 or 4x4 pixels in size, depending on the game. Common blocks are shared by nearby frames and are collected to a small set known as the codebook. Each frame can then be represented using those common blocks instead of the original pixels. Occasionally the codebook blocks get a refresh, for example when a scene cut happens.
VQA videos use both lossy and lossless compression. The lossy techniques are
The lossless ones compress the already smaller lossy stream further. They are
Compressors often squeeze the final bitstream even further with an “entropy coding” step such as Huffman coding. The coding uses fewer bits to represent commonly occurring sequences. However, in VQA there’s no entropy coding at all. I suppose it keeps the decoder simple.
OK let’s talk about the details now. We’ll begin with the most important concept.
At its core VQA is a vector quantization codec. Vector quantization is a technical term for a compression scheme where you replace many similar parts with a single representative. It’s the block reuse scheme I just described in the high-level overview section earlier. In compression jargon the blocks are called vectors and the indices that refer to codebook blocks are called codes.
The codebooks can look something like below:
So why call blocks “vectors?” Let’s say we have 4x4 pixel RGB blocks.
They consist of 4*4*3 = 48
values each. We can consider the
blocks as 48-dimensional vectors if we put all the RGB values of every
pixel one after another. Now it’s possible to apply vector operations to
the blocks!
So for example the block on the left would become this vector (the exact order of doesn’t matter as long as its consistent):
[140 140 140 173 173 156 206 206 173 214 214 181 140 140 140 173 173 156
214 206 181 222 222 181 140 140 140 173 173 156 214 214 181 231 231 189
140 140 140 173 173 156 214 214 181 231 222 189]
In practice the only vector operation we are interested in is computing a distance between two points. Now a clustering algorithm such as Lloyd’s algorithm a.k.a. k-means can build the codebook for us. Finding the nearest codebook entry for some block is can be done by finding the nearest cluster centroid. In my ClusterVQA encoder I used scikit-learn’s MiniBatchKMeans class and it worked OK.
Compressing a high number of colors to a small palette is called color quantization but if you look at it, it’s also vector quantization! The vectors are 3-dimensional RGB colors and the codebook is now the palette. This is why for example Apple Video can be considered to be a VQ-based video codec as well, even though it doesn’t use a codebook. It effectively creates small 4-color palettes for 4x4 pixel blocks.
Each new frame is a 2D array of 16-bit codes that represent block indices to the codebook. These take most of the video stream so it makes sense to compress them.
Both codebooks and frame updates are compressed with sliding-window
compression. It’s LZ77-style
and called “LCW”,
and also known as “format80” because the byte 0x80
marks
the end of data. It’s the same idea as in zip files: if some byte
sequence repeats, just refer to it with an (offset,
byte_length) pair instead of writing out the whole thing. The only
unique LCW feature is a special command that writes a repeated constant
instead of copying it from the stream.
In Command & Conquer’s frame update data the sliding-window compression seems to result in sizes 55–75% of the original.
Sometimes it’s worthwhile to reorder your data to be easier to compress. In 8-bit VQA files, the frame updates are not stored just as a sequence of 16-bit integers but as two streams. The 16-bit values are split in two bytes and the low bytes come first in the file, followed by the high ones. In the below hex dump taken right at the middle of a chunk, you can see how different the value distributions are and how the high bytes have more repetition:
000017b0: 5e5e b1a4 d57c 811c 02ba 5248 411c d2a3 ^^...|....RHA...
000017c0: 84c3 82c2 7a4b b59a 500f e6af f138 ed72 ....zK..P....8.r
000017d0: c34e 46b4 29ce 7bab 9508 9fe1 4dbd b948 .NF.).{.....M..H
000017e0: 5ab2 ea48 076a f9f9 ee90 fef9 5916 02ee Z..H.j......Y...
000017f0: 8bf2 69b0 13d6 7ebd de5f 7b35 2a68 abdb ..i...~.._{5*h..
00001800: e67e 2f97 407d 2797 7b97 cbbc 0a88 0a5c .~/.@}'.{......\
00001810: 0b7a d0bc 4a94 1a9c 9a3d 15af ce8b b41d .z..J....=......
00001820: 3c1b 3c36 91bb 69ce d20a eb37 b913 a83b <.<6..i....7...;
00001830: a348 ea48 07f7 f9f9 ee90 1bf9 59de 02ee .H.H........Y...
00001840: 8b7c 18b0 4622 d6bd de24 ef35 389a abab .|..F"...$.58...
00001850: 9446 5697 1a1d 2797 7b7d 6600 5c1d 0a7b .FV...'.{}f.\..{
00001860: 0f01 0000 0500 0004 0003 0001 0005 0202 ................
00001870: 0f02 0401 0004 0000 0000 0001 0000 0000 ................
00001880: 0505 0206 0301 0100 0001 0100 000f 0000 ................
00001890: 010f 0f00 0001 0f00 0f0f 0f0f 0f0f 010f ................
000018a0: 0f0f 0f0f 0505 0101 0001 0101 0103 0100 ................
000018b0: 0f00 070f 0000 0502 0205 0206 0000 0200 ................
000018c0: 0f00 0004 0000 0400 0201 0001 0001 0100 ................
000018d0: 0605 0205 0301 0102 0000 0701 010f 000f ................
000018e0: 0000 0000 0f00 0000 0f0f 0f0f 0f0f 0f0f ................
000018f0: 0f01 0f0f 0603 0002 0401 0000 0106 0401 ................
00001900: 0f00 0106 0001 0706 0502 0000 0402 0f0f ................
Something similar is supported in Apache Parquet, for example.
In 15-bit Hi-Color VQAs the frame updates use a run-length encoding (RLE) scheme. For repeated “runs” of the same code they write out (code, length) pairs instead of the original data. Note that this is simpler than sliding-window compression because it can repeat only a single byte.
Now frame updates can also skip over some blocks. This way static image areas screen can be kept unchanged. This is what I refer to as delta encoding. In practice almost all blocks get updated anyway:
To be honest, I was surprised to learn that 8-bit VQAs don’t do any delta encoding and always update every block. Also, the run-length encoded code updates are also compressed via LCW like before.
The video bitrate seems to be about 1 Mbit/s for the 8-bit videos and 2.5 Mbit/s for the 15-bit ones. For comparison, DVD video bitrates are 3–6 Mbit/s.
How much smaller are the compressed videos than the originals? Assuming each 8-bit video frame would have its own palette, let’s compute the uncompressed bit rates.
Uncompressed bitrate (8-bit)
((256 * 2 + 320*156)*15*8) / 1e6 = 6.05 Mbit/s
Uncompressed bitrate (15-bit)
((640*400*2)*15*8) / 1e6 = 61.44 Mbit/s
So we get 5.9x and 25x compression ratios for 8-bit and 15-bit videos, respectively:
Compression Ratio = Uncompressed data rate / Compressed data rate
Compression Ratio (8-bit) = 6.05 Mbit/s / 1.02 Mbit/s = 5.9x
Compression Ratio (15-bit) = 61.44 Mbit/s / 2.42 Mbit/s = 25x
You are also supposed to measure image distortion with some measure such as PSNR but we don’t have access to the originals so it can’t be done. Of course I could encode my own VQAs and compare but that’s too much work :)
Here are some more details on the six test videos I used to compute the above numbers. Apparently the 15-bit “VQA Hi-Color” videos consist of 96% code indices while on 8-bit ones that portion doesn’t get much higher than 80%.
filename resolution frames colors Mbit/s codebooks indices palettes size (MiB)
AKIRA.VQA 320x156 845 8-bit 1.01 23.12 % 76.83 % 0.05 % 6.81
SPYCRASH.VQA 320x156 245 8-bit 0.82 24.84 % 75.11 % 0.04 % 1.61
INTRO.VQA 320x156 1933 8-bit 1.22 18.95 % 80.52 % 0.54 % 18.74
nowcnot.vqa 640x400 286 15-bit 2.35 3.05 % 96.95 % 5.34
hideseek.vqa 640x400 255 15-bit 2.29 3.37 % 96.63 % 4.64
iceskate.vqa 640x400 247 15-bit 2.63 3.90 % 96.10 % 5.17
All the videos played at 15 frames per second.
While VQA served it purpose, in Red Alert 2 the videos were encoded with the industry-standard Bink Video.
The last game that shipped with VQA was the NoX action RPG, also from 2000 like RA2. It is rumored that NoX accidentally shipped with the original codebook construction code in it. I haven’t seen it myself but it would be a neat swan song for the format if it were true.
/* catch any unknown header tags, for curiosity */
switch (chunk_tag) {
case CINF_TAG:
case CINH_TAG:
case CIND_TAG:
case LINF_TAG:
case PINF_TAG:
case PINH_TAG:
case PIND_TAG:
case FINF_TAG:
case CMDS_TAG:
case VIEW_TAG:
case ZBUF_TAG:
break;
ZBUF
chunks mentioned above. The
Z-buffers are used to correctly occlude characters behind the looping
video environments.