Lzma Specification
Lzma Specification
Lzma Specification
----------------------------------
This specification defines the format of LZMA compressed data and lzma file format.
Notation
--------
The lzma file contains the raw LZMA stream and the header with related properties.
LZMA properties:
The dictionary size in LZMA2 is encoded with just one byte and LZMA2 supports
only reduced set of dictionary sizes:
(2 << 11), (3 << 11),
(2 << 12), (3 << 12),
...
(2 << 30), (3 << 30),
(2 << 31) - 1
The dictionary size can be extracted from encoded value with the following code:
dictSize = (p == 40) ? 0xFFFFFFFF : (((UInt32)2 | ((p) & 1)) << ((p) / 2 + 11));
if (lc + lp > 4)
throw "Unsupported properties: (lc + lp) > 4";
There are some advantages for LZMA decoder with such (lc + lp) value
limitation. It reduces the maximum size of tables allocated by decoder.
And it reduces the complexity of initialization procedure, that can be
important to keep high speed of decoding of big number of small LZMA streams.
It's recommended to use that limitation (lc + lp <= 4) for any new format
that uses LZMA compression. Note that the combinations of "lc" and "lp"
parameters, where (lc + lp > 4), can provide significant improvement in
compression ratio only in some rare cases.
The LZMA properties can be encoded into two bytes in new scheme:
The RAM usage for LZMA decoder is determined by the following parts:
If we decode full LZMA stream to one output buffer in RAM, the decoder
can use that output buffer as sliding window. So the decoder doesn't
need additional buffer allocated for sliding window.
The size of the probability model counter arrays is calculated with the
following formula:
For example, for default LZMA parameters (lp = 0 and lc = 3), the RAM usage is
The maximum RAM state usage is required for decoding the stream with lp = 4
and lc = 8:
But there are some modes of the encoder that require less memory.
LZMA Decoding
=============
The LZMA compression algorithm uses LZ-based compression with Sliding Window
and Range Encoding as entropy coding method.
Sliding Window
--------------
class COutWindow
{
Byte *Buf;
UInt32 Pos;
UInt32 Size;
bool IsFull;
public:
unsigned TotalPos;
COutStream OutStream;
COutWindow(): Buf(NULL) {}
~COutWindow() { delete []Buf; }
void PutByte(Byte b)
{
TotalPos++;
Buf[Pos++] = b;
if (Pos == Size)
{
Pos = 0;
IsFull = true;
}
OutStream.WriteByte(b);
}
Range Decoder
-------------
LZMA stream contains just one very big number in big-endian encoding.
LZMA decoder uses the Range Decoder to extract a sequence of binary
symbols from that big number.
struct CRangeDecoder
{
UInt32 Range;
UInt32 Code;
InputStream *InStream;
bool Corrupted;
}
The notes about UInt32 type for the "Range" and "Code" variables:
If the programming language does not support 32-bit unsigned integer type
(like in case of JAVA language), it's possible to use 32-bit signed integer,
but some code must be changed. For example, it's required to change the code
that uses comparison operations for UInt32 variables in this specification.
(Corrupted == false), if the Range Decoder has not detected any corruption.
(Corrupted == true), if the Range Decoder has detected some corruption.
The reference LZMA Decoder ignores the value of the "Corrupted" variable.
So it continues to decode the stream, even if the corruption can be detected
in the Range Decoder. To provide the full compatibility with output of the
reference LZMA Decoder, another LZMA Decoder implementations must also
ignore the value of the "Corrupted" variable.
The LZMA Encoder is required to create only such LZMA streams, that will not
lead the Range Decoder to states, where the "Corrupted" variable is set to true.
The Range Decoder reads first 5 bytes from input stream to initialize
the state:
bool CRangeDecoder::Init()
{
Corrupted = false;
Range = 0xFFFFFFFF;
Code = 0;
Byte b = InStream->ReadByte();
if (b != 0 || Code == Range)
Corrupted = true;
return b == 0;
}
The LZMA Encoder always writes ZERO in initial byte of compressed stream.
That scheme allows to simplify the code of the Range Encoder in the
LZMA Encoder. If initial byte is not equal to ZERO, the LZMA Decoder must
stop decoding and report error.
After the last bit of data was decoded by Range Decoder, the value of the
"Code" variable must be equal to 0. The LZMA Decoder must check it by
calling the IsFinishedOK() function:
The value of the "Range" variable before each bit decoding can not be smaller
than ((UInt32)1 << 24). The Normalize() function keeps the "Range" value in
described range.
void CRangeDecoder::Normalize()
{
if (Range < kTopValue)
{
Range <<= 8;
Code = (Code << 8) | InStream->ReadByte();
}
}
Notes: if the size of the "Code" variable is larger than 32 bits, it's
required to keep only low 32 bits of the "Code" variable after the change
in Normalize() function.
If the LZMA Stream is not corrupted, the value of the "Code" variable is
always smaller than value of the "Range" variable.
But the Range Decoder ignores some types of corruptions, so the value of
the "Code" variable can be equal or larger than value of the "Range" variable
for some "Corrupted" archives.
LZMA uses Range Encoding only with binary symbols of two types:
1) binary symbols with fixed and equal probabilities (direct bits)
2) binary symbols with predicted probabilities
if (Code == Range)
Corrupted = true;
Normalize();
res <<= 1;
res += t + 1;
}
while (--numBits);
return res;
}
#define kNumBitModelTotalBits 11
It's recommended to use 16-bit unsigned integer type, to store these 11-bit
probability values:
Each probability value must be initialized with value ((1 << 11) / 2),
that represents the state, where probabilities of symbols 0 and 1
are equal to 0.5:
#define INIT_PROBS(p) \
{ for (unsigned i = 0; i < sizeof(p) / sizeof(p[0]); i++) p[i] = PROB_INIT_VAL; }
#define kNumMoveBits 5
LZMA uses a tree of Bit model variables to decode symbol that needs
several bits for storing. There are two versions of such trees in LZMA:
1) the tree that decodes bits from high bit to low bit (the normal scheme).
2) the tree that decodes bits from low bit to high bit (the reverse scheme).
public:
void Init()
{
INIT_PROBS(Probs);
}
LZ part of LZMA
---------------
LZ part of LZMA describes details about the decoding of MATCHES and LITERALS.
The LZMA Decoder uses (1 << (lc + lp)) tables with CProb values, where
each table contains 0x300 CProb values:
CProb *LitProbs;
void CreateLiterals()
{
LitProbs = new CProb[(UInt32)0x300 << (lc + lp)];
}
void InitLiterals()
{
UInt32 num = (UInt32)0x300 << (lc + lp);
for (UInt32 i = 0; i < num; i++)
LitProbs[i] = PROB_INIT_VAL;
}
To select the table for decoding it uses the context that consists of
(lc) high bits from previous literal and (lp) low bits from value that
represents current position in outputStream.
If (State > 7), the Literal Decoder also uses "matchByte" that represents
the byte in OutputStream at position the is the DISTANCE bytes before
current position, where the DISTANCE is the distance in DISTANCE-LENGTH pair
of latest decoded match.
The following code decodes one literal and puts it to Sliding Window buffer:
unsigned symbol = 1;
unsigned litState = ((OutWindow.TotalPos & ((1 << lp) - 1)) << lc) + (prevByte
>> (8 - lc));
CProb *probs = &LitProbs[(UInt32)0x300 * litState];
if (state >= 7)
{
unsigned matchByte = OutWindow.GetByte(rep0 + 1);
do
{
unsigned matchBit = (matchByte >> 7) & 1;
matchByte <<= 1;
unsigned bit = RangeDec.DecodeBit(&probs[((1 + matchBit) << 8) + symbol]);
symbol = (symbol << 1) | bit;
if (matchBit != bit)
break;
}
while (symbol < 0x100);
}
while (symbol < 0x100)
symbol = (symbol << 1) | RangeDec.DecodeBit(&probs[symbol]);
OutWindow.PutByte((Byte)(symbol - 0x100));
}
#define kMatchMinLen 2
The match length decoder can return the values from 0 to 271.
And the corresponded real match length values can be in the range
from 2 to 273.
The following scheme is used for the match length encoding:
LZMA uses bit model variable "Choice" to decode the first selection bit.
If the first selection bit is equal to 0, the decoder uses binary tree
LowCoder[posState] to decode 3-bit zero-based match length (xxx).
If the first selection bit is equal to 1, the decoder uses bit model
variable "Choice2" to decode the second selection bit.
If the second selection bit is equal to 0, the decoder uses binary tree
MidCoder[posState] to decode 3-bit "yyy" value, and zero-based match
length is equal to (yyy + 8).
If the second selection bit is equal to 1, the decoder uses binary tree
HighCoder to decode 8-bit "zzzzzzzz" value, and zero-based
match length is equal to (zzzzzzzz + 16).
class CLenDecoder
{
CProb Choice;
CProb Choice2;
CBitTreeDecoder<3> LowCoder[1 << kNumPosBitsMax];
CBitTreeDecoder<3> MidCoder[1 << kNumPosBitsMax];
CBitTreeDecoder<8> HighCoder;
public:
void Init()
{
Choice = PROB_INIT_VAL;
Choice2 = PROB_INIT_VAL;
HighCoder.Init();
for (unsigned i = 0; i < (1 << kNumPosBitsMax); i++)
{
LowCoder[i].Init();
MidCoder[i].Init();
}
}
CLenDecoder LenDecoder;
CLenDecoder RepLenDecoder;
#define kNumLenToPosStates 4
The distance decoder returns the "dist" value that is zero-based value
of match distance. The real match distance can be calculated with the
following code:
matchDistance = dist + 1;
#define kEndPosModelIndex 14
#define kNumFullDistances (1 << (kEndPosModelIndex >> 1))
#define kNumAlignBits 4
CBitTreeDecoder<6> PosSlotDecoder[kNumLenToPosStates];
CProb PosDecoders[1 + kNumFullDistances - kEndPosModelIndex];
CBitTreeDecoder<kNumAlignBits> AlignDecoder;
void InitDist()
{
for (unsigned i = 0; i < kNumLenToPosStates; i++)
PosSlotDecoder[i].Init();
AlignDecoder.Init();
INIT_PROBS(PosDecoders);
}
At first stage the distance decoder decodes 6-bit "posSlot" value with bit
tree decoder from PosSlotDecoder array. It's possible to get 2^6=64 different
"posSlot" values.
unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec);
The encoding scheme for distance value is shown in the following table:
posSlot (decimal) /
zero-based distance (binary)
0 0
1 1
2 10
3 11
4 10 x
5 11 x
6 10 xx
7 11 xx
8 10 xxx
9 11 xxx
10 10 xxxx
11 11 xxxx
12 10 xxxxx
13 11 xxxxx
14 10 yy zzzz
15 11 yy zzzz
16 10 yyy zzzz
17 11 yyy zzzz
...
62 10 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz
63 11 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz
where
"x ... x" means the sequence of binary symbols encoded with binary tree and
"Reverse" scheme. It uses separated binary tree for each posSlot from 4 to
13.
"y" means direct bit encoded with range coder.
"zzzz" means the sequence of four binary symbols encoded with binary
tree with "Reverse" scheme, where one common binary tree "AlignDecoder"
is used for all posSlot values.
If (posSlot >= 4), the decoder uses "posSlot" value to calculate the value of
the high bits of "dist" value and the number of the low bits.
If (4 <= posSlot < kEndPosModelIndex), the decoder uses bit tree decoders.
(one separated bit tree decoder per one posSlot value) and "Reverse" scheme.
In this implementation we use one CProb array "PosDecoders" that contains
all CProb variables for all these bit decoders.
1) The unpack size is undefined. The LZMA decoder stops decoding after
getting "End of stream" marker.
The input variables for that case:
markerIsMandatory = true
unpackSizeDefined = false
unpackSize contains any value
2) The unpack size is defined and LZMA decoder supports both variants,
where the stream can contain "End of stream" marker or the stream is
finished without "End of stream" marker. The LZMA decoder must detect
any of these situations.
The input variables for that case:
markerIsMandatory = false
unpackSizeDefined = true
unpackSize contains unpack size
3) The unpack size is defined and the LZMA stream must contain
"End of stream" marker
The input variables for that case:
markerIsMandatory = true
unpackSizeDefined = true
unpackSize contains unpack size
The main loop of decoder
------------------------
The following code contains the "end of stream" condition check at the start
of the loop:
1) "Simple Match" - the match with distance value encoded with bit models.
2) "Rep Match" - the match that uses the distance from distance
history table.
3) "Short Rep Match" - the match of single byte length, that uses the latest
distance from distance history table.
The LZMA decoder keeps the history of latest 4 match distances that were used
by decoder. That set of 4 variables contains zero-based match distances and
these variables are initialized with zero values:
The LZMA decoder uses binary model variables to select type of MATCH or LITERAL:
#define kNumStates 12
#define kNumPosBitsMax 4
unsigned state = 0;
The "state" variable is updated after each LITERAL or MATCH with one of the
following functions:
The decoder calculates "state2" variable value to select exact variable from
"IsMatch" and "IsRep0Long" arrays:
The decoder uses the following code flow scheme to select exact
type of LITERAL or MATCH:
IsMatch[state2] decode
0 - the Literal
1 - the Match
IsRep[state] decode
0 - Simple Match
1 - Rep Match
IsRepG0[state] decode
0 - the distance is rep0
IsRep0Long[state2] decode
0 - Short Rep Match
1 - Rep Match 0
1 -
IsRepG1[state] decode
0 - Rep Match 1
1 -
IsRepG2[state] decode
0 - Rep Match 2
1 - Rep Match 3
LITERAL symbol
--------------
If the value "0" was decoded with IsMatch[state2] decoding, we have "LITERAL" type.
DecodeLiteral(state, rep0);
Then the decoder must update the "state" value and "unpackSize" value;
state = UpdateState_Literal(state);
unpackSize--;
Then the decoder must go to the begin of main loop to decode next Match or Literal.
Simple Match
------------
rep3 = rep2;
rep2 = rep1;
rep1 = rep0;
state = UpdateState_Match(state);
rep0 = DecodeDistance(len);
if (rep0 == 0xFFFFFFFF)
return RangeDec.IsFinishedOK() ?
LZMA_RES_FINISHED_WITH_MARKER :
LZMA_RES_ERROR;
Also the decoder must check that "rep0" value is not larger than dictionary size
and is not larger than the number of already decoded bytes:
Rep Match
---------
If the LZMA decoder has decoded the value "1" with IsRep[state] variable,
we have "Rep Match" type.
if (OutWindow.IsEmpty())
return LZMA_RES_ERROR;
If the match type is "Rep Match", the decoder uses one of the 4 variables of
distance history table to get the value of distance for current match.
And there are 4 corresponding ways of decoding flow.
The decoder updates the distance history with the following scheme
depending from type of match:
Then the decoder decodes exact subtype of "Rep Match" using "IsRepG0",
"IsRep0Long",
"IsRepG1", "IsRepG2".
If the subtype is "Short Rep Match", the decoder updates the state, puts
the one byte from window to current position in window and goes to next
MATCH/LITERAL symbol (the begin of main loop):
state = UpdateState_ShortRep(state);
OutWindow.PutByte(OutWindow.GetByte(rep0 + 1));
unpackSize--;
continue;
state = UpdateState_Rep(state);
If we have the match (Simple Match or Rep Match 0/1/2/3), the decoder must
copy the sequence of bytes with calculated match distance and match length.
If uncompressed size is defined, LZMA decoder must check that it doesn't
exceed that specified uncompressed size:
len += kMatchMinLen;
bool isError = false;
if (unpackSizeDefined && unpackSize < len)
{
len = (unsigned)unpackSize;
isError = true;
}
OutWindow.CopyMatch(rep0 + 1, len);
unpackSize -= len;
if (isError)
return LZMA_RES_ERROR;
Then the decoder must go to the begin of main loop to decode next MATCH or LITERAL.
NOTES
-----
References: