You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1175 lines
36 KiB
1175 lines
36 KiB
LZMA specification (DRAFT version) |
|
---------------------------------- |
|
|
|
Author: Igor Pavlov |
|
Date: 2013-07-28 |
|
|
|
This specification defines the format of LZMA compressed data and lzma file format. |
|
|
|
Notation |
|
-------- |
|
|
|
We use the syntax of C++ programming language. |
|
We use the following types in C++ code: |
|
unsigned - unsigned integer, at least 16 bits in size |
|
int - signed integer, at least 16 bits in size |
|
UInt64 - 64-bit unsigned integer |
|
UInt32 - 32-bit unsigned integer |
|
UInt16 - 16-bit unsigned integer |
|
Byte - 8-bit unsigned integer |
|
bool - boolean type with two possible values: false, true |
|
|
|
|
|
lzma file format |
|
================ |
|
|
|
The lzma file contains the raw LZMA stream and the header with related properties. |
|
|
|
The files in that format use ".lzma" extension. |
|
|
|
The lzma file format layout: |
|
|
|
Offset Size Description |
|
|
|
0 1 LZMA model properties (lc, lp, pb) in encoded form |
|
1 4 Dictionary size (32-bit unsigned integer, little-endian) |
|
5 8 Uncompressed size (64-bit unsigned integer, little-endian) |
|
13 Compressed data (LZMA stream) |
|
|
|
LZMA properties: |
|
|
|
name Range Description |
|
|
|
lc [0, 8] the number of "literal context" bits |
|
lp [0, 4] the number of "literal pos" bits |
|
pb [0, 4] the number of "pos" bits |
|
dictSize [0, 2^32 - 1] the dictionary size |
|
|
|
The following code encodes LZMA properties: |
|
|
|
void EncodeProperties(Byte *properties) |
|
{ |
|
properties[0] = (Byte)((pb * 5 + lp) * 9 + lc); |
|
Set_UInt32_LittleEndian(properties + 1, dictSize); |
|
} |
|
|
|
If the value of dictionary size in properties is smaller than (1 << 12), |
|
the LZMA decoder must set the dictionary size variable to (1 << 12). |
|
|
|
#define LZMA_DIC_MIN (1 << 12) |
|
|
|
unsigned lc, pb, lp; |
|
UInt32 dictSize; |
|
UInt32 dictSizeInProperties; |
|
|
|
void DecodeProperties(const Byte *properties) |
|
{ |
|
unsigned d = properties[0]; |
|
if (d >= (9 * 5 * 5)) |
|
throw "Incorrect LZMA properties"; |
|
lc = d % 9; |
|
d /= 9; |
|
pb = d / 5; |
|
lp = d % 5; |
|
dictSizeInProperties = 0; |
|
for (int i = 0; i < 4; i++) |
|
dictSizeInProperties |= (UInt32)properties[i + 1] << (8 * i); |
|
dictSize = dictSizeInProperties; |
|
if (dictSize < LZMA_DIC_MIN) |
|
dictSize = LZMA_DIC_MIN; |
|
} |
|
|
|
If "Uncompressed size" field contains ones in all 64 bits, it means that |
|
uncompressed size is unknown and there is the "end marker" in stream, |
|
that indicates the end of decoding point. |
|
In opposite case, if the value from "Uncompressed size" field is not |
|
equal to ((2^64) - 1), the LZMA stream decoding must be finished after |
|
specified number of bytes (Uncompressed size) is decoded. And if there |
|
is the "end marker", the LZMA decoder must read that marker also. |
|
|
|
|
|
The new scheme to encode LZMA properties |
|
---------------------------------------- |
|
|
|
If LZMA compression is used for some another format, it's recommended to |
|
use a new improved scheme to encode LZMA properties. That new scheme was |
|
used in xz format that uses the LZMA2 compression algorithm. |
|
The LZMA2 is a new compression algorithm that is based on the LZMA algorithm. |
|
|
|
The dictionary size in LZMA2 is encoded with just one byte and LZMA2 supports |
|
only reduced set of dictionary sizes: |
|
(2 << 11), (3 << 11), |
|
(2 << 12), (3 << 12), |
|
... |
|
(2 << 30), (3 << 30), |
|
(2 << 31) - 1 |
|
|
|
The dictionary size can be extracted from encoded value with the following code: |
|
|
|
dictSize = (p == 40) ? 0xFFFFFFFF : (((UInt32)2 | ((p) & 1)) << ((p) / 2 + 11)); |
|
|
|
Also there is additional limitation (lc + lp <= 4) in LZMA2 for values of |
|
"lc" and "lp" properties: |
|
|
|
if (lc + lp > 4) |
|
throw "Unsupported properties: (lc + lp) > 4"; |
|
|
|
There are some advantages for LZMA decoder with such (lc + lp) value |
|
limitation. It reduces the maximum size of tables allocated by decoder. |
|
And it reduces the complexity of initialization procedure, that can be |
|
important to keep high speed of decoding of big number of small LZMA streams. |
|
|
|
It's recommended to use that limitation (lc + lp <= 4) for any new format |
|
that uses LZMA compression. Note that the combinations of "lc" and "lp" |
|
parameters, where (lc + lp > 4), can provide significant improvement in |
|
compression ratio only in some rare cases. |
|
|
|
The LZMA properties can be encoded into two bytes in new scheme: |
|
|
|
Offset Size Description |
|
|
|
0 1 The dictionary size encoded with LZMA2 scheme |
|
1 1 LZMA model properties (lc, lp, pb) in encoded form |
|
|
|
|
|
The RAM usage |
|
============= |
|
|
|
The RAM usage for LZMA decoder is determined by the following parts: |
|
|
|
1) The Sliding Window (from 4 KiB to 4 GiB). |
|
2) The probability model counter arrays (arrays of 16-bit variables). |
|
3) Some additional state variables (about 10 variables of 32-bit integers). |
|
|
|
|
|
The RAM usage for Sliding Window |
|
-------------------------------- |
|
|
|
There are two main scenarios of decoding: |
|
|
|
1) The decoding of full stream to one RAM buffer. |
|
|
|
If we decode full LZMA stream to one output buffer in RAM, the decoder |
|
can use that output buffer as sliding window. So the decoder doesn't |
|
need additional buffer allocated for sliding window. |
|
|
|
2) The decoding to some external storage. |
|
|
|
If we decode LZMA stream to external storage, the decoder must allocate |
|
the buffer for sliding window. The size of that buffer must be equal |
|
or larger than the value of dictionary size from properties of LZMA stream. |
|
|
|
In this specification we describe the code for decoding to some external |
|
storage. The optimized version of code for decoding of full stream to one |
|
output RAM buffer can require some minor changes in code. |
|
|
|
|
|
The RAM usage for the probability model counters |
|
------------------------------------------------ |
|
|
|
The size of the probability model counter arrays is calculated with the |
|
following formula: |
|
|
|
size_of_prob_arrays = 1846 + 768 * (1 << (lp + lc)) |
|
|
|
Each probability model counter is 11-bit unsigned integer. |
|
If we use 16-bit integer variables (2-byte integers) for these probability |
|
model counters, the RAM usage required by probability model counter arrays |
|
can be estimated with the following formula: |
|
|
|
RAM = 4 KiB + 1.5 KiB * (1 << (lp + lc)) |
|
|
|
For example, for default LZMA parameters (lp = 0 and lc = 3), the RAM usage is |
|
|
|
RAM_lc3_lp0 = 4 KiB + 1.5 KiB * 8 = 16 KiB |
|
|
|
The maximum RAM state usage is required for decoding the stream with lp = 4 |
|
and lc = 8: |
|
|
|
RAM_lc8_lp4 = 4 KiB + 1.5 KiB * 4096 = 6148 KiB |
|
|
|
If the decoder uses LZMA2's limited property condition |
|
(lc + lp <= 4), the RAM usage will be not larger than |
|
|
|
RAM_lc_lp_4 = 4 KiB + 1.5 KiB * 16 = 28 KiB |
|
|
|
|
|
The RAM usage for encoder |
|
------------------------- |
|
|
|
There are many variants for LZMA encoding code. |
|
These variants have different values for memory consumption. |
|
Note that memory consumption for LZMA Encoder can not be |
|
smaller than memory consumption of LZMA Decoder for same stream. |
|
|
|
The RAM usage required by modern effective implementation of |
|
LZMA Encoder can be estimated with the following formula: |
|
|
|
Encoder_RAM_Usage = 4 MiB + 11 * dictionarySize. |
|
|
|
But there are some modes of the encoder that require less memory. |
|
|
|
|
|
LZMA Decoding |
|
============= |
|
|
|
The LZMA compression algorithm uses LZ-based compression with Sliding Window |
|
and Range Encoding as entropy coding method. |
|
|
|
|
|
Sliding Window |
|
-------------- |
|
|
|
LZMA uses Sliding Window compression similar to LZ77 algorithm. |
|
|
|
LZMA stream must be decoded to the sequence that consists |
|
of MATCHES and LITERALS: |
|
|
|
- a LITERAL is a 8-bit character (one byte). |
|
The decoder just puts that LITERAL to the uncompressed stream. |
|
|
|
- a MATCH is a pair of two numbers (DISTANCE-LENGTH pair). |
|
The decoder takes one byte exactly "DISTANCE" characters behind |
|
current position in the uncompressed stream and puts it to |
|
uncompressed stream. The decoder must repeat it "LENGTH" times. |
|
|
|
The "DISTANCE" can not be larger than dictionary size. |
|
And the "DISTANCE" can not be larger than the number of bytes in |
|
the uncompressed stream that were decoded before that match. |
|
|
|
In this specification we use cyclic buffer to implement Sliding Window |
|
for LZMA decoder: |
|
|
|
class COutWindow |
|
{ |
|
Byte *Buf; |
|
UInt32 Pos; |
|
UInt32 Size; |
|
bool IsFull; |
|
|
|
public: |
|
unsigned TotalPos; |
|
COutStream OutStream; |
|
|
|
COutWindow(): Buf(NULL) {} |
|
~COutWindow() { delete []Buf; } |
|
|
|
void Create(UInt32 dictSize) |
|
{ |
|
Buf = new Byte[dictSize]; |
|
Pos = 0; |
|
Size = dictSize; |
|
IsFull = false; |
|
TotalPos = 0; |
|
} |
|
|
|
void PutByte(Byte b) |
|
{ |
|
TotalPos++; |
|
Buf[Pos++] = b; |
|
if (Pos == Size) |
|
{ |
|
Pos = 0; |
|
IsFull = true; |
|
} |
|
OutStream.WriteByte(b); |
|
} |
|
|
|
Byte GetByte(UInt32 dist) const |
|
{ |
|
return Buf[dist <= Pos ? Pos - dist : Size - dist + Pos]; |
|
} |
|
|
|
void CopyMatch(UInt32 dist, unsigned len) |
|
{ |
|
for (; len > 0; len--) |
|
PutByte(GetByte(dist)); |
|
} |
|
|
|
bool CheckDistance(UInt32 dist) const |
|
{ |
|
return dist <= Pos || IsFull; |
|
} |
|
|
|
bool IsEmpty() const |
|
{ |
|
return Pos == 0 && !IsFull; |
|
} |
|
}; |
|
|
|
|
|
In another implementation it's possible to use one buffer that contains |
|
Sliding Window and the whole data stream after uncompressing. |
|
|
|
|
|
Range Decoder |
|
------------- |
|
|
|
LZMA algorithm uses Range Encoding (1) as entropy coding method. |
|
|
|
LZMA stream contains just one very big number in big-endian encoding. |
|
LZMA decoder uses the Range Decoder to extract a sequence of binary |
|
symbols from that big number. |
|
|
|
The state of the Range Decoder: |
|
|
|
struct CRangeDecoder |
|
{ |
|
UInt32 Range; |
|
UInt32 Code; |
|
InputStream *InStream; |
|
|
|
bool Corrupted; |
|
} |
|
|
|
The notes about UInt32 type for the "Range" and "Code" variables: |
|
|
|
It's possible to use 64-bit (unsigned or signed) integer type |
|
for the "Range" and the "Code" variables instead of 32-bit unsigned, |
|
but some additional code must be used to truncate the values to |
|
low 32-bits after some operations. |
|
|
|
If the programming language does not support 32-bit unsigned integer type |
|
(like in case of JAVA language), it's possible to use 32-bit signed integer, |
|
but some code must be changed. For example, it's required to change the code |
|
that uses comparison operations for UInt32 variables in this specification. |
|
|
|
The Range Decoder can be in some states that can be treated as |
|
"Corruption" in LZMA stream. The Range Decoder uses the variable "Corrupted": |
|
|
|
(Corrupted == false), if the Range Decoder has not detected any corruption. |
|
(Corrupted == true), if the Range Decoder has detected some corruption. |
|
|
|
The reference LZMA Decoder ignores the value of the "Corrupted" variable. |
|
So it continues to decode the stream, even if the corruption can be detected |
|
in the Range Decoder. To provide the full compatibility with output of the |
|
reference LZMA Decoder, another LZMA Decoder implementations must also |
|
ignore the value of the "Corrupted" variable. |
|
|
|
The LZMA Encoder is required to create only such LZMA streams, that will not |
|
lead the Range Decoder to states, where the "Corrupted" variable is set to true. |
|
|
|
The Range Decoder reads first 5 bytes from input stream to initialize |
|
the state: |
|
|
|
void CRangeDecoder::Init() |
|
{ |
|
Corrupted = false; |
|
|
|
if (InStream->ReadByte() != 0) |
|
Corrupted = true; |
|
|
|
Range = 0xFFFFFFFF; |
|
Code = 0; |
|
for (int i = 0; i < 4; i++) |
|
Code = (Code << 8) | InStream->ReadByte(); |
|
|
|
if (Code == Range) |
|
Corrupted = true; |
|
} |
|
|
|
The LZMA Encoder always writes ZERO in initial byte of compressed stream. |
|
That scheme allows to simplify the code of the Range Encoder in the |
|
LZMA Encoder. |
|
|
|
After the last bit of data was decoded by Range Decoder, the value of the |
|
"Code" variable must be equal to 0. The LZMA Decoder must check it by |
|
calling the IsFinishedOK() function: |
|
|
|
bool IsFinishedOK() const { return Code == 0; } |
|
|
|
If there is corruption in data stream, there is big probability that |
|
the "Code" value will be not equal to 0 in the Finish() function. So that |
|
check in the IsFinishedOK() function provides very good feature for |
|
corruption detection. |
|
|
|
The value of the "Range" variable before each bit decoding can not be smaller |
|
than ((UInt32)1 << 24). The Normalize() function keeps the "Range" value in |
|
described range. |
|
|
|
#define kTopValue ((UInt32)1 << 24) |
|
|
|
void CRangeDecoder::Normalize() |
|
{ |
|
if (Range < kTopValue) |
|
{ |
|
Range <<= 8; |
|
Code = (Code << 8) | InStream->ReadByte(); |
|
} |
|
} |
|
|
|
Notes: if the size of the "Code" variable is larger than 32 bits, it's |
|
required to keep only low 32 bits of the "Code" variable after the change |
|
in Normalize() function. |
|
|
|
If the LZMA Stream is not corrupted, the value of the "Code" variable is |
|
always smaller than value of the "Range" variable. |
|
But the Range Decoder ignores some types of corruptions, so the value of |
|
the "Code" variable can be equal or larger than value of the "Range" variable |
|
for some "Corrupted" archives. |
|
|
|
|
|
LZMA uses Range Encoding only with binary symbols of two types: |
|
1) binary symbols with fixed and equal probabilities (direct bits) |
|
2) binary symbols with predicted probabilities |
|
|
|
The DecodeDirectBits() function decodes the sequence of direct bits: |
|
|
|
UInt32 CRangeDecoder::DecodeDirectBits(unsigned numBits) |
|
{ |
|
UInt32 res = 0; |
|
do |
|
{ |
|
Range >>= 1; |
|
Code -= Range; |
|
UInt32 t = 0 - ((UInt32)Code >> 31); |
|
Code += Range & t; |
|
|
|
if (Code == Range) |
|
Corrupted = true; |
|
|
|
Normalize(); |
|
res <<= 1; |
|
res += t + 1; |
|
} |
|
while (--numBits); |
|
return res; |
|
} |
|
|
|
|
|
The Bit Decoding with Probability Model |
|
--------------------------------------- |
|
|
|
The task of Bit Probability Model is to estimate probabilities of binary |
|
symbols. And then it provides the Range Decoder with that information. |
|
The better prediction provides better compression ratio. |
|
The Bit Probability Model uses statistical data of previous decoded |
|
symbols. |
|
|
|
That estimated probability is presented as 11-bit unsigned integer value |
|
that represents the probability of symbol "0". |
|
|
|
#define kNumBitModelTotalBits 11 |
|
|
|
Mathematical probabilities can be presented with the following formulas: |
|
probability(symbol_0) = prob / 2048. |
|
probability(symbol_1) = 1 - Probability(symbol_0) = |
|
= 1 - prob / 2048 = |
|
= (2048 - prob) / 2048 |
|
where the "prob" variable contains 11-bit integer probability counter. |
|
|
|
It's recommended to use 16-bit unsigned integer type, to store these 11-bit |
|
probability values: |
|
|
|
typedef UInt16 CProb; |
|
|
|
Each probability value must be initialized with value ((1 << 11) / 2), |
|
that represents the state, where probabilities of symbols 0 and 1 |
|
are equal to 0.5: |
|
|
|
#define PROB_INIT_VAL ((1 << kNumBitModelTotalBits) / 2) |
|
|
|
The INIT_PROBS macro is used to initialize the array of CProb variables: |
|
|
|
#define INIT_PROBS(p) \ |
|
{ for (unsigned i = 0; i < sizeof(p) / sizeof(p[0]); i++) p[i] = PROB_INIT_VAL; } |
|
|
|
|
|
The DecodeBit() function decodes one bit. |
|
The LZMA decoder provides the pointer to CProb variable that contains |
|
information about estimated probability for symbol 0 and the Range Decoder |
|
updates that CProb variable after decoding. The Range Decoder increases |
|
estimated probability of the symbol that was decoded: |
|
|
|
#define kNumMoveBits 5 |
|
|
|
unsigned CRangeDecoder::DecodeBit(CProb *prob) |
|
{ |
|
unsigned v = *prob; |
|
UInt32 bound = (Range >> kNumBitModelTotalBits) * v; |
|
unsigned symbol; |
|
if (Code < bound) |
|
{ |
|
v += ((1 << kNumBitModelTotalBits) - v) >> kNumMoveBits; |
|
Range = bound; |
|
symbol = 0; |
|
} |
|
else |
|
{ |
|
v -= v >> kNumMoveBits; |
|
Code -= bound; |
|
Range -= bound; |
|
symbol = 1; |
|
} |
|
*prob = (CProb)v; |
|
Normalize(); |
|
return symbol; |
|
} |
|
|
|
|
|
The Binary Tree of bit model counters |
|
------------------------------------- |
|
|
|
LZMA uses a tree of Bit model variables to decode symbol that needs |
|
several bits for storing. There are two versions of such trees in LZMA: |
|
1) the tree that decodes bits from high bit to low bit (the normal scheme). |
|
2) the tree that decodes bits from low bit to high bit (the reverse scheme). |
|
|
|
Each binary tree structure supports different size of decoded symbol |
|
(the size of binary sequence that contains value of symbol). |
|
If that size of decoded symbol is "NumBits" bits, the tree structure |
|
uses the array of (2 << NumBits) counters of CProb type. |
|
But only ((2 << NumBits) - 1) items are used by encoder and decoder. |
|
The first item (the item with index equal to 0) in array is unused. |
|
That scheme with unused array's item allows to simplify the code. |
|
|
|
unsigned BitTreeReverseDecode(CProb *probs, unsigned numBits, CRangeDecoder *rc) |
|
{ |
|
unsigned m = 1; |
|
unsigned symbol = 0; |
|
for (unsigned i = 0; i < numBits; i++) |
|
{ |
|
unsigned bit = rc->DecodeBit(&probs[m]); |
|
m <<= 1; |
|
m += bit; |
|
symbol |= (bit << i); |
|
} |
|
return symbol; |
|
} |
|
|
|
template <unsigned NumBits> |
|
class CBitTreeDecoder |
|
{ |
|
CProb Probs[(unsigned)1 << NumBits]; |
|
|
|
public: |
|
|
|
void Init() |
|
{ |
|
INIT_PROBS(Probs); |
|
} |
|
|
|
unsigned Decode(CRangeDecoder *rc) |
|
{ |
|
unsigned m = 1; |
|
for (unsigned i = 0; i < NumBits; i++) |
|
m = (m << 1) + rc->DecodeBit(&Probs[m]); |
|
return m - ((unsigned)1 << NumBits); |
|
} |
|
|
|
unsigned ReverseDecode(CRangeDecoder *rc) |
|
{ |
|
return BitTreeReverseDecode(Probs, NumBits, rc); |
|
} |
|
}; |
|
|
|
|
|
LZ part of LZMA |
|
--------------- |
|
|
|
LZ part of LZMA describes details about the decoding of MATCHES and LITERALS. |
|
|
|
|
|
The Literal Decoding |
|
-------------------- |
|
|
|
The LZMA Decoder uses (1 << (lc + lp)) tables with CProb values, where |
|
each table contains 0x300 CProb values: |
|
|
|
CProb *LitProbs; |
|
|
|
void CreateLiterals() |
|
{ |
|
LitProbs = new CProb[(UInt32)0x300 << (lc + lp)]; |
|
} |
|
|
|
void InitLiterals() |
|
{ |
|
UInt32 num = (UInt32)0x300 << (lc + lp); |
|
for (UInt32 i = 0; i < num; i++) |
|
LitProbs[i] = PROB_INIT_VAL; |
|
} |
|
|
|
To select the table for decoding it uses the context that consists of |
|
(lc) high bits from previous literal and (lp) low bits from value that |
|
represents current position in outputStream. |
|
|
|
If (State > 7), the Literal Decoder also uses "matchByte" that represents |
|
the byte in OutputStream at position the is the DISTANCE bytes before |
|
current position, where the DISTANCE is the distance in DISTANCE-LENGTH pair |
|
of latest decoded match. |
|
|
|
The following code decodes one literal and puts it to Sliding Window buffer: |
|
|
|
void DecodeLiteral(unsigned state, UInt32 rep0) |
|
{ |
|
unsigned prevByte = 0; |
|
if (!OutWindow.IsEmpty()) |
|
prevByte = OutWindow.GetByte(1); |
|
|
|
unsigned symbol = 1; |
|
unsigned litState = ((OutWindow.TotalPos & ((1 << lp) - 1)) << lc) + (prevByte >> (8 - lc)); |
|
CProb *probs = &LitProbs[(UInt32)0x300 * litState]; |
|
|
|
if (state >= 7) |
|
{ |
|
unsigned matchByte = OutWindow.GetByte(rep0 + 1); |
|
do |
|
{ |
|
unsigned matchBit = (matchByte >> 7) & 1; |
|
matchByte <<= 1; |
|
unsigned bit = RangeDec.DecodeBit(&probs[((1 + matchBit) << 8) + symbol]); |
|
symbol = (symbol << 1) | bit; |
|
if (matchBit != bit) |
|
break; |
|
} |
|
while (symbol < 0x100); |
|
} |
|
while (symbol < 0x100) |
|
symbol = (symbol << 1) | RangeDec.DecodeBit(&probs[symbol]); |
|
OutWindow.PutByte((Byte)(symbol - 0x100)); |
|
} |
|
|
|
|
|
The match length decoding |
|
------------------------- |
|
|
|
The match length decoder returns normalized (zero-based value) |
|
length of match. That value can be converted to real length of the match |
|
with the following code: |
|
|
|
#define kMatchMinLen 2 |
|
|
|
matchLen = len + kMatchMinLen; |
|
|
|
The match length decoder can return the values from 0 to 271. |
|
And the corresponded real match length values can be in the range |
|
from 2 to 273. |
|
|
|
The following scheme is used for the match length encoding: |
|
|
|
Binary encoding Binary Tree structure Zero-based match length |
|
sequence (binary + decimal): |
|
|
|
0 xxx LowCoder[posState] xxx |
|
1 0 yyy MidCoder[posState] yyy + 8 |
|
1 1 zzzzzzzz HighCoder zzzzzzzz + 16 |
|
|
|
LZMA uses bit model variable "Choice" to decode the first selection bit. |
|
|
|
If the first selection bit is equal to 0, the decoder uses binary tree |
|
LowCoder[posState] to decode 3-bit zero-based match length (xxx). |
|
|
|
If the first selection bit is equal to 1, the decoder uses bit model |
|
variable "Choice2" to decode the second selection bit. |
|
|
|
If the second selection bit is equal to 0, the decoder uses binary tree |
|
MidCoder[posState] to decode 3-bit "yyy" value, and zero-based match |
|
length is equal to (yyy + 8). |
|
|
|
If the second selection bit is equal to 1, the decoder uses binary tree |
|
HighCoder to decode 8-bit "zzzzzzzz" value, and zero-based |
|
match length is equal to (zzzzzzzz + 16). |
|
|
|
LZMA uses "posState" value as context to select the binary tree |
|
from LowCoder and MidCoder binary tree arrays: |
|
|
|
unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1); |
|
|
|
The full code of the length decoder: |
|
|
|
class CLenDecoder |
|
{ |
|
CProb Choice; |
|
CProb Choice2; |
|
CBitTreeDecoder<3> LowCoder[1 << kNumPosBitsMax]; |
|
CBitTreeDecoder<3> MidCoder[1 << kNumPosBitsMax]; |
|
CBitTreeDecoder<8> HighCoder; |
|
|
|
public: |
|
|
|
void Init() |
|
{ |
|
Choice = PROB_INIT_VAL; |
|
Choice2 = PROB_INIT_VAL; |
|
HighCoder.Init(); |
|
for (unsigned i = 0; i < (1 << kNumPosBitsMax); i++) |
|
{ |
|
LowCoder[i].Init(); |
|
MidCoder[i].Init(); |
|
} |
|
} |
|
|
|
unsigned Decode(CRangeDecoder *rc, unsigned posState) |
|
{ |
|
if (rc->DecodeBit(&Choice) == 0) |
|
return LowCoder[posState].Decode(rc); |
|
if (rc->DecodeBit(&Choice2) == 0) |
|
return 8 + MidCoder[posState].Decode(rc); |
|
return 16 + HighCoder.Decode(rc); |
|
} |
|
}; |
|
|
|
The LZMA decoder uses two instances of CLenDecoder class. |
|
The first instance is for the matches of "Simple Match" type, |
|
and the second instance is for the matches of "Rep Match" type: |
|
|
|
CLenDecoder LenDecoder; |
|
CLenDecoder RepLenDecoder; |
|
|
|
|
|
The match distance decoding |
|
--------------------------- |
|
|
|
LZMA supports dictionary sizes up to 4 GiB minus 1. |
|
The value of match distance (decoded by distance decoder) can be |
|
from 1 to 2^32. But the distance value that is equal to 2^32 is used to |
|
indicate the "End of stream" marker. So real largest match distance |
|
that is used for LZ-window match is (2^32 - 1). |
|
|
|
LZMA uses normalized match length (zero-based length) |
|
to calculate the context state "lenState" do decode the distance value: |
|
|
|
#define kNumLenToPosStates 4 |
|
|
|
unsigned lenState = len; |
|
if (lenState > kNumLenToPosStates - 1) |
|
lenState = kNumLenToPosStates - 1; |
|
|
|
The distance decoder returns the "dist" value that is zero-based value |
|
of match distance. The real match distance can be calculated with the |
|
following code: |
|
|
|
matchDistance = dist + 1; |
|
|
|
The state of the distance decoder and the initialization code: |
|
|
|
#define kEndPosModelIndex 14 |
|
#define kNumFullDistances (1 << (kEndPosModelIndex >> 1)) |
|
#define kNumAlignBits 4 |
|
|
|
CBitTreeDecoder<6> PosSlotDecoder[kNumLenToPosStates]; |
|
CProb PosDecoders[1 + kNumFullDistances - kEndPosModelIndex]; |
|
CBitTreeDecoder<kNumAlignBits> AlignDecoder; |
|
|
|
void InitDist() |
|
{ |
|
for (unsigned i = 0; i < kNumLenToPosStates; i++) |
|
PosSlotDecoder[i].Init(); |
|
AlignDecoder.Init(); |
|
INIT_PROBS(PosDecoders); |
|
} |
|
|
|
At first stage the distance decoder decodes 6-bit "posSlot" value with bit |
|
tree decoder from PosSlotDecoder array. It's possible to get 2^6=64 different |
|
"posSlot" values. |
|
|
|
unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec); |
|
|
|
The encoding scheme for distance value is shown in the following table: |
|
|
|
posSlot (decimal) / |
|
zero-based distance (binary) |
|
0 0 |
|
1 1 |
|
2 10 |
|
3 11 |
|
|
|
4 10 x |
|
5 11 x |
|
6 10 xx |
|
7 11 xx |
|
8 10 xxx |
|
9 11 xxx |
|
10 10 xxxx |
|
11 11 xxxx |
|
12 10 xxxxx |
|
13 11 xxxxx |
|
|
|
14 10 yy zzzz |
|
15 11 yy zzzz |
|
16 10 yyy zzzz |
|
17 11 yyy zzzz |
|
... |
|
62 10 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz |
|
63 11 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz |
|
|
|
where |
|
"x ... x" means the sequence of binary symbols encoded with binary tree and |
|
"Reverse" scheme. It uses separated binary tree for each posSlot from 4 to 13. |
|
"y" means direct bit encoded with range coder. |
|
"zzzz" means the sequence of four binary symbols encoded with binary |
|
tree with "Reverse" scheme, where one common binary tree "AlignDecoder" |
|
is used for all posSlot values. |
|
|
|
If (posSlot < 4), the "dist" value is equal to posSlot value. |
|
|
|
If (posSlot >= 4), the decoder uses "posSlot" value to calculate the value of |
|
the high bits of "dist" value and the number of the low bits. |
|
|
|
If (4 <= posSlot < kEndPosModelIndex), the decoder uses bit tree decoders. |
|
(one separated bit tree decoder per one posSlot value) and "Reverse" scheme. |
|
In this implementation we use one CProb array "PosDecoders" that contains |
|
all CProb variables for all these bit decoders. |
|
|
|
if (posSlot >= kEndPosModelIndex), the middle bits are decoded as direct |
|
bits from RangeDecoder and the low 4 bits are decoded with a bit tree |
|
decoder "AlignDecoder" with "Reverse" scheme. |
|
|
|
The code to decode zero-based match distance: |
|
|
|
unsigned DecodeDistance(unsigned len) |
|
{ |
|
unsigned lenState = len; |
|
if (lenState > kNumLenToPosStates - 1) |
|
lenState = kNumLenToPosStates - 1; |
|
|
|
unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec); |
|
if (posSlot < 4) |
|
return posSlot; |
|
|
|
unsigned numDirectBits = (unsigned)((posSlot >> 1) - 1); |
|
UInt32 dist = ((2 | (posSlot & 1)) << numDirectBits); |
|
if (posSlot < kEndPosModelIndex) |
|
dist += BitTreeReverseDecode(PosDecoders + dist - posSlot, numDirectBits, &RangeDec); |
|
else |
|
{ |
|
dist += RangeDec.DecodeDirectBits(numDirectBits - kNumAlignBits) << kNumAlignBits; |
|
dist += AlignDecoder.ReverseDecode(&RangeDec); |
|
} |
|
return dist; |
|
} |
|
|
|
|
|
|
|
LZMA Decoding modes |
|
------------------- |
|
|
|
There are 2 types of LZMA streams: |
|
|
|
1) The stream with "End of stream" marker. |
|
2) The stream without "End of stream" marker. |
|
|
|
And the LZMA Decoder supports 3 modes of decoding: |
|
|
|
1) The unpack size is undefined. The LZMA decoder stops decoding after |
|
getting "End of stream" marker. |
|
The input variables for that case: |
|
|
|
markerIsMandatory = true |
|
unpackSizeDefined = false |
|
unpackSize contains any value |
|
|
|
2) The unpack size is defined and LZMA decoder supports both variants, |
|
where the stream can contain "End of stream" marker or the stream is |
|
finished without "End of stream" marker. The LZMA decoder must detect |
|
any of these situations. |
|
The input variables for that case: |
|
|
|
markerIsMandatory = false |
|
unpackSizeDefined = true |
|
unpackSize contains unpack size |
|
|
|
3) The unpack size is defined and the LZMA stream must contain |
|
"End of stream" marker |
|
The input variables for that case: |
|
|
|
markerIsMandatory = true |
|
unpackSizeDefined = true |
|
unpackSize contains unpack size |
|
|
|
|
|
The main loop of decoder |
|
------------------------ |
|
|
|
The main loop of LZMA decoder: |
|
|
|
Initialize the LZMA state. |
|
loop |
|
{ |
|
// begin of loop |
|
Check "end of stream" conditions. |
|
Decode Type of MATCH / LITERAL. |
|
If it's LITERAL, decode LITERAL value and put the LITERAL to Window. |
|
If it's MATCH, decode the length of match and the match distance. |
|
Check error conditions, check end of stream conditions and copy |
|
the sequence of match bytes from sliding window to current position |
|
in window. |
|
Go to begin of loop |
|
} |
|
|
|
The reference implementation of LZMA decoder uses "unpackSize" variable |
|
to keep the number of remaining bytes in output stream. So it reduces |
|
"unpackSize" value after each decoded LITERAL or MATCH. |
|
|
|
The following code contains the "end of stream" condition check at the start |
|
of the loop: |
|
|
|
if (unpackSizeDefined && unpackSize == 0 && !markerIsMandatory) |
|
if (RangeDec.IsFinishedOK()) |
|
return LZMA_RES_FINISHED_WITHOUT_MARKER; |
|
|
|
LZMA uses three types of matches: |
|
|
|
1) "Simple Match" - the match with distance value encoded with bit models. |
|
|
|
2) "Rep Match" - the match that uses the distance from distance |
|
history table. |
|
|
|
3) "Short Rep Match" - the match of single byte length, that uses the latest |
|
distance from distance history table. |
|
|
|
The LZMA decoder keeps the history of latest 4 match distances that were used |
|
by decoder. That set of 4 variables contains zero-based match distances and |
|
these variables are initialized with zero values: |
|
|
|
UInt32 rep0 = 0, rep1 = 0, rep2 = 0, rep3 = 0; |
|
|
|
The LZMA decoder uses binary model variables to select type of MATCH or LITERAL: |
|
|
|
#define kNumStates 12 |
|
#define kNumPosBitsMax 4 |
|
|
|
CProb IsMatch[kNumStates << kNumPosBitsMax]; |
|
CProb IsRep[kNumStates]; |
|
CProb IsRepG0[kNumStates]; |
|
CProb IsRepG1[kNumStates]; |
|
CProb IsRepG2[kNumStates]; |
|
CProb IsRep0Long[kNumStates << kNumPosBitsMax]; |
|
|
|
The decoder uses "state" variable value to select exact variable |
|
from "IsRep", "IsRepG0", "IsRepG1" and "IsRepG2" arrays. |
|
The "state" variable can get the value from 0 to 11. |
|
Initial value for "state" variable is zero: |
|
|
|
unsigned state = 0; |
|
|
|
The "state" variable is updated after each LITERAL or MATCH with one of the |
|
following functions: |
|
|
|
unsigned UpdateState_Literal(unsigned state) |
|
{ |
|
if (state < 4) return 0; |
|
else if (state < 10) return state - 3; |
|
else return state - 6; |
|
} |
|
unsigned UpdateState_Match (unsigned state) { return state < 7 ? 7 : 10; } |
|
unsigned UpdateState_Rep (unsigned state) { return state < 7 ? 8 : 11; } |
|
unsigned UpdateState_ShortRep(unsigned state) { return state < 7 ? 9 : 11; } |
|
|
|
The decoder calculates "state2" variable value to select exact variable from |
|
"IsMatch" and "IsRep0Long" arrays: |
|
|
|
unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1); |
|
unsigned state2 = (state << kNumPosBitsMax) + posState; |
|
|
|
The decoder uses the following code flow scheme to select exact |
|
type of LITERAL or MATCH: |
|
|
|
IsMatch[state2] decode |
|
0 - the Literal |
|
1 - the Match |
|
IsRep[state] decode |
|
0 - Simple Match |
|
1 - Rep Match |
|
IsRepG0[state] decode |
|
0 - the distance is rep0 |
|
IsRep0Long[state2] decode |
|
0 - Short Rep Match |
|
1 - Rep Match 0 |
|
1 - |
|
IsRepG1[state] decode |
|
0 - Rep Match 1 |
|
1 - |
|
IsRepG2[state] decode |
|
0 - Rep Match 2 |
|
1 - Rep Match 3 |
|
|
|
|
|
LITERAL symbol |
|
-------------- |
|
If the value "0" was decoded with IsMatch[state2] decoding, we have "LITERAL" type. |
|
|
|
At first the LZMA decoder must check that it doesn't exceed |
|
specified uncompressed size: |
|
|
|
if (unpackSizeDefined && unpackSize == 0) |
|
return LZMA_RES_ERROR; |
|
|
|
Then it decodes literal value and puts it to sliding window: |
|
|
|
DecodeLiteral(state, rep0); |
|
|
|
Then the decoder must update the "state" value and "unpackSize" value; |
|
|
|
state = UpdateState_Literal(state); |
|
unpackSize--; |
|
|
|
Then the decoder must go to the begin of main loop to decode next Match or Literal. |
|
|
|
|
|
Simple Match |
|
------------ |
|
|
|
If the value "1" was decoded with IsMatch[state2] decoding, |
|
we have the "Simple Match" type. |
|
|
|
The distance history table is updated with the following scheme: |
|
|
|
rep3 = rep2; |
|
rep2 = rep1; |
|
rep1 = rep0; |
|
|
|
The zero-based length is decoded with "LenDecoder": |
|
|
|
len = LenDecoder.Decode(&RangeDec, posState); |
|
|
|
The state is update with UpdateState_Match function: |
|
|
|
state = UpdateState_Match(state); |
|
|
|
and the new "rep0" value is decoded with DecodeDistance: |
|
|
|
rep0 = DecodeDistance(len); |
|
|
|
That "rep0" will be used as zero-based distance for current match. |
|
|
|
If the value of "rep0" is equal to 0xFFFFFFFF, it means that we have |
|
"End of stream" marker, so we can stop decoding and check finishing |
|
condition in Range Decoder: |
|
|
|
if (rep0 == 0xFFFFFFFF) |
|
return RangeDec.IsFinishedOK() ? |
|
LZMA_RES_FINISHED_WITH_MARKER : |
|
LZMA_RES_ERROR; |
|
|
|
If uncompressed size is defined, LZMA decoder must check that it doesn't |
|
exceed that specified uncompressed size: |
|
|
|
if (unpackSizeDefined && unpackSize == 0) |
|
return LZMA_RES_ERROR; |
|
|
|
Also the decoder must check that "rep0" value is not larger than dictionary size |
|
and is not larger than the number of already decoded bytes: |
|
|
|
if (rep0 >= dictSize || !OutWindow.CheckDistance(rep0)) |
|
return LZMA_RES_ERROR; |
|
|
|
Then the decoder must copy match bytes as described in |
|
"The match symbols copying" section. |
|
|
|
|
|
Rep Match |
|
--------- |
|
|
|
If the LZMA decoder has decoded the value "1" with IsRep[state] variable, |
|
we have "Rep Match" type. |
|
|
|
At first the LZMA decoder must check that it doesn't exceed |
|
specified uncompressed size: |
|
|
|
if (unpackSizeDefined && unpackSize == 0) |
|
return LZMA_RES_ERROR; |
|
|
|
Also the decoder must return error, if the LZ window is empty: |
|
|
|
if (OutWindow.IsEmpty()) |
|
return LZMA_RES_ERROR; |
|
|
|
If the match type is "Rep Match", the decoder uses one of the 4 variables of |
|
distance history table to get the value of distance for current match. |
|
And there are 4 corresponding ways of decoding flow. |
|
|
|
The decoder updates the distance history with the following scheme |
|
depending from type of match: |
|
|
|
- "Rep Match 0" or "Short Rep Match": |
|
; LZMA doesn't update the distance history |
|
|
|
- "Rep Match 1": |
|
UInt32 dist = rep1; |
|
rep1 = rep0; |
|
rep0 = dist; |
|
|
|
- "Rep Match 2": |
|
UInt32 dist = rep2; |
|
rep2 = rep1; |
|
rep1 = rep0; |
|
rep0 = dist; |
|
|
|
- "Rep Match 3": |
|
UInt32 dist = rep3; |
|
rep3 = rep2; |
|
rep2 = rep1; |
|
rep1 = rep0; |
|
rep0 = dist; |
|
|
|
Then the decoder decodes exact subtype of "Rep Match" using "IsRepG0", "IsRep0Long", |
|
"IsRepG1", "IsRepG2". |
|
|
|
If the subtype is "Short Rep Match", the decoder updates the state, puts |
|
the one byte from window to current position in window and goes to next |
|
MATCH/LITERAL symbol (the begin of main loop): |
|
|
|
state = UpdateState_ShortRep(state); |
|
OutWindow.PutByte(OutWindow.GetByte(rep0 + 1)); |
|
unpackSize--; |
|
continue; |
|
|
|
In other cases (Rep Match 0/1/2/3), it decodes the zero-based |
|
length of match with "RepLenDecoder" decoder: |
|
|
|
len = RepLenDecoder.Decode(&RangeDec, posState); |
|
|
|
Then it updates the state: |
|
|
|
state = UpdateState_Rep(state); |
|
|
|
Then the decoder must copy match bytes as described in |
|
"The Match symbols copying" section. |
|
|
|
|
|
The match symbols copying |
|
------------------------- |
|
|
|
If we have the match (Simple Match or Rep Match 0/1/2/3), the decoder must |
|
copy the sequence of bytes with calculated match distance and match length. |
|
If uncompressed size is defined, LZMA decoder must check that it doesn't |
|
exceed that specified uncompressed size: |
|
|
|
len += kMatchMinLen; |
|
bool isError = false; |
|
if (unpackSizeDefined && unpackSize < len) |
|
{ |
|
len = (unsigned)unpackSize; |
|
isError = true; |
|
} |
|
OutWindow.CopyMatch(rep0 + 1, len); |
|
unpackSize -= len; |
|
if (isError) |
|
return LZMA_RES_ERROR; |
|
|
|
Then the decoder must go to the begin of main loop to decode next MATCH or LITERAL. |
|
|
|
|
|
|
|
NOTES |
|
----- |
|
|
|
This specification doesn't describe the variant of decoder implementation |
|
that supports partial decoding. Such partial decoding case can require some |
|
changes in "end of stream" condition checks code. Also such code |
|
can use additional status codes, returned by decoder. |
|
|
|
This specification uses C++ code with templates to simplify describing. |
|
The optimized version of LZMA decoder doesn't need templates. |
|
Such optimized version can use just two arrays of CProb variables: |
|
1) The dynamic array of CProb variables allocated for the Literal Decoder. |
|
2) The one common array that contains all other CProb variables. |
|
|
|
|
|
References: |
|
|
|
1. G. N. N. Martin, Range encoding: an algorithm for removing redundancy |
|
from a digitized message, Video & Data Recording Conference, |
|
Southampton, UK, July 24-27, 1979.
|
|
|