1*22ce4affSfengbojiangZstandard Compression Format
2*22ce4affSfengbojiang============================
3*22ce4affSfengbojiang
4*22ce4affSfengbojiang### Notices
5*22ce4affSfengbojiang
6*22ce4affSfengbojiangCopyright (c) 2016-2020 Yann Collet, Facebook, Inc.
7*22ce4affSfengbojiang
8*22ce4affSfengbojiangPermission is granted to copy and distribute this document
9*22ce4affSfengbojiangfor any purpose and without charge,
10*22ce4affSfengbojiangincluding translations into other languages
11*22ce4affSfengbojiangand incorporation into compilations,
12*22ce4affSfengbojiangprovided that the copyright notice and this notice are preserved,
13*22ce4affSfengbojiangand that any substantive changes or deletions from the original
14*22ce4affSfengbojiangare clearly marked.
15*22ce4affSfengbojiangDistribution of this document is unlimited.
16*22ce4affSfengbojiang
17*22ce4affSfengbojiang### Version
18*22ce4affSfengbojiang
19*22ce4affSfengbojiang0.3.7 (2020-12-09)
20*22ce4affSfengbojiang
21*22ce4affSfengbojiang
22*22ce4affSfengbojiangIntroduction
23*22ce4affSfengbojiang------------
24*22ce4affSfengbojiang
25*22ce4affSfengbojiangThe purpose of this document is to define a lossless compressed data format,
26*22ce4affSfengbojiangthat is independent of CPU type, operating system,
27*22ce4affSfengbojiangfile system and character set, suitable for
28*22ce4affSfengbojiangfile compression, pipe and streaming compression,
29*22ce4affSfengbojiangusing the [Zstandard algorithm](http://www.zstandard.org).
30*22ce4affSfengbojiangThe text of the specification assumes a basic background in programming
31*22ce4affSfengbojiangat the level of bits and other primitive data representations.
32*22ce4affSfengbojiang
33*22ce4affSfengbojiangThe data can be produced or consumed,
34*22ce4affSfengbojiangeven for an arbitrarily long sequentially presented input data stream,
35*22ce4affSfengbojiangusing only an a priori bounded amount of intermediate storage,
36*22ce4affSfengbojiangand hence can be used in data communications.
37*22ce4affSfengbojiangThe format uses the Zstandard compression method,
38*22ce4affSfengbojiangand optional [xxHash-64 checksum method](http://www.xxhash.org),
39*22ce4affSfengbojiangfor detection of data corruption.
40*22ce4affSfengbojiang
41*22ce4affSfengbojiangThe data format defined by this specification
42*22ce4affSfengbojiangdoes not attempt to allow random access to compressed data.
43*22ce4affSfengbojiang
44*22ce4affSfengbojiangUnless otherwise indicated below,
45*22ce4affSfengbojianga compliant compressor must produce data sets
46*22ce4affSfengbojiangthat conform to the specifications presented here.
47*22ce4affSfengbojiangIt doesn’t need to support all options though.
48*22ce4affSfengbojiang
49*22ce4affSfengbojiangA compliant decompressor must be able to decompress
50*22ce4affSfengbojiangat least one working set of parameters
51*22ce4affSfengbojiangthat conforms to the specifications presented here.
52*22ce4affSfengbojiangIt may also ignore informative fields, such as checksum.
53*22ce4affSfengbojiangWhenever it does not support a parameter defined in the compressed stream,
54*22ce4affSfengbojiangit must produce a non-ambiguous error code and associated error message
55*22ce4affSfengbojiangexplaining which parameter is unsupported.
56*22ce4affSfengbojiang
57*22ce4affSfengbojiangThis specification is intended for use by implementers of software
58*22ce4affSfengbojiangto compress data into Zstandard format and/or decompress data from Zstandard format.
59*22ce4affSfengbojiangThe Zstandard format is supported by an open source reference implementation,
60*22ce4affSfengbojiangwritten in portable C, and available at : https://github.com/facebook/zstd .
61*22ce4affSfengbojiang
62*22ce4affSfengbojiang
63*22ce4affSfengbojiang### Overall conventions
64*22ce4affSfengbojiangIn this document:
65*22ce4affSfengbojiang- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
66*22ce4affSfengbojiang- the naming convention for identifiers is `Mixed_Case_With_Underscores`
67*22ce4affSfengbojiang
68*22ce4affSfengbojiang### Definitions
69*22ce4affSfengbojiangContent compressed by Zstandard is transformed into a Zstandard __frame__.
70*22ce4affSfengbojiangMultiple frames can be appended into a single file or stream.
71*22ce4affSfengbojiangA frame is completely independent, has a defined beginning and end,
72*22ce4affSfengbojiangand a set of parameters which tells the decoder how to decompress it.
73*22ce4affSfengbojiang
74*22ce4affSfengbojiangA frame encapsulates one or multiple __blocks__.
75*22ce4affSfengbojiangEach block contains arbitrary content, which is described by its header,
76*22ce4affSfengbojiangand has a guaranteed maximum content size, which depends on frame parameters.
77*22ce4affSfengbojiangUnlike frames, each block depends on previous blocks for proper decoding.
78*22ce4affSfengbojiangHowever, each block can be decompressed without waiting for its successor,
79*22ce4affSfengbojiangallowing streaming operations.
80*22ce4affSfengbojiang
81*22ce4affSfengbojiangOverview
82*22ce4affSfengbojiang---------
83*22ce4affSfengbojiang- [Frames](#frames)
84*22ce4affSfengbojiang  - [Zstandard frames](#zstandard-frames)
85*22ce4affSfengbojiang    - [Blocks](#blocks)
86*22ce4affSfengbojiang      - [Literals Section](#literals-section)
87*22ce4affSfengbojiang      - [Sequences Section](#sequences-section)
88*22ce4affSfengbojiang      - [Sequence Execution](#sequence-execution)
89*22ce4affSfengbojiang  - [Skippable frames](#skippable-frames)
90*22ce4affSfengbojiang- [Entropy Encoding](#entropy-encoding)
91*22ce4affSfengbojiang  - [FSE](#fse)
92*22ce4affSfengbojiang  - [Huffman Coding](#huffman-coding)
93*22ce4affSfengbojiang- [Dictionary Format](#dictionary-format)
94*22ce4affSfengbojiang
95*22ce4affSfengbojiangFrames
96*22ce4affSfengbojiang------
97*22ce4affSfengbojiangZstandard compressed data is made of one or more __frames__.
98*22ce4affSfengbojiangEach frame is independent and can be decompressed independently of other frames.
99*22ce4affSfengbojiangThe decompressed content of multiple concatenated frames is the concatenation of
100*22ce4affSfengbojiangeach frame decompressed content.
101*22ce4affSfengbojiang
102*22ce4affSfengbojiangThere are two frame formats defined by Zstandard:
103*22ce4affSfengbojiang  Zstandard frames and Skippable frames.
104*22ce4affSfengbojiangZstandard frames contain compressed data, while
105*22ce4affSfengbojiangskippable frames contain custom user metadata.
106*22ce4affSfengbojiang
107*22ce4affSfengbojiang## Zstandard frames
108*22ce4affSfengbojiangThe structure of a single Zstandard frame is following:
109*22ce4affSfengbojiang
110*22ce4affSfengbojiang| `Magic_Number` | `Frame_Header` |`Data_Block`| [More data blocks] | [`Content_Checksum`] |
111*22ce4affSfengbojiang|:--------------:|:--------------:|:----------:| ------------------ |:--------------------:|
112*22ce4affSfengbojiang|  4 bytes       |  2-14 bytes    |  n bytes   |                    |     0-4 bytes        |
113*22ce4affSfengbojiang
114*22ce4affSfengbojiang__`Magic_Number`__
115*22ce4affSfengbojiang
116*22ce4affSfengbojiang4 Bytes, __little-endian__ format.
117*22ce4affSfengbojiangValue : 0xFD2FB528
118*22ce4affSfengbojiangNote: This value was selected to be less probable to find at the beginning of some random file.
119*22ce4affSfengbojiangIt avoids trivial patterns (0x00, 0xFF, repeated bytes, increasing bytes, etc.),
120*22ce4affSfengbojiangcontains byte values outside of ASCII range,
121*22ce4affSfengbojiangand doesn't map into UTF8 space.
122*22ce4affSfengbojiangIt reduces the chances that a text file represent this value by accident.
123*22ce4affSfengbojiang
124*22ce4affSfengbojiang__`Frame_Header`__
125*22ce4affSfengbojiang
126*22ce4affSfengbojiang2 to 14 Bytes, detailed in [`Frame_Header`](#frame_header).
127*22ce4affSfengbojiang
128*22ce4affSfengbojiang__`Data_Block`__
129*22ce4affSfengbojiang
130*22ce4affSfengbojiangDetailed in [`Blocks`](#blocks).
131*22ce4affSfengbojiangThat’s where compressed data is stored.
132*22ce4affSfengbojiang
133*22ce4affSfengbojiang__`Content_Checksum`__
134*22ce4affSfengbojiang
135*22ce4affSfengbojiangAn optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
136*22ce4affSfengbojiangThe content checksum is the result
137*22ce4affSfengbojiangof [xxh64() hash function](http://www.xxhash.org)
138*22ce4affSfengbojiangdigesting the original (decoded) data as input, and a seed of zero.
139*22ce4affSfengbojiangThe low 4 bytes of the checksum are stored in __little-endian__ format.
140*22ce4affSfengbojiang
141*22ce4affSfengbojiang### `Frame_Header`
142*22ce4affSfengbojiang
143*22ce4affSfengbojiangThe `Frame_Header` has a variable size, with a minimum of 2 bytes,
144*22ce4affSfengbojiangand up to 14 bytes depending on optional parameters.
145*22ce4affSfengbojiangThe structure of `Frame_Header` is following:
146*22ce4affSfengbojiang
147*22ce4affSfengbojiang| `Frame_Header_Descriptor` | [`Window_Descriptor`] | [`Dictionary_ID`] | [`Frame_Content_Size`] |
148*22ce4affSfengbojiang| ------------------------- | --------------------- | ----------------- | ---------------------- |
149*22ce4affSfengbojiang| 1 byte                    | 0-1 byte              | 0-4 bytes         | 0-8 bytes              |
150*22ce4affSfengbojiang
151*22ce4affSfengbojiang#### `Frame_Header_Descriptor`
152*22ce4affSfengbojiang
153*22ce4affSfengbojiangThe first header's byte is called the `Frame_Header_Descriptor`.
154*22ce4affSfengbojiangIt describes which other fields are present.
155*22ce4affSfengbojiangDecoding this byte is enough to tell the size of `Frame_Header`.
156*22ce4affSfengbojiang
157*22ce4affSfengbojiang| Bit number | Field name                |
158*22ce4affSfengbojiang| ---------- | ----------                |
159*22ce4affSfengbojiang| 7-6        | `Frame_Content_Size_flag` |
160*22ce4affSfengbojiang| 5          | `Single_Segment_flag`     |
161*22ce4affSfengbojiang| 4          | `Unused_bit`              |
162*22ce4affSfengbojiang| 3          | `Reserved_bit`            |
163*22ce4affSfengbojiang| 2          | `Content_Checksum_flag`   |
164*22ce4affSfengbojiang| 1-0        | `Dictionary_ID_flag`      |
165*22ce4affSfengbojiang
166*22ce4affSfengbojiangIn this table, bit 7 is the highest bit, while bit 0 is the lowest one.
167*22ce4affSfengbojiang
168*22ce4affSfengbojiang__`Frame_Content_Size_flag`__
169*22ce4affSfengbojiang
170*22ce4affSfengbojiangThis is a 2-bits flag (`= Frame_Header_Descriptor >> 6`),
171*22ce4affSfengbojiangspecifying if `Frame_Content_Size` (the decompressed data size)
172*22ce4affSfengbojiangis provided within the header.
173*22ce4affSfengbojiang`Flag_Value` provides `FCS_Field_Size`,
174*22ce4affSfengbojiangwhich is the number of bytes used by `Frame_Content_Size`
175*22ce4affSfengbojiangaccording to the following table:
176*22ce4affSfengbojiang
177*22ce4affSfengbojiang|  `Flag_Value`  |    0   |  1  |  2  |  3  |
178*22ce4affSfengbojiang| -------------- | ------ | --- | --- | --- |
179*22ce4affSfengbojiang|`FCS_Field_Size`| 0 or 1 |  2  |  4  |  8  |
180*22ce4affSfengbojiang
181*22ce4affSfengbojiangWhen `Flag_Value` is `0`, `FCS_Field_Size` depends on `Single_Segment_flag` :
182*22ce4affSfengbojiangif `Single_Segment_flag` is set, `FCS_Field_Size` is 1.
183*22ce4affSfengbojiangOtherwise, `FCS_Field_Size` is 0 : `Frame_Content_Size` is not provided.
184*22ce4affSfengbojiang
185*22ce4affSfengbojiang__`Single_Segment_flag`__
186*22ce4affSfengbojiang
187*22ce4affSfengbojiangIf this flag is set,
188*22ce4affSfengbojiangdata must be regenerated within a single continuous memory segment.
189*22ce4affSfengbojiang
190*22ce4affSfengbojiangIn this case, `Window_Descriptor` byte is skipped,
191*22ce4affSfengbojiangbut `Frame_Content_Size` is necessarily present.
192*22ce4affSfengbojiangAs a consequence, the decoder must allocate a memory segment
193*22ce4affSfengbojiangof size equal or larger than `Frame_Content_Size`.
194*22ce4affSfengbojiang
195*22ce4affSfengbojiangIn order to preserve the decoder from unreasonable memory requirements,
196*22ce4affSfengbojianga decoder is allowed to reject a compressed frame
197*22ce4affSfengbojiangwhich requests a memory size beyond decoder's authorized range.
198*22ce4affSfengbojiang
199*22ce4affSfengbojiangFor broader compatibility, decoders are recommended to support
200*22ce4affSfengbojiangmemory sizes of at least 8 MB.
201*22ce4affSfengbojiangThis is only a recommendation,
202*22ce4affSfengbojiangeach decoder is free to support higher or lower limits,
203*22ce4affSfengbojiangdepending on local limitations.
204*22ce4affSfengbojiang
205*22ce4affSfengbojiang__`Unused_bit`__
206*22ce4affSfengbojiang
207*22ce4affSfengbojiangA decoder compliant with this specification version shall not interpret this bit.
208*22ce4affSfengbojiangIt might be used in any future version,
209*22ce4affSfengbojiangto signal a property which is transparent to properly decode the frame.
210*22ce4affSfengbojiangAn encoder compliant with this specification version must set this bit to zero.
211*22ce4affSfengbojiang
212*22ce4affSfengbojiang__`Reserved_bit`__
213*22ce4affSfengbojiang
214*22ce4affSfengbojiangThis bit is reserved for some future feature.
215*22ce4affSfengbojiangIts value _must be zero_.
216*22ce4affSfengbojiangA decoder compliant with this specification version must ensure it is not set.
217*22ce4affSfengbojiangThis bit may be used in a future revision,
218*22ce4affSfengbojiangto signal a feature that must be interpreted to decode the frame correctly.
219*22ce4affSfengbojiang
220*22ce4affSfengbojiang__`Content_Checksum_flag`__
221*22ce4affSfengbojiang
222*22ce4affSfengbojiangIf this flag is set, a 32-bits `Content_Checksum` will be present at frame's end.
223*22ce4affSfengbojiangSee `Content_Checksum` paragraph.
224*22ce4affSfengbojiang
225*22ce4affSfengbojiang__`Dictionary_ID_flag`__
226*22ce4affSfengbojiang
227*22ce4affSfengbojiangThis is a 2-bits flag (`= FHD & 3`),
228*22ce4affSfengbojiangtelling if a dictionary ID is provided within the header.
229*22ce4affSfengbojiangIt also specifies the size of this field as `DID_Field_Size`.
230*22ce4affSfengbojiang
231*22ce4affSfengbojiang|`Flag_Value`    |  0  |  1  |  2  |  3  |
232*22ce4affSfengbojiang| -------------- | --- | --- | --- | --- |
233*22ce4affSfengbojiang|`DID_Field_Size`|  0  |  1  |  2  |  4  |
234*22ce4affSfengbojiang
235*22ce4affSfengbojiang#### `Window_Descriptor`
236*22ce4affSfengbojiang
237*22ce4affSfengbojiangProvides guarantees on minimum memory buffer required to decompress a frame.
238*22ce4affSfengbojiangThis information is important for decoders to allocate enough memory.
239*22ce4affSfengbojiang
240*22ce4affSfengbojiangThe `Window_Descriptor` byte is optional.
241*22ce4affSfengbojiangWhen `Single_Segment_flag` is set, `Window_Descriptor` is not present.
242*22ce4affSfengbojiangIn this case, `Window_Size` is `Frame_Content_Size`,
243*22ce4affSfengbojiangwhich can be any value from 0 to 2^64-1 bytes (16 ExaBytes).
244*22ce4affSfengbojiang
245*22ce4affSfengbojiang| Bit numbers |     7-3    |     2-0    |
246*22ce4affSfengbojiang| ----------- | ---------- | ---------- |
247*22ce4affSfengbojiang| Field name  | `Exponent` | `Mantissa` |
248*22ce4affSfengbojiang
249*22ce4affSfengbojiangThe minimum memory buffer size is called `Window_Size`.
250*22ce4affSfengbojiangIt is described by the following formulas :
251*22ce4affSfengbojiang```
252*22ce4affSfengbojiangwindowLog = 10 + Exponent;
253*22ce4affSfengbojiangwindowBase = 1 << windowLog;
254*22ce4affSfengbojiangwindowAdd = (windowBase / 8) * Mantissa;
255*22ce4affSfengbojiangWindow_Size = windowBase + windowAdd;
256*22ce4affSfengbojiang```
257*22ce4affSfengbojiangThe minimum `Window_Size` is 1 KB.
258*22ce4affSfengbojiangThe maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB.
259*22ce4affSfengbojiang
260*22ce4affSfengbojiangIn general, larger `Window_Size` tend to improve compression ratio,
261*22ce4affSfengbojiangbut at the cost of memory usage.
262*22ce4affSfengbojiang
263*22ce4affSfengbojiangTo properly decode compressed data,
264*22ce4affSfengbojianga decoder will need to allocate a buffer of at least `Window_Size` bytes.
265*22ce4affSfengbojiang
266*22ce4affSfengbojiangIn order to preserve decoder from unreasonable memory requirements,
267*22ce4affSfengbojianga decoder is allowed to reject a compressed frame
268*22ce4affSfengbojiangwhich requests a memory size beyond decoder's authorized range.
269*22ce4affSfengbojiang
270*22ce4affSfengbojiangFor improved interoperability,
271*22ce4affSfengbojiangit's recommended for decoders to support `Window_Size` of up to 8 MB,
272*22ce4affSfengbojiangand it's recommended for encoders to not generate frame requiring `Window_Size` larger than 8 MB.
273*22ce4affSfengbojiangIt's merely a recommendation though,
274*22ce4affSfengbojiangdecoders are free to support larger or lower limits,
275*22ce4affSfengbojiangdepending on local limitations.
276*22ce4affSfengbojiang
277*22ce4affSfengbojiang#### `Dictionary_ID`
278*22ce4affSfengbojiang
279*22ce4affSfengbojiangThis is a variable size field, which contains
280*22ce4affSfengbojiangthe ID of the dictionary required to properly decode the frame.
281*22ce4affSfengbojiang`Dictionary_ID` field is optional. When it's not present,
282*22ce4affSfengbojiangit's up to the decoder to know which dictionary to use.
283*22ce4affSfengbojiang
284*22ce4affSfengbojiang`Dictionary_ID` field size is provided by `DID_Field_Size`.
285*22ce4affSfengbojiang`DID_Field_Size` is directly derived from value of `Dictionary_ID_flag`.
286*22ce4affSfengbojiang1 byte can represent an ID 0-255.
287*22ce4affSfengbojiang2 bytes can represent an ID 0-65535.
288*22ce4affSfengbojiang4 bytes can represent an ID 0-4294967295.
289*22ce4affSfengbojiangFormat is __little-endian__.
290*22ce4affSfengbojiang
291*22ce4affSfengbojiangIt's allowed to represent a small ID (for example `13`)
292*22ce4affSfengbojiangwith a large 4-bytes dictionary ID, even if it is less efficient.
293*22ce4affSfengbojiang
294*22ce4affSfengbojiangA value of `0` has same meaning as no `Dictionary_ID`,
295*22ce4affSfengbojiangin which case the frame may or may not need a dictionary to be decoded,
296*22ce4affSfengbojiangand the ID of such a dictionary is not specified.
297*22ce4affSfengbojiangThe decoder must know this information by other means.
298*22ce4affSfengbojiang
299*22ce4affSfengbojiang#### `Frame_Content_Size`
300*22ce4affSfengbojiang
301*22ce4affSfengbojiangThis is the original (uncompressed) size. This information is optional.
302*22ce4affSfengbojiang`Frame_Content_Size` uses a variable number of bytes, provided by `FCS_Field_Size`.
303*22ce4affSfengbojiang`FCS_Field_Size` is provided by the value of `Frame_Content_Size_flag`.
304*22ce4affSfengbojiang`FCS_Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
305*22ce4affSfengbojiang
306*22ce4affSfengbojiang| `FCS_Field_Size` |    Range   |
307*22ce4affSfengbojiang| ---------------- | ---------- |
308*22ce4affSfengbojiang|        0         |   unknown  |
309*22ce4affSfengbojiang|        1         |   0 - 255  |
310*22ce4affSfengbojiang|        2         | 256 - 65791|
311*22ce4affSfengbojiang|        4         | 0 - 2^32-1 |
312*22ce4affSfengbojiang|        8         | 0 - 2^64-1 |
313*22ce4affSfengbojiang
314*22ce4affSfengbojiang`Frame_Content_Size` format is __little-endian__.
315*22ce4affSfengbojiangWhen `FCS_Field_Size` is 1, 4 or 8 bytes, the value is read directly.
316*22ce4affSfengbojiangWhen `FCS_Field_Size` is 2, _the offset of 256 is added_.
317*22ce4affSfengbojiangIt's allowed to represent a small size (for example `18`) using any compatible variant.
318*22ce4affSfengbojiang
319*22ce4affSfengbojiang
320*22ce4affSfengbojiangBlocks
321*22ce4affSfengbojiang-------
322*22ce4affSfengbojiang
323*22ce4affSfengbojiangAfter `Magic_Number` and `Frame_Header`, there are some number of blocks.
324*22ce4affSfengbojiangEach frame must have at least one block,
325*22ce4affSfengbojiangbut there is no upper limit on the number of blocks per frame.
326*22ce4affSfengbojiang
327*22ce4affSfengbojiangThe structure of a block is as follows:
328*22ce4affSfengbojiang
329*22ce4affSfengbojiang| `Block_Header` | `Block_Content` |
330*22ce4affSfengbojiang|:--------------:|:---------------:|
331*22ce4affSfengbojiang|    3 bytes     |     n bytes     |
332*22ce4affSfengbojiang
333*22ce4affSfengbojiang__`Block_Header`__
334*22ce4affSfengbojiang
335*22ce4affSfengbojiang`Block_Header` uses 3 bytes, written using __little-endian__ convention.
336*22ce4affSfengbojiangIt contains 3 fields :
337*22ce4affSfengbojiang
338*22ce4affSfengbojiang| `Last_Block` | `Block_Type` | `Block_Size` |
339*22ce4affSfengbojiang|:------------:|:------------:|:------------:|
340*22ce4affSfengbojiang|    bit 0     |  bits 1-2    |  bits 3-23   |
341*22ce4affSfengbojiang
342*22ce4affSfengbojiang__`Last_Block`__
343*22ce4affSfengbojiang
344*22ce4affSfengbojiangThe lowest bit signals if this block is the last one.
345*22ce4affSfengbojiangThe frame will end after this last block.
346*22ce4affSfengbojiangIt may be followed by an optional `Content_Checksum`
347*22ce4affSfengbojiang(see [Zstandard Frames](#zstandard-frames)).
348*22ce4affSfengbojiang
349*22ce4affSfengbojiang__`Block_Type`__
350*22ce4affSfengbojiang
351*22ce4affSfengbojiangThe next 2 bits represent the `Block_Type`.
352*22ce4affSfengbojiang`Block_Type` influences the meaning of `Block_Size`.
353*22ce4affSfengbojiangThere are 4 block types :
354*22ce4affSfengbojiang
355*22ce4affSfengbojiang|    Value     |      0      |      1      |         2          |     3     |
356*22ce4affSfengbojiang| ------------ | ----------- | ----------- | ------------------ | --------- |
357*22ce4affSfengbojiang| `Block_Type` | `Raw_Block` | `RLE_Block` | `Compressed_Block` | `Reserved`|
358*22ce4affSfengbojiang
359*22ce4affSfengbojiang- `Raw_Block` - this is an uncompressed block.
360*22ce4affSfengbojiang  `Block_Content` contains `Block_Size` bytes.
361*22ce4affSfengbojiang
362*22ce4affSfengbojiang- `RLE_Block` - this is a single byte, repeated `Block_Size` times.
363*22ce4affSfengbojiang  `Block_Content` consists of a single byte.
364*22ce4affSfengbojiang  On the decompression side, this byte must be repeated `Block_Size` times.
365*22ce4affSfengbojiang
366*22ce4affSfengbojiang- `Compressed_Block` - this is a [Zstandard compressed block](#compressed-blocks),
367*22ce4affSfengbojiang  explained later on.
368*22ce4affSfengbojiang  `Block_Size` is the length of `Block_Content`, the compressed data.
369*22ce4affSfengbojiang  The decompressed size is not known,
370*22ce4affSfengbojiang  but its maximum possible value is guaranteed (see below)
371*22ce4affSfengbojiang
372*22ce4affSfengbojiang- `Reserved` - this is not a block.
373*22ce4affSfengbojiang  This value cannot be used with current version of this specification.
374*22ce4affSfengbojiang  If such a value is present, it is considered corrupted data.
375*22ce4affSfengbojiang
376*22ce4affSfengbojiang__`Block_Size`__
377*22ce4affSfengbojiang
378*22ce4affSfengbojiangThe upper 21 bits of `Block_Header` represent the `Block_Size`.
379*22ce4affSfengbojiang
380*22ce4affSfengbojiangWhen `Block_Type` is `Compressed_Block` or `Raw_Block`,
381*22ce4affSfengbojiang`Block_Size` is the size of `Block_Content` (hence excluding `Block_Header`).
382*22ce4affSfengbojiang
383*22ce4affSfengbojiangWhen `Block_Type` is `RLE_Block`, since `Block_Content`’s size is always 1,
384*22ce4affSfengbojiang`Block_Size` represents the number of times this byte must be repeated.
385*22ce4affSfengbojiang
386*22ce4affSfengbojiang`Block_Size` is limited by `Block_Maximum_Size` (see below).
387*22ce4affSfengbojiang
388*22ce4affSfengbojiang__`Block_Content`__ and __`Block_Maximum_Size`__
389*22ce4affSfengbojiang
390*22ce4affSfengbojiangThe size of `Block_Content` is limited by `Block_Maximum_Size`,
391*22ce4affSfengbojiangwhich is the smallest of:
392*22ce4affSfengbojiang-  `Window_Size`
393*22ce4affSfengbojiang-  128 KB
394*22ce4affSfengbojiang
395*22ce4affSfengbojiang`Block_Maximum_Size` is constant for a given frame.
396*22ce4affSfengbojiangThis maximum is applicable to both the decompressed size
397*22ce4affSfengbojiangand the compressed size of any block in the frame.
398*22ce4affSfengbojiang
399*22ce4affSfengbojiangThe reasoning for this limit is that a decoder can read this information
400*22ce4affSfengbojiangat the beginning of a frame and use it to allocate buffers.
401*22ce4affSfengbojiangThe guarantees on the size of blocks ensure that
402*22ce4affSfengbojiangthe buffers will be large enough for any following block of the valid frame.
403*22ce4affSfengbojiang
404*22ce4affSfengbojiang
405*22ce4affSfengbojiangCompressed Blocks
406*22ce4affSfengbojiang-----------------
407*22ce4affSfengbojiangTo decompress a compressed block, the compressed size must be provided
408*22ce4affSfengbojiangfrom `Block_Size` field within `Block_Header`.
409*22ce4affSfengbojiang
410*22ce4affSfengbojiangA compressed block consists of 2 sections :
411*22ce4affSfengbojiang- [Literals Section](#literals-section)
412*22ce4affSfengbojiang- [Sequences Section](#sequences-section)
413*22ce4affSfengbojiang
414*22ce4affSfengbojiangThe results of the two sections are then combined to produce the decompressed
415*22ce4affSfengbojiangdata in [Sequence Execution](#sequence-execution)
416*22ce4affSfengbojiang
417*22ce4affSfengbojiang#### Prerequisites
418*22ce4affSfengbojiangTo decode a compressed block, the following elements are necessary :
419*22ce4affSfengbojiang- Previous decoded data, up to a distance of `Window_Size`,
420*22ce4affSfengbojiang  or beginning of the Frame, whichever is smaller.
421*22ce4affSfengbojiang- List of "recent offsets" from previous `Compressed_Block`.
422*22ce4affSfengbojiang- The previous Huffman tree, required by `Treeless_Literals_Block` type
423*22ce4affSfengbojiang- Previous FSE decoding tables, required by `Repeat_Mode`
424*22ce4affSfengbojiang  for each symbol type (literals lengths, match lengths, offsets)
425*22ce4affSfengbojiang
426*22ce4affSfengbojiangNote that decoding tables aren't always from the previous `Compressed_Block`.
427*22ce4affSfengbojiang
428*22ce4affSfengbojiang- Every decoding table can come from a dictionary.
429*22ce4affSfengbojiang- The Huffman tree comes from the previous `Compressed_Literals_Block`.
430*22ce4affSfengbojiang
431*22ce4affSfengbojiangLiterals Section
432*22ce4affSfengbojiang----------------
433*22ce4affSfengbojiangAll literals are regrouped in the first part of the block.
434*22ce4affSfengbojiangThey can be decoded first, and then copied during [Sequence Execution],
435*22ce4affSfengbojiangor they can be decoded on the flow during [Sequence Execution].
436*22ce4affSfengbojiang
437*22ce4affSfengbojiangLiterals can be stored uncompressed or compressed using Huffman prefix codes.
438*22ce4affSfengbojiangWhen compressed, an optional tree description can be present,
439*22ce4affSfengbojiangfollowed by 1 or 4 streams.
440*22ce4affSfengbojiang
441*22ce4affSfengbojiang| `Literals_Section_Header` | [`Huffman_Tree_Description`] | [jumpTable] | Stream1 | [Stream2] | [Stream3] | [Stream4] |
442*22ce4affSfengbojiang| ------------------------- | ---------------------------- | ----------- | ------- | --------- | --------- | --------- |
443*22ce4affSfengbojiang
444*22ce4affSfengbojiang
445*22ce4affSfengbojiang### `Literals_Section_Header`
446*22ce4affSfengbojiang
447*22ce4affSfengbojiangHeader is in charge of describing how literals are packed.
448*22ce4affSfengbojiangIt's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
449*22ce4affSfengbojiangusing __little-endian__ convention.
450*22ce4affSfengbojiang
451*22ce4affSfengbojiang| `Literals_Block_Type` | `Size_Format` | `Regenerated_Size` | [`Compressed_Size`] |
452*22ce4affSfengbojiang| --------------------- | ------------- | ------------------ | ------------------- |
453*22ce4affSfengbojiang|       2 bits          |  1 - 2 bits   |    5 - 20 bits     |     0 - 18 bits     |
454*22ce4affSfengbojiang
455*22ce4affSfengbojiangIn this representation, bits on the left are the lowest bits.
456*22ce4affSfengbojiang
457*22ce4affSfengbojiang__`Literals_Block_Type`__
458*22ce4affSfengbojiang
459*22ce4affSfengbojiangThis field uses 2 lowest bits of first byte, describing 4 different block types :
460*22ce4affSfengbojiang
461*22ce4affSfengbojiang| `Literals_Block_Type`       | Value |
462*22ce4affSfengbojiang| --------------------------- | ----- |
463*22ce4affSfengbojiang| `Raw_Literals_Block`        |   0   |
464*22ce4affSfengbojiang| `RLE_Literals_Block`        |   1   |
465*22ce4affSfengbojiang| `Compressed_Literals_Block` |   2   |
466*22ce4affSfengbojiang| `Treeless_Literals_Block`   |   3   |
467*22ce4affSfengbojiang
468*22ce4affSfengbojiang- `Raw_Literals_Block` - Literals are stored uncompressed.
469*22ce4affSfengbojiang- `RLE_Literals_Block` - Literals consist of a single byte value
470*22ce4affSfengbojiang        repeated `Regenerated_Size` times.
471*22ce4affSfengbojiang- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
472*22ce4affSfengbojiang        starting with a Huffman tree description.
473*22ce4affSfengbojiang        See details below.
474*22ce4affSfengbojiang- `Treeless_Literals_Block` - This is a Huffman-compressed block,
475*22ce4affSfengbojiang        using Huffman tree _from previous Huffman-compressed literals block_.
476*22ce4affSfengbojiang        `Huffman_Tree_Description` will be skipped.
477*22ce4affSfengbojiang        Note: If this mode is triggered without any previous Huffman-table in the frame
478*22ce4affSfengbojiang        (or [dictionary](#dictionary-format)), this should be treated as data corruption.
479*22ce4affSfengbojiang
480*22ce4affSfengbojiang__`Size_Format`__
481*22ce4affSfengbojiang
482*22ce4affSfengbojiang`Size_Format` is divided into 2 families :
483*22ce4affSfengbojiang
484*22ce4affSfengbojiang- For `Raw_Literals_Block` and `RLE_Literals_Block`,
485*22ce4affSfengbojiang  it's only necessary to decode `Regenerated_Size`.
486*22ce4affSfengbojiang  There is no `Compressed_Size` field.
487*22ce4affSfengbojiang- For `Compressed_Block` and `Treeless_Literals_Block`,
488*22ce4affSfengbojiang  it's required to decode both `Compressed_Size`
489*22ce4affSfengbojiang  and `Regenerated_Size` (the decompressed size).
490*22ce4affSfengbojiang  It's also necessary to decode the number of streams (1 or 4).
491*22ce4affSfengbojiang
492*22ce4affSfengbojiangFor values spanning several bytes, convention is __little-endian__.
493*22ce4affSfengbojiang
494*22ce4affSfengbojiang__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
495*22ce4affSfengbojiang
496*22ce4affSfengbojiang`Size_Format` uses 1 _or_ 2 bits.
497*22ce4affSfengbojiangIts value is : `Size_Format = (Literals_Section_Header[0]>>2) & 3`
498*22ce4affSfengbojiang
499*22ce4affSfengbojiang- `Size_Format` == 00 or 10 : `Size_Format` uses 1 bit.
500*22ce4affSfengbojiang               `Regenerated_Size` uses 5 bits (0-31).
501*22ce4affSfengbojiang               `Literals_Section_Header` uses 1 byte.
502*22ce4affSfengbojiang               `Regenerated_Size = Literals_Section_Header[0]>>3`
503*22ce4affSfengbojiang- `Size_Format` == 01 : `Size_Format` uses 2 bits.
504*22ce4affSfengbojiang               `Regenerated_Size` uses 12 bits (0-4095).
505*22ce4affSfengbojiang               `Literals_Section_Header` uses 2 bytes.
506*22ce4affSfengbojiang               `Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4)`
507*22ce4affSfengbojiang- `Size_Format` == 11 : `Size_Format` uses 2 bits.
508*22ce4affSfengbojiang               `Regenerated_Size` uses 20 bits (0-1048575).
509*22ce4affSfengbojiang               `Literals_Section_Header` uses 3 bytes.
510*22ce4affSfengbojiang               `Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12)`
511*22ce4affSfengbojiang
512*22ce4affSfengbojiangOnly Stream1 is present for these cases.
513*22ce4affSfengbojiangNote : it's allowed to represent a short value (for example `13`)
514*22ce4affSfengbojiangusing a long format, even if it's less efficient.
515*22ce4affSfengbojiang
516*22ce4affSfengbojiang__`Size_Format` for `Compressed_Literals_Block` and `Treeless_Literals_Block`__ :
517*22ce4affSfengbojiang
518*22ce4affSfengbojiang`Size_Format` always uses 2 bits.
519*22ce4affSfengbojiang
520*22ce4affSfengbojiang- `Size_Format` == 00 : _A single stream_.
521*22ce4affSfengbojiang               Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023).
522*22ce4affSfengbojiang               `Literals_Section_Header` uses 3 bytes.
523*22ce4affSfengbojiang- `Size_Format` == 01 : 4 streams.
524*22ce4affSfengbojiang               Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023).
525*22ce4affSfengbojiang               `Literals_Section_Header` uses 3 bytes.
526*22ce4affSfengbojiang- `Size_Format` == 10 : 4 streams.
527*22ce4affSfengbojiang               Both `Regenerated_Size` and `Compressed_Size` use 14 bits (0-16383).
528*22ce4affSfengbojiang               `Literals_Section_Header` uses 4 bytes.
529*22ce4affSfengbojiang- `Size_Format` == 11 : 4 streams.
530*22ce4affSfengbojiang               Both `Regenerated_Size` and `Compressed_Size` use 18 bits (0-262143).
531*22ce4affSfengbojiang               `Literals_Section_Header` uses 5 bytes.
532*22ce4affSfengbojiang
533*22ce4affSfengbojiangBoth `Compressed_Size` and `Regenerated_Size` fields follow __little-endian__ convention.
534*22ce4affSfengbojiangNote: `Compressed_Size` __includes__ the size of the Huffman Tree description
535*22ce4affSfengbojiang_when_ it is present.
536*22ce4affSfengbojiang
537*22ce4affSfengbojiang#### Raw Literals Block
538*22ce4affSfengbojiangThe data in Stream1 is `Regenerated_Size` bytes long,
539*22ce4affSfengbojiangit contains the raw literals data to be used during [Sequence Execution].
540*22ce4affSfengbojiang
541*22ce4affSfengbojiang#### RLE Literals Block
542*22ce4affSfengbojiangStream1 consists of a single byte which should be repeated `Regenerated_Size` times
543*22ce4affSfengbojiangto generate the decoded literals.
544*22ce4affSfengbojiang
545*22ce4affSfengbojiang#### Compressed Literals Block and Treeless Literals Block
546*22ce4affSfengbojiangBoth of these modes contain Huffman encoded data.
547*22ce4affSfengbojiang
548*22ce4affSfengbojiangFor `Treeless_Literals_Block`,
549*22ce4affSfengbojiangthe Huffman table comes from previously compressed literals block,
550*22ce4affSfengbojiangor from a dictionary.
551*22ce4affSfengbojiang
552*22ce4affSfengbojiang
553*22ce4affSfengbojiang### `Huffman_Tree_Description`
554*22ce4affSfengbojiangThis section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
555*22ce4affSfengbojiangThe format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description).
556*22ce4affSfengbojiangThe size of `Huffman_Tree_Description` is determined during decoding process,
557*22ce4affSfengbojiangit must be used to determine where streams begin.
558*22ce4affSfengbojiang`Total_Streams_Size = Compressed_Size - Huffman_Tree_Description_Size`.
559*22ce4affSfengbojiang
560*22ce4affSfengbojiang
561*22ce4affSfengbojiang### Jump Table
562*22ce4affSfengbojiangThe Jump Table is only present when there are 4 Huffman-coded streams.
563*22ce4affSfengbojiang
564*22ce4affSfengbojiangReminder : Huffman compressed data consists of either 1 or 4 Huffman-coded streams.
565*22ce4affSfengbojiang
566*22ce4affSfengbojiangIf only one stream is present, it is a single bitstream occupying the entire
567*22ce4affSfengbojiangremaining portion of the literals block, encoded as described within
568*22ce4affSfengbojiang[Huffman-Coded Streams](#huffman-coded-streams).
569*22ce4affSfengbojiang
570*22ce4affSfengbojiangIf there are four streams, `Literals_Section_Header` only provided
571*22ce4affSfengbojiangenough information to know the decompressed and compressed sizes
572*22ce4affSfengbojiangof all four streams _combined_.
573*22ce4affSfengbojiangThe decompressed size of _each_ stream is equal to `(Regenerated_Size+3)/4`,
574*22ce4affSfengbojiangexcept for the last stream which may be up to 3 bytes smaller,
575*22ce4affSfengbojiangto reach a total decompressed size as specified in `Regenerated_Size`.
576*22ce4affSfengbojiang
577*22ce4affSfengbojiangThe compressed size of each stream is provided explicitly in the Jump Table.
578*22ce4affSfengbojiangJump Table is 6 bytes long, and consist of three 2-byte __little-endian__ fields,
579*22ce4affSfengbojiangdescribing the compressed sizes of the first three streams.
580*22ce4affSfengbojiang`Stream4_Size` is computed from total `Total_Streams_Size` minus sizes of other streams.
581*22ce4affSfengbojiang
582*22ce4affSfengbojiang`Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size`.
583*22ce4affSfengbojiang
584*22ce4affSfengbojiangNote: if `Stream1_Size + Stream2_Size + Stream3_Size > Total_Streams_Size`,
585*22ce4affSfengbojiangdata is considered corrupted.
586*22ce4affSfengbojiang
587*22ce4affSfengbojiangEach of these 4 bitstreams is then decoded independently as a Huffman-Coded stream,
588*22ce4affSfengbojiangas described at [Huffman-Coded Streams](#huffman-coded-streams)
589*22ce4affSfengbojiang
590*22ce4affSfengbojiang
591*22ce4affSfengbojiangSequences Section
592*22ce4affSfengbojiang-----------------
593*22ce4affSfengbojiangA compressed block is a succession of _sequences_ .
594*22ce4affSfengbojiangA sequence is a literal copy command, followed by a match copy command.
595*22ce4affSfengbojiangA literal copy command specifies a length.
596*22ce4affSfengbojiangIt is the number of bytes to be copied (or extracted) from the Literals Section.
597*22ce4affSfengbojiangA match copy command specifies an offset and a length.
598*22ce4affSfengbojiang
599*22ce4affSfengbojiangWhen all _sequences_ are decoded,
600*22ce4affSfengbojiangif there are literals left in the _literals section_,
601*22ce4affSfengbojiangthese bytes are added at the end of the block.
602*22ce4affSfengbojiang
603*22ce4affSfengbojiangThis is described in more detail in [Sequence Execution](#sequence-execution).
604*22ce4affSfengbojiang
605*22ce4affSfengbojiangThe `Sequences_Section` regroup all symbols required to decode commands.
606*22ce4affSfengbojiangThere are 3 symbol types : literals lengths, offsets and match lengths.
607*22ce4affSfengbojiangThey are encoded together, interleaved, in a single _bitstream_.
608*22ce4affSfengbojiang
609*22ce4affSfengbojiangThe `Sequences_Section` starts by a header,
610*22ce4affSfengbojiangfollowed by optional probability tables for each symbol type,
611*22ce4affSfengbojiangfollowed by the bitstream.
612*22ce4affSfengbojiang
613*22ce4affSfengbojiang| `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream |
614*22ce4affSfengbojiang| -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- |
615*22ce4affSfengbojiang
616*22ce4affSfengbojiangTo decode the `Sequences_Section`, it's required to know its size.
617*22ce4affSfengbojiangIts size is deduced from the size of `Literals_Section`:
618*22ce4affSfengbojiang`Sequences_Section_Size = Block_Size - Literals_Section_Size`.
619*22ce4affSfengbojiang
620*22ce4affSfengbojiang
621*22ce4affSfengbojiang#### `Sequences_Section_Header`
622*22ce4affSfengbojiang
623*22ce4affSfengbojiangConsists of 2 items:
624*22ce4affSfengbojiang- `Number_of_Sequences`
625*22ce4affSfengbojiang- Symbol compression modes
626*22ce4affSfengbojiang
627*22ce4affSfengbojiang__`Number_of_Sequences`__
628*22ce4affSfengbojiang
629*22ce4affSfengbojiangThis is a variable size field using between 1 and 3 bytes.
630*22ce4affSfengbojiangLet's call its first byte `byte0`.
631*22ce4affSfengbojiang- `if (byte0 == 0)` : there are no sequences.
632*22ce4affSfengbojiang            The sequence section stops there.
633*22ce4affSfengbojiang            Decompressed content is defined entirely as Literals Section content.
634*22ce4affSfengbojiang            The FSE tables used in `Repeat_Mode` aren't updated.
635*22ce4affSfengbojiang- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte.
636*22ce4affSfengbojiang- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0-128) << 8) + byte1` . Uses 2 bytes.
637*22ce4affSfengbojiang- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00` . Uses 3 bytes.
638*22ce4affSfengbojiang
639*22ce4affSfengbojiang__Symbol compression modes__
640*22ce4affSfengbojiang
641*22ce4affSfengbojiangThis is a single byte, defining the compression mode of each symbol type.
642*22ce4affSfengbojiang
643*22ce4affSfengbojiang|Bit number|          7-6            |      5-4       |        3-2           |     1-0    |
644*22ce4affSfengbojiang| -------- | ----------------------- | -------------- | -------------------- | ---------- |
645*22ce4affSfengbojiang|Field name| `Literals_Lengths_Mode` | `Offsets_Mode` | `Match_Lengths_Mode` | `Reserved` |
646*22ce4affSfengbojiang
647*22ce4affSfengbojiangThe last field, `Reserved`, must be all-zeroes.
648*22ce4affSfengbojiang
649*22ce4affSfengbojiang`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of
650*22ce4affSfengbojiangliterals lengths, offsets, and match lengths symbols respectively.
651*22ce4affSfengbojiang
652*22ce4affSfengbojiangThey follow the same enumeration :
653*22ce4affSfengbojiang
654*22ce4affSfengbojiang|        Value       |         0         |      1     |           2           |       3       |
655*22ce4affSfengbojiang| ------------------ | ----------------- | ---------- | --------------------- | ------------- |
656*22ce4affSfengbojiang| `Compression_Mode` | `Predefined_Mode` | `RLE_Mode` | `FSE_Compressed_Mode` | `Repeat_Mode` |
657*22ce4affSfengbojiang
658*22ce4affSfengbojiang- `Predefined_Mode` : A predefined FSE distribution table is used, defined in
659*22ce4affSfengbojiang          [default distributions](#default-distributions).
660*22ce4affSfengbojiang          No distribution table will be present.
661*22ce4affSfengbojiang- `RLE_Mode` : The table description consists of a single byte, which contains the symbol's value.
662*22ce4affSfengbojiang          This symbol will be used for all sequences.
663*22ce4affSfengbojiang- `FSE_Compressed_Mode` : standard FSE compression.
664*22ce4affSfengbojiang          A distribution table will be present.
665*22ce4affSfengbojiang          The format of this distribution table is described in [FSE Table Description](#fse-table-description).
666*22ce4affSfengbojiang          Note that the maximum allowed accuracy log for literals length and match length tables is 9,
667*22ce4affSfengbojiang          and the maximum accuracy log for the offsets table is 8.
668*22ce4affSfengbojiang          `FSE_Compressed_Mode` must not be used when only one symbol is present,
669*22ce4affSfengbojiang          `RLE_Mode` should be used instead (although any other mode will work).
670*22ce4affSfengbojiang- `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again,
671*22ce4affSfengbojiang          or if this is the first block, table in the dictionary will be used.
672*22ce4affSfengbojiang          Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated.
673*22ce4affSfengbojiang          It also includes `Predefined_Mode`, in which case `Repeat_Mode` will have same outcome as `Predefined_Mode`.
674*22ce4affSfengbojiang          No distribution table will be present.
675*22ce4affSfengbojiang          If this mode is used without any previous sequence table in the frame
676*22ce4affSfengbojiang          (nor [dictionary](#dictionary-format)) to repeat, this should be treated as corruption.
677*22ce4affSfengbojiang
678*22ce4affSfengbojiang#### The codes for literals lengths, match lengths, and offsets.
679*22ce4affSfengbojiang
680*22ce4affSfengbojiangEach symbol is a _code_ in its own context,
681*22ce4affSfengbojiangwhich specifies `Baseline` and `Number_of_Bits` to add.
682*22ce4affSfengbojiang_Codes_ are FSE compressed,
683*22ce4affSfengbojiangand interleaved with raw additional bits in the same bitstream.
684*22ce4affSfengbojiang
685*22ce4affSfengbojiang##### Literals length codes
686*22ce4affSfengbojiang
687*22ce4affSfengbojiangLiterals length codes are values ranging from `0` to `35` included.
688*22ce4affSfengbojiangThey define lengths from 0 to 131071 bytes.
689*22ce4affSfengbojiangThe literals length is equal to the decoded `Baseline` plus
690*22ce4affSfengbojiangthe result of reading `Number_of_Bits` bits from the bitstream,
691*22ce4affSfengbojiangas a __little-endian__ value.
692*22ce4affSfengbojiang
693*22ce4affSfengbojiang| `Literals_Length_Code` |         0-15           |
694*22ce4affSfengbojiang| ---------------------- | ---------------------- |
695*22ce4affSfengbojiang| length                 | `Literals_Length_Code` |
696*22ce4affSfengbojiang| `Number_of_Bits`       |          0             |
697*22ce4affSfengbojiang
698*22ce4affSfengbojiang| `Literals_Length_Code` |  16  |  17  |  18  |  19  |  20  |  21  |  22  |  23  |
699*22ce4affSfengbojiang| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
700*22ce4affSfengbojiang| `Baseline`             |  16  |  18  |  20  |  22  |  24  |  28  |  32  |  40  |
701*22ce4affSfengbojiang| `Number_of_Bits`       |   1  |   1  |   1  |   1  |   2  |   2  |   3  |   3  |
702*22ce4affSfengbojiang
703*22ce4affSfengbojiang| `Literals_Length_Code` |  24  |  25  |  26  |  27  |  28  |  29  |  30  |  31  |
704*22ce4affSfengbojiang| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
705*22ce4affSfengbojiang| `Baseline`             |  48  |  64  |  128 |  256 |  512 | 1024 | 2048 | 4096 |
706*22ce4affSfengbojiang| `Number_of_Bits`       |   4  |   6  |   7  |   8  |   9  |  10  |  11  |  12  |
707*22ce4affSfengbojiang
708*22ce4affSfengbojiang| `Literals_Length_Code` |  32  |  33  |  34  |  35  |
709*22ce4affSfengbojiang| ---------------------- | ---- | ---- | ---- | ---- |
710*22ce4affSfengbojiang| `Baseline`             | 8192 |16384 |32768 |65536 |
711*22ce4affSfengbojiang| `Number_of_Bits`       |  13  |  14  |  15  |  16  |
712*22ce4affSfengbojiang
713*22ce4affSfengbojiang
714*22ce4affSfengbojiang##### Match length codes
715*22ce4affSfengbojiang
716*22ce4affSfengbojiangMatch length codes are values ranging from `0` to `52` included.
717*22ce4affSfengbojiangThey define lengths from 3 to 131074 bytes.
718*22ce4affSfengbojiangThe match length is equal to the decoded `Baseline` plus
719*22ce4affSfengbojiangthe result of reading `Number_of_Bits` bits from the bitstream,
720*22ce4affSfengbojiangas a __little-endian__ value.
721*22ce4affSfengbojiang
722*22ce4affSfengbojiang| `Match_Length_Code` |         0-31            |
723*22ce4affSfengbojiang| ------------------- | ----------------------- |
724*22ce4affSfengbojiang| value               | `Match_Length_Code` + 3 |
725*22ce4affSfengbojiang| `Number_of_Bits`    |          0              |
726*22ce4affSfengbojiang
727*22ce4affSfengbojiang| `Match_Length_Code` |  32  |  33  |  34  |  35  |  36  |  37  |  38  |  39  |
728*22ce4affSfengbojiang| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
729*22ce4affSfengbojiang| `Baseline`          |  35  |  37  |  39  |  41  |  43  |  47  |  51  |  59  |
730*22ce4affSfengbojiang| `Number_of_Bits`    |   1  |   1  |   1  |   1  |   2  |   2  |   3  |   3  |
731*22ce4affSfengbojiang
732*22ce4affSfengbojiang| `Match_Length_Code` |  40  |  41  |  42  |  43  |  44  |  45  |  46  |  47  |
733*22ce4affSfengbojiang| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
734*22ce4affSfengbojiang| `Baseline`          |  67  |  83  |  99  |  131 |  259 |  515 | 1027 | 2051 |
735*22ce4affSfengbojiang| `Number_of_Bits`    |   4  |   4  |   5  |   7  |   8  |   9  |  10  |  11  |
736*22ce4affSfengbojiang
737*22ce4affSfengbojiang| `Match_Length_Code` |  48  |  49  |  50  |  51  |  52  |
738*22ce4affSfengbojiang| ------------------- | ---- | ---- | ---- | ---- | ---- |
739*22ce4affSfengbojiang| `Baseline`          | 4099 | 8195 |16387 |32771 |65539 |
740*22ce4affSfengbojiang| `Number_of_Bits`    |  12  |  13  |  14  |  15  |  16  |
741*22ce4affSfengbojiang
742*22ce4affSfengbojiang##### Offset codes
743*22ce4affSfengbojiang
744*22ce4affSfengbojiangOffset codes are values ranging from `0` to `N`.
745*22ce4affSfengbojiang
746*22ce4affSfengbojiangA decoder is free to limit its maximum `N` supported.
747*22ce4affSfengbojiangRecommendation is to support at least up to `22`.
748*22ce4affSfengbojiangFor information, at the time of this writing.
749*22ce4affSfengbojiangthe reference decoder supports a maximum `N` value of `31`.
750*22ce4affSfengbojiang
751*22ce4affSfengbojiangAn offset code is also the number of additional bits to read in __little-endian__ fashion,
752*22ce4affSfengbojiangand can be translated into an `Offset_Value` using the following formulas :
753*22ce4affSfengbojiang
754*22ce4affSfengbojiang```
755*22ce4affSfengbojiangOffset_Value = (1 << offsetCode) + readNBits(offsetCode);
756*22ce4affSfengbojiangif (Offset_Value > 3) offset = Offset_Value - 3;
757*22ce4affSfengbojiang```
758*22ce4affSfengbojiangIt means that maximum `Offset_Value` is `(2^(N+1))-1`
759*22ce4affSfengbojiangsupporting back-reference distances up to `(2^(N+1))-4`,
760*22ce4affSfengbojiangbut is limited by [maximum back-reference distance](#window_descriptor).
761*22ce4affSfengbojiang
762*22ce4affSfengbojiang`Offset_Value` from 1 to 3 are special : they define "repeat codes".
763*22ce4affSfengbojiangThis is described in more detail in [Repeat Offsets](#repeat-offsets).
764*22ce4affSfengbojiang
765*22ce4affSfengbojiang#### Decoding Sequences
766*22ce4affSfengbojiangFSE bitstreams are read in reverse direction than written. In zstd,
767*22ce4affSfengbojiangthe compressor writes bits forward into a block and the decompressor
768*22ce4affSfengbojiangmust read the bitstream _backwards_.
769*22ce4affSfengbojiang
770*22ce4affSfengbojiangTo find the start of the bitstream it is therefore necessary to
771*22ce4affSfengbojiangknow the offset of the last byte of the block which can be found
772*22ce4affSfengbojiangby counting `Block_Size` bytes after the block header.
773*22ce4affSfengbojiang
774*22ce4affSfengbojiangAfter writing the last bit containing information, the compressor
775*22ce4affSfengbojiangwrites a single `1`-bit and then fills the byte with 0-7 `0` bits of
776*22ce4affSfengbojiangpadding. The last byte of the compressed bitstream cannot be `0` for
777*22ce4affSfengbojiangthat reason.
778*22ce4affSfengbojiang
779*22ce4affSfengbojiangWhen decompressing, the last byte containing the padding is the first
780*22ce4affSfengbojiangbyte to read. The decompressor needs to skip 0-7 initial `0`-bits and
781*22ce4affSfengbojiangthe first `1`-bit it occurs. Afterwards, the useful part of the bitstream
782*22ce4affSfengbojiangbegins.
783*22ce4affSfengbojiang
784*22ce4affSfengbojiangFSE decoding requires a 'state' to be carried from symbol to symbol.
785*22ce4affSfengbojiangFor more explanation on FSE decoding, see the [FSE section](#fse).
786*22ce4affSfengbojiang
787*22ce4affSfengbojiangFor sequence decoding, a separate state keeps track of each
788*22ce4affSfengbojiangliteral lengths, offsets, and match lengths symbols.
789*22ce4affSfengbojiangSome FSE primitives are also used.
790*22ce4affSfengbojiangFor more details on the operation of these primitives, see the [FSE section](#fse).
791*22ce4affSfengbojiang
792*22ce4affSfengbojiang##### Starting states
793*22ce4affSfengbojiangThe bitstream starts with initial FSE state values,
794*22ce4affSfengbojiangeach using the required number of bits in their respective _accuracy_,
795*22ce4affSfengbojiangdecoded previously from their normalized distribution.
796*22ce4affSfengbojiang
797*22ce4affSfengbojiangIt starts by `Literals_Length_State`,
798*22ce4affSfengbojiangfollowed by `Offset_State`,
799*22ce4affSfengbojiangand finally `Match_Length_State`.
800*22ce4affSfengbojiang
801*22ce4affSfengbojiangReminder : always keep in mind that all values are read _backward_,
802*22ce4affSfengbojiangso the 'start' of the bitstream is at the highest position in memory,
803*22ce4affSfengbojiangimmediately before the last `1`-bit for padding.
804*22ce4affSfengbojiang
805*22ce4affSfengbojiangAfter decoding the starting states, a single sequence is decoded
806*22ce4affSfengbojiang`Number_Of_Sequences` times.
807*22ce4affSfengbojiangThese sequences are decoded in order from first to last.
808*22ce4affSfengbojiangSince the compressor writes the bitstream in the forward direction,
809*22ce4affSfengbojiangthis means the compressor must encode the sequences starting with the last
810*22ce4affSfengbojiangone and ending with the first.
811*22ce4affSfengbojiang
812*22ce4affSfengbojiang##### Decoding a sequence
813*22ce4affSfengbojiangFor each of the symbol types, the FSE state can be used to determine the appropriate code.
814*22ce4affSfengbojiangThe code then defines the `Baseline` and `Number_of_Bits` to read for each type.
815*22ce4affSfengbojiangSee the [description of the codes] for how to determine these values.
816*22ce4affSfengbojiang
817*22ce4affSfengbojiang[description of the codes]: #the-codes-for-literals-lengths-match-lengths-and-offsets
818*22ce4affSfengbojiang
819*22ce4affSfengbojiangDecoding starts by reading the `Number_of_Bits` required to decode `Offset`.
820*22ce4affSfengbojiangIt then does the same for `Match_Length`, and then for `Literals_Length`.
821*22ce4affSfengbojiangThis sequence is then used for [sequence execution](#sequence-execution).
822*22ce4affSfengbojiang
823*22ce4affSfengbojiangIf it is not the last sequence in the block,
824*22ce4affSfengbojiangthe next operation is to update states.
825*22ce4affSfengbojiangUsing the rules pre-calculated in the decoding tables,
826*22ce4affSfengbojiang`Literals_Length_State` is updated,
827*22ce4affSfengbojiangfollowed by `Match_Length_State`,
828*22ce4affSfengbojiangand then `Offset_State`.
829*22ce4affSfengbojiangSee the [FSE section](#fse) for details on how to update states from the bitstream.
830*22ce4affSfengbojiang
831*22ce4affSfengbojiangThis operation will be repeated `Number_of_Sequences` times.
832*22ce4affSfengbojiangAt the end, the bitstream shall be entirely consumed,
833*22ce4affSfengbojiangotherwise the bitstream is considered corrupted.
834*22ce4affSfengbojiang
835*22ce4affSfengbojiang#### Default Distributions
836*22ce4affSfengbojiangIf `Predefined_Mode` is selected for a symbol type,
837*22ce4affSfengbojiangits FSE decoding table is generated from a predefined distribution table defined here.
838*22ce4affSfengbojiangFor details on how to convert this distribution into a decoding table, see the [FSE section].
839*22ce4affSfengbojiang
840*22ce4affSfengbojiang[FSE section]: #from-normalized-distribution-to-decoding-tables
841*22ce4affSfengbojiang
842*22ce4affSfengbojiang##### Literals Length
843*22ce4affSfengbojiangThe decoding table uses an accuracy log of 6 bits (64 states).
844*22ce4affSfengbojiang```
845*22ce4affSfengbojiangshort literalsLength_defaultDistribution[36] =
846*22ce4affSfengbojiang        { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
847*22ce4affSfengbojiang          2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
848*22ce4affSfengbojiang         -1,-1,-1,-1 };
849*22ce4affSfengbojiang```
850*22ce4affSfengbojiang
851*22ce4affSfengbojiang##### Match Length
852*22ce4affSfengbojiangThe decoding table uses an accuracy log of 6 bits (64 states).
853*22ce4affSfengbojiang```
854*22ce4affSfengbojiangshort matchLengths_defaultDistribution[53] =
855*22ce4affSfengbojiang        { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
856*22ce4affSfengbojiang          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
857*22ce4affSfengbojiang          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
858*22ce4affSfengbojiang         -1,-1,-1,-1,-1 };
859*22ce4affSfengbojiang```
860*22ce4affSfengbojiang
861*22ce4affSfengbojiang##### Offset Codes
862*22ce4affSfengbojiangThe decoding table uses an accuracy log of 5 bits (32 states),
863*22ce4affSfengbojiangand supports a maximum `N` value of 28, allowing offset values up to 536,870,908 .
864*22ce4affSfengbojiang
865*22ce4affSfengbojiangIf any sequence in the compressed block requires a larger offset than this,
866*22ce4affSfengbojiangit's not possible to use the default distribution to represent it.
867*22ce4affSfengbojiang```
868*22ce4affSfengbojiangshort offsetCodes_defaultDistribution[29] =
869*22ce4affSfengbojiang        { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
870*22ce4affSfengbojiang          1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
871*22ce4affSfengbojiang```
872*22ce4affSfengbojiang
873*22ce4affSfengbojiang
874*22ce4affSfengbojiangSequence Execution
875*22ce4affSfengbojiang------------------
876*22ce4affSfengbojiangOnce literals and sequences have been decoded,
877*22ce4affSfengbojiangthey are combined to produce the decoded content of a block.
878*22ce4affSfengbojiang
879*22ce4affSfengbojiangEach sequence consists of a tuple of (`literals_length`, `offset_value`, `match_length`),
880*22ce4affSfengbojiangdecoded as described in the [Sequences Section](#sequences-section).
881*22ce4affSfengbojiangTo execute a sequence, first copy `literals_length` bytes
882*22ce4affSfengbojiangfrom the decoded literals to the output.
883*22ce4affSfengbojiang
884*22ce4affSfengbojiangThen `match_length` bytes are copied from previous decoded data.
885*22ce4affSfengbojiangThe offset to copy from is determined by `offset_value`:
886*22ce4affSfengbojiangif `offset_value > 3`, then the offset is `offset_value - 3`.
887*22ce4affSfengbojiangIf `offset_value` is from 1-3, the offset is a special repeat offset value.
888*22ce4affSfengbojiangSee the [repeat offset](#repeat-offsets) section for how the offset is determined
889*22ce4affSfengbojiangin this case.
890*22ce4affSfengbojiang
891*22ce4affSfengbojiangThe offset is defined as from the current position, so an offset of 6
892*22ce4affSfengbojiangand a match length of 3 means that 3 bytes should be copied from 6 bytes back.
893*22ce4affSfengbojiangNote that all offsets leading to previously decoded data
894*22ce4affSfengbojiangmust be smaller than `Window_Size` defined in `Frame_Header_Descriptor`.
895*22ce4affSfengbojiang
896*22ce4affSfengbojiang#### Repeat offsets
897*22ce4affSfengbojiangAs seen in [Sequence Execution](#sequence-execution),
898*22ce4affSfengbojiangthe first 3 values define a repeated offset and we will call them
899*22ce4affSfengbojiang`Repeated_Offset1`, `Repeated_Offset2`, and `Repeated_Offset3`.
900*22ce4affSfengbojiangThey are sorted in recency order, with `Repeated_Offset1` meaning "most recent one".
901*22ce4affSfengbojiang
902*22ce4affSfengbojiangIf `offset_value == 1`, then the offset used is `Repeated_Offset1`, etc.
903*22ce4affSfengbojiang
904*22ce4affSfengbojiangThere is an exception though, when current sequence's `literals_length = 0`.
905*22ce4affSfengbojiangIn this case, repeated offsets are shifted by one,
906*22ce4affSfengbojiangso an `offset_value` of 1 means `Repeated_Offset2`,
907*22ce4affSfengbojiangan `offset_value` of 2 means `Repeated_Offset3`,
908*22ce4affSfengbojiangand an `offset_value` of 3 means `Repeated_Offset1 - 1_byte`.
909*22ce4affSfengbojiang
910*22ce4affSfengbojiangFor the first block, the starting offset history is populated with following values :
911*22ce4affSfengbojiang`Repeated_Offset1`=1, `Repeated_Offset2`=4, `Repeated_Offset3`=8,
912*22ce4affSfengbojiangunless a dictionary is used, in which case they come from the dictionary.
913*22ce4affSfengbojiang
914*22ce4affSfengbojiangThen each block gets its starting offset history from the ending values of the most recent `Compressed_Block`.
915*22ce4affSfengbojiangNote that blocks which are not `Compressed_Block` are skipped, they do not contribute to offset history.
916*22ce4affSfengbojiang
917*22ce4affSfengbojiang[Offset Codes]: #offset-codes
918*22ce4affSfengbojiang
919*22ce4affSfengbojiang###### Offset updates rules
920*22ce4affSfengbojiang
921*22ce4affSfengbojiangDuring the execution of the sequences of a `Compressed_Block`, the
922*22ce4affSfengbojiang`Repeated_Offsets`' values are kept up to date, so that they always represent
923*22ce4affSfengbojiangthe three most-recently used offsets. In order to achieve that, they are
924*22ce4affSfengbojiangupdated after executing each sequence in the following way:
925*22ce4affSfengbojiang
926*22ce4affSfengbojiangWhen the sequence's `offset_value` does not refer to one of the
927*22ce4affSfengbojiang`Repeated_Offsets`--when it has value greater than 3, or when it has value 3
928*22ce4affSfengbojiangand the sequence's `literals_length` is zero--the `Repeated_Offsets`' values
929*22ce4affSfengbojiangare shifted back one, and `Repeated_Offset1` takes on the value of the
930*22ce4affSfengbojiangjust-used offset.
931*22ce4affSfengbojiang
932*22ce4affSfengbojiangOtherwise, when the sequence's `offset_value` refers to one of the
933*22ce4affSfengbojiang`Repeated_Offsets`--when it has value 1 or 2, or when it has value 3 and the
934*22ce4affSfengbojiangsequence's `literals_length` is non-zero--the `Repeated_Offsets` are re-ordered
935*22ce4affSfengbojiangso that `Repeated_Offset1` takes on the value of the used Repeated_Offset, and
936*22ce4affSfengbojiangthe existing values are pushed back from the first `Repeated_Offset` through to
937*22ce4affSfengbojiangthe `Repeated_Offset` selected by the `offset_value`. This effectively performs
938*22ce4affSfengbojianga single-stepped wrapping rotation of the values of these offsets, so that
939*22ce4affSfengbojiangtheir order again reflects the recency of their use.
940*22ce4affSfengbojiang
941*22ce4affSfengbojiangThe following table shows the values of the `Repeated_Offsets` as a series of
942*22ce4affSfengbojiangsequences are applied to them:
943*22ce4affSfengbojiang
944*22ce4affSfengbojiang| `offset_value` | `literals_length` | `Repeated_Offset1` | `Repeated_Offset2` | `Repeated_Offset3` | Comment                 |
945*22ce4affSfengbojiang|:--------------:|:-----------------:|:------------------:|:------------------:|:------------------:|:-----------------------:|
946*22ce4affSfengbojiang|                |                   |                  1 |                  4 |                  8 | starting values         |
947*22ce4affSfengbojiang|           1114 |                11 |               1111 |                  1 |                  4 | non-repeat              |
948*22ce4affSfengbojiang|              1 |                22 |               1111 |                  1 |                  4 | repeat 1; no change     |
949*22ce4affSfengbojiang|           2225 |                22 |               2222 |               1111 |                  1 | non-repeat              |
950*22ce4affSfengbojiang|           1114 |               111 |               1111 |               2222 |               1111 | non-repeat              |
951*22ce4affSfengbojiang|           3336 |                33 |               3333 |               1111 |               2222 | non-repeat              |
952*22ce4affSfengbojiang|              2 |                22 |               1111 |               3333 |               2222 | repeat 2; swap 1 & 2    |
953*22ce4affSfengbojiang|              3 |                33 |               2222 |               1111 |               3333 | repeat 3; rotate 3 to 1 |
954*22ce4affSfengbojiang|              3 |                 0 |               2221 |               2222 |               1111 | insert resolved offset  |
955*22ce4affSfengbojiang|              1 |                 0 |               2222 |               2221 |               3333 | repeat 2                |
956*22ce4affSfengbojiang
957*22ce4affSfengbojiang
958*22ce4affSfengbojiangSkippable Frames
959*22ce4affSfengbojiang----------------
960*22ce4affSfengbojiang
961*22ce4affSfengbojiang| `Magic_Number` | `Frame_Size` | `User_Data` |
962*22ce4affSfengbojiang|:--------------:|:------------:|:-----------:|
963*22ce4affSfengbojiang|   4 bytes      |  4 bytes     |   n bytes   |
964*22ce4affSfengbojiang
965*22ce4affSfengbojiangSkippable frames allow the insertion of user-defined metadata
966*22ce4affSfengbojianginto a flow of concatenated frames.
967*22ce4affSfengbojiang
968*22ce4affSfengbojiangSkippable frames defined in this specification are compatible with [LZ4] ones.
969*22ce4affSfengbojiang
970*22ce4affSfengbojiang[LZ4]:http://www.lz4.org
971*22ce4affSfengbojiang
972*22ce4affSfengbojiangFrom a compliant decoder perspective, skippable frames need just be skipped,
973*22ce4affSfengbojiangand their content ignored, resuming decoding after the skippable frame.
974*22ce4affSfengbojiang
975*22ce4affSfengbojiangIt can be noted that a skippable frame
976*22ce4affSfengbojiangcan be used to watermark a stream of concatenated frames
977*22ce4affSfengbojiangembedding any kind of tracking information (even just an UUID).
978*22ce4affSfengbojiangUsers wary of such possibility should scan the stream of concatenated frames
979*22ce4affSfengbojiangin an attempt to detect such frame for analysis or removal.
980*22ce4affSfengbojiang
981*22ce4affSfengbojiang__`Magic_Number`__
982*22ce4affSfengbojiang
983*22ce4affSfengbojiang4 Bytes, __little-endian__ format.
984*22ce4affSfengbojiangValue : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F.
985*22ce4affSfengbojiangAll 16 values are valid to identify a skippable frame.
986*22ce4affSfengbojiangThis specification doesn't detail any specific tagging for skippable frames.
987*22ce4affSfengbojiang
988*22ce4affSfengbojiang__`Frame_Size`__
989*22ce4affSfengbojiang
990*22ce4affSfengbojiangThis is the size, in bytes, of the following `User_Data`
991*22ce4affSfengbojiang(without including the magic number nor the size field itself).
992*22ce4affSfengbojiangThis field is represented using 4 Bytes, __little-endian__ format, unsigned 32-bits.
993*22ce4affSfengbojiangThis means `User_Data` can’t be bigger than (2^32-1) bytes.
994*22ce4affSfengbojiang
995*22ce4affSfengbojiang__`User_Data`__
996*22ce4affSfengbojiang
997*22ce4affSfengbojiangThe `User_Data` can be anything. Data will just be skipped by the decoder.
998*22ce4affSfengbojiang
999*22ce4affSfengbojiang
1000*22ce4affSfengbojiang
1001*22ce4affSfengbojiangEntropy Encoding
1002*22ce4affSfengbojiang----------------
1003*22ce4affSfengbojiangTwo types of entropy encoding are used by the Zstandard format:
1004*22ce4affSfengbojiangFSE, and Huffman coding.
1005*22ce4affSfengbojiangHuffman is used to compress literals,
1006*22ce4affSfengbojiangwhile FSE is used for all other symbols
1007*22ce4affSfengbojiang(`Literals_Length_Code`, `Match_Length_Code`, offset codes)
1008*22ce4affSfengbojiangand to compress Huffman headers.
1009*22ce4affSfengbojiang
1010*22ce4affSfengbojiang
1011*22ce4affSfengbojiangFSE
1012*22ce4affSfengbojiang---
1013*22ce4affSfengbojiangFSE, short for Finite State Entropy, is an entropy codec based on [ANS].
1014*22ce4affSfengbojiangFSE encoding/decoding involves a state that is carried over between symbols,
1015*22ce4affSfengbojiangso decoding must be done in the opposite direction as encoding.
1016*22ce4affSfengbojiangTherefore, all FSE bitstreams are read from end to beginning.
1017*22ce4affSfengbojiangNote that the order of the bits in the stream is not reversed,
1018*22ce4affSfengbojiangwe just read the elements in the reverse order they are written.
1019*22ce4affSfengbojiang
1020*22ce4affSfengbojiangFor additional details on FSE, see [Finite State Entropy].
1021*22ce4affSfengbojiang
1022*22ce4affSfengbojiang[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/
1023*22ce4affSfengbojiang
1024*22ce4affSfengbojiangFSE decoding involves a decoding table which has a power of 2 size, and contain three elements:
1025*22ce4affSfengbojiang`Symbol`, `Num_Bits`, and `Baseline`.
1026*22ce4affSfengbojiangThe `log2` of the table size is its `Accuracy_Log`.
1027*22ce4affSfengbojiangAn FSE state value represents an index in this table.
1028*22ce4affSfengbojiang
1029*22ce4affSfengbojiangTo obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value.
1030*22ce4affSfengbojiangThe next symbol in the stream is the `Symbol` indicated in the table for that state.
1031*22ce4affSfengbojiangTo obtain the next state value,
1032*22ce4affSfengbojiangthe decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`.
1033*22ce4affSfengbojiang
1034*22ce4affSfengbojiang[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems
1035*22ce4affSfengbojiang
1036*22ce4affSfengbojiang### FSE Table Description
1037*22ce4affSfengbojiangTo decode FSE streams, it is necessary to construct the decoding table.
1038*22ce4affSfengbojiangThe Zstandard format encodes FSE table descriptions as follows:
1039*22ce4affSfengbojiang
1040*22ce4affSfengbojiangAn FSE distribution table describes the probabilities of all symbols
1041*22ce4affSfengbojiangfrom `0` to the last present one (included)
1042*22ce4affSfengbojiangon a normalized scale of `1 << Accuracy_Log` .
1043*22ce4affSfengbojiangNote that there must be two or more symbols with nonzero probability.
1044*22ce4affSfengbojiang
1045*22ce4affSfengbojiangIt's a bitstream which is read forward, in __little-endian__ fashion.
1046*22ce4affSfengbojiangIt's not necessary to know bitstream exact size,
1047*22ce4affSfengbojiangit will be discovered and reported by the decoding process.
1048*22ce4affSfengbojiang
1049*22ce4affSfengbojiangThe bitstream starts by reporting on which scale it operates.
1050*22ce4affSfengbojiangLet's `low4Bits` designate the lowest 4 bits of the first byte :
1051*22ce4affSfengbojiang`Accuracy_Log = low4bits + 5`.
1052*22ce4affSfengbojiang
1053*22ce4affSfengbojiangThen follows each symbol value, from `0` to last present one.
1054*22ce4affSfengbojiangThe number of bits used by each field is variable.
1055*22ce4affSfengbojiangIt depends on :
1056*22ce4affSfengbojiang
1057*22ce4affSfengbojiang- Remaining probabilities + 1 :
1058*22ce4affSfengbojiang  __example__ :
1059*22ce4affSfengbojiang  Presuming an `Accuracy_Log` of 8,
1060*22ce4affSfengbojiang  and presuming 100 probabilities points have already been distributed,
1061*22ce4affSfengbojiang  the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive).
1062*22ce4affSfengbojiang  Therefore, it must read `log2sup(157) == 8` bits.
1063*22ce4affSfengbojiang
1064*22ce4affSfengbojiang- Value decoded : small values use 1 less bit :
1065*22ce4affSfengbojiang  __example__ :
1066*22ce4affSfengbojiang  Presuming values from 0 to 157 (inclusive) are possible,
1067*22ce4affSfengbojiang  255-157 = 98 values are remaining in an 8-bits field.
1068*22ce4affSfengbojiang  They are used this way :
1069*22ce4affSfengbojiang  first 98 values (hence from 0 to 97) use only 7 bits,
1070*22ce4affSfengbojiang  values from 98 to 157 use 8 bits.
1071*22ce4affSfengbojiang  This is achieved through this scheme :
1072*22ce4affSfengbojiang
1073*22ce4affSfengbojiang  | Value read | Value decoded | Number of bits used |
1074*22ce4affSfengbojiang  | ---------- | ------------- | ------------------- |
1075*22ce4affSfengbojiang  |   0 -  97  |   0 -  97     |  7                  |
1076*22ce4affSfengbojiang  |  98 - 127  |  98 - 127     |  8                  |
1077*22ce4affSfengbojiang  | 128 - 225  |   0 -  97     |  7                  |
1078*22ce4affSfengbojiang  | 226 - 255  | 128 - 157     |  8                  |
1079*22ce4affSfengbojiang
1080*22ce4affSfengbojiangSymbols probabilities are read one by one, in order.
1081*22ce4affSfengbojiang
1082*22ce4affSfengbojiangProbability is obtained from Value decoded by following formula :
1083*22ce4affSfengbojiang`Proba = value - 1`
1084*22ce4affSfengbojiang
1085*22ce4affSfengbojiangIt means value `0` becomes negative probability `-1`.
1086*22ce4affSfengbojiang`-1` is a special probability, which means "less than 1".
1087*22ce4affSfengbojiangIts effect on distribution table is described in the [next section].
1088*22ce4affSfengbojiangFor the purpose of calculating total allocated probability points, it counts as one.
1089*22ce4affSfengbojiang
1090*22ce4affSfengbojiang[next section]:#from-normalized-distribution-to-decoding-tables
1091*22ce4affSfengbojiang
1092*22ce4affSfengbojiangWhen a symbol has a __probability__ of `zero`,
1093*22ce4affSfengbojiangit is followed by a 2-bits repeat flag.
1094*22ce4affSfengbojiangThis repeat flag tells how many probabilities of zeroes follow the current one.
1095*22ce4affSfengbojiangIt provides a number ranging from 0 to 3.
1096*22ce4affSfengbojiangIf it is a 3, another 2-bits repeat flag follows, and so on.
1097*22ce4affSfengbojiang
1098*22ce4affSfengbojiangWhen last symbol reaches cumulated total of `1 << Accuracy_Log`,
1099*22ce4affSfengbojiangdecoding is complete.
1100*22ce4affSfengbojiangIf the last symbol makes cumulated total go above `1 << Accuracy_Log`,
1101*22ce4affSfengbojiangdistribution is considered corrupted.
1102*22ce4affSfengbojiang
1103*22ce4affSfengbojiangThen the decoder can tell how many bytes were used in this process,
1104*22ce4affSfengbojiangand how many symbols are present.
1105*22ce4affSfengbojiangThe bitstream consumes a round number of bytes.
1106*22ce4affSfengbojiangAny remaining bit within the last byte is just unused.
1107*22ce4affSfengbojiang
1108*22ce4affSfengbojiang#### From normalized distribution to decoding tables
1109*22ce4affSfengbojiang
1110*22ce4affSfengbojiangThe distribution of normalized probabilities is enough
1111*22ce4affSfengbojiangto create a unique decoding table.
1112*22ce4affSfengbojiang
1113*22ce4affSfengbojiangIt follows the following build rule :
1114*22ce4affSfengbojiang
1115*22ce4affSfengbojiangThe table has a size of `Table_Size = 1 << Accuracy_Log`.
1116*22ce4affSfengbojiangEach cell describes the symbol decoded,
1117*22ce4affSfengbojiangand instructions to get the next state (`Number_of_Bits` and `Baseline`).
1118*22ce4affSfengbojiang
1119*22ce4affSfengbojiangSymbols are scanned in their natural order for "less than 1" probabilities.
1120*22ce4affSfengbojiangSymbols with this probability are being attributed a single cell,
1121*22ce4affSfengbojiangstarting from the end of the table and retreating.
1122*22ce4affSfengbojiangThese symbols define a full state reset, reading `Accuracy_Log` bits.
1123*22ce4affSfengbojiang
1124*22ce4affSfengbojiangThen, all remaining symbols, sorted in natural order, are allocated cells.
1125*22ce4affSfengbojiangStarting from symbol `0` (if it exists), and table position `0`,
1126*22ce4affSfengbojiangeach symbol gets allocated as many cells as its probability.
1127*22ce4affSfengbojiangCell allocation is spreaded, not linear :
1128*22ce4affSfengbojiangeach successor position follows this rule :
1129*22ce4affSfengbojiang
1130*22ce4affSfengbojiang```
1131*22ce4affSfengbojiangposition += (tableSize>>1) + (tableSize>>3) + 3;
1132*22ce4affSfengbojiangposition &= tableSize-1;
1133*22ce4affSfengbojiang```
1134*22ce4affSfengbojiang
1135*22ce4affSfengbojiangA position is skipped if already occupied by a "less than 1" probability symbol.
1136*22ce4affSfengbojiang`position` does not reset between symbols, it simply iterates through
1137*22ce4affSfengbojiangeach position in the table, switching to the next symbol when enough
1138*22ce4affSfengbojiangstates have been allocated to the current one.
1139*22ce4affSfengbojiang
1140*22ce4affSfengbojiangThe process guarantees that the table is entirely filled.
1141*22ce4affSfengbojiangEach cell corresponds to a state value, which contains the symbol being decoded.
1142*22ce4affSfengbojiang
1143*22ce4affSfengbojiangTo add the `Number_of_Bits` and `Baseline` required to retrieve next state,
1144*22ce4affSfengbojiangit's first necessary to sort all occurrences of each symbol in state order.
1145*22ce4affSfengbojiangLower states will need 1 more bit than higher ones.
1146*22ce4affSfengbojiangThe process is repeated for each symbol.
1147*22ce4affSfengbojiang
1148*22ce4affSfengbojiang__Example__ :
1149*22ce4affSfengbojiangPresuming a symbol has a probability of 5,
1150*22ce4affSfengbojiangit receives 5 cells, corresponding to 5 state values.
1151*22ce4affSfengbojiangThese state values are then sorted in natural order.
1152*22ce4affSfengbojiang
1153*22ce4affSfengbojiangNext power of 2 after 5 is 8.
1154*22ce4affSfengbojiangSpace of probabilities must be divided into 8 equal parts.
1155*22ce4affSfengbojiangPresuming the `Accuracy_Log` is 7, it defines a space of 128 states.
1156*22ce4affSfengbojiangDivided by 8, each share is 16 large.
1157*22ce4affSfengbojiang
1158*22ce4affSfengbojiangIn order to reach 8 shares, 8-5=3 lowest states will count "double",
1159*22ce4affSfengbojiangdoubling their shares (32 in width), hence requiring one more bit.
1160*22ce4affSfengbojiang
1161*22ce4affSfengbojiangBaseline is assigned starting from the higher states using fewer bits,
1162*22ce4affSfengbojiangincreasing at each state, then resuming at the first state,
1163*22ce4affSfengbojiangeach state takes its allocated width from Baseline.
1164*22ce4affSfengbojiang
1165*22ce4affSfengbojiang| state value      |   1   |  39   |   77   |  84  |  122   |
1166*22ce4affSfengbojiang| state order      |   0   |   1   |    2   |   3  |    4   |
1167*22ce4affSfengbojiang| ---------------- | ----- | ----- | ------ | ---- | ------ |
1168*22ce4affSfengbojiang| width            |  32   |  32   |   32   |  16  |   16   |
1169*22ce4affSfengbojiang| `Number_of_Bits` |   5   |   5   |    5   |   4  |    4   |
1170*22ce4affSfengbojiang| range number     |   2   |   4   |    6   |   0  |    1   |
1171*22ce4affSfengbojiang| `Baseline`       |  32   |  64   |   96   |   0  |   16   |
1172*22ce4affSfengbojiang| range            | 32-63 | 64-95 | 96-127 | 0-15 | 16-31  |
1173*22ce4affSfengbojiang
1174*22ce4affSfengbojiangDuring decoding, the next state value is determined from current state value,
1175*22ce4affSfengbojiangby reading the required `Number_of_Bits`, and adding the specified `Baseline`.
1176*22ce4affSfengbojiang
1177*22ce4affSfengbojiangSee [Appendix A] for the results of this process applied to the default distributions.
1178*22ce4affSfengbojiang
1179*22ce4affSfengbojiang[Appendix A]: #appendix-a---decoding-tables-for-predefined-codes
1180*22ce4affSfengbojiang
1181*22ce4affSfengbojiang
1182*22ce4affSfengbojiangHuffman Coding
1183*22ce4affSfengbojiang--------------
1184*22ce4affSfengbojiangZstandard Huffman-coded streams are read backwards,
1185*22ce4affSfengbojiangsimilar to the FSE bitstreams.
1186*22ce4affSfengbojiangTherefore, to find the start of the bitstream, it is therefore to
1187*22ce4affSfengbojiangknow the offset of the last byte of the Huffman-coded stream.
1188*22ce4affSfengbojiang
1189*22ce4affSfengbojiangAfter writing the last bit containing information, the compressor
1190*22ce4affSfengbojiangwrites a single `1`-bit and then fills the byte with 0-7 `0` bits of
1191*22ce4affSfengbojiangpadding. The last byte of the compressed bitstream cannot be `0` for
1192*22ce4affSfengbojiangthat reason.
1193*22ce4affSfengbojiang
1194*22ce4affSfengbojiangWhen decompressing, the last byte containing the padding is the first
1195*22ce4affSfengbojiangbyte to read. The decompressor needs to skip 0-7 initial `0`-bits and
1196*22ce4affSfengbojiangthe first `1`-bit it occurs. Afterwards, the useful part of the bitstream
1197*22ce4affSfengbojiangbegins.
1198*22ce4affSfengbojiang
1199*22ce4affSfengbojiangThe bitstream contains Huffman-coded symbols in __little-endian__ order,
1200*22ce4affSfengbojiangwith the codes defined by the method below.
1201*22ce4affSfengbojiang
1202*22ce4affSfengbojiang### Huffman Tree Description
1203*22ce4affSfengbojiang
1204*22ce4affSfengbojiangPrefix coding represents symbols from an a priori known alphabet
1205*22ce4affSfengbojiangby bit sequences (codewords), one codeword for each symbol,
1206*22ce4affSfengbojiangin a manner such that different symbols may be represented
1207*22ce4affSfengbojiangby bit sequences of different lengths,
1208*22ce4affSfengbojiangbut a parser can always parse an encoded string
1209*22ce4affSfengbojiangunambiguously symbol-by-symbol.
1210*22ce4affSfengbojiang
1211*22ce4affSfengbojiangGiven an alphabet with known symbol frequencies,
1212*22ce4affSfengbojiangthe Huffman algorithm allows the construction of an optimal prefix code
1213*22ce4affSfengbojiangusing the fewest bits of any possible prefix codes for that alphabet.
1214*22ce4affSfengbojiang
1215*22ce4affSfengbojiangPrefix code must not exceed a maximum code length.
1216*22ce4affSfengbojiangMore bits improve accuracy but cost more header size,
1217*22ce4affSfengbojiangand require more memory or more complex decoding operations.
1218*22ce4affSfengbojiangThis specification limits maximum code length to 11 bits.
1219*22ce4affSfengbojiang
1220*22ce4affSfengbojiang#### Representation
1221*22ce4affSfengbojiang
1222*22ce4affSfengbojiangAll literal values from zero (included) to last present one (excluded)
1223*22ce4affSfengbojiangare represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
1224*22ce4affSfengbojiangTransformation from `Weight` to `Number_of_Bits` follows this formula :
1225*22ce4affSfengbojiang```
1226*22ce4affSfengbojiangNumber_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
1227*22ce4affSfengbojiang```
1228*22ce4affSfengbojiangThe last symbol's `Weight` is deduced from previously decoded ones,
1229*22ce4affSfengbojiangby completing to the nearest power of 2.
1230*22ce4affSfengbojiangThis power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
1231*22ce4affSfengbojiang`Max_Number_of_Bits` must be <= 11,
1232*22ce4affSfengbojiangotherwise the representation is considered corrupted.
1233*22ce4affSfengbojiang
1234*22ce4affSfengbojiang__Example__ :
1235*22ce4affSfengbojiangLet's presume the following Huffman tree must be described :
1236*22ce4affSfengbojiang
1237*22ce4affSfengbojiang|  literal value   |  0  |  1  |  2  |  3  |  4  |  5  |
1238*22ce4affSfengbojiang| ---------------- | --- | --- | --- | --- | --- | --- |
1239*22ce4affSfengbojiang| `Number_of_Bits` |  1  |  2  |  3  |  0  |  4  |  4  |
1240*22ce4affSfengbojiang
1241*22ce4affSfengbojiangThe tree depth is 4, since its longest elements uses 4 bits
1242*22ce4affSfengbojiang(longest elements are the one with smallest frequency).
1243*22ce4affSfengbojiangValue `5` will not be listed, as it can be determined from values for 0-4,
1244*22ce4affSfengbojiangnor will values above `5` as they are all 0.
1245*22ce4affSfengbojiangValues from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
1246*22ce4affSfengbojiangWeight formula is :
1247*22ce4affSfengbojiang```
1248*22ce4affSfengbojiangWeight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
1249*22ce4affSfengbojiang```
1250*22ce4affSfengbojiangIt gives the following series of weights :
1251*22ce4affSfengbojiang
1252*22ce4affSfengbojiang| literal value |  0  |  1  |  2  |  3  |  4  |
1253*22ce4affSfengbojiang| ------------- | --- | --- | --- | --- | --- |
1254*22ce4affSfengbojiang|   `Weight`    |  4  |  3  |  2  |  0  |  1  |
1255*22ce4affSfengbojiang
1256*22ce4affSfengbojiangThe decoder will do the inverse operation :
1257*22ce4affSfengbojianghaving collected weights of literal symbols from `0` to `4`,
1258*22ce4affSfengbojiangit knows the last literal, `5`, is present with a non-zero `Weight`.
1259*22ce4affSfengbojiangThe `Weight` of `5` can be determined by advancing to the next power of 2.
1260*22ce4affSfengbojiangThe sum of `2^(Weight-1)` (excluding 0's) is :
1261*22ce4affSfengbojiang`8 + 4 + 2 + 0 + 1 = 15`.
1262*22ce4affSfengbojiangNearest larger power of 2 value is 16.
1263*22ce4affSfengbojiangTherefore, `Max_Number_of_Bits = 4` and `Weight[5] = 16-15 = 1`.
1264*22ce4affSfengbojiang
1265*22ce4affSfengbojiang#### Huffman Tree header
1266*22ce4affSfengbojiang
1267*22ce4affSfengbojiangThis is a single byte value (0-255),
1268*22ce4affSfengbojiangwhich describes how the series of weights is encoded.
1269*22ce4affSfengbojiang
1270*22ce4affSfengbojiang- if `headerByte` < 128 :
1271*22ce4affSfengbojiang  the series of weights is compressed using FSE (see below).
1272*22ce4affSfengbojiang  The length of the FSE-compressed series is equal to `headerByte` (0-127).
1273*22ce4affSfengbojiang
1274*22ce4affSfengbojiang- if `headerByte` >= 128 :
1275*22ce4affSfengbojiang  + the series of weights uses a direct representation,
1276*22ce4affSfengbojiang    where each `Weight` is encoded directly as a 4 bits field (0-15).
1277*22ce4affSfengbojiang  + They are encoded forward, 2 weights to a byte,
1278*22ce4affSfengbojiang    first weight taking the top four bits and second one taking the bottom four.
1279*22ce4affSfengbojiang    * e.g. the following operations could be used to read the weights:
1280*22ce4affSfengbojiang      `Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc.
1281*22ce4affSfengbojiang  + The full representation occupies `Ceiling(Number_of_Weights/2)` bytes,
1282*22ce4affSfengbojiang    meaning it uses only full bytes even if `Number_of_Weights` is odd.
1283*22ce4affSfengbojiang  + `Number_of_Weights = headerByte - 127`.
1284*22ce4affSfengbojiang    * Note that maximum `Number_of_Weights` is 255-127 = 128,
1285*22ce4affSfengbojiang      therefore, only up to 128 `Weight` can be encoded using direct representation.
1286*22ce4affSfengbojiang    * Since the last non-zero `Weight` is _not_ encoded,
1287*22ce4affSfengbojiang      this scheme is compatible with alphabet sizes of up to 129 symbols,
1288*22ce4affSfengbojiang      hence including literal symbol 128.
1289*22ce4affSfengbojiang    * If any literal symbol > 128 has a non-zero `Weight`,
1290*22ce4affSfengbojiang      direct representation is not possible.
1291*22ce4affSfengbojiang      In such case, it's necessary to use FSE compression.
1292*22ce4affSfengbojiang
1293*22ce4affSfengbojiang
1294*22ce4affSfengbojiang#### Finite State Entropy (FSE) compression of Huffman weights
1295*22ce4affSfengbojiang
1296*22ce4affSfengbojiangIn this case, the series of Huffman weights is compressed using FSE compression.
1297*22ce4affSfengbojiangIt's a single bitstream with 2 interleaved states,
1298*22ce4affSfengbojiangsharing a single distribution table.
1299*22ce4affSfengbojiang
1300*22ce4affSfengbojiangTo decode an FSE bitstream, it is necessary to know its compressed size.
1301*22ce4affSfengbojiangCompressed size is provided by `headerByte`.
1302*22ce4affSfengbojiangIt's also necessary to know its _maximum possible_ decompressed size,
1303*22ce4affSfengbojiangwhich is `255`, since literal values span from `0` to `255`,
1304*22ce4affSfengbojiangand last symbol's `Weight` is not represented.
1305*22ce4affSfengbojiang
1306*22ce4affSfengbojiangAn FSE bitstream starts by a header, describing probabilities distribution.
1307*22ce4affSfengbojiangIt will create a Decoding Table.
1308*22ce4affSfengbojiangFor a list of Huffman weights, the maximum accuracy log is 6 bits.
1309*22ce4affSfengbojiangFor more description see the [FSE header description](#fse-table-description)
1310*22ce4affSfengbojiang
1311*22ce4affSfengbojiangThe Huffman header compression uses 2 states,
1312*22ce4affSfengbojiangwhich share the same FSE distribution table.
1313*22ce4affSfengbojiangThe first state (`State1`) encodes the even indexed symbols,
1314*22ce4affSfengbojiangand the second (`State2`) encodes the odd indexed symbols.
1315*22ce4affSfengbojiang`State1` is initialized first, and then `State2`, and they take turns
1316*22ce4affSfengbojiangdecoding a single symbol and updating their state.
1317*22ce4affSfengbojiangFor more details on these FSE operations, see the [FSE section](#fse).
1318*22ce4affSfengbojiang
1319*22ce4affSfengbojiangThe number of symbols to decode is determined
1320*22ce4affSfengbojiangby tracking bitStream overflow condition:
1321*22ce4affSfengbojiangIf updating state after decoding a symbol would require more bits than
1322*22ce4affSfengbojiangremain in the stream, it is assumed that extra bits are 0.  Then,
1323*22ce4affSfengbojiangsymbols for each of the final states are decoded and the process is complete.
1324*22ce4affSfengbojiang
1325*22ce4affSfengbojiang#### Conversion from weights to Huffman prefix codes
1326*22ce4affSfengbojiang
1327*22ce4affSfengbojiangAll present symbols shall now have a `Weight` value.
1328*22ce4affSfengbojiangIt is possible to transform weights into `Number_of_Bits`, using this formula:
1329*22ce4affSfengbojiang```
1330*22ce4affSfengbojiangNumber_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0
1331*22ce4affSfengbojiang```
1332*22ce4affSfengbojiangSymbols are sorted by `Weight`.
1333*22ce4affSfengbojiangWithin same `Weight`, symbols keep natural sequential order.
1334*22ce4affSfengbojiangSymbols with a `Weight` of zero are removed.
1335*22ce4affSfengbojiangThen, starting from lowest `Weight`, prefix codes are distributed in sequential order.
1336*22ce4affSfengbojiang
1337*22ce4affSfengbojiang__Example__ :
1338*22ce4affSfengbojiangLet's presume the following list of weights has been decoded :
1339*22ce4affSfengbojiang
1340*22ce4affSfengbojiang| Literal  |  0  |  1  |  2  |  3  |  4  |  5  |
1341*22ce4affSfengbojiang| -------- | --- | --- | --- | --- | --- | --- |
1342*22ce4affSfengbojiang| `Weight` |  4  |  3  |  2  |  0  |  1  |  1  |
1343*22ce4affSfengbojiang
1344*22ce4affSfengbojiangSorted by weight and then natural sequential order,
1345*22ce4affSfengbojiangit gives the following distribution :
1346*22ce4affSfengbojiang
1347*22ce4affSfengbojiang| Literal          |  3  |  4  |  5  |  2  |  1  |   0  |
1348*22ce4affSfengbojiang| ---------------- | --- | --- | --- | --- | --- | ---- |
1349*22ce4affSfengbojiang| `Weight`         |  0  |  1  |  1  |  2  |  3  |   4  |
1350*22ce4affSfengbojiang| `Number_of_Bits` |  0  |  4  |  4  |  3  |  2  |   1  |
1351*22ce4affSfengbojiang| prefix codes     | N/A | 0000| 0001| 001 | 01  |   1  |
1352*22ce4affSfengbojiang
1353*22ce4affSfengbojiang### Huffman-coded Streams
1354*22ce4affSfengbojiang
1355*22ce4affSfengbojiangGiven a Huffman decoding table,
1356*22ce4affSfengbojiangit's possible to decode a Huffman-coded stream.
1357*22ce4affSfengbojiang
1358*22ce4affSfengbojiangEach bitstream must be read _backward_,
1359*22ce4affSfengbojiangthat is starting from the end down to the beginning.
1360*22ce4affSfengbojiangTherefore it's necessary to know the size of each bitstream.
1361*22ce4affSfengbojiang
1362*22ce4affSfengbojiangIt's also necessary to know exactly which _bit_ is the last one.
1363*22ce4affSfengbojiangThis is detected by a final bit flag :
1364*22ce4affSfengbojiangthe highest bit of latest byte is a final-bit-flag.
1365*22ce4affSfengbojiangConsequently, a last byte of `0` is not possible.
1366*22ce4affSfengbojiangAnd the final-bit-flag itself is not part of the useful bitstream.
1367*22ce4affSfengbojiangHence, the last byte contains between 0 and 7 useful bits.
1368*22ce4affSfengbojiang
1369*22ce4affSfengbojiangStarting from the end,
1370*22ce4affSfengbojiangit's possible to read the bitstream in a __little-endian__ fashion,
1371*22ce4affSfengbojiangkeeping track of already used bits. Since the bitstream is encoded in reverse
1372*22ce4affSfengbojiangorder, starting from the end read symbols in forward order.
1373*22ce4affSfengbojiang
1374*22ce4affSfengbojiangFor example, if the literal sequence "0145" was encoded using above prefix code,
1375*22ce4affSfengbojiangit would be encoded (in reverse order) as:
1376*22ce4affSfengbojiang
1377*22ce4affSfengbojiang|Symbol  |   5  |   4  |  1 | 0 | Padding |
1378*22ce4affSfengbojiang|--------|------|------|----|---|---------|
1379*22ce4affSfengbojiang|Encoding|`0000`|`0001`|`01`|`1`| `00001` |
1380*22ce4affSfengbojiang
1381*22ce4affSfengbojiangResulting in following 2-bytes bitstream :
1382*22ce4affSfengbojiang```
1383*22ce4affSfengbojiang00010000 00001101
1384*22ce4affSfengbojiang```
1385*22ce4affSfengbojiang
1386*22ce4affSfengbojiangHere is an alternative representation with the symbol codes separated by underscore:
1387*22ce4affSfengbojiang```
1388*22ce4affSfengbojiang0001_0000 00001_1_01
1389*22ce4affSfengbojiang```
1390*22ce4affSfengbojiang
1391*22ce4affSfengbojiangReading highest `Max_Number_of_Bits` bits,
1392*22ce4affSfengbojiangit's possible to compare extracted value to decoding table,
1393*22ce4affSfengbojiangdetermining the symbol to decode and number of bits to discard.
1394*22ce4affSfengbojiang
1395*22ce4affSfengbojiangThe process continues up to reading the required number of symbols per stream.
1396*22ce4affSfengbojiangIf a bitstream is not entirely and exactly consumed,
1397*22ce4affSfengbojianghence reaching exactly its beginning position with _all_ bits consumed,
1398*22ce4affSfengbojiangthe decoding process is considered faulty.
1399*22ce4affSfengbojiang
1400*22ce4affSfengbojiang
1401*22ce4affSfengbojiangDictionary Format
1402*22ce4affSfengbojiang-----------------
1403*22ce4affSfengbojiang
1404*22ce4affSfengbojiangZstandard is compatible with "raw content" dictionaries,
1405*22ce4affSfengbojiangfree of any format restriction, except that they must be at least 8 bytes.
1406*22ce4affSfengbojiangThese dictionaries function as if they were just the `Content` part
1407*22ce4affSfengbojiangof a formatted dictionary.
1408*22ce4affSfengbojiang
1409*22ce4affSfengbojiangBut dictionaries created by `zstd --train` follow a format, described here.
1410*22ce4affSfengbojiang
1411*22ce4affSfengbojiang__Pre-requisites__ : a dictionary has a size,
1412*22ce4affSfengbojiang                     defined either by a buffer limit, or a file size.
1413*22ce4affSfengbojiang
1414*22ce4affSfengbojiang| `Magic_Number` | `Dictionary_ID` | `Entropy_Tables` | `Content` |
1415*22ce4affSfengbojiang| -------------- | --------------- | ---------------- | --------- |
1416*22ce4affSfengbojiang
1417*22ce4affSfengbojiang__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, __little-endian__ format
1418*22ce4affSfengbojiang
1419*22ce4affSfengbojiang__`Dictionary_ID`__ : 4 bytes, stored in __little-endian__ format.
1420*22ce4affSfengbojiang              `Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`).
1421*22ce4affSfengbojiang              It's used by decoders to check if they use the correct dictionary.
1422*22ce4affSfengbojiang
1423*22ce4affSfengbojiang_Reserved ranges :_
1424*22ce4affSfengbojiangIf the dictionary is going to be distributed in a public environment,
1425*22ce4affSfengbojiangthe following ranges of `Dictionary_ID` are reserved for some future registrar
1426*22ce4affSfengbojiangand shall not be used :
1427*22ce4affSfengbojiang
1428*22ce4affSfengbojiang    - low range  : <= 32767
1429*22ce4affSfengbojiang    - high range : >= (2^31)
1430*22ce4affSfengbojiang
1431*22ce4affSfengbojiangOutside of these ranges, any value of `Dictionary_ID`
1432*22ce4affSfengbojiangwhich is both `>= 32768` and `< (1<<31)` can be used freely,
1433*22ce4affSfengbojiangeven in public environment.
1434*22ce4affSfengbojiang
1435*22ce4affSfengbojiang
1436*22ce4affSfengbojiang__`Entropy_Tables`__ : follow the same format as tables in [compressed blocks].
1437*22ce4affSfengbojiang              See the relevant [FSE](#fse-table-description)
1438*22ce4affSfengbojiang              and [Huffman](#huffman-tree-description) sections for how to decode these tables.
1439*22ce4affSfengbojiang              They are stored in following order :
1440*22ce4affSfengbojiang              Huffman tables for literals, FSE table for offsets,
1441*22ce4affSfengbojiang              FSE table for match lengths, and FSE table for literals lengths.
1442*22ce4affSfengbojiang              These tables populate the Repeat Stats literals mode and
1443*22ce4affSfengbojiang              Repeat distribution mode for sequence decoding.
1444*22ce4affSfengbojiang              It's finally followed by 3 offset values, populating recent offsets (instead of using `{1,4,8}`),
1445*22ce4affSfengbojiang              stored in order, 4-bytes __little-endian__ each, for a total of 12 bytes.
1446*22ce4affSfengbojiang              Each recent offset must have a value <= dictionary content size, and cannot equal 0.
1447*22ce4affSfengbojiang
1448*22ce4affSfengbojiang__`Content`__ : The rest of the dictionary is its content.
1449*22ce4affSfengbojiang              The content act as a "past" in front of data to compress or decompress,
1450*22ce4affSfengbojiang              so it can be referenced in sequence commands.
1451*22ce4affSfengbojiang              As long as the amount of data decoded from this frame is less than or
1452*22ce4affSfengbojiang              equal to `Window_Size`, sequence commands may specify offsets longer
1453*22ce4affSfengbojiang              than the total length of decoded output so far to reference back to the
1454*22ce4affSfengbojiang              dictionary, even parts of the dictionary with offsets larger than `Window_Size`.
1455*22ce4affSfengbojiang              After the total output has surpassed `Window_Size` however,
1456*22ce4affSfengbojiang              this is no longer allowed and the dictionary is no longer accessible.
1457*22ce4affSfengbojiang
1458*22ce4affSfengbojiang[compressed blocks]: #the-format-of-compressed_block
1459*22ce4affSfengbojiang
1460*22ce4affSfengbojiangIf a dictionary is provided by an external source,
1461*22ce4affSfengbojiangit should be loaded with great care, its content considered untrusted.
1462*22ce4affSfengbojiang
1463*22ce4affSfengbojiang
1464*22ce4affSfengbojiang
1465*22ce4affSfengbojiangAppendix A - Decoding tables for predefined codes
1466*22ce4affSfengbojiang-------------------------------------------------
1467*22ce4affSfengbojiang
1468*22ce4affSfengbojiangThis appendix contains FSE decoding tables
1469*22ce4affSfengbojiangfor the predefined literal length, match length, and offset codes.
1470*22ce4affSfengbojiangThe tables have been constructed using the algorithm as given above in chapter
1471*22ce4affSfengbojiang"from normalized distribution to decoding tables".
1472*22ce4affSfengbojiangThe tables here can be used as examples
1473*22ce4affSfengbojiangto crosscheck that an implementation build its decoding tables correctly.
1474*22ce4affSfengbojiang
1475*22ce4affSfengbojiang#### Literal Length Code:
1476*22ce4affSfengbojiang
1477*22ce4affSfengbojiang| State | Symbol | Number_Of_Bits | Base |
1478*22ce4affSfengbojiang| ----- | ------ | -------------- | ---- |
1479*22ce4affSfengbojiang|     0 |      0 |              4 |    0 |
1480*22ce4affSfengbojiang|     1 |      0 |              4 |   16 |
1481*22ce4affSfengbojiang|     2 |      1 |              5 |   32 |
1482*22ce4affSfengbojiang|     3 |      3 |              5 |    0 |
1483*22ce4affSfengbojiang|     4 |      4 |              5 |    0 |
1484*22ce4affSfengbojiang|     5 |      6 |              5 |    0 |
1485*22ce4affSfengbojiang|     6 |      7 |              5 |    0 |
1486*22ce4affSfengbojiang|     7 |      9 |              5 |    0 |
1487*22ce4affSfengbojiang|     8 |     10 |              5 |    0 |
1488*22ce4affSfengbojiang|     9 |     12 |              5 |    0 |
1489*22ce4affSfengbojiang|    10 |     14 |              6 |    0 |
1490*22ce4affSfengbojiang|    11 |     16 |              5 |    0 |
1491*22ce4affSfengbojiang|    12 |     18 |              5 |    0 |
1492*22ce4affSfengbojiang|    13 |     19 |              5 |    0 |
1493*22ce4affSfengbojiang|    14 |     21 |              5 |    0 |
1494*22ce4affSfengbojiang|    15 |     22 |              5 |    0 |
1495*22ce4affSfengbojiang|    16 |     24 |              5 |    0 |
1496*22ce4affSfengbojiang|    17 |     25 |              5 |   32 |
1497*22ce4affSfengbojiang|    18 |     26 |              5 |    0 |
1498*22ce4affSfengbojiang|    19 |     27 |              6 |    0 |
1499*22ce4affSfengbojiang|    20 |     29 |              6 |    0 |
1500*22ce4affSfengbojiang|    21 |     31 |              6 |    0 |
1501*22ce4affSfengbojiang|    22 |      0 |              4 |   32 |
1502*22ce4affSfengbojiang|    23 |      1 |              4 |    0 |
1503*22ce4affSfengbojiang|    24 |      2 |              5 |    0 |
1504*22ce4affSfengbojiang|    25 |      4 |              5 |   32 |
1505*22ce4affSfengbojiang|    26 |      5 |              5 |    0 |
1506*22ce4affSfengbojiang|    27 |      7 |              5 |   32 |
1507*22ce4affSfengbojiang|    28 |      8 |              5 |    0 |
1508*22ce4affSfengbojiang|    29 |     10 |              5 |   32 |
1509*22ce4affSfengbojiang|    30 |     11 |              5 |    0 |
1510*22ce4affSfengbojiang|    31 |     13 |              6 |    0 |
1511*22ce4affSfengbojiang|    32 |     16 |              5 |   32 |
1512*22ce4affSfengbojiang|    33 |     17 |              5 |    0 |
1513*22ce4affSfengbojiang|    34 |     19 |              5 |   32 |
1514*22ce4affSfengbojiang|    35 |     20 |              5 |    0 |
1515*22ce4affSfengbojiang|    36 |     22 |              5 |   32 |
1516*22ce4affSfengbojiang|    37 |     23 |              5 |    0 |
1517*22ce4affSfengbojiang|    38 |     25 |              4 |    0 |
1518*22ce4affSfengbojiang|    39 |     25 |              4 |   16 |
1519*22ce4affSfengbojiang|    40 |     26 |              5 |   32 |
1520*22ce4affSfengbojiang|    41 |     28 |              6 |    0 |
1521*22ce4affSfengbojiang|    42 |     30 |              6 |    0 |
1522*22ce4affSfengbojiang|    43 |      0 |              4 |   48 |
1523*22ce4affSfengbojiang|    44 |      1 |              4 |   16 |
1524*22ce4affSfengbojiang|    45 |      2 |              5 |   32 |
1525*22ce4affSfengbojiang|    46 |      3 |              5 |   32 |
1526*22ce4affSfengbojiang|    47 |      5 |              5 |   32 |
1527*22ce4affSfengbojiang|    48 |      6 |              5 |   32 |
1528*22ce4affSfengbojiang|    49 |      8 |              5 |   32 |
1529*22ce4affSfengbojiang|    50 |      9 |              5 |   32 |
1530*22ce4affSfengbojiang|    51 |     11 |              5 |   32 |
1531*22ce4affSfengbojiang|    52 |     12 |              5 |   32 |
1532*22ce4affSfengbojiang|    53 |     15 |              6 |    0 |
1533*22ce4affSfengbojiang|    54 |     17 |              5 |   32 |
1534*22ce4affSfengbojiang|    55 |     18 |              5 |   32 |
1535*22ce4affSfengbojiang|    56 |     20 |              5 |   32 |
1536*22ce4affSfengbojiang|    57 |     21 |              5 |   32 |
1537*22ce4affSfengbojiang|    58 |     23 |              5 |   32 |
1538*22ce4affSfengbojiang|    59 |     24 |              5 |   32 |
1539*22ce4affSfengbojiang|    60 |     35 |              6 |    0 |
1540*22ce4affSfengbojiang|    61 |     34 |              6 |    0 |
1541*22ce4affSfengbojiang|    62 |     33 |              6 |    0 |
1542*22ce4affSfengbojiang|    63 |     32 |              6 |    0 |
1543*22ce4affSfengbojiang
1544*22ce4affSfengbojiang#### Match Length Code:
1545*22ce4affSfengbojiang
1546*22ce4affSfengbojiang| State | Symbol | Number_Of_Bits | Base |
1547*22ce4affSfengbojiang| ----- | ------ | -------------- | ---- |
1548*22ce4affSfengbojiang|     0 |      0 |              6 |    0 |
1549*22ce4affSfengbojiang|     1 |      1 |              4 |    0 |
1550*22ce4affSfengbojiang|     2 |      2 |              5 |   32 |
1551*22ce4affSfengbojiang|     3 |      3 |              5 |    0 |
1552*22ce4affSfengbojiang|     4 |      5 |              5 |    0 |
1553*22ce4affSfengbojiang|     5 |      6 |              5 |    0 |
1554*22ce4affSfengbojiang|     6 |      8 |              5 |    0 |
1555*22ce4affSfengbojiang|     7 |     10 |              6 |    0 |
1556*22ce4affSfengbojiang|     8 |     13 |              6 |    0 |
1557*22ce4affSfengbojiang|     9 |     16 |              6 |    0 |
1558*22ce4affSfengbojiang|    10 |     19 |              6 |    0 |
1559*22ce4affSfengbojiang|    11 |     22 |              6 |    0 |
1560*22ce4affSfengbojiang|    12 |     25 |              6 |    0 |
1561*22ce4affSfengbojiang|    13 |     28 |              6 |    0 |
1562*22ce4affSfengbojiang|    14 |     31 |              6 |    0 |
1563*22ce4affSfengbojiang|    15 |     33 |              6 |    0 |
1564*22ce4affSfengbojiang|    16 |     35 |              6 |    0 |
1565*22ce4affSfengbojiang|    17 |     37 |              6 |    0 |
1566*22ce4affSfengbojiang|    18 |     39 |              6 |    0 |
1567*22ce4affSfengbojiang|    19 |     41 |              6 |    0 |
1568*22ce4affSfengbojiang|    20 |     43 |              6 |    0 |
1569*22ce4affSfengbojiang|    21 |     45 |              6 |    0 |
1570*22ce4affSfengbojiang|    22 |      1 |              4 |   16 |
1571*22ce4affSfengbojiang|    23 |      2 |              4 |    0 |
1572*22ce4affSfengbojiang|    24 |      3 |              5 |   32 |
1573*22ce4affSfengbojiang|    25 |      4 |              5 |    0 |
1574*22ce4affSfengbojiang|    26 |      6 |              5 |   32 |
1575*22ce4affSfengbojiang|    27 |      7 |              5 |    0 |
1576*22ce4affSfengbojiang|    28 |      9 |              6 |    0 |
1577*22ce4affSfengbojiang|    29 |     12 |              6 |    0 |
1578*22ce4affSfengbojiang|    30 |     15 |              6 |    0 |
1579*22ce4affSfengbojiang|    31 |     18 |              6 |    0 |
1580*22ce4affSfengbojiang|    32 |     21 |              6 |    0 |
1581*22ce4affSfengbojiang|    33 |     24 |              6 |    0 |
1582*22ce4affSfengbojiang|    34 |     27 |              6 |    0 |
1583*22ce4affSfengbojiang|    35 |     30 |              6 |    0 |
1584*22ce4affSfengbojiang|    36 |     32 |              6 |    0 |
1585*22ce4affSfengbojiang|    37 |     34 |              6 |    0 |
1586*22ce4affSfengbojiang|    38 |     36 |              6 |    0 |
1587*22ce4affSfengbojiang|    39 |     38 |              6 |    0 |
1588*22ce4affSfengbojiang|    40 |     40 |              6 |    0 |
1589*22ce4affSfengbojiang|    41 |     42 |              6 |    0 |
1590*22ce4affSfengbojiang|    42 |     44 |              6 |    0 |
1591*22ce4affSfengbojiang|    43 |      1 |              4 |   32 |
1592*22ce4affSfengbojiang|    44 |      1 |              4 |   48 |
1593*22ce4affSfengbojiang|    45 |      2 |              4 |   16 |
1594*22ce4affSfengbojiang|    46 |      4 |              5 |   32 |
1595*22ce4affSfengbojiang|    47 |      5 |              5 |   32 |
1596*22ce4affSfengbojiang|    48 |      7 |              5 |   32 |
1597*22ce4affSfengbojiang|    49 |      8 |              5 |   32 |
1598*22ce4affSfengbojiang|    50 |     11 |              6 |    0 |
1599*22ce4affSfengbojiang|    51 |     14 |              6 |    0 |
1600*22ce4affSfengbojiang|    52 |     17 |              6 |    0 |
1601*22ce4affSfengbojiang|    53 |     20 |              6 |    0 |
1602*22ce4affSfengbojiang|    54 |     23 |              6 |    0 |
1603*22ce4affSfengbojiang|    55 |     26 |              6 |    0 |
1604*22ce4affSfengbojiang|    56 |     29 |              6 |    0 |
1605*22ce4affSfengbojiang|    57 |     52 |              6 |    0 |
1606*22ce4affSfengbojiang|    58 |     51 |              6 |    0 |
1607*22ce4affSfengbojiang|    59 |     50 |              6 |    0 |
1608*22ce4affSfengbojiang|    60 |     49 |              6 |    0 |
1609*22ce4affSfengbojiang|    61 |     48 |              6 |    0 |
1610*22ce4affSfengbojiang|    62 |     47 |              6 |    0 |
1611*22ce4affSfengbojiang|    63 |     46 |              6 |    0 |
1612*22ce4affSfengbojiang
1613*22ce4affSfengbojiang#### Offset Code:
1614*22ce4affSfengbojiang
1615*22ce4affSfengbojiang| State | Symbol | Number_Of_Bits | Base |
1616*22ce4affSfengbojiang| ----- | ------ | -------------- | ---- |
1617*22ce4affSfengbojiang|     0 |      0 |              5 |    0 |
1618*22ce4affSfengbojiang|     1 |      6 |              4 |    0 |
1619*22ce4affSfengbojiang|     2 |      9 |              5 |    0 |
1620*22ce4affSfengbojiang|     3 |     15 |              5 |    0 |
1621*22ce4affSfengbojiang|     4 |     21 |              5 |    0 |
1622*22ce4affSfengbojiang|     5 |      3 |              5 |    0 |
1623*22ce4affSfengbojiang|     6 |      7 |              4 |    0 |
1624*22ce4affSfengbojiang|     7 |     12 |              5 |    0 |
1625*22ce4affSfengbojiang|     8 |     18 |              5 |    0 |
1626*22ce4affSfengbojiang|     9 |     23 |              5 |    0 |
1627*22ce4affSfengbojiang|    10 |      5 |              5 |    0 |
1628*22ce4affSfengbojiang|    11 |      8 |              4 |    0 |
1629*22ce4affSfengbojiang|    12 |     14 |              5 |    0 |
1630*22ce4affSfengbojiang|    13 |     20 |              5 |    0 |
1631*22ce4affSfengbojiang|    14 |      2 |              5 |    0 |
1632*22ce4affSfengbojiang|    15 |      7 |              4 |   16 |
1633*22ce4affSfengbojiang|    16 |     11 |              5 |    0 |
1634*22ce4affSfengbojiang|    17 |     17 |              5 |    0 |
1635*22ce4affSfengbojiang|    18 |     22 |              5 |    0 |
1636*22ce4affSfengbojiang|    19 |      4 |              5 |    0 |
1637*22ce4affSfengbojiang|    20 |      8 |              4 |   16 |
1638*22ce4affSfengbojiang|    21 |     13 |              5 |    0 |
1639*22ce4affSfengbojiang|    22 |     19 |              5 |    0 |
1640*22ce4affSfengbojiang|    23 |      1 |              5 |    0 |
1641*22ce4affSfengbojiang|    24 |      6 |              4 |   16 |
1642*22ce4affSfengbojiang|    25 |     10 |              5 |    0 |
1643*22ce4affSfengbojiang|    26 |     16 |              5 |    0 |
1644*22ce4affSfengbojiang|    27 |     28 |              5 |    0 |
1645*22ce4affSfengbojiang|    28 |     27 |              5 |    0 |
1646*22ce4affSfengbojiang|    29 |     26 |              5 |    0 |
1647*22ce4affSfengbojiang|    30 |     25 |              5 |    0 |
1648*22ce4affSfengbojiang|    31 |     24 |              5 |    0 |
1649*22ce4affSfengbojiang
1650*22ce4affSfengbojiang
1651*22ce4affSfengbojiang
1652*22ce4affSfengbojiangAppendix B - Resources for implementers
1653*22ce4affSfengbojiang-------------------------------------------------
1654*22ce4affSfengbojiang
1655*22ce4affSfengbojiangAn open source reference implementation is available on :
1656*22ce4affSfengbojianghttps://github.com/facebook/zstd
1657*22ce4affSfengbojiang
1658*22ce4affSfengbojiangThe project contains a frame generator, called [decodeCorpus],
1659*22ce4affSfengbojiangwhich can be used by any 3rd-party implementation
1660*22ce4affSfengbojiangto verify that a tested decoder is compliant with the specification.
1661*22ce4affSfengbojiang
1662*22ce4affSfengbojiang[decodeCorpus]: https://github.com/facebook/zstd/tree/v1.3.4/tests#decodecorpus---tool-to-generate-zstandard-frames-for-decoder-testing
1663*22ce4affSfengbojiang
1664*22ce4affSfengbojiang`decodeCorpus` generates random valid frames.
1665*22ce4affSfengbojiangA compliant decoder should be able to decode them all,
1666*22ce4affSfengbojiangor at least provide a meaningful error code explaining for which reason it cannot
1667*22ce4affSfengbojiang(memory limit restrictions for example).
1668*22ce4affSfengbojiang
1669*22ce4affSfengbojiang
1670*22ce4affSfengbojiangVersion changes
1671*22ce4affSfengbojiang---------------
1672*22ce4affSfengbojiang- 0.3.7 : clarifications for Repeat_Offsets
1673*22ce4affSfengbojiang- 0.3.6 : clarifications for Dictionary_ID
1674*22ce4affSfengbojiang- 0.3.5 : clarifications for Block_Maximum_Size
1675*22ce4affSfengbojiang- 0.3.4 : clarifications for FSE decoding table
1676*22ce4affSfengbojiang- 0.3.3 : clarifications for field Block_Size
1677*22ce4affSfengbojiang- 0.3.2 : remove additional block size restriction on compressed blocks
1678*22ce4affSfengbojiang- 0.3.1 : minor clarification regarding offset history update rules
1679*22ce4affSfengbojiang- 0.3.0 : minor edits to match RFC8478
1680*22ce4affSfengbojiang- 0.2.9 : clarifications for huffman weights direct representation, by Ulrich Kunitz
1681*22ce4affSfengbojiang- 0.2.8 : clarifications for IETF RFC discuss
1682*22ce4affSfengbojiang- 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell
1683*22ce4affSfengbojiang- 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz
1684*22ce4affSfengbojiang- 0.2.5 : minor typos and clarifications
1685*22ce4affSfengbojiang- 0.2.4 : section restructuring, by Sean Purcell
1686*22ce4affSfengbojiang- 0.2.3 : clarified several details, by Sean Purcell
1687*22ce4affSfengbojiang- 0.2.2 : added predefined codes, by Johannes Rudolph
1688*22ce4affSfengbojiang- 0.2.1 : clarify field names, by Przemyslaw Skibinski
1689*22ce4affSfengbojiang- 0.2.0 : numerous format adjustments for zstd v0.8+
1690*22ce4affSfengbojiang- 0.1.2 : limit Huffman tree depth to 11 bits
1691*22ce4affSfengbojiang- 0.1.1 : reserved dictID ranges
1692*22ce4affSfengbojiang- 0.1.0 : initial release
1693