1*22ce4affSfengbojiangZstandard Compression Format 2*22ce4affSfengbojiang============================ 3*22ce4affSfengbojiang 4*22ce4affSfengbojiang### Notices 5*22ce4affSfengbojiang 6*22ce4affSfengbojiangCopyright (c) 2016-2020 Yann Collet, Facebook, Inc. 7*22ce4affSfengbojiang 8*22ce4affSfengbojiangPermission is granted to copy and distribute this document 9*22ce4affSfengbojiangfor any purpose and without charge, 10*22ce4affSfengbojiangincluding translations into other languages 11*22ce4affSfengbojiangand incorporation into compilations, 12*22ce4affSfengbojiangprovided that the copyright notice and this notice are preserved, 13*22ce4affSfengbojiangand that any substantive changes or deletions from the original 14*22ce4affSfengbojiangare clearly marked. 15*22ce4affSfengbojiangDistribution of this document is unlimited. 16*22ce4affSfengbojiang 17*22ce4affSfengbojiang### Version 18*22ce4affSfengbojiang 19*22ce4affSfengbojiang0.3.7 (2020-12-09) 20*22ce4affSfengbojiang 21*22ce4affSfengbojiang 22*22ce4affSfengbojiangIntroduction 23*22ce4affSfengbojiang------------ 24*22ce4affSfengbojiang 25*22ce4affSfengbojiangThe purpose of this document is to define a lossless compressed data format, 26*22ce4affSfengbojiangthat is independent of CPU type, operating system, 27*22ce4affSfengbojiangfile system and character set, suitable for 28*22ce4affSfengbojiangfile compression, pipe and streaming compression, 29*22ce4affSfengbojiangusing the [Zstandard algorithm](http://www.zstandard.org). 30*22ce4affSfengbojiangThe text of the specification assumes a basic background in programming 31*22ce4affSfengbojiangat the level of bits and other primitive data representations. 32*22ce4affSfengbojiang 33*22ce4affSfengbojiangThe data can be produced or consumed, 34*22ce4affSfengbojiangeven for an arbitrarily long sequentially presented input data stream, 35*22ce4affSfengbojiangusing only an a priori bounded amount of intermediate storage, 36*22ce4affSfengbojiangand hence can be used in data communications. 37*22ce4affSfengbojiangThe format uses the Zstandard compression method, 38*22ce4affSfengbojiangand optional [xxHash-64 checksum method](http://www.xxhash.org), 39*22ce4affSfengbojiangfor detection of data corruption. 40*22ce4affSfengbojiang 41*22ce4affSfengbojiangThe data format defined by this specification 42*22ce4affSfengbojiangdoes not attempt to allow random access to compressed data. 43*22ce4affSfengbojiang 44*22ce4affSfengbojiangUnless otherwise indicated below, 45*22ce4affSfengbojianga compliant compressor must produce data sets 46*22ce4affSfengbojiangthat conform to the specifications presented here. 47*22ce4affSfengbojiangIt doesn’t need to support all options though. 48*22ce4affSfengbojiang 49*22ce4affSfengbojiangA compliant decompressor must be able to decompress 50*22ce4affSfengbojiangat least one working set of parameters 51*22ce4affSfengbojiangthat conforms to the specifications presented here. 52*22ce4affSfengbojiangIt may also ignore informative fields, such as checksum. 53*22ce4affSfengbojiangWhenever it does not support a parameter defined in the compressed stream, 54*22ce4affSfengbojiangit must produce a non-ambiguous error code and associated error message 55*22ce4affSfengbojiangexplaining which parameter is unsupported. 56*22ce4affSfengbojiang 57*22ce4affSfengbojiangThis specification is intended for use by implementers of software 58*22ce4affSfengbojiangto compress data into Zstandard format and/or decompress data from Zstandard format. 59*22ce4affSfengbojiangThe Zstandard format is supported by an open source reference implementation, 60*22ce4affSfengbojiangwritten in portable C, and available at : https://github.com/facebook/zstd . 61*22ce4affSfengbojiang 62*22ce4affSfengbojiang 63*22ce4affSfengbojiang### Overall conventions 64*22ce4affSfengbojiangIn this document: 65*22ce4affSfengbojiang- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters. 66*22ce4affSfengbojiang- the naming convention for identifiers is `Mixed_Case_With_Underscores` 67*22ce4affSfengbojiang 68*22ce4affSfengbojiang### Definitions 69*22ce4affSfengbojiangContent compressed by Zstandard is transformed into a Zstandard __frame__. 70*22ce4affSfengbojiangMultiple frames can be appended into a single file or stream. 71*22ce4affSfengbojiangA frame is completely independent, has a defined beginning and end, 72*22ce4affSfengbojiangand a set of parameters which tells the decoder how to decompress it. 73*22ce4affSfengbojiang 74*22ce4affSfengbojiangA frame encapsulates one or multiple __blocks__. 75*22ce4affSfengbojiangEach block contains arbitrary content, which is described by its header, 76*22ce4affSfengbojiangand has a guaranteed maximum content size, which depends on frame parameters. 77*22ce4affSfengbojiangUnlike frames, each block depends on previous blocks for proper decoding. 78*22ce4affSfengbojiangHowever, each block can be decompressed without waiting for its successor, 79*22ce4affSfengbojiangallowing streaming operations. 80*22ce4affSfengbojiang 81*22ce4affSfengbojiangOverview 82*22ce4affSfengbojiang--------- 83*22ce4affSfengbojiang- [Frames](#frames) 84*22ce4affSfengbojiang - [Zstandard frames](#zstandard-frames) 85*22ce4affSfengbojiang - [Blocks](#blocks) 86*22ce4affSfengbojiang - [Literals Section](#literals-section) 87*22ce4affSfengbojiang - [Sequences Section](#sequences-section) 88*22ce4affSfengbojiang - [Sequence Execution](#sequence-execution) 89*22ce4affSfengbojiang - [Skippable frames](#skippable-frames) 90*22ce4affSfengbojiang- [Entropy Encoding](#entropy-encoding) 91*22ce4affSfengbojiang - [FSE](#fse) 92*22ce4affSfengbojiang - [Huffman Coding](#huffman-coding) 93*22ce4affSfengbojiang- [Dictionary Format](#dictionary-format) 94*22ce4affSfengbojiang 95*22ce4affSfengbojiangFrames 96*22ce4affSfengbojiang------ 97*22ce4affSfengbojiangZstandard compressed data is made of one or more __frames__. 98*22ce4affSfengbojiangEach frame is independent and can be decompressed independently of other frames. 99*22ce4affSfengbojiangThe decompressed content of multiple concatenated frames is the concatenation of 100*22ce4affSfengbojiangeach frame decompressed content. 101*22ce4affSfengbojiang 102*22ce4affSfengbojiangThere are two frame formats defined by Zstandard: 103*22ce4affSfengbojiang Zstandard frames and Skippable frames. 104*22ce4affSfengbojiangZstandard frames contain compressed data, while 105*22ce4affSfengbojiangskippable frames contain custom user metadata. 106*22ce4affSfengbojiang 107*22ce4affSfengbojiang## Zstandard frames 108*22ce4affSfengbojiangThe structure of a single Zstandard frame is following: 109*22ce4affSfengbojiang 110*22ce4affSfengbojiang| `Magic_Number` | `Frame_Header` |`Data_Block`| [More data blocks] | [`Content_Checksum`] | 111*22ce4affSfengbojiang|:--------------:|:--------------:|:----------:| ------------------ |:--------------------:| 112*22ce4affSfengbojiang| 4 bytes | 2-14 bytes | n bytes | | 0-4 bytes | 113*22ce4affSfengbojiang 114*22ce4affSfengbojiang__`Magic_Number`__ 115*22ce4affSfengbojiang 116*22ce4affSfengbojiang4 Bytes, __little-endian__ format. 117*22ce4affSfengbojiangValue : 0xFD2FB528 118*22ce4affSfengbojiangNote: This value was selected to be less probable to find at the beginning of some random file. 119*22ce4affSfengbojiangIt avoids trivial patterns (0x00, 0xFF, repeated bytes, increasing bytes, etc.), 120*22ce4affSfengbojiangcontains byte values outside of ASCII range, 121*22ce4affSfengbojiangand doesn't map into UTF8 space. 122*22ce4affSfengbojiangIt reduces the chances that a text file represent this value by accident. 123*22ce4affSfengbojiang 124*22ce4affSfengbojiang__`Frame_Header`__ 125*22ce4affSfengbojiang 126*22ce4affSfengbojiang2 to 14 Bytes, detailed in [`Frame_Header`](#frame_header). 127*22ce4affSfengbojiang 128*22ce4affSfengbojiang__`Data_Block`__ 129*22ce4affSfengbojiang 130*22ce4affSfengbojiangDetailed in [`Blocks`](#blocks). 131*22ce4affSfengbojiangThat’s where compressed data is stored. 132*22ce4affSfengbojiang 133*22ce4affSfengbojiang__`Content_Checksum`__ 134*22ce4affSfengbojiang 135*22ce4affSfengbojiangAn optional 32-bit checksum, only present if `Content_Checksum_flag` is set. 136*22ce4affSfengbojiangThe content checksum is the result 137*22ce4affSfengbojiangof [xxh64() hash function](http://www.xxhash.org) 138*22ce4affSfengbojiangdigesting the original (decoded) data as input, and a seed of zero. 139*22ce4affSfengbojiangThe low 4 bytes of the checksum are stored in __little-endian__ format. 140*22ce4affSfengbojiang 141*22ce4affSfengbojiang### `Frame_Header` 142*22ce4affSfengbojiang 143*22ce4affSfengbojiangThe `Frame_Header` has a variable size, with a minimum of 2 bytes, 144*22ce4affSfengbojiangand up to 14 bytes depending on optional parameters. 145*22ce4affSfengbojiangThe structure of `Frame_Header` is following: 146*22ce4affSfengbojiang 147*22ce4affSfengbojiang| `Frame_Header_Descriptor` | [`Window_Descriptor`] | [`Dictionary_ID`] | [`Frame_Content_Size`] | 148*22ce4affSfengbojiang| ------------------------- | --------------------- | ----------------- | ---------------------- | 149*22ce4affSfengbojiang| 1 byte | 0-1 byte | 0-4 bytes | 0-8 bytes | 150*22ce4affSfengbojiang 151*22ce4affSfengbojiang#### `Frame_Header_Descriptor` 152*22ce4affSfengbojiang 153*22ce4affSfengbojiangThe first header's byte is called the `Frame_Header_Descriptor`. 154*22ce4affSfengbojiangIt describes which other fields are present. 155*22ce4affSfengbojiangDecoding this byte is enough to tell the size of `Frame_Header`. 156*22ce4affSfengbojiang 157*22ce4affSfengbojiang| Bit number | Field name | 158*22ce4affSfengbojiang| ---------- | ---------- | 159*22ce4affSfengbojiang| 7-6 | `Frame_Content_Size_flag` | 160*22ce4affSfengbojiang| 5 | `Single_Segment_flag` | 161*22ce4affSfengbojiang| 4 | `Unused_bit` | 162*22ce4affSfengbojiang| 3 | `Reserved_bit` | 163*22ce4affSfengbojiang| 2 | `Content_Checksum_flag` | 164*22ce4affSfengbojiang| 1-0 | `Dictionary_ID_flag` | 165*22ce4affSfengbojiang 166*22ce4affSfengbojiangIn this table, bit 7 is the highest bit, while bit 0 is the lowest one. 167*22ce4affSfengbojiang 168*22ce4affSfengbojiang__`Frame_Content_Size_flag`__ 169*22ce4affSfengbojiang 170*22ce4affSfengbojiangThis is a 2-bits flag (`= Frame_Header_Descriptor >> 6`), 171*22ce4affSfengbojiangspecifying if `Frame_Content_Size` (the decompressed data size) 172*22ce4affSfengbojiangis provided within the header. 173*22ce4affSfengbojiang`Flag_Value` provides `FCS_Field_Size`, 174*22ce4affSfengbojiangwhich is the number of bytes used by `Frame_Content_Size` 175*22ce4affSfengbojiangaccording to the following table: 176*22ce4affSfengbojiang 177*22ce4affSfengbojiang| `Flag_Value` | 0 | 1 | 2 | 3 | 178*22ce4affSfengbojiang| -------------- | ------ | --- | --- | --- | 179*22ce4affSfengbojiang|`FCS_Field_Size`| 0 or 1 | 2 | 4 | 8 | 180*22ce4affSfengbojiang 181*22ce4affSfengbojiangWhen `Flag_Value` is `0`, `FCS_Field_Size` depends on `Single_Segment_flag` : 182*22ce4affSfengbojiangif `Single_Segment_flag` is set, `FCS_Field_Size` is 1. 183*22ce4affSfengbojiangOtherwise, `FCS_Field_Size` is 0 : `Frame_Content_Size` is not provided. 184*22ce4affSfengbojiang 185*22ce4affSfengbojiang__`Single_Segment_flag`__ 186*22ce4affSfengbojiang 187*22ce4affSfengbojiangIf this flag is set, 188*22ce4affSfengbojiangdata must be regenerated within a single continuous memory segment. 189*22ce4affSfengbojiang 190*22ce4affSfengbojiangIn this case, `Window_Descriptor` byte is skipped, 191*22ce4affSfengbojiangbut `Frame_Content_Size` is necessarily present. 192*22ce4affSfengbojiangAs a consequence, the decoder must allocate a memory segment 193*22ce4affSfengbojiangof size equal or larger than `Frame_Content_Size`. 194*22ce4affSfengbojiang 195*22ce4affSfengbojiangIn order to preserve the decoder from unreasonable memory requirements, 196*22ce4affSfengbojianga decoder is allowed to reject a compressed frame 197*22ce4affSfengbojiangwhich requests a memory size beyond decoder's authorized range. 198*22ce4affSfengbojiang 199*22ce4affSfengbojiangFor broader compatibility, decoders are recommended to support 200*22ce4affSfengbojiangmemory sizes of at least 8 MB. 201*22ce4affSfengbojiangThis is only a recommendation, 202*22ce4affSfengbojiangeach decoder is free to support higher or lower limits, 203*22ce4affSfengbojiangdepending on local limitations. 204*22ce4affSfengbojiang 205*22ce4affSfengbojiang__`Unused_bit`__ 206*22ce4affSfengbojiang 207*22ce4affSfengbojiangA decoder compliant with this specification version shall not interpret this bit. 208*22ce4affSfengbojiangIt might be used in any future version, 209*22ce4affSfengbojiangto signal a property which is transparent to properly decode the frame. 210*22ce4affSfengbojiangAn encoder compliant with this specification version must set this bit to zero. 211*22ce4affSfengbojiang 212*22ce4affSfengbojiang__`Reserved_bit`__ 213*22ce4affSfengbojiang 214*22ce4affSfengbojiangThis bit is reserved for some future feature. 215*22ce4affSfengbojiangIts value _must be zero_. 216*22ce4affSfengbojiangA decoder compliant with this specification version must ensure it is not set. 217*22ce4affSfengbojiangThis bit may be used in a future revision, 218*22ce4affSfengbojiangto signal a feature that must be interpreted to decode the frame correctly. 219*22ce4affSfengbojiang 220*22ce4affSfengbojiang__`Content_Checksum_flag`__ 221*22ce4affSfengbojiang 222*22ce4affSfengbojiangIf this flag is set, a 32-bits `Content_Checksum` will be present at frame's end. 223*22ce4affSfengbojiangSee `Content_Checksum` paragraph. 224*22ce4affSfengbojiang 225*22ce4affSfengbojiang__`Dictionary_ID_flag`__ 226*22ce4affSfengbojiang 227*22ce4affSfengbojiangThis is a 2-bits flag (`= FHD & 3`), 228*22ce4affSfengbojiangtelling if a dictionary ID is provided within the header. 229*22ce4affSfengbojiangIt also specifies the size of this field as `DID_Field_Size`. 230*22ce4affSfengbojiang 231*22ce4affSfengbojiang|`Flag_Value` | 0 | 1 | 2 | 3 | 232*22ce4affSfengbojiang| -------------- | --- | --- | --- | --- | 233*22ce4affSfengbojiang|`DID_Field_Size`| 0 | 1 | 2 | 4 | 234*22ce4affSfengbojiang 235*22ce4affSfengbojiang#### `Window_Descriptor` 236*22ce4affSfengbojiang 237*22ce4affSfengbojiangProvides guarantees on minimum memory buffer required to decompress a frame. 238*22ce4affSfengbojiangThis information is important for decoders to allocate enough memory. 239*22ce4affSfengbojiang 240*22ce4affSfengbojiangThe `Window_Descriptor` byte is optional. 241*22ce4affSfengbojiangWhen `Single_Segment_flag` is set, `Window_Descriptor` is not present. 242*22ce4affSfengbojiangIn this case, `Window_Size` is `Frame_Content_Size`, 243*22ce4affSfengbojiangwhich can be any value from 0 to 2^64-1 bytes (16 ExaBytes). 244*22ce4affSfengbojiang 245*22ce4affSfengbojiang| Bit numbers | 7-3 | 2-0 | 246*22ce4affSfengbojiang| ----------- | ---------- | ---------- | 247*22ce4affSfengbojiang| Field name | `Exponent` | `Mantissa` | 248*22ce4affSfengbojiang 249*22ce4affSfengbojiangThe minimum memory buffer size is called `Window_Size`. 250*22ce4affSfengbojiangIt is described by the following formulas : 251*22ce4affSfengbojiang``` 252*22ce4affSfengbojiangwindowLog = 10 + Exponent; 253*22ce4affSfengbojiangwindowBase = 1 << windowLog; 254*22ce4affSfengbojiangwindowAdd = (windowBase / 8) * Mantissa; 255*22ce4affSfengbojiangWindow_Size = windowBase + windowAdd; 256*22ce4affSfengbojiang``` 257*22ce4affSfengbojiangThe minimum `Window_Size` is 1 KB. 258*22ce4affSfengbojiangThe maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB. 259*22ce4affSfengbojiang 260*22ce4affSfengbojiangIn general, larger `Window_Size` tend to improve compression ratio, 261*22ce4affSfengbojiangbut at the cost of memory usage. 262*22ce4affSfengbojiang 263*22ce4affSfengbojiangTo properly decode compressed data, 264*22ce4affSfengbojianga decoder will need to allocate a buffer of at least `Window_Size` bytes. 265*22ce4affSfengbojiang 266*22ce4affSfengbojiangIn order to preserve decoder from unreasonable memory requirements, 267*22ce4affSfengbojianga decoder is allowed to reject a compressed frame 268*22ce4affSfengbojiangwhich requests a memory size beyond decoder's authorized range. 269*22ce4affSfengbojiang 270*22ce4affSfengbojiangFor improved interoperability, 271*22ce4affSfengbojiangit's recommended for decoders to support `Window_Size` of up to 8 MB, 272*22ce4affSfengbojiangand it's recommended for encoders to not generate frame requiring `Window_Size` larger than 8 MB. 273*22ce4affSfengbojiangIt's merely a recommendation though, 274*22ce4affSfengbojiangdecoders are free to support larger or lower limits, 275*22ce4affSfengbojiangdepending on local limitations. 276*22ce4affSfengbojiang 277*22ce4affSfengbojiang#### `Dictionary_ID` 278*22ce4affSfengbojiang 279*22ce4affSfengbojiangThis is a variable size field, which contains 280*22ce4affSfengbojiangthe ID of the dictionary required to properly decode the frame. 281*22ce4affSfengbojiang`Dictionary_ID` field is optional. When it's not present, 282*22ce4affSfengbojiangit's up to the decoder to know which dictionary to use. 283*22ce4affSfengbojiang 284*22ce4affSfengbojiang`Dictionary_ID` field size is provided by `DID_Field_Size`. 285*22ce4affSfengbojiang`DID_Field_Size` is directly derived from value of `Dictionary_ID_flag`. 286*22ce4affSfengbojiang1 byte can represent an ID 0-255. 287*22ce4affSfengbojiang2 bytes can represent an ID 0-65535. 288*22ce4affSfengbojiang4 bytes can represent an ID 0-4294967295. 289*22ce4affSfengbojiangFormat is __little-endian__. 290*22ce4affSfengbojiang 291*22ce4affSfengbojiangIt's allowed to represent a small ID (for example `13`) 292*22ce4affSfengbojiangwith a large 4-bytes dictionary ID, even if it is less efficient. 293*22ce4affSfengbojiang 294*22ce4affSfengbojiangA value of `0` has same meaning as no `Dictionary_ID`, 295*22ce4affSfengbojiangin which case the frame may or may not need a dictionary to be decoded, 296*22ce4affSfengbojiangand the ID of such a dictionary is not specified. 297*22ce4affSfengbojiangThe decoder must know this information by other means. 298*22ce4affSfengbojiang 299*22ce4affSfengbojiang#### `Frame_Content_Size` 300*22ce4affSfengbojiang 301*22ce4affSfengbojiangThis is the original (uncompressed) size. This information is optional. 302*22ce4affSfengbojiang`Frame_Content_Size` uses a variable number of bytes, provided by `FCS_Field_Size`. 303*22ce4affSfengbojiang`FCS_Field_Size` is provided by the value of `Frame_Content_Size_flag`. 304*22ce4affSfengbojiang`FCS_Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes. 305*22ce4affSfengbojiang 306*22ce4affSfengbojiang| `FCS_Field_Size` | Range | 307*22ce4affSfengbojiang| ---------------- | ---------- | 308*22ce4affSfengbojiang| 0 | unknown | 309*22ce4affSfengbojiang| 1 | 0 - 255 | 310*22ce4affSfengbojiang| 2 | 256 - 65791| 311*22ce4affSfengbojiang| 4 | 0 - 2^32-1 | 312*22ce4affSfengbojiang| 8 | 0 - 2^64-1 | 313*22ce4affSfengbojiang 314*22ce4affSfengbojiang`Frame_Content_Size` format is __little-endian__. 315*22ce4affSfengbojiangWhen `FCS_Field_Size` is 1, 4 or 8 bytes, the value is read directly. 316*22ce4affSfengbojiangWhen `FCS_Field_Size` is 2, _the offset of 256 is added_. 317*22ce4affSfengbojiangIt's allowed to represent a small size (for example `18`) using any compatible variant. 318*22ce4affSfengbojiang 319*22ce4affSfengbojiang 320*22ce4affSfengbojiangBlocks 321*22ce4affSfengbojiang------- 322*22ce4affSfengbojiang 323*22ce4affSfengbojiangAfter `Magic_Number` and `Frame_Header`, there are some number of blocks. 324*22ce4affSfengbojiangEach frame must have at least one block, 325*22ce4affSfengbojiangbut there is no upper limit on the number of blocks per frame. 326*22ce4affSfengbojiang 327*22ce4affSfengbojiangThe structure of a block is as follows: 328*22ce4affSfengbojiang 329*22ce4affSfengbojiang| `Block_Header` | `Block_Content` | 330*22ce4affSfengbojiang|:--------------:|:---------------:| 331*22ce4affSfengbojiang| 3 bytes | n bytes | 332*22ce4affSfengbojiang 333*22ce4affSfengbojiang__`Block_Header`__ 334*22ce4affSfengbojiang 335*22ce4affSfengbojiang`Block_Header` uses 3 bytes, written using __little-endian__ convention. 336*22ce4affSfengbojiangIt contains 3 fields : 337*22ce4affSfengbojiang 338*22ce4affSfengbojiang| `Last_Block` | `Block_Type` | `Block_Size` | 339*22ce4affSfengbojiang|:------------:|:------------:|:------------:| 340*22ce4affSfengbojiang| bit 0 | bits 1-2 | bits 3-23 | 341*22ce4affSfengbojiang 342*22ce4affSfengbojiang__`Last_Block`__ 343*22ce4affSfengbojiang 344*22ce4affSfengbojiangThe lowest bit signals if this block is the last one. 345*22ce4affSfengbojiangThe frame will end after this last block. 346*22ce4affSfengbojiangIt may be followed by an optional `Content_Checksum` 347*22ce4affSfengbojiang(see [Zstandard Frames](#zstandard-frames)). 348*22ce4affSfengbojiang 349*22ce4affSfengbojiang__`Block_Type`__ 350*22ce4affSfengbojiang 351*22ce4affSfengbojiangThe next 2 bits represent the `Block_Type`. 352*22ce4affSfengbojiang`Block_Type` influences the meaning of `Block_Size`. 353*22ce4affSfengbojiangThere are 4 block types : 354*22ce4affSfengbojiang 355*22ce4affSfengbojiang| Value | 0 | 1 | 2 | 3 | 356*22ce4affSfengbojiang| ------------ | ----------- | ----------- | ------------------ | --------- | 357*22ce4affSfengbojiang| `Block_Type` | `Raw_Block` | `RLE_Block` | `Compressed_Block` | `Reserved`| 358*22ce4affSfengbojiang 359*22ce4affSfengbojiang- `Raw_Block` - this is an uncompressed block. 360*22ce4affSfengbojiang `Block_Content` contains `Block_Size` bytes. 361*22ce4affSfengbojiang 362*22ce4affSfengbojiang- `RLE_Block` - this is a single byte, repeated `Block_Size` times. 363*22ce4affSfengbojiang `Block_Content` consists of a single byte. 364*22ce4affSfengbojiang On the decompression side, this byte must be repeated `Block_Size` times. 365*22ce4affSfengbojiang 366*22ce4affSfengbojiang- `Compressed_Block` - this is a [Zstandard compressed block](#compressed-blocks), 367*22ce4affSfengbojiang explained later on. 368*22ce4affSfengbojiang `Block_Size` is the length of `Block_Content`, the compressed data. 369*22ce4affSfengbojiang The decompressed size is not known, 370*22ce4affSfengbojiang but its maximum possible value is guaranteed (see below) 371*22ce4affSfengbojiang 372*22ce4affSfengbojiang- `Reserved` - this is not a block. 373*22ce4affSfengbojiang This value cannot be used with current version of this specification. 374*22ce4affSfengbojiang If such a value is present, it is considered corrupted data. 375*22ce4affSfengbojiang 376*22ce4affSfengbojiang__`Block_Size`__ 377*22ce4affSfengbojiang 378*22ce4affSfengbojiangThe upper 21 bits of `Block_Header` represent the `Block_Size`. 379*22ce4affSfengbojiang 380*22ce4affSfengbojiangWhen `Block_Type` is `Compressed_Block` or `Raw_Block`, 381*22ce4affSfengbojiang`Block_Size` is the size of `Block_Content` (hence excluding `Block_Header`). 382*22ce4affSfengbojiang 383*22ce4affSfengbojiangWhen `Block_Type` is `RLE_Block`, since `Block_Content`’s size is always 1, 384*22ce4affSfengbojiang`Block_Size` represents the number of times this byte must be repeated. 385*22ce4affSfengbojiang 386*22ce4affSfengbojiang`Block_Size` is limited by `Block_Maximum_Size` (see below). 387*22ce4affSfengbojiang 388*22ce4affSfengbojiang__`Block_Content`__ and __`Block_Maximum_Size`__ 389*22ce4affSfengbojiang 390*22ce4affSfengbojiangThe size of `Block_Content` is limited by `Block_Maximum_Size`, 391*22ce4affSfengbojiangwhich is the smallest of: 392*22ce4affSfengbojiang- `Window_Size` 393*22ce4affSfengbojiang- 128 KB 394*22ce4affSfengbojiang 395*22ce4affSfengbojiang`Block_Maximum_Size` is constant for a given frame. 396*22ce4affSfengbojiangThis maximum is applicable to both the decompressed size 397*22ce4affSfengbojiangand the compressed size of any block in the frame. 398*22ce4affSfengbojiang 399*22ce4affSfengbojiangThe reasoning for this limit is that a decoder can read this information 400*22ce4affSfengbojiangat the beginning of a frame and use it to allocate buffers. 401*22ce4affSfengbojiangThe guarantees on the size of blocks ensure that 402*22ce4affSfengbojiangthe buffers will be large enough for any following block of the valid frame. 403*22ce4affSfengbojiang 404*22ce4affSfengbojiang 405*22ce4affSfengbojiangCompressed Blocks 406*22ce4affSfengbojiang----------------- 407*22ce4affSfengbojiangTo decompress a compressed block, the compressed size must be provided 408*22ce4affSfengbojiangfrom `Block_Size` field within `Block_Header`. 409*22ce4affSfengbojiang 410*22ce4affSfengbojiangA compressed block consists of 2 sections : 411*22ce4affSfengbojiang- [Literals Section](#literals-section) 412*22ce4affSfengbojiang- [Sequences Section](#sequences-section) 413*22ce4affSfengbojiang 414*22ce4affSfengbojiangThe results of the two sections are then combined to produce the decompressed 415*22ce4affSfengbojiangdata in [Sequence Execution](#sequence-execution) 416*22ce4affSfengbojiang 417*22ce4affSfengbojiang#### Prerequisites 418*22ce4affSfengbojiangTo decode a compressed block, the following elements are necessary : 419*22ce4affSfengbojiang- Previous decoded data, up to a distance of `Window_Size`, 420*22ce4affSfengbojiang or beginning of the Frame, whichever is smaller. 421*22ce4affSfengbojiang- List of "recent offsets" from previous `Compressed_Block`. 422*22ce4affSfengbojiang- The previous Huffman tree, required by `Treeless_Literals_Block` type 423*22ce4affSfengbojiang- Previous FSE decoding tables, required by `Repeat_Mode` 424*22ce4affSfengbojiang for each symbol type (literals lengths, match lengths, offsets) 425*22ce4affSfengbojiang 426*22ce4affSfengbojiangNote that decoding tables aren't always from the previous `Compressed_Block`. 427*22ce4affSfengbojiang 428*22ce4affSfengbojiang- Every decoding table can come from a dictionary. 429*22ce4affSfengbojiang- The Huffman tree comes from the previous `Compressed_Literals_Block`. 430*22ce4affSfengbojiang 431*22ce4affSfengbojiangLiterals Section 432*22ce4affSfengbojiang---------------- 433*22ce4affSfengbojiangAll literals are regrouped in the first part of the block. 434*22ce4affSfengbojiangThey can be decoded first, and then copied during [Sequence Execution], 435*22ce4affSfengbojiangor they can be decoded on the flow during [Sequence Execution]. 436*22ce4affSfengbojiang 437*22ce4affSfengbojiangLiterals can be stored uncompressed or compressed using Huffman prefix codes. 438*22ce4affSfengbojiangWhen compressed, an optional tree description can be present, 439*22ce4affSfengbojiangfollowed by 1 or 4 streams. 440*22ce4affSfengbojiang 441*22ce4affSfengbojiang| `Literals_Section_Header` | [`Huffman_Tree_Description`] | [jumpTable] | Stream1 | [Stream2] | [Stream3] | [Stream4] | 442*22ce4affSfengbojiang| ------------------------- | ---------------------------- | ----------- | ------- | --------- | --------- | --------- | 443*22ce4affSfengbojiang 444*22ce4affSfengbojiang 445*22ce4affSfengbojiang### `Literals_Section_Header` 446*22ce4affSfengbojiang 447*22ce4affSfengbojiangHeader is in charge of describing how literals are packed. 448*22ce4affSfengbojiangIt's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes, 449*22ce4affSfengbojiangusing __little-endian__ convention. 450*22ce4affSfengbojiang 451*22ce4affSfengbojiang| `Literals_Block_Type` | `Size_Format` | `Regenerated_Size` | [`Compressed_Size`] | 452*22ce4affSfengbojiang| --------------------- | ------------- | ------------------ | ------------------- | 453*22ce4affSfengbojiang| 2 bits | 1 - 2 bits | 5 - 20 bits | 0 - 18 bits | 454*22ce4affSfengbojiang 455*22ce4affSfengbojiangIn this representation, bits on the left are the lowest bits. 456*22ce4affSfengbojiang 457*22ce4affSfengbojiang__`Literals_Block_Type`__ 458*22ce4affSfengbojiang 459*22ce4affSfengbojiangThis field uses 2 lowest bits of first byte, describing 4 different block types : 460*22ce4affSfengbojiang 461*22ce4affSfengbojiang| `Literals_Block_Type` | Value | 462*22ce4affSfengbojiang| --------------------------- | ----- | 463*22ce4affSfengbojiang| `Raw_Literals_Block` | 0 | 464*22ce4affSfengbojiang| `RLE_Literals_Block` | 1 | 465*22ce4affSfengbojiang| `Compressed_Literals_Block` | 2 | 466*22ce4affSfengbojiang| `Treeless_Literals_Block` | 3 | 467*22ce4affSfengbojiang 468*22ce4affSfengbojiang- `Raw_Literals_Block` - Literals are stored uncompressed. 469*22ce4affSfengbojiang- `RLE_Literals_Block` - Literals consist of a single byte value 470*22ce4affSfengbojiang repeated `Regenerated_Size` times. 471*22ce4affSfengbojiang- `Compressed_Literals_Block` - This is a standard Huffman-compressed block, 472*22ce4affSfengbojiang starting with a Huffman tree description. 473*22ce4affSfengbojiang See details below. 474*22ce4affSfengbojiang- `Treeless_Literals_Block` - This is a Huffman-compressed block, 475*22ce4affSfengbojiang using Huffman tree _from previous Huffman-compressed literals block_. 476*22ce4affSfengbojiang `Huffman_Tree_Description` will be skipped. 477*22ce4affSfengbojiang Note: If this mode is triggered without any previous Huffman-table in the frame 478*22ce4affSfengbojiang (or [dictionary](#dictionary-format)), this should be treated as data corruption. 479*22ce4affSfengbojiang 480*22ce4affSfengbojiang__`Size_Format`__ 481*22ce4affSfengbojiang 482*22ce4affSfengbojiang`Size_Format` is divided into 2 families : 483*22ce4affSfengbojiang 484*22ce4affSfengbojiang- For `Raw_Literals_Block` and `RLE_Literals_Block`, 485*22ce4affSfengbojiang it's only necessary to decode `Regenerated_Size`. 486*22ce4affSfengbojiang There is no `Compressed_Size` field. 487*22ce4affSfengbojiang- For `Compressed_Block` and `Treeless_Literals_Block`, 488*22ce4affSfengbojiang it's required to decode both `Compressed_Size` 489*22ce4affSfengbojiang and `Regenerated_Size` (the decompressed size). 490*22ce4affSfengbojiang It's also necessary to decode the number of streams (1 or 4). 491*22ce4affSfengbojiang 492*22ce4affSfengbojiangFor values spanning several bytes, convention is __little-endian__. 493*22ce4affSfengbojiang 494*22ce4affSfengbojiang__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ : 495*22ce4affSfengbojiang 496*22ce4affSfengbojiang`Size_Format` uses 1 _or_ 2 bits. 497*22ce4affSfengbojiangIts value is : `Size_Format = (Literals_Section_Header[0]>>2) & 3` 498*22ce4affSfengbojiang 499*22ce4affSfengbojiang- `Size_Format` == 00 or 10 : `Size_Format` uses 1 bit. 500*22ce4affSfengbojiang `Regenerated_Size` uses 5 bits (0-31). 501*22ce4affSfengbojiang `Literals_Section_Header` uses 1 byte. 502*22ce4affSfengbojiang `Regenerated_Size = Literals_Section_Header[0]>>3` 503*22ce4affSfengbojiang- `Size_Format` == 01 : `Size_Format` uses 2 bits. 504*22ce4affSfengbojiang `Regenerated_Size` uses 12 bits (0-4095). 505*22ce4affSfengbojiang `Literals_Section_Header` uses 2 bytes. 506*22ce4affSfengbojiang `Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4)` 507*22ce4affSfengbojiang- `Size_Format` == 11 : `Size_Format` uses 2 bits. 508*22ce4affSfengbojiang `Regenerated_Size` uses 20 bits (0-1048575). 509*22ce4affSfengbojiang `Literals_Section_Header` uses 3 bytes. 510*22ce4affSfengbojiang `Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12)` 511*22ce4affSfengbojiang 512*22ce4affSfengbojiangOnly Stream1 is present for these cases. 513*22ce4affSfengbojiangNote : it's allowed to represent a short value (for example `13`) 514*22ce4affSfengbojiangusing a long format, even if it's less efficient. 515*22ce4affSfengbojiang 516*22ce4affSfengbojiang__`Size_Format` for `Compressed_Literals_Block` and `Treeless_Literals_Block`__ : 517*22ce4affSfengbojiang 518*22ce4affSfengbojiang`Size_Format` always uses 2 bits. 519*22ce4affSfengbojiang 520*22ce4affSfengbojiang- `Size_Format` == 00 : _A single stream_. 521*22ce4affSfengbojiang Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023). 522*22ce4affSfengbojiang `Literals_Section_Header` uses 3 bytes. 523*22ce4affSfengbojiang- `Size_Format` == 01 : 4 streams. 524*22ce4affSfengbojiang Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023). 525*22ce4affSfengbojiang `Literals_Section_Header` uses 3 bytes. 526*22ce4affSfengbojiang- `Size_Format` == 10 : 4 streams. 527*22ce4affSfengbojiang Both `Regenerated_Size` and `Compressed_Size` use 14 bits (0-16383). 528*22ce4affSfengbojiang `Literals_Section_Header` uses 4 bytes. 529*22ce4affSfengbojiang- `Size_Format` == 11 : 4 streams. 530*22ce4affSfengbojiang Both `Regenerated_Size` and `Compressed_Size` use 18 bits (0-262143). 531*22ce4affSfengbojiang `Literals_Section_Header` uses 5 bytes. 532*22ce4affSfengbojiang 533*22ce4affSfengbojiangBoth `Compressed_Size` and `Regenerated_Size` fields follow __little-endian__ convention. 534*22ce4affSfengbojiangNote: `Compressed_Size` __includes__ the size of the Huffman Tree description 535*22ce4affSfengbojiang_when_ it is present. 536*22ce4affSfengbojiang 537*22ce4affSfengbojiang#### Raw Literals Block 538*22ce4affSfengbojiangThe data in Stream1 is `Regenerated_Size` bytes long, 539*22ce4affSfengbojiangit contains the raw literals data to be used during [Sequence Execution]. 540*22ce4affSfengbojiang 541*22ce4affSfengbojiang#### RLE Literals Block 542*22ce4affSfengbojiangStream1 consists of a single byte which should be repeated `Regenerated_Size` times 543*22ce4affSfengbojiangto generate the decoded literals. 544*22ce4affSfengbojiang 545*22ce4affSfengbojiang#### Compressed Literals Block and Treeless Literals Block 546*22ce4affSfengbojiangBoth of these modes contain Huffman encoded data. 547*22ce4affSfengbojiang 548*22ce4affSfengbojiangFor `Treeless_Literals_Block`, 549*22ce4affSfengbojiangthe Huffman table comes from previously compressed literals block, 550*22ce4affSfengbojiangor from a dictionary. 551*22ce4affSfengbojiang 552*22ce4affSfengbojiang 553*22ce4affSfengbojiang### `Huffman_Tree_Description` 554*22ce4affSfengbojiangThis section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`). 555*22ce4affSfengbojiangThe format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description). 556*22ce4affSfengbojiangThe size of `Huffman_Tree_Description` is determined during decoding process, 557*22ce4affSfengbojiangit must be used to determine where streams begin. 558*22ce4affSfengbojiang`Total_Streams_Size = Compressed_Size - Huffman_Tree_Description_Size`. 559*22ce4affSfengbojiang 560*22ce4affSfengbojiang 561*22ce4affSfengbojiang### Jump Table 562*22ce4affSfengbojiangThe Jump Table is only present when there are 4 Huffman-coded streams. 563*22ce4affSfengbojiang 564*22ce4affSfengbojiangReminder : Huffman compressed data consists of either 1 or 4 Huffman-coded streams. 565*22ce4affSfengbojiang 566*22ce4affSfengbojiangIf only one stream is present, it is a single bitstream occupying the entire 567*22ce4affSfengbojiangremaining portion of the literals block, encoded as described within 568*22ce4affSfengbojiang[Huffman-Coded Streams](#huffman-coded-streams). 569*22ce4affSfengbojiang 570*22ce4affSfengbojiangIf there are four streams, `Literals_Section_Header` only provided 571*22ce4affSfengbojiangenough information to know the decompressed and compressed sizes 572*22ce4affSfengbojiangof all four streams _combined_. 573*22ce4affSfengbojiangThe decompressed size of _each_ stream is equal to `(Regenerated_Size+3)/4`, 574*22ce4affSfengbojiangexcept for the last stream which may be up to 3 bytes smaller, 575*22ce4affSfengbojiangto reach a total decompressed size as specified in `Regenerated_Size`. 576*22ce4affSfengbojiang 577*22ce4affSfengbojiangThe compressed size of each stream is provided explicitly in the Jump Table. 578*22ce4affSfengbojiangJump Table is 6 bytes long, and consist of three 2-byte __little-endian__ fields, 579*22ce4affSfengbojiangdescribing the compressed sizes of the first three streams. 580*22ce4affSfengbojiang`Stream4_Size` is computed from total `Total_Streams_Size` minus sizes of other streams. 581*22ce4affSfengbojiang 582*22ce4affSfengbojiang`Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size`. 583*22ce4affSfengbojiang 584*22ce4affSfengbojiangNote: if `Stream1_Size + Stream2_Size + Stream3_Size > Total_Streams_Size`, 585*22ce4affSfengbojiangdata is considered corrupted. 586*22ce4affSfengbojiang 587*22ce4affSfengbojiangEach of these 4 bitstreams is then decoded independently as a Huffman-Coded stream, 588*22ce4affSfengbojiangas described at [Huffman-Coded Streams](#huffman-coded-streams) 589*22ce4affSfengbojiang 590*22ce4affSfengbojiang 591*22ce4affSfengbojiangSequences Section 592*22ce4affSfengbojiang----------------- 593*22ce4affSfengbojiangA compressed block is a succession of _sequences_ . 594*22ce4affSfengbojiangA sequence is a literal copy command, followed by a match copy command. 595*22ce4affSfengbojiangA literal copy command specifies a length. 596*22ce4affSfengbojiangIt is the number of bytes to be copied (or extracted) from the Literals Section. 597*22ce4affSfengbojiangA match copy command specifies an offset and a length. 598*22ce4affSfengbojiang 599*22ce4affSfengbojiangWhen all _sequences_ are decoded, 600*22ce4affSfengbojiangif there are literals left in the _literals section_, 601*22ce4affSfengbojiangthese bytes are added at the end of the block. 602*22ce4affSfengbojiang 603*22ce4affSfengbojiangThis is described in more detail in [Sequence Execution](#sequence-execution). 604*22ce4affSfengbojiang 605*22ce4affSfengbojiangThe `Sequences_Section` regroup all symbols required to decode commands. 606*22ce4affSfengbojiangThere are 3 symbol types : literals lengths, offsets and match lengths. 607*22ce4affSfengbojiangThey are encoded together, interleaved, in a single _bitstream_. 608*22ce4affSfengbojiang 609*22ce4affSfengbojiangThe `Sequences_Section` starts by a header, 610*22ce4affSfengbojiangfollowed by optional probability tables for each symbol type, 611*22ce4affSfengbojiangfollowed by the bitstream. 612*22ce4affSfengbojiang 613*22ce4affSfengbojiang| `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream | 614*22ce4affSfengbojiang| -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- | 615*22ce4affSfengbojiang 616*22ce4affSfengbojiangTo decode the `Sequences_Section`, it's required to know its size. 617*22ce4affSfengbojiangIts size is deduced from the size of `Literals_Section`: 618*22ce4affSfengbojiang`Sequences_Section_Size = Block_Size - Literals_Section_Size`. 619*22ce4affSfengbojiang 620*22ce4affSfengbojiang 621*22ce4affSfengbojiang#### `Sequences_Section_Header` 622*22ce4affSfengbojiang 623*22ce4affSfengbojiangConsists of 2 items: 624*22ce4affSfengbojiang- `Number_of_Sequences` 625*22ce4affSfengbojiang- Symbol compression modes 626*22ce4affSfengbojiang 627*22ce4affSfengbojiang__`Number_of_Sequences`__ 628*22ce4affSfengbojiang 629*22ce4affSfengbojiangThis is a variable size field using between 1 and 3 bytes. 630*22ce4affSfengbojiangLet's call its first byte `byte0`. 631*22ce4affSfengbojiang- `if (byte0 == 0)` : there are no sequences. 632*22ce4affSfengbojiang The sequence section stops there. 633*22ce4affSfengbojiang Decompressed content is defined entirely as Literals Section content. 634*22ce4affSfengbojiang The FSE tables used in `Repeat_Mode` aren't updated. 635*22ce4affSfengbojiang- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte. 636*22ce4affSfengbojiang- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0-128) << 8) + byte1` . Uses 2 bytes. 637*22ce4affSfengbojiang- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00` . Uses 3 bytes. 638*22ce4affSfengbojiang 639*22ce4affSfengbojiang__Symbol compression modes__ 640*22ce4affSfengbojiang 641*22ce4affSfengbojiangThis is a single byte, defining the compression mode of each symbol type. 642*22ce4affSfengbojiang 643*22ce4affSfengbojiang|Bit number| 7-6 | 5-4 | 3-2 | 1-0 | 644*22ce4affSfengbojiang| -------- | ----------------------- | -------------- | -------------------- | ---------- | 645*22ce4affSfengbojiang|Field name| `Literals_Lengths_Mode` | `Offsets_Mode` | `Match_Lengths_Mode` | `Reserved` | 646*22ce4affSfengbojiang 647*22ce4affSfengbojiangThe last field, `Reserved`, must be all-zeroes. 648*22ce4affSfengbojiang 649*22ce4affSfengbojiang`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of 650*22ce4affSfengbojiangliterals lengths, offsets, and match lengths symbols respectively. 651*22ce4affSfengbojiang 652*22ce4affSfengbojiangThey follow the same enumeration : 653*22ce4affSfengbojiang 654*22ce4affSfengbojiang| Value | 0 | 1 | 2 | 3 | 655*22ce4affSfengbojiang| ------------------ | ----------------- | ---------- | --------------------- | ------------- | 656*22ce4affSfengbojiang| `Compression_Mode` | `Predefined_Mode` | `RLE_Mode` | `FSE_Compressed_Mode` | `Repeat_Mode` | 657*22ce4affSfengbojiang 658*22ce4affSfengbojiang- `Predefined_Mode` : A predefined FSE distribution table is used, defined in 659*22ce4affSfengbojiang [default distributions](#default-distributions). 660*22ce4affSfengbojiang No distribution table will be present. 661*22ce4affSfengbojiang- `RLE_Mode` : The table description consists of a single byte, which contains the symbol's value. 662*22ce4affSfengbojiang This symbol will be used for all sequences. 663*22ce4affSfengbojiang- `FSE_Compressed_Mode` : standard FSE compression. 664*22ce4affSfengbojiang A distribution table will be present. 665*22ce4affSfengbojiang The format of this distribution table is described in [FSE Table Description](#fse-table-description). 666*22ce4affSfengbojiang Note that the maximum allowed accuracy log for literals length and match length tables is 9, 667*22ce4affSfengbojiang and the maximum accuracy log for the offsets table is 8. 668*22ce4affSfengbojiang `FSE_Compressed_Mode` must not be used when only one symbol is present, 669*22ce4affSfengbojiang `RLE_Mode` should be used instead (although any other mode will work). 670*22ce4affSfengbojiang- `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again, 671*22ce4affSfengbojiang or if this is the first block, table in the dictionary will be used. 672*22ce4affSfengbojiang Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated. 673*22ce4affSfengbojiang It also includes `Predefined_Mode`, in which case `Repeat_Mode` will have same outcome as `Predefined_Mode`. 674*22ce4affSfengbojiang No distribution table will be present. 675*22ce4affSfengbojiang If this mode is used without any previous sequence table in the frame 676*22ce4affSfengbojiang (nor [dictionary](#dictionary-format)) to repeat, this should be treated as corruption. 677*22ce4affSfengbojiang 678*22ce4affSfengbojiang#### The codes for literals lengths, match lengths, and offsets. 679*22ce4affSfengbojiang 680*22ce4affSfengbojiangEach symbol is a _code_ in its own context, 681*22ce4affSfengbojiangwhich specifies `Baseline` and `Number_of_Bits` to add. 682*22ce4affSfengbojiang_Codes_ are FSE compressed, 683*22ce4affSfengbojiangand interleaved with raw additional bits in the same bitstream. 684*22ce4affSfengbojiang 685*22ce4affSfengbojiang##### Literals length codes 686*22ce4affSfengbojiang 687*22ce4affSfengbojiangLiterals length codes are values ranging from `0` to `35` included. 688*22ce4affSfengbojiangThey define lengths from 0 to 131071 bytes. 689*22ce4affSfengbojiangThe literals length is equal to the decoded `Baseline` plus 690*22ce4affSfengbojiangthe result of reading `Number_of_Bits` bits from the bitstream, 691*22ce4affSfengbojiangas a __little-endian__ value. 692*22ce4affSfengbojiang 693*22ce4affSfengbojiang| `Literals_Length_Code` | 0-15 | 694*22ce4affSfengbojiang| ---------------------- | ---------------------- | 695*22ce4affSfengbojiang| length | `Literals_Length_Code` | 696*22ce4affSfengbojiang| `Number_of_Bits` | 0 | 697*22ce4affSfengbojiang 698*22ce4affSfengbojiang| `Literals_Length_Code` | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 699*22ce4affSfengbojiang| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | 700*22ce4affSfengbojiang| `Baseline` | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 | 701*22ce4affSfengbojiang| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | 702*22ce4affSfengbojiang 703*22ce4affSfengbojiang| `Literals_Length_Code` | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 704*22ce4affSfengbojiang| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | 705*22ce4affSfengbojiang| `Baseline` | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | 706*22ce4affSfengbojiang| `Number_of_Bits` | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 707*22ce4affSfengbojiang 708*22ce4affSfengbojiang| `Literals_Length_Code` | 32 | 33 | 34 | 35 | 709*22ce4affSfengbojiang| ---------------------- | ---- | ---- | ---- | ---- | 710*22ce4affSfengbojiang| `Baseline` | 8192 |16384 |32768 |65536 | 711*22ce4affSfengbojiang| `Number_of_Bits` | 13 | 14 | 15 | 16 | 712*22ce4affSfengbojiang 713*22ce4affSfengbojiang 714*22ce4affSfengbojiang##### Match length codes 715*22ce4affSfengbojiang 716*22ce4affSfengbojiangMatch length codes are values ranging from `0` to `52` included. 717*22ce4affSfengbojiangThey define lengths from 3 to 131074 bytes. 718*22ce4affSfengbojiangThe match length is equal to the decoded `Baseline` plus 719*22ce4affSfengbojiangthe result of reading `Number_of_Bits` bits from the bitstream, 720*22ce4affSfengbojiangas a __little-endian__ value. 721*22ce4affSfengbojiang 722*22ce4affSfengbojiang| `Match_Length_Code` | 0-31 | 723*22ce4affSfengbojiang| ------------------- | ----------------------- | 724*22ce4affSfengbojiang| value | `Match_Length_Code` + 3 | 725*22ce4affSfengbojiang| `Number_of_Bits` | 0 | 726*22ce4affSfengbojiang 727*22ce4affSfengbojiang| `Match_Length_Code` | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 728*22ce4affSfengbojiang| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | 729*22ce4affSfengbojiang| `Baseline` | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 | 730*22ce4affSfengbojiang| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | 731*22ce4affSfengbojiang 732*22ce4affSfengbojiang| `Match_Length_Code` | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 733*22ce4affSfengbojiang| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | 734*22ce4affSfengbojiang| `Baseline` | 67 | 83 | 99 | 131 | 259 | 515 | 1027 | 2051 | 735*22ce4affSfengbojiang| `Number_of_Bits` | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 | 736*22ce4affSfengbojiang 737*22ce4affSfengbojiang| `Match_Length_Code` | 48 | 49 | 50 | 51 | 52 | 738*22ce4affSfengbojiang| ------------------- | ---- | ---- | ---- | ---- | ---- | 739*22ce4affSfengbojiang| `Baseline` | 4099 | 8195 |16387 |32771 |65539 | 740*22ce4affSfengbojiang| `Number_of_Bits` | 12 | 13 | 14 | 15 | 16 | 741*22ce4affSfengbojiang 742*22ce4affSfengbojiang##### Offset codes 743*22ce4affSfengbojiang 744*22ce4affSfengbojiangOffset codes are values ranging from `0` to `N`. 745*22ce4affSfengbojiang 746*22ce4affSfengbojiangA decoder is free to limit its maximum `N` supported. 747*22ce4affSfengbojiangRecommendation is to support at least up to `22`. 748*22ce4affSfengbojiangFor information, at the time of this writing. 749*22ce4affSfengbojiangthe reference decoder supports a maximum `N` value of `31`. 750*22ce4affSfengbojiang 751*22ce4affSfengbojiangAn offset code is also the number of additional bits to read in __little-endian__ fashion, 752*22ce4affSfengbojiangand can be translated into an `Offset_Value` using the following formulas : 753*22ce4affSfengbojiang 754*22ce4affSfengbojiang``` 755*22ce4affSfengbojiangOffset_Value = (1 << offsetCode) + readNBits(offsetCode); 756*22ce4affSfengbojiangif (Offset_Value > 3) offset = Offset_Value - 3; 757*22ce4affSfengbojiang``` 758*22ce4affSfengbojiangIt means that maximum `Offset_Value` is `(2^(N+1))-1` 759*22ce4affSfengbojiangsupporting back-reference distances up to `(2^(N+1))-4`, 760*22ce4affSfengbojiangbut is limited by [maximum back-reference distance](#window_descriptor). 761*22ce4affSfengbojiang 762*22ce4affSfengbojiang`Offset_Value` from 1 to 3 are special : they define "repeat codes". 763*22ce4affSfengbojiangThis is described in more detail in [Repeat Offsets](#repeat-offsets). 764*22ce4affSfengbojiang 765*22ce4affSfengbojiang#### Decoding Sequences 766*22ce4affSfengbojiangFSE bitstreams are read in reverse direction than written. In zstd, 767*22ce4affSfengbojiangthe compressor writes bits forward into a block and the decompressor 768*22ce4affSfengbojiangmust read the bitstream _backwards_. 769*22ce4affSfengbojiang 770*22ce4affSfengbojiangTo find the start of the bitstream it is therefore necessary to 771*22ce4affSfengbojiangknow the offset of the last byte of the block which can be found 772*22ce4affSfengbojiangby counting `Block_Size` bytes after the block header. 773*22ce4affSfengbojiang 774*22ce4affSfengbojiangAfter writing the last bit containing information, the compressor 775*22ce4affSfengbojiangwrites a single `1`-bit and then fills the byte with 0-7 `0` bits of 776*22ce4affSfengbojiangpadding. The last byte of the compressed bitstream cannot be `0` for 777*22ce4affSfengbojiangthat reason. 778*22ce4affSfengbojiang 779*22ce4affSfengbojiangWhen decompressing, the last byte containing the padding is the first 780*22ce4affSfengbojiangbyte to read. The decompressor needs to skip 0-7 initial `0`-bits and 781*22ce4affSfengbojiangthe first `1`-bit it occurs. Afterwards, the useful part of the bitstream 782*22ce4affSfengbojiangbegins. 783*22ce4affSfengbojiang 784*22ce4affSfengbojiangFSE decoding requires a 'state' to be carried from symbol to symbol. 785*22ce4affSfengbojiangFor more explanation on FSE decoding, see the [FSE section](#fse). 786*22ce4affSfengbojiang 787*22ce4affSfengbojiangFor sequence decoding, a separate state keeps track of each 788*22ce4affSfengbojiangliteral lengths, offsets, and match lengths symbols. 789*22ce4affSfengbojiangSome FSE primitives are also used. 790*22ce4affSfengbojiangFor more details on the operation of these primitives, see the [FSE section](#fse). 791*22ce4affSfengbojiang 792*22ce4affSfengbojiang##### Starting states 793*22ce4affSfengbojiangThe bitstream starts with initial FSE state values, 794*22ce4affSfengbojiangeach using the required number of bits in their respective _accuracy_, 795*22ce4affSfengbojiangdecoded previously from their normalized distribution. 796*22ce4affSfengbojiang 797*22ce4affSfengbojiangIt starts by `Literals_Length_State`, 798*22ce4affSfengbojiangfollowed by `Offset_State`, 799*22ce4affSfengbojiangand finally `Match_Length_State`. 800*22ce4affSfengbojiang 801*22ce4affSfengbojiangReminder : always keep in mind that all values are read _backward_, 802*22ce4affSfengbojiangso the 'start' of the bitstream is at the highest position in memory, 803*22ce4affSfengbojiangimmediately before the last `1`-bit for padding. 804*22ce4affSfengbojiang 805*22ce4affSfengbojiangAfter decoding the starting states, a single sequence is decoded 806*22ce4affSfengbojiang`Number_Of_Sequences` times. 807*22ce4affSfengbojiangThese sequences are decoded in order from first to last. 808*22ce4affSfengbojiangSince the compressor writes the bitstream in the forward direction, 809*22ce4affSfengbojiangthis means the compressor must encode the sequences starting with the last 810*22ce4affSfengbojiangone and ending with the first. 811*22ce4affSfengbojiang 812*22ce4affSfengbojiang##### Decoding a sequence 813*22ce4affSfengbojiangFor each of the symbol types, the FSE state can be used to determine the appropriate code. 814*22ce4affSfengbojiangThe code then defines the `Baseline` and `Number_of_Bits` to read for each type. 815*22ce4affSfengbojiangSee the [description of the codes] for how to determine these values. 816*22ce4affSfengbojiang 817*22ce4affSfengbojiang[description of the codes]: #the-codes-for-literals-lengths-match-lengths-and-offsets 818*22ce4affSfengbojiang 819*22ce4affSfengbojiangDecoding starts by reading the `Number_of_Bits` required to decode `Offset`. 820*22ce4affSfengbojiangIt then does the same for `Match_Length`, and then for `Literals_Length`. 821*22ce4affSfengbojiangThis sequence is then used for [sequence execution](#sequence-execution). 822*22ce4affSfengbojiang 823*22ce4affSfengbojiangIf it is not the last sequence in the block, 824*22ce4affSfengbojiangthe next operation is to update states. 825*22ce4affSfengbojiangUsing the rules pre-calculated in the decoding tables, 826*22ce4affSfengbojiang`Literals_Length_State` is updated, 827*22ce4affSfengbojiangfollowed by `Match_Length_State`, 828*22ce4affSfengbojiangand then `Offset_State`. 829*22ce4affSfengbojiangSee the [FSE section](#fse) for details on how to update states from the bitstream. 830*22ce4affSfengbojiang 831*22ce4affSfengbojiangThis operation will be repeated `Number_of_Sequences` times. 832*22ce4affSfengbojiangAt the end, the bitstream shall be entirely consumed, 833*22ce4affSfengbojiangotherwise the bitstream is considered corrupted. 834*22ce4affSfengbojiang 835*22ce4affSfengbojiang#### Default Distributions 836*22ce4affSfengbojiangIf `Predefined_Mode` is selected for a symbol type, 837*22ce4affSfengbojiangits FSE decoding table is generated from a predefined distribution table defined here. 838*22ce4affSfengbojiangFor details on how to convert this distribution into a decoding table, see the [FSE section]. 839*22ce4affSfengbojiang 840*22ce4affSfengbojiang[FSE section]: #from-normalized-distribution-to-decoding-tables 841*22ce4affSfengbojiang 842*22ce4affSfengbojiang##### Literals Length 843*22ce4affSfengbojiangThe decoding table uses an accuracy log of 6 bits (64 states). 844*22ce4affSfengbojiang``` 845*22ce4affSfengbojiangshort literalsLength_defaultDistribution[36] = 846*22ce4affSfengbojiang { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 847*22ce4affSfengbojiang 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1, 848*22ce4affSfengbojiang -1,-1,-1,-1 }; 849*22ce4affSfengbojiang``` 850*22ce4affSfengbojiang 851*22ce4affSfengbojiang##### Match Length 852*22ce4affSfengbojiangThe decoding table uses an accuracy log of 6 bits (64 states). 853*22ce4affSfengbojiang``` 854*22ce4affSfengbojiangshort matchLengths_defaultDistribution[53] = 855*22ce4affSfengbojiang { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 856*22ce4affSfengbojiang 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 857*22ce4affSfengbojiang 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1, 858*22ce4affSfengbojiang -1,-1,-1,-1,-1 }; 859*22ce4affSfengbojiang``` 860*22ce4affSfengbojiang 861*22ce4affSfengbojiang##### Offset Codes 862*22ce4affSfengbojiangThe decoding table uses an accuracy log of 5 bits (32 states), 863*22ce4affSfengbojiangand supports a maximum `N` value of 28, allowing offset values up to 536,870,908 . 864*22ce4affSfengbojiang 865*22ce4affSfengbojiangIf any sequence in the compressed block requires a larger offset than this, 866*22ce4affSfengbojiangit's not possible to use the default distribution to represent it. 867*22ce4affSfengbojiang``` 868*22ce4affSfengbojiangshort offsetCodes_defaultDistribution[29] = 869*22ce4affSfengbojiang { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 870*22ce4affSfengbojiang 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 }; 871*22ce4affSfengbojiang``` 872*22ce4affSfengbojiang 873*22ce4affSfengbojiang 874*22ce4affSfengbojiangSequence Execution 875*22ce4affSfengbojiang------------------ 876*22ce4affSfengbojiangOnce literals and sequences have been decoded, 877*22ce4affSfengbojiangthey are combined to produce the decoded content of a block. 878*22ce4affSfengbojiang 879*22ce4affSfengbojiangEach sequence consists of a tuple of (`literals_length`, `offset_value`, `match_length`), 880*22ce4affSfengbojiangdecoded as described in the [Sequences Section](#sequences-section). 881*22ce4affSfengbojiangTo execute a sequence, first copy `literals_length` bytes 882*22ce4affSfengbojiangfrom the decoded literals to the output. 883*22ce4affSfengbojiang 884*22ce4affSfengbojiangThen `match_length` bytes are copied from previous decoded data. 885*22ce4affSfengbojiangThe offset to copy from is determined by `offset_value`: 886*22ce4affSfengbojiangif `offset_value > 3`, then the offset is `offset_value - 3`. 887*22ce4affSfengbojiangIf `offset_value` is from 1-3, the offset is a special repeat offset value. 888*22ce4affSfengbojiangSee the [repeat offset](#repeat-offsets) section for how the offset is determined 889*22ce4affSfengbojiangin this case. 890*22ce4affSfengbojiang 891*22ce4affSfengbojiangThe offset is defined as from the current position, so an offset of 6 892*22ce4affSfengbojiangand a match length of 3 means that 3 bytes should be copied from 6 bytes back. 893*22ce4affSfengbojiangNote that all offsets leading to previously decoded data 894*22ce4affSfengbojiangmust be smaller than `Window_Size` defined in `Frame_Header_Descriptor`. 895*22ce4affSfengbojiang 896*22ce4affSfengbojiang#### Repeat offsets 897*22ce4affSfengbojiangAs seen in [Sequence Execution](#sequence-execution), 898*22ce4affSfengbojiangthe first 3 values define a repeated offset and we will call them 899*22ce4affSfengbojiang`Repeated_Offset1`, `Repeated_Offset2`, and `Repeated_Offset3`. 900*22ce4affSfengbojiangThey are sorted in recency order, with `Repeated_Offset1` meaning "most recent one". 901*22ce4affSfengbojiang 902*22ce4affSfengbojiangIf `offset_value == 1`, then the offset used is `Repeated_Offset1`, etc. 903*22ce4affSfengbojiang 904*22ce4affSfengbojiangThere is an exception though, when current sequence's `literals_length = 0`. 905*22ce4affSfengbojiangIn this case, repeated offsets are shifted by one, 906*22ce4affSfengbojiangso an `offset_value` of 1 means `Repeated_Offset2`, 907*22ce4affSfengbojiangan `offset_value` of 2 means `Repeated_Offset3`, 908*22ce4affSfengbojiangand an `offset_value` of 3 means `Repeated_Offset1 - 1_byte`. 909*22ce4affSfengbojiang 910*22ce4affSfengbojiangFor the first block, the starting offset history is populated with following values : 911*22ce4affSfengbojiang`Repeated_Offset1`=1, `Repeated_Offset2`=4, `Repeated_Offset3`=8, 912*22ce4affSfengbojiangunless a dictionary is used, in which case they come from the dictionary. 913*22ce4affSfengbojiang 914*22ce4affSfengbojiangThen each block gets its starting offset history from the ending values of the most recent `Compressed_Block`. 915*22ce4affSfengbojiangNote that blocks which are not `Compressed_Block` are skipped, they do not contribute to offset history. 916*22ce4affSfengbojiang 917*22ce4affSfengbojiang[Offset Codes]: #offset-codes 918*22ce4affSfengbojiang 919*22ce4affSfengbojiang###### Offset updates rules 920*22ce4affSfengbojiang 921*22ce4affSfengbojiangDuring the execution of the sequences of a `Compressed_Block`, the 922*22ce4affSfengbojiang`Repeated_Offsets`' values are kept up to date, so that they always represent 923*22ce4affSfengbojiangthe three most-recently used offsets. In order to achieve that, they are 924*22ce4affSfengbojiangupdated after executing each sequence in the following way: 925*22ce4affSfengbojiang 926*22ce4affSfengbojiangWhen the sequence's `offset_value` does not refer to one of the 927*22ce4affSfengbojiang`Repeated_Offsets`--when it has value greater than 3, or when it has value 3 928*22ce4affSfengbojiangand the sequence's `literals_length` is zero--the `Repeated_Offsets`' values 929*22ce4affSfengbojiangare shifted back one, and `Repeated_Offset1` takes on the value of the 930*22ce4affSfengbojiangjust-used offset. 931*22ce4affSfengbojiang 932*22ce4affSfengbojiangOtherwise, when the sequence's `offset_value` refers to one of the 933*22ce4affSfengbojiang`Repeated_Offsets`--when it has value 1 or 2, or when it has value 3 and the 934*22ce4affSfengbojiangsequence's `literals_length` is non-zero--the `Repeated_Offsets` are re-ordered 935*22ce4affSfengbojiangso that `Repeated_Offset1` takes on the value of the used Repeated_Offset, and 936*22ce4affSfengbojiangthe existing values are pushed back from the first `Repeated_Offset` through to 937*22ce4affSfengbojiangthe `Repeated_Offset` selected by the `offset_value`. This effectively performs 938*22ce4affSfengbojianga single-stepped wrapping rotation of the values of these offsets, so that 939*22ce4affSfengbojiangtheir order again reflects the recency of their use. 940*22ce4affSfengbojiang 941*22ce4affSfengbojiangThe following table shows the values of the `Repeated_Offsets` as a series of 942*22ce4affSfengbojiangsequences are applied to them: 943*22ce4affSfengbojiang 944*22ce4affSfengbojiang| `offset_value` | `literals_length` | `Repeated_Offset1` | `Repeated_Offset2` | `Repeated_Offset3` | Comment | 945*22ce4affSfengbojiang|:--------------:|:-----------------:|:------------------:|:------------------:|:------------------:|:-----------------------:| 946*22ce4affSfengbojiang| | | 1 | 4 | 8 | starting values | 947*22ce4affSfengbojiang| 1114 | 11 | 1111 | 1 | 4 | non-repeat | 948*22ce4affSfengbojiang| 1 | 22 | 1111 | 1 | 4 | repeat 1; no change | 949*22ce4affSfengbojiang| 2225 | 22 | 2222 | 1111 | 1 | non-repeat | 950*22ce4affSfengbojiang| 1114 | 111 | 1111 | 2222 | 1111 | non-repeat | 951*22ce4affSfengbojiang| 3336 | 33 | 3333 | 1111 | 2222 | non-repeat | 952*22ce4affSfengbojiang| 2 | 22 | 1111 | 3333 | 2222 | repeat 2; swap 1 & 2 | 953*22ce4affSfengbojiang| 3 | 33 | 2222 | 1111 | 3333 | repeat 3; rotate 3 to 1 | 954*22ce4affSfengbojiang| 3 | 0 | 2221 | 2222 | 1111 | insert resolved offset | 955*22ce4affSfengbojiang| 1 | 0 | 2222 | 2221 | 3333 | repeat 2 | 956*22ce4affSfengbojiang 957*22ce4affSfengbojiang 958*22ce4affSfengbojiangSkippable Frames 959*22ce4affSfengbojiang---------------- 960*22ce4affSfengbojiang 961*22ce4affSfengbojiang| `Magic_Number` | `Frame_Size` | `User_Data` | 962*22ce4affSfengbojiang|:--------------:|:------------:|:-----------:| 963*22ce4affSfengbojiang| 4 bytes | 4 bytes | n bytes | 964*22ce4affSfengbojiang 965*22ce4affSfengbojiangSkippable frames allow the insertion of user-defined metadata 966*22ce4affSfengbojianginto a flow of concatenated frames. 967*22ce4affSfengbojiang 968*22ce4affSfengbojiangSkippable frames defined in this specification are compatible with [LZ4] ones. 969*22ce4affSfengbojiang 970*22ce4affSfengbojiang[LZ4]:http://www.lz4.org 971*22ce4affSfengbojiang 972*22ce4affSfengbojiangFrom a compliant decoder perspective, skippable frames need just be skipped, 973*22ce4affSfengbojiangand their content ignored, resuming decoding after the skippable frame. 974*22ce4affSfengbojiang 975*22ce4affSfengbojiangIt can be noted that a skippable frame 976*22ce4affSfengbojiangcan be used to watermark a stream of concatenated frames 977*22ce4affSfengbojiangembedding any kind of tracking information (even just an UUID). 978*22ce4affSfengbojiangUsers wary of such possibility should scan the stream of concatenated frames 979*22ce4affSfengbojiangin an attempt to detect such frame for analysis or removal. 980*22ce4affSfengbojiang 981*22ce4affSfengbojiang__`Magic_Number`__ 982*22ce4affSfengbojiang 983*22ce4affSfengbojiang4 Bytes, __little-endian__ format. 984*22ce4affSfengbojiangValue : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F. 985*22ce4affSfengbojiangAll 16 values are valid to identify a skippable frame. 986*22ce4affSfengbojiangThis specification doesn't detail any specific tagging for skippable frames. 987*22ce4affSfengbojiang 988*22ce4affSfengbojiang__`Frame_Size`__ 989*22ce4affSfengbojiang 990*22ce4affSfengbojiangThis is the size, in bytes, of the following `User_Data` 991*22ce4affSfengbojiang(without including the magic number nor the size field itself). 992*22ce4affSfengbojiangThis field is represented using 4 Bytes, __little-endian__ format, unsigned 32-bits. 993*22ce4affSfengbojiangThis means `User_Data` can’t be bigger than (2^32-1) bytes. 994*22ce4affSfengbojiang 995*22ce4affSfengbojiang__`User_Data`__ 996*22ce4affSfengbojiang 997*22ce4affSfengbojiangThe `User_Data` can be anything. Data will just be skipped by the decoder. 998*22ce4affSfengbojiang 999*22ce4affSfengbojiang 1000*22ce4affSfengbojiang 1001*22ce4affSfengbojiangEntropy Encoding 1002*22ce4affSfengbojiang---------------- 1003*22ce4affSfengbojiangTwo types of entropy encoding are used by the Zstandard format: 1004*22ce4affSfengbojiangFSE, and Huffman coding. 1005*22ce4affSfengbojiangHuffman is used to compress literals, 1006*22ce4affSfengbojiangwhile FSE is used for all other symbols 1007*22ce4affSfengbojiang(`Literals_Length_Code`, `Match_Length_Code`, offset codes) 1008*22ce4affSfengbojiangand to compress Huffman headers. 1009*22ce4affSfengbojiang 1010*22ce4affSfengbojiang 1011*22ce4affSfengbojiangFSE 1012*22ce4affSfengbojiang--- 1013*22ce4affSfengbojiangFSE, short for Finite State Entropy, is an entropy codec based on [ANS]. 1014*22ce4affSfengbojiangFSE encoding/decoding involves a state that is carried over between symbols, 1015*22ce4affSfengbojiangso decoding must be done in the opposite direction as encoding. 1016*22ce4affSfengbojiangTherefore, all FSE bitstreams are read from end to beginning. 1017*22ce4affSfengbojiangNote that the order of the bits in the stream is not reversed, 1018*22ce4affSfengbojiangwe just read the elements in the reverse order they are written. 1019*22ce4affSfengbojiang 1020*22ce4affSfengbojiangFor additional details on FSE, see [Finite State Entropy]. 1021*22ce4affSfengbojiang 1022*22ce4affSfengbojiang[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/ 1023*22ce4affSfengbojiang 1024*22ce4affSfengbojiangFSE decoding involves a decoding table which has a power of 2 size, and contain three elements: 1025*22ce4affSfengbojiang`Symbol`, `Num_Bits`, and `Baseline`. 1026*22ce4affSfengbojiangThe `log2` of the table size is its `Accuracy_Log`. 1027*22ce4affSfengbojiangAn FSE state value represents an index in this table. 1028*22ce4affSfengbojiang 1029*22ce4affSfengbojiangTo obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value. 1030*22ce4affSfengbojiangThe next symbol in the stream is the `Symbol` indicated in the table for that state. 1031*22ce4affSfengbojiangTo obtain the next state value, 1032*22ce4affSfengbojiangthe decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`. 1033*22ce4affSfengbojiang 1034*22ce4affSfengbojiang[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems 1035*22ce4affSfengbojiang 1036*22ce4affSfengbojiang### FSE Table Description 1037*22ce4affSfengbojiangTo decode FSE streams, it is necessary to construct the decoding table. 1038*22ce4affSfengbojiangThe Zstandard format encodes FSE table descriptions as follows: 1039*22ce4affSfengbojiang 1040*22ce4affSfengbojiangAn FSE distribution table describes the probabilities of all symbols 1041*22ce4affSfengbojiangfrom `0` to the last present one (included) 1042*22ce4affSfengbojiangon a normalized scale of `1 << Accuracy_Log` . 1043*22ce4affSfengbojiangNote that there must be two or more symbols with nonzero probability. 1044*22ce4affSfengbojiang 1045*22ce4affSfengbojiangIt's a bitstream which is read forward, in __little-endian__ fashion. 1046*22ce4affSfengbojiangIt's not necessary to know bitstream exact size, 1047*22ce4affSfengbojiangit will be discovered and reported by the decoding process. 1048*22ce4affSfengbojiang 1049*22ce4affSfengbojiangThe bitstream starts by reporting on which scale it operates. 1050*22ce4affSfengbojiangLet's `low4Bits` designate the lowest 4 bits of the first byte : 1051*22ce4affSfengbojiang`Accuracy_Log = low4bits + 5`. 1052*22ce4affSfengbojiang 1053*22ce4affSfengbojiangThen follows each symbol value, from `0` to last present one. 1054*22ce4affSfengbojiangThe number of bits used by each field is variable. 1055*22ce4affSfengbojiangIt depends on : 1056*22ce4affSfengbojiang 1057*22ce4affSfengbojiang- Remaining probabilities + 1 : 1058*22ce4affSfengbojiang __example__ : 1059*22ce4affSfengbojiang Presuming an `Accuracy_Log` of 8, 1060*22ce4affSfengbojiang and presuming 100 probabilities points have already been distributed, 1061*22ce4affSfengbojiang the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive). 1062*22ce4affSfengbojiang Therefore, it must read `log2sup(157) == 8` bits. 1063*22ce4affSfengbojiang 1064*22ce4affSfengbojiang- Value decoded : small values use 1 less bit : 1065*22ce4affSfengbojiang __example__ : 1066*22ce4affSfengbojiang Presuming values from 0 to 157 (inclusive) are possible, 1067*22ce4affSfengbojiang 255-157 = 98 values are remaining in an 8-bits field. 1068*22ce4affSfengbojiang They are used this way : 1069*22ce4affSfengbojiang first 98 values (hence from 0 to 97) use only 7 bits, 1070*22ce4affSfengbojiang values from 98 to 157 use 8 bits. 1071*22ce4affSfengbojiang This is achieved through this scheme : 1072*22ce4affSfengbojiang 1073*22ce4affSfengbojiang | Value read | Value decoded | Number of bits used | 1074*22ce4affSfengbojiang | ---------- | ------------- | ------------------- | 1075*22ce4affSfengbojiang | 0 - 97 | 0 - 97 | 7 | 1076*22ce4affSfengbojiang | 98 - 127 | 98 - 127 | 8 | 1077*22ce4affSfengbojiang | 128 - 225 | 0 - 97 | 7 | 1078*22ce4affSfengbojiang | 226 - 255 | 128 - 157 | 8 | 1079*22ce4affSfengbojiang 1080*22ce4affSfengbojiangSymbols probabilities are read one by one, in order. 1081*22ce4affSfengbojiang 1082*22ce4affSfengbojiangProbability is obtained from Value decoded by following formula : 1083*22ce4affSfengbojiang`Proba = value - 1` 1084*22ce4affSfengbojiang 1085*22ce4affSfengbojiangIt means value `0` becomes negative probability `-1`. 1086*22ce4affSfengbojiang`-1` is a special probability, which means "less than 1". 1087*22ce4affSfengbojiangIts effect on distribution table is described in the [next section]. 1088*22ce4affSfengbojiangFor the purpose of calculating total allocated probability points, it counts as one. 1089*22ce4affSfengbojiang 1090*22ce4affSfengbojiang[next section]:#from-normalized-distribution-to-decoding-tables 1091*22ce4affSfengbojiang 1092*22ce4affSfengbojiangWhen a symbol has a __probability__ of `zero`, 1093*22ce4affSfengbojiangit is followed by a 2-bits repeat flag. 1094*22ce4affSfengbojiangThis repeat flag tells how many probabilities of zeroes follow the current one. 1095*22ce4affSfengbojiangIt provides a number ranging from 0 to 3. 1096*22ce4affSfengbojiangIf it is a 3, another 2-bits repeat flag follows, and so on. 1097*22ce4affSfengbojiang 1098*22ce4affSfengbojiangWhen last symbol reaches cumulated total of `1 << Accuracy_Log`, 1099*22ce4affSfengbojiangdecoding is complete. 1100*22ce4affSfengbojiangIf the last symbol makes cumulated total go above `1 << Accuracy_Log`, 1101*22ce4affSfengbojiangdistribution is considered corrupted. 1102*22ce4affSfengbojiang 1103*22ce4affSfengbojiangThen the decoder can tell how many bytes were used in this process, 1104*22ce4affSfengbojiangand how many symbols are present. 1105*22ce4affSfengbojiangThe bitstream consumes a round number of bytes. 1106*22ce4affSfengbojiangAny remaining bit within the last byte is just unused. 1107*22ce4affSfengbojiang 1108*22ce4affSfengbojiang#### From normalized distribution to decoding tables 1109*22ce4affSfengbojiang 1110*22ce4affSfengbojiangThe distribution of normalized probabilities is enough 1111*22ce4affSfengbojiangto create a unique decoding table. 1112*22ce4affSfengbojiang 1113*22ce4affSfengbojiangIt follows the following build rule : 1114*22ce4affSfengbojiang 1115*22ce4affSfengbojiangThe table has a size of `Table_Size = 1 << Accuracy_Log`. 1116*22ce4affSfengbojiangEach cell describes the symbol decoded, 1117*22ce4affSfengbojiangand instructions to get the next state (`Number_of_Bits` and `Baseline`). 1118*22ce4affSfengbojiang 1119*22ce4affSfengbojiangSymbols are scanned in their natural order for "less than 1" probabilities. 1120*22ce4affSfengbojiangSymbols with this probability are being attributed a single cell, 1121*22ce4affSfengbojiangstarting from the end of the table and retreating. 1122*22ce4affSfengbojiangThese symbols define a full state reset, reading `Accuracy_Log` bits. 1123*22ce4affSfengbojiang 1124*22ce4affSfengbojiangThen, all remaining symbols, sorted in natural order, are allocated cells. 1125*22ce4affSfengbojiangStarting from symbol `0` (if it exists), and table position `0`, 1126*22ce4affSfengbojiangeach symbol gets allocated as many cells as its probability. 1127*22ce4affSfengbojiangCell allocation is spreaded, not linear : 1128*22ce4affSfengbojiangeach successor position follows this rule : 1129*22ce4affSfengbojiang 1130*22ce4affSfengbojiang``` 1131*22ce4affSfengbojiangposition += (tableSize>>1) + (tableSize>>3) + 3; 1132*22ce4affSfengbojiangposition &= tableSize-1; 1133*22ce4affSfengbojiang``` 1134*22ce4affSfengbojiang 1135*22ce4affSfengbojiangA position is skipped if already occupied by a "less than 1" probability symbol. 1136*22ce4affSfengbojiang`position` does not reset between symbols, it simply iterates through 1137*22ce4affSfengbojiangeach position in the table, switching to the next symbol when enough 1138*22ce4affSfengbojiangstates have been allocated to the current one. 1139*22ce4affSfengbojiang 1140*22ce4affSfengbojiangThe process guarantees that the table is entirely filled. 1141*22ce4affSfengbojiangEach cell corresponds to a state value, which contains the symbol being decoded. 1142*22ce4affSfengbojiang 1143*22ce4affSfengbojiangTo add the `Number_of_Bits` and `Baseline` required to retrieve next state, 1144*22ce4affSfengbojiangit's first necessary to sort all occurrences of each symbol in state order. 1145*22ce4affSfengbojiangLower states will need 1 more bit than higher ones. 1146*22ce4affSfengbojiangThe process is repeated for each symbol. 1147*22ce4affSfengbojiang 1148*22ce4affSfengbojiang__Example__ : 1149*22ce4affSfengbojiangPresuming a symbol has a probability of 5, 1150*22ce4affSfengbojiangit receives 5 cells, corresponding to 5 state values. 1151*22ce4affSfengbojiangThese state values are then sorted in natural order. 1152*22ce4affSfengbojiang 1153*22ce4affSfengbojiangNext power of 2 after 5 is 8. 1154*22ce4affSfengbojiangSpace of probabilities must be divided into 8 equal parts. 1155*22ce4affSfengbojiangPresuming the `Accuracy_Log` is 7, it defines a space of 128 states. 1156*22ce4affSfengbojiangDivided by 8, each share is 16 large. 1157*22ce4affSfengbojiang 1158*22ce4affSfengbojiangIn order to reach 8 shares, 8-5=3 lowest states will count "double", 1159*22ce4affSfengbojiangdoubling their shares (32 in width), hence requiring one more bit. 1160*22ce4affSfengbojiang 1161*22ce4affSfengbojiangBaseline is assigned starting from the higher states using fewer bits, 1162*22ce4affSfengbojiangincreasing at each state, then resuming at the first state, 1163*22ce4affSfengbojiangeach state takes its allocated width from Baseline. 1164*22ce4affSfengbojiang 1165*22ce4affSfengbojiang| state value | 1 | 39 | 77 | 84 | 122 | 1166*22ce4affSfengbojiang| state order | 0 | 1 | 2 | 3 | 4 | 1167*22ce4affSfengbojiang| ---------------- | ----- | ----- | ------ | ---- | ------ | 1168*22ce4affSfengbojiang| width | 32 | 32 | 32 | 16 | 16 | 1169*22ce4affSfengbojiang| `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 | 1170*22ce4affSfengbojiang| range number | 2 | 4 | 6 | 0 | 1 | 1171*22ce4affSfengbojiang| `Baseline` | 32 | 64 | 96 | 0 | 16 | 1172*22ce4affSfengbojiang| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 | 1173*22ce4affSfengbojiang 1174*22ce4affSfengbojiangDuring decoding, the next state value is determined from current state value, 1175*22ce4affSfengbojiangby reading the required `Number_of_Bits`, and adding the specified `Baseline`. 1176*22ce4affSfengbojiang 1177*22ce4affSfengbojiangSee [Appendix A] for the results of this process applied to the default distributions. 1178*22ce4affSfengbojiang 1179*22ce4affSfengbojiang[Appendix A]: #appendix-a---decoding-tables-for-predefined-codes 1180*22ce4affSfengbojiang 1181*22ce4affSfengbojiang 1182*22ce4affSfengbojiangHuffman Coding 1183*22ce4affSfengbojiang-------------- 1184*22ce4affSfengbojiangZstandard Huffman-coded streams are read backwards, 1185*22ce4affSfengbojiangsimilar to the FSE bitstreams. 1186*22ce4affSfengbojiangTherefore, to find the start of the bitstream, it is therefore to 1187*22ce4affSfengbojiangknow the offset of the last byte of the Huffman-coded stream. 1188*22ce4affSfengbojiang 1189*22ce4affSfengbojiangAfter writing the last bit containing information, the compressor 1190*22ce4affSfengbojiangwrites a single `1`-bit and then fills the byte with 0-7 `0` bits of 1191*22ce4affSfengbojiangpadding. The last byte of the compressed bitstream cannot be `0` for 1192*22ce4affSfengbojiangthat reason. 1193*22ce4affSfengbojiang 1194*22ce4affSfengbojiangWhen decompressing, the last byte containing the padding is the first 1195*22ce4affSfengbojiangbyte to read. The decompressor needs to skip 0-7 initial `0`-bits and 1196*22ce4affSfengbojiangthe first `1`-bit it occurs. Afterwards, the useful part of the bitstream 1197*22ce4affSfengbojiangbegins. 1198*22ce4affSfengbojiang 1199*22ce4affSfengbojiangThe bitstream contains Huffman-coded symbols in __little-endian__ order, 1200*22ce4affSfengbojiangwith the codes defined by the method below. 1201*22ce4affSfengbojiang 1202*22ce4affSfengbojiang### Huffman Tree Description 1203*22ce4affSfengbojiang 1204*22ce4affSfengbojiangPrefix coding represents symbols from an a priori known alphabet 1205*22ce4affSfengbojiangby bit sequences (codewords), one codeword for each symbol, 1206*22ce4affSfengbojiangin a manner such that different symbols may be represented 1207*22ce4affSfengbojiangby bit sequences of different lengths, 1208*22ce4affSfengbojiangbut a parser can always parse an encoded string 1209*22ce4affSfengbojiangunambiguously symbol-by-symbol. 1210*22ce4affSfengbojiang 1211*22ce4affSfengbojiangGiven an alphabet with known symbol frequencies, 1212*22ce4affSfengbojiangthe Huffman algorithm allows the construction of an optimal prefix code 1213*22ce4affSfengbojiangusing the fewest bits of any possible prefix codes for that alphabet. 1214*22ce4affSfengbojiang 1215*22ce4affSfengbojiangPrefix code must not exceed a maximum code length. 1216*22ce4affSfengbojiangMore bits improve accuracy but cost more header size, 1217*22ce4affSfengbojiangand require more memory or more complex decoding operations. 1218*22ce4affSfengbojiangThis specification limits maximum code length to 11 bits. 1219*22ce4affSfengbojiang 1220*22ce4affSfengbojiang#### Representation 1221*22ce4affSfengbojiang 1222*22ce4affSfengbojiangAll literal values from zero (included) to last present one (excluded) 1223*22ce4affSfengbojiangare represented by `Weight` with values from `0` to `Max_Number_of_Bits`. 1224*22ce4affSfengbojiangTransformation from `Weight` to `Number_of_Bits` follows this formula : 1225*22ce4affSfengbojiang``` 1226*22ce4affSfengbojiangNumber_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0 1227*22ce4affSfengbojiang``` 1228*22ce4affSfengbojiangThe last symbol's `Weight` is deduced from previously decoded ones, 1229*22ce4affSfengbojiangby completing to the nearest power of 2. 1230*22ce4affSfengbojiangThis power of 2 gives `Max_Number_of_Bits`, the depth of the current tree. 1231*22ce4affSfengbojiang`Max_Number_of_Bits` must be <= 11, 1232*22ce4affSfengbojiangotherwise the representation is considered corrupted. 1233*22ce4affSfengbojiang 1234*22ce4affSfengbojiang__Example__ : 1235*22ce4affSfengbojiangLet's presume the following Huffman tree must be described : 1236*22ce4affSfengbojiang 1237*22ce4affSfengbojiang| literal value | 0 | 1 | 2 | 3 | 4 | 5 | 1238*22ce4affSfengbojiang| ---------------- | --- | --- | --- | --- | --- | --- | 1239*22ce4affSfengbojiang| `Number_of_Bits` | 1 | 2 | 3 | 0 | 4 | 4 | 1240*22ce4affSfengbojiang 1241*22ce4affSfengbojiangThe tree depth is 4, since its longest elements uses 4 bits 1242*22ce4affSfengbojiang(longest elements are the one with smallest frequency). 1243*22ce4affSfengbojiangValue `5` will not be listed, as it can be determined from values for 0-4, 1244*22ce4affSfengbojiangnor will values above `5` as they are all 0. 1245*22ce4affSfengbojiangValues from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`. 1246*22ce4affSfengbojiangWeight formula is : 1247*22ce4affSfengbojiang``` 1248*22ce4affSfengbojiangWeight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0 1249*22ce4affSfengbojiang``` 1250*22ce4affSfengbojiangIt gives the following series of weights : 1251*22ce4affSfengbojiang 1252*22ce4affSfengbojiang| literal value | 0 | 1 | 2 | 3 | 4 | 1253*22ce4affSfengbojiang| ------------- | --- | --- | --- | --- | --- | 1254*22ce4affSfengbojiang| `Weight` | 4 | 3 | 2 | 0 | 1 | 1255*22ce4affSfengbojiang 1256*22ce4affSfengbojiangThe decoder will do the inverse operation : 1257*22ce4affSfengbojianghaving collected weights of literal symbols from `0` to `4`, 1258*22ce4affSfengbojiangit knows the last literal, `5`, is present with a non-zero `Weight`. 1259*22ce4affSfengbojiangThe `Weight` of `5` can be determined by advancing to the next power of 2. 1260*22ce4affSfengbojiangThe sum of `2^(Weight-1)` (excluding 0's) is : 1261*22ce4affSfengbojiang`8 + 4 + 2 + 0 + 1 = 15`. 1262*22ce4affSfengbojiangNearest larger power of 2 value is 16. 1263*22ce4affSfengbojiangTherefore, `Max_Number_of_Bits = 4` and `Weight[5] = 16-15 = 1`. 1264*22ce4affSfengbojiang 1265*22ce4affSfengbojiang#### Huffman Tree header 1266*22ce4affSfengbojiang 1267*22ce4affSfengbojiangThis is a single byte value (0-255), 1268*22ce4affSfengbojiangwhich describes how the series of weights is encoded. 1269*22ce4affSfengbojiang 1270*22ce4affSfengbojiang- if `headerByte` < 128 : 1271*22ce4affSfengbojiang the series of weights is compressed using FSE (see below). 1272*22ce4affSfengbojiang The length of the FSE-compressed series is equal to `headerByte` (0-127). 1273*22ce4affSfengbojiang 1274*22ce4affSfengbojiang- if `headerByte` >= 128 : 1275*22ce4affSfengbojiang + the series of weights uses a direct representation, 1276*22ce4affSfengbojiang where each `Weight` is encoded directly as a 4 bits field (0-15). 1277*22ce4affSfengbojiang + They are encoded forward, 2 weights to a byte, 1278*22ce4affSfengbojiang first weight taking the top four bits and second one taking the bottom four. 1279*22ce4affSfengbojiang * e.g. the following operations could be used to read the weights: 1280*22ce4affSfengbojiang `Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc. 1281*22ce4affSfengbojiang + The full representation occupies `Ceiling(Number_of_Weights/2)` bytes, 1282*22ce4affSfengbojiang meaning it uses only full bytes even if `Number_of_Weights` is odd. 1283*22ce4affSfengbojiang + `Number_of_Weights = headerByte - 127`. 1284*22ce4affSfengbojiang * Note that maximum `Number_of_Weights` is 255-127 = 128, 1285*22ce4affSfengbojiang therefore, only up to 128 `Weight` can be encoded using direct representation. 1286*22ce4affSfengbojiang * Since the last non-zero `Weight` is _not_ encoded, 1287*22ce4affSfengbojiang this scheme is compatible with alphabet sizes of up to 129 symbols, 1288*22ce4affSfengbojiang hence including literal symbol 128. 1289*22ce4affSfengbojiang * If any literal symbol > 128 has a non-zero `Weight`, 1290*22ce4affSfengbojiang direct representation is not possible. 1291*22ce4affSfengbojiang In such case, it's necessary to use FSE compression. 1292*22ce4affSfengbojiang 1293*22ce4affSfengbojiang 1294*22ce4affSfengbojiang#### Finite State Entropy (FSE) compression of Huffman weights 1295*22ce4affSfengbojiang 1296*22ce4affSfengbojiangIn this case, the series of Huffman weights is compressed using FSE compression. 1297*22ce4affSfengbojiangIt's a single bitstream with 2 interleaved states, 1298*22ce4affSfengbojiangsharing a single distribution table. 1299*22ce4affSfengbojiang 1300*22ce4affSfengbojiangTo decode an FSE bitstream, it is necessary to know its compressed size. 1301*22ce4affSfengbojiangCompressed size is provided by `headerByte`. 1302*22ce4affSfengbojiangIt's also necessary to know its _maximum possible_ decompressed size, 1303*22ce4affSfengbojiangwhich is `255`, since literal values span from `0` to `255`, 1304*22ce4affSfengbojiangand last symbol's `Weight` is not represented. 1305*22ce4affSfengbojiang 1306*22ce4affSfengbojiangAn FSE bitstream starts by a header, describing probabilities distribution. 1307*22ce4affSfengbojiangIt will create a Decoding Table. 1308*22ce4affSfengbojiangFor a list of Huffman weights, the maximum accuracy log is 6 bits. 1309*22ce4affSfengbojiangFor more description see the [FSE header description](#fse-table-description) 1310*22ce4affSfengbojiang 1311*22ce4affSfengbojiangThe Huffman header compression uses 2 states, 1312*22ce4affSfengbojiangwhich share the same FSE distribution table. 1313*22ce4affSfengbojiangThe first state (`State1`) encodes the even indexed symbols, 1314*22ce4affSfengbojiangand the second (`State2`) encodes the odd indexed symbols. 1315*22ce4affSfengbojiang`State1` is initialized first, and then `State2`, and they take turns 1316*22ce4affSfengbojiangdecoding a single symbol and updating their state. 1317*22ce4affSfengbojiangFor more details on these FSE operations, see the [FSE section](#fse). 1318*22ce4affSfengbojiang 1319*22ce4affSfengbojiangThe number of symbols to decode is determined 1320*22ce4affSfengbojiangby tracking bitStream overflow condition: 1321*22ce4affSfengbojiangIf updating state after decoding a symbol would require more bits than 1322*22ce4affSfengbojiangremain in the stream, it is assumed that extra bits are 0. Then, 1323*22ce4affSfengbojiangsymbols for each of the final states are decoded and the process is complete. 1324*22ce4affSfengbojiang 1325*22ce4affSfengbojiang#### Conversion from weights to Huffman prefix codes 1326*22ce4affSfengbojiang 1327*22ce4affSfengbojiangAll present symbols shall now have a `Weight` value. 1328*22ce4affSfengbojiangIt is possible to transform weights into `Number_of_Bits`, using this formula: 1329*22ce4affSfengbojiang``` 1330*22ce4affSfengbojiangNumber_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0 1331*22ce4affSfengbojiang``` 1332*22ce4affSfengbojiangSymbols are sorted by `Weight`. 1333*22ce4affSfengbojiangWithin same `Weight`, symbols keep natural sequential order. 1334*22ce4affSfengbojiangSymbols with a `Weight` of zero are removed. 1335*22ce4affSfengbojiangThen, starting from lowest `Weight`, prefix codes are distributed in sequential order. 1336*22ce4affSfengbojiang 1337*22ce4affSfengbojiang__Example__ : 1338*22ce4affSfengbojiangLet's presume the following list of weights has been decoded : 1339*22ce4affSfengbojiang 1340*22ce4affSfengbojiang| Literal | 0 | 1 | 2 | 3 | 4 | 5 | 1341*22ce4affSfengbojiang| -------- | --- | --- | --- | --- | --- | --- | 1342*22ce4affSfengbojiang| `Weight` | 4 | 3 | 2 | 0 | 1 | 1 | 1343*22ce4affSfengbojiang 1344*22ce4affSfengbojiangSorted by weight and then natural sequential order, 1345*22ce4affSfengbojiangit gives the following distribution : 1346*22ce4affSfengbojiang 1347*22ce4affSfengbojiang| Literal | 3 | 4 | 5 | 2 | 1 | 0 | 1348*22ce4affSfengbojiang| ---------------- | --- | --- | --- | --- | --- | ---- | 1349*22ce4affSfengbojiang| `Weight` | 0 | 1 | 1 | 2 | 3 | 4 | 1350*22ce4affSfengbojiang| `Number_of_Bits` | 0 | 4 | 4 | 3 | 2 | 1 | 1351*22ce4affSfengbojiang| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 | 1352*22ce4affSfengbojiang 1353*22ce4affSfengbojiang### Huffman-coded Streams 1354*22ce4affSfengbojiang 1355*22ce4affSfengbojiangGiven a Huffman decoding table, 1356*22ce4affSfengbojiangit's possible to decode a Huffman-coded stream. 1357*22ce4affSfengbojiang 1358*22ce4affSfengbojiangEach bitstream must be read _backward_, 1359*22ce4affSfengbojiangthat is starting from the end down to the beginning. 1360*22ce4affSfengbojiangTherefore it's necessary to know the size of each bitstream. 1361*22ce4affSfengbojiang 1362*22ce4affSfengbojiangIt's also necessary to know exactly which _bit_ is the last one. 1363*22ce4affSfengbojiangThis is detected by a final bit flag : 1364*22ce4affSfengbojiangthe highest bit of latest byte is a final-bit-flag. 1365*22ce4affSfengbojiangConsequently, a last byte of `0` is not possible. 1366*22ce4affSfengbojiangAnd the final-bit-flag itself is not part of the useful bitstream. 1367*22ce4affSfengbojiangHence, the last byte contains between 0 and 7 useful bits. 1368*22ce4affSfengbojiang 1369*22ce4affSfengbojiangStarting from the end, 1370*22ce4affSfengbojiangit's possible to read the bitstream in a __little-endian__ fashion, 1371*22ce4affSfengbojiangkeeping track of already used bits. Since the bitstream is encoded in reverse 1372*22ce4affSfengbojiangorder, starting from the end read symbols in forward order. 1373*22ce4affSfengbojiang 1374*22ce4affSfengbojiangFor example, if the literal sequence "0145" was encoded using above prefix code, 1375*22ce4affSfengbojiangit would be encoded (in reverse order) as: 1376*22ce4affSfengbojiang 1377*22ce4affSfengbojiang|Symbol | 5 | 4 | 1 | 0 | Padding | 1378*22ce4affSfengbojiang|--------|------|------|----|---|---------| 1379*22ce4affSfengbojiang|Encoding|`0000`|`0001`|`01`|`1`| `00001` | 1380*22ce4affSfengbojiang 1381*22ce4affSfengbojiangResulting in following 2-bytes bitstream : 1382*22ce4affSfengbojiang``` 1383*22ce4affSfengbojiang00010000 00001101 1384*22ce4affSfengbojiang``` 1385*22ce4affSfengbojiang 1386*22ce4affSfengbojiangHere is an alternative representation with the symbol codes separated by underscore: 1387*22ce4affSfengbojiang``` 1388*22ce4affSfengbojiang0001_0000 00001_1_01 1389*22ce4affSfengbojiang``` 1390*22ce4affSfengbojiang 1391*22ce4affSfengbojiangReading highest `Max_Number_of_Bits` bits, 1392*22ce4affSfengbojiangit's possible to compare extracted value to decoding table, 1393*22ce4affSfengbojiangdetermining the symbol to decode and number of bits to discard. 1394*22ce4affSfengbojiang 1395*22ce4affSfengbojiangThe process continues up to reading the required number of symbols per stream. 1396*22ce4affSfengbojiangIf a bitstream is not entirely and exactly consumed, 1397*22ce4affSfengbojianghence reaching exactly its beginning position with _all_ bits consumed, 1398*22ce4affSfengbojiangthe decoding process is considered faulty. 1399*22ce4affSfengbojiang 1400*22ce4affSfengbojiang 1401*22ce4affSfengbojiangDictionary Format 1402*22ce4affSfengbojiang----------------- 1403*22ce4affSfengbojiang 1404*22ce4affSfengbojiangZstandard is compatible with "raw content" dictionaries, 1405*22ce4affSfengbojiangfree of any format restriction, except that they must be at least 8 bytes. 1406*22ce4affSfengbojiangThese dictionaries function as if they were just the `Content` part 1407*22ce4affSfengbojiangof a formatted dictionary. 1408*22ce4affSfengbojiang 1409*22ce4affSfengbojiangBut dictionaries created by `zstd --train` follow a format, described here. 1410*22ce4affSfengbojiang 1411*22ce4affSfengbojiang__Pre-requisites__ : a dictionary has a size, 1412*22ce4affSfengbojiang defined either by a buffer limit, or a file size. 1413*22ce4affSfengbojiang 1414*22ce4affSfengbojiang| `Magic_Number` | `Dictionary_ID` | `Entropy_Tables` | `Content` | 1415*22ce4affSfengbojiang| -------------- | --------------- | ---------------- | --------- | 1416*22ce4affSfengbojiang 1417*22ce4affSfengbojiang__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, __little-endian__ format 1418*22ce4affSfengbojiang 1419*22ce4affSfengbojiang__`Dictionary_ID`__ : 4 bytes, stored in __little-endian__ format. 1420*22ce4affSfengbojiang `Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`). 1421*22ce4affSfengbojiang It's used by decoders to check if they use the correct dictionary. 1422*22ce4affSfengbojiang 1423*22ce4affSfengbojiang_Reserved ranges :_ 1424*22ce4affSfengbojiangIf the dictionary is going to be distributed in a public environment, 1425*22ce4affSfengbojiangthe following ranges of `Dictionary_ID` are reserved for some future registrar 1426*22ce4affSfengbojiangand shall not be used : 1427*22ce4affSfengbojiang 1428*22ce4affSfengbojiang - low range : <= 32767 1429*22ce4affSfengbojiang - high range : >= (2^31) 1430*22ce4affSfengbojiang 1431*22ce4affSfengbojiangOutside of these ranges, any value of `Dictionary_ID` 1432*22ce4affSfengbojiangwhich is both `>= 32768` and `< (1<<31)` can be used freely, 1433*22ce4affSfengbojiangeven in public environment. 1434*22ce4affSfengbojiang 1435*22ce4affSfengbojiang 1436*22ce4affSfengbojiang__`Entropy_Tables`__ : follow the same format as tables in [compressed blocks]. 1437*22ce4affSfengbojiang See the relevant [FSE](#fse-table-description) 1438*22ce4affSfengbojiang and [Huffman](#huffman-tree-description) sections for how to decode these tables. 1439*22ce4affSfengbojiang They are stored in following order : 1440*22ce4affSfengbojiang Huffman tables for literals, FSE table for offsets, 1441*22ce4affSfengbojiang FSE table for match lengths, and FSE table for literals lengths. 1442*22ce4affSfengbojiang These tables populate the Repeat Stats literals mode and 1443*22ce4affSfengbojiang Repeat distribution mode for sequence decoding. 1444*22ce4affSfengbojiang It's finally followed by 3 offset values, populating recent offsets (instead of using `{1,4,8}`), 1445*22ce4affSfengbojiang stored in order, 4-bytes __little-endian__ each, for a total of 12 bytes. 1446*22ce4affSfengbojiang Each recent offset must have a value <= dictionary content size, and cannot equal 0. 1447*22ce4affSfengbojiang 1448*22ce4affSfengbojiang__`Content`__ : The rest of the dictionary is its content. 1449*22ce4affSfengbojiang The content act as a "past" in front of data to compress or decompress, 1450*22ce4affSfengbojiang so it can be referenced in sequence commands. 1451*22ce4affSfengbojiang As long as the amount of data decoded from this frame is less than or 1452*22ce4affSfengbojiang equal to `Window_Size`, sequence commands may specify offsets longer 1453*22ce4affSfengbojiang than the total length of decoded output so far to reference back to the 1454*22ce4affSfengbojiang dictionary, even parts of the dictionary with offsets larger than `Window_Size`. 1455*22ce4affSfengbojiang After the total output has surpassed `Window_Size` however, 1456*22ce4affSfengbojiang this is no longer allowed and the dictionary is no longer accessible. 1457*22ce4affSfengbojiang 1458*22ce4affSfengbojiang[compressed blocks]: #the-format-of-compressed_block 1459*22ce4affSfengbojiang 1460*22ce4affSfengbojiangIf a dictionary is provided by an external source, 1461*22ce4affSfengbojiangit should be loaded with great care, its content considered untrusted. 1462*22ce4affSfengbojiang 1463*22ce4affSfengbojiang 1464*22ce4affSfengbojiang 1465*22ce4affSfengbojiangAppendix A - Decoding tables for predefined codes 1466*22ce4affSfengbojiang------------------------------------------------- 1467*22ce4affSfengbojiang 1468*22ce4affSfengbojiangThis appendix contains FSE decoding tables 1469*22ce4affSfengbojiangfor the predefined literal length, match length, and offset codes. 1470*22ce4affSfengbojiangThe tables have been constructed using the algorithm as given above in chapter 1471*22ce4affSfengbojiang"from normalized distribution to decoding tables". 1472*22ce4affSfengbojiangThe tables here can be used as examples 1473*22ce4affSfengbojiangto crosscheck that an implementation build its decoding tables correctly. 1474*22ce4affSfengbojiang 1475*22ce4affSfengbojiang#### Literal Length Code: 1476*22ce4affSfengbojiang 1477*22ce4affSfengbojiang| State | Symbol | Number_Of_Bits | Base | 1478*22ce4affSfengbojiang| ----- | ------ | -------------- | ---- | 1479*22ce4affSfengbojiang| 0 | 0 | 4 | 0 | 1480*22ce4affSfengbojiang| 1 | 0 | 4 | 16 | 1481*22ce4affSfengbojiang| 2 | 1 | 5 | 32 | 1482*22ce4affSfengbojiang| 3 | 3 | 5 | 0 | 1483*22ce4affSfengbojiang| 4 | 4 | 5 | 0 | 1484*22ce4affSfengbojiang| 5 | 6 | 5 | 0 | 1485*22ce4affSfengbojiang| 6 | 7 | 5 | 0 | 1486*22ce4affSfengbojiang| 7 | 9 | 5 | 0 | 1487*22ce4affSfengbojiang| 8 | 10 | 5 | 0 | 1488*22ce4affSfengbojiang| 9 | 12 | 5 | 0 | 1489*22ce4affSfengbojiang| 10 | 14 | 6 | 0 | 1490*22ce4affSfengbojiang| 11 | 16 | 5 | 0 | 1491*22ce4affSfengbojiang| 12 | 18 | 5 | 0 | 1492*22ce4affSfengbojiang| 13 | 19 | 5 | 0 | 1493*22ce4affSfengbojiang| 14 | 21 | 5 | 0 | 1494*22ce4affSfengbojiang| 15 | 22 | 5 | 0 | 1495*22ce4affSfengbojiang| 16 | 24 | 5 | 0 | 1496*22ce4affSfengbojiang| 17 | 25 | 5 | 32 | 1497*22ce4affSfengbojiang| 18 | 26 | 5 | 0 | 1498*22ce4affSfengbojiang| 19 | 27 | 6 | 0 | 1499*22ce4affSfengbojiang| 20 | 29 | 6 | 0 | 1500*22ce4affSfengbojiang| 21 | 31 | 6 | 0 | 1501*22ce4affSfengbojiang| 22 | 0 | 4 | 32 | 1502*22ce4affSfengbojiang| 23 | 1 | 4 | 0 | 1503*22ce4affSfengbojiang| 24 | 2 | 5 | 0 | 1504*22ce4affSfengbojiang| 25 | 4 | 5 | 32 | 1505*22ce4affSfengbojiang| 26 | 5 | 5 | 0 | 1506*22ce4affSfengbojiang| 27 | 7 | 5 | 32 | 1507*22ce4affSfengbojiang| 28 | 8 | 5 | 0 | 1508*22ce4affSfengbojiang| 29 | 10 | 5 | 32 | 1509*22ce4affSfengbojiang| 30 | 11 | 5 | 0 | 1510*22ce4affSfengbojiang| 31 | 13 | 6 | 0 | 1511*22ce4affSfengbojiang| 32 | 16 | 5 | 32 | 1512*22ce4affSfengbojiang| 33 | 17 | 5 | 0 | 1513*22ce4affSfengbojiang| 34 | 19 | 5 | 32 | 1514*22ce4affSfengbojiang| 35 | 20 | 5 | 0 | 1515*22ce4affSfengbojiang| 36 | 22 | 5 | 32 | 1516*22ce4affSfengbojiang| 37 | 23 | 5 | 0 | 1517*22ce4affSfengbojiang| 38 | 25 | 4 | 0 | 1518*22ce4affSfengbojiang| 39 | 25 | 4 | 16 | 1519*22ce4affSfengbojiang| 40 | 26 | 5 | 32 | 1520*22ce4affSfengbojiang| 41 | 28 | 6 | 0 | 1521*22ce4affSfengbojiang| 42 | 30 | 6 | 0 | 1522*22ce4affSfengbojiang| 43 | 0 | 4 | 48 | 1523*22ce4affSfengbojiang| 44 | 1 | 4 | 16 | 1524*22ce4affSfengbojiang| 45 | 2 | 5 | 32 | 1525*22ce4affSfengbojiang| 46 | 3 | 5 | 32 | 1526*22ce4affSfengbojiang| 47 | 5 | 5 | 32 | 1527*22ce4affSfengbojiang| 48 | 6 | 5 | 32 | 1528*22ce4affSfengbojiang| 49 | 8 | 5 | 32 | 1529*22ce4affSfengbojiang| 50 | 9 | 5 | 32 | 1530*22ce4affSfengbojiang| 51 | 11 | 5 | 32 | 1531*22ce4affSfengbojiang| 52 | 12 | 5 | 32 | 1532*22ce4affSfengbojiang| 53 | 15 | 6 | 0 | 1533*22ce4affSfengbojiang| 54 | 17 | 5 | 32 | 1534*22ce4affSfengbojiang| 55 | 18 | 5 | 32 | 1535*22ce4affSfengbojiang| 56 | 20 | 5 | 32 | 1536*22ce4affSfengbojiang| 57 | 21 | 5 | 32 | 1537*22ce4affSfengbojiang| 58 | 23 | 5 | 32 | 1538*22ce4affSfengbojiang| 59 | 24 | 5 | 32 | 1539*22ce4affSfengbojiang| 60 | 35 | 6 | 0 | 1540*22ce4affSfengbojiang| 61 | 34 | 6 | 0 | 1541*22ce4affSfengbojiang| 62 | 33 | 6 | 0 | 1542*22ce4affSfengbojiang| 63 | 32 | 6 | 0 | 1543*22ce4affSfengbojiang 1544*22ce4affSfengbojiang#### Match Length Code: 1545*22ce4affSfengbojiang 1546*22ce4affSfengbojiang| State | Symbol | Number_Of_Bits | Base | 1547*22ce4affSfengbojiang| ----- | ------ | -------------- | ---- | 1548*22ce4affSfengbojiang| 0 | 0 | 6 | 0 | 1549*22ce4affSfengbojiang| 1 | 1 | 4 | 0 | 1550*22ce4affSfengbojiang| 2 | 2 | 5 | 32 | 1551*22ce4affSfengbojiang| 3 | 3 | 5 | 0 | 1552*22ce4affSfengbojiang| 4 | 5 | 5 | 0 | 1553*22ce4affSfengbojiang| 5 | 6 | 5 | 0 | 1554*22ce4affSfengbojiang| 6 | 8 | 5 | 0 | 1555*22ce4affSfengbojiang| 7 | 10 | 6 | 0 | 1556*22ce4affSfengbojiang| 8 | 13 | 6 | 0 | 1557*22ce4affSfengbojiang| 9 | 16 | 6 | 0 | 1558*22ce4affSfengbojiang| 10 | 19 | 6 | 0 | 1559*22ce4affSfengbojiang| 11 | 22 | 6 | 0 | 1560*22ce4affSfengbojiang| 12 | 25 | 6 | 0 | 1561*22ce4affSfengbojiang| 13 | 28 | 6 | 0 | 1562*22ce4affSfengbojiang| 14 | 31 | 6 | 0 | 1563*22ce4affSfengbojiang| 15 | 33 | 6 | 0 | 1564*22ce4affSfengbojiang| 16 | 35 | 6 | 0 | 1565*22ce4affSfengbojiang| 17 | 37 | 6 | 0 | 1566*22ce4affSfengbojiang| 18 | 39 | 6 | 0 | 1567*22ce4affSfengbojiang| 19 | 41 | 6 | 0 | 1568*22ce4affSfengbojiang| 20 | 43 | 6 | 0 | 1569*22ce4affSfengbojiang| 21 | 45 | 6 | 0 | 1570*22ce4affSfengbojiang| 22 | 1 | 4 | 16 | 1571*22ce4affSfengbojiang| 23 | 2 | 4 | 0 | 1572*22ce4affSfengbojiang| 24 | 3 | 5 | 32 | 1573*22ce4affSfengbojiang| 25 | 4 | 5 | 0 | 1574*22ce4affSfengbojiang| 26 | 6 | 5 | 32 | 1575*22ce4affSfengbojiang| 27 | 7 | 5 | 0 | 1576*22ce4affSfengbojiang| 28 | 9 | 6 | 0 | 1577*22ce4affSfengbojiang| 29 | 12 | 6 | 0 | 1578*22ce4affSfengbojiang| 30 | 15 | 6 | 0 | 1579*22ce4affSfengbojiang| 31 | 18 | 6 | 0 | 1580*22ce4affSfengbojiang| 32 | 21 | 6 | 0 | 1581*22ce4affSfengbojiang| 33 | 24 | 6 | 0 | 1582*22ce4affSfengbojiang| 34 | 27 | 6 | 0 | 1583*22ce4affSfengbojiang| 35 | 30 | 6 | 0 | 1584*22ce4affSfengbojiang| 36 | 32 | 6 | 0 | 1585*22ce4affSfengbojiang| 37 | 34 | 6 | 0 | 1586*22ce4affSfengbojiang| 38 | 36 | 6 | 0 | 1587*22ce4affSfengbojiang| 39 | 38 | 6 | 0 | 1588*22ce4affSfengbojiang| 40 | 40 | 6 | 0 | 1589*22ce4affSfengbojiang| 41 | 42 | 6 | 0 | 1590*22ce4affSfengbojiang| 42 | 44 | 6 | 0 | 1591*22ce4affSfengbojiang| 43 | 1 | 4 | 32 | 1592*22ce4affSfengbojiang| 44 | 1 | 4 | 48 | 1593*22ce4affSfengbojiang| 45 | 2 | 4 | 16 | 1594*22ce4affSfengbojiang| 46 | 4 | 5 | 32 | 1595*22ce4affSfengbojiang| 47 | 5 | 5 | 32 | 1596*22ce4affSfengbojiang| 48 | 7 | 5 | 32 | 1597*22ce4affSfengbojiang| 49 | 8 | 5 | 32 | 1598*22ce4affSfengbojiang| 50 | 11 | 6 | 0 | 1599*22ce4affSfengbojiang| 51 | 14 | 6 | 0 | 1600*22ce4affSfengbojiang| 52 | 17 | 6 | 0 | 1601*22ce4affSfengbojiang| 53 | 20 | 6 | 0 | 1602*22ce4affSfengbojiang| 54 | 23 | 6 | 0 | 1603*22ce4affSfengbojiang| 55 | 26 | 6 | 0 | 1604*22ce4affSfengbojiang| 56 | 29 | 6 | 0 | 1605*22ce4affSfengbojiang| 57 | 52 | 6 | 0 | 1606*22ce4affSfengbojiang| 58 | 51 | 6 | 0 | 1607*22ce4affSfengbojiang| 59 | 50 | 6 | 0 | 1608*22ce4affSfengbojiang| 60 | 49 | 6 | 0 | 1609*22ce4affSfengbojiang| 61 | 48 | 6 | 0 | 1610*22ce4affSfengbojiang| 62 | 47 | 6 | 0 | 1611*22ce4affSfengbojiang| 63 | 46 | 6 | 0 | 1612*22ce4affSfengbojiang 1613*22ce4affSfengbojiang#### Offset Code: 1614*22ce4affSfengbojiang 1615*22ce4affSfengbojiang| State | Symbol | Number_Of_Bits | Base | 1616*22ce4affSfengbojiang| ----- | ------ | -------------- | ---- | 1617*22ce4affSfengbojiang| 0 | 0 | 5 | 0 | 1618*22ce4affSfengbojiang| 1 | 6 | 4 | 0 | 1619*22ce4affSfengbojiang| 2 | 9 | 5 | 0 | 1620*22ce4affSfengbojiang| 3 | 15 | 5 | 0 | 1621*22ce4affSfengbojiang| 4 | 21 | 5 | 0 | 1622*22ce4affSfengbojiang| 5 | 3 | 5 | 0 | 1623*22ce4affSfengbojiang| 6 | 7 | 4 | 0 | 1624*22ce4affSfengbojiang| 7 | 12 | 5 | 0 | 1625*22ce4affSfengbojiang| 8 | 18 | 5 | 0 | 1626*22ce4affSfengbojiang| 9 | 23 | 5 | 0 | 1627*22ce4affSfengbojiang| 10 | 5 | 5 | 0 | 1628*22ce4affSfengbojiang| 11 | 8 | 4 | 0 | 1629*22ce4affSfengbojiang| 12 | 14 | 5 | 0 | 1630*22ce4affSfengbojiang| 13 | 20 | 5 | 0 | 1631*22ce4affSfengbojiang| 14 | 2 | 5 | 0 | 1632*22ce4affSfengbojiang| 15 | 7 | 4 | 16 | 1633*22ce4affSfengbojiang| 16 | 11 | 5 | 0 | 1634*22ce4affSfengbojiang| 17 | 17 | 5 | 0 | 1635*22ce4affSfengbojiang| 18 | 22 | 5 | 0 | 1636*22ce4affSfengbojiang| 19 | 4 | 5 | 0 | 1637*22ce4affSfengbojiang| 20 | 8 | 4 | 16 | 1638*22ce4affSfengbojiang| 21 | 13 | 5 | 0 | 1639*22ce4affSfengbojiang| 22 | 19 | 5 | 0 | 1640*22ce4affSfengbojiang| 23 | 1 | 5 | 0 | 1641*22ce4affSfengbojiang| 24 | 6 | 4 | 16 | 1642*22ce4affSfengbojiang| 25 | 10 | 5 | 0 | 1643*22ce4affSfengbojiang| 26 | 16 | 5 | 0 | 1644*22ce4affSfengbojiang| 27 | 28 | 5 | 0 | 1645*22ce4affSfengbojiang| 28 | 27 | 5 | 0 | 1646*22ce4affSfengbojiang| 29 | 26 | 5 | 0 | 1647*22ce4affSfengbojiang| 30 | 25 | 5 | 0 | 1648*22ce4affSfengbojiang| 31 | 24 | 5 | 0 | 1649*22ce4affSfengbojiang 1650*22ce4affSfengbojiang 1651*22ce4affSfengbojiang 1652*22ce4affSfengbojiangAppendix B - Resources for implementers 1653*22ce4affSfengbojiang------------------------------------------------- 1654*22ce4affSfengbojiang 1655*22ce4affSfengbojiangAn open source reference implementation is available on : 1656*22ce4affSfengbojianghttps://github.com/facebook/zstd 1657*22ce4affSfengbojiang 1658*22ce4affSfengbojiangThe project contains a frame generator, called [decodeCorpus], 1659*22ce4affSfengbojiangwhich can be used by any 3rd-party implementation 1660*22ce4affSfengbojiangto verify that a tested decoder is compliant with the specification. 1661*22ce4affSfengbojiang 1662*22ce4affSfengbojiang[decodeCorpus]: https://github.com/facebook/zstd/tree/v1.3.4/tests#decodecorpus---tool-to-generate-zstandard-frames-for-decoder-testing 1663*22ce4affSfengbojiang 1664*22ce4affSfengbojiang`decodeCorpus` generates random valid frames. 1665*22ce4affSfengbojiangA compliant decoder should be able to decode them all, 1666*22ce4affSfengbojiangor at least provide a meaningful error code explaining for which reason it cannot 1667*22ce4affSfengbojiang(memory limit restrictions for example). 1668*22ce4affSfengbojiang 1669*22ce4affSfengbojiang 1670*22ce4affSfengbojiangVersion changes 1671*22ce4affSfengbojiang--------------- 1672*22ce4affSfengbojiang- 0.3.7 : clarifications for Repeat_Offsets 1673*22ce4affSfengbojiang- 0.3.6 : clarifications for Dictionary_ID 1674*22ce4affSfengbojiang- 0.3.5 : clarifications for Block_Maximum_Size 1675*22ce4affSfengbojiang- 0.3.4 : clarifications for FSE decoding table 1676*22ce4affSfengbojiang- 0.3.3 : clarifications for field Block_Size 1677*22ce4affSfengbojiang- 0.3.2 : remove additional block size restriction on compressed blocks 1678*22ce4affSfengbojiang- 0.3.1 : minor clarification regarding offset history update rules 1679*22ce4affSfengbojiang- 0.3.0 : minor edits to match RFC8478 1680*22ce4affSfengbojiang- 0.2.9 : clarifications for huffman weights direct representation, by Ulrich Kunitz 1681*22ce4affSfengbojiang- 0.2.8 : clarifications for IETF RFC discuss 1682*22ce4affSfengbojiang- 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell 1683*22ce4affSfengbojiang- 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz 1684*22ce4affSfengbojiang- 0.2.5 : minor typos and clarifications 1685*22ce4affSfengbojiang- 0.2.4 : section restructuring, by Sean Purcell 1686*22ce4affSfengbojiang- 0.2.3 : clarified several details, by Sean Purcell 1687*22ce4affSfengbojiang- 0.2.2 : added predefined codes, by Johannes Rudolph 1688*22ce4affSfengbojiang- 0.2.1 : clarify field names, by Przemyslaw Skibinski 1689*22ce4affSfengbojiang- 0.2.0 : numerous format adjustments for zstd v0.8+ 1690*22ce4affSfengbojiang- 0.1.2 : limit Huffman tree depth to 11 bits 1691*22ce4affSfengbojiang- 0.1.1 : reserved dictID ranges 1692*22ce4affSfengbojiang- 0.1.0 : initial release 1693