1*22ce4affSfengbojiang1. Compression algorithm (deflate) 2*22ce4affSfengbojiang 3*22ce4affSfengbojiangThe deflation algorithm used by gzip (also zip and zlib) is a variation of 4*22ce4affSfengbojiangLZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in 5*22ce4affSfengbojiangthe input data. The second occurrence of a string is replaced by a 6*22ce4affSfengbojiangpointer to the previous string, in the form of a pair (distance, 7*22ce4affSfengbojianglength). Distances are limited to 32K bytes, and lengths are limited 8*22ce4affSfengbojiangto 258 bytes. When a string does not occur anywhere in the previous 9*22ce4affSfengbojiang32K bytes, it is emitted as a sequence of literal bytes. (In this 10*22ce4affSfengbojiangdescription, `string' must be taken as an arbitrary sequence of bytes, 11*22ce4affSfengbojiangand is not restricted to printable characters.) 12*22ce4affSfengbojiang 13*22ce4affSfengbojiangLiterals or match lengths are compressed with one Huffman tree, and 14*22ce4affSfengbojiangmatch distances are compressed with another tree. The trees are stored 15*22ce4affSfengbojiangin a compact form at the start of each block. The blocks can have any 16*22ce4affSfengbojiangsize (except that the compressed data for one block must fit in 17*22ce4affSfengbojiangavailable memory). A block is terminated when deflate() determines that 18*22ce4affSfengbojiangit would be useful to start another block with fresh trees. (This is 19*22ce4affSfengbojiangsomewhat similar to the behavior of LZW-based _compress_.) 20*22ce4affSfengbojiang 21*22ce4affSfengbojiangDuplicated strings are found using a hash table. All input strings of 22*22ce4affSfengbojianglength 3 are inserted in the hash table. A hash index is computed for 23*22ce4affSfengbojiangthe next 3 bytes. If the hash chain for this index is not empty, all 24*22ce4affSfengbojiangstrings in the chain are compared with the current input string, and 25*22ce4affSfengbojiangthe longest match is selected. 26*22ce4affSfengbojiang 27*22ce4affSfengbojiangThe hash chains are searched starting with the most recent strings, to 28*22ce4affSfengbojiangfavor small distances and thus take advantage of the Huffman encoding. 29*22ce4affSfengbojiangThe hash chains are singly linked. There are no deletions from the 30*22ce4affSfengbojianghash chains, the algorithm simply discards matches that are too old. 31*22ce4affSfengbojiang 32*22ce4affSfengbojiangTo avoid a worst-case situation, very long hash chains are arbitrarily 33*22ce4affSfengbojiangtruncated at a certain length, determined by a runtime option (level 34*22ce4affSfengbojiangparameter of deflateInit). So deflate() does not always find the longest 35*22ce4affSfengbojiangpossible match but generally finds a match which is long enough. 36*22ce4affSfengbojiang 37*22ce4affSfengbojiangdeflate() also defers the selection of matches with a lazy evaluation 38*22ce4affSfengbojiangmechanism. After a match of length N has been found, deflate() searches for 39*22ce4affSfengbojianga longer match at the next input byte. If a longer match is found, the 40*22ce4affSfengbojiangprevious match is truncated to a length of one (thus producing a single 41*22ce4affSfengbojiangliteral byte) and the process of lazy evaluation begins again. Otherwise, 42*22ce4affSfengbojiangthe original match is kept, and the next match search is attempted only N 43*22ce4affSfengbojiangsteps later. 44*22ce4affSfengbojiang 45*22ce4affSfengbojiangThe lazy match evaluation is also subject to a runtime parameter. If 46*22ce4affSfengbojiangthe current match is long enough, deflate() reduces the search for a longer 47*22ce4affSfengbojiangmatch, thus speeding up the whole process. If compression ratio is more 48*22ce4affSfengbojiangimportant than speed, deflate() attempts a complete second search even if 49*22ce4affSfengbojiangthe first match is already long enough. 50*22ce4affSfengbojiang 51*22ce4affSfengbojiangThe lazy match evaluation is not performed for the fastest compression 52*22ce4affSfengbojiangmodes (level parameter 1 to 3). For these fast modes, new strings 53*22ce4affSfengbojiangare inserted in the hash table only when no match was found, or 54*22ce4affSfengbojiangwhen the match is not too long. This degrades the compression ratio 55*22ce4affSfengbojiangbut saves time since there are both fewer insertions and fewer searches. 56*22ce4affSfengbojiang 57*22ce4affSfengbojiang 58*22ce4affSfengbojiang2. Decompression algorithm (inflate) 59*22ce4affSfengbojiang 60*22ce4affSfengbojiang2.1 Introduction 61*22ce4affSfengbojiang 62*22ce4affSfengbojiangThe key question is how to represent a Huffman code (or any prefix code) so 63*22ce4affSfengbojiangthat you can decode fast. The most important characteristic is that shorter 64*22ce4affSfengbojiangcodes are much more common than longer codes, so pay attention to decoding the 65*22ce4affSfengbojiangshort codes fast, and let the long codes take longer to decode. 66*22ce4affSfengbojiang 67*22ce4affSfengbojianginflate() sets up a first level table that covers some number of bits of 68*22ce4affSfengbojianginput less than the length of longest code. It gets that many bits from the 69*22ce4affSfengbojiangstream, and looks it up in the table. The table will tell if the next 70*22ce4affSfengbojiangcode is that many bits or less and how many, and if it is, it will tell 71*22ce4affSfengbojiangthe value, else it will point to the next level table for which inflate() 72*22ce4affSfengbojianggrabs more bits and tries to decode a longer code. 73*22ce4affSfengbojiang 74*22ce4affSfengbojiangHow many bits to make the first lookup is a tradeoff between the time it 75*22ce4affSfengbojiangtakes to decode and the time it takes to build the table. If building the 76*22ce4affSfengbojiangtable took no time (and if you had infinite memory), then there would only 77*22ce4affSfengbojiangbe a first level table to cover all the way to the longest code. However, 78*22ce4affSfengbojiangbuilding the table ends up taking a lot longer for more bits since short 79*22ce4affSfengbojiangcodes are replicated many times in such a table. What inflate() does is 80*22ce4affSfengbojiangsimply to make the number of bits in the first table a variable, and then 81*22ce4affSfengbojiangto set that variable for the maximum speed. 82*22ce4affSfengbojiang 83*22ce4affSfengbojiangFor inflate, which has 286 possible codes for the literal/length tree, the size 84*22ce4affSfengbojiangof the first table is nine bits. Also the distance trees have 30 possible 85*22ce4affSfengbojiangvalues, and the size of the first table is six bits. Note that for each of 86*22ce4affSfengbojiangthose cases, the table ended up one bit longer than the ``average'' code 87*22ce4affSfengbojianglength, i.e. the code length of an approximately flat code which would be a 88*22ce4affSfengbojianglittle more than eight bits for 286 symbols and a little less than five bits 89*22ce4affSfengbojiangfor 30 symbols. 90*22ce4affSfengbojiang 91*22ce4affSfengbojiang 92*22ce4affSfengbojiang2.2 More details on the inflate table lookup 93*22ce4affSfengbojiang 94*22ce4affSfengbojiangOk, you want to know what this cleverly obfuscated inflate tree actually 95*22ce4affSfengbojianglooks like. You are correct that it's not a Huffman tree. It is simply a 96*22ce4affSfengbojianglookup table for the first, let's say, nine bits of a Huffman symbol. The 97*22ce4affSfengbojiangsymbol could be as short as one bit or as long as 15 bits. If a particular 98*22ce4affSfengbojiangsymbol is shorter than nine bits, then that symbol's translation is duplicated 99*22ce4affSfengbojiangin all those entries that start with that symbol's bits. For example, if the 100*22ce4affSfengbojiangsymbol is four bits, then it's duplicated 32 times in a nine-bit table. If a 101*22ce4affSfengbojiangsymbol is nine bits long, it appears in the table once. 102*22ce4affSfengbojiang 103*22ce4affSfengbojiangIf the symbol is longer than nine bits, then that entry in the table points 104*22ce4affSfengbojiangto another similar table for the remaining bits. Again, there are duplicated 105*22ce4affSfengbojiangentries as needed. The idea is that most of the time the symbol will be short 106*22ce4affSfengbojiangand there will only be one table look up. (That's whole idea behind data 107*22ce4affSfengbojiangcompression in the first place.) For the less frequent long symbols, there 108*22ce4affSfengbojiangwill be two lookups. If you had a compression method with really long 109*22ce4affSfengbojiangsymbols, you could have as many levels of lookups as is efficient. For 110*22ce4affSfengbojianginflate, two is enough. 111*22ce4affSfengbojiang 112*22ce4affSfengbojiangSo a table entry either points to another table (in which case nine bits in 113*22ce4affSfengbojiangthe above example are gobbled), or it contains the translation for the symbol 114*22ce4affSfengbojiangand the number of bits to gobble. Then you start again with the next 115*22ce4affSfengbojiangungobbled bit. 116*22ce4affSfengbojiang 117*22ce4affSfengbojiangYou may wonder: why not just have one lookup table for how ever many bits the 118*22ce4affSfengbojianglongest symbol is? The reason is that if you do that, you end up spending 119*22ce4affSfengbojiangmore time filling in duplicate symbol entries than you do actually decoding. 120*22ce4affSfengbojiangAt least for deflate's output that generates new trees every several 10's of 121*22ce4affSfengbojiangkbytes. You can imagine that filling in a 2^15 entry table for a 15-bit code 122*22ce4affSfengbojiangwould take too long if you're only decoding several thousand symbols. At the 123*22ce4affSfengbojiangother extreme, you could make a new table for every bit in the code. In fact, 124*22ce4affSfengbojiangthat's essentially a Huffman tree. But then you spend too much time 125*22ce4affSfengbojiangtraversing the tree while decoding, even for short symbols. 126*22ce4affSfengbojiang 127*22ce4affSfengbojiangSo the number of bits for the first lookup table is a trade of the time to 128*22ce4affSfengbojiangfill out the table vs. the time spent looking at the second level and above of 129*22ce4affSfengbojiangthe table. 130*22ce4affSfengbojiang 131*22ce4affSfengbojiangHere is an example, scaled down: 132*22ce4affSfengbojiang 133*22ce4affSfengbojiangThe code being decoded, with 10 symbols, from 1 to 6 bits long: 134*22ce4affSfengbojiang 135*22ce4affSfengbojiangA: 0 136*22ce4affSfengbojiangB: 10 137*22ce4affSfengbojiangC: 1100 138*22ce4affSfengbojiangD: 11010 139*22ce4affSfengbojiangE: 11011 140*22ce4affSfengbojiangF: 11100 141*22ce4affSfengbojiangG: 11101 142*22ce4affSfengbojiangH: 11110 143*22ce4affSfengbojiangI: 111110 144*22ce4affSfengbojiangJ: 111111 145*22ce4affSfengbojiang 146*22ce4affSfengbojiangLet's make the first table three bits long (eight entries): 147*22ce4affSfengbojiang 148*22ce4affSfengbojiang000: A,1 149*22ce4affSfengbojiang001: A,1 150*22ce4affSfengbojiang010: A,1 151*22ce4affSfengbojiang011: A,1 152*22ce4affSfengbojiang100: B,2 153*22ce4affSfengbojiang101: B,2 154*22ce4affSfengbojiang110: -> table X (gobble 3 bits) 155*22ce4affSfengbojiang111: -> table Y (gobble 3 bits) 156*22ce4affSfengbojiang 157*22ce4affSfengbojiangEach entry is what the bits decode as and how many bits that is, i.e. how 158*22ce4affSfengbojiangmany bits to gobble. Or the entry points to another table, with the number of 159*22ce4affSfengbojiangbits to gobble implicit in the size of the table. 160*22ce4affSfengbojiang 161*22ce4affSfengbojiangTable X is two bits long since the longest code starting with 110 is five bits 162*22ce4affSfengbojianglong: 163*22ce4affSfengbojiang 164*22ce4affSfengbojiang00: C,1 165*22ce4affSfengbojiang01: C,1 166*22ce4affSfengbojiang10: D,2 167*22ce4affSfengbojiang11: E,2 168*22ce4affSfengbojiang 169*22ce4affSfengbojiangTable Y is three bits long since the longest code starting with 111 is six 170*22ce4affSfengbojiangbits long: 171*22ce4affSfengbojiang 172*22ce4affSfengbojiang000: F,2 173*22ce4affSfengbojiang001: F,2 174*22ce4affSfengbojiang010: G,2 175*22ce4affSfengbojiang011: G,2 176*22ce4affSfengbojiang100: H,2 177*22ce4affSfengbojiang101: H,2 178*22ce4affSfengbojiang110: I,3 179*22ce4affSfengbojiang111: J,3 180*22ce4affSfengbojiang 181*22ce4affSfengbojiangSo what we have here are three tables with a total of 20 entries that had to 182*22ce4affSfengbojiangbe constructed. That's compared to 64 entries for a single table. Or 183*22ce4affSfengbojiangcompared to 16 entries for a Huffman tree (six two entry tables and one four 184*22ce4affSfengbojiangentry table). Assuming that the code ideally represents the probability of 185*22ce4affSfengbojiangthe symbols, it takes on the average 1.25 lookups per symbol. That's compared 186*22ce4affSfengbojiangto one lookup for the single table, or 1.66 lookups per symbol for the 187*22ce4affSfengbojiangHuffman tree. 188*22ce4affSfengbojiang 189*22ce4affSfengbojiangThere, I think that gives you a picture of what's going on. For inflate, the 190*22ce4affSfengbojiangmeaning of a particular symbol is often more than just a letter. It can be a 191*22ce4affSfengbojiangbyte (a "literal"), or it can be either a length or a distance which 192*22ce4affSfengbojiangindicates a base value and a number of bits to fetch after the code that is 193*22ce4affSfengbojiangadded to the base value. Or it might be the special end-of-block code. The 194*22ce4affSfengbojiangdata structures created in inftrees.c try to encode all that information 195*22ce4affSfengbojiangcompactly in the tables. 196*22ce4affSfengbojiang 197*22ce4affSfengbojiang 198*22ce4affSfengbojiangJean-loup Gailly Mark Adler 199*22ce4affSfengbojiang[email protected] [email protected] 200*22ce4affSfengbojiang 201*22ce4affSfengbojiang 202*22ce4affSfengbojiangReferences: 203*22ce4affSfengbojiang 204*22ce4affSfengbojiang[LZ77] Ziv J., Lempel A., ``A Universal Algorithm for Sequential Data 205*22ce4affSfengbojiangCompression,'' IEEE Transactions on Information Theory, Vol. 23, No. 3, 206*22ce4affSfengbojiangpp. 337-343. 207*22ce4affSfengbojiang 208*22ce4affSfengbojiang``DEFLATE Compressed Data Format Specification'' available in 209*22ce4affSfengbojianghttp://tools.ietf.org/html/rfc1951 210