Some Assembly Code

gzip optimizations

These diff files modify gzip 1.2.4 to use optimized assembly code for the deflation algorithm.

The Pentium version of this patch is here mostly for educational purposes, as the speed increase it provides is all but unnoticeable. On the Pentium, the major bottleneck in the deflation algorithm is L1-cache contention. The gzip code already contains a fair i386-oriented assembly optimization, and so the few extra cycles squeezed out of the inner loop by my version are all but inconsequential when weighed against the mountains of cycles lost to L2-cache memory accesses. Under certain circumstances, the patch can provide a speed increase of 25%, but it typically winds up being more like 5%, which is less than the amount of variance between successive executions. (And, to add insult to my injury, the existing gzip assembly code is much more compact than mine.)

The patch for the Pentium Pro fares much better than this, however, for two reasons. First, the PPro has a much smaller penalty for L1-cache misses. And second, the existing assembly code runs afoul of the painful partial register stall inside the inner loop. My modifications give gzip a 20% speedup (more or less) at the default compression, and as much as a 40% speedup at level 9 compression. And most of the speedup can be attributed to changing a single assembly instruction.

zlib optimizations

These diff files modify zlib 1.1.2 to use optimized assembly code for the deflation algorithm. Note that the content of these patches have been incorporated into the standard zlib distribution starting with version 1.1.3. (But see also below.)

My Pentium patch fares a little better here, mainly since it is being measured against optimized C code instead of assembly. (The only i386 assembly code provided for zlib is written for Win32, using MASM.) The effectiveness is still not great -- even though my inner loop is almost half the size of that produced by gcc, there is still the same number of L1-cache misses. The speedup, on average, is approximately 5-10% with the default compression level, and approximately 15% with compression at level 9 (though occasionally more).

For the Pentium Pro patch, the speedup is about the same as for gzip (i.e., 20-40%), although this will depend on how well the C compiler optimized the original zlib that you test my version against. (Somewhat ironically, the C code actually survives the move to the PPro better than gzip's assembly, since a compiler is less likely to output code that causes a partial register stall.)

The diff file creates the assembly source file, modifies the Makefile appropriately, and alters the configure script to use the assembly version by default. (In other words, you will need to run ./configure after applying this patch.)

zlib optimizations, revisited

(added April 2007)

zlib 1.2.3 patch for the Pentium Pro and later

This diff file modifies zlib 1.2.3 to use optimized assembly code for the deflation algorithm.

I've allowed these assembly routines to languish for several years now, believing that the code generated by gcc had caught up with it, somewhere between version 2.95 of gcc and the wholeslace rearchitecting of the Pentium 4. However, I recently learned that, despite what I believed, this code still has some life in it. On the Pentium 4 and AMD64 chips, for example, it continues to run about 8% faster than the code produced by gcc 4.1. Somehow my code managed to avoid all of the pitfalls of the newer chips and keep its inner loop spinning as fast as it did on the Pentium Pro.

So, in acknowledgement of its continuing usefulness, I've changed the license to be the same as that of zlib proper. (In fact, the above diff does nothing but change the license in the source comments and update the README.686 file.)

Share and Enjoy. Contact me if you have any questions or comments.

Software
Brian Raiter