[BusyBox] Re: Where's the bzip2 compressed linux-kernel patch?

Tue Oct 21 01:42:49 UTC 2003

On Monday 20 October 2003 09:33, Jörn Engel wrote:
> On Mon, 20 October 2003 04:41:57 -0500, Rob Landley wrote:
> > On Monday 20 October 2003 03:31, Jörn Engel wrote:
> > > On Sat, 18 October 2003 15:38:35 -0500, Rob Landley wrote:
> > >
> > > Does anyone really need the "small mode"?  If not, I consider it a
> > > feature to not support it. :)
> >
> > The "fast mode" walks a 4 megabyte array (okay, 900k*4=3.6 megabytes) in
> > a way that makes the cpu cache break down and cry, and actually brings
> > dram bank switching latency into the picture.  (It's about the worst case
> > scenario for memory access; fetching whole cache lines for randomly
> > ordered 4 byte reads...)  Anything that doesn't do that seems like it's
> > at least worth a look, but it's not a high priority just now...;)
>
> I should take a closer look at the code sometime.  I've written a very
> data-local implementation of the block sorting myself.  My problem was
> to sort *really big* blocks, bigger than the L4-cache (aka main
> memory).  Locality really helps when suffering from swap latency. :)

Feel free. :)

The true block sorting stuff is really compression side only.  The 
decompression already has all this sorted.  Undoing the burrows-wheeler 
transform does require a kind of sort, but the end result isn't a sorted 
array but a linked list showing what order the sorted array would have been 
in, and they kind of fake their way around it with a one-pass thing that's 
actually fairly neat.  (That's the "fast" algorithm.  The slow one I haven't 
thought through yet.)

The sorting side has FOUR DIFFERENT sorting algorithms, done entirely for 
speed: it tries each one and falls back to the next after a certain work 
threshold.  I'm shying away from that and cleaning up other bits at the 
moment, trying to figure out what "cleaned up" means in this context...

> 4MB should also not matter too much, as a memory requirement.  Either
> you need compression anyway, or you are early on in the boot sequence
> with lots of free memory.

Basically, yeah.  These days even embedded devices have 8 or 16 megs, and 
bzip's temp buffer is freed as soon as the decompression finishes.  In-place 
decompression would be needed to work in 8 megs, but I'm not sure how big a 
deal that really is...

> > Any future uber-xeon that has a 4 megabyte L2 cache is going to do bunzip
> > _really_ fast. :)
>
> Wouldn't running busybox on that machine be kinda pointless maybe? ;)

A simple, straightforward, understandable implementation of an algorithm is 
generally preferable to a craptacular pile of cruft.  It took the FSF 833 
lines of C to implement "cat".  Yes, really.  I am not in favor of this.

> > It turns out Manuel Novoa is busy beating the source code to death with a
> > profiler, which is why I'm waiting on the bzip-kernel patch.  He's sped
> > it up another 10% or so since last time (extracting the kernel tarball
> > took 22 seconds in the original, and he's got it down under 18 now), but
> > hasn't checked the result into busybox's CVS yet.  (He's still tweaking.)
>
> Nice to see such work. :)

I'm all for it.  He's much better at it profiler-based optimization than I am, 
I'm quite happy to stay out of his way and go off to what I do well, which is 
mucking out the next section of brick-wall unintellgibility...

> > That's what I'm working on now.  Yesterday, I glued all the bits
> > necessary to bzip compress stdin to stdout into one 95k kilobyte c file
> > (which compiled to a ~50k executable).  So far this evening I've trimmed
> > that down to 79k of source and 48k of executable, and I'm still at the
> > "crapectomy" stage.  Not actually doing serious reorganization type
> > cleanup yet, just taking a scalpel (okay, vi) and removing bits that
> > shouldn't BE there.  (The line "typedef BZFILE void;" should not occur in
> > any program anywhere ever.  It just shouldn't.  "#define BZ_API(func)
> > func" and "#define BZ_EXTERN extern" aren't really a good sign either,
> > although they were the result of making the code compile on windows. 
> > Unwinding this kind of thing is very time consuming.)
>
> I've started doing the same thing, but got distracted at some point of
> time.  Look here:
> http://wohnheim.fh-wedel.de/~joern/software/kernel/je/25/
>
> All the cleanup patches might look quite familiar to you. ;)

Im replying to my laptop's outbox at the moment, I'll take a look when I'm 
online...

Right now I'm mentally nibbling at the edge of the block sorting machinery, 
and trying to figure out where the speed/complexity tradeoffs really are.  
(I'll might just clean up what's here first, to expose it to to Manuel's 
profiling.  Or I might just use a single bog-standard quicksort and add the 
complexity back if the speed goes in the toilet.  Or, it's quite possible I 
don't actually understand what it's DOING yet.  That's actually pretty 
likely, I'm only about 1/4 of the way through it, there's all sorts of 
subtlety buried in this stuff sometimes...)

> My killer application for this was and still is jffs2.  But tests
> showed, that bzip2 actually compresses worse than gzip when running on
> chunks of <=4k.  I've pretty much figured out the reason by now and am
> just writing on a paper with my findings.  Hope that IBM allows me to
> follow up on that and create a now compression suite that is actually
> better than bzip2.  We'll see.

There's a number of little places bzip2 could be tightened up (use 64 symbols 
instead of 50 for the chunk size, what IS the deal with that discarded symbol 
at the beginning of each block and the fact the first three symbols don't 
qualify as a run...) but backwards compatability is probably going to doom 
any such effort.

> > I'm going to take the shepard patch and make a 2.6 version with the new
> > bunzip code as soon as Manuel gets tired of making it faster.  But he
> > hasn't yet, and I've got plenty of other things to do in the meantime...
>
> Sure.  I've had the same plan, and already started, but found some
> questions that are still unanswered:
> - With zlib, bzlib and possibly more, is lib/* still a good place?

Not a clue.

> - How can we reduce the memory consumption of say 1000 open ipsec
>   connections, eating up a zlib workspace each?

My lack of caring about ipsec is fairly deep.  It's at the wrong layer.  
Encryption is inherently stateful.  The packet layer isn't supposed to be 
stateful.  The TCP/IP SESSION layer is supposed to be stateful...

You're dealing with a fundmanetally misdesigned protocol, I'm afraid.  Fixing 
it probably isn't actually possible, the best you can do is mitigate the 
damage.

> - How much code can be shared amoung different compression algorithms?

The CRC algorithms are similar.  There's only a few ways the basic CRC usualy 
gets mangled: big endian or little endian, initialize to 0 or ffffffff, and ~ 
the result at the end or not...

If I remember, gzip is big endian and bzip is little endian.  Initializing the 
table is about four lines of code, and trying to sharing the data at run time 
sounds like an amazing pain that's BOUND to hurt throughput...

> - There are users that will compress block-based, others need
>   streaming support.  Can these two classes share code?  How?

Bzip is designed to work on large quantities of data.  It's not a packet level 
protocol, and isn't likely to become one.

> And so on.  This is definitely 2.7 work, but it doesn't hurt to start
> somewhere and fix problems on the way.  Feel free to port your
> reimplementation to the kernel, less work for me. :)
>
> Jörn

Rob