Here's a nutshell version of the whole sordid story:
One, I was overconfident in our preexisting and new unit tests for this work; there were failures even in pre-JDK6u23 cases of reading real ARCs/WARCs. I believe those are now all fixed, but will test more on a wider variety of real ARCs/WARCs before sounding the all-clear.
Two, I had thought the tests were passing in JDK6u24 – but in fact an older JDK was being used. Testing the right JDK revealed...
Three, there's a deeper problem with JDK6u23-JDK6u24 - the GZIPInputStream no longer handles GZIP members with optional 'extra fields' correctly. (Traditionally Alexa used on particular extra field to mark their ARCs, and we've continued that practice. More recently some are using an extra field to hint how to do a long-skip over the current member.) The JDK6u23-24 bug looks like a bit of sloppy editing by someone fixing the prior bug; as a result I expect very little 'natural' GZIP data with extra fields can be read with the JDK6u23-34 GZIPInputStream (and maliciously-crafted GZIP data could decompress to totally different data in Java compared to standard GUNZIP!). I've reported the issue to Oracle; a bug record may appear here soon: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7022417 . Notably, this bug does not appear in the OpenJDK7/JDK7 preview release.
Four, the packaging-limitations/private-modifiers/package-protected modifiers in GZIPInputStream and InflaterInputStream make many easy tactics for patching around the bug when in the affected JDKs 6u23-24-?? difficult. I think that pulling both those classes from OpenJDK7 into a local package will serve as a workaround, and if those classes (changed only in their name, home package, and imports) remain GPL-with-classpath-exception, bundling them in our distribution should be OK. Thus we'd have consistent, as-designed-for-6u23-and-later behavior no matter what the underlying JRE/JDK.
Five, this JDK7-behavior avoids all of the so-far reproduced exceptions, but seems to sometimes skip a member-boundary, and thus a full record, when iterating through with the old ArchiveReader code. So that's still an issue I need to investigate and address.