Heritrix
  1. Heritrix
  2. HER-1865

JDK6u23 breaks GzippedInputStream & W/ARCReaders with different GZIP handling

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: Heritrix 3.0.0
    • Fix Version/s: Heritrix 3.1.0-beta
    • Component/s: Release Notes
    • Labels:
      None

      Description

      JDK6u23 fixed a longstanding bug in Snoracle's GZIPInputStream where it would stop reading the concatenation of many GZIP members after the first. Our workaround, GzippedInputStream, awkwardly gave us the ability to continue reading past each member boundary, and in fact find the (compressed) boundary offsets for the benefit of on W/ARC record range indexing - but depended on the old buggy behavior.

      We need a way to get the compressed offsets with the new GZIPInputStream behavior – it's likely to be different but easier. And ideally we need an approach/codebase that works in both pre- and post-JDK6u23 systems without operator intervention, and the other classes (W/ARC reading and random access) to work in either era. (Our systems mostly haven't moved past JDK6u22 yet, but partners have started to, and we may soon.)

        Activity

        Hide
        Erik Hetzner added a comment -

        See HER-1878.

        Show
        Erik Hetzner added a comment - See HER-1878 .
        Hide
        Gordon Mohr added a comment -

        With HER-1878 (hopefully) fixed, as well as HER-1881 (which showed up with the new code), and this code having received some more use in Wayback indexing, I'm going to consider this fixed, and let future problems (if any) get new issue numbers.

        Show
        Gordon Mohr added a comment - With HER-1878 (hopefully) fixed, as well as HER-1881 (which showed up with the new code), and this code having received some more use in Wayback indexing, I'm going to consider this fixed, and let future problems (if any) get new issue numbers.
        Hide
        Søren Vejrup Carlsen added a comment -

        Would it be difficult to implement this fix in the H1 branch as well?

        Show
        Søren Vejrup Carlsen added a comment - Would it be difficult to implement this fix in the H1 branch as well?
        Hide
        Will Johnson added a comment -

        Is there any fix (or suggestions for an approach) possible for 2.0.2? Looking at the code it seems like the design of the ArchiveReaderFactory relies on the broken up streams that are no longer provided per the spec. Also, it seems like copying the OpenJDK code will cause Heretrix (and therefore my code) to be GPL since you're making a modified work and not linking.

        Show
        Will Johnson added a comment - Is there any fix (or suggestions for an approach) possible for 2.0.2? Looking at the code it seems like the design of the ArchiveReaderFactory relies on the broken up streams that are no longer provided per the spec. Also, it seems like copying the OpenJDK code will cause Heretrix (and therefore my code) to be GPL since you're making a modified work and not linking.
        Hide
        Gordon Mohr added a comment -

        Regarding use in prior versions: the bug only affects reading, so you could move to the H3 codebase for (W)ARC-reading while leaving crawling on whatever version is convenient. (We would also welcome a contributed backport.)

        Regarding 2.0.2 specifically: I'd highly recommend moving to H3. There have been many fixes and improvements, and no further H2.0.x releases are expected.

        Regarding OpenJDK reuse: No code reuse can cause any other code's license to change automatically. Only the author(s) make the choice of license. Improper reuse/relicensing could open a project to allegations that they do not have permission to reuse the GPL code in a particular fashion, which might then have to be cured (and relicensing to GPL is sometimes but not always a possible cure).

        I am not a lawyer, but I believe our reuse is in accordance with the Oracle and related affiliate's copyrights, and the GPL-with-classpath-exception licensing of OpenJDK code. The two changed classes, OpenJDK7InflaterInputStream (from InflaterInputStream) and OpenJDK7GZIPInputStream (from GZIPInputStream) remain code licensed under the GPL with the classpath exception. Other code only refers/links to that code, in the exact same manner as the OpenJDK versions would be linked (when running on OpenJDK). If Oracle's/OpenJDK's lawyers have an alternative interpretation, we would adjust our use. (For example, if necessary we could put those 2 classes into their own more-clearly-distinct GPL-with-classpath-licensed package/library.) But the whole issue will likely become moot when the JDK6 bugs are fixed or JDK7 use becomes the norm, and this backport-plus-hackery is no longer a necessary workaround.

        Show
        Gordon Mohr added a comment - Regarding use in prior versions: the bug only affects reading, so you could move to the H3 codebase for (W)ARC-reading while leaving crawling on whatever version is convenient. (We would also welcome a contributed backport.) Regarding 2.0.2 specifically: I'd highly recommend moving to H3. There have been many fixes and improvements, and no further H2.0.x releases are expected. Regarding OpenJDK reuse: No code reuse can cause any other code's license to change automatically. Only the author(s) make the choice of license. Improper reuse/relicensing could open a project to allegations that they do not have permission to reuse the GPL code in a particular fashion, which might then have to be cured (and relicensing to GPL is sometimes but not always a possible cure). I am not a lawyer, but I believe our reuse is in accordance with the Oracle and related affiliate's copyrights, and the GPL-with-classpath-exception licensing of OpenJDK code. The two changed classes, OpenJDK7InflaterInputStream (from InflaterInputStream) and OpenJDK7GZIPInputStream (from GZIPInputStream) remain code licensed under the GPL with the classpath exception. Other code only refers/links to that code, in the exact same manner as the OpenJDK versions would be linked (when running on OpenJDK). If Oracle's/OpenJDK's lawyers have an alternative interpretation, we would adjust our use. (For example, if necessary we could put those 2 classes into their own more-clearly-distinct GPL-with-classpath-licensed package/library.) But the whole issue will likely become moot when the JDK6 bugs are fixed or JDK7 use becomes the norm, and this backport-plus-hackery is no longer a necessary workaround.

          People

          • Assignee:
            Gordon Mohr
            Reporter:
            Gordon Mohr
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: