0-byte arc.gz file

Description

Indexing the LoC E04 collection, I ran into a 0-byte arc.gz file:
http://locdata950.us.archive.org:19972/0/E04-CRAWL-50-20040918065656-00870-crawling001.archive.org.arc.gz

The Java gzip library doesn't like it (see exception trace at bottom) and the ARCReaderFactory bails out.

Hard to say if this is:

  • bug in ARCReader(Factory) code

  • bug in our repository to allow a 0-byte arc.gz file to be ingested

  • bug in my code for not handling this exception

java.io.EOFException
at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207)
at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197)
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136)
at java.util.zip.GZIPInputStream.(GZIPInputStream.java:58)
at org.archive.io.GzippedInputStream.(GzippedInputStream.java:103)
at org.archive.io.GzippedInputStream.(GzippedInputStream.java:90)
at org.archive.io.arc.ARCReaderFactory$CompressedARCReader.(ARCReaderFactory.java:367)
at org.archive.io.arc.ARCReaderFactory.getArchiveReader(ARCReaderFactory.java:140)
at org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:131)
at org.archive.io.ArchiveReaderFactory.getArchiveReader(ArchiveReaderFactory.java:181)
at org.archive.io.ArchiveReaderFactory.getArchiveReader(ArchiveReaderFactory.java:217)
at org.archive.io.ArchiveReaderFactory.get(ArchiveReaderFactory.java:200)
at org.archive.io.ArchiveReaderFactory.getArchiveReader(ArchiveReaderFactory.java:99)
at org.archive.io.ArchiveReaderFactory.getArchiveReader(ArchiveReaderFactory.java:93)
at org.archive.io.ArchiveReaderFactory.get(ArchiveReaderFactory.java:88)
at org.archive.nutchwax.Importer.map(Importer.java:196)
at org.archive.nutchwax.Importer.map(Importer.java:98)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

Environment

None

Status

Assignee

Unassigned

Reporter

Aaron Binns

Labels

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Priority

Major
Configure