Failure to parse WAT files due to gzip boundary

Description

While processing WATs for 20th century find, I run into a lot of errors in the WAT reader due to gzip boundary problems:

A few examples, from hdfs:/user/vinay/wat/20thcf/1999:

gzip.GZIPFormatException: Not aligned at gzip start: robots-918458593.arc.wat.gz at offset 1559
gzip.GZIPFormatException: Not aligned at gzip start: robots-918458593.arc.wat.gz at offset 1559
gzip.GZIPFormatException: Not aligned at gzip start: robots_19990123130846-917135024.arc.wat.gz at offset 743
gzip.GZIPFormatException: Not aligned at gzip start: robots_19990123130846-917135024.arc.wat.gz at offset 743
gzip.GZIPFormatException: Not aligned at gzip start: robots_19990123130846-917135024.arc.wat.gz at offset 743
gzip.GZIPFormatException: Not aligned at gzip start: robots-19990917-937632725.arc.wat.gz at offset 786
gzip.GZIPFormatException: Not aligned at gzip start: robots-19990917-937632725.arc.wat.gz at offset 786
gzip.GZIPFormatException: Not aligned at gzip start: robots-19990917-937640941.arc.wat.gz at offset 2409
gzip.GZIPFormatException: Not aligned at gzip start: robots-19990917-937640941.arc.wat.gz at offset 2409

See attached lists for full lists of errors from Hadoop logs.

Environment

None

Status

Assignee

Brad Tofel

Reporter

Aaron Binns

Labels

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Components

Priority

Major
Configure