Nutchwax requires very long timeouts on remotely hosted arc files

Description

When importing arc files into nutchwax, a server with timeout values less than 20 minutes will frequently time out.

(20 mins. is a value based on trial-and-error tests; setting it to 10 mins. has raised timeouts.)

Presumably this is because we nutchwax processes ARC files on the fly while fetching them rather than prefetching and reading from disk.

Environment

None

Activity

Show:
Aaron Binns
January 29, 2009, 7:49 PM

Yes and no, depending on how you access your arc files. If they are on local-disk, or via an NFS mount or other mechanism that looks like a local disk, then you can setup your manifests to read from disk.

However, if you are using HTTP urls, then yes, increasing the timeout value will help avoid problems if the network is slow in transferring the arc files, especially if they are very large. Even though Heritrix tries to limit arc files to 100MB by default, there are cases where, say a 700MB CD-ROM image is crawled by Heritrix and you wind up with a huge arcfile. In that case, the time to transfer the arcfile across the network could exceed the timeout value.

In our big indexing jobs, we use 'rsync' rather than 'http' just because we ran into some weird problems with 'http' where the connection wasn't closing properly and the client was waiting for an EOF that never came. We didn't have this problem with 'rsync' so we've been using that since. With 'rsync' the possibility of time-out is probably even higher than 'http' since with 'rsync' the entire arc file is transferred to '/tmp' before NutchWAX starts importing it. Again, if that transfer takes too long, the timeout will kick-in and the task will be killed.

We tend to set our timeouts very high to avoid these types of problems and haven't run into any cases where we wished the timeout to be lower/shorter.

Erik Hetzner
January 29, 2009, 8:35 PM

Hi Aaron -

I think I was a little unclear. I don't think the problem is that the network is slow. 10 min timeout should be way more than sufficient to fetch ~100 mb ARCs which is what we target as well (on our network).

Unfortunately I do not have any logs to show what is happening right now. I am building an index from an http server with a 10 min timeout right now to recreate the error.

The problem seems to be that the nutchwax importer is processing the HTTP URIs as it fetches them from the server. This seems to add considerable overhead to the fetching process, which triggers timeouts from the servers.

Maybe I am misreading this problem. I will attach a stacktrace when I have one available.

-Erik

Erik Hetzner
January 29, 2009, 8:38 PM

I meant to add that increasing the timeout on the server is a workaround, but my suggestion is that it might be a useful improvement to pre-fetch the ARC files to /tmp as is apparently done with rsync. This way the nutchwax importing overhead will not trigger a timeout on servers that have a standard timeout.

-Erik

Erik Hetzner
February 10, 2009, 1:17 AM

Here is a stacktrace from an indexing job error:

2009-02-09 11:49:51,291 ERROR nutchwax.Importer - Import fail : http://www.uncf.org/DrLomaxIntv.flv
java.io.IOException: Premature EOF
at sun.net.www.http.ChunkedInputStream.readAheadBlocking(ChunkedInputStream.java:538)
at sun.net.www.http.ChunkedInputStream.readAhead(ChunkedInputStream.java:582)
at sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:669)
at java.io.FilterInputStream.read(FilterInputStream.java:111)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:2196)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at org.archive.io.RepositionableInputStream.read(RepositionableInputStream.java:81)
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:214)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:134)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:87)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:208)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:681)
at org.archive.nutchwax.Importer.readBytes(Importer.java:574)
at org.archive.nutchwax.Importer.importRecord(Importer.java:254)
at org.archive.nutchwax.Importer.map(Importer.java:205)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132)
2009-02-09 11:49:51,298 INFO nutchwax.Importer - Completed ARC: http://xxxxxxx/jobs/20090202054639076/arcs/CDL-20090202065644-00006-oriole.ucop.edu-00074452.arc.gz
2009-02-09 11:49:51,298 WARN mapred.LocalJobRunner - job_local_1
java.lang.RuntimeException: java.io.IOException: Premature EOF
at org.archive.io.GzippedInputStream$1.hasNext(GzippedInputStream.java:225)
at org.archive.io.arc.ARCReaderFactory$CompressedARCReader$1.innerHasNext(ARCReaderFactory.java:399)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:492)
at org.archive.nutchwax.ArcReader$ArcIterator.hasNext(ArcReader.java:134)
at org.archive.nutchwax.Importer.map(Importer.java:198)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132)
Caused by: java.io.IOException: Premature EOF
at sun.net.www.http.ChunkedInputStream.readAheadBlocking(ChunkedInputStream.java:538)
at sun.net.www.http.ChunkedInputStream.readAhead(ChunkedInputStream.java:582)
at sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:669)
at java.io.FilterInputStream.read(FilterInputStream.java:111)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:2196)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at org.archive.io.RepositionableInputStream.read(RepositionableInputStream.java:81)
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:214)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:134)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:87)
at java.util.zip.InflaterInputStream.skip(InflaterInputStream.java:184)
at org.archive.io.GzippedInputStream.gotoEOR(GzippedInputStream.java:187)
at org.archive.io.GzippedInputStream$1.hasNext(GzippedInputStream.java:217)
... 7 more
2009-02-09 11:49:51,609 FATAL nutchwax.Importer - Importer:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:894)
at org.archive.nutchwax.Importer.run(Importer.java:666)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.archive.nutchwax.Importer.main(Importer.java:702)

Erik Hetzner
April 2, 2009, 11:02 PM

Attached is a diff which (possibly) fetches a remotely located HTTP URI to a temp file before processing it, then deletes the temp file when finished. It can be toggled on or off with a setting.

Background: We have had to increase the timeouts on our HTTP servers to 60 minutes to prevent problems fetching ARC files, because currently the ARCReader keeps a stream open to the HTTP server as nutchwax is importing the ARC, creating problems if nutchwax takes a long time to import the ARC.

Part of our problem is due to the fact that we have increased the maximum size limits on ARC files to process; but importing can be slow for many reasons, and it seems to me that pre-fetching the ARCs to a local directory should be an option.

Obsolete

Assignee

Aaron Binns

Reporter

Erik Hetzner

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Priority

Major
Configure