Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Description

      I happen to have a seed list of nearly 1024 entries.
      Not totally surprisingly, Heritrix behaves a little
      oddly with that many seeds. First, crawls with either
      0.6.0 or the latest CVS build fail because too many
      files are opened almost immediately, and then neither
      socket operations nor file logging are able to proceed.
      A typical exception:

      java.io.FileNotFoundException:
      /crawl/heritrix/heritrix-0.6.0/jobs/crs-20040427190708335/disk/scratch/bphc
      .hrsa.gov.ff0
      (Too many open files)
      at java.io.FileOutputStream.open(Native Method)
      at
      java.io.FileOutputStream.<init>(FileOutputStream.java:179)
      at
      java.io.FileOutputStream.<init>(FileOutputStream.java:131)
      at
      org.archive.io.FlipFileOutputStream.<init>(FlipFileOutputStream.java:69)
      at
      org.archive.io.DiskBackedByteQueue.initializeStreams(DiskBackedByteQueue.ja
      va:67)
      at
      org.archive.util.DiskQueue.<init>(DiskQueue.java:100)
      at
      org.archive.util.DiskBackedQueue.<init>(DiskBackedQueue.java:59)
      at
      org.archive.crawler.basic.KeyedQueue.<init>(KeyedQueue.java:76)
      at
      org.archive.crawler.basic.Frontier.keyedQueueFor(Frontier.java:927)
      at
      org.archive.crawler.basic.Frontier.scheduleForRetry(Frontier.java:1333)
      at
      org.archive.crawler.basic.Frontier.finished(Frontier.java:676)
      at
      org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:200)

      at
      org.archive.crawler.framework.ToeThread.run(ToeThread.java:124)
      You can get past that by allowing a larger number of
      open files for the process (which requires running
      Heritrix with root privilege), as in:

      1. (ulimit -n 4096; JAVA_OPTS=-Xmx320 bin/heritrix -p
        9876)

        Gliffy Diagrams

          Attachments

            Activity

              People

              • Assignee:
                stack Michael Stack (Inactive)
                Reporter:
                stack Michael Stack (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Zendesk Support