Skip to end of metadata
Go to start of metadata

In addition to logs, the following files are generated.  Some of the information in them is also available in the WUI.

surts.dump

This file contains the SURTs form of the seed URIs.

negative-surts.dump

This file contains the SURT form of URIs that are to be excluded from the crawl.

heritrix_out.log

This file captures output to standard out and standard error.  Most of the output consists of low-level exceptions and logging information.

This file is created in the same directory as the Heritrix jar file.  It is not associated with any one job, but contains output from all jobs run by the crawler.

Below is sample output from this file:

crawl-report.txt

This file contains useful metrics about completed jobs.  The report is created by the StatisticsTracker bean.  This file is written at the end of the crawl.

Below is sample output from this file:

hosts-report.txt

This file contains an overview of the hosts that were crawled.  It also displays the number of documents crawled and the bytes downloaded per host.

This file is created by the StatisticsTracker bean and is written at the end of the crawl.

Below is sample output from this file:

mimetype-report.txt

This file contains a report displaying the number of documents downloaded per mime type.  Also, the amount of data downloaded per mime type is displayed.

This file is created by the StatisticsTracker bean and is written at the end of the crawl.

Below is sample output from this report:

processors-report.txt

This file contains the processors report.  The processors report shows the activity of each Heritrix processor.  For more information on processors see Processing Chains.  It is written at the end of the crawl.

Below is sample output from this report:

responsecode-report.txt

This file contains a report displaying the number of documents downloaded per status code.  It covers successful codes only.  For failure codes see the crawl.log file.

This file is created by the StatisticsTracker bean and is written at the end of the crawl.

Below is sample output from this report:

seeds-report.txt

This file contains the crawling status of each seed.

This file is created by the StatisticsTracker bean and is written at the end of the crawl.

Below is sample output from this report:

frontier-summary-report.txt

This report contains a breakdown of frontier activity on a per-thread basis.  For each thread running, the status of the frontier queue can be examined.

Below is sample output from this report.

source-report.txt

This report contains a line item for each host, which includes the seed from which the host was reached.

Below is a sample of this report:

Note

  • The sourceTagSeeds property of the TextSeedModule bean must be set to true for this report to be generated.
threads-report.txt

This report contains the list of threads that were active at the end of the crawl.  Detailed information about each thread is also available.

WARC files

Assuming you are using the WARC writer that comes with Heritrix, a number of WARC files will be generated containing crawled content.

You can specify the storage location of WARC files by setting the directory value of the WARCWriterProcessor bean.

WARC files are named using the following convention:

[prefix][12 digit timestamp][series padded to 5 digits][crawler hostname].warc.gz

The WARCWriterProcessor contains the prefix setting.  By default it is IAH.

WARC files with an .open suffix are in the process of being written to by Heritrix.  There may be multiple open WARCs at any given time.

WARC files with an .invalid suffix indicate problems writing to the file.  This may be the result of a bad disk or a fully utilized disk.  On an I/O problem, Heritrix closes the problematic WARC file and gives it an .invalid suffix.  These files should be checked for coherence.

As of Heritrix 3.1, the "LowDiskPauseProcessor" bean has been replaced by the "DiskSpaceMonitor" bean.  When writing WARC files, the DiskSpaceMonitor checks the available space on the configured paths and if free space has dropped below the defined threshold the crawl is paused.  In the example below, the path /warcs is monitored.  If the level of free space drops below 500MB the crawls writing to the /warcs directory are paused.

As of Heritrix 3.1, the naming convention for WARC files has changed.  Instead of specifying the formula for ARC/WARC naming in code and using a supplied 'prefix' and 'suffix', a template with variable interpolation may be used.  The configured 'prefix' remains an available variable, as well as other useful local machine, crawl, and writer properties. The default template is:

The template adds the local process ID and the 17 digit timestamp.  The timestamp is provided by a service that ensures each timestamp is at least 1 millisecond after previous millisecond values.  The new default convention also minimizes the likelihood of ARC/WARC name collisions, even when many crawls are launched or running simultaneously on the same local machine, using the same file name prefix.  Although the generated names are long, they are very likley to be unique under normal conditions.  It is not recommended that the template by changed unless the alternate naming system is certain to also generate unique names.  This is important because down stream tools that index ARCs/WARCs often assume file name uniqueness and can benefit from their unique generation.

  • No labels