Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Github wiki
Page Tree

Heritrix is the Internet Archive's open-source, extensible, scalable, archival-quality Web crawler.

This document explains how to install, configure, and use Heritrix to crawl the Web.  It assumes the reader has a general understanding of computing concepts such as HTTP and URIs, 


The audience of this document is Heritrix administrators and other technical staff who want to crawl the Internet using Heritrix.


The information in this guide is for Heritrix 3.0 unless otherwise noted.  Sections that provide information about Heritrix 3.1 are marked by an "As of Heritrix 3.1" clauseThis page has moved to Introduction on the Github wiki.