Heritrix

Introduction

This used to be the public wiki for the Heritrix archival crawler project. The contents of this wiki have been migrated to the Heritrix 3 Github project wiki.

Webmasters!

Heritrix is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity.

If you notice our crawler behaving poorly – The Internet Archive uses archive.org_bot as User Agent when crawling – please send us email at archive-crawler-agent@lists.sourceforge.net.

(If you see a different User-Agent in your logs that still says 'heritrix', it may be someone else using this open-source software. In such a case, even if we can't directly change how your site is crawled, we are happy to help you interpret your logs and identify, contact, or block the source of any troublesome crawling.)

Note: The wildcard extension to robots.txt is not yet supported.