Heritrix2

Heritrix 2 no longer in development

Heritrix 2 was a forward-looking successor to Heritrix 1 whose settings architecture (custom) has been migrated to Heritrix 3 (Spring). We recommend using Heritrix 1 for legacy crawl projects, and Heritrix 3 for new projects.

Heritrix 2

The major changes/improvements in Heritrix 2 are: refactoring the settings architecture, internal and UI, to provide basis for new functionality, plus new options for controlling the priority of URLs, sites, and other content groupings.

Releases

Date

Release Notes

Misc

2008 November 7

2.0.2

bugfixes

2008 August 7

2.0.1

bugfixes & small new features

2008 February 20

2.0.0

 

2007 December 6

2.0.0-RC1

 

2007 December

2.0.0-Beta

 

2007 October

2.0.0-Alpha-2

 

2007 July

2.0.0-Alpha-1

Release announcement

Links to the source code for each release can be found in the corresponding release notes.

All code is hosted on SourceForge: http://sourceforge.net/projects/archive-crawler/

Work on Heritrix 2 occurs in the SVN area trunk/heritrix2.

User Documentation

We don't have a lot of documentation yet, but it will grow over time.

Moving to 2.x - for Crawl Operators
2.0 Tutorial
HOWTO Launch Heritrix
Heritrix 2.0.0 First Impressions

Developer Documentation

We have even less developer documentation, but indent to expand it over time.

Moving to 2.x - for Developers
Setting up the new Heritrix in Eclipse

Other Documentation

Below is a list of all the child pages of this one. Some of those are also linked above, in the appropriate section, but some haven't been categorized yet.