WARC File Format
The WARC file format is a successor to the ARC format. (The ARC format has been used for many years to store the Internet Archive's web captures.) Small example ARC and WARC (v0.17) files from a shallow (~2 hops) Heritrix crawl of the www.archive.org website are attached to this wiki page. It is easy to create larger, more representative ARC and WARC files using any recent release of Heritrix.
Compared to ARC, note that WARC adds:
- an expandable amount of header info per record
- optional new record types for data/metadata other than just HTTP responses (which was all that ARC recorded)
In May of 2009, a proposed WARC standard was approved as ISO standard ISO 28500:2009, and the latest versions of Heritrix output WARC files which conform to this standard as described at http://bibnum.bnf.fr/WARC/ (latest draft as of November 2008).