Skip to end of metadata
Go to start of metadata

WARC File Format

The WARC file format is a successor to the ARC format. (The ARC format has been used for many years to store the Internet Archive's web captures.) Small example ARC and WARC (v0.17) files from a shallow (~2 hops) Heritrix crawl of the www.archive.org website are attached to this wiki page. It is easy to create larger, more representative ARC and WARC files using any recent release of Heritrix.

  File Modified
ZIP Archive IAH-20080430204825-00000-blackbook.arc.gz example ARC file of shallow www.archive.org crawl Jul 07, 2008 by Gordon Mohr
ZIP Archive IAH-20080430204825-00000-blackbook.warc.gz example WARC file of shallow www.archive.org crawl Jul 07, 2008 by Gordon Mohr

Compared to ARC, note that WARC adds:

  1. an expandable amount of header info per record
  2. optional new record types for data/metadata other than just HTTP responses (which was all that ARC recorded)

ISO Standard

In May of 2009, a proposed WARC standard was approved as ISO standard ISO 28500:2009, and the latest versions of Heritrix output WARC files which conform to this standard as described at http://bibnum.bnf.fr/WARC/ (latest draft as of November 2008).

  • No labels