Child pages
  • Web Archive Metadata File Specification
Skip to end of metadata
Go to start of metadata

The WARC file format offers a convention for concatenating multiple resources, each consisting of a set of simple text headers and an arbitrary data block into one long file. It allows for recording content beyond the primary content stored in ARCs such as metadata and duplicate detection events. 

The goal of this document is to facilitate the creation and exchange of web archive metadata by establishing a convention on the use and meaning of existing WARC header fields, and by defining new fields to use in the metadata record block along the lines of the ones already described in the WARC file specification (WARC).

The WARC file being described is a simple concatenation of one or more metadata records. Each WARC record consists of a record header followed by a record content block and two newlines. It includes a warcinfo record at the beginning of the file indicating the date of creation of the metadata records and the name and version of the software used to generate them. 

Metadata Record Header

The record header consists of the first line declaring the record to be in the WARC format with a given version number followed by a variable number of line oriented named fields terminated by a blank line. 

Named fields in the metadata record header

 Header Field

Description 

 WARC-Type

 The type of WARC record. Set to 'metadata'

 WARC-Target-URI

 The original URI of the primary content

 WARC-Date

 A 14-digit timestamp that represents the instant of data capture of the primary content

 WARC-Record-ID

 An identifier assigned to the current record that is globally unique for its period of intended use

 WARC-Refers-To

 The WARC-Record-ID of the primary WARC record being described.
 In the case of ARC records, the identifier is a combination of ARC filename and file-offset (e.g. <urn:arc:foo.arc.gz:3492>)

 Content-Type

 The MIME type of the information contained in the metadata record's block. Set to 'application/json'

 Content-Length

 The number of octets in the metadata record's block

Metadata Record Content Block

The metadata record block makes use of the “application/json” format (nested JSON) to describe metadata of the primary ARC / WARC record. The metadata is organized into the blocks, Container and Envelope. All the metadata fields are optional.

Container

Structure of nested metadata:

Container-Metadata

Metadata Field

Description

Filename

Filename of the ARC / WARC file where the record is stored

Compressed

Indicates if the file is compressed

Offset

File offset of the record in the file

Digest

Parameter indicating the algorithm name and calculated value of a digest applied to the file

Gzip-Metadata

The Gzip metadata block

Gzip-Metadata

See RFC 1952 for details on the gzip format.

Metadata Field

Description

Header-Length

Indicates octet length of the Gzip header

Footer-Length

Indicates octet length of Gzip footer - Always 8

Deflate-Length

Indicates octet length of the total deflated gzip member, including Header and Footer data

Inflated-CRC

Indicates inflated CRC

Inflated-Length

Indicates inflated length

F-Extra

F-Extra fields and values

Envelope

Structure of nested metadata:

Envelope-Metadata

Metadata Field

Description

Format

Indicates if the record is an ARC / WARC record

ARC-Header-Length / WARC-Header-Length

Number of octets in the record's header

ARC-Header-Metadata / WARC-Header-Metadata

The header metadata block

Payload-Metadata

The payload metadata block

ARC-Header-Metadata

Metadata fields from ARC record's header 

Metadata Field

Description

Date

14-digit timestamp that represents the instant of data capture

Content-Length

Number of octets in the record's block (Declared) 

Content-Type

MIME type of the information contained in the record's block (Declared) 

IP-Address

The numeric Internet address contacted to retrieve content 

Target-URI

Original URI whose capture gave rise to the information content in the record 

WARC-Header-Metadata

Metadata fields from WARC record's header

The metadata record block includes a metadata field for each WARC named field that appears in the original WARC record's header.

Metadata Field

Description

WARC-Type

Type of WARC record

WARC-Record-ID

Identifier assigned to the WARC record

WARC-Date

14-digit timestamp that represents the instant of data capture

Content-Length

Number of octets in the record's block (Declared) 

Content-Type

MIME type of the information contained in the record's block (Declared)

WARC-Concurrent-To

WARC-Record-IDs of any records created as part of the same capture event as the record 

WARC-Block-Digest

Parameter indicating the algorithm name and calculated value of a digest applied to the full block of the record

WARC-Payload-Digest

Parameter indicating the algorithm name and calculated value of a digest applied to the payload referred to or contained by the record,  
which is not necessarily equivalent to the record block.  
The payload of an application/http block is its 'entity-body' (per RFC2616). In contrast to WARC-Block-Digest, the 
WARC-Payload-Digest field may also be used for data not actually present in the current record block or when a record is segmented. 

WARC-IP-Address

The numeric Internet address contacted to retrieve content

WARC-Refers-To

WARC-Record-ID of a single record for which the record holds additional content

WARC-Target-URI

Original URI whose capture gave rise to the information content in the record

WARC-Truncated

Indicates truncation of content block with the reason of truncation

WARC-Warcinfo-ID

WARC-Record-ID of associated 'warcinfo' record

WARC-Filename

Filename containing the 'warcinfo' record. Applicable only for 'warcinfo' records

WARC-Profile

URI signifying the kind of analysis and handling applied in the 'revisit' record. Applicable only for 'revisit' records

WARC-Identified-Payload-Type

Content type of the record's payload as determined by an independent check

WARC-Segment-Origin-ID

Identifies the starting record in a series of segmented records whose content blocks are reassembled to obtain a logically complete content block. 
Applicable only for 'continuation' records

WARC-Segment-Number

Reports the record's relative ordering in a sequence of segmented records

WARC-Segment-Total-Length

Reports the total length of all segment content blocks when concatenated together. Applicable only for 'continuation' records

Payload-Metadata

Metadata Field

Description

Actual-Content-Type

MIME type of the information contained in the record's block as determined by an independent check (Actual/Detected)

Actual-Content-Length

Number of octets in the record's block (Actual)

Block-Digest

Parameter indicating the algorithm name and calculated value of a digest applied to the full block of the record

Trailing-Slop-Length

Number of trailing slop bytes

Metadata

The record-type specific metadata block (HTTP-Response-Metadata, DNS-Response-Metadata etc.)

HTTP-Response-Metadata

Metadata Field

Description

Message

Indicates version of HTTP along with status and reason of HTTP response

Header-Length

Indicates length of HTTP headers

Headers

HTTP Header fields and associated values

Entity-Length

Indicates length of HTTP Entity in octets

Entity-Digest

Indicates the algorithm name and calculated value of a digest applied to the HTTP Entity

Entity-Content-Type

MIME type of the HTTP Entity

Entity-Trailing-Slop-Length

Number of trailing slop bytes in HTTP Entity

Entity-Transfer-Encoding

Indicates the Transfer-Encoding value

Metadata

The record's content-type specific metadata block (HTML-Metadata, PDF-Metadata etc.)

HTML-Metadata

Metadata Field

Description

Head

Attributes and values of HTML head elements: title, base, style, link, meta and script

Links

Indicates the absolute URI of an outgoing link from the capture, the URI of the link as it appears on the page,
the type of outgoing link (link, embed, redirect or other), XPath-suffix of link (best-effort), the alt attribute and anchor-text (truncated to 100 bytes)

  • No labels

2 Comments

  1. For link data, is it possible to readily identify whether or not an URI is within the same domain? i.e.  in addition to capturing "link/embed/redirect" could domain level be captured?

    1. Can you give an example of what you'd want?