Web Archive Metadata File Specification
The WARC file format offers a convention for concatenating multiple resources, each consisting of a set of simple text headers and an arbitrary data block into one long file. It allows for recording content beyond the primary content stored in ARCs such as metadata and duplicate detection events.
The goal of this document is to facilitate the creation and exchange of web archive metadata by establishing a convention on the use and meaning of existing WARC header fields, and by defining new fields to use in the metadata record block along the lines of the ones already described in the WARC file specification (WARC).
The WARC file being described is a simple concatenation of one or more metadata records. Each WARC record consists of a record header followed by a record content block and two newlines. It includes a warcinfo record at the beginning of the file indicating the date of creation of the metadata records and the name and version of the software used to generate them.
Metadata Record Header
The record header consists of the first line declaring the record to be in the WARC format with a given version number followed by a variable number of line oriented named fields terminated by a blank line.
Named fields in the metadata record header
Header Field |
Description |
---|---|
WARC-Type |
The type of WARC record. Set to 'metadata' |
WARC-Target-URI |
The original URI of the primary content |
WARC-Date |
A 14-digit timestamp that represents the instant of data capture of the primary content |
WARC-Record-ID |
An identifier assigned to the current record that is globally unique for its period of intended use |
WARC-Refers-To |
The WARC-Record-ID of the primary WARC record being described. |
Content-Type |
The MIME type of the information contained in the metadata record's block. Set to 'application/json' |
Content-Length |
The number of octets in the metadata record's block |
Metadata Record Content Block
The metadata record block makes use of the “application/json” format (nested JSON) to describe metadata of the primary ARC / WARC record. The metadata is organized into the blocks, Container and Envelope. All the metadata fields are optional.
{ "Container": {}, "Envelope": {} }
Container
Structure of nested metadata:
"Container": { "Gzip-Metadata": {}, },
Container-Metadata
"Container": { "Compressed": true, "Filename": "ARCHIVEIT-2197-MONTHLY-UOYNUH-20110331102334-00374-crawling212.us.archive.org-6682.warc.gz", "Offset": "0" "Gzip-Metadata": {}, },
Metadata Field |
Description |
---|---|
Filename |
Filename of the ARC / WARC file where the record is stored |
Compressed |
Indicates if the file is compressed |
Offset |
File offset of the record in the file |
Digest |
Parameter indicating the algorithm name and calculated value of a digest applied to the file |
Gzip-Metadata |
The Gzip metadata block |
Gzip-Metadata
See RFC 1952 for details on the gzip format.
"Gzip-Metadata": { "Deflate-Length": "170", "F-Extra": [{ "Name": "LX", "Value": "\u0000\u0000\u0000\u0000" }], "Footer-Length": "8", "Header-Length": "20", "Inflated-CRC": "1583730560", "Inflated-Length": "175" },
Metadata Field |
Description |
---|---|
Header-Length |
Indicates octet length of the Gzip header |
Footer-Length |
Indicates octet length of Gzip footer - Always 8 |
Deflate-Length |
Indicates octet length of the total deflated gzip member, including Header and Footer data |
Inflated-CRC |
Indicates inflated CRC |
Inflated-Length |
Indicates inflated length |
F-Extra |
F-Extra fields and values |
Envelope
Structure of nested metadata:
"Envelope": { "ARC-Header-Metadata": {}, "Payload-Metadata": {} } "Envelope": { "WARC-Header-Metadata": {} "Payload-Metadata": {} }
Envelope-Metadata
"Envelope": { "Format": "WARC", "Payload-Metadata": {} "WARC-Header-Length": "298", "WARC-Header-Metadata": {} } "Envelope": { "Format": "ARC", "Payload-Metadata": {} "ARC-Header-Length": "298", "ARC-Header-Metadata": {} }
Metadata Field |
Description |
---|---|
Format |
Indicates if the record is an ARC / WARC record |
ARC-Header-Length / WARC-Header-Length |
Number of octets in the record's header |
ARC-Header-Metadata / WARC-Header-Metadata |
The header metadata block |
Payload-Metadata |
The payload metadata block |
ARC-Header-Metadata
"ARC-Header-Metadata": { "Content-Length": "77", "Content-Type": "text/plain", "Date": "2011020300000000000", "IP-Address": "0.0.0.0", "Target-URI": "filedesc://ARCHIVEIT-2367-JIMDEMINT-EXTRACTED-00008.arc" }
Metadata fields from ARC record's header
Metadata Field |
Description |
---|---|
Date |
14-digit timestamp that represents the instant of data capture |
Content-Length |
Number of octets in the record's block (Declared) |
Content-Type |
MIME type of the information contained in the record's block (Declared) |
IP-Address |
The numeric Internet address contacted to retrieve content |
Target-URI |
Original URI whose capture gave rise to the information content in the record |
WARC-Header-Metadata
"WARC-Header-Metadata": { "Content-Length": "729", "Content-Type": "application/warc-fields", "WARC-Date": "2011-03-31T10:23:34Z", "WARC-Filename": "ARCHIVEIT-2197-MONTHLY-UOYNUH-20110331102334-00374-crawling212.us.archive.org-6682.warc.gz", "WARC-Record-ID": "<urn:uuid:c2f7f2c6-d50e-4182-95be-9330e548675c>", "WARC-Type": "warcinfo" }
Metadata fields from WARC record's header
The metadata record block includes a metadata field for each WARC named field that appears in the original WARC record's header.
Metadata Field |
Description |
---|---|
WARC-Type |
Type of WARC record |
WARC-Record-ID |
Identifier assigned to the WARC record |
WARC-Date |
14-digit timestamp that represents the instant of data capture |
Content-Length |
Number of octets in the record's block (Declared) |
Content-Type |
MIME type of the information contained in the record's block (Declared) |
WARC-Concurrent-To |
WARC-Record-IDs of any records created as part of the same capture event as the record |
WARC-Block-Digest |
Parameter indicating the algorithm name and calculated value of a digest applied to the full block of the record |
WARC-Payload-Digest |
Parameter indicating the algorithm name and calculated value of a digest applied to the payload referred to or contained by the record, |
WARC-IP-Address |
The numeric Internet address contacted to retrieve content |
WARC-Refers-To |
WARC-Record-ID of a single record for which the record holds additional content |
WARC-Target-URI |
Original URI whose capture gave rise to the information content in the record |
WARC-Truncated |
Indicates truncation of content block with the reason of truncation |
WARC-Warcinfo-ID |
WARC-Record-ID of associated 'warcinfo' record |
WARC-Filename |
Filename containing the 'warcinfo' record. Applicable only for 'warcinfo' records |
WARC-Profile |
URI signifying the kind of analysis and handling applied in the 'revisit' record. Applicable only for 'revisit' records |
WARC-Identified-Payload-Type |
Content type of the record's payload as determined by an independent check |
WARC-Segment-Origin-ID |
Identifies the starting record in a series of segmented records whose content blocks are reassembled to obtain a logically complete content block. |
WARC-Segment-Number |
Reports the record's relative ordering in a sequence of segmented records |
WARC-Segment-Total-Length |
Reports the total length of all segment content blocks when concatenated together. Applicable only for 'continuation' records |
Payload-Metadata
"Payload-Metadata": { "Actual-Content-Length": "4055", "Actual-Content-Type": "application/http; msgtype=response", "Block-Digest": "sha1:43AHHMT33HMV3ZMLCTTVTN6VZMKNEG6Z", "HTTP-Response-Metadata": {}, "Trailing-Slop-Length": "1" }
Metadata Field |
Description |
---|---|
Actual-Content-Type |
MIME type of the information contained in the record's block as determined by an independent check (Actual/Detected) |
Actual-Content-Length |
Number of octets in the record's block (Actual) |
Block-Digest |
Parameter indicating the algorithm name and calculated value of a digest applied to the full block of the record |
Trailing-Slop-Length |
Number of trailing slop bytes |
Metadata |
The record-type specific metadata block (HTTP-Response-Metadata, DNS-Response-Metadata etc.) |
HTTP-Response-Metadata
"HTTP-Response-Metadata": { "Entity-Content-Type": "text/html", "Entity-Digest": "sha1:JOFQGAO2VTRGCZOSWX6NYAOC36WP6MWL", "Entity-Length": "5643", "HTML-Metadata": {} "Headers": { "Accept-Ranges": "bytes", "Connection": "close", "Content-Length": "5643", "Content-Type": "text/html; charset=ISO-8859-1", "Date": "Tue, 02 Nov 2004 11:25:30 GMT", "Server": "Apache" }, "Headers-Length": "180", "Response-Message": { "Reason": "OK", "Status": "200", "Version": "HTTP/1.1" }
Metadata Field |
Description |
---|---|
Message |
Indicates version of HTTP along with status and reason of HTTP response |
Header-Length |
Indicates length of HTTP headers |
Headers |
HTTP Header fields and associated values |
Entity-Length |
Indicates length of HTTP Entity in octets |
Entity-Digest |
Indicates the algorithm name and calculated value of a digest applied to the HTTP Entity |
Entity-Content-Type |
MIME type of the HTTP Entity |
Entity-Trailing-Slop-Length |
Number of trailing slop bytes in HTTP Entity |
Entity-Transfer-Encoding |
Indicates the Transfer-Encoding value |
Metadata |
The record's content-type specific metadata block (HTML-Metadata, PDF-Metadata etc.) |
HTML-Metadata
"HTML-Metadata": { "Head": { "Metas": [ { "content": "Jim DeMint - U.S. Senate South Carolina", "name": "description" }, { "content": "demint, jim deMint, senate, south carolina, republican, sc, gop, conservative, congressman demint, congress, representative, u.s. senate, working with pr esident bush, 4th district, jimdemint.com, election, 2004, campaign, politics, issues, social security, trade, transportation, taxes, small business, limited government, fam ily values, health care, fritz hollings", "name": "keywords" } ], "Title": "Jim DeMint - U.S. Senate" }, "Links": [ { "path": "TABLE@/background", "url": "/demint_images/top_bg1.gif" }, { "path": "A@/href", "text": "clicking here.", "url": "http://jimdemint.com/demint_contents/issues/jobs/" }] }
Metadata Field |
Description |
---|---|
Head |
Attributes and values of HTML head elements: title, base, style, link, meta and script |
Links |
Indicates the absolute URI of an outgoing link from the capture, the URI of the link as it appears on the page, |