Page tree
Skip to end of metadata
Go to start of metadata

Web Archive Transformation (WAT) files feature key metadata elements that represent every crawled resource and are derived from a collection’s (W)ARC files. 

WAT Dataset

The Web Archive Transformation (WAT) specification describes a structured way of storing metadata from (W)ARC files. The structure of the extracted metadata is optimized for data analysis.

  • WAT files are (W)ARC metadata records, composed of key metadata such as provenance/capture information, essential text and link data, and other information, extracted from (W)ARCs for every resource.
  • WAT files map one-to-one to (W)ARC files, thus a collection's (W)ARC files will have corresponding WAT files.
  • WATs are around 5%-20% the size of corresponding (W)ARCs.
  • WAT files store the metadata extracted from (W)ARC files in an easily analyzed format.
  • WAT formats metadata into JavaScript Object Notation (JSON).

For more detailed information, please refer to WAT file specification

WAT Data Structure

Each WAT record has a brief header that identifies its URL via "WARC-Target-URI", corresponding (W)ARC file via "WARC-Refers-To"  and other mapping information.

The following extracted metadata payload is encoded in JSON. For an example, the WAT record for this archived URL, https://web.archive.org/web/20061130122508/http://wwwc.house.gov/smbiz/press/108th/2003/030604aNew.asp can seen by expanding the below code block (or can be downloaded and viewed in a text editor).

 WAT example record

WARC/1.0
WARC-Type: metadata
WARC-Target-URI: http://wwwc.house.gov/smbiz/press/108th/2003/030604aNew.asp
WARC-Date: 2006-11-14T05:34:48Z
WARC-Record-ID: <urn:uuid:f60b604c-5b4f-4a82-859a-9bddf97f834e>
WARC-Refers-To: <urn:arc:web_con035-20061130122647-00842-crawling021.us.archive.org.arc:798>
Content-Type: application/json
Content-Length: 4254

{
  "Envelope": {
    "Format": "ARC",
    "ARC-Header-Metadata": {
      "Date": "20061114053448",
      "Content-Length": "27455",
      "Content-Type": "text\/html",
      "Target-URI": "http:\/\/wwwc.house.gov\/smbiz\/press\/108th\/2003\/030604aNew.asp",
      "IP-Address": "143.228.146.10"
    },
    "ARC-Header-Length": "106",
    "Payload-Metadata": {
      "Trailing-Slop-Length": "1",
      "Actual-Content-Type": "application\/http; msgtype=response",
      "HTTP-Response-Metadata": {
        "Headers": {
          "Date": "Tue, 14 Nov 2006 05:34:48 GMT",
          "Content-Length": "27206",
          "Expires": "Tue, 14 Nov 2006 05:24:48 GMT",
          "Connection": "close",
          "Content-Type": "text\/html",
          "Server": "U.S. House of Representatives",
          "X-Powered-By": "ASP.NET",
          "Cache-Control": "private"
        },
        "Headers-Length": "249",
        "Entity-Length": "27206",
        "Entity-Trailing-Slop-Bytes": "0",
        "Response-Message": {
          "Status": "200",
          "Version": "HTTP\/1.1",
          "Reason": "OK"
        },
        "HTML-Metadata": {
          "Links": [
            {
              "text": "Oversight Plan",
              "path": "A@\/href",
              "url": "..\/..\/..\/oversightPlan\/oversight_plan.asp"
            },
            {
              "text": "Special Projects",
              "path": "A@\/href",
              "url": "..\/..\/..\/specialProjects\/special_projects_for_108th_congress.asp"
            },
            {
              "text": "Committee Rules",
              "path": "A@\/href",
              "url": "..\/..\/..\/committeeRules\/committee_rules.asp"
            },
            {
              "text": "Chairmans Biography",
              "path": "A@\/href",
              "url": "..\/..\/..\/chairmansBiography\/chairmansBiography.asp"
            },
            {
              "text": "Committee Members",
              "path": "A@\/href",
              "url": "..\/..\/..\/committeeMembers\/committeeMembers.asp"
            },
            {
              "text": "Budget Views and Estimates",
              "path": "A@\/href",
              "url": "..\/..\/..\/budgetViewsAndEstimates\/budgetViewsAndEstimates.asp"
            },
            {
              "path": "IMG@\/src",
              "url": "..\/..\/..\/images\/smallerHeader.jpg"
            },
            {
              "text": "Home",
              "path": "A@\/href",
              "url": "..\/..\/..\/default.asp"
            },
            {
              "text": "About The Committee",
              "path": "A@\/href",
              "url": "..\/..\/..\/aboutTheCommittee.asp"
            },
            {
              "text": "Press Releases",
              "path": "A@\/href",
              "url": "..\/..\/asp_display_all_press_releases.asp?year=2006"
            },
            {
              "text": "Resources",
              "path": "A@\/href",
              "url": "..\/..\/..\/resources\/asp_display_resources.asp"
            },
            {
              "text": "Calendar of Events",
              "path": "A@\/href",
              "url": "..\/..\/..\/calendarOfEvents\/asp_calendar_of_upcoming_events.asp"
            },
            {
              "text": "Hearings",
              "path": "A@\/href",
              "url": "..\/..\/..\/hearings\/databaseDrivenHearingsSystem\/displayHearings.asp?congress=109"
            },
            {
              "text": "Subcommittees",
              "path": "A@\/href",
              "url": "..\/..\/..\/subcommittees\/subcommittees_main.asp"
            },
            {
              "text": "Small Business Facts",
              "path": "A@\/href",
              "url": "..\/..\/..\/smallBusinessFacts\/smallBusinessFacts.asp"
            },
            {
              "text": "Newsletters",
              "path": "A@\/href",
              "url": "..\/..\/..\/newsletters\/asp_display_newsletters.asp?year=2006"
            },
            {
              "text": "Legislation",
              "path": "A@\/href",
              "url": "..\/..\/..\/legislation\/legislation_for_109th_congress.asp"
            },
            {
              "text": "Contact &amp; Location Details",
              "path": "A@\/href",
              "url": "..\/..\/..\/contactDetails\/contactDetails.asp"
            },
            {
              "text": "Search The Site",
              "path": "A@\/href",
              "url": "..\/..\/..\/search_the_website\/search_the_website.asp"
            },
            {
              "text": "Minority Site",
              "path": "A@\/href",
              "url": "http:\/\/www.house.gov\/smbiz\/democrats\/"
            },
            {
              "text": "Printer Friendly Version",
              "path": "A@\/href",
              "url": "..\\..\\..\\PFV\\030604a.asp"
            },
            {
              "path": "FORM@\/action",
              "method": "post",
              "url": "\/smbiz\/search_the_website\/search_results.asp"
            },
            {
              "text": "No hearings scheduled",
              "path": "A@\/href",
              "url": "\/smbiz\/calendarOfEvents\/asp_calendar_event_detail.asp?eventId=113"
            }
          ],
          "Head": {
            "Link": [
              {
                "path": "LINK@\/href",
                "rel": "stylesheet",
                "type": "text\/css",
                "url": "..\/..\/..\/css\/styles.css"
              }
            ],
            "Metas": [
              {
                "content": "text\/html; charset=iso-8859-1",
                "http-equiv": "Content-Type"
              },
              {
                "content": "US",
                "name": "DC.Coverage.Spatial"
              },
              {
                "content": "United States (C,V)",
                "name": "DC.Coverage.Spatial"
              },
              {
                "content": "United States. Congress. House of Representatives. Small Business Committee",
                "name": "DC.Creator"
              },
              {
                "content": "Small Business Committee, United States House of Representatives",
                "http-equiv": "Owner"
              },
              {
                "content": "United States Government work under 17 USC secs. 105, 403",
                "name": "DC.Rights"
              }
            ],
            "Title": "House of Representatives &gt; Small Business Committee &gt; Press Releases For 2003"
          }
        },
        "Entity-Digest": "sha1:XRAFRBPLOTVUUKE4ZB6CKOQR5V2JUAUK"
      },
      "Block-Digest": "sha1:ZQYVZSRFFTEYEZO4UCLGFCDPPQR62Z4U",
      "Actual-Content-Length": "27455"
    }
  },
  "Container": {
    "Compressed": true,
    "Gzip-Metadata": {
      "Footer-Length": "8",
      "Deflate-Length": "6381",
      "Header-Length": "10",
      "Inflated-CRC": "1808622068",
      "Inflated-Length": "27562"
    },
    "Offset": "798",
    "Filename": "NARA-109TH-CONGRESS-2006-20061114053449-00505-crawling021.us.archive.org.arc.gz"
  }
}

WAT Dataset Creation Process

WATs are generated via a Hadoop-based processing pipeline that includes use of Apache Pig, Java, and Python scripts. Downloaded WAT datasets will map one-to-one to (W)ARCs and will be similarly packed as individual concatenated compressed records.

 

  • No labels