Child pages
  • Web Archive Transformation (WAT) Specification, Utilities, and Usage Overview
Skip to end of metadata
Go to start of metadata

Overview

The Web Archive Transformation (WAT) specification describes a structured way of storing metadata.  WAT utilities are used to extract metadata from WARC files.  The structure of the extracted metadata is optimized for data analysis.  WAT data can be used to efficiently create data analysis reports based on very large data sets.

Audience

The audience of this document is users who want to generate WAT data using the Internet Archive's WAT utilities and analyze the data.

Terminology

Technology

Description

Hadoop

Hadoop is a software framework for processing data in large-scale distributed environments.  For more information about Hadoop, visit http://hadoop.apache.org/.

Hadoop File System (HDFS)

HDFS is a distributed file system that works within the Hadoop framework to store and provide access to files.

Pig

Pig is a high level data flow language. It provides concise expression that can process complex data transformations. Pig can send multiple MapReduce jobs to a cluster and manages intermediate data and cleanup.  For more information about Pig, visit http://pig.apache.org/

Web ARChive File Format (WARC)

The WARC format specifies a method for combining multiple digital resources and related information into an archival file.  WARC files are generated by Web crawling software such as Heritrix.

JavaScript Object Notation (JSON)

JSON is a lightweight data-interchange format.  For more information about JSON, visit http://www.json.org/.

CDX File

A CDX file is an index into a WARC file.  An index is a file that efficiently maps a specific piece of information, such as a URL, into another piece of information.  A CDX file maps the combination of a URL and timestamp into the resulting content of the URL captured at that time.  For example, a CDX file might map the URL, http://myurl.org/index.html, and the timestamp March 3, 2001 at 4:33pm GMT into the content of the Web page that was captured at that time.  The Web page exists in the WARC file.  The content of the Web page might look like this:

<html><body>My URL Home Page</body></html>

Why WAT?

Web Archive Transformation (WAT) is a specification for structuring metadata generated by Web crawls.  The WAT specification simplifies analysis of the large datasets produced by Web crawling.  WAT utilities extract metadata from WARC files and format the metadata into a highly optimized format that can be analyzed in a distributed processing environment such as Hadoop.  WAT formats data using JavaScript Object Notation (JSON).  The WAT File Specification can be found here.

Why JSON?

JavaScript Object Notation (JSON) is a common data format that allows metadata to be structured as a nested hierarchy.  Pig eliminates the non-functional details of Hadoop programming such as intermediate data creation and resource cleanup.

WAT Utilities

Utility programs are made available through the WAT library (Java jar file).  They produce structured metadata that is optimized for data analysis.  To install the command line utilities:

  1. Download the latest version of the WAT library at archive-metadata-extractor.jar
  2. Ensure that Java version 5 or greater is installed on the machine designated to run the WAT utilities
  3. Ensure that the WAT library (for example, archive-metadata-extractor-20110430.jar) is in the PATH
  4. Ensure that the WARC files to be used for WAT generation are accessible

Example Usage

The WAT utilities generate JSON data from compressed (GZIPed) or uncompressed ARC or WARC files.  The JSON data is written to STDOUT in compressed (GZIP) format.  The ARC or WARC file can be a local file, a HTTP accessible file (http://), or an Hadoop File System (HDFS) accessible file (hdfs://).

The following command extracts the metadata from an uncompressed WARC file and writes it to an compressed WAT file.

java -jar archive-metadata-extractor.jar -wat mywarcfile.warc > mywatfile.wat.gz

The following command extracts metadata from a compressed WARC file and writes it to a compressed WAT file.

java -jar archive-metadata-extractor.jar -wat mywarcfile.warc.gz > mywatfile.wat.gz

The following command extracts data from a compressed WARC file and displays the data in WAT JSON format.

java -jar archive-metadata-extractor.jar -wat mywarcfile.warc.gz

The following command creates a CDX file from a compressed WARC file.  Note that the first column in the CDX file, URL-KEY, is not canonicalized.

java -jar archive-metadata-extractor.jar -cdx mywarcfile.warc.gz > mycdxfile.cdx

Using WAT and Pig for Data Analysis

A common WAT use case to generate reports that show relationships between links that are accessed and captured by Web crawls.  The procedure for creating a simple report with two columns - a URL and the pages that the URL links to - is described below.

  1. Setup Hadoop environment
  2. Create Pig script that will perform the data analysis and reporting
  3. Create WAT file(s) from WARC file(s) to be analyzed
  4. Put WAT file into HDFS.
  5. Run Pig script
  6. View report

The archive-meta-extractor.jar library contains a custom Pig function for loading the data to be analyzed.  The load function reads the JSON data from WAT files and produces a set of columns that can be used to create reports.  The columns that are made available can be referenced using a simple column naming syntax.  The syntax is based on the names of metadata elements in the WAT specification, which can be found here.

For example, to produce a report that displays the title and URL of all the pages crawled, use the following column names:

Column Name

Description

Envelope.WARC-Header-Metadata.WARC-Target-URI

The URL of the crawled page

Envelope.Payload-Metadata.HTTP-Response-Metadata.HTML-Metadata.Head.Title

The Title of the crawled page

Some column names represent arrays of data.  For example, all the links that where discovered in a crawled page can be referenced as:  Envelope.HTML.@Links.url

The custom Pig loading function will create a cross-product of the values in the specified columns.

Running the Pig Script

The following Pig script loads a WAT file and generates a report displaying the URL of each page crawled and the links embedded in that page.  The WAT file name and the output directory are passed as parameters (SRC_SPEC, and TGT_DIR) to the Pig script.

InLinks.pig

%default SRC_SPEC '/home/user/test-wats/';
%default TGT_DIR '/home/user/pig-out-to-from-sorted';

SET pig.splitCombination 'false';

-- Load Internet Archive Pig utility jar:
REGISTER /home/user/archive-meta-extractor-20110413.jar;

-- alias short-hand for IA 'resolve()' UDF:
DEFINE resolve org.archive.hadoop.func.URLResolverFunc();

-- load data from SRC_SPEC:
Orig = LOAD '$SRC_SPEC' USING
org.archive.hadoop.ArchiveJSONViewLoader('Envelope.WARC-Header-Metadata.WARC-Target-URI',
'Envelope.Payload-Metadata.HTTP-Response-Metadata.HTML-Metadata.Head.Base','Envelope.Payload-Metadata.HTTP-Response-Metadata.HTML-Metadata.@Links.url')
AS (page_url,html_base,relative);

-- discard lines without links
LinksOnly = FILTER Orig BY relative != '';

-- fabricate new 1st column, which is the resolved to-URL, followed by the from-URL:
ResolvedLinks = FOREACH LinksOnly GENERATE FLATTEN(resolve(page_url,html_base,relative)) AS (resolved), page_url;

-- this will include all the fields, for debug:
--ResolvedLinks = FOREACH LinksOnly GENERATE FLATTEN(resolve(page_url,html_base,relative)) AS (resolved), page_url, html_base, relative;

SortedLinks = ORDER ResolvedLinks BY resolved, page_url;

STORE SortedLinks INTO '$TGT_DIR' USING PigStorage();

To run the Pig data analysis script, create the source data files (WAT files) and then use the Hadoop utilities to put the source data files into HDFS.  Call the Pig script to process the WAT data files.

# create the WAT files from compressed WARC files
java -jar /home/user/archive-meta-extractor-20110413.jar -wat /tmp/foo.warc.gz > /tmp/foo.wat.gz
java -jar /home/user/archive-meta-extractor-20110413.jar -wat /tmp/bar.warc.gz > /tmp/bar.wat.gz

# put the WAT files into HDFS
hadoop fs -mkdir /home/user/wats/
hadoop fs -put /tmp/foo.wat.gz /home/user/wats/
hadoop fs -put /tmp/bar.wat.gz /home/user/wats/

# run the Pig script to extract the data from the WAT files and process it into a report
pig -param SRC_SPEC=/home/user/wats/ \
                            -param TGT_DIR=/home/user/reports/ \
                            /home/user/pig-scripts/InLinks.pig

#view the output of the Pig script
hadoop fs -cat /home/user/reports/part-r-00000 | less

Example Output of Pig Script

http://about.dc.gov/index.asp   http://app.calendar.dc.gov/dayView.aspx?thisDate=12.3.2010&cdlClasses=22
http://about.dc.gov/index.asp   http://app.calendar.dc.gov/dayView.aspx?thisDate=12.3.2010&cdlClasses=24
http://about.dc.gov/index.asp   http://app.calendar.dc.gov/dayView.aspx?thisDate=12.3.2010&cdlClasses=26
http://about.dc.gov/index.asp   http://app.calendar.dc.gov/dayView.aspx?thisDate=8.21.2010
http://about.dc.gov/index.asp   http://app.calendar.dc.gov/dayView.aspx?thisDate=8.23.2010&cdlClasses=26
  • No labels

11 Comments

  1. I have ran the WAT generator on a couple of small-mid sized warcs and it went all well.

    When I tried it on a 1GB warc file, it failed with the exception below. All my warcs small,mid,big are generated from H3.0

    Any pointer why the exception?

    2011-10-04 12:00:29,860 [Thread-13] WARN  org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
    org.archive.format.gzip.GZIPFormatException: Not aligned at gzip start
            at org.archive.format.gzip.GZIPMemberSeries.getNextMember(GZIPMemberSeries.java:229)
            at org.archive.resource.gzip.GZIPResourceContainer.getNext(GZIPResourceContainer.java:42)
            at org.archive.resource.TransformingResourceProducer.getNext(TransformingResourceProducer.java:13)
            at org.archive.extract.ExtractingResourceProducer.getNext(ExtractingResourceProducer.java:26)
            at org.archive.hadoop.ResourceRecordReader.nextKeyValue(ResourceRecordReader.java:110)
            at org.archive.hadoop.ArchiveMetadataLoader.getNext(ArchiveMetadataLoader.java:46)
            at org.archive.hadoop.ArchiveJSONViewLoader.getNext(ArchiveJSONViewLoader.java:55)
            at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)
            at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
            at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
            at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
            at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
            at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
    2011-10-04 12:00:34,792 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_local_0001 has failed! Stop running all dependent jobs

    1. I must add, apparently the problem is not the size but the way the warc is written.

      I tested another warc of size 1.1 GB and it runs fine.

      Now, the question is why those specific warcs are failing? ( I tried two warcs for a job) and both are throwing the same above exception.

      1. Seems to be related to this issue:  ACC-112 - Getting issue details... STATUS  

        Perhaps you could add your own comments there too?

  2. Is the Java code behind the archive-meta-extractor.jar open source? If so, could you point me to where it is located at?

    1. As far as I can tell, the code has recently moved to here.

      1. Thanks, Andrew. I've also found the ArchiveJSONViewLoader code now that is used by Pig to load the wats. Naively I was expecting them to be all in one project. Thanks!

        1. Is seems that the archive-commons project has been split into a number of separate projects, so you might find that that version is not longer kept up to date.

    2. Follow-up, it appears the Java code behind the archive-meta-extractor.jar is hosted at: https://github.com/internetarchive/archive-commons and that the version there is newer.

      In particular, the jar on this page was giving me GZIPFormatException 's (similar to what Pranay posted, but when running the PIG script, not when generating the WATs). Compiling the code on github (mvn package) and using the generated archive-commons-jar-with-dependencies.jar fixed this error and generated the report as expected.

  3. To extract war files with current code instead of the legacy archive-metadata-extractor.jar, clone the ia-web-commons repo from github and do an "mvn package". You can then create wat files from warc files with:

    java -cp target/ia-web-commons-jar-with-dependencies.jar org.archive.extract.ResourceExtractor -wat /my/warcs/foo.warc.gz > foo.wat

    I hope this helps someone. It took quite some time to hunt this down... (wink)

  4. Clone the https://github.com/internetarchive/ia-web-commons and https://github.com/internetarchive/ia-hadoop-tools repositories, and run mvn install for each of them.

    You can then use the ia-hadoop-tools-jar-with-dependencies.jar under ia-hadoop-tools/target/ to build WATs like so: 

    java -Xmx2048m -jar ia-hadoop-tools-jar-with-dependencies.jar extractor -wat foo.warc.gz > foo.wat.gz