The Web Archive Transformation (WAT) specification describes a structured way of storing metadata. WAT utilities are used to extract metadata from WARC files. The structure of the extracted metadata is optimized for data analysis. WAT data can be used to efficiently create data analysis reports based on very large data sets.
The audience of this document is users who want to generate WAT data using the Internet Archive's WAT utilities and analyze the data.
Hadoop is a software framework for processing data in large-scale distributed environments. For more information about Hadoop, visit http://hadoop.apache.org/.
Hadoop File System (HDFS)
HDFS is a distributed file system that works within the Hadoop framework to store and provide access to files.
Pig is a high level data flow language. It provides concise expression that can process complex data transformations. Pig can send multiple MapReduce jobs to a cluster and manages intermediate data and cleanup. For more information about Pig, visit http://pig.apache.org/
Web ARChive File Format (WARC)
The WARC format specifies a method for combining multiple digital resources and related information into an archival file. WARC files are generated by Web crawling software such as Heritrix.
JSON is a lightweight data-interchange format. For more information about JSON, visit http://www.json.org/.
A CDX file is an index into a WARC file. An index is a file that efficiently maps a specific piece of information, such as a URL, into another piece of information. A CDX file maps the combination of a URL and timestamp into the resulting content of the URL captured at that time. For example, a CDX file might map the URL, http://myurl.org/index.html, and the timestamp March 3, 2001 at 4:33pm GMT into the content of the Web page that was captured at that time. The Web page exists in the WARC file. The content of the Web page might look like this:
Utility programs are made available through the WAT library (Java jar file). They produce structured metadata that is optimized for data analysis. To install the command line utilities:
- Download the latest version of the WAT library at archive-metadata-extractor.jar
- Ensure that Java version 5 or greater is installed on the machine designated to run the WAT utilities
- Ensure that the WAT library (for example,
archive-metadata-extractor-20110430.jar) is in the PATH
- Ensure that the WARC files to be used for WAT generation are accessible
The WAT utilities generate JSON data from compressed (GZIPed) or uncompressed ARC or WARC files. The JSON data is written to STDOUT in compressed (GZIP) format. The ARC or WARC file can be a local file, a HTTP accessible file (http://), or an Hadoop File System (HDFS) accessible file (hdfs://).
The following command extracts the metadata from an uncompressed WARC file and writes it to an compressed WAT file.
The following command extracts metadata from a compressed WARC file and writes it to a compressed WAT file.
The following command extracts data from a compressed WARC file and displays the data in WAT JSON format.
The following command creates a CDX file from a compressed WARC file. Note that the first column in the CDX file, URL-KEY, is not canonicalized.
Using WAT and Pig for Data Analysis
A common WAT use case to generate reports that show relationships between links that are accessed and captured by Web crawls. The procedure for creating a simple report with two columns - a URL and the pages that the URL links to - is described below.
- Setup Hadoop environment
- Create Pig script that will perform the data analysis and reporting
- Create WAT file(s) from WARC file(s) to be analyzed
- Put WAT file into HDFS.
- Run Pig script
- View report
archive-meta-extractor.jar library contains a custom Pig function for loading the data to be analyzed. The load function reads the JSON data from WAT files and produces a set of columns that can be used to create reports. The columns that are made available can be referenced using a simple column naming syntax. The syntax is based on the names of metadata elements in the WAT specification, which can be found here.
For example, to produce a report that displays the title and URL of all the pages crawled, use the following column names:
The URL of the crawled page
The Title of the crawled page
Some column names represent arrays of data. For example, all the links that where discovered in a crawled page can be referenced as: Envelope.HTML.@Links.url
The custom Pig loading function will create a cross-product of the values in the specified columns.
Running the Pig Script
The following Pig script loads a WAT file and generates a report displaying the URL of each page crawled and the links embedded in that page. The WAT file name and the output directory are passed as parameters (SRC_SPEC, and TGT_DIR) to the Pig script.
To run the Pig data analysis script, create the source data files (WAT files) and then use the Hadoop utilities to put the source data files into HDFS. Call the Pig script to process the WAT data files.
Example Output of Pig Script