Web Archive Analysis Workshop

Set up

 Background
 Installation
 

1) Install the latest version of Java and set JAVA_HOME

Linux:

export JAVA_HOME=/usr

OS X:

export JAVA_HOME=$(/usr/libexec/java_home)

 

2) Install Python

Linux:

sudo apt-get install python-pip

OS X:

sudo easy_install pip
curl https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py | python

 

3) Install Hadoop (version 2 which includes Apache Pig)

curl -O http://archive.org/~vinay/archive-analysis/hadoop-2-local-mode.tar.gz
tar xfz hadoop-2-local-mode.tar.gz

# set PIG_HOME and other env variables to run in local mode
source hadoop-2-local-mode/setup-env.sh

 

4) Download the Web Archive Analysis Project

git clone --depth=1 https://github.com/vinaygoel/archive-analysis.git
export PROJECT_DIR=`pwd`/archive-analysis/
 
curl -o $PROJECT_DIR/lib/ia-hadoop-tools-jar-with-dependencies.jar http://archive.org/~vinay/archive-analysis/ia-libraries/hadoop-2.x/ia-hadoop-tools-jar-with-dependencies.jar
curl -o $PROJECT_DIR/lib/ia-porky-jar-with-dependencies.jar http://archive.org/~vinay/archive-analysis/ia-libraries/hadoop-2.x/ia-porky-jar-with-dependencies.jar

 

5) Download the workshop sample dataset (5.0 GB) (sample WARC files and the corresponding Waimea generated parsed text data)

curl -O http://archive.org/~vinay/archive-analysis/sample-dataset.tar.gz
tar xfz sample-dataset.tar.gz
export DATA_DIR=`pwd`/sample-dataset/
 
# For an even smaller sample dataset (single WARC file with corresponding parsed text)
# download http://archive.org/~vinay/archive-analysis/sample-dataset-single.tar.gz (986 MB)
# curl -O http://archive.org/~vinay/archive-analysis/sample-dataset-single.tar.gz
# tar xfz sample-dataset-single.tar.gz
# export DATA_DIR=`pwd`/sample-dataset-single/

 

6) Launch jobs from the workshop project directory

cd $PROJECT_DIR

 

7) Optional: Cluster mode settings for Pig and Giraph

To run the workshop in Cluster mode, set the following

export DATA_DIR=<location of the dataset in HDFS>
export HADOOP_HOME=<location of the installed hadoop software>
export HADOOP_BIN=$HADOOP_HOME/bin/hadoop
export PIG_HOME=<location of the installed Pig software>

To run Giraph in Local mode, please refer to the Giraph Quick Start Guide

 Generate CDX files from WARC data

SURT CDX legend 

Local mode

./hadoop-streaming/cdx/generate-cdx-job-local.sh $DATA_DIR/crawl-data/warcs/ $DATA_DIR/derived-data/cdx/

Cluster mode

$HADOOP_BIN jar lib/ia-hadoop-tools-jar-with-dependencies.jar CDXGenerator $DATA_DIR/derived-data/cdx/ $DATA_DIR/crawl-data/warcs/*.warc.gz
 Generate parsed text data from WARC data

Local mode

# See provided dataset - $DATA_DIR/derived-data/parsed/
# If you want to generate it yourself, please setup Hadoop (Pseudo-distributed), and run
$HADOOP_BIN jar lib/jbs.jar org.archive.jbs.Parse -conf etc/job-parse.xml $DATA_DIR/derived-data/parsed/ $DATA_DIR/crawl-data/warcs/*.warc.gz

Cluster mode

$HADOOP_BIN jar lib/jbs.jar org.archive.jbs.Parse -conf etc/job-parse.xml $DATA_DIR/derived-data/parsed/ $DATA_DIR/crawl-data/warcs/*.warc.gz
 Generate WAT files from WARC data

WAT file specification 

Local mode

./hadoop-streaming/wats/generate-wat-job-local.sh $DATA_DIR/crawl-data/warcs/ $DATA_DIR/derived-data/wats/

Cluster mode

$HADOOP_BIN jar lib/ia-hadoop-tools-jar-with-dependencies.jar WATGenerator $DATA_DIR/derived-data/wats/ $DATA_DIR/crawl-data/warcs/*.warc.gz

Link Extraction

 Extract links from WARC data (any available WARC metadata records with outlinks)

Local mode

# Clone the warctools repo
git clone https://github.com/internetarchive/warctools.git
 
# Extract links
./hadoop-streaming/warc-metadata-outlinks/generate-outlinks-job-local.sh $DATA_DIR/crawl-data/warcs/ $DATA_DIR/derived-data/warc-metadata-outlinks/
 
# SURT canonicalize the extracted links
$PIG_HOME/bin/pig -x local -p I_LINKS_DIR=$DATA_DIR/derived-data/warc-metadata-outlinks/ -p O_CANON_LINKS_DIR=$DATA_DIR/derived-data/canon-links-from-warc-metadata.gz/ pig/graph/canonicalize-links-from-warcs.pig

Cluster mode

# Extract links
$HADOOP_BIN jar lib/ia-hadoop-tools-jar-with-dependencies.jar WARCMetadataRecordGenerator $DATA_DIR/derived-data/warc-metadata-outlinks/ $DATA_DIR/crawl-data/warcs/*.warc.gz
 
# SURT canonicalize the extracted links
$PIG_HOME/bin/pig -p I_LINKS_DIR=$DATA_DIR/derived-data/warc-metadata-outlinks/*.metadata -p O_CANON_LINKS_DIR=$DATA_DIR/derived-data/canon-links-from-warc-metadata.gz/ pig/graph/canonicalize-links-from-warcs.pig
 Extract links from parsed text data

Local mode

# Extract SURT canonicalized links
$PIG_HOME/bin/pig -x local -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-parsed-captures.gz pig/parsed-captures/extract-surt-canon-links-with-anchor-from-parsed-captures.pig

Cluster mode

# Extract SURT canonicalized links
$PIG_HOME/bin/pig -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-parsed-captures.gz pig/parsed-captures/extract-surt-canon-links-with-anchor-from-parsed-captures.pig
 Extract links from WAT data

Local mode

# Extract SURT canonicalized links
$PIG_HOME/bin/pig -x local -p I_WATS_DIR=$DATA_DIR/derived-data/wats/*.wat.gz -p O_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-wats.gz pig/wats/extract-surt-canon-links-with-anchor-from-warc-wats.pig

Cluster mode

# Extract SURT canonicalized links
$PIG_HOME/bin/pig -p I_WATS_DIR=$DATA_DIR/derived-data/wats/*.wat.gz -p O_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-wats.gz pig/wats/extract-surt-canon-links-with-anchor-from-warc-wats.pig

Text Extraction (Titles)

 Extract titles from parsed text data

Local mode

$PIG_HOME/bin/pig -x local -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_URL_TITLE_DIR=$DATA_DIR/derived-data/url.title-from-parsed-captures.gz/ pig/parsed-captures/extract-surt-canon-urls-with-only-titles-from-parsed-captures.pig

Cluster mode

$PIG_HOME/bin/pig -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_URL_TITLE_DIR=$DATA_DIR/derived-data/url.title-from-parsed-captures.gz/ pig/parsed-captures/extract-surt-canon-urls-with-only-titles-from-parsed-captures.pig
 Extract titles from WAT data

Local mode

$PIG_HOME/bin/pig -x local -p I_WATS_DIR=$DATA_DIR/derived-data/wats/*.wat.gz -p O_URL_TITLE_DIR=$DATA_DIR/derived-data/url.title-from-wats.gz/ pig/wats/extract-surt-canon-urls-with-only-titles-from-warc-wats.pig

Cluster mode

$PIG_HOME/bin/pig -p I_WATS_DIR=$DATA_DIR/derived-data/wats/*.wat.gz -p O_URL_TITLE_DIR=$DATA_DIR/derived-data/url.title-from-wats.gz/ pig/wats/extract-surt-canon-urls-with-only-titles-from-warc-wats.pig

Text Extraction (Meta text: title, meta keywords, description text)

 Extract meta text from parsed text data

Local mode

$PIG_HOME/bin/pig -x local -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_METATEXT_DATA_DIR=$DATA_DIR/derived-data/metatext-from-parsed-captures.gz/ pig/parsed-captures/extract-surt-canon-urls-with-metatext-from-parsed-captures.pig

Cluster mode

$PIG_HOME/bin/pig -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_METATEXT_DATA_DIR=$DATA_DIR/derived-data/metatext-from-parsed-captures.gz/ pig/parsed-captures/extract-surt-canon-urls-with-metatext-from-parsed-captures.pig
 Extract meta text from WAT data

Local mode

$PIG_HOME/bin/pig -x local -p I_WATS_DIR=$DATA_DIR/derived-data/wats/*.wat.gz -p O_METATEXT_DATA_DIR=$DATA_DIR/derived-data/metatext-from-wats.gz/ pig/wats/extract-surt-canon-urls-with-metatext-from-warc-wats.pig

Cluster mode

$PIG_HOME/bin/pig -p I_WATS_DIR=$DATA_DIR/derived-data/wats/*.wat.gz -p O_METATEXT_DATA_DIR=$DATA_DIR/derived-data/metatext-from-wats.gz/ pig/wats/extract-surt-canon-urls-with-metatext-from-warc-wats.pig

Entity Extraction (using Stanford NER)

 Extract entities from parsed text data

Local mode

$PIG_HOME/bin/pig -x local -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_ENTITIES_DIR=$DATA_DIR/derived-data/entities-from-parsed-captures.gz/ pig/parsed-captures/extract-entities-from-parsed-captures.pig

Cluster mode

# Store the english.all.3class.distsim.crf.ser.gz available under $PROJECT_DIR/lib/english.all.3class.distsim.crf.ser.gz into HDFS. Set LOCATION_OF_NER_CLASSIFIER_FILE_IN_HDFS.
# This classifier will be supplied as a distributed cache to the workers

$PIG_HOME/bin/pig -Dmapred.cache.files="$LOCATION_OF_NER_CLASSIFIER_FILE_IN_HDFS/english.all.3class.distsim.crf.ser.gz#english.all.3class.distsim.crf.ser.gz" -Dmapred.create.symlink=yes -p I_NER_CLASSIFIER_FILE=english.all.3class.distsim.crf.ser.gz  -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_ENTITIES_DIR=$DATA_DIR/derived-data/entities-from-parsed-captures.gz/ pig/parsed-captures/extract-entities-from-parsed-captures.pig
 

Graph Generation

 Generate Host Graph from links data

Local mode

$PIG_HOME/bin/pig -x local -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/host.graph/ pig/graph/convert-link-data-to-host-graph-with-counts.pig

Cluster mode

$PIG_HOME/bin/pig -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/host.graph/ pig/graph/convert-link-data-to-host-graph-with-counts.pig
 Generate Domain Graph from links data

Local mode

$PIG_HOME/bin/pig -x local -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_TOP_DOMAIN_GRAPH_DIR=$DATA_DIR/derived-data/graph/top-domain.graph/ pig/graph/convert-link-data-to-top-domain-graph-with-counts.pig

Cluster mode

$PIG_HOME/bin/pig -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_TOP_DOMAIN_GRAPH_DIR=$DATA_DIR/derived-data/graph/top-domain.graph/ pig/graph/convert-link-data-to-top-domain-graph-with-counts.pig
 Generate an Archival Web Graph (ID-Map and ID-Graph) from links data

Local mode

# Generate an ID-Map that maps a unique integer ID to each URL (SURT sorted URLs allows for pages belonging to the same host having "closer" IDs)
$PIG_HOME/bin/pig -x local -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_ID_MAP_DIR=$DATA_DIR/derived-data/graph/id.map/ pig/graph/generate-id-map.pig

# Alternate version: Generate an ID-Map where the ID assigned to each URL is a 64-bit fingerprint generated from the URL
# $PIG_HOME/bin/pig -x local -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_ID_MAP_DIR=$DATA_DIR/derived-data/graph/id.map/ pig/graph/generate-id-map-using-fingerprints.pig 

# Generate an ID-Graph that lists the destination IDs for each source ID along with the timestamp info
$PIG_HOME/bin/pig -x local -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p I_ID_MAP_DIR=$DATA_DIR/derived-data/graph/id.map/ -p O_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/id.graph/ pig/graph/convert-link-data-to-id-graph.pig

Cluster mode

# Generate an ID-Map that maps a unique integer ID to each URL
$PIG_HOME/bin/pig -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_ID_MAP_DIR=$DATA_DIR/derived-data/graph/id.map/ pig/graph/generate-id-map.pig

# Alternate version: Generate an ID-Map where the ID assigned to each URL is a 64-bit fingerprint generated from the URL
# $PIG_HOME/bin/pig -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_ID_MAP_DIR=$DATA_DIR/derived-data/graph/id.map/ pig/graph/generate-id-map-using-fingerprints.pig 

# Generate an ID-Graph that lists the destination IDs for each source ID along with the timestamp info
$PIG_HOME/bin/pig -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p I_ID_MAP_DIR=$DATA_DIR/derived-data/graph/id.map/ -p O_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/id.graph/ pig/graph/convert-link-data-to-id-graph.pig

 Generate a Link Graph (ID-Map and SortedInt-ID-Graph) from (SRC,DST) - no timestamp

Local mode

# Generate an ID-Map that maps a unique integer ID to each resource, and SortedIntID-Graph that maps a set of sorted integer destination IDs to each source ID
$PIG_HOME/bin/pig -x local -p I_LINKS_DATA_NO_TS_DIR=$DATA_DIR/derived-data/graph/host.graph/ -p O_ID_MAP_DIR=$DATA_DIR/derived-data/graph/host-id.map/ -p O_ID_SORTEDINT_GRAPH_NO_TS_DIR=$DATA_DIR/derived-data/graph/host-id-sortedint.graph/ pig/graph/convert-link-data-no-ts-to-sorted-id-data-no-ts.pig

Cluster mode

# Generate an ID-Map that maps a unique integer ID to each resource, and SortedIntID-Graph that maps a set of sorted integer destination IDs to each source ID
$PIG_HOME/bin/pig -p I_LINKS_DATA_NO_TS_DIR=$DATA_DIR/derived-data/graph/host.graph/ -p O_ID_MAP_DIR=$DATA_DIR/derived-data/graph/host-id.map/ -p O_ID_SORTEDINT_GRAPH_NO_TS_DIR=$DATA_DIR/derived-data/graph/host-id-sortedint.graph/ pig/graph/convert-link-data-no-ts-to-sorted-id-data-no-ts.pig

CDX Analysis

 Example: Generate breakdown of MIME types by year from CDX data

Local mode

$PIG_HOME/bin/pig -x local -p I_CDX_DIR=$DATA_DIR/derived-data/cdx/*.cdx -p O_MIME_BREAKDOWN_DIR=$DATA_DIR/derived-data/mime-breakdown/ pig/cdx/mimeBreakdown.pig

Cluster mode

$PIG_HOME/bin/pig -p I_CDX_DIR=$DATA_DIR/derived-data/cdx/*.cdx -p O_MIME_BREAKDOWN_DIR=$DATA_DIR/derived-data/mime-breakdown/ pig/cdx/mimeBreakdown.pig
 CDX Warehouse - Apache Hive

Apache Hive

Install Hive. Refer to CDX Warehouse

Text Analysis

 Example: Extract top 50 terms (by TF-IDF) per URL from meta text data

Local mode

$PIG_HOME/bin/pig -x local -p I_METATEXT_DIR=$DATA_DIR/derived-data/metatext-from-* -p O_URL_METATEXT_TOPTERMS_DIR=$DATA_DIR/derived-data/url.topmetatext.gz pig/text/metatext-topN-tfidf.pig

Cluster mode

# Store the stop-words.txt available under $PROJECT_DIR/pig/text/stop-words.txt into HDFS. Set LOCATION_OF_STOP_WORDS_FILE_IN_HDFS.
# This stop words list will be supplied as a distributed cache to the workers
$PIG_HOME/bin/pig -Dmapred.cache.files="$LOCATION_OF_STOP_WORDS_FILE_IN_HDFS/stop-words.txt#stop-words.txt" -Dmapred.create.symlink=yes -p I_STOP_WORDS_FILE=stop-words.txt -p I_METATEXT_DIR=$DATA_DIR/derived-data/metatext-from-* -p O_URL_METATEXT_TOPTERMS_DIR=$DATA_DIR/derived-data/url.topmetatext.gz/ pig/text/metatext-topN-tfidf.pig
 Example: Extract top 50 terms (by TF-IDF) per URL from link text data (anchor text)

Local mode

$PIG_HOME/bin/pig -x local -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_URL_ANCHORTEXT_TOPTERMS_DIR=$DATA_DIR/derived-data/url.topanchortext.gz pig/text/anchortext-topN-tfidf.pig

Cluster mode

# Store the stop-words.txt available under $PROJECT_DIR/pig/text/stop-words.txt into HDFS. Set LOCATION_OF_STOP_WORDS_FILE_IN_HDFS.
# This stop words list will be supplied as a distributed cache to the workers
$PIG_HOME/bin/pig -Dmapred.cache.files="$LOCATION_OF_STOP_WORDS_FILE_IN_HDFS/stop-words.txt#stop-words.txt" -Dmapred.create.symlink=yes -p I_STOP_WORDS_FILE=stop-words.txt -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_URL_ANCHORTEXT_TOPTERMS_DIR=$DATA_DIR/derived-data/url.topanchortext.gz pig/text/anchortext-topN-tfidf.pig
 Example: Prepare parsed text for analysis with Apache Mahout

Apache Mahout

Local mode

$PIG_HOME/bin/pig -x local -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_URL_CONTENT_SEQ_DIR=$DATA_DIR/derived-data/parsed-captures-content-for-mahout.seq/ pig/parsed-captures/prepare-content-for-mahout-with-filter.pig

Cluster mode

# Store the stop-words.txt available under $PROJECT_DIR/pig/text/stop-words.txt into HDFS. Set LOCATION_OF_STOP_WORDS_FILE_IN_HDFS.
# This stop words list will be supplied as a distributed cache to the workers
$PIG_HOME/bin/pig -Dmapred.cache.files="$LOCATION_OF_STOP_WORDS_FILE_IN_HDFS/stop-words.txt#stop-words.txt" -Dmapred.create.symlink=yes -p I_STOP_WORDS_FILE=stop-words.txt -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_URL_CONTENT_SEQ_DIR=$DATA_DIR/derived-data/parsed-captures-content-for-mahout.seq/ pig/parsed-captures/prepare-content-for-mahout-with-filter.pig

Now you can run the Mahout seq2sparse command on the produced output, followed by any of the available clustering and classification algorithms.

Link Analysis with Apache Pig

 Example: Compute the number of incoming and outgoing links per URL

Local mode

$PIG_HOME/bin/pig -x local -p I_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/id.graph/ -p O_DEGREE_ANALYSIS_DIR=$DATA_DIR/derived-data/graph/degree-analysis/ pig/graph/degree-analysis.pig

Cluster mode

$PIG_HOME/bin/pig -p I_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/id.graph/ -p O_DEGREE_ANALYSIS_DIR=$DATA_DIR/derived-data/graph/degree-analysis/ pig/graph/degree-analysis.pig
 Example: Compute the degree stats per Host/Domain from Host/Domain Graph

Local mode

$PIG_HOME/bin/pig -x local -p I_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/host.graph/ -p O_HOST_DEGREE_STATS_DIR=$DATA_DIR/derived-data/graph/host-degree-stats/ pig/graph/host-degree-stats.pig

Cluster mode

$PIG_HOME/bin/pig -p I_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/host.graph/ -p O_HOST_DEGREE_STATS_DIR=$DATA_DIR/derived-data/graph/host-degree-stats/ pig/graph/host-degree-stats.pig
 Example: PageRank

Not recommended. The MapReduce implementation (Pig) is several orders of magnitude slower than the BSP implementation (Giraph).

Refer to the prepare PageRank graph and the PageRank implementation using Pig scripts.

 Example: Find common links using Link Graph (ID-Map, SortedInt-ID-Graph)

Problem: Given a set of input resources, find the set of links that are linked-to by all of these input resources.

The following code makes use of FindAndIntersectionsUsingPForDeltaDocIdSetUDF(). This UDF takes advantage of the Kamikaze library to build docIdSets from the sorted integer destination ID sets, and then performs efficient intersection of these sets to find all the destinations in common. It only works for integer IDs with value < Java Integer.MAX_VALUE (2^31 -1)

 

Local mode

# Set I_SRC_RESOURCES_DIR to the data containing the list of resources for which we need to find common/shared links
$PIG_HOME/bin/pig -x local -p I_SRC_RESOURCES_DIR=$I_SRC_RESOURCES_DIR -p I_ID_MAP_DIR=$DATA_DIR/derived-data/graph/host-id.map/ -p I_ID_SORTEDINT_GRAPH_NO_TS_DIR=$DATA_DIR/derived-data/graph/host-id-sortedint.graph/ -p O_COMMON_LINKS_RESOURCES_DIR=$DATA_DIR/derived-data/graph/common-links/ pig/graph/find-links-in-common.pig

Cluster mode

# Set I_SRC_RESOURCES_DIR to the data containing the list of resources for which we need to find common/shared links
$PIG_HOME/bin/pig -p I_SRC_RESOURCES_DIR=$I_SRC_RESOURCES_DIR -p I_ID_MAP_DIR=$DATA_DIR/derived-data/graph/host-id.map/ -p I_ID_SORTEDINT_GRAPH_NO_TS_DIR=$DATA_DIR/derived-data/graph/host-id-sortedint.graph/ -p O_COMMON_LINKS_RESOURCES_DIR=$DATA_DIR/derived-data/graph/common-links/ pig/graph/find-links-in-common.pig

Link Analysis with Apache Giraph (Cluster mode)

 Example: Compute the number of incoming and outgoing links per URL

Cluster mode - Hadoop 0.20.2

# Prepare links graph for Giraph from ID-Graph data using Pig
$PIG_HOME/bin/pig -p I_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/id.graph/ -p O_PR_TAB_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/pagerank-id.graph/ pig/graph/prepare-tab-delimited-pagerank-graph.pig
 
# Giraph settings
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.0.0-for-hadoop-0.20.2-jar-with-dependencies.jar
export LVIF=org.archive.giraph.VertexWithDoubleValueLongDoubleFloatTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10

# Indegree
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner org.archive.giraph.InDegreeCountVertex -vif $LVIF -vip $DATA_DIR/derived-data/graph/pagerank-id.graph/part* -of $OF -op $DATA_DIR/derived-data/graph/id.indegree/ -w $NUMWORKERS

# Outdegree
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner org.archive.giraph.OutDegreeCountVertex -vif $LVIF -vip $DATA_DIR/derived-data/graph/pagerank-id.graph/part* -of $OF -op $DATA_DIR/derived-data/graph/id.outdegree/ -w $NUMWORKERS

Cluster mode - Hadoop 2.x

# Prepare links graph for Giraph from ID-Graph data using Pig
$PIG_HOME/bin/pig -p I_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/id.graph/ -p O_PR_TAB_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/pagerank-id.graph/ pig/graph/prepare-tab-delimited-pagerank-graph.pig
 
# Giraph settings
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.1.0-SNAPSHOT-for-hadoop-2.0.5-alpha-jar-with-dependencies.jar
export LVIF=org.archive.giraph.VertexWithDoubleValueLongDoubleFloatTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
# ZooKeeper and JobTracker settings 
export ZOOKEEPER_OPTS_STRING="-ca giraph.zkList=$ZOOKEEPER_HOST:$ZOOKEEPER_PORT"
export JOBTRACKER_OPTS_STRING="-Dmapred.job.tracker=$JOBTRACKER_HOST:$JOBTRACKER_PORT"

# Indegree
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner $JOBTRACKER_OPTS_STRING org.archive.giraph.InDegreeCountComputation -vif $LVIF -vip $DATA_DIR/derived-data/graph/pagerank-id.graph/part* -of $OF -op $DATA_DIR/derived-data/graph/id.indegree/ -w $NUMWORKERS $ZOOKEEPER_OPTS_STRING

# Outdegree
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner $JOBTRACKER_OPTS_STRING org.archive.giraph.OutDegreeCountComputation -vif $LVIF -vip $DATA_DIR/derived-data/graph/pagerank-id.graph/part* -of $OF -op $DATA_DIR/derived-data/graph/id.outdegree/ -w $NUMWORKERS $ZOOKEEPER_OPTS_STRING

 Example: PageRank

Cluster mode - Hadoop 0.20.2

# Prepare links graph for Giraph from Host/Domain Graph data using Pig
$PIG_HOME/bin/pig -p I_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/id.graph/ -p O_PR_TAB_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/pagerank-id.graph/ pig/graph/prepare-tab-delimited-pagerank-graph.pig
 
# Giraph settings
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.0.0-for-hadoop-0.20.2-jar-with-dependencies.jar
export LVIF=org.archive.giraph.VertexWithDoubleValueLongDoubleFloatTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
# PageRank
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner org.archive.giraph.PageRankVertex -ca PageRankVertex.jump_probability=0.15f -ca PageRankVertex.max_supersteps=15 -vif $LVIF -vip $DATA_DIR/derived-data/graph/pagerank-id.graph/part* -of $OF -op $DATA_DIR/derived-data/graph/id.prscore/ -w $NUMWORKERS -mc org.archive.giraph.PageRankVertex\$PageRankVertexMasterCompute

# Assign ranks by PageRank score using Pig
$PIG_HOME/bin/pig -p I_ID_PRSCORE_DIR=$DATA_DIR/derived-data/graph/id.prscore/ -p O_ID_PRRANK_DIR=$DATA_DIR/derived-data/graph/id.prrank/ pig/graph/assign-pagerank-rank.pig

Cluster mode - Hadoop 2.x

# Prepare links graph for Giraph from Host/Domain Graph data using Pig
$PIG_HOME/bin/pig -p I_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/id.graph/ -p O_PR_TAB_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/pagerank-id.graph/ pig/graph/prepare-tab-delimited-pagerank-graph.pig
 
# Giraph settings
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.1.0-SNAPSHOT-for-hadoop-2.0.5-alpha-jar-with-dependencies.jar
export LVIF=org.archive.giraph.VertexWithDoubleValueLongDoubleFloatTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
# ZooKeeper and JobTracker settings 
export ZOOKEEPER_OPTS_STRING="-ca giraph.zkList=$ZOOKEEPER_HOST:$ZOOKEEPER_PORT"
export JOBTRACKER_OPTS_STRING="-Dmapred.job.tracker=$JOBTRACKER_HOST:$JOBTRACKER_PORT"
 
# PageRank
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner $JOBTRACKER_OPTS_STRING org.archive.giraph.PageRankComputation -ca PageRankComputation.jumpProbability=0.15f -ca PageRankComputation.maxSupersteps=15 -vif $LVIF -vip $DATA_DIR/derived-data/graph/pagerank-id.graph/part* -of $OF -op $DATA_DIR/derived-data/graph/id.prscore/ -w $NUMWORKERS -mc org.archive.giraph.PageRankComputation\$PageRankComputationMasterCompute $ZOOKEEPER_OPTS_STRING

# Assign ranks by PageRank score using Pig
$PIG_HOME/bin/pig -p I_ID_PRSCORE_DIR=$DATA_DIR/derived-data/graph/id.prscore/ -p O_ID_PRRANK_DIR=$DATA_DIR/derived-data/graph/id.prrank/ pig/graph/assign-pagerank-rank.pig

 Example: Weighted PageRank (Host/Domain Graph)

Cluster mode  - Hadoop 0.20.2

# Prepare links graph for Giraph from Host/Domain graph using Pig
$PIG_HOME/bin/pig -p I_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/host.graph/ -p O_PR_TAB_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/pagerank-host.graph/ pig/graph/prepare-tab-delimited-weighted-pagerank-graph.pig
 
# Giraph settings
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.0.0-for-hadoop-0.20.2-jar-with-dependencies.jar
export TVIF=org.archive.giraph.VertexWithDoubleValueTextDoubleFloatTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
# Weighted PageRank
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner org.archive.giraph.WeightedPageRankVertex -ca WeightedPageRankVertex.jump_probability=0.15f -ca WeightedPageRankVertex.max_supersteps=15 -vif $TVIF -vip $DATA_DIR/derived-data/graph/pagerank-host.graph/part* -of $OF -op $DATA_DIR/derived-data/graph/host.prscore/ -w $NUMWORKERS -mc org.archive.giraph.WeightedPageRankVertex\$WeightedPageRankVertexMasterCompute

# Assign ranks by PageRank score using Pig
$PIG_HOME/bin/pig -p I_ID_PRSCORE_DIR=$DATA_DIR/derived-data/graph/host.prscore/ -p O_ID_PRRANK_DIR=$DATA_DIR/derived-data/graph/host.prrank/ pig/graph/assign-pagerank-rank.pig

Cluster mode  - Hadoop 2.x

# Prepare links graph for Giraph from Host/Domain graph using Pig
$PIG_HOME/bin/pig -p I_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/host.graph/ -p O_PR_TAB_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/pagerank-host.graph/ pig/graph/prepare-tab-delimited-weighted-pagerank-graph.pig
 
# Giraph settings
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.1.0-SNAPSHOT-for-hadoop-2.0.5-alpha-jar-with-dependencies.jar
export TVIF=org.archive.giraph.VertexWithDoubleValueTextDoubleFloatTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
# ZooKeeper and JobTracker settings 
export ZOOKEEPER_OPTS_STRING="-ca giraph.zkList=$ZOOKEEPER_HOST:$ZOOKEEPER_PORT"
export JOBTRACKER_OPTS_STRING="-Dmapred.job.tracker=$JOBTRACKER_HOST:$JOBTRACKER_PORT"
 
# Weighted PageRank
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner $JOBTRACKER_OPTS_STRING org.archive.giraph.WeightedPageRankComputation -ca WeightedPageRankComputation.jumpProbability=0.15f -ca WeightedPageRankComputation.maxSupersteps=15 -vif $TVIF -vip $DATA_DIR/derived-data/graph/pagerank-host.graph/part* -of $OF -op $DATA_DIR/derived-data/graph/host.prscore/ -w $NUMWORKERS -mc org.archive.giraph.WeightedPageRankComputation\$WeightedPageRankComputationMasterCompute $ZOOKEEPER_OPTS_STRING

# Assign ranks by PageRank score using Pig
$PIG_HOME/bin/pig -p I_ID_PRSCORE_DIR=$DATA_DIR/derived-data/graph/host.prscore/ -p O_ID_PRRANK_DIR=$DATA_DIR/derived-data/graph/host.prrank/ pig/graph/assign-pagerank-rank.pig

Crawl Log Analysis

 Example: Extract path from crawler for each URL using Apache Giraph (Cluster mode)

Cluster mode - Hadoop 0.20.2

# Generate crawler hops info:
# If you have the Heritrix crawl logs for the collection (stored under $DATA_DIR/crawl-data/crawl-logs/) , run
$PIG_HOME/bin/pig -p I_CRAWLLOG_DATA_DIR=$DATA_DIR/crawl-data/crawl-logs/ -p O_CRAWLLOG_ID_MAP_DIR=$DATA_DIR/derived-data/crawler/crawllogid.map -p O_CRAWLLOG_ID_ONEHOP_DIR=$DATA_DIR/derived-data/crawler/crawllogid.onehop -p O_CRAWLLOG_LINKS_DATA_DIR=$DATA_DIR/derived-data/crawler/links-from-crawllog.gz pig/crawl-logs/generate-crawler-hops-from-crawllog.pig

# Else, if you have WARCs with metadata records containing crawl hop info, 
# Generate Hopinfo files
$HADOOP_BIN jar lib/ia-hadoop-tools-jar-with-dependencies.jar WARCMetadataRecordGenerator -hopinfo $DATA_DIR/derived-data/warc-metadata-hopinfo/ $DATA_DIR/crawl-data/warcs/*.warc.gz

# And then run
$PIG_HOME/bin/pig -p I_WARCMETADATAHOPINFO_DATA_DIR=$DATA_DIR/derived-data/warc-metadata-hopinfo/*.metadata -p O_CRAWLLOG_ID_MAP_DIR=$DATA_DIR/derived-data/crawler/crawllogid.map -p O_CRAWLLOG_ID_ONEHOP_DIR=$DATA_DIR/derived-data/crawler/crawllogid.onehop -p O_CRAWLLOG_LINKS_DATA_DIR=$DATA_DIR/derived-data/crawler/links-from-crawllog.gz pig/crawl-logs/generate-crawler-hops-from-warc-metadata-records-hopinfo.pig

# followed by the Giraph job that generates the path from the crawler for each URL
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.0.0-for-hadoop-0.20.2-jar-with-dependencies.jar
export CVIF=org.archive.giraph.VertexWithTextValueLongTextTextTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner org.archive.giraph.LabelPathVertex -vif $CVIF -vip $DATA_DIR/derived-data/crawler/crawllogid.onehop/part* -of $OF -op $DATA_DIR/derived-data/crawler/crawllogid.hoppathfromcrawler/ -w $NUMWORKERS

Cluster mode - Hadoop 2.x

# Generate crawler hops info:
# If you have the Heritrix crawl logs for the collection (stored under $DATA_DIR/crawl-data/crawl-logs/) , run
$PIG_HOME/bin/pig -p I_CRAWLLOG_DATA_DIR=$DATA_DIR/crawl-data/crawl-logs/ -p O_CRAWLLOG_ID_MAP_DIR=$DATA_DIR/derived-data/crawler/crawllogid.map -p O_CRAWLLOG_ID_ONEHOP_DIR=$DATA_DIR/derived-data/crawler/crawllogid.onehop -p O_CRAWLLOG_LINKS_DATA_DIR=$DATA_DIR/derived-data/crawler/links-from-crawllog.gz pig/crawl-logs/generate-crawler-hops-from-crawllog.pig

# Else, if you have WARCs with metadata records containing crawl hop info, 
# Generate Hopinfo files
$HADOOP_BIN jar lib/ia-hadoop-tools-jar-with-dependencies.jar WARCMetadataRecordGenerator -hopinfo $DATA_DIR/derived-data/warc-metadata-hopinfo/ $DATA_DIR/crawl-data/warcs/*.warc.gz

# And then run
$PIG_HOME/bin/pig -p I_WARCMETADATAHOPINFO_DATA_DIR=$DATA_DIR/derived-data/warc-metadata-hopinfo/*.metadata -p O_CRAWLLOG_ID_MAP_DIR=$DATA_DIR/derived-data/crawler/crawllogid.map -p O_CRAWLLOG_ID_ONEHOP_DIR=$DATA_DIR/derived-data/crawler/crawllogid.onehop -p O_CRAWLLOG_LINKS_DATA_DIR=$DATA_DIR/derived-data/crawler/links-from-crawllog.gz pig/crawl-logs/generate-crawler-hops-from-warc-metadata-records-hopinfo.pig

# followed by the Giraph job that generates the path from the crawler for each URL
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.1.0-SNAPSHOT-for-hadoop-2.0.5-alpha-jar-with-dependencies.jar
export CVIF=org.archive.giraph.VertexWithTextValueLongTextTextTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
# ZooKeeper and JobTracker settings 
export ZOOKEEPER_OPTS_STRING="-ca giraph.zkList=$ZOOKEEPER_HOST:$ZOOKEEPER_PORT"
export JOBTRACKER_OPTS_STRING="-Dmapred.job.tracker=$JOBTRACKER_HOST:$JOBTRACKER_PORT"
 
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner $JOBTRACKER_OPTS_STRING org.archive.giraph.LabelPathComputation -vif $CVIF -vip $DATA_DIR/derived-data/crawler/crawllogid.onehop/part* -of $OF -op $DATA_DIR/derived-data/crawler/crawllogid.hoppathfromcrawler/ -w $NUMWORKERS $ZOOKEEPER_OPTS_STRING

 Crawl Log Warehouse - Apache Hive

Data Extraction

 Repackage a subset of WARC data into new WARC files

Local Mode

# Generate a subset of WARC data to be extracted by compiling a list of tab separated offsets and filelocations (HTTP/HDFS locations)
# Use CDX/WAT data and Pig/Hive to compile this list (set I_OFFSET_SRCFILEPATH to the location of this list on local disk)
 
# Run GZRange-Client to repackage these records into WARC files
java -jar jar lib/ia-hadoop-tools-jar-with-dependencies.jar gzrange-client $DATA_DIR/subset-data/warcs/ $I_OFFSET_SRCFILEPATH

Cluster mode

# Generate a subset of WARC data to be extracted by compiling a list of tab separated offsets and filelocations (HTTP/HDFS locations)
# Use CDX/WAT data and Pig/Hive to compile this list (set I_OFFSET_SRCFILEPATH to the location of this list in HDFS)
 
# Then, use this list to prepare task files for extraction
$PIG_HOME/bin/pig -p I_EXTRACTED_FILE_PREFIX=MY-COLLECTION -p I_RECORDS_PER_EXTRACTED_FILE=10000 -p I_OFFSET_SRCFILEPATH=$I_OFFSET_SRCFILEPATH -p O_TASK_FILE_FOR_EXTRACTION=$DATA_DIR/subset-data/extraction.input pig/extraction/prepare-taskfile-for-extraction.pig

# now, Run the ArchiveFileExtractor Job to repackage these records into WARC files
NUMMAPPERS=15
$HADOOP_BIN jar lib/ia-hadoop-tools-jar-with-dependencies.jar ArchiveFileExtractor -mappers $NUMMAPPERS $DATA_DIR/subset-data/extraction.input $DATA_DIR/subset-data/warcs/
 

Additional Datasets

 US Congressional 109th End of Term collection