Child pages
  • Web Archive Analysis Workshop
Skip to end of metadata
Go to start of metadata

Set up

 Background
 Installation
 

1) Install the latest version of Java and set JAVA_HOME

Linux:

export JAVA_HOME=/usr

OS X:

export JAVA_HOME=$(/usr/libexec/java_home)

 

2) Install Python

Linux:

sudo apt-get install python-pip

OS X:

sudo easy_install pip
curl https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py | python

 

3) Install Hadoop (version 2 which includes Apache Pig)

curl -O http://archive.org/~vinay/archive-analysis/hadoop-2-local-mode.tar.gz
tar xfz hadoop-2-local-mode.tar.gz

# set PIG_HOME and other env variables to run in local mode
source hadoop-2-local-mode/setup-env.sh

 

4) Download the Web Archive Analysis Project

git clone --depth=1 https://github.com/vinaygoel/archive-analysis.git
export PROJECT_DIR=`pwd`/archive-analysis/
 
curl -o $PROJECT_DIR/lib/ia-hadoop-tools-jar-with-dependencies.jar http://archive.org/~vinay/archive-analysis/ia-libraries/hadoop-2.x/ia-hadoop-tools-jar-with-dependencies.jar
curl -o $PROJECT_DIR/lib/ia-porky-jar-with-dependencies.jar http://archive.org/~vinay/archive-analysis/ia-libraries/hadoop-2.x/ia-porky-jar-with-dependencies.jar

 

5) Download the workshop sample dataset (5.0 GB) (sample WARC files and the corresponding Waimea generated parsed text data)

curl -O http://archive.org/~vinay/archive-analysis/sample-dataset.tar.gz
tar xfz sample-dataset.tar.gz
export DATA_DIR=`pwd`/sample-dataset/
 
# For an even smaller sample dataset (single WARC file with corresponding parsed text)
# download http://archive.org/~vinay/archive-analysis/sample-dataset-single.tar.gz (986 MB)
# curl -O http://archive.org/~vinay/archive-analysis/sample-dataset-single.tar.gz
# tar xfz sample-dataset-single.tar.gz
# export DATA_DIR=`pwd`/sample-dataset-single/

 

6) Launch jobs from the workshop project directory

cd $PROJECT_DIR

 

7) Optional: Cluster mode settings for Pig and Giraph

To run the workshop in Cluster mode, set the following

export DATA_DIR=<location of the dataset in HDFS>
export HADOOP_HOME=<location of the installed hadoop software>
export HADOOP_BIN=$HADOOP_HOME/bin/hadoop
export PIG_HOME=<location of the installed Pig software>

To run Giraph in Local mode, please refer to the Giraph Quick Start Guide

 Generate CDX files from WARC data

SURT CDX legend 

Local mode

./hadoop-streaming/cdx/generate-cdx-job-local.sh $DATA_DIR/crawl-data/warcs/ $DATA_DIR/derived-data/cdx/

Cluster mode

$HADOOP_BIN jar lib/ia-hadoop-tools-jar-with-dependencies.jar CDXGenerator $DATA_DIR/derived-data/cdx/ $DATA_DIR/crawl-data/warcs/*.warc.gz
 Generate parsed text data from WARC data

Local mode

# See provided dataset - $DATA_DIR/derived-data/parsed/
# If you want to generate it yourself, please setup Hadoop (Pseudo-distributed), and run
$HADOOP_BIN jar lib/jbs.jar org.archive.jbs.Parse -conf etc/job-parse.xml $DATA_DIR/derived-data/parsed/ $DATA_DIR/crawl-data/warcs/*.warc.gz

Cluster mode

$HADOOP_BIN jar lib/jbs.jar org.archive.jbs.Parse -conf etc/job-parse.xml $DATA_DIR/derived-data/parsed/ $DATA_DIR/crawl-data/warcs/*.warc.gz
 Generate WAT files from WARC data

WAT file specification 

Local mode

./hadoop-streaming/wats/generate-wat-job-local.sh $DATA_DIR/crawl-data/warcs/ $DATA_DIR/derived-data/wats/

Cluster mode

$HADOOP_BIN jar lib/ia-hadoop-tools-jar-with-dependencies.jar WATGenerator $DATA_DIR/derived-data/wats/ $DATA_DIR/crawl-data/warcs/*.warc.gz

Link Extraction

 Extract links from WARC data (any available WARC metadata records with outlinks)

Local mode

# Clone the warctools repo
git clone https://github.com/internetarchive/warctools.git
 
# Extract links
./hadoop-streaming/warc-metadata-outlinks/generate-outlinks-job-local.sh $DATA_DIR/crawl-data/warcs/ $DATA_DIR/derived-data/warc-metadata-outlinks/
 
# SURT canonicalize the extracted links
$PIG_HOME/bin/pig -x local -p I_LINKS_DIR=$DATA_DIR/derived-data/warc-metadata-outlinks/ -p O_CANON_LINKS_DIR=$DATA_DIR/derived-data/canon-links-from-warc-metadata.gz/ pig/graph/canonicalize-links-from-warcs.pig

Cluster mode

# Extract links
$HADOOP_BIN jar lib/ia-hadoop-tools-jar-with-dependencies.jar WARCMetadataRecordGenerator $DATA_DIR/derived-data/warc-metadata-outlinks/ $DATA_DIR/crawl-data/warcs/*.warc.gz
 
# SURT canonicalize the extracted links
$PIG_HOME/bin/pig -p I_LINKS_DIR=$DATA_DIR/derived-data/warc-metadata-outlinks/*.metadata -p O_CANON_LINKS_DIR=$DATA_DIR/derived-data/canon-links-from-warc-metadata.gz/ pig/graph/canonicalize-links-from-warcs.pig
 Extract links from parsed text data

Local mode

# Extract SURT canonicalized links
$PIG_HOME/bin/pig -x local -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-parsed-captures.gz pig/parsed-captures/extract-surt-canon-links-with-anchor-from-parsed-captures.pig

Cluster mode

# Extract SURT canonicalized links
$PIG_HOME/bin/pig -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-parsed-captures.gz pig/parsed-captures/extract-surt-canon-links-with-anchor-from-parsed-captures.pig
 Extract links from WAT data

Local mode

# Extract SURT canonicalized links
$PIG_HOME/bin/pig -x local -p I_WATS_DIR=$DATA_DIR/derived-data/wats/*.wat.gz -p O_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-wats.gz pig/wats/extract-surt-canon-links-with-anchor-from-warc-wats.pig

Cluster mode

# Extract SURT canonicalized links
$PIG_HOME/bin/pig -p I_WATS_DIR=$DATA_DIR/derived-data/wats/*.wat.gz -p O_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-wats.gz pig/wats/extract-surt-canon-links-with-anchor-from-warc-wats.pig

Text Extraction (Titles)

 Extract titles from parsed text data

Local mode

$PIG_HOME/bin/pig -x local -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_URL_TITLE_DIR=$DATA_DIR/derived-data/url.title-from-parsed-captures.gz/ pig/parsed-captures/extract-surt-canon-urls-with-only-titles-from-parsed-captures.pig

Cluster mode

$PIG_HOME/bin/pig -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_URL_TITLE_DIR=$DATA_DIR/derived-data/url.title-from-parsed-captures.gz/ pig/parsed-captures/extract-surt-canon-urls-with-only-titles-from-parsed-captures.pig
 Extract titles from WAT data

Local mode

$PIG_HOME/bin/pig -x local -p I_WATS_DIR=$DATA_DIR/derived-data/wats/*.wat.gz -p O_URL_TITLE_DIR=$DATA_DIR/derived-data/url.title-from-wats.gz/ pig/wats/extract-surt-canon-urls-with-only-titles-from-warc-wats.pig

Cluster mode

$PIG_HOME/bin/pig -p I_WATS_DIR=$DATA_DIR/derived-data/wats/*.wat.gz -p O_URL_TITLE_DIR=$DATA_DIR/derived-data/url.title-from-wats.gz/ pig/wats/extract-surt-canon-urls-with-only-titles-from-warc-wats.pig

Text Extraction (Meta text: title, meta keywords, description text)

 Extract meta text from parsed text data

Local mode

$PIG_HOME/bin/pig -x local -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_METATEXT_DATA_DIR=$DATA_DIR/derived-data/metatext-from-parsed-captures.gz/ pig/parsed-captures/extract-surt-canon-urls-with-metatext-from-parsed-captures.pig

Cluster mode

$PIG_HOME/bin/pig -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_METATEXT_DATA_DIR=$DATA_DIR/derived-data/metatext-from-parsed-captures.gz/ pig/parsed-captures/extract-surt-canon-urls-with-metatext-from-parsed-captures.pig
 Extract meta text from WAT data

Local mode

$PIG_HOME/bin/pig -x local -p I_WATS_DIR=$DATA_DIR/derived-data/wats/*.wat.gz -p O_METATEXT_DATA_DIR=$DATA_DIR/derived-data/metatext-from-wats.gz/ pig/wats/extract-surt-canon-urls-with-metatext-from-warc-wats.pig

Cluster mode

$PIG_HOME/bin/pig -p I_WATS_DIR=$DATA_DIR/derived-data/wats/*.wat.gz -p O_METATEXT_DATA_DIR=$DATA_DIR/derived-data/metatext-from-wats.gz/ pig/wats/extract-surt-canon-urls-with-metatext-from-warc-wats.pig

Entity Extraction (using Stanford NER)

 Extract entities from parsed text data

Local mode

$PIG_HOME/bin/pig -x local -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_ENTITIES_DIR=$DATA_DIR/derived-data/entities-from-parsed-captures.gz/ pig/parsed-captures/extract-entities-from-parsed-captures.pig

Cluster mode

# Store the english.all.3class.distsim.crf.ser.gz available under $PROJECT_DIR/lib/english.all.3class.distsim.crf.ser.gz into HDFS. Set LOCATION_OF_NER_CLASSIFIER_FILE_IN_HDFS.
# This classifier will be supplied as a distributed cache to the workers

$PIG_HOME/bin/pig -Dmapred.cache.files="$LOCATION_OF_NER_CLASSIFIER_FILE_IN_HDFS/english.all.3class.distsim.crf.ser.gz#english.all.3class.distsim.crf.ser.gz" -Dmapred.create.symlink=yes -p I_NER_CLASSIFIER_FILE=english.all.3class.distsim.crf.ser.gz  -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_ENTITIES_DIR=$DATA_DIR/derived-data/entities-from-parsed-captures.gz/ pig/parsed-captures/extract-entities-from-parsed-captures.pig
 

Graph Generation

 Generate Host Graph from links data

Local mode

$PIG_HOME/bin/pig -x local -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/host.graph/ pig/graph/convert-link-data-to-host-graph-with-counts.pig

Cluster mode

$PIG_HOME/bin/pig -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/host.graph/ pig/graph/convert-link-data-to-host-graph-with-counts.pig
 Generate Domain Graph from links data

Local mode

$PIG_HOME/bin/pig -x local -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_TOP_DOMAIN_GRAPH_DIR=$DATA_DIR/derived-data/graph/top-domain.graph/ pig/graph/convert-link-data-to-top-domain-graph-with-counts.pig

Cluster mode

$PIG_HOME/bin/pig -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_TOP_DOMAIN_GRAPH_DIR=$DATA_DIR/derived-data/graph/top-domain.graph/ pig/graph/convert-link-data-to-top-domain-graph-with-counts.pig
 Generate an Archival Web Graph (ID-Map and ID-Graph) from links data

Local mode

# Generate an ID-Map that maps a unique integer ID to each URL (SURT sorted URLs allows for pages belonging to the same host having "closer" IDs)
$PIG_HOME/bin/pig -x local -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_ID_MAP_DIR=$DATA_DIR/derived-data/graph/id.map/ pig/graph/generate-id-map.pig

# Alternate version: Generate an ID-Map where the ID assigned to each URL is a 64-bit fingerprint generated from the URL
# $PIG_HOME/bin/pig -x local -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_ID_MAP_DIR=$DATA_DIR/derived-data/graph/id.map/ pig/graph/generate-id-map-using-fingerprints.pig 

# Generate an ID-Graph that lists the destination IDs for each source ID along with the timestamp info
$PIG_HOME/bin/pig -x local -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p I_ID_MAP_DIR=$DATA_DIR/derived-data/graph/id.map/ -p O_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/id.graph/ pig/graph/convert-link-data-to-id-graph.pig

Cluster mode

# Generate an ID-Map that maps a unique integer ID to each URL
$PIG_HOME/bin/pig -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_ID_MAP_DIR=$DATA_DIR/derived-data/graph/id.map/ pig/graph/generate-id-map.pig

# Alternate version: Generate an ID-Map where the ID assigned to each URL is a 64-bit fingerprint generated from the URL
# $PIG_HOME/bin/pig -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_ID_MAP_DIR=$DATA_DIR/derived-data/graph/id.map/ pig/graph/generate-id-map-using-fingerprints.pig 

# Generate an ID-Graph that lists the destination IDs for each source ID along with the timestamp info
$PIG_HOME/bin/pig -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p I_ID_MAP_DIR=$DATA_DIR/derived-data/graph/id.map/ -p O_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/id.graph/ pig/graph/convert-link-data-to-id-graph.pig

 Generate a Link Graph (ID-Map and SortedInt-ID-Graph) from (SRC,DST) - no timestamp

Local mode

# Generate an ID-Map that maps a unique integer ID to each resource, and SortedIntID-Graph that maps a set of sorted integer destination IDs to each source ID
$PIG_HOME/bin/pig -x local -p I_LINKS_DATA_NO_TS_DIR=$DATA_DIR/derived-data/graph/host.graph/ -p O_ID_MAP_DIR=$DATA_DIR/derived-data/graph/host-id.map/ -p O_ID_SORTEDINT_GRAPH_NO_TS_DIR=$DATA_DIR/derived-data/graph/host-id-sortedint.graph/ pig/graph/convert-link-data-no-ts-to-sorted-id-data-no-ts.pig

Cluster mode

# Generate an ID-Map that maps a unique integer ID to each resource, and SortedIntID-Graph that maps a set of sorted integer destination IDs to each source ID
$PIG_HOME/bin/pig -p I_LINKS_DATA_NO_TS_DIR=$DATA_DIR/derived-data/graph/host.graph/ -p O_ID_MAP_DIR=$DATA_DIR/derived-data/graph/host-id.map/ -p O_ID_SORTEDINT_GRAPH_NO_TS_DIR=$DATA_DIR/derived-data/graph/host-id-sortedint.graph/ pig/graph/convert-link-data-no-ts-to-sorted-id-data-no-ts.pig

CDX Analysis

 Example: Generate breakdown of MIME types by year from CDX data

Local mode

$PIG_HOME/bin/pig -x local -p I_CDX_DIR=$DATA_DIR/derived-data/cdx/*.cdx -p O_MIME_BREAKDOWN_DIR=$DATA_DIR/derived-data/mime-breakdown/ pig/cdx/mimeBreakdown.pig

Cluster mode

$PIG_HOME/bin/pig -p I_CDX_DIR=$DATA_DIR/derived-data/cdx/*.cdx -p O_MIME_BREAKDOWN_DIR=$DATA_DIR/derived-data/mime-breakdown/ pig/cdx/mimeBreakdown.pig
 CDX Warehouse - Apache Hive

Apache Hive

Install Hive. Refer to CDX Warehouse

Text Analysis

 Example: Extract top 50 terms (by TF-IDF) per URL from meta text data

Local mode

$PIG_HOME/bin/pig -x local -p I_METATEXT_DIR=$DATA_DIR/derived-data/metatext-from-* -p O_URL_METATEXT_TOPTERMS_DIR=$DATA_DIR/derived-data/url.topmetatext.gz pig/text/metatext-topN-tfidf.pig

Cluster mode

# Store the stop-words.txt available under $PROJECT_DIR/pig/text/stop-words.txt into HDFS. Set LOCATION_OF_STOP_WORDS_FILE_IN_HDFS.
# This stop words list will be supplied as a distributed cache to the workers
$PIG_HOME/bin/pig -Dmapred.cache.files="$LOCATION_OF_STOP_WORDS_FILE_IN_HDFS/stop-words.txt#stop-words.txt" -Dmapred.create.symlink=yes -p I_STOP_WORDS_FILE=stop-words.txt -p I_METATEXT_DIR=$DATA_DIR/derived-data/metatext-from-* -p O_URL_METATEXT_TOPTERMS_DIR=$DATA_DIR/derived-data/url.topmetatext.gz/ pig/text/metatext-topN-tfidf.pig
 Example: Extract top 50 terms (by TF-IDF) per URL from link text data (anchor text)

Local mode

$PIG_HOME/bin/pig -x local -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_URL_ANCHORTEXT_TOPTERMS_DIR=$DATA_DIR/derived-data/url.topanchortext.gz pig/text/anchortext-topN-tfidf.pig

Cluster mode

# Store the stop-words.txt available under $PROJECT_DIR/pig/text/stop-words.txt into HDFS. Set LOCATION_OF_STOP_WORDS_FILE_IN_HDFS.
# This stop words list will be supplied as a distributed cache to the workers
$PIG_HOME/bin/pig -Dmapred.cache.files="$LOCATION_OF_STOP_WORDS_FILE_IN_HDFS/stop-words.txt#stop-words.txt" -Dmapred.create.symlink=yes -p I_STOP_WORDS_FILE=stop-words.txt -p I_LINKS_DATA_DIR=$DATA_DIR/derived-data/canon-links-from-* -p O_URL_ANCHORTEXT_TOPTERMS_DIR=$DATA_DIR/derived-data/url.topanchortext.gz pig/text/anchortext-topN-tfidf.pig
 Example: Prepare parsed text for analysis with Apache Mahout

Apache Mahout

Local mode

$PIG_HOME/bin/pig -x local -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_URL_CONTENT_SEQ_DIR=$DATA_DIR/derived-data/parsed-captures-content-for-mahout.seq/ pig/parsed-captures/prepare-content-for-mahout-with-filter.pig

Cluster mode

# Store the stop-words.txt available under $PROJECT_DIR/pig/text/stop-words.txt into HDFS. Set LOCATION_OF_STOP_WORDS_FILE_IN_HDFS.
# This stop words list will be supplied as a distributed cache to the workers
$PIG_HOME/bin/pig -Dmapred.cache.files="$LOCATION_OF_STOP_WORDS_FILE_IN_HDFS/stop-words.txt#stop-words.txt" -Dmapred.create.symlink=yes -p I_STOP_WORDS_FILE=stop-words.txt -p I_PARSED_DATA_DIR=$DATA_DIR/derived-data/parsed/ -p O_URL_CONTENT_SEQ_DIR=$DATA_DIR/derived-data/parsed-captures-content-for-mahout.seq/ pig/parsed-captures/prepare-content-for-mahout-with-filter.pig

Now you can run the Mahout seq2sparse command on the produced output, followed by any of the available clustering and classification algorithms.

Link Analysis with Apache Pig

 Example: Compute the number of incoming and outgoing links per URL

Local mode

$PIG_HOME/bin/pig -x local -p I_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/id.graph/ -p O_DEGREE_ANALYSIS_DIR=$DATA_DIR/derived-data/graph/degree-analysis/ pig/graph/degree-analysis.pig

Cluster mode

$PIG_HOME/bin/pig -p I_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/id.graph/ -p O_DEGREE_ANALYSIS_DIR=$DATA_DIR/derived-data/graph/degree-analysis/ pig/graph/degree-analysis.pig
 Example: Compute the degree stats per Host/Domain from Host/Domain Graph

Local mode

$PIG_HOME/bin/pig -x local -p I_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/host.graph/ -p O_HOST_DEGREE_STATS_DIR=$DATA_DIR/derived-data/graph/host-degree-stats/ pig/graph/host-degree-stats.pig

Cluster mode

$PIG_HOME/bin/pig -p I_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/host.graph/ -p O_HOST_DEGREE_STATS_DIR=$DATA_DIR/derived-data/graph/host-degree-stats/ pig/graph/host-degree-stats.pig
 Example: PageRank

Not recommended. The MapReduce implementation (Pig) is several orders of magnitude slower than the BSP implementation (Giraph).

Refer to the prepare PageRank graph and the PageRank implementation using Pig scripts.

 Example: Find common links using Link Graph (ID-Map, SortedInt-ID-Graph)

Problem: Given a set of input resources, find the set of links that are linked-to by all of these input resources.

The following code makes use of FindAndIntersectionsUsingPForDeltaDocIdSetUDF(). This UDF takes advantage of the Kamikaze library to build docIdSets from the sorted integer destination ID sets, and then performs efficient intersection of these sets to find all the destinations in common. It only works for integer IDs with value < Java Integer.MAX_VALUE (2^31 -1)

 

Local mode

# Set I_SRC_RESOURCES_DIR to the data containing the list of resources for which we need to find common/shared links
$PIG_HOME/bin/pig -x local -p I_SRC_RESOURCES_DIR=$I_SRC_RESOURCES_DIR -p I_ID_MAP_DIR=$DATA_DIR/derived-data/graph/host-id.map/ -p I_ID_SORTEDINT_GRAPH_NO_TS_DIR=$DATA_DIR/derived-data/graph/host-id-sortedint.graph/ -p O_COMMON_LINKS_RESOURCES_DIR=$DATA_DIR/derived-data/graph/common-links/ pig/graph/find-links-in-common.pig

Cluster mode

# Set I_SRC_RESOURCES_DIR to the data containing the list of resources for which we need to find common/shared links
$PIG_HOME/bin/pig -p I_SRC_RESOURCES_DIR=$I_SRC_RESOURCES_DIR -p I_ID_MAP_DIR=$DATA_DIR/derived-data/graph/host-id.map/ -p I_ID_SORTEDINT_GRAPH_NO_TS_DIR=$DATA_DIR/derived-data/graph/host-id-sortedint.graph/ -p O_COMMON_LINKS_RESOURCES_DIR=$DATA_DIR/derived-data/graph/common-links/ pig/graph/find-links-in-common.pig

Link Analysis with Apache Giraph (Cluster mode)

 Example: Compute the number of incoming and outgoing links per URL

Cluster mode - Hadoop 0.20.2

# Prepare links graph for Giraph from ID-Graph data using Pig
$PIG_HOME/bin/pig -p I_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/id.graph/ -p O_PR_TAB_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/pagerank-id.graph/ pig/graph/prepare-tab-delimited-pagerank-graph.pig
 
# Giraph settings
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.0.0-for-hadoop-0.20.2-jar-with-dependencies.jar
export LVIF=org.archive.giraph.VertexWithDoubleValueLongDoubleFloatTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10

# Indegree
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner org.archive.giraph.InDegreeCountVertex -vif $LVIF -vip $DATA_DIR/derived-data/graph/pagerank-id.graph/part* -of $OF -op $DATA_DIR/derived-data/graph/id.indegree/ -w $NUMWORKERS

# Outdegree
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner org.archive.giraph.OutDegreeCountVertex -vif $LVIF -vip $DATA_DIR/derived-data/graph/pagerank-id.graph/part* -of $OF -op $DATA_DIR/derived-data/graph/id.outdegree/ -w $NUMWORKERS

Cluster mode - Hadoop 2.x

# Prepare links graph for Giraph from ID-Graph data using Pig
$PIG_HOME/bin/pig -p I_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/id.graph/ -p O_PR_TAB_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/pagerank-id.graph/ pig/graph/prepare-tab-delimited-pagerank-graph.pig
 
# Giraph settings
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.1.0-SNAPSHOT-for-hadoop-2.0.5-alpha-jar-with-dependencies.jar
export LVIF=org.archive.giraph.VertexWithDoubleValueLongDoubleFloatTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
# ZooKeeper and JobTracker settings 
export ZOOKEEPER_OPTS_STRING="-ca giraph.zkList=$ZOOKEEPER_HOST:$ZOOKEEPER_PORT"
export JOBTRACKER_OPTS_STRING="-Dmapred.job.tracker=$JOBTRACKER_HOST:$JOBTRACKER_PORT"

# Indegree
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner $JOBTRACKER_OPTS_STRING org.archive.giraph.InDegreeCountComputation -vif $LVIF -vip $DATA_DIR/derived-data/graph/pagerank-id.graph/part* -of $OF -op $DATA_DIR/derived-data/graph/id.indegree/ -w $NUMWORKERS $ZOOKEEPER_OPTS_STRING

# Outdegree
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner $JOBTRACKER_OPTS_STRING org.archive.giraph.OutDegreeCountComputation -vif $LVIF -vip $DATA_DIR/derived-data/graph/pagerank-id.graph/part* -of $OF -op $DATA_DIR/derived-data/graph/id.outdegree/ -w $NUMWORKERS $ZOOKEEPER_OPTS_STRING

 Example: PageRank

Cluster mode - Hadoop 0.20.2

# Prepare links graph for Giraph from Host/Domain Graph data using Pig
$PIG_HOME/bin/pig -p I_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/id.graph/ -p O_PR_TAB_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/pagerank-id.graph/ pig/graph/prepare-tab-delimited-pagerank-graph.pig
 
# Giraph settings
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.0.0-for-hadoop-0.20.2-jar-with-dependencies.jar
export LVIF=org.archive.giraph.VertexWithDoubleValueLongDoubleFloatTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
# PageRank
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner org.archive.giraph.PageRankVertex -ca PageRankVertex.jump_probability=0.15f -ca PageRankVertex.max_supersteps=15 -vif $LVIF -vip $DATA_DIR/derived-data/graph/pagerank-id.graph/part* -of $OF -op $DATA_DIR/derived-data/graph/id.prscore/ -w $NUMWORKERS -mc org.archive.giraph.PageRankVertex\$PageRankVertexMasterCompute

# Assign ranks by PageRank score using Pig
$PIG_HOME/bin/pig -p I_ID_PRSCORE_DIR=$DATA_DIR/derived-data/graph/id.prscore/ -p O_ID_PRRANK_DIR=$DATA_DIR/derived-data/graph/id.prrank/ pig/graph/assign-pagerank-rank.pig

Cluster mode - Hadoop 2.x

# Prepare links graph for Giraph from Host/Domain Graph data using Pig
$PIG_HOME/bin/pig -p I_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/id.graph/ -p O_PR_TAB_ID_GRAPH_DIR=$DATA_DIR/derived-data/graph/pagerank-id.graph/ pig/graph/prepare-tab-delimited-pagerank-graph.pig
 
# Giraph settings
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.1.0-SNAPSHOT-for-hadoop-2.0.5-alpha-jar-with-dependencies.jar
export LVIF=org.archive.giraph.VertexWithDoubleValueLongDoubleFloatTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
# ZooKeeper and JobTracker settings 
export ZOOKEEPER_OPTS_STRING="-ca giraph.zkList=$ZOOKEEPER_HOST:$ZOOKEEPER_PORT"
export JOBTRACKER_OPTS_STRING="-Dmapred.job.tracker=$JOBTRACKER_HOST:$JOBTRACKER_PORT"
 
# PageRank
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner $JOBTRACKER_OPTS_STRING org.archive.giraph.PageRankComputation -ca PageRankComputation.jumpProbability=0.15f -ca PageRankComputation.maxSupersteps=15 -vif $LVIF -vip $DATA_DIR/derived-data/graph/pagerank-id.graph/part* -of $OF -op $DATA_DIR/derived-data/graph/id.prscore/ -w $NUMWORKERS -mc org.archive.giraph.PageRankComputation\$PageRankComputationMasterCompute $ZOOKEEPER_OPTS_STRING

# Assign ranks by PageRank score using Pig
$PIG_HOME/bin/pig -p I_ID_PRSCORE_DIR=$DATA_DIR/derived-data/graph/id.prscore/ -p O_ID_PRRANK_DIR=$DATA_DIR/derived-data/graph/id.prrank/ pig/graph/assign-pagerank-rank.pig

 Example: Weighted PageRank (Host/Domain Graph)

Cluster mode  - Hadoop 0.20.2

# Prepare links graph for Giraph from Host/Domain graph using Pig
$PIG_HOME/bin/pig -p I_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/host.graph/ -p O_PR_TAB_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/pagerank-host.graph/ pig/graph/prepare-tab-delimited-weighted-pagerank-graph.pig
 
# Giraph settings
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.0.0-for-hadoop-0.20.2-jar-with-dependencies.jar
export TVIF=org.archive.giraph.VertexWithDoubleValueTextDoubleFloatTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
# Weighted PageRank
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner org.archive.giraph.WeightedPageRankVertex -ca WeightedPageRankVertex.jump_probability=0.15f -ca WeightedPageRankVertex.max_supersteps=15 -vif $TVIF -vip $DATA_DIR/derived-data/graph/pagerank-host.graph/part* -of $OF -op $DATA_DIR/derived-data/graph/host.prscore/ -w $NUMWORKERS -mc org.archive.giraph.WeightedPageRankVertex\$WeightedPageRankVertexMasterCompute

# Assign ranks by PageRank score using Pig
$PIG_HOME/bin/pig -p I_ID_PRSCORE_DIR=$DATA_DIR/derived-data/graph/host.prscore/ -p O_ID_PRRANK_DIR=$DATA_DIR/derived-data/graph/host.prrank/ pig/graph/assign-pagerank-rank.pig

Cluster mode  - Hadoop 2.x

# Prepare links graph for Giraph from Host/Domain graph using Pig
$PIG_HOME/bin/pig -p I_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/host.graph/ -p O_PR_TAB_HOST_GRAPH_DIR=$DATA_DIR/derived-data/graph/pagerank-host.graph/ pig/graph/prepare-tab-delimited-weighted-pagerank-graph.pig
 
# Giraph settings
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.1.0-SNAPSHOT-for-hadoop-2.0.5-alpha-jar-with-dependencies.jar
export TVIF=org.archive.giraph.VertexWithDoubleValueTextDoubleFloatTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
# ZooKeeper and JobTracker settings 
export ZOOKEEPER_OPTS_STRING="-ca giraph.zkList=$ZOOKEEPER_HOST:$ZOOKEEPER_PORT"
export JOBTRACKER_OPTS_STRING="-Dmapred.job.tracker=$JOBTRACKER_HOST:$JOBTRACKER_PORT"
 
# Weighted PageRank
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner $JOBTRACKER_OPTS_STRING org.archive.giraph.WeightedPageRankComputation -ca WeightedPageRankComputation.jumpProbability=0.15f -ca WeightedPageRankComputation.maxSupersteps=15 -vif $TVIF -vip $DATA_DIR/derived-data/graph/pagerank-host.graph/part* -of $OF -op $DATA_DIR/derived-data/graph/host.prscore/ -w $NUMWORKERS -mc org.archive.giraph.WeightedPageRankComputation\$WeightedPageRankComputationMasterCompute $ZOOKEEPER_OPTS_STRING

# Assign ranks by PageRank score using Pig
$PIG_HOME/bin/pig -p I_ID_PRSCORE_DIR=$DATA_DIR/derived-data/graph/host.prscore/ -p O_ID_PRRANK_DIR=$DATA_DIR/derived-data/graph/host.prrank/ pig/graph/assign-pagerank-rank.pig

Crawl Log Analysis

 Example: Extract path from crawler for each URL using Apache Giraph (Cluster mode)

Cluster mode - Hadoop 0.20.2

# Generate crawler hops info:
# If you have the Heritrix crawl logs for the collection (stored under $DATA_DIR/crawl-data/crawl-logs/) , run
$PIG_HOME/bin/pig -p I_CRAWLLOG_DATA_DIR=$DATA_DIR/crawl-data/crawl-logs/ -p O_CRAWLLOG_ID_MAP_DIR=$DATA_DIR/derived-data/crawler/crawllogid.map -p O_CRAWLLOG_ID_ONEHOP_DIR=$DATA_DIR/derived-data/crawler/crawllogid.onehop -p O_CRAWLLOG_LINKS_DATA_DIR=$DATA_DIR/derived-data/crawler/links-from-crawllog.gz pig/crawl-logs/generate-crawler-hops-from-crawllog.pig

# Else, if you have WARCs with metadata records containing crawl hop info, 
# Generate Hopinfo files
$HADOOP_BIN jar lib/ia-hadoop-tools-jar-with-dependencies.jar WARCMetadataRecordGenerator -hopinfo $DATA_DIR/derived-data/warc-metadata-hopinfo/ $DATA_DIR/crawl-data/warcs/*.warc.gz

# And then run
$PIG_HOME/bin/pig -p I_WARCMETADATAHOPINFO_DATA_DIR=$DATA_DIR/derived-data/warc-metadata-hopinfo/*.metadata -p O_CRAWLLOG_ID_MAP_DIR=$DATA_DIR/derived-data/crawler/crawllogid.map -p O_CRAWLLOG_ID_ONEHOP_DIR=$DATA_DIR/derived-data/crawler/crawllogid.onehop -p O_CRAWLLOG_LINKS_DATA_DIR=$DATA_DIR/derived-data/crawler/links-from-crawllog.gz pig/crawl-logs/generate-crawler-hops-from-warc-metadata-records-hopinfo.pig

# followed by the Giraph job that generates the path from the crawler for each URL
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.0.0-for-hadoop-0.20.2-jar-with-dependencies.jar
export CVIF=org.archive.giraph.VertexWithTextValueLongTextTextTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner org.archive.giraph.LabelPathVertex -vif $CVIF -vip $DATA_DIR/derived-data/crawler/crawllogid.onehop/part* -of $OF -op $DATA_DIR/derived-data/crawler/crawllogid.hoppathfromcrawler/ -w $NUMWORKERS

Cluster mode - Hadoop 2.x

# Generate crawler hops info:
# If you have the Heritrix crawl logs for the collection (stored under $DATA_DIR/crawl-data/crawl-logs/) , run
$PIG_HOME/bin/pig -p I_CRAWLLOG_DATA_DIR=$DATA_DIR/crawl-data/crawl-logs/ -p O_CRAWLLOG_ID_MAP_DIR=$DATA_DIR/derived-data/crawler/crawllogid.map -p O_CRAWLLOG_ID_ONEHOP_DIR=$DATA_DIR/derived-data/crawler/crawllogid.onehop -p O_CRAWLLOG_LINKS_DATA_DIR=$DATA_DIR/derived-data/crawler/links-from-crawllog.gz pig/crawl-logs/generate-crawler-hops-from-crawllog.pig

# Else, if you have WARCs with metadata records containing crawl hop info, 
# Generate Hopinfo files
$HADOOP_BIN jar lib/ia-hadoop-tools-jar-with-dependencies.jar WARCMetadataRecordGenerator -hopinfo $DATA_DIR/derived-data/warc-metadata-hopinfo/ $DATA_DIR/crawl-data/warcs/*.warc.gz

# And then run
$PIG_HOME/bin/pig -p I_WARCMETADATAHOPINFO_DATA_DIR=$DATA_DIR/derived-data/warc-metadata-hopinfo/*.metadata -p O_CRAWLLOG_ID_MAP_DIR=$DATA_DIR/derived-data/crawler/crawllogid.map -p O_CRAWLLOG_ID_ONEHOP_DIR=$DATA_DIR/derived-data/crawler/crawllogid.onehop -p O_CRAWLLOG_LINKS_DATA_DIR=$DATA_DIR/derived-data/crawler/links-from-crawllog.gz pig/crawl-logs/generate-crawler-hops-from-warc-metadata-records-hopinfo.pig

# followed by the Giraph job that generates the path from the crawler for each URL
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.1.0-SNAPSHOT-for-hadoop-2.0.5-alpha-jar-with-dependencies.jar
export CVIF=org.archive.giraph.VertexWithTextValueLongTextTextTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
# ZooKeeper and JobTracker settings 
export ZOOKEEPER_OPTS_STRING="-ca giraph.zkList=$ZOOKEEPER_HOST:$ZOOKEEPER_PORT"
export JOBTRACKER_OPTS_STRING="-Dmapred.job.tracker=$JOBTRACKER_HOST:$JOBTRACKER_PORT"
 
$HADOOP_BIN jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner $JOBTRACKER_OPTS_STRING org.archive.giraph.LabelPathComputation -vif $CVIF -vip $DATA_DIR/derived-data/crawler/crawllogid.onehop/part* -of $OF -op $DATA_DIR/derived-data/crawler/crawllogid.hoppathfromcrawler/ -w $NUMWORKERS $ZOOKEEPER_OPTS_STRING

 Crawl Log Warehouse - Apache Hive

Data Extraction

 Repackage a subset of WARC data into new WARC files

Local Mode

# Generate a subset of WARC data to be extracted by compiling a list of tab separated offsets and filelocations (HTTP/HDFS locations)
# Use CDX/WAT data and Pig/Hive to compile this list (set I_OFFSET_SRCFILEPATH to the location of this list on local disk)
 
# Run GZRange-Client to repackage these records into WARC files
java -jar jar lib/ia-hadoop-tools-jar-with-dependencies.jar gzrange-client $DATA_DIR/subset-data/warcs/ $I_OFFSET_SRCFILEPATH

Cluster mode

# Generate a subset of WARC data to be extracted by compiling a list of tab separated offsets and filelocations (HTTP/HDFS locations)
# Use CDX/WAT data and Pig/Hive to compile this list (set I_OFFSET_SRCFILEPATH to the location of this list in HDFS)
 
# Then, use this list to prepare task files for extraction
$PIG_HOME/bin/pig -p I_EXTRACTED_FILE_PREFIX=MY-COLLECTION -p I_RECORDS_PER_EXTRACTED_FILE=10000 -p I_OFFSET_SRCFILEPATH=$I_OFFSET_SRCFILEPATH -p O_TASK_FILE_FOR_EXTRACTION=$DATA_DIR/subset-data/extraction.input pig/extraction/prepare-taskfile-for-extraction.pig

# now, Run the ArchiveFileExtractor Job to repackage these records into WARC files
NUMMAPPERS=15
$HADOOP_BIN jar lib/ia-hadoop-tools-jar-with-dependencies.jar ArchiveFileExtractor -mappers $NUMMAPPERS $DATA_DIR/subset-data/extraction.input $DATA_DIR/subset-data/warcs/
 

Additional Datasets

 US Congressional 109th End of Term collection

10 Comments

  1. Unknown User (drahcos)

    Hi,

    when I try the " Generate parsed text data from WARC data"-example in "setup" I get a NullpointerException, but I don't know why. I tried local and clustermode with the original and with other warc-files. Here is the error-message:

    Exception in thread "main" java.lang.NullPointerException
        at org.archive.jbs.Parse.run(Parse.java:394)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.archive.jbs.Parse.main(Parse.java:452)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

    Any ideas?

  2. Unknown User (drahcos)

    btw.,

    why is it that in  "Generate parsed text data from WARC data" the commands for local and clustermode are the same?

    Best regards,

    Richard

     

  3. Hi Richard,

    The parsed text generation code is from the Waimea project (https://github.com/internetarchive/waimea) / https://github.com/internetarchive/jbs  which builds parsed text data from (W)ARCs stored only in HDFS.

    So, local mode is not supported unless you set up a pseudo-distributed hadoop installation on your local machine. That's the reason the command is the same for both local (pseudo-distributed) and cluster mode.

    Are you getting the NullPointerException while processing files stored in HDFS? 

    Thanks,

    Vinay

     

  4. Unknown User (drahcos)

    Hi Vinay,

    I only tried the example with the small dataset and with my own. So no HDFS involved.

    To be honest, I don't know how to use it with HDFS since there is a path needed and I can only access HDFS via the hdfs commands. I'm mostly working with pig scripts and they automatically use the hdfs-files. Could you tell me what I need to change

    in your example in order to use the hdfs?

    Thanks,

    Richard

     

    1. When you run this command, use the $HADOOP_BIN that's configured to use your cluster. 

       

      export HADOOP_HOME=<location of the installed hadoop software>

       

      export HADOOP_BIN=$HADOOP_HOME/bin/hadoop

      (will be the same directory from where you invoke the hdfs command to access files)

      Then $DATA_DIR will refer to a location in HDFS where you have placed the WARC files.

      $HADOOP_BIN jar lib/jbs.jar org.archive.jbs.Parse -conf etc/job-parse.xml $DATA_DIR/derived-data/parsed/ $DATA_DIR/crawl-data/warcs/*.warc.gz

  5. Unknown User (drahcos)

    Thanks! It works now.

    What I want to do is to analyze the text with Apache Mahout. I know there is the section

    " Example: Prepare parsed text for analysis with Apache Mahout"

    but I would like to take a look at the text before I work with Mahout. Is there a way to do that? The output of

    "Generate parsed text data from WARC data"

    is another warc.gz-file but when I want to access it via a pig script I get the following errorlog

    Backend error message
    ---------------------
    java.io.IOException: incorrect header check
        at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
        at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221)
        at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:89)
        at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)
        at java.io.InputStream.read(InputStream.java:101)
        at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)
        at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:147)
        at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:221)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
        at org.apache.hadoop.mapreduce.task.MapContex

    Pig Stack Trace
    ---------------
    ERROR 2997: Encountered IOException. incorrect header check

    java.io.IOException: incorrect header check
        at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
        at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221)
        at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:89)
        at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)
        at java.io.InputStream.read(InputStream.java:101)
        at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)
        at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:147)
        at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:221)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
    ================================================================================
    Pig Stack Trace
    ---------------
    ERROR 2244: Job failed, hadoop does not return any error message

    org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job failed, hadoop does not return any error message
        at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:145)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
        at org.apache.pig.Main.run(Main.java:604)
        at org.apache.pig.Main.main(Main.java:157)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
    ================================================================================

    The pigscript I used simply takes the file as input and output with no further change.

    1. This is the one confusing thing about the parsed text. The output files are named the same as the input WARC files but they're actually sequence files (http://wiki.apache.org/hadoop/SequenceFile) containing the parsed text data, not WARC files (take a look at the code https://github.com/aaronbinns/jbs/blob/master/src/java/org/archive/jbs/Parse.java)

      To take a look at the content of this generated parsed text data / sequence files, you can use the "mahout seqdumper" command. 

  6. Unknown User (drahcos)

    Thanks a lot!

    Now I can do exactly what I wanted.

  7. Unknown User (drahcos)

    Hi,

    I used "Generate parsed text data from WARC data" on more than 7TB and the job failed because of memory issues. Then I parted the data into 4 parts (so ca 25% each) 

    and the job works fine until it reaches 99.xx%. After that it gets stuck every time. About 3-4 attempts keep staying at 0.00% with the status "running". I killed the attempts so they could

    restart which works fine but the new attempts also stay at 0.00%. This all wouldn't be much of a problem if the reduce part would run until that part. A bit of data loss is ok for me. 

    I tried setting mapred.reduce.slowstart.completed.maps to 0.4 in etc/job-parse.xml and when I check the job.xml I can see that it worked but still the reduce part stays at 0.00%

    all the time. Is there a way I can make reduce start early in this tool?

    Best regards,

    Richard

    1. Use mapred.task.timeout (to kill long running map tasks), mapred.map.max.attempts (set max attempts) and mapred.max.map.failures.percent (to tolerate some failed map tasks). This will allow your job to complete despite some failed map tasks.