Child pages
  • IA - GOV dataset - Altiscale
Skip to end of metadata
Go to start of metadata

Set up

 Background

Presentation: (Slides)

 Installation
 

1) Download the Web Archive Analysis Project

git clone --depth=1 https://github.com/vinaygoel/archive-analysis.git
export PROJECT_DIR=`pwd`/archive-analysis/
 
curl -o $PROJECT_DIR/lib/ia-hadoop-tools-jar-with-dependencies.jar http://archive.org/~vinay/archive-analysis/ia-libraries/hadoop-2.x/ia-hadoop-tools-jar-with-dependencies.jar
curl -o $PROJECT_DIR/lib/ia-porky-jar-with-dependencies.jar http://archive.org/~vinay/archive-analysis/ia-libraries/hadoop-2.x/ia-porky-jar-with-dependencies.jar

 

2) Launch jobs from the workshop project directory

cd $PROJECT_DIR
 Datasets
 URL-Timestamp-Checksum for every .gov resource in IA

Records containing three fields for every .gov resource in the Internet Archive: URL (SURT canonicalized URL), timestamp (14-digit) and checksum (SHA-1 digest)

There are a total of 2,655,307,483 (2.6 B) records for the time period 1995 - end of Sep 2013. This data is available under /dataset/gov/url-ts-checksum/ and can be queried using Hive.

hive> describe url_ts_checksum;
OK
url                 	string              	None
ts                  	string              	None
checksum            	string              	None
Time taken: 2.38 seconds, Fetched: 3 row(s)
 ARC/WARC files generated by extracting every unique URL-Checksum .gov resource

(W)ARC files were generated by extracting every unique <URL,checksum> .gov resource in the Internet Archive (1995-end of Sep 2013). The timestamp for a particular <URL,checksum> was chosen at random.

There are 1,111,698,054 (1.1 B) such unique <URL,checksum> captures extracted into a total of 153,311 ARC files and 68,649 WARC files.

The total size of this extracted (W)ARC dataset is 85.93 TB and is available under /dataset/gov/extracted-data/

The file that lists the locations for each (W)ARC file is /dataset/gov/extracted-data-path-index.txt

 CDX data derived from the generated ARC/WARC files

CDX data derived from the (W)ARC dataset is available under /dataset/gov/extracted-data-cdx/ and can be queried using Hive.

hive> describe extracted_data_cdx;
OK
url                 	string              	None
ts                  	string              	None
origurl             	string              	None
mime                	string              	None
rescode             	string              	None
checksum            	string              	None
redirecturl         	string              	None
meta                	string              	None
compressedsize      	string              	None
offset              	string              	None
filename            	string              	None
Time taken: 0.123 seconds, Fetched: 11 row(s)
 Parsed Text data derived from the generated ARC/WARC files

Parsed text data derived from the (W)ARC dataset is available under /dataset-derived/gov/parsed/. The total size of this data is 2.31 TB

The file that lists the locations for the parsed text files for each source (W)ARC file is /dataset-derived/gov/parsed-data-path-index.txt

Note: We failed to generate parsed text files for 90 (W)ARC files from the total of 221,961 files.

 Links from Parsed Text

All links from the parsed text dataset have been extracted.

There are a total of 58,832,436,888 (58.8 billion) links that are available under /dataset-derived/gov/link-analysis/src-ts-dst-anchor-linktype/

 Expanded Archival Web Graph (Per Year)

The per year Archival Web Graph was generated and expanded by consolidating the extracted links from the parsed text collection with the URL-Timestamp-Checksum dataset to find all instances of links from .gov resources across time (Expanded Graph). 

Id-Map: ID-Map that maps a unique integer ID to each URL (/dataset-derived/gov/link-analysis/id.map/)

Expanded Id-Graph (per year): ID-Graph that lists the destination IDs for each source ID along with the timestamp info (/dataset-derived/gov/link-analysis/expanded.id.graph-by-year/)

 Expanded Host Graph (Per Year)

The Expanded per year Archival Web Graph was converted into a Host Graph (srcHost dstHost numLinks). 

Expanded Host-Graph (per year): The source and destination host along with the number of links from the source to the destination. (/dataset-derived/gov/link-analysis/expanded.host.graph-by-year/)


 Expanded Domain Graph (Per Year)

The Expanded per year Archival Web Graph was converted into a Top Domain Graph (srcDomain dstDomain numLinks). 

Expanded Domain-Graph (per year): The source and destination domain (top-private-domain) along with the number of links from the source to the destination. (/dataset-derived/gov/link-analysis/expanded.top-domain.graph-by-year/)


Analyzing subsets of dataset

 Use Hive to find files of interest

Example Goal: To analyze captures of http://speaker.house.gov/blog from the year 2009

First, convert the URL into the SURT representation ($PROJECT_DIR/pig/misc/surt-canonicalize-lines.pig). So, in this case, the SURT URL is gov,house,speaker)/blog

Next, find all records from the year 2009 in the url_ts_checksum Hive table:

$ hive
 
hive > INSERT OVERWRITE LOCAL DIRECTORY '/tmp/speaker-blog-2009' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' SELECT url,ts,checksum from url_ts_checksum where url="gov,house,speaker)/blog" and ts rlike '^2009.*';

Now, you have the <URL-checksum> tuples from the year 2009. You can then use this information to query the extracted_data_cdx Hive table to find the CDX lines of interest.

You can run this in a single Hive query:

$ hive
 
hive > INSERT OVERWRITE LOCAL DIRECTORY '/tmp/speaker-blog-2009-cdx' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
     > SELECT extracted_data_cdx.* from
     > (SELECT url,checksum from url_ts_checksum where url="gov,house,speaker)/blog" and ts rlike '^2009.*') captures
     > JOIN extracted_data_cdx ON captures.url = extracted_data_cdx.url AND captures.checksum = extracted_data_cdx.checksum;
    

Now, the local directory /tmp/speaker-blog-2009-cdx will contain CDX lines referring to captures of every <URL-checksum> where the URL=gov,house,speaker)/blog and the checksum was seen in 2009.

You can extract the filename field from this CDX output to find the (W)ARC files (and by extenstion, the parsed text files) containing all the records of interest. Note that these files might also contain records that are not of interest.

Now, these files can be provided as input to your analysis jobs.

 Repackage a subset of (W)ARC data into new (W)ARC files
# Generate a subset of WARC data to be extracted by compiling a list of tab separated offsets and filelocations (HTTP/HDFS locations)
# Use CDX/WAT data and Pig/Hive to compile this list (set I_OFFSET_SRCFILEPATH to the location of this list in HDFS)
 
# Then, use this list to prepare task files for extraction
pig -p I_EXTRACTED_FILE_PREFIX=MY-COLLECTION -p I_RECORDS_PER_EXTRACTED_FILE=10000 -p I_OFFSET_SRCFILEPATH=$I_OFFSET_SRCFILEPATH -p O_TASK_FILE_FOR_EXTRACTION=extraction.input pig/extraction/prepare-taskfile-for-extraction.pig

# now, Run the ArchiveFileExtractor Job to repackage these records into WARC files
NUMMAPPERS=15
hadoop jar lib/ia-hadoop-tools-jar-with-dependencies.jar ArchiveFileExtractor -mappers $NUMMAPPERS extraction.input <output_dir>
 

CDX Analysis

 Generate CDX files from WARC data
hadoop jar lib/ia-hadoop-tools-jar-with-dependencies.jar CDXGenerator <output_cdx_dir> <list_of_(W)ARC files>
 
For example:
hadoop jar lib/ia-hadoop-tools-jar-with-dependencies.jar CDXGenerator /user/vinay/sample-cdx/ /dataset/gov/extracted-data/arcs/bucket-0/DOTGOV-EXTRACTION-1995-FY2013-MIME-APPLICATION-ARCS-PART-00000-00000*.arc.gz
 

CDX data for the collection has already been generated and is available under /dataset/gov/extracted-data-cdx/ and can be queried using Hive (table: extracted_data_cdx)


 Generate breakdown of MIME types by year from CDX data
pig -p I_CDX_DIR=<cdx_dir> -p O_MIME_BREAKDOWN_DIR=<output_mime_breakdown_dir> pig/cdx/mimeBreakdown.pig

MIME breakdown for the complete (W)ARC dataset:

Top Level MIMENumber of extracted captures

MIME-APPLICATION

57,629,049
MIME-AUDIO679,015
MIME-IMAGE64,269,884
MIME-OTHER10,253,645
MIME-TEXT978,456,588
MIME-VIDEO409,873
 CDX Warehouse - Apache Hive

Apache Hive

$ hive
 
# Example queries

hive> SELECT url,offset,filename from extracted_data_cdx where rescode = "200";
 
# Store results of query into a local file
 
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/results.del' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' SELECT url,ts,rescode,mime from extracted_data_cdx where url="gov,tn,capitol)/"  limit 100;

Extraction From Parsed Text

 Generate parsed text data from WARC data
hadoop jar lib/jbs.jar org.archive.jbs.Parse -conf etc/job-parse.xml <output_parsed_text_dir> <list_of_(W)ARC_files>
 
For example:
hadoop jar lib/jbs.jar org.archive.jbs.Parse -conf etc/job-parse.xml /user/vinay/sample-parsed/ /dataset/gov/extracted-data/arcs/bucket-0/DOTGOV-EXTRACTION-1995-FY2013-MIME-APPLICATION-ARCS-PART-00000-00000*.arc.gz
 

Parsed text data for the collection has already been generated and is available under /dataset-derived/gov/parsed/.

The file that lists the locations for the parsed text files for each source (W)ARC file is /dataset-derived/gov/parsed-data-path-index.txt

 Extract titles from parsed text data
pig -p I_PARSED_DATA_DIR=<parsed_data_dir> -p O_URL_TITLE_DIR=<output_url_title_dir> pig/parsed-captures/extract-surt-canon-urls-with-only-titles-from-parsed-captures.pig
 Extract meta text from parsed text data
pig -p I_PARSED_DATA_DIR=<parsed_data_dir> -p O_METATEXT_DATA_DIR=<output_metatext_dir> pig/parsed-captures/extract-surt-canon-urls-with-metatext-from-parsed-captures.pig
 Extract links from parsed text data
# Extract SURT canonicalized links
pig -p I_PARSED_DATA_DIR=<parsed_data_dir> -p O_LINKS_DATA_DIR=<output_links_data_dir> pig/parsed-captures/extract-surt-canon-links-with-anchor-from-parsed-captures.pig
 Extract entities from parsed text data
# Store the english.all.3class.distsim.crf.ser.gz available under $PROJECT_DIR/lib/english.all.3class.distsim.crf.ser.gz into HDFS. Set LOCATION_OF_NER_CLASSIFIER_FILE_IN_HDFS.
# This classifier will be supplied as a distributed cache to the workers

pig -Dmapred.cache.files="$LOCATION_OF_NER_CLASSIFIER_FILE_IN_HDFS/english.all.3class.distsim.crf.ser.gz#english.all.3class.distsim.crf.ser.gz" -Dmapred.create.symlink=yes -p I_NER_CLASSIFIER_FILE=english.all.3class.distsim.crf.ser.gz  -p I_PARSED_DATA_DIR=<parsed_data_dir> -p O_ENTITIES_DIR=<output_entities_data_dir> pig/parsed-captures/extract-entities-from-parsed-captures.pig

Graph Generation

 Generate Host Graph from links data
pig -p I_LINKS_DATA_DIR=<links_data_dir> -p O_HOST_GRAPH_DIR=<output_host_graph> pig/graph/convert-link-data-to-host-graph-with-counts.pig
 Generate Domain Graph from links data
pig -p I_LINKS_DATA_DIR=<links_data_dir> -p O_TOP_DOMAIN_GRAPH_DIR=<output_top_domain_graph> pig/graph/convert-link-data-to-top-domain-graph-with-counts.pig
 Generate an Archival Web Graph (ID-Map and ID-Graph) from links data
# Generate an ID-Map that maps a unique integer ID to each URL
pig -p I_LINKS_DATA_DIR=<links_data_dir> -p O_ID_MAP_DIR=<output_id.map_dir> pig/graph/generate-id-map.pig

# Alternate version: Generate an ID-Map where the ID assigned to each URL is a 64-bit fingerprint generated from the URL
# pig -p I_LINKS_DATA_DIR=<links_data_dir> -p O_ID_MAP_DIR=<output_id.map_dir> pig/graph/generate-id-map-using-fingerprints.pig 

# Generate an ID-Graph that lists the destination IDs for each source ID along with the timestamp info
pig -p I_LINKS_DATA_DIR=<links_data_dir> -p I_ID_MAP_DIR=<id.map_dir> -p O_ID_GRAPH_DIR=<output_id.graph_dir> pig/graph/convert-link-data-to-id-graph.pig
 Generate a Link Graph (ID-Map and SortedInt-ID-Graph) from (SRC,DST) - no timestamp
# Generate an ID-Map that maps a unique integer ID to each resource, and SortedIntID-Graph that maps a set of sorted integer destination IDs to each source ID
pig -p I_LINKS_DATA_NO_TS_DIR=<links_data_no_ts_dir> -p O_ID_MAP_DIR=<output_id.map_dir> -p O_ID_SORTEDINT_GRAPH_NO_TS_DIR=<output_id_sortedint_graph_no_ts_dir> pig/graph/convert-link-data-no-ts-to-sorted-id-data-no-ts.pig

Text Analysis

 Example: Extract top 50 terms (by TF-IDF) per URL from meta text data
# The stop-words.txt available under $PROJECT_DIR/pig/text/stop-words.txt is also in HDFS under /user/vinay/. 
# This stop words list will be supplied as a distributed cache to the workers

pig -Dmapred.cache.files="/user/vinay/stop-words.txt#stop-words.txt" -Dmapred.create.symlink=yes -p I_STOP_WORDS_FILE=stop-words.txt -p I_METATEXT_DIR=<metatext_dir> -p O_URL_METATEXT_TOPTERMS_DIR=<output_url_metatext_topterms_dir> pig/text/metatext-topN-tfidf.pig
 Example: Extract top 50 terms (by TF-IDF) per URL from link text data (anchor text)
# The stop-words.txt available under $PROJECT_DIR/pig/text/stop-words.txt is also in HDFS under /user/vinay/. 
# This stop words list will be supplied as a distributed cache to the workers

pig -Dmapred.cache.files=/user/vinay/stop-words.txt#stop-words.txt" -Dmapred.create.symlink=yes -p I_STOP_WORDS_FILE=stop-words.txt -p I_LINKS_DATA_DIR=<links_data_dir> -p O_URL_ANCHORTEXT_TOPTERMS_DIR=<output_url_topanchortext_dir> pig/text/anchortext-topN-tfidf.pig
 Example: Prepare parsed text for analysis with Apache Mahout

Apache Mahout

# The stop-words.txt available under $PROJECT_DIR/pig/text/stop-words.txt is also in HDFS under /user/vinay/. 
# This stop words list will be supplied as a distributed cache to the workers

pig -Dmapred.cache.files="/user/vinay/stop-words.txt#stop-words.txt" -Dmapred.create.symlink=yes -p I_STOP_WORDS_FILE=stop-words.txt -p I_PARSED_DATA_DIR=<parsed_data_dir> -p O_URL_CONTENT_SEQ_DIR=<output_parsed_captures_content_for_mahout.seq> pig/parsed-captures/prepare-content-for-mahout-with-filter.pig

Now you can run the Mahout seq2sparse command on the produced output, followed by any of the available clustering and classification algorithms.

Link Analysis with Apache Pig

 Example: Compute the number of incoming and outgoing links per URL
pig -p I_ID_GRAPH_DIR=<id.graph_dir> -p O_DEGREE_ANALYSIS_DIR=<output_degree_analysis_dir> pig/graph/degree-analysis.pig
 Example: Compute the degree stats per Host/Domain from Host/Domain Graph
pig -p I_HOST_GRAPH_DIR=<host_graph_dir> -p O_HOST_DEGREE_STATS_DIR=<output_host_degree_stats> pig/graph/host-degree-stats.pig
 Example: Find common links using Link Graph (ID-Map, SortedInt-ID-Graph)

Problem: Given a set of input resources, find the set of links that are linked-to by all of these input resources.

The following code makes use of FindAndIntersectionsUsingPForDeltaDocIdSetUDF(). This UDF takes advantage of the Kamikaze library to build docIdSets from the sorted integer destination ID sets, and then performs efficient intersection of these sets to find all the destinations in common. It only works for integer IDs with value < Java Integer.MAX_VALUE (2^31 -1)

 

# Set I_SRC_RESOURCES_DIR to the data containing the list of resources for which we need to find common/shared links

pig -p I_SRC_RESOURCES_DIR=$I_SRC_RESOURCES_DIR -p I_ID_MAP_DIR=<id.map_dir> -p I_ID_SORTEDINT_GRAPH_NO_TS_DIR=<output_id_sortedint_graph_no_ts_dir>  -p O_COMMON_LINKS_RESOURCES_DIR=<output_common_links_resources_dir> pig/graph/find-links-in-common.pig

Link Analysis with Apache Giraph

 Example: Compute the number of incoming and outgoing links per URL
# Prepare links graph for Giraph from ID-Graph data using Pig
pig -p I_ID_GRAPH_DIR=<id.graph_dir> -p O_PR_TAB_ID_GRAPH_DIR=<output_pagerank_id.graph_dir> pig/graph/prepare-tab-delimited-pagerank-graph.pig
 
# Giraph settings
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.1.0-SNAPSHOT-for-hadoop-2.0.5-alpha-jar-with-dependencies.jar
export LVIF=org.archive.giraph.VertexWithDoubleValueLongDoubleFloatTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
# ZooKeeper and JobTracker settings 
export ZOOKEEPER_OPTS_STRING="-ca giraph.zkList=$ZOOKEEPER_HOST:$ZOOKEEPER_PORT"
export JOBTRACKER_OPTS_STRING="-Dmapred.job.tracker=$JOBTRACKER_HOST:$JOBTRACKER_PORT"

# Indegree
hadoop jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner $JOBTRACKER_OPTS_STRING org.archive.giraph.InDegreeCountComputation -vif $LVIF -vip <pagerank_id.graph_dir>/part* -of $OF -op <output_id.indegree_dir> -w $NUMWORKERS $ZOOKEEPER_OPTS_STRING

# Outdegree
hadoop jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner $JOBTRACKER_OPTS_STRING org.archive.giraph.OutDegreeCountComputation -vif $LVIF -vip <pagerank_id.graph_dir>/part* -of $OF -op <output_id.outdegree_dir> -w $NUMWORKERS $ZOOKEEPER_OPTS_STRING

 Example: PageRank
# Prepare links graph for Giraph from Host/Domain Graph data using Pig
pig -p I_ID_GRAPH_DIR=<id.graph_dir> -p O_PR_TAB_ID_GRAPH_DIR=<output_pagerank_id.graph_dir> pig/graph/prepare-tab-delimited-pagerank-graph.pig
 
# Giraph settings
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.1.0-SNAPSHOT-for-hadoop-2.0.5-alpha-jar-with-dependencies.jar
export LVIF=org.archive.giraph.VertexWithDoubleValueLongDoubleFloatTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
# ZooKeeper and JobTracker settings 
export ZOOKEEPER_OPTS_STRING="-ca giraph.zkList=$ZOOKEEPER_HOST:$ZOOKEEPER_PORT"
export JOBTRACKER_OPTS_STRING="-Dmapred.job.tracker=$JOBTRACKER_HOST:$JOBTRACKER_PORT"
 
# PageRank
hadoop jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner $JOBTRACKER_OPTS_STRING org.archive.giraph.PageRankComputation -ca PageRankComputation.jumpProbability=0.15f -ca PageRankComputation.maxSupersteps=15 -vif $LVIF -vip <pagerank_id.graph_dir>/part* -of $OF -op <output_id.prscore_dir> -w $NUMWORKERS -mc org.archive.giraph.PageRankComputation\$PageRankComputationMasterCompute $ZOOKEEPER_OPTS_STRING

# Assign ranks by PageRank score using Pig
pig -p I_ID_PRSCORE_DIR=<id.prscore_dir> -p O_ID_PRRANK_DIR=<output_id.prrank_dir> pig/graph/assign-pagerank-rank.pig

 Example: Weighted PageRank (Host/Domain Graph)
# Prepare links graph for Giraph from Host/Domain graph using Pig
pig -p I_HOST_GRAPH_DIR=<host_graph_dir> -p O_PR_TAB_HOST_GRAPH_DIR=<output_pagerank_host_graph_dir> pig/graph/prepare-tab-delimited-weighted-pagerank-graph.pig
 
# Giraph settings
export IA_GIRAPH_JAR=$PROJECT_DIR/lib/giraph-ia-1.1.0-SNAPSHOT-for-hadoop-2.0.5-alpha-jar-with-dependencies.jar
export TVIF=org.archive.giraph.VertexWithDoubleValueTextDoubleFloatTextInputFormat
export OF=org.apache.giraph.io.formats.IdWithValueTextOutputFormat
export NUMWORKERS=10
 
# ZooKeeper and JobTracker settings 
export ZOOKEEPER_OPTS_STRING="-ca giraph.zkList=$ZOOKEEPER_HOST:$ZOOKEEPER_PORT"
export JOBTRACKER_OPTS_STRING="-Dmapred.job.tracker=$JOBTRACKER_HOST:$JOBTRACKER_PORT"
 
# Weighted PageRank
hadoop jar $IA_GIRAPH_JAR org.apache.giraph.GiraphRunner $JOBTRACKER_OPTS_STRING org.archive.giraph.WeightedPageRankComputation -ca WeightedPageRankComputation.jumpProbability=0.15f -ca WeightedPageRankComputation.maxSupersteps=15 -vif $TVIF -vip <pagerank_host_graph_dir>/part* -of $OF -op <output_host.prscore_dir> -w $NUMWORKERS -mc org.archive.giraph.WeightedPageRankComputation\$WeightedPageRankComputationMasterCompute $ZOOKEEPER_OPTS_STRING

# Assign ranks by PageRank score using Pig
pig -p I_ID_PRSCORE_DIR=<output_host.prscore_dir> -p O_ID_PRRANK_DIR=<output_id.prrank_dir> pig/graph/assign-pagerank-rank.pig

  • No labels