Skip to end of metadata
Go to start of metadata

Must reads

Nice to reads

Information on Java with respect to Heritrix/crawling

Find these (also may be outdated with respect to current Java and our implementation choices) at the archive-crawler Yahoo Group files page:

  • G. B. Reddy Study of synch vs. asynch IO in Java
  • G. B. Reddy Study of multi-threaded DNS performance in Java

Others

Relevant specifications

Attachments

  File Modified
HTML File crawler-requirements-2003-03.htm Original March 2003 crawler requirement that gave rise to Heritrix project Jun 06, 2008 by Gordon Mohr
PDF File Mohr-et-al-2004.pdf An Introduction to Heritrix - from the Heritrix FAQ Jan 13, 2009 by siznax
PDF File 1998-Cho-efficient.pdf Efficient Crawling Through URL Ordering Jan 12, 2009 by siznax
PDF File 1999-Heydon-javalimits.pdf Performance Limitations of the Java Core Libraries Jan 12, 2009 by siznax
PDF File 1999-Hirai-webbase.pdf WebBase : A repository of web pages Jan 12, 2009 by siznax
PDF File 1999-Mercator.pdf Mercator: A Scalable, Extensible Web Crawler Jan 12, 2009 by siznax
PDF File 2000-Broder-webgraph.pdf Graph structure in the web Jan 12, 2009 by siznax
PDF File 2000-Cho-incremental.pdf The Evolution of the Web and Implications for an Incremental Crawler Jan 12, 2009 by siznax
PDF File 2001-Abiteboul-crawlorder.pdf Computing web page importance without storing the graph of the web Jan 12, 2009 by siznax
PDF File 2001-Arasu-search.pdf Searching the Web Jan 12, 2009 by siznax
PDF File 2001-Najork-breadthfirst.pdf Breadth-First Search Crawling Yields High-Quality Pages Jan 12, 2009 by siznax
PDF File 2001-Najork-highperf.pdf High-Performance Web Crawling Jan 12, 2009 by siznax
PDF File 2002-Guillaume-webgraph.pdf The Web Graph: an Overview Jan 12, 2009 by siznax
PDF File 2008-IRLBot.pdf IRLbot: Scaling to 6 Billion Pages and Beyond (2008) Jan 12, 2009 by siznax
PDF File 2008-Olston-recrawl.pdf Recrawl Scheduling Based on Information Longevity Jan 12, 2009 by siznax
PDF File 2002-Shkapenyuk-polybot.pdf Polybot: Design and Implementation of a High-Performance Distributed Web Crawler Jan 13, 2009 by siznax
  • No labels