- Haydon, A; Najork, M. Mercator: A Scalable, Extensible Web Crawler (wayback (http://web.archive.org/web/*/http://research.compaq.com/SRC/mercator/papers/www/paper.html)), 1999
- Haydon, A; Najork, M. High-performance web crawling, 2001
- Kimpton, Stata, Mohr. Internet Archive Crawler Requirements Analysis for library consortium, 2003
- Lee, H; Leonard, D; Wang, X; Loguinov, D. IRLbot: Scaling to 6 Billion Pages and Beyond (new from WWW2008)
Nice to reads
- Najork, M.; Wiener, J. Breadth-First Search Crawling Yields High-Quality Pages, 2001
- Cho, J.; Garcia-Molina, H.; Page, L. Efficient Crawling Through URL Ordering, 1998
- Abiteboul, S.; Preda, M.; Cobena, G. Computing web page importance without storing the graph of the web (extended abstract), 2001
- Olsten, C.; Pandey, S. Recrawl Scheduling Based on Information Longevity (new from WWW2008)
Information on Java with respect to Heritrix/crawling
- Haydon, A; Najork, M. Performance Limitations of the Java Core Libraries (may not reflect latest Java issues, Heritrix uses a high performance DNS package)
Find these (also may be outdated with respect to current Java and our implementation choices) at the archive-crawler Yahoo Group files page:
- G. B. Reddy Study of synch vs. asynch IO in Java
- G. B. Reddy Study of multi-threaded DNS performance in Java
- Archive-crawler group files
- Cho, J.; Garcia-Molina, H. The Evolution of the Web and Implications for an Incremental Crawler, Conf. on Very Large Data Bases, 2000
- Focused Crawling The Quest for Topic-specific Portals
- Focused Crawling: : A New Approach to Topic-Specific Web Resource Discovery, 1999, WWW8
- Intelligent Crawling on the World Wide Web with Arbitrary Predicates, 2001, WWW10
- Web Crawling High-Quality Metadata using RDF and Dublin Core, 2002, WWW11
- Stanford WebBase Project
- An Introduction to Heritrix - Mohr et al, 4th International Web Archiving Workshop 2004
- RFC 2616: Hypertext Transfer Protocol - HTTP/1.1
- Clarifying the fundamentals of HTTP By Jeffery Mogul, an author of RFC-2616.
- RFC 3986: Uniform Resource Identifiers (URI): Generic Syntax.
- HTML 4.01 specification (from W3C).
- Although robots.txt is important for crawling, it's never been officially ratified as an RFC. The defacto minimal spec live at robotstxt.org. Search engines have made a number of ad hoc extensions; Google recently shared some info about how GoogleBot implements the Robots Exclusion Protocol.
- RFC 1034: Domain Names - Concepts and Facilities
- RFC 1035: Domain Names - Implementation and Specification