Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The Internet Archive has a long tradition of providing domain-scale web harvests, often on behalf of National Libraries. As a key provider of web archiving technologies and services, the Internet Archive has made available open source software for crawling and access, enabling national bodies to undertake web archiving locally.


Panel
borderColorblue
borderStylenone

Example projects:

Table of Contents


National Libraries

The Internet Archive has worked with national libraries and archives since 1998 having domain-scale crawling, often for full country code top level domains (ccTLD) harvests of over 1 billion URLs. We have worked with partners such as: Library of Congress, National Library of Australia, National Library of Israel, National Library of New Zealand, National Library of Spain, National Library of Luxembourg, Swiss National Library, Sweden National Library, National Library of Ireland, and national archives such as the U.S. National Archives and Records Administration. 

Example crawl report from a national library partner

Harvesting tools


We use the Heritrix open-source software to perform web crawling for harvest, along with Umbra and Brozzler, browser-based tools that allow the crawler to imitate human interactions with Web, such as executing JavaScript through clicking or hovering the mouse over different Web page elements and scrolling down a page. This allows for discovery and archiving of user-action generated content. Both Heritrix, Umbra, and Brozzler were developed at IA and our engineers continue to lead development of these tools.

 

Government Web Harvesting

A web archive portal page designed by the Internet Archive for the U.S. National Archives and Records Administration.


  •  End of Term project is a collaborative effort to archive the entire United States federal government web presence every four years at times of administrative transitions. Harvests have been conducted in 2008, 2012, and 2016. The project has collected over 300 TB of data

  • Many of our domain crawls also specifically target the .gov subdomain

Grant funded crawls and Special Projects

News Measures Research Project

  • Crawled and archived the homepages of 663 local news websites representing 100 communities across the United States. Seven crawls were run on single days from July through September and captured over 2.2TB of unique data and 16 million URLs. 

Wikipedia  

  • Rescued more than a million broken Wikipedia outlinks, replacing them with links to archived pages from the Wayback Machine.
  • Automatically archived all links created by Wikipedia users.

Wordpress

  • We worked with Automattic to get a feed of new posts made to WordPress.com blogs and self-hosted WordPress sites.  We crawl the posts themselves, as well as all of their outlinks and embedded content – about 3,000,000 URLs per day.