Indexing and Crawl Analysis Services


The Internet Archive has developed a number of production-level post-crawl indexing and analytical services that can also be provided to domain harvesting partners upon request.

Full-text Search Indexing

The Internet Archive can provide full-text search services in two ways, either generating extracted full text files for partners to use in their own locally-provisioned search infrastructure or as a fully hosted full-text search service maintained by IA as part of our Elasticsearch deployment with endpoints that can be embedded on local websites or portals.

An example of a full hosted instance: https://www.webharvest.gov/ -- we (IA) designed, built, and maintain this website as part of our web archiving of all U.S. Congressional websites through a contract with the U.S. National Archive (NARA).

An example of a partially hosted instance: http://eotarchive.cdlib.org/ -- all full-text search and replay of archived pages is handled by IA, though CDL hosts the website front-end portal.

Derivative Datasets

Derivative datasets can be generated from the WARC files of a domain crawl. These datasets contain key metadata points, including things like meta-tags or named entities. These datasets are often valuable for partner internal use or for use by researchers in data mining and computational research efforts. The datasets that can be generated from domain crawls are WAT files and WANE files.

Language Identification

Language is determined through language analysis of IA’s historical web archive to identify sites of a particular language that are not hosted on the ccTLD of that language’s primary country. As the largest global-scale web archiving institution, we have the ability to mine global web crawls, as well as historical content in the Wayback Machine, and apply language detection for the discovery of websites in specific languages. Our language identification pipeline is built on the same CLD2 language detection libraries used by the Chrome browser but includes special improvements added by IA engineers for CLD2’s use on web archives. We are currently providing language identification services to a number of national library domain harvesting partners.

This analysis can be done pre-crawl to aid in content discovery, or can be applied after a crawl has been run, thus allowing a partner to identify languages represented within their own domain harvest.

Extended CDX

In addition to the standard CDX files that are generated as part of the domain harvesting service, a special “Extended CDX” can also be produced. These CDX files contain three additional fields, a language identification entry (per 3.3), a SHA256 checksum (more advanced, cryptographically, than the SHA1 in standard CDX, and a Simhash checksum, which allows for simularity and near-duplicate detection between files (web archive deduplication currently uses full-match SHA1 comparison).