This page provides links to technical documentation describing how specific datasets are generated, their format and structure, relative size, and their relation to, and impact by, the computational processes involved in web archiving. Each dataset also has a corresponding page that includes example use cases, outlines some of the types of analysis possible, and sample data visualizations created using these datasets.
Types of Datasets Currently Available
WAT: Web Archive Transformation files feature key metadata elements that represent every crawled resource in a collection and are derived from a collection’s WARC files.
LGA: Longitudinal Graph Analysis files feature a complete list of what URIs link to what URIs, along with a timestamp, within an entire collection.
WANE: Web Archive Named Entities uses named-entity recognition tools to generate a list of all the people, places, and organizations mentioned in each URI in a collection along with a timestamp of URI capture.