WANE Overview and Technical Details

WANE: Web Archive Named Entities uses named-entity recognition tools to generate a list of all the people, places, and organizations mentioned in each valid URL in a collection along with a timestamp of that URL's capture.

WANE Dataset

The WANE dataset is generated using the Stanford Named Entity Recognizer software (http://nlp.stanford.edu/software/CRF-NER.shtml) to extract named entities from each textual resource in a collection.

  • The analyzer uses an English model 3-class classifier to extract names that correspond to recognized Persons, Organizations, and Locations.
  • WANE files map one-to-one to (W)ARC files, thus a collection's (W)ARC files will have corresponding WANE files.
  • WANE files are less than 1% the size of their corresponding (W)ARC files.

WANE Data Structure

The WANE data is structured as a JSON object per line: URL ("url"), timestamp ("timestamp"), content digest ("digest") and the named entities ("named_entities") containing data arrays of "persons", "organizations", and "locations".

WANE Dataset Creation Process

The WANE dataset encodes entities from any textual document in a collection with a 200 HTTP response code. WANE datasets are generated using a Hadoop-based production pipeline that includes use of Stanford's NER tool, Apache Pig, Java, and Python scripts. Downloaded WANE datasets will map one-to-one to (W)ARCs and will be similarly packed as concatenated, compressed records. In some cases, when no known entities have been discovered in a whole (W)ARC file, you will see a 0 byte WANE file. Zero byte files allow the exact WANE-to-WARC mapping to be maintained. Entity recognition is enabled by the ability of the Stanford NER classifier. More on Stanford's NER can be found at http://nlp.stanford.edu/software/CRF-NER.shtml.