WANE Overview and Technical Details

WANE: Web Archive Named Entities uses named-entity recognition tools to generate a list of all the people, places, and organizations mentioned in each valid URL in a collection along with a timestamp of that URL's capture.

WANE Dataset

The WANE dataset is generated using the Stanford Named Entity Recognizer software (http://nlp.stanford.edu/software/CRF-NER.shtml) to extract named entities from each textual resource in a collection.

  • The analyzer uses an English model 3-class classifier to extract names that correspond to recognized Persons, Organizations, and Locations.
  • WANE files map one-to-one to (W)ARC files, thus a collection's (W)ARC files will have corresponding WANE files.
  • WANE files are less than 1% the size of their corresponding (W)ARC files.

WANE Data Structure

The WANE data is structured as a JSON object per line: URL ("url"), timestamp ("timestamp"), content digest ("digest") and the named entities ("named_entities") containing data arrays of "persons", "organizations", and "locations".

 WANE Records

{"url":"http://dissonantwinstonsmith.wordpress.com/2014/08/24/im-sick-of/?like_comment=79&_wpnonce=0fc57aa499&replytocom=93","timestamp":"20141019212346","named_entities":{"locations":["North County","America","St. Louis County St. Louis County Police St. Louis County","St. Louis","WordPress.com","Middle East"],"organizations":["Dissonant Winston Smith Dissonant Winston Smith Menu Skip","Twitter Facebook Google","Google","Facebook","Wal-Mart","CNN","Bearcats"],"persons":["Stell","Tom Jackson","Smith","Pamela Fillingim","Darren Wilson Eric Fowler Eric Vickers Ferguson Ferguson","Ferguson","Rob Crawford","Kley","Erin Miller","darren wilson","Mike","Daniel Garrelts","Darren Wilson","Rath","Ellis Wyatt","Nick","Wilson","Mike Browns","Trayvon","Jane Jacoby","Kley Potter","Mike Brown","Michael","Michael Brown","Angela","Pablo","Jon Stewart","George Zimmerman Jamilah Nasheed KTVI","mike brown","Heather","Pamela fillingim","pamela fillingim","Susan"]},"digest":"sha1:747IKFWUCVQVXY7TX2NMYFL422T4TRQX"}


{"url":"http://www.studlife.com/archives/Sports/2006/07/25/MaintainingsomeSouthBendontheSouthHowtobeatrueWUsportsfan/","timestamp":"20141019212348","named_entities":{"locations":["Miami","Virginia","Fort Lauderdale","Wash.","Va.","Blacksburg","St. Louis","Clayton","Fla.","Chapel Hill","Michigan State","Austin","North Carolina","Michigan"],"organizations":["Student Life Archives","Edition Student Life Breaking News Alerts Student Life Weekly Digest Student Life","University of Texas","Washington University","Facebook","Gators","UF","Virginia Tech"],"persons":["Scott Kaufman-Ross","Jim Druckenmiller","Ann Arbor","Sagartz","Scott Stern","Michael Vick"]},"digest":"sha1:QBXYTSBSEMRYTL47FSNPZ3JNC4Q3WCSZ"}

WANE Dataset Creation Process

The WANE dataset encodes entities from any textual document in a collection with a 200 HTTP response code. WANE datasets are generated using a Hadoop-based production pipeline that includes use of Stanford's NER tool, Apache Pig, Java, and Python scripts. Downloaded WANE datasets will map one-to-one to (W)ARCs and will be similarly packed as concatenated, compressed records. In some cases, when no known entities have been discovered in a whole (W)ARC file, you will see a 0 byte WANE file. Zero byte files allow the exact WANE-to-WARC mapping to be maintained. Entity recognition is enabled by the ability of the Stanford NER classifier. More on Stanford's NER can be found at http://nlp.stanford.edu/software/CRF-NER.shtml.