WANE files provide researchers the named entities from each text resource in a web archive, as well as the origin URL and timestamp of capture. The ability to track, over time, the people, places, and organizations mentioned in a collection holds much potential for understanding the ebb and flow of these name entities within a specific timespan, domain, group of sites, or archival web collection. Extracting entities from a web archive will also allow for these collections to be augmented by linking these entities to external knowledge bases and resources (such as Wikipedia) and better connecting web archives to current trends in the library linked data community.
While named entity recognition has be used in digitized text collections, its potential is mostly still unexplored for web archives. We expect a richer set of use cases and visualizations as we continue or internal research and anaylsis of WANE datasets and as we see the novel uses of this data by researchers and users.
Named Entities in the Human Rights Collection
From one month (October 2014) of crawling over 300GB of data, we used the WANE dataset to represent the top people, organizations, and locations within this collection. Further analysis across time would allow for the studying of
Top Entities in the Ferguson Collection
The WANE dataset from four months of crawling of the Internet Archive's collaborative collection of URLs related to events Ferguson, MO featured over 650,000 named entities. The top ten person names featured both expected and unexpected results.
We will be documenting here, and via the Archive-It blog and other presentations and papers, other representative analytical use cases and corresponding visualizations from our internal work and that of researchers. More coming soon!
Mining Entities in the Human Rights Collection