Page tree
Skip to end of metadata
Go to start of metadata

Longitudinal Graph Analysis (LGA) files feature a complete list of what URLs link to what URLs, along with a timestamp, within an entire web archive collection.

LGA Dataset

The LGA dataset is an archival web graph file that encodes linking activity.

  • LGA datasets are built from the entirety of a web archive from its origin through present.
  • LGA files are ~1% the size of a collection's aggregate (W)ARC files.
  • As an aggregate historical dataset, LGA files are generated quarterly as part of the ARS service (monthly generation is possible), which each dataset being the complete representation of linking activity for a collection.
  • The LGA dataset is represented using two files, the ID-Map file and the ID-Graph file. The LGA dataset is a zip file that contains these two files.
    • ID-Map:
      • Contains one line per each URL in a collection and assigns a UID (unique identifier) to each URL.
      • Each line contains a JSON object with three fields: The URL's UID ("id"), the URL ("url") and the URL in SURT form ("surt_url")
    • ID-Graph: 
      • Each line contains a JSON object with three fields: The URL's UID ("id"), the timestamp associated with the capture of this URL ("timestamp"), and the set of the UIDs of the URLs linked to by this URL at that given timestamp ("outlink_ids")
  • Examples of the file structures are below.

LGA Data Structure

ID-Map: As an example, here are 5 URLs.

ID-Graph: Within the LGA ID-Graph file, the linking activity over time of these URLs  would be represented as seen below (this is a truncated example). 

 ID-Graph:

{"timestamp":"20150209052911","id":294869,"outlink_ids":[31,31366,62596,91594,91595,129599,145417,148627,149589,160328,184019,215031,215097,246317,277534,277535,277668,277678,277679,278400,337582,338468,344883,351501,387404,451563,451869,482599,527910,539088,603042,607449,626478,631649,645718,679335,701233,704678,706981,731522,737257,737423,737436,737459,737476,737483,737503,737514,737539,772101,803348,803349,803350,803373,803374,803490,803565,803586,803590]}

{"timestamp":"20150206180648","id":294870,"outlink_ids":[62596,110007,129599,145417,148627,215031,277534,277535,277668,277678,277679,737423,737436,737459,737476,737483,737503,737514,803348,803349,803373,803374,803490,803565,803586,803590]}

{"timestamp":"20150206180648","id":294871,"outlink_ids":[32,31367,62596,80320,85633,129599,142565,145417,147278,148627,158724,167740,184020,215031,215098,246318,277534,277535,277668,277678,277679,283442,333767,336392,344927,360135,389065,393803,406981,458500,461298,469625,479158,511763,532042,543389,544607,563105,645234,660313,661415,677324,699626,733736,737423,737436,737459,737476,737483,737503,737514,737540,770012,772102,803348,803349,803373,803374,803490,803565,803586,803590]}

{"timestamp":"20150206180648","id":294872,"outlink_ids":[33,31368,62596,116600,122696,129599,144914,145417,148627,158724,165672,215031,215099,246319,277534,277535,277668,277678,277679,290755,344933,361182,362630,363626,431796,480690,487540,502930,527604,549291,567719,568093,576986,579038,596072,601635,628678,647925,665468,714856,735164,737423,737436,737459,737476,737483,737503,737514,737541,770081,772103,803348,803349,803350,803373,803374,803490,803565,803586,803590]}

{"timestamp":"20150206180648","id":294873,"outlink_ids":[34,31369,62596,116601,122697,129599,141749,145417,148627,149589,154853,159119,174201,178621,182304,184021,215031,215100,246320,277534,277535,277668,277678,277679,288058,330762,338469,344936,350564,394972,397510,405107,533865,544590,554078,565156,566969,600005,616247,631683,638182,645695,648188,682109,715133,722270,737423,737436,737459,737476,737483,737503,737514,737542,772104,803348,803349,803373,803374,803490,803565,803586,803590]}

LGA Dataset Creation Process

The LGA dataset encodes linking activity from any textual document in a collection with a 200 HTTP response code. LGA datasets are generated using a Hadoop-based production pipeline and Apache Pig, Java, and Python scripts. The LGA dataset will download as a compressed .tgz file containing compressed .gz files, the ID-Map and ID-Graph.

 

  • No labels