OS Wayback API Documentation

This page is changing rapidly, please check back once in a while!

Request types within Wayback

  • Capture Requests - returns information about the various captures of a specific URL
  • URL Requests - returns information about URLs captured that begin with a particular prefix
  • Replay Requests - returns a specific resource from the archive based on a URL plus a date

Capture Request URL format

http://ia360911.us.archive.org:9090/wayback/xmlquery?type=urlquery&url={URL}&startdate={DATE}&enddate={DATE}
  • url - the URL for which data should be returned. Ex. http://www.yahoo.com/
  • startdate (optional) - the earliest date boundary for which data should be returned. Partial timestamps(see below) are assumed to mean the earliest possible date given the partial Timestamp.
  • enddate (optional) - the latest date boundary for which data should be returned. Partial timestamps(see below) are assumed to mean the latest possible date given the partial Timestamp.
<wayback>
  <request>
    <resultsrequested>1000</resultsrequested>
    <startdate>19960101000000</startdate>
    <numresults>2</numresults>
    <type>urlquery</type>
    <enddate>20090605003233</enddate>
    <firstreturned>0</firstreturned>
    <url>enigmahistory.org/</url>
    <numreturned>2</numreturned>
    <resultstype>resultstypecapture</resultstype>
  </request>
  <results>
    <result>
      <capturedate>20020805101003</capturedate>
      <file>IA-WORLDWARS-ia400119.20080802024501.arc.gz</file>
      <urlkey>enigmahistory.org/</urlkey>
      <redirecturl>-</redirecturl>
      <url>http://www.enigmahistory.org:80/</url>
      <digest>54JBFQDKPNUQNUKYI4QT6DECP22VESEQ</digest>
      <compressedoffset>67083729</compressedoffset>
      <httpresponsecode>200</httpresponsecode>
      <mimetype>text/html</mimetype>
    </result>
    <result>
      <capturedate>20021128012534</capturedate>
      <file>IA-WORLDWARS-ia400119.20080802045222.arc.gz</file>
      <urlkey>enigmahistory.org/</urlkey>
      <redirecturl>-</redirecturl>
      <url>http://www.enigmahistory.org:80/</url>
      <digest>54JBFQDKPNUQNUKYI4QT6DECP22VESEQ</digest>
      <compressedoffset>17656542</compressedoffset>
      <httpresponsecode>200</httpresponsecode>
      <mimetype>text/html</mimetype>
    </result>
  </results>
</wayback>
  • wayback.request.url - canonicalized lookup version of requested URL (see Canonicalization below)
  • wayback.request.firstreturned - in paginated responses, record number of first result returned, zero-based
  • wayback.request.enddate - end date boundary of request, or end of current year if omitted
  • wayback.request.resultstype - string literal "resultstypecapture"
  • wayback.request.resultsrequested - maximum number of records to return in a single request
  • wayback.request.numresults - total number of results matching the query
  • wayback.request.type - string literal "urlquery"
  • wayback.request.startdate - start date boundary of request, or default for Wayback installation if omitted
  • wayback.request.numreturned - number of actual results returned in response
  • wayback.results.result.url - as close of a representation as can be made of the original request url using data only from the index
  • wayback.results.result.file - name of ARC/WARC file holding this resource
  • wayback.results.result.httpresponsecode - servers HTTP response code to the original request
  • wayback.results.result.digest - MD5 or SHA1 digest of the HTTP payload of this resource
  • wayback.results.result.capturedate - 14-digit timestamp when this resource was captured
  • wayback.results.result.urlkey - canonicalized version of the original capture URL
  • wayback.results.result.compressedoffset - offset within arcfile where this capture begins
  • wayback.results.result.mimetype - MIME Type of capture, as reported by servers HTTP response headers
  • wayback.results.result.redirecturl - URL which this capture redirects to, or "-" if it does not redirect

Live web example:

http://ia360911.us.archive.org:9090/wayback/xmlquery?type=urlquery&url=http://www.enigmahistory.org/

URL Request URL format

http://ia360911.us.archive.org:9090/wayback/xmlquery?type=prefixquery&url={URL}&startdate={DATE}&enddate={DATE}
  • url - the URL for which data should be returned. Ex. http://www.yahoo.com/
  • startdate (optional) - the earliest date boundary for which data should be returned. Partial timestamps(see below) are assumed to mean the earliest possible date given the partial Timestamp.
  • enddate (optional) - the latest date boundary for which data should be returned. Partial timestamps(see below) are assumed to mean the latest possible date given the partial Timestamp.
<wayback>
  <request>
    <resultsrequested>1000</resultsrequested>
    <startdate>19960101000000</startdate>
    <numresults>53</numresults>
    <type>prefixquery</type>
    <enddate>20090605004205</enddate>
    <firstreturned>0</firstreturned>
    <url>enigmahistory.org/</url>
    <numreturned>53</numreturned>
    <resultstype>resultstypeurl</resultstype>
  </request>
  <results>
    <result>
      <numcaptures>35</numcaptures>
      <lastcapturets>20070814013921</lastcapturets>
      <numversions>1</numversions>
      <firstcapturets>20020805101003</firstcapturets>
      <urlkey>enigmahistory.org/</urlkey>
      <originalurl>http://www.enigmahistory.org:80/</originalurl>
    </result>
    <result>
      <numcaptures>18</numcaptures>
      <lastcapturets>20080108235127</lastcapturets>
      <numversions>1</numversions>
      <firstcapturets>20021212183427</firstcapturets>
      <urlkey>enigmahistory.org/booksreviews.html</urlkey>
      <originalurl>http://www.enigmahistory.org:80/booksreviews.html</originalurl>
    </result>
    ...
  </results>
</wayback>
  • wayback.request.* - same as urlquery definition, except:
  • wayback.request.type - string literal "prefixquery"
  • wayback.request.resultstype - string literal "resultstypeurl"
  • wayback.results.result.result.urlkey - canonicalized version of the original capture URL
  • wayback.results.result.result.numversions - number of unique digests across all captures of this URL
  • wayback.results.result.result.numcaptures - total number of captures of this URL
  • wayback.results.result.result.firstcapturets - timestamp of first capture of this URL within requested date boundaries
  • wayback.results.result.result.originalurl - as close of a representation as can be made of the original request url using data only from the index
  • wayback.results.result.result.lastcapturets - timestamp of last capture of this URL within requested date boundaries

Live web example:

http://ia360911.us.archive.org:9090/wayback/xmlquery?type=prefixquery&url=http://www.enigmahistory.org/

Replay Request URL format

http://ia360911.us.archive.org:9090/wayback/replay?url={URL}&date={DATE}
  • url - the URL which should be replayed. Ex. http://www.yahoo.com/
  • date - the capture date specifying the particular version of URL to be returned, as specified as a timestamp. Partial timestamps are interpreted as the earliest capture.

If the specified date does not exactly match a capture date for the URL, the client will be redirected to the closest date that was actually captured.

Documents returned may be altered depending on the configuration of the wayback installation. In some cases, no modifications at all are performed on the resource before returning, in others, HTTP headers may be altered, or HTML content may be altered, to enhance in-browser replay experience.

Live web example:

http://ia360911.us.archive.org:9090/wayback/replay?date=20070814013921&url=http://www.enigmahistory.org/

Timestamp format

Timestamps are a 14 digit representation of a specific second in time, represented in UTC:

YYYYMMDDHHmmss
  • YYYY - Year. ex. 1999, 2004.
  • MM - month, 01 = Jan, 12 = Dec
  • DD - day of month, 1 based.
  • HH - hour of day, 0 based. 01 = 1 AM, 13 = 1 PM
  • mm - minute of hour, 0 based.
  • ss - second of minute, 0 based.

If a timestamp is represented as less than 14 digits, it will be interpreted as either the earliest or latest possible moment, depending on the context. The timestamp "1999" interpreted as the earliest date becomes "19990101000000". The timestamp "1999" interpreted as the latest date becomes "19991231235959".

URL Canonicalization

Wayback performs several URL normalization, or canonicalization operations on URLs before they are inserted into the Wayback index. The same operations are performed on URLs before searching the Wayback index. Examples, of canonicalization operations are:

  • removal of leading "www." from hostnames
  • lowercasing host and/or path
  • collapsing redundant path components. ex, "/images/../foo.gif" = "/foo.gif", "/./foo.gif" = "/foo.gif"

Replay Extension Opportunities:

Requests can be made to the wayback service to retrieve original documents exactly as captured by making the request in ArchivalUrl format, and adding the "identitycontext=yes" request parameters to a replay request URL. Note that the original HTTP headers are also maintained, so you may need to handle some strange HTTP interactions, depending on the HTTP client library you are using.

Additionally, Wayback allows insertion of arbitrary content within the wayback service by creating a .JSP file. Different .JSP files can be configured for various document types by MIME type, or by matching fragments of the original capture URL.

When an appropriate document is retrieved from Wayback, the .JSP file will be executed on the Wayback host, and the output of the .JSP file will be inserted into the original content, and returned to the end user. Information about the context of the request causing the .JSP to execute is available to the code within the .JSP, including:

Wayback includes several example implementations of replay .JSP files:

  • Disclaimer.jsp Inserts a banner in replayed HTML pages giving information about the particular capture.
  • Timeline.jsp Inserts a navigation element at the top of each replayed HTML page allowing navigation to other versions of the same URL.
  • ClientSideReplay.jsp Inserts a SCRIPT tag which causes Javascript to execute while the page loads. The Javascript is responsible for modifying image and anchor URLs within the page so they direct back into the Wayback service.
  • DebugBanner.jsp Inserts a horribly ugly banner at the top of replayed HTML pages which shows some of the information available to replay .JSP pages.

Getting started with Wayback .JSP files

  • Check out the Project Home
  • Install apache tomcat
  • Download Wayback
  • Download a sample .ARC file
  • Build Wayback with maven, install .WAR file as ROOT.war (you can break it after the webapp is built...)
  • create /tmp/wayback/files1/
  • download a sample ARC file and place it into /tmp/wayback/files1/
  • Start tomcat
  • test your install with http://localhost.archive.org:8080/wayback/*/http://peagreenboat.com/
  • modify .JSP files, mess around with wayback.xml, ArchivalURLReplay.xml, restart tomcat, rinse, repeat
  • Win cool stuff on Amazon (Dev8D folks only : )

Adding features deeper than the .JSP layer (modifying and building the source)

  • Get Eclipse Ganymede(optional, if you want to make "deeper changes" in the wayback codebase)
  • Install m2eclipse plugin(optional, if you want eclipse, and if you want eclipse Java navigation...)
  • Install maven 2
  • checkout maven from SVN
  • put on your Java Hat at hack away
  • mvn install (you can break it after the web app is built, the hadoop build is sloooow)
  • drop your new .WAR file into tomcat
  • restart tomcat