Wayback Installation and Configuration Guide

Overview

This document provides step-by-step installation and configuration instructions for the Internet Archive's Wayback Machine.  It is intended for partners who want to setup access to their collections with a minimum of technical effort.  The instructions are intended for Unix.  Currently we do not support deploying Wayback in Windows environments.

Audience

This document is intended for partners who want to view their collections of Internet content using the Wayback software.  The reader should have a basic understanding of the Unix command line, the ability to download files from the Internet, and knowledge of networking concepts such as HTTP, hostnames, ports, and IP addresses.

Collection Delivery

This document assumes two methods of collection delivery.  A collection can either be delivered on physical media, such as a hard disk, or by download.  In the case of download, the partner logs into an Internet Archive server and transfers the files in the collection to their local computing resources.  The procedure for transferring files is beyond the scope of this document.  This document describes configuring the Wayback software after the collection files have been delivered.

Collection Files

Collections are made up of two types of files: CDX files and WARC files.

File Type

Description

WARC File

A WARC file contains archived Internet content.  Data such as Web pages, PDF files, and images are stored in WARC files.

CDX File

A CDX file is an index into a WARC file.  An index is a file that efficiently maps a specific piece of information, such as a URL, into another piece of information.  A CDX file maps the combination of a URL and timestamp into the resulting content of the URL captured at that time.  For example, a CDX file might map the URL, http://myurl.org/index.html, and the timestamp March 3, 2001 at 4:33pm GMT into the content of the Web page that was captured at that time.  The Web page exists in the WARC file.  The content of the Web page might look like this:

<html><body>My URL Home Page</body></html>

.  CDX files are automcatically generated by the Wayback Web application.  They are not delivered on the physical media or through file transfer.

Concepts

The concepts of collections and access points are fundamental to understanding how the Wayback software operates.  Also, understanding these concepts helps to clarify the installation steps.

Collections

Conceptually, a collection is a set of Internet content that is topically related.  For example, a collection can represent all the Web pages related to the California state government.  Technically, a collection is a repository of archived Internet content consisting of WARC and CDX files.  For more information on WARC and CDX files, see Collection Files.  A collection consists of physical artifacts (CDX and WARC files) that contain Internet content and indices into the content.

Access Points

An access point is a Web accessible view into a collection.  An access point is made available through a URL.  There are many configurable properties attached to an access point.  For example, an access point can be configured to only allow access to a specific set of archived files.  Multiple access points can be configured that point to the same collection.  This is useful when different levels of security or different formatting options are required for different sets of users.

This document describes installing and generating the physical artifacts of a collection and the configuration of access points that point to these physical artifacts.

Unix System Requirements

Requirement

Description

Unix

A computer running a version of the Unix operating system.  This document is based on the Ubuntu version of Unix.  Ubuntu can be downloaded from http://www.ubuntu.com.

Java Runtime Environment (JRE) 1.5 or greater

JRE 1.5 or greater for Unix.  The JRE for Unix can be downloaded from http://www.java.com/en/download/index.jsp.

Tomcat 6.0

Tomcat is a Java Servlet container.  Wayback runs as a Servlet application within Tomcat.  Tomcat 6.0 can be downloaded from http://tomcat.apache.org/download-60.cgi.

WARC files

The WARC files to be viewed through Wayback must be accessible to Wayback from a local or remotely-mounted file system.  These files usually exist on the physical media (hard disk) that is delivered to the partner.  They may also be delivered via file transfer from an Internet Archive server.

Wayback 1.6.0

Wayback consists of a Java Servlet application and related files.  The Servlet application runs within the Tomcat Servlet container.  Wayback can be downloaded from http://sourceforge.net/projects/archive-access/files/.  Version 1.6.0 is the recommended version of Wayback.

Unix administrator access

Installing software and hardware requires a computer user to have 'root' permissions.

Unix Conventions

Convention

Description

/home/user

In this document the /home/user directory is used for installation of all software.  For production installation use the appropriate installation location for your environment.  Check with your system administrator for the appropriate installation location.

jre-6u24-linux-i586.bin

In this document the file jre-6u24-linux-i586.bin is a placeholder for the Java Runtime Environment (JRE) installation file.  The installation file name will vary depending on the version of Java you choose.

/home/user/jre1.6.0_24

In this document the directory /home/user/jre1.6.0_24 is the default JRE installation directory.  The installation directory name will vary depending on the version of Java you choose.

apache-tomcat-6.0.32.tar.gz

In this document the file apache-tomcat-6.0.32.tar is a placeholder for the Tomcat 6.0 installation file.  The installation file name will vary depending on the version of Tomcat 6.0 you choose.

/home/user/apache-tomcat-6.0.32

In this document the directory /home/user/apache-tomcat-6.0.32 is the default Tomcat installation directory.  The installation directory name will vary depending on the version of Tomcat you choose.  In this document, the directory is also referred to as <tomcat_install_dir>.

wayback-1.6.0.tar.gz

In this document the file wayback-1.6.0.tar.gz is the default Wayback installation file.

wayback-1.6.0.war

In this document the file wayback-1.6.0.war is the default Wayback Web application file.

/file_system/warcs

This is the file system and directory storing the delivered and deflated WARC files.  WARC files are delivered in deflated (gzip) format.

/my_wayback_big_file_system/collection_files_index

This is the file system and directory used to store CDX files.  CDX files are generated automatically from WARC files that are read by the Wayback Servlet application.

wayback.myserver.org:8080

This is the hostname and port of the server running the Wayback Web application

The following diagram provides a simple logical overview of a Wayback Web application installation.

URLs and Web Applications

Apache Tomcat is responsible for running the Wayback Web application.  Web applications are usually packaged in a file with a .war extension.  For example, version 1.6 of the Wayback Web application is packaged in a file named wayback-1.6.0.war.  Installation of a Web application on Tomcat involves copying the .war file into the <tomcat_install_dir>/webapps directory.  When Tomcat runs it will read the .war files in the webapps directory and make their functionality available to users via URLs.

The default URL used to access a Web application usually depends on the name of the Web application.  For example, a Web application named myapplication.war will be accessible at the URL, http://localhost:8080/myapplication.  The following table illustrates the relationship between URLs and Web application names.  The information in the table assumes that Tomcat is running on a computer named localhost on port 8080.

Web Application File Name

URL

myapplication.war

http://localhost:8080/myapplication

myapplication2.war

http://localhost:8080/myapplication2

myapplication3.war

http://localhost:8080/myapplication3

In order to simplify the URL that is used to access Wayback, the Wayback .war file can be installed as the "root" application.  The "root" Web application will be accessible at the URL '/'.  For example, on a computer named localhost running Tomcat on port 8080, the URL to access Wayback would be http://localhost:8080/.  

The "root" application is named ROOT.war.  In order to make a Web application .war file the "root" application, it must be renamed to ROOT.war.  The default procedure for installing Wayback, which is described below, assumes that the user will run Wayback as a the "root" application.

The following diagram describes the physical layout of a basic Wayback installation.  Note that the diagram shows a very specific Wayback setup. Other configurations can be created depending on your environment.  For example, the Wayback software can be installed on the delivered physical media (hard disk) containing the WARC files.

The installation instructions assume the collection is delivered on physical media (a hard drive or drives) or through file transfer.  For physical media or file transfers, the destination directory of the WARC files is assumed to be /file_system/warcs for Unix and c:\warcs for Windows.

Unix Installation

  1. Install the physical media containing WARC files or download the WARC files from an Internet Archive server.  Installing physical media involves attaching the hard drive delivered from Internet Archive to a computer and mounting the pre-configured file system. The procedure for installing physical media is beyond the scope of this document.  For file transfer options, please contact your Internet Archive representative.
  2. Ensure that the file system on which the Wayback Web application will be running can access the delivered WARC files.
  3. Download the JRE into /home/user. The JRE can be downloaded from http://www.java.com/en/download/index.jsp.
  4. Install the JRE.

    cd home/user
    ./jre-6u24-linux-i586.bin
  5. Set the JAVA_HOME environment variable.  A best practice is to put this command in your startup script, such as .bashrc.

    export JAVA_HOME=/home/user/jre1.6.0_24
  6. Download Tomcat into /home/user.  Tomcat can be downloaded from http://tomcat.apache.org/download-60.cgi.
  7. Install Tomcat.

    cd /home/user
    gzip -d apache-tomcat-6.0.32.tar.gz
    tar xvf apache-tomcat-6.0.32.tar
  8. Start Tomcat and test that it is working.

    /home/user/apache-tomcat-6.0.32/bin/startup.sh

    Open the URL: http://wayback.myserver.org:8080.  The page should look similar to the image below.  If it does, Tomcat is working.

  9. Shutdown Tomcat.

    /home/user/apache-tomcat-6.0.32/bin/shutdown.sh
  10. Download the latest version of Wayback.  Wayback can be downloaded from http://sourceforge.net/projects/archive-access/files/.
  11. Expand the Wayback software archive into /home/user.

    cd /home/user
    gzip -d wayback-1.6.0.tar.gz
    tar xvf wayback-1.6.0.tar
  12. Remove all the files under the Tomcat webapps directory.

    rm -rf /home/user/apache-tomcat-6.0.32/webapps/*
  13. Install the Wayback .war file into Tomcat.  Make sure that Tomcat is not running.

    cp /home/user/wayback/wayback-1.6.0.war /home/user/apache-tomcat-6.0.32/webapps
  14. Rename the Wayback .war file so that it is the Tomcat "root" Web application.

    cd /home/user/apache-tomcat-6.0.32/webapps
    mv wayback-1.6.0.war ROOT.war
  15. Start Tomcat.

    /home/user/apache-tomcat-6.0.32/bin/startup.sh
  16. Open the URL: http://wayback.myserver.org:8080/.  The page should look similar to the image below.  If it does then, you have successfully installed the Wayback Web application.

Unix Configuration of a Single AccessPoint

Configuration involves modifying the Wayback installation so that it can read the delivered WARC files and create the CDX files associated with the WARCs.

  1.  Shutdown Tomcat if it is not already shutdown.

    /home/user/apache-tomcat-6.0.32/bin/shutdown.sh
  2.  Modify the wayback.xml configuration file.  The wayback.xml file is located in the /user/home/apache-tomcat-6.0.32/webapps/ROOT/WEB-INF directory.  The following edits must be made to the wayback.xml file.

    XML Value Name

    Description

    wayback.basedir

    This is the name of the directory that will contain the generated CDX files.

    wayback.urlprefix

    This is the name of the URL that is used to access the Wayback Web application.

    8080:wayback

    All instances of the text "8080:wayback" must be replaced by the port number on which your instance of Tomcat is running.  For example, if your instance of Tomcat is running on port 8096 then the text "8080:wayback" must be changed to "8096".

    These attributes can be seen in the following XML snippets from the wayback.xml file.

    ...  
    <bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
        <property name="properties">
          <value>
            wayback.basedir=/tmp/wayback
            wayback.urlprefix=http://localhost.archive.org:8080/
         </value>
        </property>
    </bean>
    ...
    ...  
    <bean name="8080:wayback" class="org.archive.wayback.webapp.AccessPoint">
        <property name="serveStatic" value="true" />
        <property name="bounceToReplayPrefix" value="false" />
        <property name="bounceToQueryPrefix" value="false" />
    ...
    1.  Change the wayback.basedir value to the name of an empty directory that will contain the CDX files.  This directory should be located on a file system that is large enough to store all the CDX  files generated from the WARC files.  A conservative estimate is that CDX files will take up 5% of the space used by WARC files.
    2.  Change the wayback.urlprefix value to the name of the Wayback Web application URL.
    3.  Search for the value "8080:wayback" and replace it with the port number on which your Tomcat instance is running.  On Unix, you can use the vi editor to modify the file using the following commands.  This example assumes Tomcat is running on port 8080.

      cd /user/home/apache-tomcat-6.0.32/webapps/ROOT/WEB-INF
      vi wayback.xml
      :%s/8080:wayback/8080/g
      :wq
      

      A modified wayback.xml is shown below.

      ...
        <bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
          <property name="properties">
            <value>
              wayback.basedir=/my_wayback_big_file_system/collection_files_index
              wayback.urlprefix=http://wayback.myserver.org:8080/
           </value>
          </property>
        </bean>
      ...
      
      ...
        <bean name="8080" class="org.archive.wayback.webapp.AccessPoint">
          <property name="serveStatic" value="true" />
          <property name="bounceToReplayPrefix" value="false" />
          <property name="bounceToQueryPrefix" value="false" />
      ...
  3. Modify the BDBCollection.xml file.  The BDBCollection.xml file is located in the /user/home/apache-tomcat-6.0.32/webapps/ROOT/WEB-INF directory.  The section of BDBCollection.xml to modify is shown below.

    ...  
    <bean id="datadirs" class="org.springframework.beans.factory.config.ListFactoryBean">
        <property name="sourceList">
          <list>
            <bean class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource">
              <property name="name" value="files1" />
              <property name="prefix" value="/tmp/wayback/files1/" />
            </bean>
            <bean class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource">
              <property name="name" value="files2" />
              <property name="prefix" value="/tmp/wayback/files2/" />
            </bean>
          </list>
        </property>
    </bean>
    
    ...
    

    The element named property must reflect the location of the delivered WARC files.  The default configuration contains two directories, /tmp/wayback/files1 and /tmp/wayback/files2.  In the default BDBCollection.xml, WARC files will be accessed under /tmp/wayback/files1 and /tmp/wayback/files2.  Each of these directories must be changed to reflect the "true" location of the WARC files in your environment.  Also, if only a single directory contains the WARC files, one of the entries represented by the following XML:

    ...
    <bean class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource">
    ...

     should be removed.  Assume that in your environment the WARC files are located in /file_system/warcs.  The modified XML snippet is shown below.

    ...
      <bean id="datadirs" class="org.springframework.beans.factory.config.ListFactoryBean">
        <property name="sourceList">
          <list>
            <bean class="org.archive.wayback.resourcestore.resourcefile.DirectoryResourceFileSource">
              <property name="name" value="warcfiles" />
              <property name="prefix" value="/file_system/warcs" />
            </bean>
          </list>
        </property>
      </bean>
    ...
  4. Start Tomcat.

    /home/user/apache-tomcat-6.0.32/bin/startup.sh
  5. Open the Wayback Web application URL: http://wayback.myserver.org:8080/
  6. Search for a URL by entering the URL in the "Enter Web Address" field and clicking on the "Take Me Back" button.
  7. The available versions page will appear showing the versions of the URL that are archived. Click on a version (date).
  8. The archived page will be displayed.

Adding New Archived Content to a Collection

In some cases, additional content is created for a collection.  This can occur if a partner initiates new crawl jobs that capture updated versions of collection content.  For example, a collection may contain all the Web pages of a specific site as of 2/1/2001.  If the same Web site is re-captured on 3/1/2001, the new content must be integrated with the existing content.  The following instructions describe the procedure for updating a collection with new content.  Note that the following procedure can be executed while Tomcat is running.

  1. Receive the new content from Internet Archive via physical media or file transfer.  For the purposes of this document, the new content is assumed to exist in a directory named c:\newarcs.
  2. Copy the new WARC files from the c:\newarcs directory into the c:\warcs directory, which is the directory that is configured in Wayback for reading WARC files.

    cp c:/newarcs/* c:/warcs

    The new WARC files are detected by Wayback and indexed into CDX files in the c:/collection_files_index directory.
    The new content should now be accessible from the Wayback search page.

Configuring Multiple Access Points

Create multiple access points if you want to access the same collection or multiple collections using different URLs.  For example, if you have two collections made up of three WARC files per collection, you can configure the Wayback software so that each collection is accessed through a different URL.

Limiting Access

Wayback can limit access to archived Internet content using various criteria, such as HTTP user credentials and administrative file lists.  For information on limiting access to Internet content, go to http://archive-access.sourceforge.net/projects/wayback/administrator_manual.html and search for the phrase "Excluding Documents within an AccessPoint".