Child pages
  • Heritrix 3.x API Guide
Skip to end of metadata
Go to start of metadata

In case of SSL error

If you get an error like this from curl:
error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert internal error
try adding the argument -sslv3 to your curl commands.
See http://tech.groups.yahoo.com/group/archive-crawler/message/7456

Introduction

This manual describes the REST application programming interface (API) of the Heritrix Web crawler.  Heritrix is the Internet Archive's open source, extensible, Web-scale, archival-quality Web crawler. For more information about Heritrix, visit http://crawler.archive.org/.

This document is intended for application developers and administrators interested in controlling the Heritrix Web crawler through its REST API.

Conventions and Assumptions

The following conventions are used in this document.

Convention

Description

(identifier)

A identifier surrounded by parenthesis indicates a user-defined value. For example, (heritrixhostname) indicates a user-defined hostname that is running Heritrix.

[identifier1,identifier2,...]

Multiple identifiers surrounded by brackets indicate a predefined set of values. For example, [on,off] indicates a set of values comprised of the literals, "on" and "off".

The following curl parameters are used when invoking the API.

curl Parameter

Description

-v

Verbose. Output a detailed account of the curl command to standard out.

-d

Data. These are the name/value pairs that are send in the body of a POST.

-k

Insecure. Allows connections to SSL sites without certificates.

-u

User. Allows the submission of a username and password to authenticate the HTTP request.

--anyauth

Any authentication type. Allows authentication of the request based on any type of authentication method.

--location

Follows HTTP redirects. This option is used so that API calls that return data (such as HTML) will not halt upon receipt of a redirect code (such as an HTTP 303).

-H

Set the value of an HTTP header. For example, "Accept: application/xml".

It is assumed that the reader has a working knowledge of the HTTP protocol and Heritrix functionality.  Also, the examples assume that Heritrix is run with an administrative username and password of "admin."

REST

Representational State Transfer (REST) is a software architecture for distributed hypermedia systems such as the World Wide Web (WWW). REST is built on the concept of representations of resources. Resources can be any coherent and meaningful concept that may be addressed. A URI is an example of a resource. The representation of the resource is typically a document that captures the current or intended state of the resource. An example of a representation of a resource is an HTML page.

Heritrix uses REST to expose its functionality. The REST implementation used by Heritrix is Restlet. Restlet implements the concepts defined by REST, including resources and representations. It also provides a REST container that processes RESTful requests. The container is the Noelios Restlet Engine. For detailed information on Restlet, visit http://www.restlet.org/.

Heritrix Restlet API

Heritrix exposes its REST functionality through HTTPS. The HTTPS protocol is used to send requests to retrieve or modify configuration settings and manage crawl jobs.

Requirements for API Invocation

Any client that supports HTTPS can be used to invoke the Heritrix API. The most common clients are command line tools such as curl and wget. These command line tools are typically found in Unix environments but can also be run on a Windows environment by installing Cygwin.  Cygwin is a free Linux emulation environment for Windows.

API Format

The format used to describe each API is as follows.

Name

Description

API Name

The name assigned to the API. The name is a single word or short phrase that encapsulates the purpose of the API call.

URI

The URI to call when invoking the API.

Description

The description of the API. The description provides a detailed overview of what the API accomplishes and when the API should be called.

HTTP Method

The HTTP method to use when invoking the API.

HTTP Data

The name/value pairs that are submitted with the HTTP request.

HTML Example

An example call to the API. The curl command line utility is the HTTPS client used in the examples. The call returns HTML output.

XML Example

An example call to the API that returns XML output.  The curl command line utility is the HTTPS client used in the examples.

API

Create New Job

URI

https://(heritrixhost):8443/engine

Description

This API creates a new crawl job configuration. It uses the default configuration provided by the profile-defaults profile.

HTTP Method

POST

HTTP Data

Name

Value

Description

createpath

(jobname)

The name of the job.

action

create

The action to invoke.

HTML Example
curl -v -d "createpath=myjob&action=create" -k -u admin:admin --anyauth --location https://localhost:8443/engine
XML Example
curl -v -d "createpath=myjob&action=create" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine

Add Job Directory

URI

https://(heritrixhost):8443/engine

Description

This API adds a new job directory to the Heritrix configuration. The directory must contain a cxml configuration file.

HTTP Method

POST

HTTP Data

Name

Value

Description

addpath

(job directory to add)

The job directory to add.

action

add

The action to invoke

HTML Example
curl -v -d "action=add&addpath=/Users/hstern/job" -k -u admin:admin --anyauth --location https://localhost:8443/engine
XML Example
curl -v -d "action=add&addpath=/Users/hstern/job" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine

Build Job Configuration

URI

https://(heritrixhost):8443/engine/job/(jobname)

Description

This API builds the job configuration for the chosen job. It reads an XML descriptor file and uses Spring to build the Java objects that are necessary for running the crawl. Before a crawl can be run it must be built.

HTTP Method

POST

HTTP Data

Name

Value

Description

action

build

The action to invoke.

HTML Example
curl -v -d "action=build" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob
XML Example
curl -v -d "action=build" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob

Launch Job

URI

https://(heritrixhost):8443/engine/job/(jobname)

Description

This API launches a crawl job. The job can be launched in the "paused" state or the "unpaused" state. If launched in the "unpaused" state the job will immediately begin crawling.

HTTP Method

POST

HTTP Data

Name

Value

Description

action

launch

The action to invoke.

HTML Example
curl -v -d "action=launch" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob
XML Example
curl -v -d "action=launch" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob

Rescan Job Directory

URI

https://(heritrixhost):8443/engine

Description

This API rescans the main job directory and returns an HTML page containing all the job names. It also returns information about the jobs, such as the location of the job configuration file and the number of job launches.

HTTP Method

POST

HTTP Data

Name

Value

Description

action

rescan

The action to invoke.

HTML Example
curl -v -d "action=rescan" -k -u admin:admin --anyauth --location https://localhost:8443/engine
XML Example
curl -v -d "action=rescan" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine

Pause Job

URI

https://(heritrixhost):8443/engine/job/(jobname)

Description

This API pauses an unpaused job. No crawling will occur while a job is paused.

HTTP Method

POST

HTTP Data

Name

Value

Description

action

pause

The action to invoke.

HTML Example
curl -v -d "action=pause" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob
XML Example
curl -v -d "action=pause" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob

Unpause Job

URI

https://(heritrixhost):8443/engine/job/(jobname)

Description

This API unpauses a paused job. Crawling will resume (or begin, in the case of a job launched in the paused state) if possible.

HTTP Method

POST

HTTP Data

Name 

Value 

Description 

action 

unpause 

The action to invoke. 

HTML Example
curl -v -d "action=unpause" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob
XML Example
curl -v -d "action=unpause" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob

Terminate Job

URI

https://(heritrixhost):8443/engine/job/(jobname)

Description

This API terminates a running job.

HTTP Method

POST

HTTP Data

Name

Value

Description

action

terminate

The action to invoke.

HTML Example
curl -v -d "action=terminate" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob
XML Example
curl -v -d "action=terminate" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob

Teardown Job

URI

https://(heritrixhost):8443/engine/job/(jobname)

Description

This API removes the Spring code that is used to run the job. Once a job is torn down it must be rebuilt in order to run.

HTTP Method

POST

HTTP Data

Name

Value

Description

action

teardown

The action to invoke.

HTML Example
curl -v -d "action=teardown" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob
XML Example
curl -v -d "action=teardown" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob

Copy Job

URI

https://(heritrixhost):8443/engine/job/(jobname)

Description

This API copies an existing job configuration to a new job configuration. If the "as profile" checkbox is selected, than the job configuration is copied as a non-runnable profile configuration.

HTTP Method

POST

HTTP Data

Name

Value

Description

copyTo

(new job or profile configuration name)

The name of the new job or profile configuration.

asProfile

[on]

Whether to copy the job as a runnable configuration or as a non-runnable profile. "On" means the job will be copied as a profile. If the "asProfile" parameter is ommitted, the job will be copied as a runnable configuration.

HTML Example
curl -v -d "copyTo=mycopy&asProfile=on" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob
XML Example
curl -v -d "copyTo=mycopy&asProfile=on" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob

Checkpoint Job

URI

https://(heritrixhost):8443/engine/job/(jobname)

Description

This API checkpoints the chosen job. Checkpointing writes the current state of a crawl to the file system so that the crawl can be recovered if it fails.

HTTP Method

POST

HTTP Data

Name

Value

Description

action

checkpoint

The action to invoke.

HTML Example
curl -v -d "action=checkpoint" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob
XML Example
curl -v -d "action=checkpoint" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob

Execute Shell Script in Job

URI

https://(heritrixhost):8443/engine/job/(jobname)/script

Description

This API executes a shell script. The script can be written as Beanshell, ECMAScript, Groovy, or AppleScript.

HTTP Method

POST

HTTP Data

Name

Value

Description

engine

[beanshell,js,groovy,AppleScriptEngine]

The script engine to use.

script

(code to execute)

The script code to execute.

HTML Example
curl -v -d "engine=beanshell&script=System.out.println%28%22test%22%29%3B" -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob/script
XML Example
curl -v -d "engine=beanshell&script=System.out.println%28%22test%22%29%3B" -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/myjob/script

Submitting a CXML Job Configuration File

URI

https://(heritrixhost):8443/engine/job/(jobname)/jobdir/crawler-beans.cxml

Description

This API submits the contents of a CXML file for a chosen job. CXML files are the configuration files used to control a crawl job. Each job has a single CXML file.

HTTP Method

PUT

HTTP Data

(CXML file content)

The XML-based text of the CXML file.

Example
curl -v -T my-crawler-beans.cxml -k -u admin:admin --anyauth --location https://localhost:8443/engine/job/myjob/jobdir/crawler-beans.cxml
API Response

On success, the Heritrix REST API will return a HTTP 200 with no body.

  • No labels