Child pages
  • Glossary of Web Archiving Terms
Skip to end of metadata
Go to start of metadata

Active Collection / Seed collections or seed URLs that are scheduled for crawling.

InActive Collection / Seed collections or seeds that are not scheduled for crawling. When a collection and/or seed is marked inactive, the URLs crawled from that seed will not be deleted, and are still searchable and viewable.

WArc Record represents the capture of a distinct URL. It also records the archive date, content type and archive length as well as the raw byte stream.

WArc File is made of disaggregated (coming from different hosts) WArc records and are usually 1 GB in size. This is an open source format developed by the Internet Archive and an ISO item (CD 28500). You can learn more about it here.

Capture The process of copying digital information from the web to a repository for storage and archival purposes.

Collection A group of resources related by common ownership or a common theme, subject matter and/or domain. A web collection consists of one or more crawls that harvest a group of related URLs. Collections are managed and maintained by an organization or institution.

Crawl  A web capture operation that is conducted by a crawler.  "Crawl" can also reference the archived content associated with a capture.

Crawl Budget The number of documents/amount of data to be collected defined by the subscription level.

Crawl Day This is how the Archive-It team refers to a 24-hour block of active crawling as part of your subscription. In other words when the crawler embarks on its assigned daily, weekly, monthly or quarterly crawl per your instruction.

Crawl Frequency The rate at which you set your seeds to be crawled. The frequency is on a per seed basis and can be set to one time, twice daily, daily, weekly, monthly, bi-monthly, quarterly, semiannual, or annual.

Crawler Software explores the web and collects data about its contents. A crawler can also be configured to capture web-based resources. It starts a capture process from a seed list of entry-point URLs (EPUs).

Curation Process Collection development for web-published materials includes the selection, curation, and preservation processes. In this context, the curation process involves description, organization, presentation, maintenances, and de-selection of the materials in the collection.

Curators, or others responsible for building collections of web-based resources, specify seed lists for specific crawls.

Digital Archive A digital collection for which an institution has agreed to accept long-term responsibility for preserving the resources in the collection and for providing continual access to those resources in keeping with an archive's user access policies.

Digital Collection A collection consisting entirely of born-digital or digitized materials.

Document A document, or web document, is a resource on the World Wide Web that has a distinct web address. It could be an embedded image, whole web page, pdf or any other component of a web page. A document can be any kind of MIME type.

Domain The domain is the root of a host name for example, .com, .gov, .org, etc.

Dublin Core The metadata standard used by Archive-It and available for you to catalog your collection and each seed chosen. This standard has 15 fields that can be used to describe any kind of digital artifact, in this case an archived web page. Click here to get a detailed explanation of each field or here to learn more about the [Dublin Core Metadata Initiative®|http://www.dublincore.org).

Dynamic Web Page A web page created automatically by software at the web server. The page may be (a) personalized for the user based on identification via login or based on cookies stored on the user's computer, (b) tailored to fulfill a specific request made by the user, or (c) code-generated (e.g., using php, jsp, asp, or xml). Information used for personalization or tailoring of pages may be retrieved in real-time from a database or other data store.

Harvest Another name for the act of capturing web content as a part of crawling.

Heritrix Is the name of Internet Archive's open-source, extensible, web-scale, and archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or missaid as heratrix/heritix/ heretix/heratix) is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.

Host A single networked machine, as usually designated by its Internet hostname (ex. archive.org). The hostname can be identical to a URL's domain name, but not always.

Internet Archive A non-profit digital library seeking to provide universal access to all knowledge. Archive-It is a subscription service of the Internet Archive (www.archive.org).

MIME Stands for multipurpose Internet mail extensions. This is a specification for formatting non-text content to be sent over the Internet. A MIME file can be just about any kind of non-text file, ex: gif, jpg, html, etc. When using Archive-It you will get a MIME report of all the different types of files archived.

Nutch An open source search engine utilized by Archive-It to make archived websites text searchable.

One Hop Off A crawling protocol which captures one document out of your crawl's scope, if there is a link to it from an in-scope page scheduled to be crawled. Currently this feature is turned off for all Archive-It crawling.

Persistent Name A unique name assigned to a web-based resource that will remain unchanged regardless of movement of the resource from one location to another or changes to the resource's URL. Persistent names are resolved by a third party that maintains a map of the persistent name to the current URL of the resource.

Repository The physical storage location and medium for one or more digital archives. A repository may contain an active copy of an archive (i.e. one that is accessed by end users) or a mirror copy of an archive for disaster recovery.

Seed A URL appearing in a seed list as one of the starting addresses a web crawler uses to capture content. Also called a targeted URL.

Seed List One or more starting point URLs from which a web crawler begins capturing web resources.

SOLR Open source search platform that provides metadata based search for Archive-It

Starting Point URL Also known as a seed URL.

Sub-domain A directory named before the root web address, for example crawler.archive.org (crawler is the sub domain).

Umbra 

URL (Uniform Resource Locator) The location of a resource on the web.

Wayback Machine is also known as Internet Archive's general web archive. The Wayback Machine is a piece of software that makes archived websites browsable as if they were on the live web.

Web Archive A collection of web-published materials that an institution has either made arrangements for or has accepted long-term responsibility for preservation and access in keeping with an archive's user access policies. Some of these materials may also exist in other forms but the web archive captures the web versions for posterity.

Web Archive Service Enables curators to build collections of web-published materials that are stored in either local and/or remote repositories. The service includes a set of tools for selection, curation, and preservation of the archives. It also includes repositories for storage, preservation services (e.g., replication, emulation, and persistent naming), and administrative services (e.g., templates for collection strategies, content provider agreements, repository provider agreements.)

Web-published materials Web-published materials are accessed and presented via the World Wide Web. The materials span the cultural heritage spectrum and include a range of material types from text documents to streaming video to interactive experiences. Web-published materials are both dynamic and transient. They are at risk of disappearing. Web archives preserve web-published materials.

Website A website is a collection of related web resources, usually as grouped by some common addressing – as when all resources on a single host, or group of related hosts, are considered a 'website'.

  • No labels