If you do not find the answers you need from the topics below, please use the "Submit a Question" link in your account to send your question(s) to the Archive-It team.
Crawls & Reports
What is a document?
A document is any file on the web that has a distinct URL. Images, pdfs, articles, etc are all considered separate documents. Archive-It accounts include a budget stating the total number of documents that can be archived per subscription year.
What is the difference between a host and a website?
A website is a collection of related web resources, usually as grouped by some common addressing, ex. www.archive.org. However, this website could be composed of a number of different hosts, depending on where the information being served comes from. A website can bind information from a number of different sources into one address. For example a page on www.archive-it.org could have an RLG logo on it, and the HTML that creates the page could be linking to that image directly on RLG servers. This means if the page was archived, the reports would show that one document was archived from rlg.org host. This one document would be the RLG logo.
What is my collection number?
There are two easy ways to find the collection number for your Archive-It collection:
1) Log in to the Archive-It Application and go to Access > Wayback. Select the collection you want to know the number for from the drop-down list. Then click any seed URL from the collecton. Once the Wayback Machine opens in your browser, examine the url in the address bar at the top of the screen. The collection number occurs just after http://wayback.archive-it.org/ in the URL. For example, the collection number for this URL in the Wayback Machine http://wayback.archive-it.org/194/*/http://census.state.nc.us/ is 194.
2) Go to the public collection page for your collection. If your collection is public, you can find it by typing the name of the collection in the search box next to "Explore Collections" on www.archive-it.org. You can also see a full list of public collections on Archive-it.org: http://archive-it.org/explore?show=Collections
Once you find your collection page, examine its URL: The collection number is the number at the end of the URL. For example, the collection number for this collection http://www.archive-it.org/collections/866 is 866.
How do I upload a logo to my Archive-It Account?
Inside the Archive-It application, click on the "admin" link in the upper right corner of the screen. From there, click the "logo" tab.
On the add logo page, you can easily select and upload your institution's logo. This logo will appear on your Archive-It partner and collection pages, ex: http://www.archive-it.org/organizations/67 and http://www.archive-it.org/collections/194 Additionally the logo will appear in the Wayback Machine on calendar view pages (these are the pages that show dates of capture for a given page), ex: http://wayback.archive-it.org/194/*/http://census.state.nc.us/
If you would like a different logo or image associated with a specific collection within your account, you can upload that image by going to the Collection Management screen for that collection and clicking on "Edit Collection Metadata" in the right column and then clicking on the "Logo" tab. This logo will show up on your Archive-It collection page as well as the Wayback Machine calendar pages for that collection.
In either case, the maximum logo size is 150 x 150 pixels. Larger logos will be automatically resized.
What is the Difference between the General Archive (sometimes called the Wayback Machine) and Archive-It?
- The General Archive is a complimentary resource for the community. Archive-It is a paid subscription service.
- Archive-It allows users to curate, scope and manage their own focused or topical collections. Users control how deep and how often a site is crawled, they can exclude content from being crawled, ignore robots.txt, catalog with metadata at the collection, seed and document level, and so on. Archive-It also provides an option for users to ignore robots.txt on a host by host basis.
- Archive-It collections attribute archived web pages to a specific collection and the organization that captured it.
- Full text search (basic and advanced) is available with Archive-It collections, and there are no plans currently to provide this for the General Archive.
- The General Archive crawls do not include Umbra so many social media sites (Flickr, Twitter, Instagram, Vimeo and Facebook etc) are not captured.
- Archive-It provides technical support throughout the process and helps our users with scoping (and other) issues.
- Archive-It partners are able to get a backup copy of their data, which is not available for content collected as part of the General Archive.
- By default, content that is captured through the Archive-It service and public on the Archive-It website does appear in the General Archive within 24 hours. However:
- Archived content that is private in Archive-It remains inaccessible in the General Archive.
- If an organization desires their public content in Archive-It to be embargoed from browsing in the General Archive, for a period of time (3 months, 6 months undefined etc), that is something easy to do by notifying a Partner Specialist.
- All trial training archived content remains private.
How can I change my collection name?
Just click on your collection name at the top of the Collection Management page or click the 'Edit' link next to the collection name. This will allow you to type in a new collection name. When you are finished, click the "Update Name" button.
What is the difference between active vs. inactive seeds or collections?
Active Seeds and Collections
A seed or collection is considered "active" when it is scheduled for crawling.
Inactive Seeds and Collections
A collection is considered inactive if it is not scheduled for crawling. Inactive collections can still be accessed by the public.
Dormant collections are not scheduled for crawling and a partner can have unlimited numbers of dormant collections. Dormant collections are publicly accessible.
How can I add more websites or seeds to my collection?
You can always add more seeds or websites to your collection. From the Collection Management page click the "Add Seeds" link under the "Collection Management" column. You will then be able to add the seeds you want to add, make sure they are verified, and assign them a crawl frequency.
How do I move seeds from one collection to another?
There is no automatic way to move a seed from one collection to another. To do so, disable the seed you want to move and then add it to the new collection. Note that this does not move the archived files associated with the seed. There is currently no way to move archived files from one collection to another.
How do I change the crawl frequency of my seeds?
There are two ways to change the crawl frequency of a seed. If you want to change the frequency of only one or two seeds, click on "Seed Management" for the collection you wish to edit, then click on the "[Settings]" link for the seed you would like to edit. Then select the new frequency you would like. Your change will be saved automatically.
If you would like to change the frequency of many of your seeds, click "Bulk Edit" on the upper right corner of the Seed Management page. Select "All" or just the seeds you wish to edit, and then click "Next". From the list of changes you can make select "Change frequency" and click "Next" again. From this list select the new frequency you would like and finish by clicking "Next" one last time. Your changes will be automatically saved.
Click on Seed Management for more help.
How do I add metadata to my seeds?
Click the "Add" link in the 'Metadata' column on the Seed Management page for each seed to add metadata to your seeds individually. If you want to apply the same metadata to more than one seed, use the "bulk edit seeds" link found above your seed URL list on the Seed Management page. Select the seeds you would like to add metadata to and click "Next". Then select "Edit metadata" from the list of options and click "Next" again. Finally, add the desired metadata and click "Next" one last time to save your changes.
Are any metadata fields required?
The only required metadata field is the collection description, because it appears on the public www.archive-it.org website. All other metadata fields both at the seed and collection level can be used completely at your institution's discretion and per your policies. Archive-It allows partners to input metadata for collections, seeds, and individual URLs using the Dublin Core Metadata Element Set.
How do I export the metadata I've added to my seeds and collections?
You can export your metadata per collection or for your entire account using XML feeds. You should see "XML" buttons in the following places:
-Partner Home page: the orange button here will create an XML feed of all the metadata you have entered for all your seeds in all your collections
-Collection Management page: in the upper right side of the page you will see an orange "XML" link that will create a feed of just the metadata from the current collection you are currently viewing.
Note: Make sure the XML setting is enabled for your account by going to Admin and clicking the box next to Enable XML Settings.
How do I remove seeds from my public collection page?
Partners can easily remove seeds from their collection pages listed on www.archive-it.org (ex: http://www.archive-it.org/collections/866 ).
From the seed management page inside the application, click the "edit" link next to the seed you would like to remove from your public collection page.
Under the settings tab, you should see "show on public site" and a check box. When the check box is selected, the seed appears on your public collection page. To remove the seed, uncheck the box.
If you would like to remove more than one seed from your public collection page, you can select "Bulk Edit" on the Seed Management page. Then select the seeds you would like to remove and click "Next". Finally select "Make invisible on public site" from the list of options and click "Next" to save your changes.
How do I know how big a crawl will be?
To find out how large a web site or collection is, run a test crawl. To run a test crawl, select the seeds you want to test and then click the "Run Test Crawl" button located on the top right-hand corner of the Seed Management page.
The resulting crawl will not collect any actual data, but will generate all the normal reports and crawl statistics. You can then analyze your data by seed/collection and make any necessary adjustments. When you are ready to run a production crawl to capture data, change the frequency of your seeds as necessary. Click to learn more about test crawls.
How do I re-start crawling or start a crawl early?
To restart a crawl, first decide which seed frequencies you want re-captured in which of your collections. You can choose to restart just one crawling frequency in one of your collections, or all seed frequencies in all of your collections (thus ALL of your seeds would be re-captured), or any combination in between. First, make sure the collection you want to crawl is active. Then, from the Collection Management page, click Start Crawl Now next to the crawl frequency you want to re-crawl. Crawling should begin immediately.
All future crawling will be rescheduled from the date/time the re-start begins. For example if you have a quarterly crawl scheduled for July 1, 2010 and you restart that crawl on June 1, 2010 the seeds you have set to crawl will be immediately captured. They will then be automatically be re-scheduled for September 1, 2010.
How do I stop a seed from being crawled?
If you no longer want a specific seed to be crawled, you can deactivate by going to the 'Settings' area for that seed and clicking the 'Deactivate' button. Once the seed has been deactivated, you can easily activate it again by clicking the "[Activate]" button in the same place.
If you would like to activate or deactivate more than one seed at a time, click on the "Bulk Edit" button on the Seed Management screen and select the seeds you would like to edit. On the next screen, select either "Activate" or "Deactivate" and click "Next" to save.
How do I stop a collection from being crawled?
To stop a complete collection from crawling and cancel any scheduled crawls, click the "Deactivate" link in the upper right corner of the Collection Management page. The collection will then appear in the "Inactive" menu under "Collections" in the navigation bar.
Note: If you have a crawl currently running, this will not stop the current crawl. To do so, go to Crawls > Current on the menu bar and click the "stop" link on the right side.
If a crawl is currently running on a collection, and I deactivate it, will it end the current crawl immediately or will it finish the crawl for that collection?
If you have a crawl running and you deactivate the collection, the crawl will proceed and will not stop despite the collection being deactivated after the start.
How do I know what I crawled?
There are two ways to view and understand the content archived in your collection.
To get an authoritative listing of what was captured, use the Host Report, specifically the 'URLs' column in that report will allow you to view exactly what urls were captured from each host. You can find more information on reports here.
You can also browse the archived websites directly to get a feel for your collection and its contents. From inside the application, go to the "Access" link on the navigation bar and select the "Wayback" option in the menu. Now enter the seed URL you want to see and click "Go". To view the seed URL's for a specific collection, select it from the drop down box under the search box. You can also do a keyword text search under "Search" while logged into the application or from the public Archive-It site (www.archive-it.org). Enter any keyword and the search engine will search all the text on an archived web page.
Collections can also be browsed from the public site (www.archive-it.org). From the homepage, partners can access their collections by searching for a specific collection in the "Explore Collections" search box or for their organization in the "Explore Collecting Organizations" search box. They can also view collection information by searching or browsing from the Explore page: http://www.archive-it.org/explore
Why didn't some pages get archived?
There are a few reasons why specific pages within a seed site would not get archived:
-robots.txt: parts of the site could be blocked from our crawler by a robots.txt. Crawling for Archive-It is done with the user-agent archive.org_bot. You can check to see if your seed has blocked the Heritrix web crawler by going to www.yourseed.com/robots.txt. Learn more about robots.txt
-not linked: Heritrix can only follow links to pages in the seed site. If there are parts of the site that are not linked to from a page that is in scope, the will not be archived. If you know that you want to archive a specific page that is not linked to from anywhere else, please list it as a seed.
-Connection Error: sometimes Heritrix will not be able to connect to the site you want archived either because of an error on the host side or because they have forbidden access. When this occurs an error notice will be logged on your seed status report. Go here to see what an error code means.
-Out of Scope: A url might not have been archived because it was out of scope for the crawl. You may want to review how scoping works to understand why something would be out of scope. You can also review the 'out of scope' column of the Host Report to see what urls were deemed 'out of scope' and so were not captured. If a url you want to archive is 'out of scope' you may want to expand your scope using expand scope rules. Note: one common reason for something being out of scope is if the page you are trying to archive is part of a subdomain of your seed, which by default will not be automatically harvested. Subdomains are directories named to the left of the seed site, ex. crawler.archive.org (crawler is the subdomain). If you want to be sure to crawl subdomains, you need to list them as separate seeds on your seed list or expand the scope of your seeds using scope expansion rules.
What do all the result messages mean in the Seed status report?
Error 403 - The site owner has forbidden access to our crawler (this is different than being blocked by robots.txt)
Error 404 - The seed URL wasn't found. This could be a a typo in the seed URL, a web server misconfiguration, or the page may simply not exist.
Blocked (robots.txt) - The site owner has blocked our crawler (user agent: archive.org_bot); learn more about robots.txt here.
Redirected - The seed URL has redirected the crawler to a different web address. When this happens the new address is considered the seed URL and it appears in your seed status report.
Unknown - There are many numbered derivations of this error result and they come directly from Heritrix. You can find a complete breakdown here;
Why are there strange hosts listed in my hosts report?
Websites can be composed of elements from a number of different locations. If they are embedded elements on a page they are captured in the archiving process. If there are some particularly odd ones, feel free to contact the Archive-It partner specialist who can help track down an example for you.
How can I block individual hosts within a domain from crawling?
All the hosts archived in your collections are listed on your Hosts report as well as the downloadable version of the Seed Source report. If you find hosts you do not want to archive, go to the Modify Crawl Scope link on the left-hand column of the Collection Management page.
Under the "Host Constraints" tab, enter the URL for the host that you do not want to crawl and click "Add". Now click the box beside the host URL to block the host completely. (You also have the option to block the host using a regular expression or to enter the maximum number of documents you want crawled from the host).
Also on the Modify Crawl Scope area in the 'Crawl Limits' tab, you can restrict the number of documents you archive total per crawl (per frequency). Click to learn more about host constraints.
Will content that has been previously crawled and has not changed since the previous capture affect my account's document/data budgets?
Please see this page explaining Data De-Duplication for more details.
Why am I sometimes directed to live pages when I am viewing archives?
If you browse archived documents without using proxy mode, be aware that files within a web page may be inadvertently redirected to the live web.
When browsing your archives, please be sure to do the following:
A) Click a few links on your archived seed sites in the Wayback Machine to make sure it was archived and that its display appears to be normal.
B) Check streaming media files (video and sound recordings) to make sure they were archived successfully. (It is especially important to use proxy mode when checking streaming media files).
C) If there are specific files (URLs) essential to your collection, make sure they were archived successfully by browsing through your archives or searching directly for the URL in question (on Access > Wayback in the application).
To set your browser to proxy mode, you need to make a manual adjustment to your web browser's settings. Once you have adjusted your browser you will only be able to view material from your archived collection, and not any content from the live web. To view sites on the live web you will need to adjust your browser back to its original settings. Below you can find complete set up instructions and some tools that will make using proxy mode very easy.
See Wayback Machine, Proxy Mode for more information.
How does the text search engine work?
Archive-it uses a special bundling of the Nutch search engine called NutchWAX (web archiving extensions). Nutch indexes every word on every archived page. Results are determined in two ways. Nutch compares the number of times your search term appears on a document with the number of times the term appears in the overall corpus of archived pages. Secondly, Nutch keeps track of how many pages refer to a document and what the anchor text is for those referrals.
Why do I sometimes see a capture date listed more than once when viewing the available captures for a url?
You may sometimes see a date listed multiple times when viewing the calendar page that lists all available capture dates for a url. For example:
Most often this is because two very similar, but slightly different urls were discovered at the time the url was crawled. The Wayback software tries to be "smart" and piece together urls that are more or less "the same" url. For example, you may see this happen with urls with or without www in them (http://www.mysite.com vs. http://mysite.com); Another common example is versions of urls with or without a slash on the end (http://www.mysite.com/ versus http://www.mysite.com)