response codes 404, and 500 due to invalid URIs

Description

breakdown of the response codes for pivit.cos.com

Count Status Code
2 1 - dns
825 200 - OK
135 302 - Found
52 404 - NOT_FOUND
45 500 - Internal Server Error
2938 -5000 - OUT_OF_SCOPE (this is due to adding to the blacklist)
1 -9998 - ROBOTS_PRECLUDED

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Example of 404 issue:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

2012-01-11T10:03:04.814Z 404 14072 http://pivot.cos.com/Search_Tips_Funding.pdf LLL http://pivot.cos.com/support text/html #1078 20120111100302625+686 sha1:RVK4XA5M4GB4IQO4M75CN3VOQDXGBZ6Q - -

The snippet of source for where Heritrix generated a bad url is here. I marked the place in the code that pertains to the above link. The actual URL is http://pivot.cos.com/guides/Search_Tips_Funding.pdf

.
.
.

<tr>
<td>
<h4 class="support-cat">Finding Funding Opportunities</h4>
</td>
<td>
<div class="accordion">
<div class="header">
<span><h4>Support guides:&nbsp;7</h4></span>

</div>
<div class="content">
<ul class="support-guides">
<li><a href="COS_Pivot_Funding_Homepage_and_Searching.pdf" class="guide-link">COS Pivot Funding Homepage and Searching</a></li>
<li><a href="Quick_Search_Funding.pdf" class="guide-link">Quick Search</a></li>
<li><a href="Advanced_Search_Funding.pdf" class="guide-link">Advanced Search</a></li>
<li><a href="Field_Descriptions.pdf" class="guide-link">Field Descriptions</a></li>

<li><a href="Search_Tips_Funding.pdf" class="guide-link">Search Tips</a></li> <---------------------This is the code for above line —
<li><a href="Navigating_Your_Search_Results_Funding.pdf" class="guide-link">Navigating Your Search Results</a></li>
<li><a href="Managing_Individual_Funding_Opportunities.pdf" class="guide-link">Managing Individual Funding Opportunities</a></li>
</ul>
</div>
</div>
</td>

<td>
<h4><a href="/faqs#funding" class="faq-link">Funding Opportunity FAQs</a></h4>
</td>
</tr>

.
.
.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Example of 5--issue:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

2012-01-11T16:17:15.660Z 500 13872 http://pivot.cos.com/profiles/all LLLLL http://pivot.cos.com/profiles/0D240B55CE47B01D01B3FB139DC37342?h= text/html #361 20120111161713581+1572 sha1:JVMAFRIZM2D7Y6SFANFXLYN6KEXKLNPY - -

<div id="profile-tabs" class="init-hide"><div class="grid init-hide"><div class="tab-header gradient dropshadow init-hide"><ul class="inline init-hide"><li id='pubs'><a href="#pubs_tab"><span>Publications: </span>2</a></li><li id='webcontent'><a href="#webcontent_tab"><span>Web: </span>1</a></li><li id='colleagues'><a href="#colleagues_tab"><span>Colleagues: </span>60</a></li></ul></div></div><div id='pubs_tab' class='grid init-hide'> <form class="tab-options-form" action="/profiles/0D240B55CE47B01D01B3FB139DC37342/pubs">
<input name="page" type="hidden" value="1" />

<div class="grid-25">
<div class="profile-filter">
<h2>Publications</h2>

<h3>Types</h3>

<ul>
<li class='active'><a href="all" class="pub_type">All - 2</a></li> <------- This line caused problem.
<li class="open"><a>Articles</a>

<ul>
<li><a href="peerreviewed" class="pub_type">peer reviewed - 2</a></li>
<li><a href="article" class="pub_type">other - 2</a></li>
</ul>
</li>

</ul>

<input id="pub_type" name="pub_type" type="hidden" value="all" />

<fieldset>

<div class="field">
<label for="list-search"><span>Search</span> within this list</label>
<input class="text rounded" id="nested_query" name="nested_query" type="text" />
<a href="#" class="button-link type1">Go</a>
</div>
</fieldset>
</div>
</div>

<div class="grid-75">

<div class="page-navigator">
<div class="pagination top">
<span id='pagination-top-span'>1-2 of 2 results</span>
</div>
</div>

Environment

java version "1.6.0_29"
Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)

Linux version 2.6.18-194.17.4.el5xen (mockbuild@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Mon Oct 25 16:36:31 EDT 2010

Status

Assignee

Unassigned

Reporter

David Pane

Labels

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Affects versions

Heritrix 3.1.1

Priority

Major
Configure