Heritrix ignores robots.txt

Description

Upon crawling http://fedora.clarin-d.uni-saarland.de Heritrix 3.2.0 ignores the robots.txt. The robots.txt is:

User-agent: *
Disallow: /ajax
Disallow: /cqpweb
Disallow: /cqpweb-3.1.4
Disallow: /entwicklung
Disallow: /examples
Disallow: /fedora
Disallow: /fedora-demo
Disallow: /gecco/libs
Disallow: /grug/ressources
Disallow: /hub
Allow: /hub/index
Allow: /hub/browse
Allow: /hub/faq
Allow: /hub/help
Allow: /hub/impressum
Allow: /hub/resource
Disallow: /hub/resource/add
Disallow: /hub/resource/create
Allow: /hub/terms
Allow: /hub/whatsnew
Disallow: /hub-static
Disallow: /hurz
Disallow: /oaiprovider
Disallow: /piwik
Disallow: /poldilemma/downloads
Disallow: /poldilemma/ressources
Disallow: /ressources
Disallow: /sacoco/ressources
Disallow: /shib
Disallow: /Shibboleth.sso
Disallow: /showcases
Disallow: /sru
Disallow: /sru2
Disallow: /sru3
Disallow: /sru4
Disallow: /sru5
Disallow: /static
Disallow: /test
Disallow: /unserwiki

and the crawl.log says:

2016-06-01T22:53:47.851Z 1 72 dns:fedora.clarin-d.uni-saarland.de LLRLLLLLLLELLP http://fedora.clarin-d.uni-saarland.de/unserwiki/doku.php?id=workshops:hcl2016 text/dns #093 20160601225347837+14 sha1:UDT5L3WDFCUWMI6IVNJI2JEIHXOKRVEF - -
2016-06-01T22:53:51.497Z 200 822 http://fedora.clarin-d.uni-saarland.de/robots.txt LLRLLLLLLLELLP http://fedora.clarin-d.uni-saarland.de/unserwiki/doku.php?id=workshops:hcl2016 text/plain #101 20160601225351470+25 sha1:GYJFUPPXHHDRTJR6FY7YMZ7ULNDNB57H - -
2016-06-01T22:54:34.873Z -9998 - http://fedora.clarin-d.uni-saarland.de/unserwiki/doku.php?id=workshops:hcl2016 LLRLLLLLLLELL http://www.sfb1102.uni-saarland.de/?tribe_events=workshop-historical-corpus-linguistics-methods-and-applications unknown #084 - - - 3t
2016-06-01T22:54:40.362Z -9998 - http://fedora.clarin-d.uni-saarland.de/unserwiki/doku.php?id=corpus_tutorial:dgfs2014 LLLEPRRLRLLRLRLLLRLL https://www.linguistics.ruhr-uni-bochum.de/dgfs-cl/tutorien.shtml unknown #015 - - - -
2016-06-01T22:54:41.931Z -9998 - http://fedora.clarin-d.uni-saarland.de/unserwiki/doku.php?id=teaching:ws_2014-15:ps_skt LLRLLLLLLLLLLL http://fr46.uni-saarland.de/index.php?id=3450 unknown #069 - - - -
2016-06-01T22:54:42.901Z -9998 - http://fedora.clarin-d.uni-saarland.de/unserwiki/doku.php?id=teaching:hs_corpus_linguistics_sose2014 LLRLLLLLLLLLLL http://fr46.uni-saarland.de/index.php?id=3450 unknown #064 - - - -
2016-06-01T22:54:43.989Z -9998 - http://fedora.clarin-d.uni-saarland.de/unserwiki/doku.php?id=teaching:ws_2015-16:ue_wwc& LLRLLLLLLLLLL http://fr46.uni-saarland.de/index.php?id=kermes unknown #013 - - - -
2016-06-01T22:54:44.982Z -9998 - http://fedora.clarin-d.uni-saarland.de/unserwiki/doku.php?id=teaching:ss_2015:hs_comparing_corpora LLRLLLLLLLLLL http://fr46.uni-saarland.de/index.php?id=kermes unknown #063 - - - -
2016-06-01T22:54:45.846Z -9998 - http://fedora.clarin-d.uni-saarland.de/unserwiki/doku.php?id=teaching:ws_2014-15:hs_diachronic_cl LLRLLLLLLLLLL http://fr46.uni-saarland.de/index.php?id=kermes unknown #079 - - - -
2016-06-01T22:54:46.524Z -9998 - http://fedora.clarin-d.uni-saarland.de/unserwiki/doku.php?id=teaching:sose_2014:hs_corpus_linguistics_sose2014 LLRLLLLLLLLLL http://fr46.uni-saarland.de/index.php?id=kermes unknown #036 - - - -
2016-06-01T22:54:47.208Z -9998 - http://fedora.clarin-d.uni-saarland.de/unserwiki/doku.php?id=teaching:v_intro_ling LLRLLLLLLLLLL http://fr46.uni-saarland.de/index.php?id=kermes unknown #021 - - - -
2016-06-01T22:54:48.724Z 200 822 https://fedora.clarin-d.uni-saarland.de/robots.txt LLRLLLLLLLLLLLP https://fedora.clarin-d.uni-saarland.de/cqpweb/ text/plain #036 20160601225448557+166 sha1:GYJFUPPXHHDRTJR6FY7YMZ7ULNDNB57H - -
2016-06-01T22:54:52.079Z 200 13802 https://fedora.clarin-d.uni-saarland.de/cqpweb/ LLRLLLLLLLLLLL http://fr46.uni-saarland.de/lsteich/ClarindDS2013/links.php text/html #022 20160601225451899+158 sha1:TOQ7VEOYHP4VKHO4MQA5H63SXNWFOPE3 - 2t
2016-06-01T22:54:55.413Z 200 4733 https://fedora.clarin-d.uni-saarland.de/cqpweb/css/CQPweb.css LLRLLLLLLLLLLLE https://fedora.clarin-d.uni-saarland.de/cqpweb/ text/css #005 20160601225455292+120 sha157JW3K7VYTKHDOHCT7AYRSPHVLDXRGI - -
2016-06-01T22:54:59.146Z 200 95786 https://fedora.clarin-d.uni-saarland.de/cqpweb/jsc/jquery.js LLRLLLLLLLLLLLE https://fedora.clarin-d.uni-saarland.de/cqpweb/ application/javascript #084 20160601225458859+278 sha1:23A7IGLS3YD3BG72MPJOKD42WQPMG4V5 - -
2016-06-01T22:55:02.332Z 200 1194 https://fedora.clarin-d.uni-saarland.de/cqpweb/jsc/always.js LLRLLLLLLLLLLLE https://fedora.clarin-d.uni-saarland.de/cqpweb/ application/javascript #123 20160601225502232+99 sha1:FSNS5ACU3OJ6WDKPELRFPWMQPXARGAD7 - -
2016-06-01T22:55:06.929Z 200 7492 https://fedora.clarin-d.uni-saarland.de/cqpweb/css/img/UdSLogo.png LLRLLLLLLLLLLLE https://fedora.clarin-d.uni-saarland.de/cqpweb/ image/png #143 20160601225506669+259 sha1:5JKEQ7F7YZ3EOQCJBMNGF76EHQSSRF77 - -
2016-06-01T22:55:10.925Z 200 2894 https://fedora.clarin-d.uni-saarland.de/cqpweb/css/img/ocwb-logo.transparent.gif LLRLLLLLLLLLLLE https://fedora.clarin-d.uni-saarland.de/cqpweb/ image/gif #032 20160601225510719+205 sha1:YUM3CSLVBPFSZTAOX7XSB23PAHKXCVRE - -
2016-06-01T22:55:15.033Z 404 1193 https://fedora.clarin-d.uni-saarland.de/jsc/wz_tooltip.js LLRLLLLLLLLLLLE https://fedora.clarin-d.uni-saarland.de/cqpweb/ text/html #062 20160601225514851+181 sha1QKVXSRSXBG2WDGZQGSIQBFCXJIDTQNI - -
2016-06-01T22:55:19.002Z -9998 - http://fedora.clarin-d.uni-saarland.de/unserwiki/doku.php?id=teaching:ss_2015:v_fachkommunikation_sose2014 LLRLLLLLLLLLLL http://fr46.uni-saarland.de/index.php?id=3796&L=O unknown #123 - - - -
2016-06-01T22:55:20.221Z -9998 - http://fedora.clarin-d.uni-saarland.de/unserwiki/doku.php?id=teaching:hs_cohesion_in_english LLRLLLLLLLLLLL http://fr46.uni-saarland.de/index.php?id=3756&L=O unknown #087 - - - -

As can be seen, after crawling the robots.txt the second time, it crawls the URL https://fedora.clarin-d.uni-saarland.de/cqpweb/ although it is forbidden by the line

User-agent: *
Disallow: /ajax
Disallow: /cqpweb

in robots.txt

Environment

None

Status

Assignee

Unassigned

Reporter

Robert Jäschke

Labels

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Affects versions

Priority

Minor
Configure