-------- Forwarded Message --------
Subject: Archive org bot hitting not existing urls
Date: Wed, 26 Jul 2017 15:35:00 +0000
From: Marius Vaitkus <email@example.com>
Your bot is constantly hitting non existing urls at https://www.wix.com/app-market
It looks like instead of trying to go to actual URL's it tries to parse random keys inside the site and append them to current location. An example of such url:
Mozilla/5.0 (compatible; heritrix/3.3.0-SNAPSHOT-20140702-2247 +http://archive.org/details/archive.org_bot)
Also, hitting a section of a site with around 600 requests per minute which results in about 3 times more traffic that this endpoint usually gets is not very reasonable.
It would be really helpful if your bot stops doing that and we don't need to apply additional measures to blocking it.
Could you please check again? As I said, wix.com/app-market* is using plain html links:
As an example of strange hit, I saw archive.org bot (Mozilla/5.0 (compatible; heritrix/3.3.0-SNAPSHOT-20140702-2247 +http://archive.org/details/archive.org_bot)) accessing:
If I view sources on referrer, I only see permissionsRequest.agree as a key in a json object of translations, so it looks like your crawler tries to guess the URL, not only crawl links that are already in the page.