Bad requests from the Reed Elsevier search engine

Joor Loohuis, December 3, 2009, 4629 views.

Over the last weeks we're seeing ever increasing amounts of strange requests from a crawler that runs on IP addresses used by Reed-Elsevier in Australia. Time for some online forensics.

Tags: , , ,

We've been hosting many websites for quite some years now, so we've seen our share of strange requests in the server logs. Search engine crawlers in particular tend to stand out, typically because they parse a webpage a bit different from the browsers the page was written for. This typically results in broken requests for JavaScript code and other peculiarities, but nothing really remarkable.

Recently, we're starting to see a new kind of request error. One particular crawler not only requests pages based on the contents of the href attribute of the links in HTML, but also makes requests for the actual contents of the link. For example, for a piece of HTML looking like this:

<a href="/index.html">back to the home page</a>

requests would be made for URLs

/index.html
/back%20to%20the%20home%20page

What the writers of the crawler hope to achieve by making the second request is beyond us. Some of the characteristics of these requests are:

  • Most requests we see come from IP addresses 203.6.203.78 and 203.6.203.79, which are assigned to Reed Elsevier Australia Pty Ltd. Some requests come from IP address 139.130.192.34, which resolves to reedel20.lnk.telstra.net.
  • The crawler does not request robots.txt. It also doesn't set a user agent.
  • We've observed the crawler happily making one or two requests per second sustained over minutes.

To demonstrate, here's a piece of access log:

203.6.203.79 - - [28/Nov/2009:14:40:09 +0100] "GET /links.htm HTTP/1.0" 200 3317 "-" "-"
203.6.203.79 - - [28/Nov/2009:14:40:10 +0100] "GET /links HTTP/1.0" 404 203 "-" "-"
203.6.203.79 - - [28/Nov/2009:14:40:11 +0100] "GET /contact.htm HTTP/1.0" 200 2264 "-" "-"
203.6.203.79 - - [28/Nov/2009:14:40:11 +0100] "GET /contact HTTP/1.0" 404 205 "-" "-"
203.6.203.79 - - [28/Nov/2009:14:40:12 +0100] "GET /stappenplan.htm HTTP/1.0" 200 2139 "-" "-"
203.6.203.79 - - [28/Nov/2009:14:40:13 +0100] "GET /stappenplan HTTP/1.0" 404 209 "-" "-"
203.6.203.79 - - [28/Nov/2009:14:40:13 +0100] "GET /inventarisatie.htm HTTP/1.0" 200 3045 "-" "-"
203.6.203.79 - - [28/Nov/2009:14:40:14 +0100] "GET /inventarisatie HTTP/1.0" 404 212 "-" "-"
203.6.203.79 - - [28/Nov/2009:14:40:15 +0100] "GET /nieuwe_wensen.htm HTTP/1.0" 200 2907 "-" "-"
203.6.203.79 - - [28/Nov/2009:14:40:16 +0100] "GET /nieuwe%20wensen HTTP/1.0" 404 211 "-" "-"

As you can see, requests are coming in continuously, and 404s are a dime a dozen.

According to Bots vs Browsers the IP addresses mentioned above are also used for CatchBot, which indeed is a crawler used by Reed-Elsevier for their online business. The crawler we see, however, does not identify itself as CatchBot, and also doesn't appear to be very well-behaved.

Feedback

It would be interesting to find out more about this crawler, and what it hopes to achieve. Our initial thought was that we might be able to do better in profiling the companies we host, so we contacted the people from Catch to see if they care to elaborate on the matter. Their initial response was:

The crawler idents as "catchbot" and leaves the URL catchbot.com in the
customer logs. We are crawling to confirm information about your company
from the pubicly accessible web site.
We have requested that your domain is excluded from being crawled.

which is not helpful at all. It sounds like a canned response, despite the Freudian spelling error. Also, it makes us wonder which domain has been excluded, since we didn't mention any of the 20 or 30 domains we've seen this crawler visit so far. Unfortunately, the people at CatchBot haven't come up with an acceptable explanation, in response to our further requests for information. We haven't given up yet. To be continued.

Theories?

It's very likely that similar requests are made to numerous other domains. If you have a clue or a theory as to what's going on here, please let us know in the talkback below.

Social networking: Tweet this article on Twitter Pass on this article on LinkedIn Bookmark this article on Google Bookmark this article on Yahoo! Bookmark this article on Technorati Bookmark this article on Delicious Share this article on Facebook Digg this article on Digg Submit this article to Reddit Thumb this article up at StumbleUpon Submit this article to Furl

Talkback

respond to this article

Re: Bad requests from the Reed Elsevier search engine (anonyme, 2009-12-04 13:18 CET)
Hello,
Thanks for all this info but I am not sophisticated enough to have any theory on the matter. However, from the number of 404 errors they are generating in my logs, I simply banned their IP to access my domain in the future.
Re: Bad requests from the Reed Elsevier search engine (Joor Loohuis, 2009-12-04 15:57 CET)
> Thanks for all this info but I am not sophisticated enough to have
> any theory on the matter. However, from the number of 404 errors
> they are generating in my logs, I simply banned their IP to access
> my domain in the future.

That is one way of dealing with it, but we are not in a position to decide for our corporate customers whether to block certain crawlers. On the contrary, we'd rather find ou what the point is behind the strange requests, so that we can either improve the rankings of our customers, or get Reed-Elsevier (or Hot Frog, or whoever) to fix the problem.

Well, doesn't that sound politically correct or what ;)
Re: Bad requests from the Reed Elsevier search engine (Jos Visser, 2009-12-07 07:30 CET)
Just goes to show that writing a decent crawler is not that easy :-)