Related articles
Bad requests from the Reed Elsevier search engine
Joor Loohuis,
December 3, 2009,
1350 views.
Over the last weeks we're seeing ever increasing amounts of strange requests from a crawler that runs on IP addresses used by Reed-Elsevier in Australia. Time for some online forensics.
Tags: crawler, HTTP, search engine, SEO
We've been hosting many websites for quite some years now, so we've seen our share of strange requests in the server logs. Search engine crawlers in particular tend to stand out, typically because they parse a webpage a bit different from the browsers the page was written for. This typically results in broken requests for JavaScript code and other peculiarities, but nothing really remarkable.
Recently, we're starting to see a new kind of request error. One particular crawler not only requests pages based on the contents of the href attribute of the links in HTML, but also makes requests for the actual contents of the link. For example, for a piece of HTML looking like this:
<a href="/index.html">back to the home page</a>
requests would be made for URLs
/index.html /back%20to%20the%20home%20page
What the writers of the crawler hope to achieve by making the second request is beyond us. Some of the characteristics of these requests are:
- Most requests we see come from IP addresses 203.6.203.78 and 203.6.203.79, which are assigned to Reed Elsevier Australia Pty Ltd. Some requests come from IP address 139.130.192.34, which resolves to reedel20.lnk.telstra.net.
- The crawler does not request robots.txt. It also doesn't set a user agent.
- We've observed the crawler happily making one or two requests per second sustained over minutes.
To demonstrate, here's a piece of access log:
203.6.203.79 - - [28/Nov/2009:14:40:09 +0100] "GET /links.htm HTTP/1.0" 200 3317 "-" "-" 203.6.203.79 - - [28/Nov/2009:14:40:10 +0100] "GET /links HTTP/1.0" 404 203 "-" "-" 203.6.203.79 - - [28/Nov/2009:14:40:11 +0100] "GET /contact.htm HTTP/1.0" 200 2264 "-" "-" 203.6.203.79 - - [28/Nov/2009:14:40:11 +0100] "GET /contact HTTP/1.0" 404 205 "-" "-" 203.6.203.79 - - [28/Nov/2009:14:40:12 +0100] "GET /stappenplan.htm HTTP/1.0" 200 2139 "-" "-" 203.6.203.79 - - [28/Nov/2009:14:40:13 +0100] "GET /stappenplan HTTP/1.0" 404 209 "-" "-" 203.6.203.79 - - [28/Nov/2009:14:40:13 +0100] "GET /inventarisatie.htm HTTP/1.0" 200 3045 "-" "-" 203.6.203.79 - - [28/Nov/2009:14:40:14 +0100] "GET /inventarisatie HTTP/1.0" 404 212 "-" "-" 203.6.203.79 - - [28/Nov/2009:14:40:15 +0100] "GET /nieuwe_wensen.htm HTTP/1.0" 200 2907 "-" "-" 203.6.203.79 - - [28/Nov/2009:14:40:16 +0100] "GET /nieuwe%20wensen HTTP/1.0" 404 211 "-" "-"
As you can see, requests are coming in continuously, and 404s are a dime a dozen.
According to Bots vs Browsers the IP addresses mentioned above are also used for CatchBot, which indeed is a crawler used by Reed-Elsevier for their online business. The crawler we see, however, does not identify itself as CatchBot, and also doesn't appear to be very well-behaved.
Feedback
It would be interesting to find out more about this crawler, and what it hopes to achieve. Our initial thought was that we might be able to do better in profiling the companies we host, so we contacted the people from Catch to see if they care to elaborate on the matter. Their initial response was:
The crawler idents as "catchbot" and leaves the URL catchbot.com in the customer logs. We are crawling to confirm information about your company from the pubicly accessible web site. We have requested that your domain is excluded from being crawled.
which is not helpful at all. It sounds like a canned response, despite the Freudian spelling error. Also, it makes us wonder which domain has been excluded, since we didn't mention any of the 20 or 30 domains we've seen this crawler visit so far. Unfortunately, the people at CatchBot haven't come up with an acceptable explanation, in response to our further requests for information. We haven't given up yet. To be continued.
Theories?
It's very likely that similar requests are made to numerous other domains. If you have a clue or a theory as to what's going on here, please let us know in the talkback below.
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Netherlands License.










