Bad requests from the Reed Elsevier search engine
December 3, 2009,
Over the last weeks we're seeing ever increasing amounts of strange requests from a crawler that runs on IP addresses used by Reed-Elsevier in Australia. Time for some online forensics.
Recently, we're starting to see a new kind of request error. One particular crawler not only requests pages based on the contents of the href attribute of the links in HTML, but also makes requests for the actual contents of the link. For example, for a piece of HTML looking like this:
<a href="/index.html">back to the home page</a>
requests would be made for URLs
What the writers of the crawler hope to achieve by making the second request is beyond us. Some of the characteristics of these requests are:
- Most requests we see come from IP addresses 126.96.36.199 and 188.8.131.52, which are assigned to Reed Elsevier Australia Pty Ltd. Some requests come from IP address 184.108.40.206, which resolves to reedel20.lnk.telstra.net.
- The crawler does not request robots.txt. It also doesn't set a user agent.
- We've observed the crawler happily making one or two requests per second sustained over minutes.
To demonstrate, here's a piece of access log:
220.127.116.11 - - [28/Nov/2009:14:40:09 +0100] "GET /links.htm HTTP/1.0" 200 3317 "-" "-" 18.104.22.168 - - [28/Nov/2009:14:40:10 +0100] "GET /links HTTP/1.0" 404 203 "-" "-" 22.214.171.124 - - [28/Nov/2009:14:40:11 +0100] "GET /contact.htm HTTP/1.0" 200 2264 "-" "-" 126.96.36.199 - - [28/Nov/2009:14:40:11 +0100] "GET /contact HTTP/1.0" 404 205 "-" "-" 188.8.131.52 - - [28/Nov/2009:14:40:12 +0100] "GET /stappenplan.htm HTTP/1.0" 200 2139 "-" "-" 184.108.40.206 - - [28/Nov/2009:14:40:13 +0100] "GET /stappenplan HTTP/1.0" 404 209 "-" "-" 220.127.116.11 - - [28/Nov/2009:14:40:13 +0100] "GET /inventarisatie.htm HTTP/1.0" 200 3045 "-" "-" 18.104.22.168 - - [28/Nov/2009:14:40:14 +0100] "GET /inventarisatie HTTP/1.0" 404 212 "-" "-" 22.214.171.124 - - [28/Nov/2009:14:40:15 +0100] "GET /nieuwe_wensen.htm HTTP/1.0" 200 2907 "-" "-" 126.96.36.199 - - [28/Nov/2009:14:40:16 +0100] "GET /nieuwe%20wensen HTTP/1.0" 404 211 "-" "-"
As you can see, requests are coming in continuously, and 404s are a dime a dozen.
According to Bots vs Browsers the IP addresses mentioned above are also used for CatchBot, which indeed is a crawler used by Reed-Elsevier for their online business. The crawler we see, however, does not identify itself as CatchBot, and also doesn't appear to be very well-behaved.
It would be interesting to find out more about this crawler, and what it hopes to achieve. Our initial thought was that we might be able to do better in profiling the companies we host, so we contacted the people from Catch to see if they care to elaborate on the matter. Their initial response was:
The crawler idents as "catchbot" and leaves the URL catchbot.com in the customer logs. We are crawling to confirm information about your company from the pubicly accessible web site. We have requested that your domain is excluded from being crawled.
which is not helpful at all. It sounds like a canned response, despite the Freudian spelling error. Also, it makes us wonder which domain has been excluded, since we didn't mention any of the 20 or 30 domains we've seen this crawler visit so far. Unfortunately, the people at CatchBot haven't come up with an acceptable explanation, in response to our further requests for information. We haven't given up yet. To be continued.
It's very likely that similar requests are made to numerous other domains. If you have a clue or a theory as to what's going on here, please let us know in the talkback below.