How bots can kill a server

ict / computers April 13th, 2008

I’m running a pretty friendly little (LAMP) server here, hosting domains, email and web sites for friends and my own little side projects. Some sites are straight-up HTML and some are dynamic PHP sites, running WordPress or Joomla.

Last week I noticed a major dip in the server’s performance, which got worse and worse nearing the weekend. A quick scan revealed that my Apache processes were taking up around 90 - 95% of the CPU capacity with no real direct sign of the offending site. It took my 30 minutes to write and run a script that disabled the hosted sites one by one and then checked the CPU results. The ‘culprit’ appeared to be moqub.com, a (Dutch) blog run my a friend of mine with a steady following of readers.

As I had recently upgraded her WordPress to version 2.5 my first suspicion was that the new software was messing up my server…but then again my own site also runs WordPress 2.5, and if I put some stress on this blog the CPU was still quietly ticking over, not stressing out as with her site. But still…her WordPress was the result of upgrade upon upgrade, each version having been added on to with lots of plugins, so I just couldn’t be sure. I set up a fresh site on the server, did a clean WordPress 2.5 install, imported Moqub’s posts, comment and links, added the theme (after verifying its 2.5 compatibility) and diverted the visitors to the new site. Hey presto, the CPU was back up to 90% again, even when I disabled all customisations and the extra theme. Yikes.

As I now knew that the problem was not with WordPress itself, I started digging around in the logs. I found that there was an extraordinary amount of requests from a single IP address, identifying itself as a “Microsoft Search Bot 4.0″. Well, that should be easy to fix: I built a custom robots.txt that should have shushed all robot traffic except for GoogleBot, but to no avail. The bot never even tried to read the text file, it just went straight for the content, running several threads at once at high speed and thus maxing out the CPU.
A little research showed the IP address belonged to the Provincial Central Library in Drenthe (a province in the east of The Netherlands). This was no coincidence as Moqub writes about libraries and the use of information systems in them. Still, as their robot misbehaved I had no alternative but to completely block the IP address the robot was originating from. Ahh…peace and quiet on the server at last.

Now the question bugs me: why do robots still misbehave and completely ignore the robots.txt file, accepted (as far as I know) as the de facto standard in blocking or guiding robot traffic? And this was no home-brew, this was a Microsoft robot. Am I just being silly and naive in expecting “professional” software to behave according to the rules?

The lesson for me here was that badly run scripts can really mess up your server, especially if they decide to dig in to dynamically generated pages. And there really is not a whole lot you can do about it if they decide to completely ignore the standards in place.