Search engine crawlers have a miserable cache hitrate

I’m looking to migrate some load off my main webserver/database, so I was looking into which of our pages render the slowest. While I was doing that, I discovered that most of our rendering time is due to just a few client IP addresses, and they turned out to be search indexers. If I group all requests by user agent (“robots”, which is only googlebot and bingbot, and “humans”, which is everybody else), I get:

Robot reqs:8062 total_sz:85MB avg_sz:10kB avg_upstr_time: 1925ms total_upstr_time: 15521s cache_hitrate: 51.3%
Human reqs:414898 total_sz:5132MB avg_sz:12kB avg_upstr_time: 32ms total_upstr_time: 13520s cache_hitrate: 98.7%

So the average robot request is some 60 times slower to render than the average human one. This is because they spend most of their time loading old pages that nobody else cares about, which are never in cache and incur heavy disk IO times from our database.

I plan to create a readonly database replica and second webserver which will be dedicated to handling requests from these search crawlers. That will stop the caches on our primary server from being wasted on old content that no humans want to see.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.