A few months ago I posted about page-store.com and their annoying way of crawling my domain. The short version is that they over the course of about a day visited every page in this domain, including the gallery with around a thousand pictures. Yep, the high resolution versions too. The result? Page-store.com spent 1.5 GB of my bandwidth, whereas the runner-up used 11 MB! Did I mention they ignored robots.txt?
Tonight I spotted this in my Apache logs (address munging is mine):
220.127.116.11 - - [17/Feb/2008:22:15:27 +0100] "GET /gallery/view_comments.php?set_albumName=album03 HTTP/1.0" 200 6382 "http://break-left.org/gallery/album03" "zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl[at]powerset.com,email:paul[at]page-store.com]"
What’s up with this? Is Paul Pedersen crawling on behalf of the new (self-appointed and buzzword compliant) websearch outfit Powerset.com? From an Amazon IP-address? To be fair, the crawler went about its business like any decent crawler should, and they claim to honour robots.txt.
Powerset’s team does not include neither Mr. Pedersen nor anyone from Amazon.
To be unfair, they also claim that whenever you see their spider rummaging about in your site, you should consider yourself among the chosen ones. Didn’t page-store.com notice they’ve been blocked here since last July?
Anyway, the majority of material on this domain is written in Norwegian, and I’d really like to see how they handle my sometimes offbeat “structures and nuances of natural language”. Well, powerset.com, consider yourselves among the chosen ones too.
Oh, you’ll notice.