Powerset.com or page-store.com? Who’s to know

A few months ago I posted about page-store.com and their annoying way of crawling my domain. The short version is that they over the course of about a day visited every page in this domain, including the gallery with around a thousand pictures. Yep, the high resolution versions too. The result? Page-store.com spent 1.5 GB of my bandwidth, whereas the runner-up used 11 MB! Did I mention they ignored robots.txt?

Tonight I spotted this in my Apache logs (address munging is mine):

67.202.45.112 - - [17/Feb/2008:22:15:27 +0100] "GET /gallery/view_comments.php?set_albumName=album03 HTTP/1.0" 200 6382 "http://break-left.org/gallery/album03" "zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl[at]powerset.com,email:paul[at]page-store.com]"

What’s up with this? Is Paul Pedersen crawling on behalf of the new (self-appointed and buzzword compliant) websearch outfit Powerset.com? From an Amazon IP-address? To be fair, the crawler went about its business like any decent crawler should, and they claim to honour robots.txt.

Powerset’s team does not include neither Mr. Pedersen nor anyone from Amazon.

To be unfair, they also claim that whenever you see their spider rummaging about in your site, you should consider yourself among the chosen ones. Didn’t page-store.com notice they’ve been blocked here since last July?

Anyway, the majority of material on this domain is written in Norwegian, and I’d really like to see how they handle my sometimes offbeat “structures and nuances of natural language”. Well, powerset.com, consider yourselves among the chosen ones too.

Oh, you’ll notice.

2 Responses to “Powerset.com or page-store.com? Who’s to know”

  1. Paul Pedersen Says:

    Page-store is doing some outsourced crawl for a number of search companies. We strive to be maximally polite in the crawl, and we always try to consider the problem from the page provider’s point of view. My premise is that it’s better to have one good crawl from a know agent like page-store/zermelo, rather than a multitude of separate crawls from miscellaneous startups. It reduces bandwidth usage, and it multiplies the page exposure. Hopefully, since we specialize, we will learn to do the best crawl in the world - although we obviously have lots of work ahead of us. Right now we’re at the initial stage of a new crawl, bootstrapping a large seed URL set. We began the crawl with the top 1M traffic ranked sites, plus pages referred from (and to) en.wikipedia.org, plus a selection of news, blog, university and government web sites. We’re listening to suggestions on all aspects of this problem, particularly how to make the crawl more open.

  2. tw Says:

    Sorry for taking so long to respond to this. Life happens.

    Mr. Pedersen: if you strive to be open and polite about your crawling, please consider adding a little more information to page-store.com. There is absolutely no information there that would contribute to calming down a webmaster seeing your spider in the logs. Since your crawl was indeed much (much) better than the initial rampant gate-crashing here last July, I can only assume you are and do what you say. Nobody should have to trust a stranger’s claims on the internet. There is no information on whether you are honouring the robot exclusion protocol, no info on the crawler/spider, no information on what your business is all about and no way to contact you for technical stuff save for the sales address. I’m not a prospective customer and have no need for what you can sell me. I only want to know a little more about how you’re making money off of my content.

    The only way I even found out about your crawler was because this domain had *all* crawlers banned when you came around.

    By the way: what happens to the material you crawled when you *didn’t* honour robots.txt? You must have picked up a lot of things you weren’t meant to see, including a lot of stuff from here that aren’t meant to be in any search engines.

    Tor Willy