feb17

Powerset.com or page-store.com? Who’s to know

A few months ago I posted about page-store.com and their annoying way of crawling my domain. The short version is that they over the course of about a day visited every page in this domain, including the gallery with around a thousand pictures. Yep, the high resolution versions too. The result? Page-store.com spent 1.5 GB of my bandwidth, whereas the runner-up used 11 MB! Did I mention they ignored robots.txt?

Tonight I spotted this in my Apache logs (address munging is mine):

67.202.45.112 - - [17/Feb/2008:22:15:27 +0100] "GET /gallery/view_comments.php?set_albumName=album03 HTTP/1.0" 200 6382 "http://break-left.org/gallery/album03" "zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl[at]powerset.com,email:paul[at]page-store.com]"

What’s up with this? Is Paul Pedersen crawling on behalf of the new (self-appointed and buzzword compliant) websearch outfit Powerset.com? From an Amazon IP-address? To be fair, the crawler went about its business like any decent crawler should, and they claim to honour robots.txt.

Powerset’s team does not include neither Mr. Pedersen nor anyone from Amazon.

To be unfair, they also claim that whenever you see their spider rummaging about in your site, you should consider yourself among the chosen ones. Didn’t page-store.com notice they’ve been blocked here since last July?

Anyway, the majority of material on this domain is written in Norwegian, and I’d really like to see how they handle my sometimes offbeat “structures and nuances of natural language”. Well, powerset.com, consider yourselves among the chosen ones too.

Oh, you’ll notice.


2 kommentarer til “Powerset.com or page-store.com? Who’s to know”

Du kan legge igjen en kommentar, eller et tilbaketråkk fra din egen blogg.

  1. Get a Gravatar!

    Paul Pedersen

    Sa følgende 18. februar, 2008 kl 9:57:

    Page-store is doing some outsourced crawl for a number of search companies. We strive to be maximally polite in the crawl, and we always try to consider the problem from the page provider’s point of view. My premise is that it’s better to have one good crawl from a know agent like page-store/zermelo, rather than a multitude of separate crawls from miscellaneous startups. It reduces bandwidth usage, and it multiplies the page exposure. Hopefully, since we specialize, we will learn to do the best crawl in the world - although we obviously have lots of work ahead of us. Right now we’re at the initial stage of a new crawl, bootstrapping a large seed URL set. We began the crawl with the top 1M traffic ranked sites, plus pages referred from (and to) en.wikipedia.org, plus a selection of news, blog, university and government web sites. We’re listening to suggestions on all aspects of this problem, particularly how to make the crawl more open.

  2. Get a Gravatar!

    tw

    Sa følgende 25. februar, 2008 kl 0:36:

    Sorry for taking so long to respond to this. Life happens.

    Mr. Pedersen: if you strive to be open and polite about your crawling, please consider adding a little more information to page-store.com. There is absolutely no information there that would contribute to calming down a webmaster seeing your spider in the logs. Since your crawl was indeed much (much) better than the initial rampant gate-crashing here last July, I can only assume you are and do what you say. Nobody should have to trust a stranger’s claims on the internet. There is no information on whether you are honouring the robot exclusion protocol, no info on the crawler/spider, no information on what your business is all about and no way to contact you for technical stuff save for the sales address. I’m not a prospective customer and have no need for what you can sell me. I only want to know a little more about how you’re making money off of my content.

    The only way I even found out about your crawler was because this domain had *all* crawlers banned when you came around.

    By the way: what happens to the material you crawled when you *didn’t* honour robots.txt? You must have picked up a lot of things you weren’t meant to see, including a lot of stuff from here that aren’t meant to be in any search engines.

    Tor Willy


Skriv en kommentar

t

Nylig postet

 

Mest populære kategorier

Ukens unyttige Montreal English Roteskuffa (hvori opptatt Pølsevev)

Om bloggen

Sugerør?

Ja, du skjønner, jeg løper rundt her borte mens du sitter aldeles stille der borte og ser gjennom et sugerør. Av og til kommer jeg forbi i det noe begrensede synsfeltet ditt. Etterpå er du kanskje ikke helt sikker på hva du fikk øye på. Et glimt gjennom et sugerør. Tittelen er tatt fra en serie engelske og norske historier jeg kladdet i 2001-2004, der jeg selv hadde betrakterrollen i forhold til mennesker jeg traff her & der. De het “Through Sucking Straws”. Disse uferdige historiene er aldri lagt ut noe sted på nettet og kommer nok aldri til å bli det. Jeg har nok mere greie på meg selv enn på andre og har mindre tid til å skrive bra nå enn jeg hadde før.

Denne bloggen (eller hva den er) har altså ingen agenda utover det å inneholde ordene mine. En uke uten innlegg betyr bare at jeg det er en uke med helt andre ting. Alt i denne bloggen er © Break Left Outings dersom ikke annet er angitt med sitatmerker eller sitatstrek.

Ting & tang om min gamle røde italiener finner du i en egen blogg. Alt om den er på engelsk litt fordi jeg ikke kan italiensk, og litt fordi det nesten ikke finnes noen av sorten i Norge.

Tor Willy

Sugerør appears courtesy of Break Left Outings.