Oh yes, this is going to be about … well … what newspapers and HR consultants call technology, but really, it’s about web server logs. Rummaging through Apache logs isn’t about technology, it’s about craft. The various tools I use to read and filter the logs are nothing but that, and the actual reading and interpretation of them is up to my brain. Thus, it’s a craft. Tools + brain + experience = craft.
Rants aside, this is wat it’s all about: search engines. Or crawlers or spiders or robots. Or, the people behind them. A random 10-minute snapshot of the access logs of any of the 12 websites I’m involved with, shows a lot of hits from crawlers/spiders/search things/robots. Some of them visit the same content over and over again, checking and rechecking for updates that will never come. Each time, they transfer the entire article, page or script. Byte by byte, day after day, week after week. Yahoo, Baidu, Findlinks, Googlebot, Blekko, OOZ, Yanga, PicSearch, Microsoft (I forget what they call their robot this week), the annoying Cyveillance content spies and probably dozens of others. They all want the content of every web page in the entire universe stored in their index so that 16 year old Juha from Finland can rest assured that there are no bikini girls on any of “my” 12 sites or that the contact page for a customer of mine doesn’t, in fact, contain Linux drivers for the $5 rebranded Realtek network card they sold at the discount store last week. Special note to the Cyveillance guys: nope, we don’t do replicas of watches or fake insanely expensive jeans. Doesn’t matter what you think. We still don’t.
Some crawl on their own behalf, while others sell indexed content to the highest bidder. Some only look for pictures, while some only look for pages linked to from somewhere out there containing special or not-so-special-terms. But they all have to transfer every single accessible web page across the internet in order to crunch them, because that’s how HTTP works.
Why do they do this?
Because they want to compete with Google, that’s why. It’s free enterprise and healthy competition, right?
Apart from the sneaky bastards over at Cyveillance, I have few problems with any one of them. It’s only a problem because they show up in droves and flocks and mobs, drooling over the possibly useful content like people on a flea market. They’re like in the movies when journalists beleaguer the residence of some newly minted hero or someone who’s having his or hers 15 minutes, only to find that the only person at home is the nanny. So they check back later, and then later yet, using up the driveway bandwidth like they have some kind of inalienable right to it. It’s the internet, I can do what I want! Woohoo!
Only they can’t.
Here’s a search engine optimization as seen from a network guy’s viewpoint: my virtual driveways are now combined into a single, gravelled and bumpy-like uncomfortable path 48 Kbit/s wide, and you’re all being directed there as soon as I find you. Yes, you all get to share the same ridiculously narrow pipe. You know, I don’t want to engineer robot traffic into bandwidth requirements. That’s not my business, it’s yours.
And don’t step on the grass. Please.