Archive for the ‘English’ Category

17-Mar-2009 11:22 UTC

tirsdag, mars 17th, 2009
like crawling home after a long night of misadventures, burrowing into the backseat of a cab, squinting your eyes at the light buzzing across the horizon, tying your shoes, and thinking hard about blankets.

I wish.

15-Mar-2009 14:43 UTC

søndag, mars 15th, 2009

I’m sitting (planted, really) in bed, smoking, listening to Olsen Olsen from Sigur Rós’ 1999 album Ágætis byrjun, reading a somewhat interesting article on Kvenland while waiting for a 283 MB uplink transfer to a customer’s server, having a cold which includes stomach tingling and a runny nose, drinking cold coffee, trying to work but can’t. The sun shines outside the window, and the Olsen Olsen Da Capo is impressive in its chaotic song-theme-cum-background-noise. I guess I should be outside, but I have promises to keep and miles to go before I sleep.

I have one more coffee to drink on the Lido, but it won’t be today.

Never mind the bow. That’s how he always plays the guitar.

Forty-eight kilobits per second

torsdag, oktober 16th, 2008

What happens when you go "Hold on. I'm just going to ..."Oh yes, this is going to be about … well … what newspapers and HR consultants call technology, but really, it’s about web server logs. Rummaging through Apache logs isn’t about technology, it’s about craft. The various tools I use to read and filter the logs are nothing but that, and the actual reading and interpretation of them is up to my brain. Thus, it’s a craft. Tools + brain + experience = craft.

Rants aside, this is wat it’s all about: search engines. Or crawlers or spiders or robots. Or, the people behind them. A random 10-minute snapshot of the access logs of any of the 12 websites I’m involved with, shows a lot of hits from crawlers/spiders/search things/robots. Some of them visit the same content over and over again, checking and rechecking for updates that will never come. Each time, they transfer the entire article, page or script. Byte by byte, day after day, week after week. Yahoo, Baidu, Findlinks, Googlebot, Blekko, OOZ, Yanga, PicSearch, Microsoft (I forget what they call their robot this week), the annoying Cyveillance content spies and probably dozens of others. They all want the content of every web page in the entire universe stored in their index so that 16 year old Juha from Finland can rest assured that there are no bikini girls on any of “my” 12 sites or that the contact page for a customer of mine doesn’t, in fact, contain Linux drivers for the $5 rebranded Realtek network card they sold at the discount store last week. Special note to the Cyveillance guys: nope, we don’t do replicas of watches or fake insanely expensive jeans. Doesn’t matter what you think. We still don’t.

Some crawl on their own behalf, while others sell indexed content to the highest bidder. Some only look for pictures, while some only look for pages linked to from somewhere out there containing special or not-so-special-terms. But they all have to transfer every single accessible web page across the internet in order to crunch them, because that’s how HTTP works.

Why do they do this?

Because they want to compete with Google, that’s why. It’s free enterprise and healthy competition, right?

Apart from the sneaky bastards over at Cyveillance, I have few problems with any one of them. It’s only a problem because they show up in droves and flocks and mobs, drooling over the possibly useful content like people on a flea market. They’re like in the movies when journalists beleaguer the residence of some newly minted hero or someone who’s having his or hers 15 minutes, only to find that the only person at home is the nanny. So they check back later, and then later yet, using up the driveway bandwidth like they have some kind of inalienable right to it. It’s the internet, I can do what I want! Woohoo!

Only they can’t.

Here’s a search engine optimization as seen from a network guy’s viewpoint: my virtual driveways are now combined into a single, gravelled and bumpy-like uncomfortable path 48 Kbit/s wide, and you’re all being directed there as soon as I find you. Yes, you all get to share the same ridiculously narrow pipe. You know, I don’t want to engineer robot traffic into bandwidth requirements. That’s not my business, it’s yours.

And don’t step on the grass. Please.

Putting the Spamhaus DROP list in FreeBSD’s ipfw

søndag, mars 16th, 2008

Are you aware of the Spamhaus DROP list? According to the ladies and gentlemen of Spamhaus:

DROP (Don’t Route Or Peer) is an advisory “drop all traffic” list, consisting of stolen ‘zombie’ netblocks and netblocks controlled entirely by professional spammers. DROP is a tiny sub-set of the SBL designed for use by firewalls and routing equipment.

So DROP is simply a short-ish list of CIDR numbers and Spamhaus SBL references, and that we can definitely use in our FreeBSD ipfw rules. There’s a couple of perl scripts in the DROP FAQ, but none of them is suitable for generating ipfw rules, so I went ahead and made my own script. Yup, a good old-fashioned shell script. I don’t speak perl (not very well, anyway), and a shell script can be made equally no-nonsense and portable in my opinion. I operate several internet-facing FreeBSD servers, and they all use this script to generate ipfw rules.
(more…)

Powerset.com or page-store.com? Who’s to know

søndag, februar 17th, 2008

A few months ago I posted about page-store.com and their annoying way of crawling my domain. The short version is that they over the course of about a day visited every page in this domain, including the gallery with around a thousand pictures. Yep, the high resolution versions too. The result? Page-store.com spent 1.5 GB of my bandwidth, whereas the runner-up used 11 MB! Did I mention they ignored robots.txt?

Tonight I spotted this in my Apache logs (address munging is mine):

67.202.45.112 - - [17/Feb/2008:22:15:27 +0100] "GET /gallery/view_comments.php?set_albumName=album03 HTTP/1.0" 200 6382 "http://break-left.org/gallery/album03" "zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl[at]powerset.com,email:paul[at]page-store.com]"

What’s up with this? Is Paul Pedersen crawling on behalf of the new (self-appointed and buzzword compliant) websearch outfit Powerset.com? From an Amazon IP-address? To be fair, the crawler went about its business like any decent crawler should, and they claim to honour robots.txt.

Powerset’s team does not include neither Mr. Pedersen nor anyone from Amazon.

To be unfair, they also claim that whenever you see their spider rummaging about in your site, you should consider yourself among the chosen ones. Didn’t page-store.com notice they’ve been blocked here since last July?

Anyway, the majority of material on this domain is written in Norwegian, and I’d really like to see how they handle my sometimes offbeat “structures and nuances of natural language”. Well, powerset.com, consider yourselves among the chosen ones too.

Oh, you’ll notice.

qmail and the Cisco PIX

fredag, januar 4th, 2008

A spanner in the works.

Last night a customer of mine complained that some of his contacts couldn’t send him mail. Apparently, his contact received the following a few days after sending my customer something:

Hi. This is the qmail-send program at mail.example.com.
I'm afraid I wasn't able to deliver your message to the following addresses.
This is a permanent error; I've given up. Sorry it didn't work out.
<my.customers.email@example.tld>:
Connected to 255.12.255.14 but connection died. (#4.4.2) I'm not going to try again; this message has been in the queue too long.

Asleep at the CiscoWhoa. Look at that. Another qmail installation just like my customer’s! The sending MTA’s IP address were nowhere to be found in the usual smtp or rblsmtpd logs, nor in the firewall logs. The DNSBL clearing house openrbl.org had nothing on them either. In other words, the traffic from the sending MTA didn’t even reach the various daemons handling high-level communications (qmail-smtpd, rblsmtpd etc.), but were turned back at the door so to speak.

The next thing I did was setting up a tcpdump and have them send another mail to my customer’s mail address:

tcpdump -f src net 255.12.255.0/24

Sure enough, after a three-way handshake the communication from the sending MTA stopped:

10:47:24.118697 IP mail.example.com.50065 > 10.10.0.2.smtp: S 1000285782:1000285782(0) win 5840 <mss 1380,sackOK,timestamp 1764882384 0,nop,wscale 9>
10:47:24.166551 IP mail.example.com.50065 > 10.10.0.2.smtp: . ack 1709646105 win 12 <nop,nop,timestamp 1764882431 1755154237>
10:47:24.172664 IP mail.example.com.50065 > 10.10.0.2.smtp: R 0:12(12) ack 1 win 32832 <nop,nop,timestamp 1755154289 1764882431>

Now, between the FreeBSD server and the internet proper, there is a Cisco PIX firewall with IOS 7. This was the only place left to check since the customer’s mail server did receive mail from other places without any (reported) problems. The PIX have no ACLs whatsoever and simply works as a NAT with port-holes.

After we turned off “Inspect ESMTP” in the PIX, the mails from the offending (for lack of a better word) MTA were received without a hitch.

Now, I don’t really know why Cisco puts this functionality into a firewall, but it does not belong there in my (not so) humble opinion. Why would a firewall need to inspect SMTP traffic? How do we know Cisco have understood all of the relevant RFCs and implemented the checks correctly? How do we even know why the PIX decided to block the entire connection after just three packets? Is it a bug in qmail (or qmqp, as the sending MTA seems to use) or IOS? Why do I even need to poke around in the firewall after port 25 has been opened?

Oh, well. I realise, the usual allegations of qmail “not being a real mail server” (hello, Vernon) and the PIX “not being a real firewall” aside, that a PIX probably isn’t a good choice for a professional-use mail server, but that’s what they have.