HTTP conditional GET with Python urllib2

by Stii

python-logo-smallWhen aggregating or reading crap loads of RSS feeds, it makes little or no sense to read every feed every time you check, when most feeds is updated only once a day. To give you and idea, at Afrigator the size of the feeds are half a gig (500 MB), so if you do that every hour you consume 12 giga bytes of data in 24 hours. This simply to get about 2000 new blog posts per day.

To alleviate load off the system and data transfers, you can do a HTTP conditional GET which basically check the RSS feed’s HTTP headers to see whether or not the feed was updated since the last time you checked and if it was, you’ll process the feed, else just ignore it. It does this by checking the ETag and Last-Modified-Date HTTP header attributes. It also only fetches the headers and not the entire feed, so only a fraction of the data is retrieved.

...
req = urllib2.Request(url)

req.add_header("If-None-Match", etag)
req.add_header("If-Modified-Since", lastmodified)

opener = urllib2.build_opener(NotModifiedHandler())
url_handle = opener.open(req)

if hasattr(url_handle, 'code') and url_handle.code == 304:
    return
else:
    headers = url_handle.info()
    new_etag = headers.getheader("ETag")
    new_last_modified = headers.getheader("Last-Modified")

    if new_etag != None and new_last_modified != None:
        store_new_etag(new_etag, new_last_modified, self.id)

    #get the content and write to file
    content = url_handle.read()
...

If you’re interested to know the more technical aspects of what happens, see this brilliant post. If you plan to build a feed reader at all, you need to use this function. You will not only kill your bandwidth, but everybody else’s if you don’t use it. If you built your own blogging platform, you need to make sure that you add the necessary ETag and Last-Modified-Date headers to your RSS feed. Will tell you next time how to do that. If you are on WordPress, Blogger or Movable Type it should be fine.