Quick Python Tip: Socket Timeouts for Page Scrapers

It's not well documented, but there is a way to set a timeout for urllib, urllib2, and the like. This is done by setting the default timeout on the global socket. So if you're constantly hanging cron scripts because some resource you want to scrape is never responding, add the following to your script:


import socket                                             

# "timeout" is a float and 
# is the value you want in seconds.
timeout = 2.5
socket.setdefaulttimeout(timeout)

Any subsequent calls to urllib or any other module based off socket will now generate an IOError if the response is not returned before reaching timeout.

How you handle IOError is up to you. :-)

Posted by deryck on March 4, 2008

Post a comment