Spider

2013/04/20 Python

License: (CC 3.0) BY-NC-SA

Reference

  1. 淺談coroutine與gevent
  2. Beautiful Soup 中文文档
  3. Queue — A synchronized queue class
  4. gevent For the Working Python Developer - Written by the Gevent Community

Workflow

argparse

  1. parser = argparse.ArgumentParser
  2. parser.add_argument(‘-u’, ‘–url’)
  3. args = parser.parse_args(sys.argv[1:])

urllib2

  1. page = urllib2.urlopen(url)
  2. content = page.read()
  3. page.close()

BeautifulSoup

  1. bs = BeautifulSoup.BeautifulSoup(content)
  2. links = bs(‘a’)

re

  1. pattern = re.compile(r’^(http:// https://)’)
  2. http_links = filter(pattern.search, links)

Queue

  1. queue = Queue.Queue()
  2. queue.put(link, [block[, timeout]])
  3. queue.get([block[, timeout]])

gevent

  1. thread = gevent.spawn(f, *f_args)
  2. thread.join()

sqlite3

  1. conn = sqlite3.connect(‘file.db’)
  2. curs = conn.cursor()
  3. curs.execute()
  4. conn.commit()
  5. curs.close()
  6. conn.close()

logging

  1. logging module is thread safe because it uses threading locks
  2. logger = logging.getLogger(name)
  3. lh = logging.FileHandler(filename)
  4. logger.addHandler(lh)

Trouble Shoot

urllib2.URLError:

urlopen error [Errno 67] request timed out

This problem raises after gevent is imported, directly use urllib2 is ok, even though it is really slow because of my network traffic. However, my localhost blog can be accessed without any problem. this issue showes that gevent version 0.13.6’s monkey.patch truly has some problems. But gevent 1.0 has fix this problem, however, pypi only get 0.13.8, so i decide to give up this solution right now.

validate an url

Actually, i think it is very hard to verify if an url is valid, you can find some discusion on stackoverflow. So i will not check the url via re module, i just complete some malformed url to a right format, and try open it, if it successes, then it definitely a valid url.

relative url

Using urllib2.urlparse.urljoin(base, malformed_url)

Search

    Table of Contents