License: (CC 3.0) BY-NC-SA
Reference
- 淺談coroutine與gevent
- Beautiful Soup 中文文档
- Queue — A synchronized queue class
- gevent For the Working Python Developer - Written by the Gevent Community
Workflow
argparse
- parser = argparse.ArgumentParser
- parser.add_argument(‘-u’, ‘–url’)
- args = parser.parse_args(sys.argv[1:])
urllib2
- page = urllib2.urlopen(url)
- content = page.read()
- page.close()
BeautifulSoup
- bs = BeautifulSoup.BeautifulSoup(content)
- links = bs(‘a’)
re
-
pattern = re.compile(r’^(http:// https://)’) - http_links = filter(pattern.search, links)
Queue
- queue = Queue.Queue()
- queue.put(link, [block[, timeout]])
- queue.get([block[, timeout]])
gevent
thread = gevent.spawn(f, *f_args)
- thread.join()
sqlite3
- conn = sqlite3.connect(‘file.db’)
- curs = conn.cursor()
- curs.execute()
- conn.commit()
- curs.close()
- conn.close()
logging
- logging module is thread safe because it uses threading locks
- logger = logging.getLogger(name)
- lh = logging.FileHandler(filename)
- logger.addHandler(lh)
Trouble Shoot
urllib2.URLError:
urlopen error [Errno 67] request timed out
This problem raises after gevent is imported, directly use urllib2 is ok, even though it is really slow because of my network traffic. However, my localhost blog can be accessed without any problem. this issue showes that gevent version 0.13.6’s monkey.patch truly has some problems. But gevent 1.0 has fix this problem, however, pypi only get 0.13.8, so i decide to give up this solution right now.
validate an url
Actually, i think it is very hard to verify if an url is valid, you can find some discusion on stackoverflow. So i will not check the url via re module, i just complete some malformed url to a right format, and try open it, if it successes, then it definitely a valid url.
relative url
Using urllib2.urlparse.urljoin(base, malformed_url)