As this answer explains, mutlithreading can work well for web scraping because non-CPU time is spent simply waiting for results. I am trying to understand the behavior of this multithreaded crawler:
import urllib
import re
import time
from threading import Thread
import MySQLdb
import mechanize
import readability
from bs4 import BeautifulSoup
from readability.readability import Document
import urlparse
class MultiScrape:
visited = []
urls = []
glob_visited = []
depth = 0
counter = 0
threadlist = []
root = ""
def __init__(self, url, depth):
self.glob_visited.append(url)
self.depth = depth
self.root = url
def run(self):
while self.counter < self.depth:
for w in self.glob_visited:
if w not in self.visited:
self.visited.append(w)
self.urls.append(w)
self.glob_visited = []
for r in self.urls:
try:
t = Thread(target=self.scrapeStep, args=(r,))
self.threadlist.append(t)
t.start()
except:
nnn = True
for g in self.threadlist:
g.join()
self.counter+=1
return self.visited
def scrapeStep(self,root):
result_urls = []
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Firefox')]
try:
br.open(root)
for link in br.links():
newurl = urlparse.urljoin(link.base_url,link.url)
if urlparse.urlparse(link.base_url).hostname.replace("www.","") in self.root:
result_urls.append(newurl)
except:
err = True
for res in result_urls:
self.glob_visited.append(res)
My most basic question is how python manages to maintain the self.glob_visited
list given the global interpreter lock (GIL). My understanding is that each threaded call of scrapeStep
maintains its own list, which are combined with g.join()
in the run
function. Is that correct? Would python behave the same way if I added a global list of the html of the pages scrapped? What if the html were stored instead in a global dictionary? Finally, as I understand this script, it only makes use of a single CPU. So if I have multiple CPUs I could call this function using multiprocessing
to speed up the crawl?