0

As this answer explains, mutlithreading can work well for web scraping because non-CPU time is spent simply waiting for results. I am trying to understand the behavior of this multithreaded crawler:

import urllib
import re
import time
from threading import Thread
import MySQLdb
import mechanize
import readability
from bs4 import BeautifulSoup
from readability.readability import Document
import urlparse


class MultiScrape:
    visited = []
    urls = []
    glob_visited = []
    depth = 0
    counter = 0
    threadlist = []
    root = ""

    def __init__(self, url, depth):
       self.glob_visited.append(url)
       self.depth = depth
       self.root = url

    def run(self):
        while self.counter < self.depth:
            for w in self.glob_visited:
                if w not in self.visited:
                    self.visited.append(w)
                    self.urls.append(w)
            self.glob_visited = []       
            for r in self.urls:
                try: 
                    t = Thread(target=self.scrapeStep, args=(r,))
                    self.threadlist.append(t)
                    t.start()            
                except:
                    nnn = True 
            for g in self.threadlist:
                g.join() 
            self.counter+=1
        return self.visited  

    def scrapeStep(self,root):
        result_urls = []
        br = mechanize.Browser()
        br.set_handle_robots(False)
        br.addheaders = [('User-agent', 'Firefox')]
        try:
            br.open(root)
            for link in br.links():
                newurl =  urlparse.urljoin(link.base_url,link.url)
                if urlparse.urlparse(link.base_url).hostname.replace("www.","") in self.root:
                    result_urls.append(newurl)   
        except:
            err = True     
        for res in result_urls:
            self.glob_visited.append(res)

My most basic question is how python manages to maintain the self.glob_visited list given the global interpreter lock (GIL). My understanding is that each threaded call of scrapeStep maintains its own list, which are combined with g.join() in the run function. Is that correct? Would python behave the same way if I added a global list of the html of the pages scrapped? What if the html were stored instead in a global dictionary? Finally, as I understand this script, it only makes use of a single CPU. So if I have multiple CPUs I could call this function using multiprocessing to speed up the crawl?

Community
  • 1
  • 1
Michael
  • 13,244
  • 23
  • 67
  • 115

1 Answers1

2

My understanding is that each threaded call of scrapeStep maintains its own list, which are combined with g.join(). Is that correct?

Nope, actually, every thread shares the same copy of self.glob_visited. The call to g.join() just makes your program block until the thread object g is finished. The self.glob_visited.append operation each thread is doing is thread-safe, because the GIL won't allow multiple threads to append to the list concurrently. It doesn't seem like the order items are added to the list matters, either, so there's no locking required.

Finally, as I understand this script, it only makes use of a single CPU.

Only one CPU can be used at a time, though in theory different CPUs could be used at different times.

So if I have multiple CPUs I could call this function using multiprocessing to speed up the crawl?

Multiprocessing would allow all the non I/O-operations to run concurrently across CPUs, rather than their execution being interleaved, with only one ever executing at a time. However, it does require some implementation changes, because the glob_visited list can't be shared across processes the way it can be with threads. You would probably need to use a multiprocessing.Manager() to create a proxy list object that can be shared between processes, or have each thread return a list of urls to the main process, and then have the main process join the lists together.

Here's an (untested) example showing how you could do the latter multiprocessing approach, utilizing a multiprocessing.Pool:

# scrapeStep can be a top-level function now, since it 
# doesn't use anything from the MultiScrape class
def scrapeStep(root):
    result_urls = []
    br = mechanize.Browser()
    br.set_handle_robots(False)
    br.addheaders = [('User-agent', 'Firefox')]
    try:
        br.open(root)
        for link in br.links():
            newurl =  urlparse.urljoin(link.base_url,link.url)
            if urlparse.urlparse(link.base_url).hostname.replace("www.","") in self.root:
                result_urls.append(newurl)   
    except:
        err = True

    # return result_urls directly, rather than appending it to a shared list
    return result_urls

class MultiScrape:
   ... # Snipped a bunch of stuff here
   def run(self):
        while self.counter < self.depth:
            for w in self.glob_visited:
                if w not in self.visited:
                    self.visited.append(w)
                    self.urls.append(w)
            self.glob_visited = []
            pool = multiprocessing.Pool() # Create cpu_count() workers in the pool
            results = pool.map(scrapeStep, self.urls) # scrapeStep no longer part of the object
            # results contains a list of lists, lets flatten it
            self.glob_visited = [item for sublist in results for item in sublist]
            self.counter+=1
        return self.visited  
dano
  • 91,354
  • 19
  • 222
  • 219