1

I am trying to crawl a long list of websites. Some of the websites in the start_url list redirect (301). I want scrapy to crawl the redirected websites from start_url list as if they were also on the allowed_domain list (which they are not). For example, example.com was on my start_url list and allowed domain list and example.com redirects to foo.com. I want to crawl foo.com.

DEBUG: Redirecting (301) to <GET http://www.foo.com/> from <GET http://www.example.com>

I tried dynamically adding allowed_domains in the parse_start_url method and return a Request object so that scrapy will go back and scrape the redirected websites once it is on the allowed domain list, but I still get:

 DEBUG: Filtered offsite request to 'www.foo.com'

Here is my attempt to dynamically add allowed_domains:

def parse_start_url(self,response):
    domain = tldextract.extract(str(response.request.url)).registered_domain
    if domain not in self.allowed_domains:
        self.allowed_domains.append(domain)
        return Request = (response.url,callback=self.parse_callback)
    else:
        return self.parse_it(response,1)

My other ideas were to try and create a function in the spidermiddleware offsite.py that dynamically adds allowed_domains for redirected websites that originated from start_urls, but I have not been able to get that solution to work either.

12Ryan12
  • 304
  • 2
  • 9
  • have you make any try using scrapy + selenium webdriver ? Here there is a possible solution http://stackoverflow.com/questions/27775048/python-selenium-possible-to-cancel-redirect-on-driver-get/27783398#27783398 – aberna Jan 16 '15 at 20:09

1 Answers1

3

I figured out the answer to my own question.

I edited the offsite middleware to get the updated list of allowed domains before it filters and I dynamically add to the allowed domain list in parse_start_url method.

I added this function to OffisteMiddleware

def update_regex(self,spider):
    self.host_regex = self.get_host_regex(spider)

I also edited this function inside OffsiteMiddleware

def should_follow(self, request, spider):
    #Custom code to update regex
    self.update_regex(spider)

    regex = self.host_regex
    # hostname can be None for wrong urls (like javascript links)
    host = urlparse_cached(request).hostname or ''
    return bool(regex.search(host))

Lastly for my use case I added this code to my spider

def parse_start_url(self,response):
    domain = tldextract.extract(str(response.request.url)).registered_domain
    if domain not in self.allowed_domains:
        self.allowed_domains.append(domain)
    return self.parse_it(response,1)

This code will add the redirected domain for any start_urls that get redirected and then will crawl those redirected sites.

12Ryan12
  • 304
  • 2
  • 9
  • Hi 12Ryan12. Thank you very much for this. This is exactly what I was looking for. I have a couple of questions if you don't mind: Is your 'parse_it' your main parse function inside the spider? Also, where do you call the parse_start_url function from? Lastly, where do you edit the OffsiteMiddleware? Thanks in advance! – Mike Nedelko Jul 02 '16 at 14:51
  • Hi @12Ryan12. So I managed to locate the OffisteMiddleware and was able to make the necessary adjustments. Where I am however getting confused is the implementation of the 'parse_start_url' method. Also wondering where parse_it(response,1) comes from and what the '1' attribute is doing there. Any advise would be highly appreciated :). – Mike Nedelko Jul 02 '16 at 16:31