I am trying to crawl a long list of websites. Some of the websites in the start_url list redirect (301). I want scrapy to crawl the redirected websites from start_url list as if they were also on the allowed_domain list (which they are not). For example, example.com was on my start_url list and allowed domain list and example.com redirects to foo.com. I want to crawl foo.com.
DEBUG: Redirecting (301) to <GET http://www.foo.com/> from <GET http://www.example.com>
I tried dynamically adding allowed_domains in the parse_start_url method and return a Request object so that scrapy will go back and scrape the redirected websites once it is on the allowed domain list, but I still get:
DEBUG: Filtered offsite request to 'www.foo.com'
Here is my attempt to dynamically add allowed_domains:
def parse_start_url(self,response):
domain = tldextract.extract(str(response.request.url)).registered_domain
if domain not in self.allowed_domains:
self.allowed_domains.append(domain)
return Request = (response.url,callback=self.parse_callback)
else:
return self.parse_it(response,1)
My other ideas were to try and create a function in the spidermiddleware offsite.py that dynamically adds allowed_domains for redirected websites that originated from start_urls, but I have not been able to get that solution to work either.