I need to crawl 200 different websites for a project and I'd like to do it with starting the crawler once and then it's working on it by itself for the next hours. The URLs will be in a txt-or csv-file. I had two slightly different approaches so far. First attempt:
class MySpider(CrawlSpider):
name = 'spiderName'
read_urls = open('../../urls.txt', 'r')
for url in read_urls.readlines():
url = url.strip()
allowed_domains = [url[4:]]
start_urls = ['http://' + url]
read_urls.close()
rules = (Rule(LinkExtractor(allow = ('', )), callback = 'parse_stuff', follow = True),)
def parse_stuff(self, response):
hxs = Selector(response)
sites = hxs.xpath('//html')
items_main = []
for site in sites:
loader = ItemLoader(item = Items_Main(), response = response)
loader.add_xpath('a_title', '//head/title/text()')
...
items_main.append(loader.load_item())
return items_main
Here it gets only the last URL in the txt-file but it works properly and I'm able to restrict the allowed_domains
.
Second attempt as found here on Stackoverflow is basically the same except for start_urls = [url.strip() for url in read_urls.readlines()]
which gives me the following error raise ValueError('Missing scheme in request url: %s' % self._url)
.