2

I need to crawl 200 different websites for a project and I'd like to do it with starting the crawler once and then it's working on it by itself for the next hours. The URLs will be in a txt-or csv-file. I had two slightly different approaches so far. First attempt:

class MySpider(CrawlSpider):

name = 'spiderName'

read_urls = open('../../urls.txt', 'r')
for url in read_urls.readlines():
    url = url.strip() 
    allowed_domains = [url[4:]]
    start_urls = ['http://' + url]

read_urls.close()

rules = (Rule(LinkExtractor(allow = ('', )), callback = 'parse_stuff', follow = True),)

def parse_stuff(self, response):
    hxs = Selector(response)
    sites = hxs.xpath('//html')
    items_main = []

    for site in sites:
        loader = ItemLoader(item = Items_Main(), response = response)
        loader.add_xpath('a_title', '//head/title/text()')
        ...
        items_main.append(loader.load_item())
        return items_main

Here it gets only the last URL in the txt-file but it works properly and I'm able to restrict the allowed_domains.

Second attempt as found here on Stackoverflow is basically the same except for start_urls = [url.strip() for url in read_urls.readlines()] which gives me the following error raise ValueError('Missing scheme in request url: %s' % self._url).

Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
Niklas
  • 85
  • 1
  • 8

1 Answers1

0

You overwrite the lists in your for loop.

Initialize the lists before the loop and append inside the loop.

allowed_domains = []
start_urls = []
for url in read_urls.readlines():
    url = url.strip() 
    allowed_domains = allowed_domains + [url[4:]]
    start_urls = start_urls + ['http://' + url]
Frank Martin
  • 2,584
  • 2
  • 22
  • 25
  • Thanks I get all URLs now. But it opens the pipeline only once for all URLs so I'm getting only one XML-file at all as output and not one file per URL. – Niklas Aug 04 '15 at 10:41
  • In this case alter your pipeline (or write a custom one if you use some default) which opens a new XML file for each item and writes the contents to that file. Alternatively you could call your spider with each starting URL from a script -- so you shift the file reading from the spider to the caller. – GHajba Aug 04 '15 at 11:01
  • If your question is answered consider accepting this answer and compile a new question for your followup question. Maybe you should have a look at this [SO-Question](http://stackoverflow.com/questions/23868784/separate-output-file-for-every-url-given-in-start-urls-list-of-spider-in-scrapy) – Frank Martin Aug 04 '15 at 11:35