0

I'm using Scrapy to scrape a website. The item page that I want to scrape looks like: http://www.somepage.com/itempage/&page=x. Where x is any number from 1 to 100. Thus, I have an SgmlLinkExractor Rule with a callback function specified for any page resembling this.

The website does not have a listpage with all the items, so I want to somehow well scrapy to scrape those urls (from 1 to 100). This guy here seemed to have the same issue, but couldn't figure it out.

Does anyone have a solution?

Community
  • 1
  • 1
kevin_82
  • 317
  • 3
  • 10

2 Answers2

6

You could list all the known URLs in your Spider class' start_urls attribute:

class SomepageSpider(BaseSpider):
    name = 'somepage.com'
    allowed_domains = ['somepage.com']
    start_urls = ['http://www.somepage.com/itempage/&page=%s' % page for page in xrange(1, 101)]

    def parse(self, response):
        # ...
Jonny Buchanan
  • 61,926
  • 17
  • 143
  • 150
1

If it's just a one time thing, you can create a local html file file:///c:/somefile.html with all the links. Start scraping that file and add somepage.com to allowed domains.

Alternately, in the parse function, you can return a new Request which is the next url to be scraped.

dilbert
  • 273
  • 1
  • 3
  • 7