Scrapy - no list page, but I know the url for each item page

Question

I'm using Scrapy to scrape a website. The item page that I want to scrape looks like: http://www.somepage.com/itempage/&page=x. Where x is any number from 1 to 100. Thus, I have an SgmlLinkExractor Rule with a callback function specified for any page resembling this.

The website does not have a listpage with all the items, so I want to somehow well scrapy to scrape those urls (from 1 to 100). This guy here seemed to have the same issue, but couldn't figure it out.

Does anyone have a solution?

score 6 · Answer 1 · answered May 27 '11 at 12:22

You could list all the known URLs in your Spider class' start_urls attribute:

class SomepageSpider(BaseSpider):
    name = 'somepage.com'
    allowed_domains = ['somepage.com']
    start_urls = ['http://www.somepage.com/itempage/&page=%s' % page for page in xrange(1, 101)]

    def parse(self, response):
        # ...

dilbert · Answer 2 · 2011-05-27T12:11:52.583

1

If it's just a one time thing, you can create a local html file file:///c:/somefile.html with all the links. Start scraping that file and add somepage.com to allowed domains.

Alternately, in the parse function, you can return a new Request which is the next url to be scraped.

edited May 27 '11 at 12:11

answered May 27 '11 at 11:52

dilbert

273
1
3
7

Scrapy - no list page, but I know the url for each item page

2 Answers2