python-scrapy: how to fetch an URL (not via following links) inside a spider?

Question

How can I have inside my spider something that will fetch some URL to extract something from a page via HtmlXPathSelector? But the URL is something I want to supply as a string inside the code, not a link to follow.

I tried something like this:

req = urllib2.Request('http://www.example.com/' + some_string + '/')
req.add_header('User-Agent', 'Mozilla/5.0')
response = urllib2.urlopen(req)
hxs = HtmlXPathSelector(response)

but at this moment it throws an exception with:

[Failure instance: Traceback: <type 'exceptions.AttributeError'>: addinfourl instance has no attribute 'encoding'

score 1 · Accepted Answer · answered Jan 12 '11 at 03:04

1

You will need to construct a scrapy.http.HtmlResponse object with the body=urllib2.urlopen(req).read() - but why exactly do you need to use urllib2 instead of returning the request with a callback?

answered Jan 12 '11 at 03:04

Pablo Hoffman

1,540
13
19

I don't know how to make a "request with a callback" to an URL which is not linked anywhere on the page I am scraping. I just want to query a URL I supply in a string, not following any links, inside my Scrapy script. – ria Jan 12 '11 at 08:58
Thanks, however now I ended up parsing the URL with BeautifulSoup because I couldn't make it work with HtmlXPathSelector. – ria Jan 12 '11 at 09:00

score -1 · Answer 2 · answered Jun 16 '15 at 06:41

-1

scrapy is not explicit to show how to do unittest, i don't recommend use scrapy to crawl data if you want do unittest for each spider.

answered Jun 16 '15 at 06:41

Chandler.Huang

873
3
12
24

python-scrapy: how to fetch an URL (not via following links) inside a spider?

2 Answers2

Linked