Update
This is embarrassing, but it turned out that the problem with my original pipeline was that I'd forgotten to activate it in my settings. eLRuLL was right anyway, though.
I'm at the stage where I have a working spider that can consistently retrieve the information I'm interested in and push it out in the format I want. My—hopefully—final stumbling block is applying a more reasonable naming convention to the files saved by my images pipeline. The SHA1 hash works, but I find it really unpleasant to work with.
I'm having trouble interpreting the documentation to figure out how to change the naming system, and I didn't have any luck blindly applying this solution. In the course of my scrape, I'm already pulling down a unique identifier for each page; I'd like to use it to name the images, since there's only one per page.
The image pipeline also doesn't seem to respect the fields_to_export
section of my pipeline. I'd like to suppress the image urls to give myself a cleaner, more readable output. If anyone has an idea how to do that, I'd be very grateful.
The unique identifier that it'd like to pull out of my parse is CatalogRecord.add_xpath('accession', './/dd[@class="accession"]/text()')
. You'll find my spider and my pipelines below.
Spider:
URL = "http://www.nga.gov/content/ngaweb/Collection/art-object-page.%d"
starting_number = 1315
number_of_pages = 1311
class NGASpider(CrawlSpider):
name = 'ngamedallions'
allowed_domains = ['nga.gov']
start_urls = [URL % i + '.html' for i in range (starting_number, number_of_pages, -1)]
rules = (
Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True
),)
def parse_CatalogRecord(self, response):
CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
CatalogRecord.default_output_processor = TakeFirst()
CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
keywords = "reverse|obverse and (medal|medallion)"
notkey = "Image Not Available"
n = re.compile('.*(%s).*' % notkey, re.IGNORECASE|re.MULTILINE|re.UNICODE)
r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
if not n.search(response.body_as_unicode()):
if r.search(response.body_as_unicode()):
CatalogRecord.add_xpath('title', './/dl[@class="artwork-details"]/dt[@class="title"]/text()')
CatalogRecord.add_xpath('accession', './/dd[@class="accession"]/text()')
CatalogRecord.add_xpath('inscription', './/div[@id="inscription"]/p/text()', Join(), re='[A-Z]+')
CatalogRecord.add_xpath('image_urls', './/img[@class="mainImg"]/@src')
CatalogRecord.add_xpath('date', './/dt[@class="title"]', re='(\d+-\d+)')
return CatalogRecord.load_item()
Pipelines:
class NgamedallionsPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = ['accession', 'title', 'date', 'inscription']
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item