2

Update

This is embarrassing, but it turned out that the problem with my original pipeline was that I'd forgotten to activate it in my settings. eLRuLL was right anyway, though.


I'm at the stage where I have a working spider that can consistently retrieve the information I'm interested in and push it out in the format I want. My—hopefully—final stumbling block is applying a more reasonable naming convention to the files saved by my images pipeline. The SHA1 hash works, but I find it really unpleasant to work with.

I'm having trouble interpreting the documentation to figure out how to change the naming system, and I didn't have any luck blindly applying this solution. In the course of my scrape, I'm already pulling down a unique identifier for each page; I'd like to use it to name the images, since there's only one per page.

The image pipeline also doesn't seem to respect the fields_to_export section of my pipeline. I'd like to suppress the image urls to give myself a cleaner, more readable output. If anyone has an idea how to do that, I'd be very grateful.

The unique identifier that it'd like to pull out of my parse is CatalogRecord.add_xpath('accession', './/dd[@class="accession"]/text()'). You'll find my spider and my pipelines below.

Spider:

URL = "http://www.nga.gov/content/ngaweb/Collection/art-object-page.%d"
starting_number = 1315
number_of_pages = 1311
class NGASpider(CrawlSpider):
    name = 'ngamedallions'
    allowed_domains = ['nga.gov']
    start_urls = [URL % i + '.html' for i in range (starting_number, number_of_pages, -1)]
    rules = (
            Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True
),)



    def parse_CatalogRecord(self, response):
        CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
        CatalogRecord.default_output_processor = TakeFirst()
        CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
        keywords = "reverse|obverse and (medal|medallion)"
        notkey = "Image Not Available"
        n = re.compile('.*(%s).*' % notkey, re.IGNORECASE|re.MULTILINE|re.UNICODE)
        r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
        if not n.search(response.body_as_unicode()):
            if r.search(response.body_as_unicode()):
                CatalogRecord.add_xpath('title', './/dl[@class="artwork-details"]/dt[@class="title"]/text()')
                CatalogRecord.add_xpath('accession', './/dd[@class="accession"]/text()')
                CatalogRecord.add_xpath('inscription', './/div[@id="inscription"]/p/text()', Join(), re='[A-Z]+')
                CatalogRecord.add_xpath('image_urls', './/img[@class="mainImg"]/@src')
                CatalogRecord.add_xpath('date', './/dt[@class="title"]', re='(\d+-\d+)')

                return CatalogRecord.load_item()

Pipelines:

class NgamedallionsPipeline(object):
  def __init__(self):
     self.files = {}

  @classmethod
  def from_crawler(cls, crawler):
    pipeline = cls()
    crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
    crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
    return pipeline

  def spider_opened(self, spider):
    file = open('%s_items.csv' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = CsvItemExporter(file)
    self.exporter.fields_to_export = ['accession', 'title', 'date', 'inscription']
    self.exporter.start_exporting()

  def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

  def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item
Community
  • 1
  • 1
Tric
  • 97
  • 10
  • 1
    about the `fields_to_export`, please try passing the list on the constructor: `CsvItemExporter(file, fields_to_export=['accesion',...])` – eLRuLL May 05 '16 at 12:48
  • No luck. I'm getting the same output as before. The items.csv that I produce gives me, in order: inscription, title, accession, image_urls, images, and date. Could part of the problem be some conflict the image pipeline, which has its own mechanics tucked away behind the scenes? – Tric May 05 '16 at 15:14

1 Answers1

3

Regarding renaming the images written to disk, here's one way to do it:

  1. add something in meta for the images Request generated by the pipeline by overriding get_media_requests()
  2. override file_path() and use that info from meta

Example custom ImagesPipeline:

import scrapy
from scrapy.pipelines.images import ImagesPipeline


class NgaImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        # use 'accession' as name for the image when it's downloaded
        return [scrapy.Request(x, meta={'image_name': item["accession"]})
                for x in item.get('image_urls', [])]

    # write in current folder using the name we chose before
    def file_path(self, request, response=None, info=None):
        return '%s.jpg' % request.meta['image_name']

Regarding exported fields, the suggestion from @eLRuLL worked for me:

# -*- coding: utf-8 -*-
import scrapy
from scrapy import signals
from scrapy.exporters import CsvItemExporter


class NgaCsvPipeline(object):
    def __init__(self):
        self.files = {}

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        ofile = open('%s_items.csv' % spider.name, 'w+b')
        self.files[spider] = ofile
        self.exporter = CsvItemExporter(ofile,
            fields_to_export = ['accession', 'title', 'date', 'inscription'])
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        ofile = self.files.pop(spider)
        ofile.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item
paul trmbrth
  • 20,518
  • 4
  • 53
  • 66