14

I'm attempting to rename the images that are downloaded by my Scrapy 0.24 spider. Right now the downloaded images are stored with a SHA1 hash of their URLs as the file names. I'd like to instead name them the value I extract with item['model']. This question from 2011 outlines what I want, but the answers are for previous versions of Scrapy and don't work with the latest version.

Once I manage to get this working I'll also need to make sure I account for different images being downloaded with the same filename. So I'll need to download each image to its own uniquely named folder, presumably based on the original URL.

Here is a copy of the code I am using in my pipeline. I got this code from a more recent answer in the link above, but it's not working for me. Nothing errors out and the images are downloaded as normal. It doesn't seem my extra code has any effect on the filenames as they still appear as SHA1 hashes.

pipelines.py

class AllenheathPipeline(object):
    def process_item(self, item, spider):
        return item

import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.http import Request
from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):

    #Name download version
    def file_path(self, request, response=None, info=None):
        item=request.meta['item'] # Like this you can use all from item, not just url.
        image_guid = request.url.split('/')[-1]
        return 'full/%s' % (image_guid)

    #Name thumbnail version
    def thumb_path(self, request, thumb_id, response=None, info=None):
        image_guid = thumb_id + request.url.split('/')[-1]
        return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)

    def get_media_requests(self, item, info):
        #yield Request(item['images']) # Adding meta. I don't know, how to put it in one line :-)
        for image in item['images']:
            yield Request(image)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

settings.py

BOT_NAME = 'allenheath'

SPIDER_MODULES = ['allenheath.spiders']
NEWSPIDER_MODULE = 'allenheath.spiders'

ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}

IMAGES_STORE = 'c:/allenheath/images'

products.py (my spider)

import scrapy
import urlparse

from allenheath.items import ProductItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

class productsSpider(scrapy.Spider):
    name = "products"
    allowed_domains = ["http://www.allen-heath.com/"]
    start_urls = [
        "http://www.allen-heath.com/ahproducts/ilive-80/",
        "http://www.allen-heath.com/ahproducts/ilive-112/"
    ]

    def parse(self, response):
        for sel in response.xpath('/html'):
            item = ProductItem()
            item['model'] = sel.css('#prodsingleouter > div > div > h2::text').extract() # The value I'd like to use to name my images.
            item['shortdesc'] = sel.css('#prodsingleouter > div > div > h3::text').extract()
            item['desc'] = sel.css('#tab1 #productcontent').extract()
            item['series'] = sel.css('#pagestrip > div > div > a:nth-child(3)::text').extract()
            item['imageorig'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['image_urls'] = sel.css('#tab1 #productcontent .col-sm-9 img').xpath('./@src').extract()
            item['image_urls'] = [urlparse.urljoin(response.url, url) for url in item['image_urls']]
            yield item

items.py

import scrapy

class ProductItem(scrapy.Item):
    model = scrapy.Field()
    itemcode = scrapy.Field()
    shortdesc = scrapy.Field()
    desc = scrapy.Field()
    series = scrapy.Field()
    imageorig = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

Here's a pastebin of the output I get from the command prompt when I run the spider: http://pastebin.com/ir7YZFqf

Any help would be greatly appreciated!

peterh
  • 11,875
  • 18
  • 85
  • 108
jkupczak
  • 2,891
  • 8
  • 33
  • 55
  • Have you enabled your pipeline? – bosnjak Apr 29 '15 at 20:47
  • Yes I have. Updated the question to show my `settings.py` file. – jkupczak Apr 29 '15 at 20:53
  • 1
    You mention `item['model']` in your question, but I dont see it in your code. Are you sure you are using the right field to generate the name? Can you paste the console output? – bosnjak Apr 30 '15 at 07:04
  • Woops, I've gone ahead and added my spider code that shows the value I'd like to name my files with. And here's a link to a paste bin of the output I get in the command prompt: http://pastebin.com/ir7YZFqf – jkupczak Apr 30 '15 at 16:51

2 Answers2

12

The pipelines.py:

from scrapy.pipelines.images import ImagesPipeline
from scrapy.http import Request
from scrapy.exceptions import DropItem
from scrapy import log

class MyImagesPipeline(ImagesPipeline):

    #Name download version
    def file_path(self, request, response=None, info=None):
        image_guid = request.meta['model'][0]
        log.msg(image_guid, level=log.DEBUG)
        return 'full/%s' % (image_guid)

    #Name thumbnail version
    def thumb_path(self, request, thumb_id, response=None, info=None):
        image_guid = thumb_id + request.url.split('/')[-1]
        log.msg(image_guid, level=log.DEBUG)
        return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)

    def get_media_requests(self, item, info):
        yield Request(item['image_urls'][0], meta=item)

You're using the settings.py wrong. You should use this:

ITEM_PIPELINES = {'allenheath.pipelines.MyImagesPipeline': 1}

For thumbsnails to work, add this to settings.py:

IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (100, 100),
}
ilovecomputer
  • 4,238
  • 1
  • 20
  • 33
ljk321
  • 16,242
  • 7
  • 48
  • 60
  • Thanks for the answer. I swapped in your line and removed mine. Unfortunately it breaks the spider. No data/text is downloaded and no images are downloaded anymore when using your answer. – jkupczak May 04 '15 at 14:41
  • Can you post all the code including `item.py`. I'm kind of having trouble ruuning your code. – ljk321 May 05 '15 at 00:25
  • The reason it broken the spider is that your pipeline is broken. I've edit my answer. Hope it helps. – ljk321 May 05 '15 at 10:52
  • Your edit has cleared up the errors and now I get data and images downloaded. Only problem is that I need to name the image using `item['itemcode']`. How would I do that. – jkupczak May 05 '15 at 20:43
  • Where is this `itemcode`? I can't seem to find it in the `item`. – ljk321 May 06 '15 at 00:28
  • Woops. That was a typo, I meant `image_guid = item['model']` instead of `image_guid = request.url.split('/')[-1]`. Using `item['model']` breaks the spider. – jkupczak May 06 '15 at 00:44
  • I only take care of the full image. Dealing with thumbnails would be just the same. – ljk321 May 06 '15 at 00:52
  • That did the trick skyline, thank you. I still need to organize them in unique folders to prevent duplicate filenames from happening. But your answer solves my biggest issue. – jkupczak May 06 '15 at 00:59
  • @skyline75489 thanks for you answer. However I have a question. Where do I find documentation regarding your suggestions? I mean I would like to write code with more understanding of what exactly I'm doing and why I should overwrite exactly this or that method. Any link would be appreciated. Thanks a lot! – drsealks Jul 05 '15 at 19:36
  • 1
    @sof_dff My main reference is the source code here https://github.com/scrapy/scrapy/blob/master/scrapy%2Fpipelines%2Fimages.py . Sometimes the official doc is just not enough for this kind of situation. – ljk321 Jul 06 '15 at 00:58
0

Since the URL hash will make sure you'll end up with a unique identifier, you could perhaps just write separately to a file the item's value and the URL hash.

After all is done, you can then just loop over this file and do the renaming (and using a Counter dictionary to make sure you rename them with a number appended based on how many Items with an equal value).

PascalVKooten
  • 20,643
  • 17
  • 103
  • 160
  • Thank you for the suggestion. I'd really rather get this all done in one command though. I suppose this will be my fallback if I can't figure it out. – jkupczak May 04 '15 at 14:48