7

I am using image pipeline to download all the images from different websites.

All the images are successfully downloaded to my defined folder, but I am unable to name the downloaded image of my choice before saving in hard disk.

Here is my code

pipelines.py

class jellyImagesPipeline(ImagesPipeline):


def image_key(self, url, item):
    name = item['image_name']
    return 'full/%s.jpg' % (name)


def get_media_requests(self, item, info):
    print'Entered get_media_request'
    for image_url in item['image_urls']:
        yield Request(image_url)

Image_spider.py

 def getImage(self, response):
 item = JellyfishItem()
 item['image_urls']= [response.url]
 item['image_name']= response.meta['image_name']
 return item

What are the changes that i need to do in my code ??

Update 1


pipelines.py

class jellyImagesPipeline(ImagesPipeline):

    def image_custom_key(self, response):
        print '\n\n image_custom_key \n\n'
        name = response.meta['image_name'][0]
        img_key = 'full/%s.jpg' % (name)
        print "custom image key:", img_key
        return img_key
        
    def get_images(self, response, request, info):
        print "\n\n get_images \n\n"
        for key, image, buf, in super(jellyImagesPipeline, self).get_images(response, request, info):
            yield key, image, buf

        
        key = self.image_custom_key(response)
        orig_image = Image.open(StringIO(response.body))
        image, buf = self.convert_image(orig_image)
        yield key, image, buf
   
    def get_media_requests(self, item, info):
        print "\n\nget_media_requests\n"
        return [Request(x, meta={'image_name': item["image_name"]})
                for x in item.get('image_urls', [])]

update 2


def image_key(self, image_name):
print 'entered into image_key'
    name = 'homeshop/%s.jpg' %(image_name)
    print name
    return name
    
def get_images(self,request):
    print '\nEntered into get_images'
    key = self.image_key(request.url)
yield key 

def get_media_requests(self, item, info):
print '\n\nEntered media_request'
print item['image_name']
    yield Request(item['image_urls'][0], meta=dict(image_name=item['image_name']))

def item_completed(self, results, item, info):
    print '\n\nentered into item_completed\n'
print 'Name : ', item['image_urls']
print item['image_name']
for tuple in results:
    print tuple

                 
        
Community
  • 1
  • 1
Binit Singh
  • 973
  • 4
  • 14
  • 35
  • what's in response.meta['image_name']? is it dependent on the URL only? or maybe the `` @alt or @title? – paul trmbrth Aug 06 '13 at 14:11
  • `response.meta['image_name']` is retrieved from Mysql table, it dosen't depend on url. It is completely independent of url – Binit Singh Aug 07 '13 at 04:40
  • With scrapy evolution simpler solution possible, see [Scrapy image download how to use custom filename](http://stackoverflow.com/a/22263951/775066). – sumid Mar 08 '14 at 01:51

1 Answers1

14

In pipelines.py

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.http import Request
from PIL import Image
from cStringIO import StringIO
import re

class jellyImagesPipeline(ImagesPipeline):

    CONVERTED_ORIGINAL = re.compile('^full/[0-9,a-f]+.jpg$')

    # name information coming from the spider, in each item
    # add this information to Requests() for individual images downloads
    # through "meta" dictionary
    def get_media_requests(self, item, info):
        print "get_media_requests"
        return [Request(x, meta={'image_name': item["image_name"]})
                for x in item.get('image_urls', [])]

    # this is where the image is extracted from the HTTP response
    def get_images(self, response, request, info):
        print "get_images"

        for key, image, buf, in super(jellyImagesPipeline, self).get_images(response, request, info):
            if self.CONVERTED_ORIGINAL.match(key):
                key = self.change_filename(key, response)
            yield key, image, buf

    def change_filename(self, key, response):
        return "full/%s.jpg" % response.meta['image_name'][0]

In settings.py, make sure you have

ITEM_PIPELINES = ['jelly.pipelines.jellyImagesPipeline']
IMAGES_STORE = '/path/to/where/you/want/to/store/images'

Example spider: Get images from Python.org's homepage, name (and path) of saved images will follow the site structure, i.e. in a folder called www.python.org

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import urlparse

class CustomItem(Item):
    image_urls = Field()
    image_names = Field()
    images = Field()

class ImageSpider(BaseSpider):
    name = "customimg"
    allowed_domains = ["www.python.org"]
    start_urls = ['http://www.python.org']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//img')
        items = []
        for site in sites:
            item = CustomItem()
            item['image_urls'] = [urlparse.urljoin(response.url, u) for u in site.select('@src').extract()]
            # the name information for your image
            item['image_name'] = ['whatever_you_want']
            items.append(item)
        return items
IK KLX
  • 121
  • 2
  • 14
paul trmbrth
  • 20,518
  • 4
  • 53
  • 66
  • Thanx for answering my question but it dosen't solve my problem. Still the image name is not changed. please help me out with some digging – Binit Singh Aug 07 '13 at 05:03
  • I edited my answer with tested code. You have 2 options: either create another image with a new name, or change the name of the original JPEG-converted image from the built-in `ImagesPipeline` – paul trmbrth Aug 07 '13 at 10:43
  • Thanx for the answer @paul. I really appreciate your effort. I would like to go for second option you have suggested i.e change the name of original JPEG-converted image from the built-in ImagePipeline. – Binit Singh Aug 07 '13 at 12:04
  • I have updated my code in question according to your suggestion but the program is not entering into `get_image` and `image_coustom_key` at all. Therefore image is downloaded with no change in name. – Binit Singh Aug 07 '13 at 12:35
  • Have you set `ITEM_PIPELINES = ['yourprojectname.pipelines.jellyImagesPipeline']` in your settings.py file? Also, make sure pipeline name is consistent: in your edited question code, `JellyImagesPipeline` is starting with 'j' and then 'J' (in the `super()` call) – paul trmbrth Aug 07 '13 at 12:37
  • one more information i would like to share with you. Actually i have a `mysql` table which contains the `url` of image present on `CDN` and `name` with which i wold like to use for renaming images – Binit Singh Aug 07 '13 at 12:39
  • yes i have set the `ITEM_PIPELINES =[jelly.pipelines.jellyImagesPipeline]` – Binit Singh Aug 07 '13 at 12:41
  • I have added log file in my question four your better understanding – Binit Singh Aug 07 '13 at 12:59
  • The logs may show "full/deecb6c02e37af96ffe4879836aedc51301841c5.jpg" but the real filename should be ok. At least that's what I observe. Also make sure you delete the already downloaded images in your local store, otherwise images will not be downloaded again (see `uptodate` in log) – paul trmbrth Aug 07 '13 at 13:02
  • But why it is not showing `get images` in the log file. it means it is not entring into that function. – Binit Singh Aug 07 '13 at 13:09
  • Sorry but Your new code is also giving the same result. please try in your pc – Binit Singh Aug 07 '13 at 13:10
  • `get_images` will get called if you havent got the image locally (check if you have "uptodate" in the logs, if you have, `get_images` was not called). Remove you local "full/....jpg" images and it should. It works for me. If you earn +1 in repuration, we could chat which would be more efficient I think – paul trmbrth Aug 07 '13 at 13:17
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/34998/discussion-between-paul-t-and-binit) – paul trmbrth Aug 07 '13 at 13:21
  • I think you need to change `image_name` to `image_names` in this line `return "full/%s.jpg" % response.meta['image_name'][0]` – DucCuong Jun 25 '15 at 03:48