1

First I guess I should say I am still a bit of a Django/Python noob. I am in the midst of a project that allows users to enter a URL, the site scrapes the content from that page and returns images over a certain size and the page title tag so the user can then pick which image they want to use on their profile. A pretty standard scenario I assume. I have this working by using Selenium (headless Chrome browser) to grab the destination page content, some python to determine the file size and then my Django view spits it all out into a template. I then have it coded in such a way that the image the user selects will be downloaded and stored locally.

However I seriously doubt the scalability of this, its currently just running locally and I am very concerned about how this would cope if there were lots of users all running at the same time. I am firing up that headless chrome browser every time a request is made which doesn't sound efficient, I am having to download the image to determine it's size so I can decide whether it's large enough. One example took 12 seconds to get from me submitting the URL to displaying the results to the user, whereas the same destination URL put through www.kit.com (they have very similar web scraping functionality) took 3 seconds.

I have not provided any code as the code I have does what it should, I think the approach however is incorrect. To summarise what I want is:

  • To allow a user to enter a URL and for it to return all images (or just the URLs to those images) from that page over a certain size (width/height), and the page title.

  • For this to be the most efficient solution, taking into account it would be run concurrently between many users at once.

  • For it to work in a Django (2.0) / Python (3+) environment.

I am not completely against using the API from a 3rd party service if one exists, but it would be my least preferred option.

Any help/pointers would be much appreciated.

Zeb
  • 547
  • 4
  • 15

1 Answers1

3

You can use 2 python solutions in your case:
1) BeautifulSoup, and here is a good answer how to download the images using it. You just have to make it a separate function and pass site as the argument into it. But also it is very easy to parse only images links as u said - depending on speed what u need (obviously scraping files, specially when there is a big amount of them, will be much slower, than links). This tool is just for parsing and scrapping the content of the page.

2) Scrapy - this is much more powerful tool, framework, via it you can connect your spider to a Django models, operate with images much more efficiently, using its built-in image-pipelines. It is much more flexible with a lot of features how to operate with scrapped data. I am not sure if u need to use it in your project and if it is not overpowered in your case.

Also my advice is to run the spider in some background task like Queue or Celery, and call the result via AJAX, cuz it may take some time to parse the content - so don't make a user wait for the response.

P.S. You can even combine those 2 tools in some cases :)

Chiefir
  • 2,561
  • 1
  • 27
  • 46
  • Its `BeautifulSoup` not `BeautifulSoap` – Rachit kapadia May 09 '18 at 10:18
  • Is there a benefit of using Beautiful Soup given that I can use [Seleniums find functionality](http://selenium-python.readthedocs.io/locating-elements.html) to find all the image tags amongst the scraped data? – Zeb May 09 '18 at 10:48
  • @DrakeRamoray you don't have to use Selenium with BS, BS has that functionality in stock. At that answer I referred to - the first part is for scrapping image urls, what is what you are asking for. And the second part - is for downloading images from those urls. But sure, you can use BS with Celenium, if u need, but I don't know for what purposes :) – Chiefir May 09 '18 at 10:54
  • @Rachitkapadia, I sense meme material. – Sean Francis N. Ballais May 09 '18 at 10:59
  • @Chiefir OK. I was using bs before but moved to Selenium because of issues with JavaScript rendered pages. It was a while ago, but if memory serves - doesn't bs rely on the requests library, which is a dead end as far as JavaScript rendered pages are concerned. In any case, if this project works out it'll need to do some very heavy lifting so I think i need to bite the bullet and dive into your suggestion of Scrapy. – Zeb May 09 '18 at 11:55
  • @DrakeRamoray You didn't mention they u gonna scrap images from JS-base sites. You are right in that case BS does not help. Scrapy can deal with JS, but also with some limited cases. I think that will be a combination of Selenium and Scrapy. Can't give more advises cuz I have never used Selenium. P.S. If you gonna scrap 50:50 static and JS based sites - may be that will be still an option to write a BS spider for static sites, cuz it seems to me it will be faster, than Selenium, but I might be mistaken here :) – Chiefir May 09 '18 at 12:20