2

I want to use an AWS Lambda to scrape a website. The crawler code is in Python and using the Scrapy library, provided by Pip.

To run the lambda function I had to create a zip of dependencies (here only scrapy) in public Amazon Linux AMI version - amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2, as per their documentation here, add the lambda function and upload it to create the lambda function.

Now, when I invoke the lambda function it gives me the following error:

cannot import name certificate_transparency: ImportError
Traceback (most recent call last):
  File "/var/task/my_lambda_function.py", line 120, in my_lambda_handler
    return get_data_from_scrapy(username, password)
  File "/var/task/my_lambda_function.py", line 104, in get_data_from_scrapy
    process.crawl(MyScrapyFunction)
  File "/var/task/scrapy/crawler.py", line 167, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/var/task/scrapy/crawler.py", line 195, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/var/task/scrapy/crawler.py", line 200, in _create_crawler
    return Crawler(spidercls, self.settings)
  File "/var/task/scrapy/crawler.py", line 52, in __init__
    self.extensions = ExtensionManager.from_crawler(self)
  File "/var/task/scrapy/middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/var/task/scrapy/middleware.py", line 34, in from_settings
    mwcls = load_object(clspath)
  File "/var/task/scrapy/utils/misc.py", line 44, in load_object
    mod = import_module(module)
  File "/usr/lib64/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/var/task/scrapy/extensions/memusage.py", line 16, in <module>
    from scrapy.mail import MailSender
  File "/var/task/scrapy/mail.py", line 22, in <module>
    from twisted.internet import defer, reactor, ssl
  File "/var/task/twisted/internet/ssl.py", line 59, in <module>
    from OpenSSL import SSL
  File "/var/task/OpenSSL/__init__.py", line 8, in <module>
    from OpenSSL import crypto, SSL
  File "/var/task/OpenSSL/crypto.py", line 12, in <module>
    from cryptography import x509
  File "/var/task/cryptography/x509/__init__.py", line 7, in <module>
    from cryptography.x509 import certificate_transparency
ImportError: cannot import name certificate_transparency

Following are the dependencies/libraries version (all are latest) that I'm using:

  • pip 9.0.1
  • Scrapy==1.4.0
  • pyOpenSSL==17.5.0
  • lxml==4.1.1
  • cryptography==2.1.4

Any help would be appreciated. Thanks in advance.

declension
  • 4,110
  • 22
  • 25

3 Answers3

5

I would not use AWS Lambda for such complicated tasks. Why did you choose it? If because it is free, you have several better options:

  • AWS gives a one-year free access to all its services for new accounts.
  • AWS Lightsail gives you a free month for the minimum plan.
  • PythonAnywhere.com offers you a free account. I tried Scrapy on PythonAnywhere and it works perfectly. Just please note that the "continuous" running time is up to 2 hours for free accounts and 6 hours for paid accounts (according to their Support).
  • ScrapingHub.com gives you one free crawler. Check the video called "Deploying Scrapy Spider to ScrapingHub" - the video is available for free preview under this course "Scrapy: Powerful Web Scraping & Crawling with Python".

I hope this helps. If you have questions, please let me know.

GoTrained
  • 158
  • 1
  • 7
  • I just wanted to scrap a website and dump it to some database. I agree with you that I should not use Lambda for scraping and I'm not using it now. But I'm just curious that why did Lambda didn't work even after satisfying all the dependencies being compiled in Amazon Linux AMI. – Chitresh Sinha Jan 04 '18 at 18:01
5

I don't know if you ever ended up solving this, but the issue arises from the lxml library. It requires C dependencies to build properly, which give lambda a plethora of problems since they're dependent on the OS. I'm deploying scrapy through serverless AWS, and I used two things to solve it: the serverless-python-requirements plugin and dockerizePip: non-linux setting. This forces the serverless to build the package in a docker container, which provides the correct binaries. Note that this is also the solution for getting NumPy, SciPy, Pandas, etc. in addition to lxml to work on AWS Lambda. Here's a blog that I followed to get it working: https://serverless.com/blog/serverless-python-packaging/

Serverless is nice if you don't want to deal with making the zip file yourself. If you do, here's a stack overflow link that shows how you can solve the problem with lxml: AWS Lambda not importing LXML

Ivan Peng
  • 579
  • 5
  • 16
0

As Ivan mentioned the issue here arises from the required c dependencies for the python packages

Fortunately, AWS published an amazonlinux Docker image that is nearly identical to the AMI that Lambda functions use, here is an article that i used myself and elaborate that in more detail.

Here is my docker configuration that i used to build my Scrapy project and package it for lambda

FROM amazonlinux:latest
RUN yum -y install git \
    gcc \
    openssl-devel \
    bzip2-devel \
    libffi \
    libffi-devel \
    python3-devel \
    python37 \
    zip \
    unzip \
    && yum clean all

RUN python3 -m pip install --upgrade pip 

COPY src /io

CMD sh /io/package.sh

and here is the package.sh file

#!/bin/bash

mkdir holder 
python3 -m pip install scrapy OTHER-REPOS -t holder
rm -f /packages/lambda.zip
cp -r /io/* holder
cd holder
zip -r /packages/lambda.zip *

and this how I build the image and run it with a volume to get the deployment package zip file after it finishes

docker build -t TAG_NAME_HERE .
docker run --rm -v ${PWD}/deployment_package:/packages -t TAG_NAME_HERE

hope this can help.

Mo Hajr
  • 1,253
  • 1
  • 15
  • 31