1

I am trying to ceate a web scraper in Python that goes through all products of Aliexpress supplier. My problem is that when I am going it without logging it I am eventually redirected to login web page. I added login section to my code but it does not help. I will appreciate all suggestions.

My code:

import requests
from bs4 import BeautifulSoup
import re
import sys
from lxml import html


def go_through_paginator(link):
    source_code = requests.get(link, data=payload,  headers = dict(referer = link))
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    print(soup)
    for page in soup.findAll ('div', {'class' : 'ui-pagination-navi util-left'}):
        for next_page in page.findAll ('a', {'class' : 'ui-pagination-next'}):
            next_page_link="https:" + next_page.get('href')
            print (next_page_link)
            gather_all_products (next_page_link)

def gather_all_products (url):
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for item in soup.findAll ('a', {'class' : 'pic-rind'}):
        product_link=item.get('href')
    go_through_paginator(url)


payload = {
    "loginId": "EMAIL", 
    "password": "LOGIN",
}

LOGIN_URL='https://login.aliexpress.com/buyer.htm?spm=2114.12010608.1000002.4.EihgQ5&return=https%3A%2F%2Fwww.aliexpress.com%2Fstore%2F1816376%3Fspm%3D2114.10010108.0.0.fs2frD&random=CAB39130D12E432D4F5D75ED04DC0A84'

session_requests = requests.session()
source_code = session_requests.get(LOGIN_URL)
source_code = session_requests.post(LOGIN_URL, data = payload)


URL='https://www.aliexpress.com/store/1816376?spm=2114.10010108.0.0.fs2frD'

source_code = requests.get(URL, data=payload,  headers = dict(referer = URL))
plain_text = source_code.text
soup = BeautifulSoup(plain_text)

for L1 in soup.findAll ('li', {'id' : 'product-nav'}):
    for L1_link in L1.findAll('a', {'class' : 'nav-link'}):
        link = "https:" + L1_link.get('href') 
        gather_all_products(link)

And this the aliexpress login URL: https://login.aliexpress.com/buyer.htm?spm=2114.12010608.1000002.4.EihgQ5&return=https%3A%2F%2Fwww.aliexpress.com%2Fstore%2F1816376%3Fspm%3D2114.10010108.0.0.fs2frD&random=CAB39130D12E432D4F5D75ED04DC0A84

MattDMo
  • 100,794
  • 21
  • 241
  • 231
  • Are you doing anything with the cookie they send back? Because they're probably authenticating off of that. You're cookie probably needs to be in the header, but it looks like your header is just the URL? – Alexander Kleinhans Jan 13 '17 at 22:34
  • I would probably diff the headers logged in and logged out with something like this and then set it however they want. https://stackoverflow.com/questions/4423061/view-http-headers-in-google-chrome – Alexander Kleinhans Jan 13 '17 at 22:37

2 Answers2

0

Try to set the cookies value from xman_t and intl_common_forever from response cookies.

I was try it directly to grab all products information. Before I set xman_t and intl_common_forever Aliexpress just allow me to grab 7 products. After I set xman_t and intl_common_forever I successfully grab 50 products.

Hopefully this help you to scrapes their product.

0

Your problem is a bit too complex for Stackoverflow but let's take a look at few tips that can make huge difference.

What is happening here is that Aliexpress is suspecting you of being a robot and requests you to login to proceed.

To avoid that you need to fortify your scraper a bit. Let's take a look at common tips for AliExpress.
First, you need to sanitize your urls and remove tracking parameters:

URL='https://www.aliexpress.com/store/1816376?spm=2114.10010108.0.0.fs2frD'
#                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# we don't need this parameter, it just helps AliExpress to identify us

Then, we should fortify our requests Session with browser-like headers:

BASE_HEADERS = {
    # lets use headers of Chrome 96 on Windows operating system to blend in:
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}
session = requests.session(headers=BASE_HEADERS)

These two tips will decrease login requests significantly!

Though that's just the tip of an iceberg - for more see blog tutorial I wrote How to Scrape AliExpress

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82