I went through some threads and an cannot discover the solution.
If i Scrape Amazon with Selenium and Beautifulsoup, everything works fine. But as soon as I activate "headless", my output changes and I need to enter a Captcha in order to continue (which by default, is not scrape-friendly).
My goal is to avoid beeing detected as a bot (not only an amazon, but on every other page too ) This seems to work, as look as I scrape head-on, but this is no pleasure for my resources!
One Idea: is it possible that a headless browser does not accept cookies, scripts and images? How can I add them?
Here ist my code:
def seleniumhtml_url(link):
dic={}
dirname = os.path.dirname(__file__)
filepath = os.path.join(dirname, 'chromedriver')
chrome_options = Options()
chrome_options.add_argument('--incognito')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-extensions')
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('disable-infobars')
driver = webdriver.Chrome(executable_path=filepath, chrome_options=chrome_options) # Optional argument, if not specified will search path.
driver.get(link)
time.sleep(3) # Let the user actually see something!
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
driver.quit()
soup = BeautifulSoup(html, 'lxml')
html = text_from_html(soup)
dic["html"] = html
return(dic)
The output is:
Enter the characters you see below Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies. Type the characters you see in this image: Try different image Continue shopping Conditions of Use Privacy Policy © 1996-2014, Amazon.com, Inc. or its affiliates