4

I am using requests and bs4 to scrape some data from a Chinese website that also has an English version. I wrote this to see if I get the right data:

import requests
from bs4 import BeautifulSoup

page = requests.get('http://dotamax.com/hero/rate/')
soup = BeautifulSoup(page.content, "lxml")
for i in soup.find_all('span'):
    print i.text

And I do, the only problem is that the text is in Chinese, although it is in English when I look at the page source. Why do I get Chinese instead of English. How to fix that?

Chen Guevara
  • 324
  • 1
  • 4
  • 14

2 Answers2

14

The website appears to check the GET request for an Accept-Language parameter. If the request doesn't have one, it shows the Chinese version. However, this is an easy fix - use headers as described in the requests documentation:

import requests
from bs4 import BeautifulSoup

headers = {'Accept-Language': 'en-US,en;q=0.8'}

page = requests.get('http://dotamax.com/hero/rate/', headers=headers)
soup = BeautifulSoup(page.content, "lxml")
for i in soup.find_all('span'):
    print i.text

produces:

Anti-Mage
Axe
Bane
Bloodseeker
Crystal Maiden
Drow Ranger
...

etc.

Usually when a request shows up differently in your browser and in the requests content, it has to do with the type of request and headers you're using. One really useful tip for web-scraping that I wish I had realized much earlier on is that if you hit F12 and go to the "Network" tab on Chrome or Firefox, you can get a lot of useful information that you can use for debugging:

enter image description here

n1c9
  • 2,662
  • 3
  • 32
  • 52
-1

you have to tell the server which language you like in the http headers:

    import requests
    from bs4 import BeautifulSoup
    header={
        'Accept-Language': 'en-US'
    }
    page = requests.get('http://dotamax.com/hero/rate/',headers=header)
    soup = BeautifulSoup(page.content, "html5lib")
    for i in soup.find_all('span'):
        print(i.text)
kiviak
  • 1,083
  • 9
  • 10