0

I am new in python programming and webscraping, I am able to get the relevant information from the website but it generates only one element with all the information needed in the list. The problem is that I cannot delete the unwanted things in this one element list. I am not sure if it is at all possible to do this from a single element list.Is there any way to create a python dictionary as in the example below:

{Kabul: River Kabul, Tirana: River Tirane, etc}

Any help will be really appreciated. Thanks in advance.

from bs4 import BeautifulSoup
import urllib.request

url = "https://sites.google.com/site/worldfactsinc/rivers-of-the-world-s-capital-cities"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'}
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)
html = resp.read()

soup = BeautifulSoup(html, "html.parser")
attr = {"class":"sites-layout-tile sites-tile-name-content-1"}
rivers = soup.find_all(["table", "tr", "td","div","div","div"], attrs=attr)

data = [div.text for div in rivers]

print(data[0])
Andersson
  • 51,635
  • 17
  • 77
  • 129
  • What's this one element list look like? What's the actual data you get returned (or at least an example subset)? – MCBama Dec 04 '17 at 19:52
  • COUNTRY - CAPITAL CITY - RIVER                    A       Afghanistan - Kabul - River Kabul.  Albania - Tirana - River Tirane.  Andorra - Andorra La Vella - The Gran Valira. Argentina - Buenos Aries - River Plate. – user8838477 Dec 04 '17 at 19:55
  • I don't think so that's the right way to get elements element-wise. Look https://stackoverflow.com/questions/15951591/python-beautiful-soup-searching-result-strings – Aakash Verma Dec 04 '17 at 20:04
  • @user8838477, are looking for `urllib` + `BeautifulSoup` solution only? – Andersson Dec 04 '17 at 20:08
  • not necessarily, anything that works – user8838477 Dec 04 '17 at 20:12

3 Answers3

0

If you can figure out a better way to pull your data from the webpage you might want to, but assuming you don't, this will get you a usable and modifiable dictionary:

web_ele = ['COUNTRY - CAPITAL CITY - RIVER A Afghanistan - Kabul - River Kabul. Albania - Tirana - River Tirane. Andorra - Andorra La Vella - The Gran Valira. Argentina - Buenos Aries - River Plate. ']

web_ele[0] = web_ele[0].replace('COUNTRY - CAPITAL CITY - RIVER A ', '')
rows = web_ele[0].split('.')

data_dict = {}
for row in rows:
  data = row.split(' - ')
  if len(data) == 3:
    data_dict[data[0].strip()] = {
      'Capital':data[1].strip(),
      'River':data[2].strip(),
    }

print(data_dict)
# output: {'Afghanistan': {'Capital': 'Kabul', 'River': 'River Kabul'}, 'Albania': {'Capital': 'Tirana', 'River': 'River Tirane'}, 'Andorra': {'Capital': 'Andorra La Vella', 'River': 'The Gran Valira'}, 'Argentina': {'Capital': 'Buenos Aries', 'River': 'River Plate'}}

You'll probably have to account for the various 'A', 'B', 'C' ... elements that seem to be part of your string but the header shouldn't pop back up any more than the one time it did but if it does you should be able to parse it out.

Again, I would probably suggest finding a cleaner way to pull your data but this will get you something to work with.

MCBama
  • 1,432
  • 10
  • 18
0

Code:

from bs4 import BeautifulSoup
import urllib.request

url = "https://sites.google.com/site/worldfactsinc/rivers-of-the-world-s-capital-cities"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'}
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)
html = resp.read()

soup = BeautifulSoup(html, "html.parser")
rivers = soup.select_one("td.sites-layout-tile.sites-tile-name-content-1")

data = [
    div.text.split('-')[1:] 
    for div in rivers.find_all('div', style='font-size:small') 
    if div.text.strip()
    ][4:-4]
data = {k.strip():v.strip() for k,v in data}

print(data)

Steps:

  • Select the container tag ('tr.sites-layout-tile.sites-tile-name-content-1')
  • Find all <div style='font-size:small'> children tags, select the text and split by '-'.
  • Create a dictionary from the items in data.
t.m.adam
  • 15,106
  • 3
  • 32
  • 52
0

Another way you can get required result (dictionary with city: river pairs) is to use requests and lxml as below:

import requests
from lxml import html

url = "https://sites.google.com/site/worldfactsinc/rivers-of-the-world-s-capital-cities"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'}
req = requests.get(url, headers=headers)
source = html.fromstring(req.content)

xpath = '//b[.="COUNTRY - CAPITAL CITY - RIVER"]/following::div[b and following-sibling::hr]'
rivers = [item.text_content().strip() for item in source.xpath(xpath) if item.text_content().strip()]
rivers_dict = {}

for river in rivers:
    rivers_dict[river.split("-")[1].strip()] = river.split("-")[2].strip()

print(rivers_dict)

Output:

{'Asuncion': 'River Paraguay.', 'La Paz': 'River Choqueapu.', 'Kinshasa': 'River Congo.', ...}

...147 items total

Andersson
  • 51,635
  • 17
  • 77
  • 129