1

I am new to beautiful soup / selenium in python, I am trying to get contact / emails from a list of URLs. URLs:

listOfURLs=['https://oooo.com/Number=xxxxx', 'https://oooo.com/Number/yyyyyy', 'https://oooo.com/Number/zzzzzz']

HTML I am parsing:

<div class="row classicdiv" id="renderContacInfo">
  <div class="col-md-2" style="word-break: break-word;">
    <h6>Contact</h6>
    <h5>Israa S</h5>
  </div>
  <div class="col-md-2" style="word-break: break-word;">
    <h6>Email</h6>
    <h5>israa.s@xxxx.com <br/>
    </h5>
  </div>
  <div class="col-md-2" style="word-break: break-word;">
    <h6>Alternate Email</h6>
    <h5></h5>
  </div>
  <div class="col-md-2">
    <h6>Primary Phone</h6>
    <h5>1--1</h5>
  </div>
  <div class="col-md-2">
    <h6>Alternate Phone</h6>
    <h5>
    </h5>
  </div>
</div>

I am trying to loop the list of URLs, but I am only able to get the soup from the first url in the list.

The code written:

driver = webdriver.Chrome(chrome_driver_path)
driver.implicitly_wait(300) 
driver.maximize_window()
driver.get(url)
driver.implicitly_wait(30)
content=driver.page_source
soup=BeautifulSoup(content,'html.parser')
contact_text=soup.findAll("div",{"id":"renderContacInfo"})
output1=''
output2=''
print(contact_text)
time.sleep(100)

for tx in contact_text:
    time.sleep(100)
    output1+=tx.find(text="Email").findNext('h5').text
    output2+=tx.find(text="Contact").findNext('h5').text

My questions:

  1. How to iterate loop through the list or URLs I have?
  2. How to filter the Email and contact from the soup html.
  3. Expected output:

URL Contact Email

https://oooo.com/Number=xxxxx xxxxxxxx xxxx@xxx.com

https://oooo.com/Number=yyyyy yyyyyyyy yyyy@yyy.com

0buz
  • 3,443
  • 2
  • 8
  • 29
  • 1
    you need an outer loop _for url in listOfURLs:_ – QHarr Mar 31 '20 at 16:33
  • @QHarr i like your suggestion of a outer loop for url. could we do the itteration also like did it at this question: /60908216/how-to-handle-multiple-urls-in-beautifultsoup-and-convert-the-data-into-datafram/60908470#comment107771591_60908470 this could be another approach. - one that i am trying to follow at this question: https://stackoverflow.com/questions/60954426/writing-a-loop-beautifulsoup-and-lxml-for-getting-page-content-in-a-page-to-pag !? ideas!? – zero Mar 31 '20 at 21:19

2 Answers2

2

Something like this should do it. I removed all the implicit waits(which btw, if you want to go that route, you should set once, at the top of your script when you instatiate your driver; also they are very long!).

listOfURLs=['https://oooo.com/Number=xxxxx', 'https://oooo.com/Number/yyyyyy', 'https://oooo.com/Number/zzzzzz']
result=[]
for url in listOfURLs:
    driver.get(url)
    content = driver.page_source
    soup = BeautifulSoup(content, 'html.parser')
    contact_text = soup.findAll("div", {"id": "renderContacInfo"})

    for tx in contact_text:
        output1=tx.find(text="Contact").findNext('h5').text
        output2=tx.find(text="Email").findNext('h5').text
        output=f"{url} {output1} {output2}"
        result.append(output)

driver.quit()

result is a list which will include all the collected output in the form of url + contact + email.

0buz
  • 3,443
  • 2
  • 8
  • 29
  • Thanks for the answer. It worked for me The only thing I've noticed is, I get an output from result when I use print(result) but I get [] when I use return result, any idea why this is happening with lists in particular? – Israa El-Sakka Apr 02 '20 at 09:14
  • Glad it helped. Make sure you use `return result` in the scope of a function i.e. wrap your code in a function `def return result` and pay attention to indentation. – 0buz Apr 02 '20 at 10:06
1

As @QHarr suggested use outer loop for url.Use reglar expression re to search text.

import re
listOfURLs=['https://oooo.com/Number=xxxxx', 'https://oooo.com/Number/yyyyyy', 'https://oooo.com/Number/zzzzzz']

for url in listOfURLs:
    driver = webdriver.Chrome(chrome_driver_path)
    driver.maximize_window()
    driver.get(url)
    driver.implicitly_wait(30)
    content = driver.page_source
    soup = BeautifulSoup(content, 'html.parser')
    print(url)
    print(soup.find('h6',text=re.compile("Contact")).find_next('h5').text)
    print(soup.find('h6',text=re.compile("Email")).find_next('h5').text)
KunduK
  • 32,888
  • 5
  • 17
  • 41
  • 1
    hi there dear Kunduk - many many thanks for the solution with the loop. This is very ineresting. - mille grazie - yours zero – zero Mar 31 '20 at 20:13
  • hello dear KunduK - many many thanks for the answer: in this question you show much of what i need in my question - is is visible here at this site: questions/60954426/writing-a-loop-beautifulsoup-and-lxml-for-getting-page-content-in-a-page-to-pag - just would be great if you look: techniques such as a. gathering several infos from a page and collecting them in a output and then itteraging to a list of urls. I am trying to apply these techniques to my question. Would be glad if you have a look at this above mentioned question and lend me a helping hand. thx alot in advance! - yours zero. – zero Mar 31 '20 at 20:43
  • dear kunduK - again i like your answer and i am willing to click on the hollow under download vote button , but all i see at the moment is the so called timeline. But perhaps i will find what you mean and advice me to do. ... and perhaps you have some ideas for my question - i just have added the goals and what is aimed to the question. Many many thanks in advance. By the way: I have learned from you in the past few days.;) – zero Mar 31 '20 at 21:03