0

i just started programming. I have the task to extract data from a HTML page to Excel. Using Python 3.7. My Problem is, that i have a website, whith more urls inside. Behind these urls again more urls. I need the data behind the third url. My first Problem would be, how i can dictate the programm to choose only specific links from an ul rather then every ul on the page?

from bs4 import BeautifulSoup
import urllib
import requests
import re

page = urllib.request.urlopen("file").read()

soup = BeautifulSoup(page, "html.parser")

print(soup.prettify())

for link in soup.find_all("a", href=re.compile("katalog_")):
links= link.get("href")
if "katalog" in links:
    for link in soup.find_all("a", href=re.compile("alle_")):
        links = link.get("href")       

print(soup.get_text())

1 Answers1

0

There are many ways, one is to use "find_all" and try to be specific on the tags like "a" just like you did. If that's the only option, then use regular expression with your output. You can refer to this thread: Python BeautifulSoup Extract specific URLs. Also please show us either the link, or html structure of the links you want to extract. We would like to see the differences between the URLs.

PS: Sorry I can't make comments because of <50 reputation or I would have.

Updated answer based on understanding:

from bs4 import BeautifulSoup
import urllib
import requests

page = urllib.request.urlopen("https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ITGrundschutzKompendium/itgrundschutzKompendium_node.html").read()
soup = BeautifulSoup(page, "html.parser")

for firstlink in soup.find_all("a",{"class":"RichTextIntLink NavNode"}):
    firstlinks = firstlink.get("href")
    if "bausteine" in firstlinks:
        bausteinelinks = "https://www.bsi.bund.de/" + str(firstlinks.split(';')[0])
        response = urllib.request.urlopen(bausteinelinks).read()
        soup = BeautifulSoup(response, 'html.parser')
        secondlink = "https://www.bsi.bund.de/" + str(((soup.find("a",{"class":"RichTextIntLink Basepage"})["href"]).split(';'))[0])
        res = urllib.request.urlopen(secondlink).read()
        soup = BeautifulSoup(res, 'html.parser')
        listoftext = soup.find_all("div",{"id":"content"})
        for text in listoftext:
            print (text.text)
Sin Han Jinn
  • 574
  • 3
  • 18
  • https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ITGrundschutzKompendium/itgrundschutzKompendium_node.html – Moritz Nagel Oct 22 '19 at 06:58
  • starting from that webiste i want to open the urls in "Bausteine" and open the url behind there and then export data on the final page – Moritz Nagel Oct 22 '19 at 06:58
  • ok with adding re.compile i can compile the urls to the first 10 urls i need. That helped. – Moritz Nagel Oct 22 '19 at 07:10
  • I'm still unsure of which links you are looking for, I've updated my guess answer, check it out. – Sin Han Jinn Oct 22 '19 at 07:17
  • i have edited my post and by that extracted the urls i was looking for. Now i need a way to open the found urls. If i understand correct, you solution only prints them – Moritz Nagel Oct 22 '19 at 07:21
  • You mean you want to extract more urls from the new urls? – Sin Han Jinn Oct 22 '19 at 07:23
  • i think that should be done just like the first url, but i dont know how i can export specific data on that last site – Moritz Nagel Oct 22 '19 at 07:34
  • you just have to do the same thing, requests twice. – Sin Han Jinn Oct 22 '19 at 07:40
  • first off all thanks for you work, but i dont want to print the links i want to print the actuall text starting with "Beschreibung" and print everything until "Weitere Informationen". Can i do that? – Moritz Nagel Oct 22 '19 at 07:49
  • Finally I got it, of course you can. will update you soon – Sin Han Jinn Oct 22 '19 at 07:53
  • maybe i can specify my problem. Can you give me your email i can post the html files to you so you can see my problem – Moritz Nagel Oct 22 '19 at 07:53
  • Hey Moritz, I've shown an example of how to print text just like you asked. It's just an example. To clean the text or transfer them to excel or text file, you would have to explore more. – Sin Han Jinn Oct 22 '19 at 08:12
  • If this answered your question, do mark it as the answer. Thanks – Sin Han Jinn Oct 22 '19 at 08:22
  • I have on last problem: Your code works perfectly for that website the hmtl files i have saved on my computer are a little different. Maybe i can send them to you and you can take a quick look? – Moritz Nagel Oct 22 '19 at 08:25