Writing a loop: Beautifulsoup and lxml for getting page-content in a page-to-page skip-setting

Question

Update: now with a image of one of the more than 6600 target-pages: https://europa.eu/youth/volunteering/organisation/48592 see below - the images and the explanation and description of the aimed goals and the data which are wanted.

I am a pretty new in the field of data work in the field of volunteering services. Any help is appreciated. I have learned a lot in the past few days from some coding heroes such as αԋɱҽԃ αмєяιcαη and KunduK.

Basically our goal is to create a quick overview on a set of opportunities for free volunteering in Europe. I have the list of the URL which I want to use to fetch the data. I can do for one url like this:- currently working on a hands on approach to dive into python programming: i have several parser-parts that work already - see below a overview on several pages. BTW: I guess that we should gather the info with pandas and store it in csv...

...and so forth and so forth .... - [note - not every URL and id is backed up with a content-page - therefore we need an incremental n+1 setting] therefore we can count the pages each by each - and count incremental n+1

See examples:

Approach: I used CSS Selector; XPath and CSS Selector do same task but - with both BS or lxml we can use this or mix with find() and findall().

So I run this mini-approach here:

from bs4 import BeautifulSoup

import requests

url = 'https://europa.eu/youth/volunteering/organisation/50160'

resonse = requests.get(url)

soup = BeautifulSoup(resonse.content, 'lxml')

tag_info = soup.select('.col-md-12 > p:nth-child(3) > i:nth-child(1)')

print(tag_info[0].text)

Output: Norwegian Judo Federation

Mini-approach 2:

from lxml import html

import requests

url = 'https://europa.eu/youth/volunteering/organisation/50160'

response = requests.get(url)

tree = html.fromstring(response.content)

tag_info = tree.xpath("//p[contains(text(),'Norwegian')]")

print(tag_info[0].text)

Output: Norwegian Judo Federation (NJF) is a center organisation for Norwegian Judo clubs. NJF has 65 member clubs, which have about 4500 active members. 73 % of the members are between ages of 3 and 19. NJF is organized in The Norwegian Olympic and Paralympic Committee and Confederation of Sports (NIF). We are a member organisation in European Judo Union (EJU) and International Judo Federation (IJF). NJF offers and organizes a wide range of educational opportunities to our member clubs.

and so forth and so fort. What I am trying to achieve: aimed is to gather all the interesting information from all the 6800 pages - this means information, such as:

the URL of the page and all the parts of the page that are marked red
Name of Organisation
Address
DESCRIPTION OF ORGANISATION
Role
Expiring date
Scope
Last updated
Organisation Topics ( not on every page noted: occasionally )

...and iterate to the next page, getting all the information and so forth. So I try a next step to get some more experience:... to gather info form all of the pages Note: we've got 6926 pages

The question is - regarding the URLs how to find out which is the first and which is the last URL - idea: what if we iterate from zero to 10 000!?

With the numbers of the urls!?

import requests
from bs4 import BeautifulSoup
import pandas as pd

numbers = [48592, 50160]


def Main(url):
    with requests.Session() as req:
        for num in numbers:
            resonse = req.get(url.format(num))
            soup = BeautifulSoup(resonse.content, 'lxml')
            tag_info =soup.select('.col-md-12 > p:nth-child(3) > i:nth-child(1)')
            print(tag_info[0].text)



Main("https://europa.eu/youth/volunteering/organisation/{}/")

but here i run into issues. Guess that i have overseen some thing while combining the ideas of the above mentioned parts. Again. I guess that we should gather the infos with pandas and store it in csv...

i smell the way of my code here :P, you've to share screenshot of desired output on the website — αԋɱҽԃ αмєяιcαη, Mar 31 '20 at 22:50
hello dear αԋɱҽԃ αмєяιcαη ... you were right: i am a fan of your coding-approach - and i like your ideas and how you work out things. Wait - i will create a screenshot of the desired output - i just need approx 60 minutes then i will add this info to the thread here - meanwhile many many thanks for your reply and that you are here ;) great to see . — zero, Mar 31 '20 at 23:08
@αԋɱҽԃ αмєяιcαη : now i have added a image of one of the pages. The pages are all in the same fashion - only the organisations-topics are not on every page … guess that we can work with some of your great approaches - that i have se@αԋɱҽԃ αмєяιcαη : now i have added a image of one of the pages. The pages are all in the same fashion en in the past few days… where you looped over a number of pages - eg. on the daad pages a German page for college-programs in Germany… which you sampled and gathered into pandas i guess... - look forward to hear from you ... greetings zero ..;) — zero, Mar 31 '20 at 23:52
i try to understand your target here, where's your start loop here? from which to which ? — αԋɱҽԃ αмєяιcαη, Apr 01 '20 at 01:48

αԋɱҽԃ αмєяιcαη · Accepted Answer · 2020-04-01T10:58:38.647

import requests
from bs4 import BeautifulSoup
import re
import csv
from tqdm import tqdm


first = "https://europa.eu/youth/volunteering/organisations_en?page={}"
second = "https://europa.eu/youth/volunteering/organisation/{}_en"


def catch(url):
    with requests.Session() as req:
        pages = []
        print("Loading All IDS\n")
        for item in tqdm(range(0, 347)):
            r = req.get(url.format(item))
            soup = BeautifulSoup(r.content, 'html.parser')
            numbers = [item.get("href").split("/")[-1].split("_")[0] for item in soup.findAll(
                "a", href=re.compile("^/youth/volunteering/organisation/"), class_="btn btn-default")]
            pages.append(numbers)
        return numbers


def parse(url):
    links = catch(first)
    with requests.Session() as req:
        with open("Data.csv", 'w', newline="", encoding="UTF-8") as f:
            writer = csv.writer(f)
            writer.writerow(["Name", "Address", "Site", "Phone",
                             "Description", "Scope", "Rec", "Send", "PIC", "OID", "Topic"])
            print("\nParsing Now... \n")
            for link in tqdm(links):
                r = req.get(url.format(link))
                soup = BeautifulSoup(r.content, 'html.parser')
                task = soup.find("section", class_="col-sm-12").contents
                name = task[1].text
                add = task[3].find(
                    "i", class_="fa fa-location-arrow fa-lg").parent.text.strip()
                try:
                    site = task[3].find("a", class_="link-default").get("href")
                except:
                    site = "N/A"
                try:
                    phone = task[3].find(
                        "i", class_="fa fa-phone").next_element.strip()
                except:
                    phone = "N/A"
                desc = task[3].find(
                    "h3", class_="eyp-project-heading underline").find_next("p").text
                scope = task[3].findAll("span", class_="pull-right")[1].text
                rec = task[3].select("tbody td")[1].text
                send = task[3].select("tbody td")[-1].text
                pic = task[3].select(
                    "span.vertical-space")[0].text.split(" ")[1]
                oid = task[3].select(
                    "span.vertical-space")[-1].text.split(" ")[1]
                topic = [item.next_element.strip() for item in task[3].select(
                    "i.fa.fa-check.fa-lg")]
                writer.writerow([name, add, site, phone, desc,
                                 scope, rec, send, pic, oid, "".join(topic)])


parse(second)

Note: I've tested for the first 10 pages, in case if you are looking to gain more speed, i advise you to use concurrent.futures. and if there's any error. use try/except.

wow i am impressed - this is outstanding! just came back to office and saw your solution - not tested yet - but i am willing to run it this afternoon. regarding the number of the pages **Note**: we ve got 6926 pages - see here https://europa.eu/youth/volunteering/organisations_en#open ( i have added a tiny image to the thread above ) : the **question is** - regarding the URLs how to find out which is the first and which is the last URL - **idea**: what if we itterate from zero to 10 000 !? with the numbers of the urls!? What do you think? Look forward to hear from you! many thx for all!! — zero, Apr 01 '20 at 10:26
@zero ! you don't need to iterate from `0` to `10,000`, Actually the first function `catch` is looping from `0` to `348`, each page will return 20 `id`, so `20 * 347` = `6940`, you do have 6926 because the last page is including `6` ids only, which means `6940 - 14` = 6926 — αԋɱҽԃ αмєяιcαη, Apr 01 '20 at 10:37
@zero for a record, here's the full [ids](https://paste.centos.org/view/raw/0cd4ba85) in sorted format — αԋɱҽԃ αмєяιcαη, Apr 01 '20 at 11:13
good day - this is more than expected: many many thanks to you. You have helped me alot ;) — zero, Apr 01 '20 at 11:55
many thanks again - your ideas and your pythonic genius is needed here. i need to fetch a bunch of urls…https://stackoverflow.com/questions/61106309/fetching-multiple-urls-with-beautifulsoup-gathering-meta-data-in-wp-plugins - this one goes over my head a. how to fetch all the existing Urls b. how to fetch the meta-data of each plugin ... guess that you have a solution... ;) — zero, Apr 08 '20 at 20:19
many thanks one question though: the scraper fetches 347 pages and parsed but he gives only a dataset that counts 20 rows of data: Where to change the code to change the behavior!? Look forward to hear from you - regards — zero, May 04 '21 at 16:35

Writing a loop: Beautifulsoup and lxml for getting page-content in a page-to-page skip-setting

1 Answers1

Linked