0

Update: now with a image of one of the more than 6600 target-pages: https://europa.eu/youth/volunteering/organisation/48592 see below - the images and the explanation and description of the aimed goals and the data which are wanted.

I am a pretty new in the field of data work in the field of volunteering services. Any help is appreciated. I have learned a lot in the past few days from some coding heroes such as αԋɱҽԃ αмєяιcαη and KunduK.

Basically our goal is to create a quick overview on a set of opportunities for free volunteering in Europe. I have the list of the URL which I want to use to fetch the data. I can do for one url like this:- currently working on a hands on approach to dive into python programming: i have several parser-parts that work already - see below a overview on several pages. BTW: I guess that we should gather the info with pandas and store it in csv...

...and so forth and so forth .... - [note - not every URL and id is backed up with a content-page - therefore we need an incremental n+1 setting] therefore we can count the pages each by each - and count incremental n+1

See examples:

Approach: I used CSS Selector; XPath and CSS Selector do same task but - with both BS or lxml we can use this or mix with find() and findall().

So I run this mini-approach here:

from bs4 import BeautifulSoup

import requests

url = 'https://europa.eu/youth/volunteering/organisation/50160'

resonse = requests.get(url)

soup = BeautifulSoup(resonse.content, 'lxml')

tag_info = soup.select('.col-md-12 > p:nth-child(3) > i:nth-child(1)')

print(tag_info[0].text)

Output: Norwegian Judo Federation

Mini-approach 2:

from lxml import html

import requests

url = 'https://europa.eu/youth/volunteering/organisation/50160'

response = requests.get(url)

tree = html.fromstring(response.content)

tag_info = tree.xpath("//p[contains(text(),'Norwegian')]")

print(tag_info[0].text)

Output: Norwegian Judo Federation (NJF) is a center organisation for Norwegian Judo clubs. NJF has 65 member clubs, which have about 4500 active members. 73 % of the members are between ages of 3 and 19. NJF is organized in The Norwegian Olympic and Paralympic Committee and Confederation of Sports (NIF). We are a member organisation in European Judo Union (EJU) and International Judo Federation (IJF). NJF offers and organizes a wide range of educational opportunities to our member clubs.

and so forth and so fort. What I am trying to achieve: aimed is to gather all the interesting information from all the 6800 pages - this means information, such as:

  • the URL of the page and all the parts of the page that are marked red
  • Name of Organisation
  • Address
  • DESCRIPTION OF ORGANISATION
  • Role
  • Expiring date
  • Scope
  • Last updated
  • Organisation Topics ( not on every page noted: occasionally )

 see one of the picture here

...and iterate to the next page, getting all the information and so forth. So I try a next step to get some more experience:... to gather info form all of the pages Note: we've got 6926 pages

enter image description here

The question is - regarding the URLs how to find out which is the first and which is the last URL - idea: what if we iterate from zero to 10 000!?

With the numbers of the urls!?

import requests
from bs4 import BeautifulSoup
import pandas as pd

numbers = [48592, 50160]


def Main(url):
    with requests.Session() as req:
        for num in numbers:
            resonse = req.get(url.format(num))
            soup = BeautifulSoup(resonse.content, 'lxml')
            tag_info =soup.select('.col-md-12 > p:nth-child(3) > i:nth-child(1)')
            print(tag_info[0].text)



Main("https://europa.eu/youth/volunteering/organisation/{}/")

but here i run into issues. Guess that i have overseen some thing while combining the ideas of the above mentioned parts. Again. I guess that we should gather the infos with pandas and store it in csv...

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
zero
  • 1,003
  • 3
  • 20
  • 42
  • i smell the way of my code here :P, you've to share screenshot of desired output on the website – αԋɱҽԃ αмєяιcαη Mar 31 '20 at 22:50
  • hello dear αԋɱҽԃ αмєяιcαη ... you were right: i am a fan of your coding-approach - and i like your ideas and how you work out things. Wait - i will create a screenshot of the desired output - i just need approx 60 minutes then i will add this info to the thread here - meanwhile many many thanks for your reply and that you are here ;) great to see . – zero Mar 31 '20 at 23:08
  • @αԋɱҽԃ αмєяιcαη : now i have added a image of one of the pages. The pages are all in the same fashion - only the organisations-topics are not on every page … guess that we can work with some of your great approaches - that i have se@αԋɱҽԃ αмєяιcαη : now i have added a image of one of the pages. The pages are all in the same fashion en in the past few days… where you looped over a number of pages - eg. on the daad pages a German page for college-programs in Germany… which you sampled and gathered into pandas i guess... - look forward to hear from you ... greetings zero ..;) – zero Mar 31 '20 at 23:52
  • i try to understand your target here, where's your start loop here? from which to which ? – αԋɱҽԃ αмєяιcαη Apr 01 '20 at 01:48

1 Answers1

1
import requests
from bs4 import BeautifulSoup
import re
import csv
from tqdm import tqdm


first = "https://europa.eu/youth/volunteering/organisations_en?page={}"
second = "https://europa.eu/youth/volunteering/organisation/{}_en"


def catch(url):
    with requests.Session() as req:
        pages = []
        print("Loading All IDS\n")
        for item in tqdm(range(0, 347)):
            r = req.get(url.format(item))
            soup = BeautifulSoup(r.content, 'html.parser')
            numbers = [item.get("href").split("/")[-1].split("_")[0] for item in soup.findAll(
                "a", href=re.compile("^/youth/volunteering/organisation/"), class_="btn btn-default")]
            pages.append(numbers)
        return numbers


def parse(url):
    links = catch(first)
    with requests.Session() as req:
        with open("Data.csv", 'w', newline="", encoding="UTF-8") as f:
            writer = csv.writer(f)
            writer.writerow(["Name", "Address", "Site", "Phone",
                             "Description", "Scope", "Rec", "Send", "PIC", "OID", "Topic"])
            print("\nParsing Now... \n")
            for link in tqdm(links):
                r = req.get(url.format(link))
                soup = BeautifulSoup(r.content, 'html.parser')
                task = soup.find("section", class_="col-sm-12").contents
                name = task[1].text
                add = task[3].find(
                    "i", class_="fa fa-location-arrow fa-lg").parent.text.strip()
                try:
                    site = task[3].find("a", class_="link-default").get("href")
                except:
                    site = "N/A"
                try:
                    phone = task[3].find(
                        "i", class_="fa fa-phone").next_element.strip()
                except:
                    phone = "N/A"
                desc = task[3].find(
                    "h3", class_="eyp-project-heading underline").find_next("p").text
                scope = task[3].findAll("span", class_="pull-right")[1].text
                rec = task[3].select("tbody td")[1].text
                send = task[3].select("tbody td")[-1].text
                pic = task[3].select(
                    "span.vertical-space")[0].text.split(" ")[1]
                oid = task[3].select(
                    "span.vertical-space")[-1].text.split(" ")[1]
                topic = [item.next_element.strip() for item in task[3].select(
                    "i.fa.fa-check.fa-lg")]
                writer.writerow([name, add, site, phone, desc,
                                 scope, rec, send, pic, oid, "".join(topic)])


parse(second)

Note: I've tested for the first 10 pages, in case if you are looking to gain more speed, i advise you to use concurrent.futures. and if there's any error. use try/except.

  • wow i am impressed - this is outstanding! just came back to office and saw your solution - not tested yet - but i am willing to run it this afternoon. regarding the number of the pages **Note**: we ve got 6926 pages - see here https://europa.eu/youth/volunteering/organisations_en#open ( i have added a tiny image to the thread above ) : the **question is** - regarding the URLs how to find out which is the first and which is the last URL - **idea**: what if we itterate from zero to 10 000 !? with the numbers of the urls!? What do you think? Look forward to hear from you! many thx for all!! – zero Apr 01 '20 at 10:26
  • 1
    @zero ! you don't need to iterate from `0` to `10,000`, Actually the first function `catch` is looping from `0` to `348`, each page will return 20 `id`, so `20 * 347` = `6940`, you do have 6926 because the last page is including `6` ids only, which means `6940 - 14` = 6926 – αԋɱҽԃ αмєяιcαη Apr 01 '20 at 10:37
  • 1
    @zero for a record, here's the full [ids](https://paste.centos.org/view/raw/0cd4ba85) in sorted format – αԋɱҽԃ αмєяιcαη Apr 01 '20 at 11:13
  • good day - this is more than expected: many many thanks to you. You have helped me alot ;) – zero Apr 01 '20 at 11:55
  • many thanks again - your ideas and your pythonic genius is needed here. i need to fetch a bunch of urls…https://stackoverflow.com/questions/61106309/fetching-multiple-urls-with-beautifulsoup-gathering-meta-data-in-wp-plugins - this one goes over my head a. how to fetch all the existing Urls b. how to fetch the meta-data of each plugin ... guess that you have a solution... ;) – zero Apr 08 '20 at 20:19
  • many thanks one question though: the scraper fetches 347 pages and parsed but he gives only a dataset that counts 20 rows of data: Where to change the code to change the behavior!? Look forward to hear from you - regards – zero May 04 '21 at 16:35