0

I web scraped the data using the following code:

# personal skills
skills = soup.findAll("li", {"data-ng-repeat": "xxxskillDetailsxxx"})
for i in skills: 
    print(str(i.get_text())) # The output has 5 different skills, for example.

# languages
languages = soup.findAll("li", {"data-ng-repeat": "xxxlanguagesxxx"})
for n in languages: 
    print(str(n.get_text())) # The output has 3 different languages, for instance.

The above code works well if I just print it. However, if I use the following code to save the data and then later save it as a dataframe, only the last element was saved. That is, only the last skill and the last language were saved.

data=[]

for url in df.urls[:10]:
    webdriver.get(url)  
    time.sleep(5)
    soup = BeautifulSoup(webdriver.page_source, 'html.parser')
    
    # personal skills
    skills = soup.findAll("li", {"data-ng-repeat": "xxxskillDetailsxxx"})
    for i in skills: 
        data.append(str(i.get_text()))

    # languages
    languages = soup.findAll("li", {"data-ng-repeat": "xxxlanguagesxxx"})
    for n in languages: 
        data.append(str(n.get_text()))

print(data) # Output: only the last skill output and the last language output are printed.

How could I save all skills and languages into two different columns and within the column, it is separate by commas?

I searched results online for a while, but I did not find a good solution. Any suggestion is highly appreciated. Thank you.

betahat
  • 27
  • 1
  • 7
  • Edit your question so it's self-contained and has a [mre]. – baduker Mar 10 '21 at 14:21
  • Look at https://stackoverflow.com/questions/37965638/appending-to-list-saves-only-the-last-item-in-python-3 – DMart Mar 10 '21 at 18:29
  • Could you clarify. You want two columns; skils and languages; and the skills/languages for each row seperated by commas? EG: |URL 1| Programming, Speaking, Excel| English, German| – Rusty Robot Mar 11 '21 at 03:31
  • @RustyRobot yes, that is correct. Thank you in advance. – betahat Mar 11 '21 at 12:23

1 Answers1

1

I've made a few changes to your code, and think I have what you are looking for.

data = []

for url in df.urls[:10]:
    webdriver.get(url)
    time.sleep(5)
    soup = BeautifulSoup(webdriver.page_source, 'html.parser')

    # personal skills
    skills_elements = soup.findAll("li", {"data-ng-repeat": "xxxskillDetailsxxx"})
    skills = []
    for i in skills_elements:
        skills.append(str(i.get_text()))

    # languages
    language_elements = soup.findAll("li", {"data-ng-repeat": "xxxlanguagesxxx"})
    languages = []
    for n in language_elements:
        languages.append(str(n.get_text()))
        
    data.append({
        'url':url,
        'skills': ','.join(skills),
        'languages': ','.join(languages),
    })

print(data)
print(data[0]['skills']

Skills and Languages are now stored in their own lists while iterating. At the end of the each iteration, the join function joins all the elements in each list into a string seperated by commas.

The end result is a list of dictionaries. Each dictionary contains the url, a string of skills, and a string of languages.

Rusty Robot
  • 1,725
  • 2
  • 13
  • 29