Web Scraping Google Scholar Author profiles

Question

I have used scholarly package and parsed the author names generated in the 3 question its method search by author name to get the author profiles including all the citation information of all the professors. I was able to load the data into a final dataframe with NA values for those who do not have a google scholar profile. However, there is an issue approx. 8 authors citation information is not matching the information on google scholar website, it is because the scholarly package is retrieving the citation information of other authors with the same name. I believe I can fix it by using search_author_id function but the question is how do we get the author_ids of all the professors in the first place.

Any help would be appreciated.

Cheers, Yash

Dmitriy Zub · Answer 1 · 2022-04-15T10:52:57.817

1

This solution possibly will not be suitable for the scholarly package. beautifulsoup will be used instead.

Author id's is located under the tag name inside the <a> tag href attribute. Here's how we can grab their id's:

# assumes that request and soup are already sent and made

link = soup.select_one('.gs_ai_name a')['href']

# https://stackoverflow.com/a/6633693/15164646
_id = link

# looking for the text that contains "user=" to split it to 3 parts.
id_identifer = 'user='

# splitting text to 3 parts
before_keyword, keyword, after_keyword = _id.partition(id_identifer)

# after_keyword means that everything AFTER "user=" will be scraped, which is ID.
author_id = after_keyword

# RlANTZEAAAAJ

Code that goes a "bit" out of your question scope (full example in the online IDE under bs4 folder -> get_profiles.py):

from bs4 import BeautifulSoup
import requests, lxml, os

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

proxies = {
  'http': os.getenv('HTTP_PROXY')
}

html = requests.get('https://scholar.google.com/citations?view_op=view_org&hl=en&org=9834965952280547731', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')

for result in soup.select('.gs_ai_chpr'):
  name = result.select_one('.gs_ai_name a').text
  link = result.select_one('.gs_ai_name a')['href']

  # https://stackoverflow.com/a/6633693/15164646
  _id = link
  id_identifer = 'user='
  before_keyword, keyword, after_keyword = _id.partition(id_identifer)
  author_id = after_keyword
  affiliations = result.select_one('.gs_ai_aff').text
  email = result.select_one('.gs_ai_eml').text

  try:
    interests = result.select_one('.gs_ai_one_int').text
  except:
    interests = None

  cited_by = result.select_one('.gs_ai_cby').text.split(' ')[2]
  
  print(f'{name}\nhttps://scholar.google.com{link}\n{author_id}\n{affiliations}\n{email}\n{interests}\n{cited_by}\n')

Output:

Jeong-Won Lee
https://scholar.google.com/citations?hl=en&user=D41VK7AAAAAJ
D41VK7AAAAAJ
Samsung Medical Center
Verified email at samsung.com
Gynecologic oncology
107516

Alternatively, you can do the same thing with Google Scholar Profiles API from SerpApi, but without thinking about how to solve the CAPTCHA, find proxies, and maintain the parser over time.

It's a paid API with a free plan.

Code to integrate:

from serpapi import GoogleSearch
import os

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    "api_key": os.getenv("API_KEY"),      # your serpapi API key
    "engine": "google_scholar_profiles",  # search engine
    "mauthors": "samsung"                 # search query
}

search = GoogleSearch(params)             # where data extraction happens
results = search.get_dict()               # JSON -> Python dictionary

for result in results.get('profiles'):
    name = result.get('name')
    email = result.get('email')
    author_id = result.get('author_id')
    affiliation = result.get('affiliations')
    cited_by = result.get('cited_by')
    interests = result['interests'][0]['title']
    interests_link = result['interests'][0]['link']

print(f'{name}\n{email}\n{author_id}\n{affiliation}\n{cited_by}\n{interests}\n{interests_link}\n')

Part of the output:

Jeong-Won Lee
Verified email at samsung.com
D41VK7AAAAAJ
Samsung Medical Center
107516
Gynecologic oncology
https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:gynecologic_oncology

Disclaimer, I work for SerpApi.

edited Apr 15 '22 at 10:52

answered May 23 '21 at 12:15

Dmitriy Zub

1,398
8
35

How to scrap the data from the fellowing pages? – Mohamed Hachaichi Apr 15 '22 at 10:34
@Mohamed-IslemHachaichi the ones from the output? – Dmitriy Zub Apr 15 '22 at 10:48
let's say that I want to srap GS profiles that belong to a given university (assuming that there is 120 profiles), but google scholar displays only 10 profiles per page, how to jump to the next page? – Mohamed Hachaichi Apr 15 '22 at 11:15
@Mohamed-IslemHachaichi Got it! I wrote a dedicated [scrape Google Scholar Profiles from a certain University in Python](https://serpapi.com/blog/scrape-google-scholar-profiles-from-a-certain-university-in-python/#full_code) blog post about it, using a custom solution via [`parsel`](https://parsel.readthedocs.io/) and SerpApi solution. Let me know if it's what you were looking for. – Dmitriy Zub Apr 15 '22 at 11:26
Hi Dmitriy, thanks for the suggestion. But I still can't manage to delete the "label" from the params. Can you try to retrive data from "EM Normandie"? I can't use your code! please can you help asap. Many thanks. – Mohamed Hachaichi Apr 15 '22 at 12:24
@Mohamed-IslemHachaichi you have to delete `label` function argument to make it work. I've [created a Github gist with `parsel` solution that shows an output](https://gist.github.com/dimitryzub/eb00fe06d32e81178878fcd4f1c35e16), and it's working great both locally and on replit, with the query you provided. Let me know if it works. – Dmitriy Zub Apr 15 '22 at 13:22
Yes, it does work now. Do you have any idea on how to retrieve data from ResearchGate? – Mohamed Hachaichi Apr 18 '22 at 17:11
@Mohamed-IslemHachaichi awesome! What exactly are you looking to parse from researchgate? Can you share the link? – Dmitriy Zub Apr 19 '22 at 04:16
the link: (https://www.researchgate.net/institution/EM-Normandie-Business-School/members), but then I want to go to each profile and scrabe: Research Interest, Citations, and h-index. – Mohamed Hachaichi Apr 19 '22 at 06:42
@Mohamed-IslemHachaichi Thank you. For now, I don't have any suggestions for you. I'll make a blog post about scraping all institution members and their profiles (whole page) very soon. What are your thoughts on having a ResearchGate API on SerpApi? – Dmitriy Zub Apr 19 '22 at 07:50
Hey Dmitriy, let's say I want to collect the article titles of each author? how to do so, I've already consulted your blog but the wode does not work! it yields nothing... – Mohamed Hachaichi Apr 28 '22 at 10:58
Hey @Mohamed-IslemHachaichi, the code from Scrape Google Scholar Profiles from a certain University blog post works. If you want to parse titles from author pages, this code will not work, this is correct behavior as there are other selectors and elements. [Check out this gist to scrape all publications from the authors' page](https://github.com/dimitryzub/stackoverflow-answers-archive/blob/main/answers/Dmitriy/twitter_answers/google/google_scholar/google_scholar_authors/python/google-scholar-extract-all-author-publications.py). – Dmitriy Zub Apr 28 '22 at 16:09
Hey @Dmitriy Zub, can we please take the year of the publication of the article? with the article titles? I collected the article titles, but I don't find a way to add the year of the title (article)? – Mohamed Hachaichi May 04 '22 at 19:45
Hey @MohamedHachaichi, I've [added year extraction to the script](https://github.com/dimitryzub/stackoverflow-answers-archive/blob/main/answers/Dmitriy/twitter_answers/google/google_scholar/google_scholar_authors/python/google-scholar-extract-all-author-publications.py) and a SerpApi solution. If you don't have a proxies/captcha solver, an API solution is the way to go since it solves this problem. At some point, you will run into a problem where requests are blocked. – Dmitriy Zub May 05 '22 at 08:07

Web Scraping Google Scholar Author profiles

1 Answers1