1

I'm trying to scrape the table "Competitors" for every Country in the last 2 Olympic games from Wikipedia (e.g. https://en.wikipedia.org/wiki/2022_Winter_Olympics) and make it into a dataframe. I can get to the point where I have the list of URLs for each country but then, when I start grabbing I face the issue where every page has the "Competitors" table in a different order (some time is the first, sometimes the second) plus I can find a unique title to identify the table (https://en.wikipedia.org/wiki/Spain_at_the_2022_Winter_Olympics). I was trying to apply this code Scraping a table from a section in Wikipedia but I can't figure it out. Any help would be appreciated.

Thanks!

  • 3
    Code please: *"I can get to the point where I have the list of URLs for each country"* – Rui Barradas Apr 20 '22 at 15:22
  • You'll note that "Sport" is a column name in all of the competitor tables. There is only one place (Norway) where "Sport" is a column name of more than one table - there you would have to code in an idiosyncratic way to solve the problem. – DaveArmstrong Apr 20 '22 at 16:14

1 Answers1

1

This should do it:

library(rvest)
library(dplyr)
h <- read_html("https://en.wikipedia.org/wiki/2022_Winter_Olympics")
links <- h %>% html_elements(css = "#mw-content-text > div.mw-parser-output > table:nth-child(107) > tbody > tr:nth-child(2) > td > div > ul") %>% 
  html_elements("li a") %>% 
  html_attr("href") 

links <- links[-grep("\\#cite", links)]

comps <- list()
for(i in 1:length(links)){
  r <- read_html(paste0("https://en.wikipedia.org", links[i]))
  ctry <- gsub("/wiki/(.*)_at_the_2022_Winter_Olympics", "\\1", links[i])
  tabs <- r %>% html_table()
  sport <- sapply(tabs, function(x){g <- grep("Sport", colnames(x)); ifelse(length(g) == 0, 0, g)})
  ind <- which(sport == 1)
  if(str_detect( links[i],"Norway")){
    ind <- 7  
  }
  comps[[i]] <- tabs[[ind]] %>% 
    select(Sport, Men, Women, Total) %>% 
    mutate(across(c(Men, Women, Total), as.numeric), 
           country = ctry)
}

comps <- bind_rows(comps) %>% 
  filter(Sport != "Total")
DaveArmstrong
  • 18,377
  • 2
  • 13
  • 25