Scrape page content after option tag is selected

Question

I'd like to scrape the content of a page once the province (and the commune) are selected.
The following code correctly outputs the provinces and their values.

library(rvest)

page <- read_html(x = "https://www.solferinoesanmartino.it/progetto-torelli/progetto-torelli-risultati/")

text <- page %>% html_nodes(xpath='//select[@name="provincia"]/option')%>% html_text() 
values <- page %>%  html_nodes(xpath='//select[@name="provincia"]/option')%>% html_attr("value") 

Res <- data.frame(text = text, values = values, stringsAsFactors = FALSE)
Res

Now, I'd like to access the page for each value, e.g. this might be helpful for getting access to value=19.

text <- page %>% html_nodes(xpath="//*/option[@value = '19']")%>% html_text() 
text

The source code is the following

<div class="row results_form_search">
        <form role="search" method="POST" class="search-form" action="/progetto-torelli/progetto-torelli-risultati/" id="search_location">
            <input type="hidden" name="comune_from" value="" />
            <div class="form-row">
                <input type="text" name="cognome" placeholder="Cognome" autocomplete="off" value="">
                <select name="provincia">
                    <option value="0" selected>Seleziona Provincia</option>
                                        <option value="74"
                        >-
</option>
                                        <option value="75"
                        >AGRIGENTO
</option>
                                        <option value="19"
                        >ALESSANDRIA

This is where the content that I want to scrape might be.

    <div class="row">
        <ul class="listing_search">
        </ul>
    </div>

Thank you so much for your advice!

As I see you might need to use Selenium, which i recommend to used from python. BTW I am impressed by your clear xpath-s:) https://selenium-python.readthedocs.io/ or https://cran.r-project.org/web/packages/RSelenium/vignettes/basics.html — polkas, Jun 13 '21 at 21:08

QHarr · Accepted Answer · 2021-07-03T14:25:42.970

RSelenium may end up being the way to go. However, if you can insert some judicious waits, or chunk your requests, so server isn't swamped with requests, you can use rvest and make the same requests the page does.

You first need to generate all the combinations of province and comune (filtering out unwanted values); this can be done by making xmlhttp requests, using the value attribute for the options within the select for province, to gather back the comune dropdown options and their associated values.

You then make further requests, for each combination pair, to get the page content, which you would get when making selections from each of those dropdowns manually and pushing CERCA.

Pauses are needed as there are 10,389 valid combinations, by my reckoning, and, if you attempt to make all those requests one after the other, following the initial requests as well, the server will cut-off the connection.

Another option would be to chunk up combined into smaller dataframes and make requests for those at timed intervals and then combine the results.

library(rvest)
library(dplyr)
library(purrr)

get_provincias <- function(link) {
  nodes <- read_html(link) %>%
    html_nodes('[name="provincia"] > option:not([selected]):not(:contains("-")):not(:contains("\u0085"))')

  df <- data.frame(
    Provincia = nodes %>% html_text(trim = T),
    id0 = nodes %>% html_attr("value")
  )

  return(df)
}

get_comunes <- function(id) {
  link <- sprintf(
    "https://www.solferinoesanmartino.it/db-torelli/_get_comuni.php?id0=%s&id1=0&_=%i",
    id,
    as.numeric(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"))
  )
  # print(link)
  nodes <- read_html(link) %>% html_nodes('option:not([value="0"])')

  df <- data.frame(
    id0 = id, # id1
    Comune = nodes %>% html_text(trim = T),
    id3 = nodes %>% html_attr("value")
  )
  return(df)
}

get_page <- function(prov_id, com_id) {
  link <- sprintf(
    "https://www.solferinoesanmartino.it/db-torelli/_get_soldati.php?id0=1&id1=&id2=%s&id3=%s&_=%i",
    prov_id,
    com_id,
    as.numeric(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"))
  )

  page <- read_html(link)
  # print(page %>% html_node(".listing_name") %>% html_text(trim = T))
  # print(tibble(id3 = com_id, page = page))
  return(tibble(id3 = com_id, page = page))
}
 
provincias <- get_provincias("https://www.solferinoesanmartino.it/progetto-torelli/progetto-torelli-risultati")

comunes <- map_df(provincias$id0, get_comunes) %>% filter(Comune != "-")

combined <- dplyr::right_join(provincias, comunes, by = "id0")

# length(combined$Comune) ->  10389

results <- map2_dfr(combined$id0, combined$id3, .f = get_page)

final <- dplyr::inner_join(combined, results, by = "id3")

Below is a longer version, with the additional info you requested, where I played around with adding pauses. I still found that I could run up to, and including

combined <- dplyr::right_join(provincias, comunes, by = "id0")

in one go. But after that I needed to chunk requests into about 2000 requests batches with 20-30 minutes in between. You can try tweaking the timings below. I ended up using the commented out section to run each batch and then left a pause of 30 mins in between.

Some things to consider:

It seems that you can have comunes values like ... which still return listings. With that in mind you may wish to remove the :not parts of this:

html_nodes('[name="provincia"] > option:not([selected]):not(:contains("-")):not(:contains("\u0085"))')

as I assumed that was filtering out invalid results.

Next, you might consider writing a helper function with httr and retry, to make the requests with backoff/retry, rather than use pauses.

Such a function might look like this:

httr::RETRY(
  "GET", 
  <request url>,
  times = 3, 
  pause_min = 20*60,
  pause_base = 20*60)

Anyway, those are some ideas. Even without the server cutting the connection, via uses of waits, I still found it started to throttle requests, meaning some requests took quite a long time to complete. Optimizing this could potentially take a lot of time and effort. I spent a good few days playing around looking at chunk size and waits.

library(rvest)
library(dplyr)
library(purrr)

get_provincias <- function(link) {
  nodes <- read_html(link) %>%
    html_nodes('[name="provincia"] > option:not([selected]):not(:contains("-")):not(:contains("\u0085"))')

  df <- data.frame(
    Provincia = nodes %>% html_text(trim = T),
    id0 = nodes %>% html_attr("value")
  )

  return(df)
}

get_comunes <- function(id) {
  link <- sprintf(
    "https://www.solferinoesanmartino.it/db-torelli/_get_comuni.php?id0=%s&id1=0&_=%i",
    id,
    as.numeric(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"))
  )
  # print(link)
  nodes <- read_html(link) %>% html_nodes('option:not([value="0"])')

  df <- data.frame(
    id0 = id, # id1
    Comune = nodes %>% html_text(trim = T),
    id3 = nodes %>% html_attr("value")
  )
  return(df)
}

get_data <- function(prov_id, com_id) {
  link <- sprintf(
    "https://www.solferinoesanmartino.it/db-torelli/_get_soldati.php?id0=1&id1=&id2=%s&id3=%s&_=%i",
    prov_id,
    com_id,
    as.numeric(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"))
  )
  # print(link)
  page <- read_html(link)

  df <- data.frame(
    cognome = page %>% html_nodes(".listing_name") %>% html_text(trim = T),
    livello = page %>% html_nodes(".listing_level") %>% html_text(trim = T),
    id3 = com_id,# for later join back on comune
    id0 = prov_id
  )
  Sys.sleep(.25) # pause for . sec
  return(df)
}

get_chunks <- function(df, chunk_size) { # adapted from @BenBolker https://stackoverflow.com/a/7060331
  n <- nrow(df)
  r <- rep(1:ceiling(n / chunk_size), each = chunk_size)[1:n]
  d <- split(df, r)
  return(d)
}

write_rows <- function(df, filename) {
  
  flag <- file.exists(filename)
  df2 <- purrr::map2_dfr(df$id0, df$id3, .f = get_data)

  write.table(df2,
    file = filename, sep = ",",
    append = flag,
    quote = F, col.names = !flag,
    row.names = F
  )
  Sys.sleep(60*10)
}

provincias <- get_provincias("https://www.solferinoesanmartino.it/progetto-torelli/progetto-torelli-risultati")

Sys.sleep(60*5)

comunes <- map_df(provincias$id0, get_comunes) %>% filter(Comune != "-")

Sys.sleep(60*10)

combined <- dplyr::right_join(provincias, comunes, by = "id0")

Sys.sleep(60*10)

chunked <- get_chunks(combined, 2000) # https://stackoverflow.com/questions/7060272/split-up-a-dataframe-by-number-of-rows

filename <- "prov_com_cog_liv.csv"

map(chunked, ~ write_rows(.x, filename))

## #### test case #####################

# df <- chunked[[6]]
# 
#   flag <- file.exists(filename)
#   
#   df2 <- map2_dfr(df$id0, df$id3, .f = get_data)
#   
#   write.table(df2,
#     file = filename, sep = ",",
#     append = flag,
#     quote = F, col.names = !flag,
#     row.names = F
#   )
####################################

results <- read.csv(filename)

final <- dplyr::right_join(combined, results, by = "id3")

Thank you so much for your efforts, @QHarr ! I haven't had time to look into it yet, but will do it asap! — natalieee, Jun 15 '21 at 14:18
Sorry for my late reply! Thanks for your great help, that' s almost what I'd like to get - a table showing provincia, comune, name of the person (cognome in Italian) `".listing_name"` and level of the person `".listing_level"` . I can see in your code that you have already included something about printing the name of the person `page %>% html_nodes(".listing_name")%>% html_text()` . I tried it out for a certain prov(ince ID) and com(une ID) and it worked perfectly fine. Could you give me a hint how to loop through all the com(une IDs) and return the respective names and levels of the person? — natalieee, Jun 19 '21 at 08:50
Ok.... so I retrieved all results but am now investigating further what was returned. — QHarr, Jun 21 '21 at 21:34
What do you want doing with some odd values e.g. CHIETI where comune is blank; COSENZA, LECCE and TORINO where comune is `…` ? They do return results as the ids are not missing. — QHarr, Jun 22 '21 at 05:32
Thanks for getting back to me! I'm fine with comune being blank or `...` because they have IDs assigned as you also said. — natalieee, Jun 22 '21 at 08:50
I may need to revisit the exclusion I applied then with `filter(Comune != "-")` - currently I have 229,964 results with that exclusion. — QHarr, Jun 22 '21 at 12:31
That sounds good, even with the filter! Could you tell me how you managed to retrieve the names and levels of the people from the comunes? — natalieee, Jun 26 '21 at 08:30
Sorry for the delay. I wanted to give a nice clean run end to end version but am now tired and have so much work to do. Hope the above helps. It shows how to retrieve the additional info and gives ideas for further development. I did manage to retrieve all results but it does take some time, for the reasons mentioned above. — QHarr, Jun 30 '21 at 02:27
Thank you so much for your help! I made some little adjustments and it works very well now! — natalieee, Jul 03 '21 at 12:32
You are most welcome. It is an interesting problem in terms of timings! — QHarr, Jul 03 '21 at 14:16

Scrape page content after option tag is selected

1 Answers1