0

I'm trying to read just a couple of fields from each of a bunch of xml files. I wrote a little function that extracts the fields I need and returns them as a vector:

id_dir <- function(d) {
  xml <- read_xml(d)
  id <- xml_text(xml_node(xml, 'AwardID'))
  dir <- xml_text(xml_node(xml, 'Abbreviation'))
  phone <- xml_text(xml_node(xml, 'PhoneNumber'))
  return(c(id, phone, dir))
}

But when I wrap it with ldply the following happens:

setwd('xmls/2017')
files <- list.files()[1:100]
sev_data <- plyr::ldply(files, id_dir)

Error in read_xml.character(d) : xmlParseEntityRef: no name [68]

This happens despite the fact that the following code works as intended:

id_dir(glue('xmls/2017/{files[1]}'))

"1700003" "5746317432" "MPS"

I've tried poking around SO for quite a while now, but mostly I'm seeing people talking about PHP and stuff that is most likely irrelevant.

For reproducibility here are a couple of files I'm reading in.

Justin
  • 147
  • 1
  • 6
  • 1
    [The error suggests that at least one of the XML files is invalid](https://stackoverflow.com/questions/7604436/xmlparseentityref-no-name-warnings-while-loading-xml-into-a-php-file), likely due to ampersand characters. – neilfws Feb 01 '21 at 02:18

1 Answers1

0

Your function works as expected which can be verified with the examples you have shared.

id_dir('https://raw.githubusercontent.com/jdollman/stackoverflow/data/1700229.xml')
#[1] "1700229"    "8659743466" "MPS" 
      
id_dir('https://raw.githubusercontent.com/jdollman/stackoverflow/data/1715157.xml')
#[1] "1715157"    "5705773510" "BIO"       

So the issue is how you are passing the files to function id_dir. I don't use plyr since it is long retired and replaced with dplyr. I would just use lapply here.

Another issue could be that you have other files in your directory which is not xml. You can specify in list.files to select only 'xml' files. Try :

setwd('xmls/2017')
files <- list.files(pattern = '\\.xml$')[1:100]
sev_data <- lapply(files, id_dir)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213