Extract data from large XML using R

Question

I have a large XML that I cant parse completely in R due to memory shortage. I would like just to extract some specific columns. I found other asked similar questions:

How to read large (~20 GB) xml file in R? Storing specific XML node values with R's xmlEventParse

I cant get it to work though with my data, it runs, but no data is returned. I did try to adjust the suggested solutions to my XML but it still does not work. Might be my lack of knowledge XML. Below is a example of my XML data, where cl, clssc, clp, clpssc, primclp are the columns. How can I extract only cl and clssc without parsing the whole document first?

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<abc:abc xmlns:abc="http://abc/abc" xsi:schemaLocation="http://abc/abc lala_20Q2.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <chcp>
    <cl>2000000</cl>
    <clssc>10934</clssc>
    <clp>200000</clp>
    <clpssc>10934</clpssc>
    <primclp>Y</primclp>
  </chcp>
  <chcp>
    <cl>2000000</cl>
    <clssc>10934</clssc>
    <clp>200000</clp>
    <clpssc>10934</clpssc>
    <primclp>Y</primclp>
  </chcp>
  <chcp>
    <cl>2000000</cl>
    <clssc>10934</clssc>
    <clp>2000000</clp>
    <clpssc>10934</clpssc>
    <primclp>Y</primclp>
  </chcp>
</abc:abc>

It would be of help, if you could provide your solution which did not work out and we can have a look together trying to fix it. Important is, that the implementation must have streaming capabilities (e.g. xslt 3.0) - but I guess, the answers from the linked questions fulfil this already. — Aydin K., Oct 26 '21 at 11:25
By the way, why going the hard way (XML parsing of a huge XML) - a simple filtering step via grep should also do the trick (in case the XML is pretty printed, meaning it has newlines per element), or? — Aydin K., Oct 26 '21 at 11:27
@AydinK. What do you mean by "grep"? Indeed its pretty printed, but if using pattern matching I still need to load the whole file into R, which is >20gb. Or? — MLEN, Oct 26 '21 at 11:58
It might be simpler to just [grep from bash](https://unix.stackexchange.com/questions/322280/grep-only-those-lines-with-active-active-whose-next-line-contains-speci) into another file, instead of specifically loading it all to R? — runr, Oct 26 '21 at 12:07
@runr bash is something new for me, so for me, it does not sound easier. I will take a look though. — MLEN, Oct 26 '21 at 12:11
If you're able to convert your XML to CSV, you would be able to use the chunking feature of `readr` exemplified in this [question/answer](https://stackoverflow.com/a/58606056/3073340). I do not believe that `xml2` or similar xml focused libraries for R have a similar feature yet... — sgdata, Oct 28 '21 at 00:18
@sgoley I cant. What I am doing now is reading the XML in chunks using read_lines from readr, filtering what I need and then compiling everything. — MLEN, Oct 28 '21 at 06:39

Donald Seinen · Answer 1 · 2021-10-29T18:43:16.427

The disk.frame package is made to handle medium sized data. It enables batch conversion of data into .fst and speedy I/O through the fst package, and fast data manipulation using data.table. Here, the dtplyr interface to the data.table package is used for the final wrangling.

Step 1: preparing the input file*

1.1 Create a folder and place your .xml file in there.

1.2 Remove the first two lines, and the last line so that you are left with a .xml file that has this structure:

  <chcp>
    <cl>2000000</cl>
    <clssc>10934</clssc>
    <clp>200000</clp>
    <clpssc>10934</clpssc>
    <primclp>Y</primclp>
  </chcp>
  <chcp>
    <cl>2000000</cl>
    <clssc>10934</clssc>
    <clp>200000</clp>
    <clpssc>10934</clpssc>
    <primclp>Y</primclp>
  </chcp>
  <chcp>
    <cl>2000000</cl>
    <clssc>10934</clssc>
    <clp>2000000</clp>
    <clpssc>10934</clpssc>
    <primclp>Y</primclp>
  </chcp>

Step 2: setup disk.frame

library(tidyverse)
library(disk.frame)

path <- file.path(file.choose()) # filepath to your folder containing .fst
setup_disk.frame(workers = 10) # adjust this to your machine
options(future.globals.maxSize = Inf)
old <- getOption("scipen") ; options(scipen = 100) # prevent scientific numbers later

Step 3: convert .xml to .fst

l <- csv_to_disk.frame( #works for .xml
  paste0(path, "b.xml"), # replace b.xml with your filename (in folder)
  outdir = paste0(path, "combined.df"),
  in_chunk_size = 7,
  backend = "data.table", header = F)

This gives object l of class "disk.frame" "disk.frame.folder" in your R environment. You should now have a subfolder "combined.df" which contains a bunch of .fst files in your specified directory.

Step 4: read in the required columns & tidy data

cbind(
  get_chunk(l, 1) %>%
    `[`(seq(2, nrow(.), 2)),  
  get_chunk(l, 2) %>%
    `[`(seq(1, nrow(.), 2))
) %>%
  rename("cl" = 1, "clssc" = 2) %>%
  mutate(across(.fns = parse_number)) %>%
  as_tibble() # omit this to keep data.table

# A tibble: 3 x 2
       cl clssc
    <dbl> <dbl>
1 2000000 10934
2 2000000 10934
3 2000000 10934

Finally, don't forget to revert the scientific notation options options(scipen = old).

*Note: Step 1 can likely be avoided by playing with the chunk sizes or through some manipulation of the .xml from within R. This I don't know how to do (yet).

Note2: Recommend to carefully read the disk.frame documentation for tips on how to setup properly for your machine.

Wimpel · Answer 2 · 2021-11-02T08:38:57.710

on a windows machine

Here ia an approach where you use the output from the windows findstr-command to import data using data.table::fread(). It filters the data using the windows-version of 'grep' before it is loaded into R. This way you will not run into memory problems very soon.

location of the xml: e:/testdata.xml

further explanation is in data's comments below

library(data.table)
# Import output from windows findstr-command
#  assumes location of data is e:/testdata.xml
#  !! use \\ in path, else findstr does not undeerstand !!
DT <- data.table::fread(cmd = 'findstr "<clssc> <cl>" e:\\testdata.xml', 
                        sep = "\n", col.names = "line", header = FALSE )
#                    line
# 1:     <cl>2000000</cl>
# 2: <clssc>10934</clssc>
# 3:     <cl>2000000</cl>
# 4: <clssc>10934</clssc>
# 5:     <cl>2000000</cl>
# 6: <clssc>10934</clssc>

# Extract data from raw line
DT[, name  := gsub("^<(.+?)>.*$", "\\1", line)]
DT[, value := gsub("^.*>([0-9]+?)<.*$", "\\1", line)]
#                    line  name   value
# 1:     <cl>2000000</cl>    cl 2000000
# 2: <clssc>10934</clssc> clssc   10934
# 3:     <cl>2000000</cl>    cl 2000000
# 4: <clssc>10934</clssc> clssc   10934
# 5:     <cl>2000000</cl>    cl 2000000
# 6: <clssc>10934</clssc> clssc   10934

# Build some id's
DT[, id := rowid(name)]

# Cast to wide format
dcast(DT, id ~ name, value.var = "value")

#    id      cl clssc
# 1:  1 2000000 10934
# 2:  2 2000000 10934
# 3:  3 2000000 10934

on a unix machine
I cannot test, since I only use windows at this location.
Replace the cmd='...' part from findstr-command (and the regex) with the grep-command of your system.

Extract data from large XML using R

2 Answers2

Step 1: preparing the input file*

Step 2: setup disk.frame

Step 3: convert .xml to .fst

Step 4: read in the required columns & tidy data

Linked