4

What's the easiest way to read text from a printed data.frame into a data.frame when there are string values containing spaces that interfere with read.table? For instance, this data.frame excerpt does not pose a problem:

     candname party elecVotes
1 BarackObama     D       365
2  JohnMcCain     R       173

I can paste it into a read.table call without a problem:

dat <- read.table(text = "     candname party elecVotes
1 BarackObama     D       365
2  JohnMcCain     R       173", header = TRUE)

But if the data has strings with spaces like this:

      candname party elecVotes
1 Barack Obama     D       365
2  John McCain     R       173

Then read.table throws an error as it interprets "Barack" and "Obama" as two separate variables.

Sam Firke
  • 21,571
  • 9
  • 87
  • 105
  • 2
    Does your data have a different delimiter than a space, perhaps a tab? Is your data fixed-width? There needs to be some structure your data in order for R to read it properly. – MrFlick May 28 '15 at 05:00
  • I'm interested in handling snippets of data posted on SO, like this: http://stackoverflow.com/questions/30494359/subset-based-on-repeated-values-in-row-and-conditional-in-column-in-r/ Or others like it, with or without line numbers. I may have asked this too narrowly with my toy example. – Sam Firke May 28 '15 at 13:19
  • 1
    If you edit that question, you'll see that data actually contains tabs but when the question is "rendered" as HTML, the tabs are converted to spaces. "Good" data should have a proper delimiter and data should be shared in a [reproducible format](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Trying to work around bad formatting is a fools errand because there will always be ways to break it. – MrFlick May 28 '15 at 18:34
  • Ah I see, thanks. My example here has been addressed and I've accepted the answer, but my greater interest is in using data like that. How would you get that linked question's data into R - click edit, then copy & paste the tab-delimited data into `read.table`? Perhaps I should post this as a new question. – Sam Firke May 28 '15 at 19:35
  • Well, i would ask the OP to share the data in a more [reproducible format](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). A `dput()` of a data.frame is much easier to import into R. That will additionally have all the same classes that the user has created. Otherwise you can edit the question and attempt to copy the data. Other than that, you'll need to start making strong (dangerous) assumptions about the data in order to import it. – MrFlick May 28 '15 at 19:44

1 Answers1

7

Read the file into L, remove the row numbers and use sub with the indicated regular expression to insert commas between the remaining fields. (Note that "\\d" matches any digit and "\\S" matches any non-whitespace character.) Now re-read it using read.csv:

Lines <- "      candname party elecVotes
1 Barack Obama     D       365
2  John McCain     R       173"

# L <- readLines("myfile")  # read file; for demonstration use next line instead
L <- readLines(textConnection(Lines))

L2 <- sub("^ *\\d+ *", "", L)  # remove row numbers
read.csv(text = sub("^ *(.*\\S) +(\\S+) +(\\S+)$", "\\1,\\2,\\3", L2), as.is = TRUE)

giving:

      candname party elecVotes
1 Barack Obama     D       365
2  John McCain     R       173

Here is a visualization of the regular expression:

^ *(.*\S) +(\S+) +(\S+)$

Regular expression visualization

Debuggex Demo

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341