2

How do I list the data files in a folder and store their filenames without their extensions as factors in a dataframe? In other words: How do I create a character vector from a list of filenames omitting the '.csv' extension and store this vector as a list of factors in a dataframe after creating that dataframe from those files?

My ultimate goal is to store the filenames containing my data as StudyIDs as factors in a dataframe. I think this an extremely simple task, but I have not discovered the formatting required for the regular expression, or if there is some interaction between sapply and gsub that changes the formatting.

Two folders 'planned' and 'blurred' each contain files named 1.csv, 2.csv, etc., with sometimes non-sequential numbers. Specifically, I am thinking it would be good to obtain the factors "Blurred 1", "Planned 1", "Blurred 2", "Planned 2", etc to name the data imported from these files to refer to Study ID (number) and category (planned or blurred).

The code I've tried in RStudio 1.0.143, with a comment on what happens:

# Create a vector of the files to process
filenames <- list.files(path = '../Desktop/data/',full.names=TRUE,recursive=TRUE) 
# We parse the path to find the terminating filename which contains the StudyID.
FileEndings <- basename(filenames)
# We store this filename as the StudyID
regmatches('.csv',FileEndings,invert=TRUE) -> StudyID   # Error: ‘x’ and ‘m’ must have the same length
lapply(FileEndings,grep('.csv',invert=TRUE)) -> StudyID # Error: argument "x" is missing, with no default
sapply(FileEndings,grep,'.csv',invert=TRUE) -> StudyID; StudyID # Wrong: Gives named integer vector of 1's
sapply(FileEndings,grep,'.csv',invert=TRUE,USE.NAMES=FALSE) -> StudyID; StudyID # Wrong: Gives integer vector of 1's
sapply(FileEndings,gsub,'.csv',ignore.case=TRUE,invert=TRUE,USE.NAMES=FALSE) -> StudyID; StudyID # Error: unused argument (invert = TRUE)
sapply(FileEndings,gsub,'.csv','',ignore.case=TRUE,USE.NAMES=FALSE) -> StudyID; StudyID # Wrong: vector of ""
sapply(FileEndings,gsub,'[:alnum:].csv','[:alnum:]',ignore.case=TRUE,USE.NAMES=FALSE) -> StudyID; StudyID # Wrong: vector of "[:alnum:]"
sapply(FileEndings,gsub,'[[:alnum:]].csv','[[:alnum:]]',ignore.case=TRUE,USE.NAMES=FALSE) -> StudyID; StudyID # Wrong: vector of "[[:alnum:]]"
sapply(FileEndings,gsub,'[:alnum:]\.csv','[:alnum:]',ignore.case=TRUE,USE.NAMES=FALSE) -> StudyID; StudyID # Error: '\.' is an unrecognized escape

The documentation has not answered this question, and multiple webpages online provide overly simplistic examples that do not address this problem. I will continue searching, but I hope you will provide the solution to expedite this work and help future users. Thank you.

DBinJP
  • 247
  • 5
  • 13

3 Answers3

7

There is a built-in function in the tools package for this: file_path_sans_ext.

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • Thank you for showing me [this function](https://stat.ethz.ch/R-manual/R-devel/library/tools/html/fileutils.html). It does indeed eliminate the need for sapply and gsub, or using regular expressions to determine the name of the file. I had to [load the package](https://stat.ethz.ch/R-manual/R-devel/library/base/html/library.html) 'tools': Is there a better way to load multiple packages than library(c(tidyr,tools))? – DBinJP Jul 12 '17 at 06:40
  • 1
    @DBinJP FWIW, you could also do `tools::file_path_sans_ext`. – lukeA Jul 12 '17 at 07:03
  • 1
    Just to second @lukeA : Try to avoid cluttering your namespace by loading many packages with `library()`. This may lead to name conflicts, e.g., see the warnings when running `library(tidyverse)`. – Uwe Jul 12 '17 at 07:53
1

I think you missed the $ in your regex for specifically replacing the file ending. What about

gsub(filenames, pattern=".csv$", replacement="")

This should truncate the file ending.

If you want to get rid of the path, too, then you could do a similar substitution for the path:

gsub(filenames, pattern="^.*AAPM2017//", replacement="")
mondano
  • 827
  • 10
  • 29
0

If you intend to use basename, you might as well just leave out the full.names argument from list.files (as it is FALSE by defualt). I'm not entirely clear on you question but does the following code help?

filenames <- list.files(path = 'DIRECTORY/',recursive=TRUE) 
csvfiles <- filenames[grep(".csv", filenames)] # grep to find pattern matches
finalnames <- sub("(.*)\\.csv","",csvfiles) # sub to replace the pattern
Evan Friedland
  • 3,062
  • 1
  • 11
  • 25
  • I edited the post to explicitly ask the question. The full.names argument is needed to pass the filenames character vector to a function used to import the data in those files. The code you've given results in a character vector with only "" for each element. – DBinJP Jul 12 '17 at 05:32