I write many R scripts producing output that is later used as input to another R script. For example, I may write script a.R
, save resulting table as an a-result.rda
object, write script b.R
, import the a-result.rda
object, modify the table, maybe plot some graphs, save new table as table.csv
, and so on..
I find it cumbersome to identify the R script (here a.R
) which the particular table.csv
(indirectly) originated from.
So far, my solution is a shortcut function in my .bashrc
that allows me to pinpoint the immediate parent of an output object:
findinR(){ find . -iname '*.R' -exec grep -r $1 {} + ; }
If I am looking for the origin of table.csv
, I can do findinR table.csv
, and as output I will get the path of the b.R
script, along with the corresponding write.csv(..
line. Then, in b.R
script I notice that table.csv
is a modified version of a-result.rda
, so I do findinR a-result.rda
and I find script a.R
. This is working for 'shallow' I/O dependencies, but makes me want to kill myself when dealing with more layers.
Does anyone use/know of a text-based system or software that would allow me to produce (automatically or at least semi-automatically) I/O flowcharts or pipelines? So that I can record the history of the files generated during analysis?
EDIT: Some additional, potentially crucial details:
- I am not particularly interested in reporting tools or rerunning the entire analysis in one go (for smaller projects, I use
knitr
, but for more complex dependency-heavy workflows it's too cumbersome). I work with genomic data, which makes it impossible to rerun parts of the analysis. - The input / output scripts/files have unstandardized names. I almost never repeat the same steps, so templates wouldn't help much.
- The final output can be anything, not only a
.csv
file, but most often it's.rds
or.rdata
/.rda
. - The only thing I need is a (semi-)automatics way to record the workflow, not necessarily rerun it with a different input.
EDIT2: I tried automatically generating txt files by grepping lines that incude save()
, load()
, write.csv()
, read.csv()
, pdf()
, etc. However, I often use paste
to generate my file names, so the corresponding lines in the code are not descriptive enough to be able to identify the file.