Have Snakemake recognize complete files upon relaunch

Question

I have created this Snakemake workflow. This pipeline works really well; however, if any rule fails and I relaunch, Snakemake isnt recognizing all completed files. For instances, Sample A finishes all the way through and creates all files for rule all, but Sample B fails at rule Annotate UMI. When I relaunch, snakemake wants to do all jobs for both A and B, instead of just B. What do I need to get this to work?

sampleIDs = [A, B]
  
rule all:
    input:
        expand('PATH/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam', sampleID=sampleIDs),
        expand('PATH/bams/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai', sampleID=sampleIDs),
        expand('/PATH/logfiles/{sampleID}_removed.txt', sampleID=sampleIDs)

# Some tools require unzipped fastqs
rule AnnotateUMI:
    # Modify each run
    input: 'PATH/{sampleID}_unisamp_L001_001.star_rg_added.sorted.dmark.bam'
    # Modify each run
    output: 'PATH/{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bam'
    # Modify each run
    params: 'PATH/{sampleID}_unisamp_L001_UMI.fastq.gz'
    threads: 36
    run:
         # Each user needs to set tool path
         shell('java -Xmx220g -jar PATH/fgbio-2.0.0.jar AnnotateBamWithUmis \
         -i {input} \
         -f {params} \
         -o {output}')


rule SortSam:
    input: rules.AnnotateUMI.output
    # Modify each run
    output: 'PATH/{sampleID}_Qsorted.MarkUMI.bam'
    params:
    threads: 32
    run:
         # Each user needs to set tool path
         shell('java -Xmx110g -jar PATH/picard.jar SortSam \
         INPUT={input} \
         OUTPUT={output} \
         SORT_ORDER=queryname')


rule MItag:
    input: rules.SortSam.output
    # Modify each run
    output: 'PATH/{sampleID}_Qsorted.MarkUMI.MQ.bam'
    params:
    threads: 32
    run:
         # Each user needs to set tool path
         shell('java -Xmx220g -jar PATH/fgbio-2.0.0.jar SetMateInformation \
         -i {input} \
         -o {output}')


rule GroupUMI:
    input: rules.MItag.output
    # Modify each run
    output: 'PATH/{sampleID}_grouped.Qsorted.MarkUMI.MQ.bam'
    params:
    threads: 32
    run:
         # Each user needs to set tool path
         shell('java -Xmx220g -jar PATH/fgbio-2.0.0.jar GroupReadsByUmi \
         -i {input} \
         -s adjacency \
         -e 1 \
         -m 20 \
         -o {output}')


rule ConcensusUMI:
    input: rules.GroupUMI.output
    # Modify each run
    output: 'PATH/{sampleID}_concensus.Qunsorted.MarkUMI.MQ.bam'
    params:
    threads: 32
    run:
         # Each user needs to set tool path
         shell('java -Xmx220g -jar PATH/fgbio-2.0.2.jar CallMolecularConsensusReads \
         --input={input} \
         --min-reads=1 \
         --output={output}')


rule STARmap:
    input: rules.ConcensusUMI.output
    # Modify each run
    output:
        log = 'PATH/{sampleID}_UMI_Concensus_Log.final.out',
        bam = 'PATH/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam'
    # Modify each run
    params: 'PATH/{sampleID}_UMI_Concensus_'
    threads: 32
    run:
         # Each user needs to genome path
         shell('STAR \
         --runThreadN {threads} \
         --readFilesIn {input} \
         --readFilesType SAM PE \
         --readFilesCommand samtools view -h \
         --genomeDir PATH/STAR_hg19_v2.7.5c \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --limitBAMsortRAM 220000000000 \
         --outFileNamePrefix {params}')


rule Index:
    input: rules.STARmap.output.bam
    # Modify each run
    output: 'PATH/{sampleID}_UMI_Concensus_Aligned.sortedByCoord.out.bam.bai'
    params:
    threads: 32
    run:
         shell('samtools index {input}')


rule BamRemove:
    input:
        AnnotateUMI_BAM = rules.AnnotateUMI.output,
        # Modify each run and include in future version to delete
        #AnnotateUMI_BAI = 'PATH/{sampleID}_L001_001.star_rg_added.sorted.dmark.bam.UMI.bai',
        SortSam = rules.SortSam.output,
        MItag = rules.MItag.output,
        GroupUMI = rules.GroupUMI.output,
        ConcensusUMI = rules.ConcensusUMI.output,
        STARmap = rules.STARmap.output.bam,
        Index = rules.Index.output
    # Modify each run
    output: touch('PATH/logfiles/{sampleID}_removed.txt')
    threads: 32
    run:
        shell('rm {input.AnnotateUMI_BAM} {input.SortSam} {input.MItag} {input.GroupUMI} {input.ConcensusUMI}')

When this happens, could you try running the workflow with `-n` for a dry run and, if you have snakemake 7.7.0 or older, `-r` to get the reason for why it executes each rule? (From the docs it looks like 7.8.0 or higher should automatically give you the reason.) That might help you figure out why it reruns the other sample(s) too. — KeyboardCat, Sep 02 '22 at 06:06
@KeyboardCat, so for failed samples BamRemove says missing. Makes sense to relaunch this. However for outputs that exist from BramRemove says: Input files updated by another job. This is causing this to relauch. How do I fix this? — Genetics, Sep 02 '22 at 13:58
Does it say which job? (And which version of Snakemake are you using, by the way? There's been some recent changes to rerun behavior.) — KeyboardCat, Sep 02 '22 at 14:03
@KeyboardCat, I am running Snakemake 7.7.0. For each files updated by another job, it is all the rules upstream of this rule. These outputs get all deleted by this rule to produce the output. So essentially, Snakemake wants to recreate this missing files. How do I get snakemake to say BamRemove output exist, ignore all the other missing upstream outputs to not cause a relaunch? — Genetics, Sep 02 '22 at 14:21
That *should* predate changes to rerun behavior, but others have come across a similar issue with 7.6.2 here: https://stackoverflow.com/q/72363539/15704972 Does downgrading solve the issue for you? — KeyboardCat, Sep 05 '22 at 06:06

Have Snakemake recognize complete files upon relaunch

0 Answers0