amazon-genomics-cli icon indicating copy to clipboard operation
amazon-genomics-cli copied to clipboard

Snakemake workflow fails on completion with temporary files

Open ElDeveloper opened this issue 3 years ago • 0 comments

Describe the Bug

Running a Snakemake workflow with temporary output files makes the whole run fail. Reason being that temporary files are deleted by Snakemake as the workflow is running so when the s3 copy step happens those files are no longer there.

Steps to Reproduce

  • Run a snakemake workflow that has multiple jobs with temp outputs.

Relevant Logs

For example see this output for a workflow I am running:

Wed, 11 May 2022 09:28:39 -0700	unlocking
Wed, 11 May 2022 09:28:39 -0700	removing lock
Wed, 11 May 2022 09:28:39 -0700	removing lock
Wed, 11 May 2022 09:28:39 -0700	removed all locks
Wed, 11 May 2022 09:28:39 -0700	Snakmake outputs are:
Wed, 11 May 2022 09:28:40 -0700	Building DAG of jobs...
Wed, 11 May 2022 09:28:41 -0700	Updating job all.
Wed, 11 May 2022 09:28:44 -0700	output_file	date	rule	version	log-file(s)	status	plan
Wed, 11 May 2022 09:28:44 -0700	classified/strain.tsv	Wed May 11 16:28:29 2022	tabulate	-		ok	no update
Wed, 11 May 2022 09:28:44 -0700	classified/species.tsv	Wed May 11 16:28:29 2022	tabulate	-		ok	no update
Wed, 11 May 2022 09:28:44 -0700	classified/genus.tsv	Wed May 11 16:28:29 2022	tabulate	-		ok	no update
Wed, 11 May 2022 09:28:44 -0700	classified/family.tsv	Wed May 11 16:28:29 2022	tabulate	-		ok	no update
Wed, 11 May 2022 09:28:44 -0700	qc/Exp01DRfullBTxA_GAACTGAGCG-CGCTCCACGA_L001_R1_001.fastq.gz	-	qc	-	logs/qc/Exp01DRfullBTxA_GAACTGAGCG-CGCTCCACGA_L001	missing	no update
Wed, 11 May 2022 09:28:44 -0700	qc/Exp01DRfullBTxA_GAACTGAGCG-CGCTCCACGA_L001_R2_001.fastq.gz	-	qc	-	logs/qc/Exp01DRfullBTxA_GAACTGAGCG-CGCTCCACGA_L001	missing	no update

[...]

Wed, 11 May 2022 09:28:45 -0700	copying outputs efs with s3
Wed, 11 May 2022 09:28:45 -0700	attempt 1 at copying classified/strain.tsv to s3://[redacted]/strain.tsv
upload: classified/strain.tsv to s3://[redacted]/strain.tsv
Wed, 11 May 2022 09:28:46 -0700	attempt 1 at copying classified/species.tsv to s3://[redacted]/species.tsv
upload: classified/species.tsv to s3://[redacted]/species.tsv
Wed, 11 May 2022 09:28:47 -0700	attempt 1 at copying classified/genus.tsv to s3://[redacted]/genus.tsv
upload: classified/genus.tsv to s3://[redacted]/genus.tsv
Wed, 11 May 2022 09:28:48 -0700	attempt 1 at copying classified/family.tsv to s3://[recacted]/family.tsv
upload: classified/family.tsv to s3://[redacted]/family.tsv
Wed, 11 May 2022 09:28:49 -0700	attempt 1 at copying qc/Exp01DRfullBTxA_GAACTGAGCG-CGCTCCACGA_L001_R1_001.fastq.gz to s3://[redacted]/Exp01DRfullBTxA_GAACTGAGCG-CGCTCCACGA_L001_R1_001.fastq.gz
Wed, 11 May 2022 09:28:49 -0700	The user-provided path qc/Exp01DRfullBTxA_GAACTGAGCG-CGCTCCACGA_L001_R1_001.fastq.gz does not exist.
Wed, 11 May 2022 09:28:49 -0700	=== Running Cleanup ===
Wed, 11 May 2022 09:28:49 -0700	=== Bye! ===

In my Snakemake file I have a rule that marks qc outputs as temporary, for example:

    output:
        qcfwd=temp("qc/{sample}_R1_001.fastq.gz"),
        qcrev=temp("qc/{sample}_R2_001.fastq.gz"),

As the workflow is running Snakemake reports that these files are deleted, for example:

Wed, 11 May 2022 08:49:49 -0700	Removing temporary output file qc/Exp01DRfullBTxA_GAACTGAGCG-CGCTCCACGA_L001_R1_001.fastq.gz.

Expected Behavior

The copying step should skip copying temporary files. In the context of snakemake those are files that we can "live without". Plus I wouldn't want to copy these intermediate files back to my S3 bucket.

Actual Behavior

snakemake.aws.sh tries to copy temporary files and fails when it can't find them.

Screenshots

Additional Context

Operating System: AGC Version: 1.4.0 Was AGC setup with a custom bucket: no Was AGC setup with a custom VPC: no

ElDeveloper avatar May 11 '22 17:05 ElDeveloper