metscale icon indicating copy to clipboard operation
metscale copied to clipboard

Data directory is disorganized

Open lovettse opened this issue 6 years ago • 2 comments

Expected behavior

Clean organization of input/output data separated by sample (and preferably project) Separation of input/output data from reference data

Actual behavior

Unstructured data directory containing all input and output data

The data directory makes organization difficult and would quickly become extremely cluttered with regular use. One possible structure that would help:

data/ref data/PROJECT/seq/SAMPLE data/PROJECT/analysis/SAMPLE

This is related to issue #1 in that the requirement to run all analyses from the same location exacerbates this problem by forcing every analysis for every sample to land in the same place.

lovettse avatar Mar 27 '19 19:03 lovettse

Snakemake will automatically make output directory paths for every rule when executed if the output directories do not exist. To organize the output for each rule just specify a sub-directory for each rule.

For an example I subsample the input reads at different coverages and run spades on each to see which coverage works best. In the rule below Snakemake automatically makes separate output sub-directory for each sample + coverage.

rule spades:
      input:
           read1='{sample}/seq/filtered_{sample}_R1.fastq.gz.cov{cov}.fastq',
           read2='{sample}/seq/filtered_{sample}_R2.fastq.gz.cov{cov}.fastq'
       output:
           '{sample}/analysis/spades.{cov}/asm.fasta'
       message: "Running spades "
       run:
           out_dir = os.path.dirname(output[0])
           shell("spades.py -t {config[threads]} -m {config[memory]} -1 {input.read1} -2 {input.read2}  -o {out_dir}")
           shutil.copyfile(out_dir + "/scaffolds.fasta", output[0])

PS. this was before Snakemake 5.2 which now supports giving a directory as output.. As of Snakemake 5.2 you wouldn't need the os.path.dirname command.

dsommer avatar Mar 29 '19 21:03 dsommer

To extend this example even further, below is the rule that follows the spades rules. The coverage_eval_filter rules doesn't need to know the exact name of output directory of spades because of Snakemake wildcarding. It will dynamically match "{assembler}.{cov}' to the correct output directory. Hopefully some of this will help with managing your output organization.

rule coverage_eval_filter:
     input:
         asm='{sample}/analysis/{assembler}.{cov}/asm.fasta',
         read1='{sample}/seq/filtered_{sample}_R1.fastq.gz.cov{cov}.fastq',
         read2='{sample}/seq/filtered_{sample}_R2.fastq.gz.cov{cov}.fastq'
     params:
         path="{sample}/analysis/{assembler}.{cov}/"
     output:
         '{sample}/analysis/{assembler}.{cov}/score',
         '{sample}/analysis/{assembler}.{cov}/asm.fasta.bam'
         
     shell:
         'coverage_eval.sh {input} {params.path} {config[threads]}'

rule spades:
      input:
           read1='{sample}/seq/filtered_{sample}_R1.fastq.gz.cov{cov}.fastq',
           read2='{sample}/seq/filtered_{sample}_R2.fastq.gz.cov{cov}.fastq'
       output:
           '{sample}/analysis/spades.{cov}/asm.fasta'
       message: "Running spades "
       run:
           out_dir = os.path.dirname(output[0])
           shell("spades.py -t {config[threads]} -m {config[memory]} -1 {input.read1} -2 {input.read2}  -o {out_dir}")
           shutil.copyfile(out_dir + "/scaffolds.fasta", output[0])

dsommer avatar Apr 01 '19 17:04 dsommer