modules icon indicating copy to clipboard operation
modules copied to clipboard

add dragen module - v2

Open marrip opened this issue 1 year ago • 7 comments

PR checklist

Closes #4026

  • [ ] This comment contains a description of changes (with reason).
  • [ ] If you've fixed a bug or added code that should be tested, add tests!
  • [ ] If you've added a new tool - have you followed the module conventions in the contribution docs
  • [ ] If necessary, include test data in your PR.
  • [ ] Remove all TODO statements.
  • [ ] Emit the versions.yml file.
  • [ ] Follow the naming conventions.
  • [ ] Follow the parameters requirements.
  • [ ] Follow the input/output options guidelines.
  • [x] Add a resource label
  • [ ] Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • [ ] nf-core modules test <MODULE> --profile docker
      • [ ] nf-core modules test <MODULE> --profile singularity
      • [ ] nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • [ ] nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • [ ] nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • [ ] nf-core subworkflows test <SUBWORKFLOW> --profile conda

old PR: #4111

marrip avatar Aug 27 '24 12:08 marrip

This is only the first draft of the main.nf so far. I have to add a proper stubbing part to add a number of possible use cases. Also tests and meta info are still missing. I had some questions that I am not sure how to solve. Dragen allows for input in form of lists of files which makes it difficult to handle the input properly. Here is the list of potentially problematic input:

flag description
build-sys-noise-vcfs-list Text file containing the paths of normal VCFs. Specify the full VCF file paths. List one file per line.
cnv-normals-list Specify text file that contains paths to the list of reference target counts files to be used as a panel of normals (new line separated).
config-file Configuration file - could potentially set everything, no control
explify-sample-list Path to .tsv file with sample names, associated FASTQs, etc.
explify-sample-list SampleID BatchID RunID ControlFlag FastQs
explify-sample-list MySample MyBatch MyRun POS /path/to/fastq1.gz /path/to/fastq1.gz
fastq-list Specifies CSV file that contains a list of FASTQ files to process.
fastq-list-sample-id Specifies the sample ID for the list of FASTQ files specified by fastq-list.
ht-graph-vcf-list Path to the text file containing the list of VCF files to be used to build the custom multigenome hash table
imputation-phase-input-list Alternative to imputation-phase-input
imputation-phase-sample-type-list Input file list of sample types where each line contains sample name NAME followed by TYPE. Required when imputing regions with ploidy that depends on this sample type.
input-batch-list The path to a file containing a list of msVCF files to be merged, with the path to each file on a separate line. All the files listed must have been generated from the same global census file and all batches pertaining to that global census must be included in the merge.
input-census-list File specifying list of census files for input (applicable when aggregate-censuses is true)
intermediate-results-dir Specifies directory to store intermediate results in (eg, sort partitions).
ph-concat-all-input-list PH concat all input msVCF file list (output files from phase rare step, in ascending position order), this option is exclusive with --ph-concat-all-input-list-sites-only
ph-concat-all-input-list-sites-only Provides a .txt file with list of VCF containing all the haplotyped sites. The VCF files provided are the output files of Phase Rare step, in ascending position order, sex chromosomes at the end.
ph-ligate-common-input-list PH ligate common input msVCF file list (output files from phase common step, in ascending position order)
ph-phase-common-input-list PH phase common input msVCF file list
tumor-fastq-list Inputs a CSV file containing a list of FASTQ files for the mapper, aligner, and somatic variant caller.
tumor-fastq-list-sample-id Specifies the sample ID for the list of FASTQ files specified by tumor-fastq-list.

Possible solutions could be to provide a list of input files, write the list in the module and then supplying it with the appropriate flag. Another possibility would be to provide input files and a list which only contains the file name without a path assuming dragen can handle a relative file path and the files are in the working directory. For some of the list-flags, there are alternative flags which allow one specifying multiple files directly in the cli while some are the only way to supply input.

fyi @asr081

marrip avatar Aug 27 '24 12:08 marrip

These look like very seperate use-cases and I would say it would make sense to make individual submodules for each one.

SPPearce avatar Aug 27 '24 14:08 SPPearce

These look like very seperate use-cases and I would say it would make sense to make individual submodules for each one.

When I started working on this at the hackathon 2023 in Barcelona I had a long discussion with @FriederikeHanssen who argued for everything to be combined as dragen doesn't have subcommands. Dragen basically allows you to run most of these options together and I assume that users would like to be able to do so in the module as well. If we come to an agreement on this I would be happy to split it into submodules because the whole thing is very complex 😜

marrip avatar Aug 28 '24 06:08 marrip

You have ~40 input channels, I just don't see how that is going to be usable or maintainable

I agree, it is difficult. But then we need to discuss where to draw the line for those use cases. There are different "pipelines" according to the documentation but even those are rather massive, especially the DNA one which might be the most interesting.

marrip avatar Aug 28 '24 07:08 marrip

You have ~40 input channels, I just don't see how that is going to be usable or maintainable

I agree, it is difficult. But then we need to discuss where to draw the line for those use cases. There are different "pipelines" according to the documentation but even those are rather massive, especially the DNA one which might be the most interesting.

Maybe we focus on just supporting the most common pipelines? What do you want to use it for?

SPPearce avatar Aug 28 '24 07:08 SPPearce

You have ~40 input channels, I just don't see how that is going to be usable or maintainable

I agree, it is difficult. But then we need to discuss where to draw the line for those use cases. There are different "pipelines" according to the documentation but even those are rather massive, especially the DNA one which might be the most interesting.

Maybe we focus on just supporting the most common pipelines? What do you want to use it for?

For us, the calling of small and structural variants as well as CNVs in germline and tumor might be most relevant. RNA might also be important.

marrip avatar Aug 28 '24 07:08 marrip

When I started working on this at the hackathon 2023 in Barcelona I had a long discussion with @FriederikeHanssen who argued for everything to be combined as dragen doesn't have subcommands.

yep, sorry underestimated the sheer amount of things you could throw at it. Agree with Simon that to keep it maintainable at all it should be split up.

FriederikeHanssen avatar Aug 28 '24 14:08 FriederikeHanssen

@marrip Are you planning to work on this one?

We are doing the nf-core spring cleaning and are trying to close PRs that are not used anymore. They can be reopened in the future if needed.

luisas avatar Mar 11 '25 12:03 luisas

Yes, I am currently working on it. Sorry for the delay.

marrip avatar Mar 12 '25 07:03 marrip

@marrip No worries :)

Let us know if you need any help!

luisas avatar Mar 12 '25 07:03 luisas

Finally the reduced version for germline: https://github.com/nf-core/modules/pull/8823

marrip avatar Jul 29 '25 06:07 marrip