add dragen module - v2
PR checklist
Closes #4026
- [ ] This comment contains a description of changes (with reason).
- [ ] If you've fixed a bug or added code that should be tested, add tests!
- [ ] If you've added a new tool - have you followed the module conventions in the contribution docs
- [ ] If necessary, include test data in your PR.
- [ ] Remove all TODO statements.
- [ ] Emit the
versions.ymlfile. - [ ] Follow the naming conventions.
- [ ] Follow the parameters requirements.
- [ ] Follow the input/output options guidelines.
- [x] Add a resource
label - [ ] Use BioConda and BioContainers if possible to fulfil software requirements.
- Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
- For modules:
- [ ]
nf-core modules test <MODULE> --profile docker - [ ]
nf-core modules test <MODULE> --profile singularity - [ ]
nf-core modules test <MODULE> --profile conda
- [ ]
- For subworkflows:
- [ ]
nf-core subworkflows test <SUBWORKFLOW> --profile docker - [ ]
nf-core subworkflows test <SUBWORKFLOW> --profile singularity - [ ]
nf-core subworkflows test <SUBWORKFLOW> --profile conda
- [ ]
- For modules:
old PR: #4111
This is only the first draft of the main.nf so far. I have to add a proper stubbing part to add a number of possible use cases. Also tests and meta info are still missing. I had some questions that I am not sure how to solve. Dragen allows for input in form of lists of files which makes it difficult to handle the input properly. Here is the list of potentially problematic input:
| flag | description |
|---|---|
| build-sys-noise-vcfs-list | Text file containing the paths of normal VCFs. Specify the full VCF file paths. List one file per line. |
| cnv-normals-list | Specify text file that contains paths to the list of reference target counts files to be used as a panel of normals (new line separated). |
| config-file | Configuration file - could potentially set everything, no control |
| explify-sample-list | Path to .tsv file with sample names, associated FASTQs, etc. |
| explify-sample-list | SampleID BatchID RunID ControlFlag FastQs |
| explify-sample-list | MySample MyBatch MyRun POS /path/to/fastq1.gz /path/to/fastq1.gz |
| fastq-list | Specifies CSV file that contains a list of FASTQ files to process. |
| fastq-list-sample-id | Specifies the sample ID for the list of FASTQ files specified by fastq-list. |
| ht-graph-vcf-list | Path to the text file containing the list of VCF files to be used to build the custom multigenome hash table |
| imputation-phase-input-list | Alternative to imputation-phase-input |
| imputation-phase-sample-type-list | Input file list of sample types where each line contains sample name NAME followed by TYPE. Required when imputing regions with ploidy that depends on this sample type. |
| input-batch-list | The path to a file containing a list of msVCF files to be merged, with the path to each file on a separate line. All the files listed must have been generated from the same global census file and all batches pertaining to that global census must be included in the merge. |
| input-census-list | File specifying list of census files for input (applicable when aggregate-censuses is true) |
| intermediate-results-dir | Specifies directory to store intermediate results in (eg, sort partitions). |
| ph-concat-all-input-list | PH concat all input msVCF file list (output files from phase rare step, in ascending position order), this option is exclusive with --ph-concat-all-input-list-sites-only |
| ph-concat-all-input-list-sites-only | Provides a .txt file with list of VCF containing all the haplotyped sites. The VCF files provided are the output files of Phase Rare step, in ascending position order, sex chromosomes at the end. |
| ph-ligate-common-input-list | PH ligate common input msVCF file list (output files from phase common step, in ascending position order) |
| ph-phase-common-input-list | PH phase common input msVCF file list |
| tumor-fastq-list | Inputs a CSV file containing a list of FASTQ files for the mapper, aligner, and somatic variant caller. |
| tumor-fastq-list-sample-id | Specifies the sample ID for the list of FASTQ files specified by tumor-fastq-list. |
Possible solutions could be to provide a list of input files, write the list in the module and then supplying it with the appropriate flag. Another possibility would be to provide input files and a list which only contains the file name without a path assuming dragen can handle a relative file path and the files are in the working directory. For some of the list-flags, there are alternative flags which allow one specifying multiple files directly in the cli while some are the only way to supply input.
fyi @asr081
These look like very seperate use-cases and I would say it would make sense to make individual submodules for each one.
These look like very seperate use-cases and I would say it would make sense to make individual submodules for each one.
When I started working on this at the hackathon 2023 in Barcelona I had a long discussion with @FriederikeHanssen who argued for everything to be combined as dragen doesn't have subcommands. Dragen basically allows you to run most of these options together and I assume that users would like to be able to do so in the module as well. If we come to an agreement on this I would be happy to split it into submodules because the whole thing is very complex 😜
You have ~40 input channels, I just don't see how that is going to be usable or maintainable
I agree, it is difficult. But then we need to discuss where to draw the line for those use cases. There are different "pipelines" according to the documentation but even those are rather massive, especially the DNA one which might be the most interesting.
You have ~40 input channels, I just don't see how that is going to be usable or maintainable
I agree, it is difficult. But then we need to discuss where to draw the line for those use cases. There are different "pipelines" according to the documentation but even those are rather massive, especially the DNA one which might be the most interesting.
Maybe we focus on just supporting the most common pipelines? What do you want to use it for?
You have ~40 input channels, I just don't see how that is going to be usable or maintainable
I agree, it is difficult. But then we need to discuss where to draw the line for those use cases. There are different "pipelines" according to the documentation but even those are rather massive, especially the DNA one which might be the most interesting.
Maybe we focus on just supporting the most common pipelines? What do you want to use it for?
For us, the calling of small and structural variants as well as CNVs in germline and tumor might be most relevant. RNA might also be important.
When I started working on this at the hackathon 2023 in Barcelona I had a long discussion with @FriederikeHanssen who argued for everything to be combined as dragen doesn't have subcommands.
yep, sorry underestimated the sheer amount of things you could throw at it. Agree with Simon that to keep it maintainable at all it should be split up.
@marrip Are you planning to work on this one?
We are doing the nf-core spring cleaning and are trying to close PRs that are not used anymore. They can be reopened in the future if needed.
Yes, I am currently working on it. Sorry for the delay.
@marrip No worries :)
Let us know if you need any help!
Finally the reduced version for germline: https://github.com/nf-core/modules/pull/8823