[FEATURE] A standing list of file extensions/file types that should be compressed
Is your feature request related to a problem? Please describe
Somewhat related to a problem, but also at the same time not really. I totally agree with compressed files should be used.
Where applicable, the usage and generation of compressed files SHOULD be enforced as input and output, respectively:
*.fastq.gz and NOT *.fastq
*.bam and NOT *.sam
I'm just not sure where to draw the line. The big ones (FASTQ and BAM) are straight forward, but I think it becomes more difficult for things like BLAST results, genome annotations, log files, etc...
I started a conversation in Slack, but think putting it here might be better.
Describe the solution you'd like
I'm wondering if there could be a standing list of extensions that should be compressed. Here's an example using Prokka outputs:
.err failed annotations
.faa proteins fasta
.ffn genes fasta
.fna contigs
.fsa contigs
.gbk genbank file
.gff gff3 annotations
.log prokka outputs
.sqn sequin file
.tbl tbl2asn file
.tsv tsv of annotations
.txt annotation stats
I think most of these should be compressed, especially the FASTA ones. This is where I think a standing list of extensions, or maybe file types (e.g. sequences in FASTA format), would be useful as a guide and make the choice easier on submitters.
Here's a working list (big overlap with Prokka):
.aln alignment
.fa fasta
.faa proteins fasta
.fasta fasta
.fastq fastq
.fq fastq
.ffn genes fasta
.fna contigs
.fsa contigs
.gbk genbank file
.gfa assembly graph
.gff gff3 annotations
.sqn sequin file
.tbl tbl2asn file
.vcf variants
And how they should be compressed.
Some files apparently benefit from being compressed with bgzip (from htslib toolkit) but that requires including it into the containers which is extra effort to maintain (see https://github.com/nf-core/modules/pull/1360#discussion_r816570773).
If we have such a list, we could add a function to pytest-workflow which warns/fails if such a file was generated and not compressed.
Hi there!
We’ve noticed there hasn’t been much activity here. Are you still planning on working on this? If not, you can ignore this message and we’ll close your issue in about 2 weeks. If you think this is still relevant, you can also add it to the hackathon2023 project board.
Cheers the nf-core maintainers
@rpetit3 Hello! Where did you envisage this list being placed? On the website or just in this ticket for now, until it's made part of pytest-workflow?
We can probably keep this open, but this needs to go on the website / docs I think @mashehu
Would this be better as a linting warning? Something nf-core tools could test for?
I would argue, just having this as a list on the website is not going to help so much as it would be quite easy to skim over it and miss it. Having a file with file extensions that nf-core tools references and produces warnings for might be more useful.