website icon indicating copy to clipboard operation
website copied to clipboard

[FEATURE] A standing list of file extensions/file types that should be compressed

Open rpetit3 opened this issue 4 years ago • 6 comments

Is your feature request related to a problem? Please describe

Somewhat related to a problem, but also at the same time not really. I totally agree with compressed files should be used.

Where applicable, the usage and generation of compressed files SHOULD be enforced as input and output, respectively:

*.fastq.gz and NOT *.fastq
*.bam and NOT *.sam

I'm just not sure where to draw the line. The big ones (FASTQ and BAM) are straight forward, but I think it becomes more difficult for things like BLAST results, genome annotations, log files, etc...

I started a conversation in Slack, but think putting it here might be better.

Describe the solution you'd like

I'm wondering if there could be a standing list of extensions that should be compressed. Here's an example using Prokka outputs:

.err   failed annotations
.faa   proteins fasta
.ffn   genes fasta
.fna   contigs
.fsa   contigs
.gbk   genbank file
.gff   gff3 annotations
.log   prokka outputs
.sqn   sequin file
.tbl   tbl2asn file
.tsv   tsv of annotations
.txt   annotation stats 

I think most of these should be compressed, especially the FASTA ones. This is where I think a standing list of extensions, or maybe file types (e.g. sequences in FASTA format), would be useful as a guide and make the choice easier on submitters.

Here's a working list (big overlap with Prokka):

.aln   alignment
.fa    fasta
.faa   proteins fasta
.fasta fasta
.fastq fastq
.fq    fastq
.ffn   genes fasta
.fna   contigs
.fsa   contigs
.gbk   genbank file
.gfa   assembly graph
.gff   gff3 annotations
.sqn   sequin file
.tbl   tbl2asn file
.vcf   variants

rpetit3 avatar Aug 19 '21 16:08 rpetit3

And how they should be compressed. Some files apparently benefit from being compressed with bgzip (from htslib toolkit) but that requires including it into the containers which is extra effort to maintain (see https://github.com/nf-core/modules/pull/1360#discussion_r816570773).

mahesh-panchal avatar Mar 02 '22 09:03 mahesh-panchal

If we have such a list, we could add a function to pytest-workflow which warns/fails if such a file was generated and not compressed.

grst avatar Mar 07 '22 20:03 grst

Hi there!

We’ve noticed there hasn’t been much activity here. Are you still planning on working on this? If not, you can ignore this message and we’ll close your issue in about 2 weeks. If you think this is still relevant, you can also add it to the hackathon2023 project board.

Cheers the nf-core maintainers

jasmezz avatar Mar 07 '23 11:03 jasmezz

@rpetit3 Hello! Where did you envisage this list being placed? On the website or just in this ticket for now, until it's made part of pytest-workflow?

lukbut avatar Mar 28 '23 09:03 lukbut

We can probably keep this open, but this needs to go on the website / docs I think @mashehu

famosab avatar Mar 13 '25 09:03 famosab

Would this be better as a linting warning? Something nf-core tools could test for?

I would argue, just having this as a list on the website is not going to help so much as it would be quite easy to skim over it and miss it. Having a file with file extensions that nf-core tools references and produces warnings for might be more useful.

mahesh-panchal avatar Mar 28 '25 08:03 mahesh-panchal