Using bgz extension instead of gz for bgziped files
Would make it very clear what type of file is being dealth with. Additionally, can avoid issues with things like vim being able to read bgziped files saved with .gz extensions and resaving them as regular gzip.
I believe minimal changes to bgzip.c to autoappend .bgz instead of .gz and tabix.c to auto detect files with .bgz extensions.
I can submit a pull request to these if needed.
bgzipped files are gzip compliant. Why not tell vim to gzip everything with bgzip instead?
vim was just an example, not the motivation. The more general motivation is bgzip is slightly different than gzip and there doesn't appear to a reason to conflate the two. Especially since sometimes its preferred to just have pure gzip (no index will be built/the file has nothing to do with genomics).
On Wed, Nov 18, 2015 at 8:39 AM, Warren Kretzschmar < [email protected]> wrote:
bgzipped files are gzip compliant. Why not tell vim to gzip everything with bgzip instead?
— Reply to this email directly or view it on GitHub https://github.com/samtools/htslib/issues/129#issuecomment-157771277.
Yep, you're right. Even different implementations of the same standard (for example .lz and .lzma here) appear to get different file endings.
For bgzip, we could make the ".bgz" extension an optional (non-default) filename.
For the rest of htslib/samtools/bcftools, we should make sure that ".bgz" files are recognized as bgzip for input and output filenames.
.bgz are still not recognized when using bgzip. I needed to modify all my file name from .bgz to gz, which is not an intuitive thing since i haven't found any information about this little detail... I guess i'm not the only one to download file with ".bgz" extension from database (e.g. gnomAD vcf). And i guess it's an easy modification to do
Thank you
Why do you need to rename your files? The suffix name does matter for the operation of the program.
i have to rename them from ".bgz" to ".gz" so bgzip can work, otherwise i get this error "unknown suffix -- ignored " like i said, they are named ".bgz" on gnomAD but since the tool is called bgzip (and not gzip), it would make sense (for me) that the extension could be ".bgz"
That's likely a bug in bgzip as it ought to be using the magic number instead of filename.
However I see similar login in tabix, which only works on (for example) foo.bed.gz and wouldn't accept foo.bed.bgz. This is why renaming files to your own suffixes is problematic and I'd be reluctant to tinker with this. Even if we change it in htslib, it'll cause problems for people using old installs and we have no idea how many other applications out there are assuming .gz instead of .bgz. I agree bgz would have been better, but IMO this ship sailed long ago.
I agree that it should not use the filename...
My first idea was to have bgzip able to work on ".gz", but also on ".bgz" so old installs as you said would still work on ".gz".
Yes, it should use the magic number, this fails:
bgzip -d test.bgz
[bgzip] test.bgz: unknown suffix -- ignored
As a quick workaround, use
gunzip -c test.bgz
This code in bgzip is checking that the file is compressed, hence in a position to be decompressed. Doing that via filename-extension checking code is ancient, from before we had easy magic-number sniffing infrastructure.
[Edit: the similar logic in tabix.c — in file_type() — is just a shortcut: if the extension heuristic doesn't trigger, the code sniffs the file contents. So I think tabix is fine and would accept foo.bed.bgz just fine; certainly it queries .bgz files downloaded from gnomAD happily.]
We'd now be in a position to move bgzip's is-it-compressed test to after bgzf_open() and use bgzf_compression() instead — compare the tbx.c part of ec1d68e23ce75c7040ec895dfc0fffb1a2acb22c.
The code that strips .gz off the end of the input filename to construct an output filename would also need generalising to handle bgzip -d foo.bed.bgz, but that's not insurmountable.
@pd3 yes in the end looks like this is the best solution @jmarshall yes that's the idea, i'll try to do it on my own
It seems like this was never done...? When using bgzip -d after installing the latest version of htslib, I still have to change the extension of the file that I am decompressing from .bgz -> .gz