htslib icon indicating copy to clipboard operation
htslib copied to clipboard

Using bgz extension instead of gz for bgziped files

Open hguturu opened this issue 11 years ago • 13 comments

Would make it very clear what type of file is being dealth with. Additionally, can avoid issues with things like vim being able to read bgziped files saved with .gz extensions and resaving them as regular gzip.

I believe minimal changes to bgzip.c to autoappend .bgz instead of .gz and tabix.c to auto detect files with .bgz extensions.

I can submit a pull request to these if needed.

hguturu avatar Sep 09 '14 17:09 hguturu

bgzipped files are gzip compliant. Why not tell vim to gzip everything with bgzip instead?

winni2k avatar Nov 18 '15 16:11 winni2k

vim was just an example, not the motivation. The more general motivation is bgzip is slightly different than gzip and there doesn't appear to a reason to conflate the two. Especially since sometimes its preferred to just have pure gzip (no index will be built/the file has nothing to do with genomics).

On Wed, Nov 18, 2015 at 8:39 AM, Warren Kretzschmar < [email protected]> wrote:

bgzipped files are gzip compliant. Why not tell vim to gzip everything with bgzip instead?

— Reply to this email directly or view it on GitHub https://github.com/samtools/htslib/issues/129#issuecomment-157771277.

hguturu avatar Nov 18 '15 18:11 hguturu

Yep, you're right. Even different implementations of the same standard (for example .lz and .lzma here) appear to get different file endings.

winni2k avatar Nov 19 '15 14:11 winni2k

For bgzip, we could make the ".bgz" extension an optional (non-default) filename.

For the rest of htslib/samtools/bcftools, we should make sure that ".bgz" files are recognized as bgzip for input and output filenames.

jrandall avatar Jan 25 '16 15:01 jrandall

.bgz are still not recognized when using bgzip. I needed to modify all my file name from .bgz to gz, which is not an intuitive thing since i haven't found any information about this little detail... I guess i'm not the only one to download file with ".bgz" extension from database (e.g. gnomAD vcf). And i guess it's an easy modification to do

Thank you

dprat avatar Apr 18 '18 07:04 dprat

Why do you need to rename your files? The suffix name does matter for the operation of the program.

pd3 avatar Apr 18 '18 08:04 pd3

i have to rename them from ".bgz" to ".gz" so bgzip can work, otherwise i get this error "unknown suffix -- ignored " like i said, they are named ".bgz" on gnomAD but since the tool is called bgzip (and not gzip), it would make sense (for me) that the extension could be ".bgz"

dprat avatar Apr 18 '18 09:04 dprat

That's likely a bug in bgzip as it ought to be using the magic number instead of filename.

However I see similar login in tabix, which only works on (for example) foo.bed.gz and wouldn't accept foo.bed.bgz. This is why renaming files to your own suffixes is problematic and I'd be reluctant to tinker with this. Even if we change it in htslib, it'll cause problems for people using old installs and we have no idea how many other applications out there are assuming .gz instead of .bgz. I agree bgz would have been better, but IMO this ship sailed long ago.

jkbonfield avatar Apr 18 '18 09:04 jkbonfield

I agree that it should not use the filename...

My first idea was to have bgzip able to work on ".gz", but also on ".bgz" so old installs as you said would still work on ".gz".

dprat avatar Apr 18 '18 09:04 dprat

Yes, it should use the magic number, this fails:

bgzip -d test.bgz
[bgzip] test.bgz: unknown suffix -- ignored

As a quick workaround, use

gunzip -c test.bgz

pd3 avatar Apr 18 '18 09:04 pd3

This code in bgzip is checking that the file is compressed, hence in a position to be decompressed. Doing that via filename-extension checking code is ancient, from before we had easy magic-number sniffing infrastructure.

[Edit: the similar logic in tabix.c — in file_type() — is just a shortcut: if the extension heuristic doesn't trigger, the code sniffs the file contents. So I think tabix is fine and would accept foo.bed.bgz just fine; certainly it queries .bgz files downloaded from gnomAD happily.]

We'd now be in a position to move bgzip's is-it-compressed test to after bgzf_open() and use bgzf_compression() instead — compare the tbx.c part of ec1d68e23ce75c7040ec895dfc0fffb1a2acb22c.

The code that strips .gz off the end of the input filename to construct an output filename would also need generalising to handle bgzip -d foo.bed.bgz, but that's not insurmountable.

jmarshall avatar Apr 18 '18 10:04 jmarshall

@pd3 yes in the end looks like this is the best solution @jmarshall yes that's the idea, i'll try to do it on my own

dprat avatar Apr 18 '18 13:04 dprat

It seems like this was never done...? When using bgzip -d after installing the latest version of htslib, I still have to change the extension of the file that I am decompressing from .bgz -> .gz

alam-shahul avatar Aug 05 '19 18:08 alam-shahul