htslib icon indicating copy to clipboard operation
htslib copied to clipboard

Skip to last record using tabix

Open CreRecombinase opened this issue 5 years ago • 2 comments

I have many large (indexed) vcfs of the form ${CHROM}_${CHUNK}vcf.gz and was looking for a quick way to get the coordinates spanned by the file. I know that given a region, the index can be used to skip to chunks overlapping that region, but is the reverse possible? Can I use the last entry in the index to get the offset to the last chunk?

CreRecombinase avatar Feb 13 '21 01:02 CreRecombinase

was looking for a quick way to get the coordinates spanned by the file.

Do you mean something like:

chr1    61772    17129271
chr2    262     6221917

Can I use the last entry in the index to get the offset to the last chunk?

This is a different request. Do you actually need the file offset? It wouldn't make much sense to have it displayed by tabix, but it could be returned by a HTSlib method.

valeriuo avatar Apr 20 '21 13:04 valeriuo

In my use case I know that the file doesn't span multiple chromosomes, but yes, that's the idea. My (admittedly poor) understanding of the tabix format (for bcf/vcf files) is that it stores the (genomic) coordinate of the first record in each chunk.

This is a different request. Do you actually need the file offset? It wouldn't make much sense to have it displayed by tabix, but it could be returned by a HTSlib method.

I agree that having tabix export the file offset of the last chunk would be a weird piece of functionality, and I was thinking it would make more sense as an HTSlib method. Now that you mention it though I feel like a tabix view or tabix export that spit out a contents of the index file as like a json file (or something) could be useful in a lot of settings.

CreRecombinase avatar Apr 20 '21 16:04 CreRecombinase