Add ability to strip sequences/qualities from SAM/BAM files
When working with very large SAM files it is often convenient to remove sequence and quality information to reduce storage and improve I/O.
Following from this it would be convenient to have a stripSeqQual function that replaces the two fields with *.
Could also be a flag when saving (to the write function). Not 100% sure what is the best model.
Hi, I would like to work on this.
So I've written a stripSeqQual function in the Data.Sam module. Where should that function be called?
Luis suggestion above would be to have an extra attribute on the write() function.
Something like::
data = input("file.sam")
write(data, ofile="output.sam", remove_sequence_qualities=true)
My initial idea was to have it as part of a select block. So:
data = input("file.sam")
newdata = select(data) using |mr|:
mr = mr.filter(min_match_size=45, min_identity_pc=90, action={unmatch})
mr.remove_sequence_qualities()
write(newdata, ofile="output.sam")
The second interface has a few more use-cases but we didn't reach a decision on which to implement.
@luispedro thoughts?
The write function already has a format_flags argument, so it could be write(newdata, ofile="...", format_flags={no_qualities}).
@unode: what use-cases do you see with the second interface? I am not against it, but the write version is more straightforward to code and can be very fast (interpreting blocks is still a bit slow).
The main case I envision is optimization. Assuming a long pipeline using SAM/BAM that doesn't require qualities, removing them early could speed up processing by reducing I/O.
I had a couple of such cases in the past but wouldn't call it a frequent use-case.
In principle, we could move the stripping to earlier in the pipeline as an optimization later without changing the user-visible interface.
Ok, so write(newdata, ofile="...", format_flags={no_qualities}) and we revisit in the future if necessary.
@sureyeaah can you also add a line to https://github.com/ngless-toolkit/ngless/blob/59576e0c17d90e031ae24c9e6ae0f556a2b7c37b/ChangeLog#L2 in your pull request? Thanks