fcs icon indicating copy to clipboard operation
fcs copied to clipboard

[FEATURE REQUEST]: skip over/ignore short sequences

Open ptrebert opened this issue 2 years ago • 3 comments

Is this a feature request for FCS-adaptor or FCS-GX? FCS-adaptor (I am using v0.4.0)

Describe the problem you'd like to be solved Don't fail on sequences <10 bp

Describe the solution you'd like Please add a CLI switch to simply skip over/ignore sequences that are shorter than 10bp

Describe alternatives you've considered Checking/filtering all input sequences beforehand, which implies that each sequence file is processed at least twice (checking and then adaptor scanning)

Thanks

ptrebert avatar Jan 11 '24 08:01 ptrebert

Can I ask what context you are working with sequences <10 bp? NCBI GenBank submissions have length requirements (200 bp for genome sequences, 10 bp for others, hence the validation check included here). If these aren't intended for submission to NCBI archives, that's fine, and if we can consider adding this as an optional flag in a future release. For now my best suggestion would be a workaround to extract the short sequences, set them aside while running FCS-adaptor on larger sequences, then add them back in.

etvedte avatar Jan 11 '24 16:01 etvedte

I am not "working" on these sequences, or at least I strongly assume that this is some garbage contained in a handful of the genome assemblies I am analyzing. If that length requirement exists due to strict filter criteria for submissions, maybe a cleaner solution would be to have a strict setting that is used/on by default that would simply label too short sequences for removal, i.e. analogous to flagging any remainders of adaptor sequences.

In my case, of course, I am now pre-filtering the assembly FASTAs.

ptrebert avatar Jan 17 '24 15:01 ptrebert

Point taken. We will consider this for our next release.

etvedte avatar Jan 18 '24 13:01 etvedte