Does subseq use random seed?
Does subseq use random seed? Each subseq execution outputs sequences in random order compared to a fixed fasta and a fixed bed input Can't find this info in the help nor the manual
In my specific case the output order is important because it is related to the expected positions.
for run in {1..4}; do
seqkit subseq --quiet --bed covid19_sample-positions.bed merged-gnm.fa -j 12 | head && echo "" ;
done
>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-negative
GGGCATAGTAAGGCAGTTT
>covid19_sample-uncertain
TACAGTCGTTGGTCCTCG
>covid19_sample-antisense
CTTTACTGTGGACCGTGGCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA
>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-negative
GGGCATAGTAAGGCAGTTT
>covid19_sample-uncertain
TACAGTCGTTGGTCCTCG
>covid19_sample-antisense
CTTTACTGTGGACCGTGGCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA
>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-lysis
GTCGTCCGATCTTTTAACCGGTA
>covid19_sample-antisense
CTTTACTGTGGACCGTGGCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA
>covid19_sample-F2
ACTAGCCCACTAAACTCAG
>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-lysis
GTCGTCCGATCTTTTAACCGGTA
Personal computer Ubuntu MATE 22.04.4 LTS x86_64 Kernel: 6.5.0-26-generic Shell: bash 5.1.16 Seqkit Version: 2.8.0 mamba version : 1.5.5
Yes. For plain text FASTA input, the order of output records is random.
Try compressing the merged-gnm.fa file and use merged-gnm.fa.gz as the input.
Execution time increases but it works fine. Thank you!
Can we have that info on the manual? Also, would it be nice adding control to subseq seed, right?
Thanks a lot!
gzip merged-gnm.fa
for run in {1..4}; do
seqkit subseq --quiet --bed covid19_sample-positions.bed merged-gnm.fa.gz -j 12 | head && echo "" ;
done
>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-negative
GGGCATAGTAAGGCAGTTT
>covid19_sample-uncertain
TACAGTCGTTGGTCCTCG
>covid19_sample-antisense
CTTTACTGTGGACCGTGGCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA
>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-negative
GGGCATAGTAAGGCAGTTT
>covid19_sample-uncertain
TACAGTCGTTGGTCCTCG
>covid19_sample-antisense
CTTTACTGTGGACCGTGGCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA
>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-negative
GGGCATAGTAAGGCAGTTT
>covid19_sample-uncertain
TACAGTCGTTGGTCCTCG
>covid19_sample-antisense
CTTTACTGTGGACCGTGGCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA
>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-negative
GGGCATAGTAAGGCAGTTT
>covid19_sample-uncertain
TACAGTCGTTGGTCCTCG
>covid19_sample-antisense
CTTTACTGTGGACCGTGGCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA
Can we have that info on the manual?
Added.
Attention:
1. When extracting with BED/GTF from plain text FASTA files, the order of output sequences
are random. To keep the order, just compress the FASTA file (input.fasta) and use the
compressed one (input.fasta.gz) as the input.
2. Use "seqkit grep" for extracting subsets of sequences.
"seqtk subseq seqs.fasta id.txt" equals to
"seqkit grep -f id.txt seqs.fasta"
would it be nice adding control to subseq seed
BED records are saved in a map/hash, which is totally random, can't be controlled by seed number. The logic is fast extracting the FASTA sequence that appeared the in BED file via the FASTA index