seqkit icon indicating copy to clipboard operation
seqkit copied to clipboard

Does subseq use random seed?

Open Aciole-David opened this issue 1 year ago • 3 comments

Does subseq use random seed? Each subseq execution outputs sequences in random order compared to a fixed fasta and a fixed bed input Can't find this info in the help nor the manual

In my specific case the output order is important because it is related to the expected positions.

for run in {1..4}; do
seqkit subseq --quiet --bed covid19_sample-positions.bed merged-gnm.fa -j 12 | head && echo "" ;
done

>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-negative
GGGCATAGTAAGGCAGTTT
>covid19_sample-uncertain
TACAGTCGTTGGTCCTCG
>covid19_sample-antisense
CTTTACTGTGGACCGTGGCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA

>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-negative
GGGCATAGTAAGGCAGTTT
>covid19_sample-uncertain
TACAGTCGTTGGTCCTCG
>covid19_sample-antisense
CTTTACTGTGGACCGTGGCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA

>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA
>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-lysis
GTCGTCCGATCTTTTAACCGGTA

>covid19_sample-antisense
CTTTACTGTGGACCGTGGCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA
>covid19_sample-F2
ACTAGCCCACTAAACTCAG
>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-lysis
GTCGTCCGATCTTTTAACCGGTA

Personal computer Ubuntu MATE 22.04.4 LTS x86_64 Kernel: 6.5.0-26-generic Shell: bash 5.1.16 Seqkit Version: 2.8.0 mamba version : 1.5.5

Aciole-David avatar Apr 03 '24 18:04 Aciole-David

Yes. For plain text FASTA input, the order of output records is random.

Try compressing the merged-gnm.fa file and use merged-gnm.fa.gz as the input.

shenwei356 avatar Apr 03 '24 19:04 shenwei356

Execution time increases but it works fine. Thank you!

Can we have that info on the manual? Also, would it be nice adding control to subseq seed, right?

Thanks a lot!

gzip merged-gnm.fa

for run in {1..4}; do
seqkit subseq --quiet --bed covid19_sample-positions.bed merged-gnm.fa.gz -j 12 | head && echo "" ;
done

>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-negative
GGGCATAGTAAGGCAGTTT
>covid19_sample-uncertain
TACAGTCGTTGGTCCTCG
>covid19_sample-antisense
CTTTACTGTGGACCGTGGCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA

>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-negative
GGGCATAGTAAGGCAGTTT
>covid19_sample-uncertain
TACAGTCGTTGGTCCTCG
>covid19_sample-antisense
CTTTACTGTGGACCGTGGCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA

>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-negative
GGGCATAGTAAGGCAGTTT
>covid19_sample-uncertain
TACAGTCGTTGGTCCTCG
>covid19_sample-antisense
CTTTACTGTGGACCGTGGCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA

>covid19_sample-conserved
GCACTAACTAAGTTCCTAACCACTT
>covid19_sample-negative
GGGCATAGTAAGGCAGTTT
>covid19_sample-uncertain
TACAGTCGTTGGTCCTCG
>covid19_sample-antisense
CTTTACTGTGGACCGTGGCA
>covid19_sample-deep
GGCAGTAATAGGTCTTAATACCA

Aciole-David avatar Apr 04 '24 02:04 Aciole-David

Can we have that info on the manual?

Added.

Attention:
  1. When extracting with BED/GTF from plain text FASTA files, the order of output sequences
     are random. To keep the order, just compress the FASTA file (input.fasta) and use the
     compressed one (input.fasta.gz) as the input.
  2. Use "seqkit grep" for extracting subsets of sequences.
     "seqtk subseq seqs.fasta id.txt" equals to
     "seqkit grep -f id.txt seqs.fasta"

would it be nice adding control to subseq seed

BED records are saved in a map/hash, which is totally random, can't be controlled by seed number. The logic is fast extracting the FASTA sequence that appeared the in BED file via the FASTA index

shenwei356 avatar Apr 07 '24 09:04 shenwei356