seqtk icon indicating copy to clipboard operation
seqtk copied to clipboard

Allow for appending to sequence id with RENAME

Open deprekate opened this issue 10 years ago • 2 comments

With the RENAME function you replace the sequence id with an incrementing integer (and optional prefix).

It would be nice to be able to just append the incrementing integer to the sequence id. Other wise a separate file is needed to map an original read name to the new read name.

deprekate avatar Apr 23 '15 23:04 deprekate

@deprekate (I like that name!)

Why would you want to add an integer to the original sequence id ? The original sequence ID is already unique?

tseemann avatar Aug 18 '16 15:08 tseemann

Exactly. Ideally one of the properties of a fasta/q files is that all ids are unique. However often times people unknowingly create improper fasta/q files.

One of the main ways they do this is when they combine paired end data into a single file, without renaming the read ids of each end with the proper syntax: >read_id_1.1 and >read_id_1.2 or >read_id_1/1 and >read_id_1/2

NCBI is partly to blame for this as their SRA tool fastq-dump, does not output the ids correctly; with the suffix.

$ ./fastq-dump --split-spot --clip --maxSpotId 1 ERR712664
Read 1 spots for ERR712664
Written 1 spots for ERR712664

$ cat ERR712664.fastq 
@ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
TCTTCGCTGTTACCTCTGGAAAGAATCGCCTCGCGAAAACGCCGCCCATTTTCACGCGTTAATCCGCCCTGCTCAACAAACCACTGGTAACCATCATCGGCCAACATTTGCGTCCACAGATAAGCGTAATAACCTGCGGCATATCCGCCACCAAAAATATGGGCGAAATAACTGCTGCGATAGCGTGGCGGTATAGCAGGAAGATCCATATTTTCCGCCACCAGCGCCCGCAATTCGCAATCATCGACAT
+ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
A1>A11>A1>D3AABBE11111000BA0AAAE000////B/AA/AAE/BFGGF2@/>/>>//BFF/>EEE0FFF11B1/0?F/FF10B12BF1FF1FFB/<B//?/?FG2?/?@@/?101?111<->A.<11>FF.<--<-;0CGF.CA?AG.....90;00.:----:00;C09FB0--99-:-;--9--9-9//9//-----99B/9/:BBBB9-9;-;A--9-99@-9--9;:----:9/99//;-;
@ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
TTACACGAAGCAATGCTGGTTGTCGTTGTTTTCGAATTGCGGGCGCTGGTGGCGGACCATATGGCTCTTCCTGCTATACCGCCACGCTCTCGCAGCCGTTATTTCGCCCCTATTTTTGTTGCCGCATATGCCGCAGGTTATTACGCTTCTCTGTTGTCGCCCATGTTGGCCGCTGCTGGTTACCTGTGGTTTGTTGTGCCGGCCGGATTTACGCGTGATCCTTGGCGGCGTTTTCGCGTGGCGTTTCTTT
+ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
11111@1111111A1A311010AA00A00AABA000AA1B///A/AA//B0/B//////@1B21/0>BF>EE11B1>2>2/<</<//</?<////<//??/??FD0<<<-<.<111=-00.0<---..00<:-----.:0::09..9;.9900900/9.....//000./9-----;//99/////9/-999------//--------9//9------///;////--------/-----------///9

You will notice that in this fastq file there are two reads with the same id (@ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841)


You can dump each read end into two different files:

$ ./fastq-dump --split-files --clip --maxSpotId 1 ERR712664
Read 1 spots for ERR712664
Written 1 spots for ERR712664

$ cat ERR712664_1.fastq 
@ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
TCTTCGCTGTTACCTCTGGAAAGAATCGCCTCGCGAAAACGCCGCCCATTTTCACGCGTTAATCCGCCCTGCTCAACAAACCACTGGTAACCATCATCGGCCAACATTTGCGTCCACAGATAAGCGTAATAACCTGCGGCATATCCGCCACCAAAAATATGGGCGAAATAACTGCTGCGATAGCGTGGCGGTATAGCAGGAAGATCCATATTTTCCGCCACCAGCGCCCGCAATTCGCAATCATCGACAT
+ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
A1>A11>A1>D3AABBE11111000BA0AAAE000////B/AA/AAE/BFGGF2@/>/>>//BFF/>EEE0FFF11B1/0?F/FF10B12BF1FF1FFB/<B//?/?FG2?/?@@/?101?111<->A.<11>FF.<--<-;0CGF.CA?AG.....90;00.:----:00;C09FB0--99-:-;--9--9-9//9//-----99B/9/:BBBB9-9;-;A--9-99@-9--9;:----:9/99//;-;

$ cat ERR712664_2.fastq 
@ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
TTACACGAAGCAATGCTGGTTGTCGTTGTTTTCGAATTGCGGGCGCTGGTGGCGGACCATATGGCTCTTCCTGCTATACCGCCACGCTCTCGCAGCCGTTATTTCGCCCCTATTTTTGTTGCCGCATATGCCGCAGGTTATTACGCTTCTCTGTTGTCGCCCATGTTGGCCGCTGCTGGTTACCTGTGGTTTGTTGTGCCGGCCGGATTTACGCGTGATCCTTGGCGGCGTTTTCGCGTGGCGTTTCTTT
+ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
11111@1111111A1A311010AA00A00AABA000AA1B///A/AA//B0/B//////@1B21/0>BF>EE11B1>2>2/<</<//</?<////<//??/??FD0<<<-<.<111=-00.0<---..00<:-----.:0::09..9;.9900900/9.....//000./9-----;//99/////9/-999------//--------9//9------///;////--------/-----------///9

And now you need to rename each read and then combine the files, ie:

perl -pi -e 's/ /.1 /' file_1.fq
perl -pi -e 's/ /.2 /' file_2.fq
cat file_1 file_2 > file.fq

But people don't always follow these steps.


What I have to do now is after running seqtk seq -a in.fq.gz > out.fa, I then have to parse the out.fa file again and add an incrementing integer before the first space (actually I never enabled this, i just output an error message and make them fix their data).

It would be nice if seqtk could do this adding a rolling integer. Seqtk does allow for a complete rename with an incrementing integer, but then if I want to map a read at the end of my pipeline I need to keep track of which RENAME value points to which original read id.

(also thanks ya my handle is pretty nifty, a play on words/myname)

deprekate avatar Aug 19 '16 01:08 deprekate