cell barcodes missing in output after samtools sort by tag
samtools version: samtools 1.9 Using htslib 1.9 Copyright (C) 2018 Genome Research Ltd.
operating system: I don't know which node my job was run on, but looking through my emails it appears all them are CentOS 6 uname -a Linux node60.uhnh4h.cluster 2.6.32-642.15.1.el6.x86_64 #1 SMP Fri Feb 24 14:31:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Problem:
I wanted to sort a bam file by cell barcode. My input file is a bam file with entries as follows, with cell barcode tag CB:
The tags are further described at the 10X website (file comes from a 10X scRNA experiment)
D00353:193:CAU8LANXX:3:1203:11737:15281 16 1 11775 1 98M * 0 0 TGACTGCGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCTTTTCGTTA FGGGGGGGGGGGGGGGGGG>GEGGGGGGGGGGGGGGGGGGGGGFGGGGGFGGGGGGGGGGGGGGGGGF@=1GGGCGGGB<C>EGGGGGGGGGGB@CBB NH:i:4 HI:i:1 AS:i:96 nM:i:0 RE:A:I CR:Z:TGACGGCTCGGAAATA CY:Z:CCCCCGGGGGGDGGGG CB:Z:TGACGGCTCGGAAATA-1 UR:Z:TTTTGCCTTT UY:Z:GGGGGGGGGG UB:Z:TTTTGCCTTT BC:Z:GCTACCTG QT:Z:CCCCCGGG RG:Z:G620_T:MissingLibrary:1:CAVC9ANXX:3-42CBADEC
Here's the command:
samtools sort -t CB -@ $threads -m $max_mem -o $CB_sorted_bam $bam_file
threads was set to 1, max mem to 20G.
samtools sort ran without error
Here's the output from my script (print statements showing start and end):
samtools: sorting by cell barcode then position
Thu Jun 27 16:37:52 EDT 2019
finished sorting by cell barcode
Thu Jun 27 18:16:20 EDT 2019
Here's the stderr:
[bam_sort_core] merging from 4 files and 1 in-memory blocks...
The output file has entries like this
D00353:192:CAVC9ANXX:4:2115:7079:80990 272 1 14471 0 98M * 0 0 CAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGAAGCTGG .BGE.<GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGDDGC<GFF>BGGGGGGGF=1>C>ECBC@BBGGE;=0FEGDGFF=100><3 NH:i:5 HI:i:2 AS:i:96 nM:i:0 RE:A:I CR:Z:TGAGAGGATTGCAGGA CY:Z:CCCCCGGGGGGGGGFG UR:Z:GTGTGATCTT UY:Z:GDGGGGGGGG UB:Z:GTGTGATCTT BC:Z:AAAGTGCT QT:Z:CCCCCGGG RG:Z:G620_T:MissingLibrary:1:CAVC9ANXX:4
Note that there is no CB tag now. Is this expected behavior of sort -t?
Thanks
No, it's not expected. What version of samtools are you using, and would it be possible for you to make a minimal file that reproduces the problem?
Having looked at this a bit more closely, I've noticed that the FLAGs on your output file example is 272, which means REVERSE,SECONDARY. So it's possible that whatever produced the alignment didn't put CB:Z: tags on the secondary alignments. If this is the case then you will see them at the beginning of the sort output file as the -t CB option will put alignments with a missing CB:Z: tag before any that have it.
A check would be to check your input file using samtools view $bam_file | grep 'D00353:192:CAVC9ANXX:4:2115:7079:80990' to see which lines for this name include a CB:Z: tag.
If the secondary alignments are missing CB:Z: then you'll need to either filter them out with something like samtools view -F 0x100 or ask the author of the software that produced the alignment file to ensure the tags get added to all of the alignment records and not just the primary ones.