samtools icon indicating copy to clipboard operation
samtools copied to clipboard

cell barcodes missing in output after samtools sort by tag

Open whitleyo opened this issue 6 years ago • 2 comments

samtools version: samtools 1.9 Using htslib 1.9 Copyright (C) 2018 Genome Research Ltd.

operating system: I don't know which node my job was run on, but looking through my emails it appears all them are CentOS 6 uname -a Linux node60.uhnh4h.cluster 2.6.32-642.15.1.el6.x86_64 #1 SMP Fri Feb 24 14:31:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Problem:

I wanted to sort a bam file by cell barcode. My input file is a bam file with entries as follows, with cell barcode tag CB:

The tags are further described at the 10X website (file comes from a 10X scRNA experiment)

D00353:193:CAU8LANXX:3:1203:11737:15281 16 1 11775 1 98M * 0 0 TGACTGCGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCTTTTCGTTA FGGGGGGGGGGGGGGGGGG>GEGGGGGGGGGGGGGGGGGGGGGFGGGGGFGGGGGGGGGGGGGGGGGF@=1GGGCGGGB<C>EGGGGGGGGGGB@CBB NH:i:4 HI:i:1 AS:i:96 nM:i:0 RE:A:I CR:Z:TGACGGCTCGGAAATA CY:Z:CCCCCGGGGGGDGGGG CB:Z:TGACGGCTCGGAAATA-1 UR:Z:TTTTGCCTTT UY:Z:GGGGGGGGGG UB:Z:TTTTGCCTTT BC:Z:GCTACCTG QT:Z:CCCCCGGG RG:Z:G620_T:MissingLibrary:1:CAVC9ANXX:3-42CBADEC

Here's the command:

samtools sort -t CB -@ $threads -m $max_mem -o $CB_sorted_bam $bam_file

threads was set to 1, max mem to 20G.

samtools sort ran without error

Here's the output from my script (print statements showing start and end):

samtools: sorting by cell barcode then position
Thu Jun 27 16:37:52 EDT 2019
finished sorting by cell barcode
Thu Jun 27 18:16:20 EDT 2019

Here's the stderr:

[bam_sort_core] merging from 4 files and 1 in-memory blocks...

The output file has entries like this

D00353:192:CAVC9ANXX:4:2115:7079:80990 272 1 14471 0 98M * 0 0 CAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGAAGCTGG .BGE.<GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGDDGC<GFF>BGGGGGGGF=1>C>ECBC@BBGGE;=0FEGDGFF=100><3 NH:i:5 HI:i:2 AS:i:96 nM:i:0 RE:A:I CR:Z:TGAGAGGATTGCAGGA CY:Z:CCCCCGGGGGGGGGFG UR:Z:GTGTGATCTT UY:Z:GDGGGGGGGG UB:Z:GTGTGATCTT BC:Z:AAAGTGCT QT:Z:CCCCCGGG RG:Z:G620_T:MissingLibrary:1:CAVC9ANXX:4

Note that there is no CB tag now. Is this expected behavior of sort -t?

Thanks

whitleyo avatar Jun 28 '19 14:06 whitleyo

No, it's not expected. What version of samtools are you using, and would it be possible for you to make a minimal file that reproduces the problem?

daviesrob avatar Jun 28 '19 16:06 daviesrob

Having looked at this a bit more closely, I've noticed that the FLAGs on your output file example is 272, which means REVERSE,SECONDARY. So it's possible that whatever produced the alignment didn't put CB:Z: tags on the secondary alignments. If this is the case then you will see them at the beginning of the sort output file as the -t CB option will put alignments with a missing CB:Z: tag before any that have it.

A check would be to check your input file using samtools view $bam_file | grep 'D00353:192:CAVC9ANXX:4:2115:7079:80990' to see which lines for this name include a CB:Z: tag.

If the secondary alignments are missing CB:Z: then you'll need to either filter them out with something like samtools view -F 0x100 or ask the author of the software that produced the alignment file to ensure the tags get added to all of the alignment records and not just the primary ones.

daviesrob avatar Jul 01 '19 13:07 daviesrob