SeqLib icon indicating copy to clipboard operation
SeqLib copied to clipboard

SeqLib::BamWriter.WriteRecord() seems to be truncating the qname field

Open bkohrn opened this issue 7 years ago • 2 comments

When I create a new bam record, the new bam record seems to truncate the read name at 23 characters. I am using the following code (note that this is part of a larger program; I just stopped it in the middle to check the file quickly):

SeqLib::BamReader in_bam_file;
SeqLib::BamWriter temp_bam = SeqLib::BamWriter(SeqLib::SAM);
in_bam_file.Open(input);
bool testOpen = temp_bam.Open(prefix + ".temp.bam");
temp_bam.SetHeader(in_bam_file.Header());
temp_bam.WriteHeader(); 
SeqLib::BamRecord line;
SeqLib::BamRecord temp_read1_entry;
SeqLib::BamRecord temp_bam_entry;
while (in_bam_file.GetNextRecord(line)) {
    if (paired_end_count % 2 == 1) {
        temp_read1_entry = line;
    }
    if (paired_end_count % 2 == 0) {
        std::string tagName;
        SeqLib::BamRecord temp_bam_entry;
        if (temp_read1_entry.Sequence().substr(0,taglen) + temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) > line.Sequence().substr(0,taglen) + line.Sequence().substr(taglen + spacerlen, loclen)) {
            tagName = temp_read1_entry.Sequence().substr(0,taglen) + \
            temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) + \
            line.Sequence().substr(0,taglen) + \
            line.Sequence().substr(taglen + spacerlen, loclen) + "#ab";
        }
        else if (temp_read1_entry.Sequence().substr(0,taglen) + temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) < line.Sequence().substr(0,taglen) + line.Sequence().substr(taglen + spacerlen, loclen)) {
            tagName = line.Sequence().substr(0,taglen) + \
            line.Sequence().substr(taglen + spacerlen, loclen) + \
            temp_read1_entry.Sequence().substr(0,taglen) + \
            temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) + "#ba";
        }
        else if (temp_read1_entry.Sequence().substr(0,taglen) + temp_read1_entry.Sequence().substr(taglen + spacerlen, loclen) == line.Sequence().substr(0,taglen) + line.Sequence().substr(taglen + spacerlen, loclen)) {
            paired_end_count++;
            continue;
        }
        // Write entries for Read 1
        std::string tempQualities = temp_read1_entry.Qualities().substr(taglen + spacerlen);
        temp_bam_entry = temp_read1_entry;
        temp_bam_entry.AddZTag("X?", temp_read1_entry.Qname());
        temp_bam_entry.SetQname("abcdefghijklmnopqrstuvwxyz");
        temp_bam_entry.SetSequence(temp_read1_entry.Sequence().substr(taglen + spacerlen));
        temp_bam_entry.SetQualities(tempQualities, 33);
        temp_bam.WriteRecord(temp_bam_entry);
       
        // Write entries for Read 2
        tempQualities = line.Qualities().substr(taglen + spacerlen);
        temp_bam_entry = line;
        temp_bam_entry.AddZTag("X?", line.Qname());
        temp_bam_entry.SetQname("abcdefghijklmnopqrstuvwxyz");
        temp_bam_entry.SetSequence(line.Sequence().substr(taglen + spacerlen));
        temp_bam_entry.SetQualities(tempQualities, 33);
        temp_bam.WriteRecord(temp_bam_entry);
        temp_bam.Close();
        abort();

When I look at the resulting file, I see this:

@HD VN:1.5 SO:queryname @RG ID:A SM:testNonRef abcdefghijklmnopqrstuvw 77 * 0 0 * * 0 0 GGGGTGCCCCCAAAGCGGGCCCGTGGGGGCCTCTGGGATGATGATGATGCACCTCGGCCATCCCAATTCGAGGAGGACCTGGCA DGHIIIIIIIIIIHIIIIIHHHIIHIIIGIIIIIIIIIHIIIIIIIIIIIIIIFIIGIIIGIIIIIHHIIIHIIIIIIGIHHII RG:Z:A X?:Z:HISEQ_0537:1:1101:10000:20114#ACGCTC abcdefghijklmnopqrstuvw 141 * 0 0 * * 0 0 AAGAGAATGAGGGTCAAGGGTCAGGGCCAGAATCACTGTAGGCCACACAGGCTGTACGGGGTGGAAGGGCAGTCCTTTGTAGGA DGHHIIIIIIIHIHHIIIIIIIIIGHIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIGHIIIIIHIHHIIIIIFHFGHHIG RG:Z:A X?:Z:HISEQ_0537:1:1101:10000:20114#ACGCTC

Based on my understanding, the read name should be "abcdefghijklmnopqrstuvwxyz", but it isn't.

Thanks for any help you can give, Brendan

bkohrn avatar Dec 07 '18 01:12 bkohrn

That looks like a valid BAM to me, albeit unaligned to a genome. The 77 * 0 0 * * 0 0 part are columns 2-9 of the BAM/SAM specification. see section 1.4.

EDIT: Ignore above, I see what you're saying now, it's missing xyz

martinjvickers avatar Dec 07 '18 09:12 martinjvickers

I'm not entirely sure what is causing it for you, but I've just edited a program I made to see if SetQname is truncating but it doesn't seem to be;

BamRecordVector results;
bwa_ga.AlignSequence(seq, id, results, hardclip, secondary_cutoff, secondary_cap);

for(auto& i : results)
{
   i.SetQname("abcdefghijklmnopqrstuvwxyz");
   writer.WriteRecord(i);
}
abcdefghijklmnopqrstuvwxyz      16      chr4    5360249 60      101M    *       0       0       TAAATTAGGTTTGGATTTTGTGATAAGTTTAATTAATTATATTTATTTTAATGTTACGATGTTATATTATANTTTAGNATGCACTTAATNNNAGTNCNNTA   *       NA:i:2  NM:i:9  SB:i:0  AS:i:80

martinjvickers avatar Dec 07 '18 10:12 martinjvickers