Issue running expand-cigar.py
Hi,
I'm following the assembly workflow for MinION data presented in this preprint from Nick Loman, Joshua Quick and Jared Simpson. To assess read accuracy, they use your expand-cigar.py and count-errors.py scripts.
I am having problems running expand-cigar.py on my position-sorted BAM file (generated using BWA-MEM with -x ont2d). I using python version 2.7.5.
Here is the error I get:
$ python ../poreathon/analysis/expand-cigar.py --bam pass.fastq.bwa.sorted.bam --fasta ../../data/ReferenceSequence_GBJ02459.1_phagelambda/GBJ02459.1_enterobacteria_phage_lambda_genome.fna >output.txt
Traceback (most recent call last):
File "../poreathon/analysis/expand-cigar.py", line 112, in <module>
main(args)
File "../poreathon/analysis/expand-cigar.py", line 62, in main
curr_chrom_seq = get_chrom(fa, bam.getrname(curr_chrom_id))
File "pysam/calignmentfile.pyx", line 570, in pysam.calignmentfile.AlignmentFile.getrname (pysam/calignmentfile.c:7987)
ValueError: reference_id -1 out of range 0<=tid<1
Do you have any recommendations for a python version, or any dependencies?
Thanks! Kathleen
Hi Kathleen,
This is an error being raised by pysam and it seems to be indicating that a chromosome label in one of the alignments is not found in the BAM header. I would suggest checking that you have a valid BAM header.
Hi Aaron,
This is my BAM header. Do you think there will be an issue with the pipes?
$ samtools view -H pass.fastq.bwa.sorted.bam
@HD VN:1.3 SO:coordinate
@SQ SN:gi|215104|gb|J02459.1|LAMCG LN:48502
@PG ID:bwa PN:bwa VN:0.7.12-r1039 CL:bwa mem -t 8 -x ont2d lambda_phage.fna pass.fastq
Not sure, what do the chromosome labels for your alignments look like:
samtools view kathleen.bam | cut -f 1 | sort | uniq -c
If I run the command you suggested, I get the read names:
$ samtools view pass.fastq.bwa.sorted.bam | cut -f 1 | sort | uniq -c
1 ch128_file14
1 ch128_file15
<truncated>
I assume you meant cut -f 3? The SAM specification, section 1.4 states that field 3 is the RNAME, or reference sequence name.
$ samtools view pass.fastq.bwa.sorted.bam | cut -f 3 | sort | uniq -c
1 *
806 gi|215104|gb|J02459.1|LAMCG
There is one unmapped read, and the other reads have the same chromosome name as the BAM header.
I have tried running expand-cigar.py and count-errors.py with the unmapped read removed from the BAM (using samtools view -F 4), and the first script executes fine. The second script, count-errors.py ends in this message:
$ python ../poreathon/analysis/count-errors.py pass.fastq.bwa.sorted.nounmapped.expandedcigar.bam >pass.fastq.bwa.sorted.nounmapped.expandedcigar.bam.errors.txt
Traceback (most recent call last):
File "../poreathon/analysis/count-errors.py", line 70, in <module>
read_type = query.split('_')[4]
IndexError: list index out of range
I have also tried removing the supplemental alignments from the BAM file that already has the unmapped read removed. As with the previous case, expand-cigar.py works, but count-errors.py dies with an error.
I will now try changing the chromosome name to have no pipes, and re-running the scripts.
Please let me know if you have any suggestions.
Okay, thanks for the update. It looks like count-errors.py expects sequence names to have at least 5 "_" separated components. Based on your commands above, it looks like you have just two components (e.g., "ch128_file14"). You could modify the script to account for this, and perhaps you will be on your way. I will modify the expand-cigar.py script to ignore unmapped reads.
Can you give an example of what your input sequence names look like, as well as what portion of the name the splitting command keeps? I want to make sure that my modifications to your script will be consistent with its normal behavior.
Thanks again, Kathleen
I wanted to follow up again on which field I should be keeping when I modify the script. In the example of "ch128_file14", do I need "ch128" or "file14"?
Thanks, Kathleen