nanopore-scripts Issue running expand-cigar.py

Hi,

I'm following the assembly workflow for MinION data presented in this preprint from Nick Loman, Joshua Quick and Jared Simpson. To assess read accuracy, they use your expand-cigar.py and count-errors.py scripts.

I am having problems running expand-cigar.py on my position-sorted BAM file (generated using BWA-MEM with -x ont2d). I using python version 2.7.5.

Here is the error I get:

$ python ../poreathon/analysis/expand-cigar.py --bam pass.fastq.bwa.sorted.bam --fasta ../../data/ReferenceSequence_GBJ02459.1_phagelambda/GBJ02459.1_enterobacteria_phage_lambda_genome.fna >output.txt
Traceback (most recent call last):
  File "../poreathon/analysis/expand-cigar.py", line 112, in <module>
    main(args)
  File "../poreathon/analysis/expand-cigar.py", line 62, in main
    curr_chrom_seq = get_chrom(fa, bam.getrname(curr_chrom_id))
  File "pysam/calignmentfile.pyx", line 570, in pysam.calignmentfile.AlignmentFile.getrname (pysam/calignmentfile.c:7987)
ValueError: reference_id -1 out of range 0<=tid<1

Do you have any recommendations for a python version, or any dependencies?

Thanks! Kathleen

Apr 28 '15 21:04 kthlnktng

Hi Kathleen,

This is an error being raised by pysam and it seems to be indicating that a chromosome label in one of the alignments is not found in the BAM header. I would suggest checking that you have a valid BAM header.

Apr 30 '15 15:04 arq5x

Hi Aaron,

This is my BAM header. Do you think there will be an issue with the pipes?

$ samtools view -H pass.fastq.bwa.sorted.bam
@HD VN:1.3  SO:coordinate
@SQ SN:gi|215104|gb|J02459.1|LAMCG  LN:48502
@PG ID:bwa  PN:bwa  VN:0.7.12-r1039 CL:bwa mem -t 8 -x ont2d lambda_phage.fna pass.fastq

Apr 30 '15 21:04 kthlnktng

Not sure, what do the chromosome labels for your alignments look like:

samtools view kathleen.bam | cut -f 1 | sort | uniq -c

May 04 '15 13:05 arq5x

If I run the command you suggested, I get the read names:

$ samtools view pass.fastq.bwa.sorted.bam | cut -f 1 | sort | uniq -c
      1 ch128_file14
      1 ch128_file15
<truncated>

I assume you meant cut -f 3? The SAM specification, section 1.4 states that field 3 is the RNAME, or reference sequence name.

$ samtools view pass.fastq.bwa.sorted.bam | cut -f 3  | sort | uniq -c
      1 *
    806 gi|215104|gb|J02459.1|LAMCG

There is one unmapped read, and the other reads have the same chromosome name as the BAM header.

I have tried running expand-cigar.py and count-errors.py with the unmapped read removed from the BAM (using samtools view -F 4), and the first script executes fine. The second script, count-errors.py ends in this message:

$ python ../poreathon/analysis/count-errors.py pass.fastq.bwa.sorted.nounmapped.expandedcigar.bam >pass.fastq.bwa.sorted.nounmapped.expandedcigar.bam.errors.txt
Traceback (most recent call last):
  File "../poreathon/analysis/count-errors.py", line 70, in <module>
    read_type = query.split('_')[4]
IndexError: list index out of range

I have also tried removing the supplemental alignments from the BAM file that already has the unmapped read removed. As with the previous case, expand-cigar.py works, but count-errors.py dies with an error.

I will now try changing the chromosome name to have no pipes, and re-running the scripts.

Please let me know if you have any suggestions.

May 04 '15 17:05 kthlnktng

Okay, thanks for the update. It looks like count-errors.py expects sequence names to have at least 5 "_" separated components. Based on your commands above, it looks like you have just two components (e.g., "ch128_file14"). You could modify the script to account for this, and perhaps you will be on your way. I will modify the expand-cigar.py script to ignore unmapped reads.

May 06 '15 00:05 arq5x

Can you give an example of what your input sequence names look like, as well as what portion of the name the splitting command keeps? I want to make sure that my modifications to your script will be consistent with its normal behavior.

Thanks again, Kathleen

May 06 '15 18:05 kthlnktng

I wanted to follow up again on which field I should be keeping when I modify the script. In the example of "ch128_file14", do I need "ch128" or "file14"?

Thanks, Kathleen

Jun 03 '15 20:06 kthlnktng