StrainFinder Pre-processing issue

Using the AMPHORA files in the paper, and applying it to my own data, I've encountered an error with python 5.filter_np.py

python 5.filter_np.py --aln all_alignments.cPickle --map /home1/cseto/StrainFinderFiles/bact.meta.txt --samples samples.txt > filter_np.log Traceback (most recent call last): File "5.filter_np.py", line 53, in n = sum([(np.shape(x[contig])[1] - 2*args.tlen) for contig in contigs]) KeyError: '259536'

Running through the python code line by line, I find that the error is attributable to some dictionary weirdness.

Where

cmap

{'BACT_942': ['259536'], .....}

and

contigs=cmap[genome]

where genome=='BACT_852'

cmap['BACT_852'] ['411477']

Failure mode is

n = sum([(np.shape(x[contig])[1] - 2*args.tlen) for contig in contigs])

Where x is a dictionary

[[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.], ..., [0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]]]), 'BACT_361_infC;2048079;2048570;+;': array([[[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.], ..., [0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]],

[[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.], ..., [0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]]])}

and X['411477'] returns "KeyError" because these numbers don't point to anything,

Instead, it is actually "BACT_361_infC;2048079;2048570;+;"

x['BACT_361_infC;2048079;2048570;+;']

array([[[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.], ..., [0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]],

[[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.], ..., [0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]]])

I suspect I may have simply set something up wrong, in which case some help would be appreciated.

As far as I can tell, 5 is looking for contig numbers, but gene_files.txt that is used by 5 has duplicate first two columns; possibly derived from file 2 where it generates the gene_file by stripping sequence information from the input.

whereas in file 4, the kp.txt files, the first column seems to match the first column of gene_files.txt but the second contains contig information. But this is the only place that contains contig information, and the preceding all_alignments.cPickle can't have contig information because none of the preceding files seem to supply it.

For the dictionary x in 4, it is using an entry that doesn't have the information that 5 is looking to put in. Do I modify 5 to put in BACT_361, etc?

Jan 11 '21 00:01 ctseto

Commands: I rolled my own R1 and R2.fastq and pulled files from the dropbox. First: bwa index /home1/cseto/StrainFinderFiles/all.amphora_genes.fst

python 0.run.py --fastqs list --ref /home1/cseto/StrainFinderFiles/all.amphora_genes.fst --map /home1/cseto/StrainFinderFiles/bact.meta.txt bwa mem -a /home1/cseto/StrainFinderFiles/all.amphora_genes.fst One.fastq > One.sam bwa mem -a /home1/cseto/StrainFinderFiles/all.amphora_genes.fst Two.fastq > Two.sam python 1.filter_sam.py One.sam 90 40 > One.filter.sam python 1.filter_sam.py Two.sam 90 40 > Two.filter.sam samtools view -bS -F 4 -o One.bam One.filter.sam samtools sort One.bam -o One.sorted samtools index One.sorted.bam samtools view -bS -F 4 -o Two.bam Two.filter.sam samtools sort Two.bam -o Two.sorted samtools index Two.sorted.bam python 2.make_gene_file.py --fst /home1/cseto/StrainFinderFiles/all.amphora_genes.fst --out gene_file.txt perl 3.kpileup.pl One One.sorted.bam gene_file.txt 20 0 10 > One.kp.txt perl 3.kpileup.pl Two Two.sorted.bam gene_file.txt 20 0 10 > Two.kp.txt python 4.kp2np.py --samples samples.txt --gene_file gene_file.txt --out all_alignments.cPickle python 5.filter_np.py --aln all_alignments.cPickle --map /home1/cseto/StrainFinderFiles/bact.meta.txt --samples samples.txt --tlen 0 --faln 0.5 --mcov 10 --dcov 1.5 --npos None > filter_np.log

Jan 11 '21 00:01 ctseto

When using the example_files that come with the github:

python 0.run.py --fastqs ./AMPHORA/list --ref ./example_files/ref.fasta --map ./example_files/map.txt bwa mem -a ./example_files/ref.fasta One.fastq > One.sam bwa mem -a ./example_files/ref.fasta Two.fastq > Two.sam python 1.filter_sam.py One.sam 90 40 > One.filter.sam python 1.filter_sam.py Two.sam 90 40 > Two.filter.sam samtools view -bS -F 4 -o One.bam One.filter.sam samtools sort One.bam -o One.sorted samtools index One.sorted.bam samtools view -bS -F 4 -o Two.bam Two.filter.sam samtools sort Two.bam -o Two.sorted samtools index Two.sorted.bam python 2.make_gene_file.py --fst ./example_files/ref.fasta --out gene_file.txt perl 3.kpileup.pl One One.sorted.bam gene_file.txt 20 0 10 > One.kp.txt perl 3.kpileup.pl Two Two.sorted.bam gene_file.txt 20 0 10 > Two.kp.txt python 4.kp2np.py --samples samples.txt --gene_file gene_file.txt --out all_alignments.cPickle python 5.filter_np.py --aln all_alignments.cPickle --map ./example_files/map.txt --samples samples.txt --tlen 0 --faln 0.5 --mcov 10 --dcov 1.5 --npos None > filter_np.log KeyError: 'lacto_pgk'

Some kind of parsing issue?

Jan 11 '21 00:01 ctseto