For simulating mutations in chrY, I need whether number of mutations in chrY is greater than zero using SPMG.
Dear all,
Normally, SigProfilerSimulator simulates as gender='female' by default. But I want to check whether there are mutations on chrY and if yes, I want to run SigProfilerSimulator simulates as gender='male'
For this aim, I called SPMG with chrom_based=True However, I couldn't see any chr-based key even I called SPMG with chrom_based=True
Please have a look below: matrices = matGen.SigProfilerMatrixGeneratorFunc(jobname, genome, inputDir, plot=False, seqInfo=seqInfo) #matrices.keys(): dict_keys(['6144', '384', '1536', '96', '6', '24', '4608', '288', '18', 'DINUC', 'ID'])
chrom_based_matrices = matGen.SigProfilerMatrixGeneratorFunc(jobname, genome, inputDir, chrom_based=True, plot=False, seqInfo=seqInfo) #chrom_based_matrices.keys(): dict_keys(['6144', '384', '1536', '96', '6', '24', '4608', '288', '18'])
My question is whether there is a way to learn the number of mutations on chrY by calling SPMG.
Thanks, Burcak
Dear @burcakotlu,
In the output directory when chrom_based is True there will be the file for the Y chromosome named similarly to example_project.SBS6.all.chrY. Each column represents a sample, so by checking whether that column is non-zero, you can determine whether there are mutations on chrY or not.
Dear @mdbarnesUCSD,
Thanks for the explanation. I was checking those files. I have the latest versions of the tools: SigProfilerMatrixGenerator 1.2.30 and SigProfilerSimulator 1.1.6
Chrom-based files were created without "chrom_based=True" with the following call: matrices = matGen.SigProfilerMatrixGeneratorFunc(jobname, genome, inputDir, plot=False, seqInfo=True)
For testing purposes, I used the following 2 vcf files:
PD39500a.caveman_strelka2_filtered.consensus_snv.vcf
PD39500a.pindel_strelka2_filtered.consensus_indel.vcf
which can be found under
/tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples
I called SPMG for these 2 vcf files. Returned matrices didn't have any key for indels. Keys are as follows: 6144, 384, 1536, 96, 6, 24, 4608, 288. But one of the vcf files contains indels. and there are indels under /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/ID as a result of SPMG call.
Minor: Why are chrom_based files generated even if chrom_based isn't set to True, is it due to seqInfo=True? Major: Why can't I get any key for indels in the returned matrices after the SPMG call?
Thanks, Burcak
Thanks for sharing the command and files for reproducing the issue. I tested with chrom_based=False and a standard matrix was returned and no chrom_based files were generated. I suspect the chrom_based files exist in your environment because they were generated during a previous run where chrom_based=True.
There is some inconsistent behavior with how the indel matrices are returned, though the matrices are written to file. I observed that when chrom_based=False the ID matrix is returned, but when chrom_based=True there is no ID matrix returned.
The issue originates from this line for indels (and also doublet base substitutions): SigProfilerMatrixGeneratorFunc.py - line 2838
- To resolve, remove single indentation
SigProfilerMatrixGeneratorFunc.py - line 2078
- To resolve, remove double indentation
We will include a patch for these in the next release.
Dear all,
I deleted all the directories and made a clean run with the following call: matrices = matGen.SigProfilerMatrixGeneratorFunc(jobname, genome, inputDir, plot=False, seqInfo=True, chrom_based=False)
DEBUG mutation_types: ['SBS', 'DBS', 'ID'] DEBUG filepath: /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/SBS/test_samples.SBS96.all.chrY DEBUG filepath: /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/DBS/test_samples.DBS78.all.chrY DEBUG filepath: /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/ID/test_samples.ID83.all.chrY DEBUG chrY_num_of_mutations: 0
Matrices have keys for SBS, DBS and ID mutation types. Chrom-based files are written at some point, but later so when I check for the number of mutations on chrY, it can not reach to chrY files. Chrom_based files are written maybe due to seqInfo=True.
I deleted all the directories and made a clean run with the following call: matrices = matGen.SigProfilerMatrixGeneratorFunc(jobname, genome, inputDir, plot=False, seqInfo=True, chrom_based=True)
DEBUG mutation_types: ['SBS'] DEBUG filepath: /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/SBS/test_samples.SBS96.all.chrY DEBUG filepath: /tscc/lustre/restricted/alexandrov-ddn/users/burcak/SigProfilerTopographyRuns/Mutographs_ESCC_552/test_samples/output/SBS/test_samples.SBS96.all.chrY exists DEBUG chrY_num_of_mutations: 110
Matrices have keys only for SBS mutation types. No key for DBS or ID mutation types. Chrom-based files are written so when I check for the number of mutations on chrY, it can reach to chrY files for SBS only.
I need mutation types for existing mutations. I need to reach to chr-based files for all mutation types so that I can understand whether there are mutations on the chr Y.
Thanks, Burcak
It seems that there are two things that may be happening.
If you want a matrix file for the mutations on chromosome Y, then you will need to run chrom_based=True. In this case, the chromosome based matrix will not be returned in memory so you will need to navigate to the output/SBS/ project_name.SBS96.all.chrY to read the file in.
If you want information on the mutation context for each mutation on chromosome Y, you will need to run seqInfo=True. This will produce output/vcf_files/SNV/Y_seqinfo.txt. This file will be generated regardless of whether there are mutations on chromosome Y or not.
The parameters chrom_based and seqInfo are independent from each other.
Please re-open if you are still encountering issues.
If SPMG is run with seqInfo=True and chrom_based=True, are the matrices returned by SPMG have keys for all given mutation types such as SBS, DBS, and ID mutation types in this specific case?
Also, can we get the number of mutations on ChrY by reading the corresponding files immediately after the SPMG call?
Thanks
The v1.2.31 release resolves the issue of the matrices not being returned. The number of mutation on ChrY can be determine by reading the corresponding files immediately after the SPMG call. Thanks!
I have tested v1.2.31 release, and it works correctly on my side—many thanks.
When we run the matrix generator with seqInfo=True and chrom_based=True, the all files are not created. e.g., under output/DBS/ there are files named "jobname.DBS78.all.chr1" up to "jobname.DBS78.all.chrY" but there is no file "jobname.DBS78.all"
is it possible to provide those files?
Thanks, Burcak
I believe that non-chromosome based matrices are being returned in memory, which can be saved to a file. Additionally, matrix generation can be run again with chrom_based=False. If neither of these solutions work, please let me know.
It would be great to get non-chromosome based matrices (*.all files) when SPMG is called with chrom_based = True. Path to these matrices are required for SigProfilerAssignment calls in case probability files are not provided.
Thanks, Burcak