bambu Novel/Unannotated Isoforms

Hi there,

So I'm trying to look at novel/unannotated isoforms detected and quantified by Bambu and I'd just like some clarification and/or advice. So basically I get an extended annotation GTF file, and upon looking at it, it contains all the Ensemble transcript IDs that were in my reference GTF, plus several hundred Ensemble transcript IDs that were not in my reference GTF. Where is Bambu getting these Ensemble transcript IDs from? I was hoping unannotated transcripts would be given some kind of separate ID instead of being associated with existing annotations.

Jul 22 '22 21:07 ajritter8

Hi there,

Sorry for the late response. Bambu shouldn't be able to create new ensembl transcript IDs. All novel transcripts will be assigned a transcript id of tx.XXX where XXX is an arbitrary id. They may be assigned an ensembl gene ID if they overlap with an annotated gene, but these ID should already be present in the annotation GTF file.

If this doesn't answer your question, you could send me the files, or a screen shot of an example, so that I may look a bit deeper.

Aug 16 '22 09:08 andredsim

Hey Andre,

No worries, I got that part figured out. But I do have another question: I’ve been trying out different cutoffs for de novo isoform detection and no matter how little stringency, Bambu doesn’t seem to detect more than ~20 or so unannotated transcripts in a given sample. Do you have recommendations for how to optimize discovery mode settings to best detect unannotated transcripts?

Best,

Alex

On Tue, Aug 16, 2022 at 2:48 AM Andre Sim @.***> wrote:

Hi there,

Sorry for the late response. Bambu shouldn't be able to create new ensembl transcript IDs. All novel transcripts will be assigned a transcript id of tx.XXX where XXX is an arbitrary id. They may be assigned an ensembl gene ID if they overlap with an annotated gene, but these ID should already be present in the annotation GTF file.

If this doesn't answer your question, you could send me the files, or a screen shot of an example, so that I may look a bit deeper.

— Reply to this email directly, view it on GitHub https://github.com/GoekeLab/bambu/issues/300#issuecomment-1216411570, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALT2F36ANBINDDUMF75ALEDVZNPX3ANCNFSM54M2OMZQ . You are receiving this because you authored the thread.Message ID: @.***>

Aug 16 '22 15:08 ajritter8

Hi Alex,

There are many things that could influence this. Which parameters are you using for your bambu run and is this a standard long-read run or a custom library preparation (i.e used primers to target specific genes)?

I would recommend just using the NDR (novel discovery rate) parameter and if you want a more sensitive output increasing the NDR. In a human sample we generally use between .1-.2, however if you are using a different species with less comprehensive annotations you might want to increase this value. The maximum this value can be set is 1, which will mean every novel transcript that has at least 2 reads will be annotated.

In the coming patch (see the BambuRevisionManuscript branch) we are also introducing a recommended NDR feature which will recommend an NDR value for your sample, in the event many annotations are missing.

Aug 17 '22 01:08 andredsim

Hey Andre, Below is an example of how I'm running Bambu. For "x", I tried 0.01, 0.05, 0.10 (default), 0.25, 0.50, 0.75 and 1.00. The "bam_file" is from a standard long-read run. The results were actually the exact same across all NDR levels, which was that only annotated transcripts were detected and quantified. When I instead tried messing with minGeneFrac and minReadCount, it did detect some unannotated transcripts but not many. Would you recommend a specific combination of arguments to tune to get a more sensitive novel discovery rate?

txdb <- makeTxDbFromGFF(gtf_file) se_bambu <- bambu(reads = bam_file, annotations = txdb, genome = genome_fasta_file, discovery = TRUE, opt.discovery = list(NDR = x))

Thank you,

Alex

On Tue, Aug 16, 2022 at 6:10 PM Andre Sim @.***> wrote:

Hi Alex,

There are many things that could influence this. Which parameters are you using for your bambu run and is this a standard long-read run or a custom library preparation (i.e used primers to target specific genes)?

I would recommend just using the NDR (novel discovery rate) parameter and if you want a more sensitive output increasing the NDR. In a human sample we generally use between .1-.2, however if you are using a different species with less comprehensive annotations you might want to increase this value. The maximum this value can be set is 1, which will mean every novel transcript that has at least 2 reads will be annotated.

In the coming patch (see the BambuRevisionManuscript branch) we are also introducing a recommended NDR feature which will recommend an NDR value for your sample, in the event many annotations are missing.

— Reply to this email directly, view it on GitHub https://github.com/GoekeLab/bambu/issues/300#issuecomment-1217342497, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALT2F3667P2CTDT6AADRUQLVZQ32FANCNFSM54M2OMZQ . You are receiving this because you authored the thread.Message ID: @.***>

Aug 18 '22 22:08 ajritter8

Hi Alex,

This is very strange, if the NDR is 1, you should be getting loads of novel transcripts as that is maximum sensitivity.

Could you let me know how you changed minGeneFrac, and minReadCount that allowed you to get some unannotated transcripts? This might give me a clue as to what might be occuring.

Could you also please try a run where you replace makeTxDbFromGFF(gtf_file) with prepareAnnotations(gtf_file) as a sanity check for me?

In the likely event this doesn't make a difference, could you add rcOutDir = "/some/output/directory/here" to your arguments. This should output an .rds file which contains all the read classes for your sample (essentially all possible candidates before any filtering). If you could send that to me I could have a look what might be going on.

Aug 19 '22 05:08 andredsim

Hey Andre,

I was getting novel transcripts with min.readCount <=5, and with min.readFractionByGene <= 0.05 I got just a few novel transcripts. I am now trying out min.readFractionByGene = 0.05 with different NDRs (0.25, 0.50, 0.75, 1.00) to see how that does. I've also added the rcOutDir argument so I'll send you one of the outputs when it's finished. The reason I'm using makeTxDbFromGFF instead of prepareAnnotations is because we're using a custom annotation file (human) and we found that it was instead importing the standard annotation.

Thanks,

Alex

On Thu, Aug 18, 2022 at 10:18 PM Andre Sim @.***> wrote:

Hi Alex,

This is very strange, if the NDR is 1, you should be getting loads of novel transcripts as that is maximum sensitivity.

Could you let me know how you changed minGeneFrac, and minReadCount that allowed you to get some unannotated transcripts? This might give me a clue as to what might be occuring.

Could you also please try a run where you replace makeTxDbFromGFF(gtf_file) with prepareAnnotations(gtf_file) as a sanity check for me?

In the likely event this doesn't make a difference, could you add rcOutDir = "/some/output/directory/here" to your arguments. This should output an .rds file which contains all the read classes for your sample (essentially all possible candidates before any filtering). If you could send that to me I could have a look what might be going on.

— Reply to this email directly, view it on GitHub https://github.com/GoekeLab/bambu/issues/300#issuecomment-1220262602, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALT2F34SNCUWIJZQK2E6U73VZ4KJHANCNFSM54M2OMZQ . You are receiving this because you authored the thread.Message ID: @.***>

Aug 19 '22 20:08 ajritter8

Hi Alex,

The default threshold for min.readCount is 2, meaning only transcripts with 1 full length read are filtered out via this threshold. Your threshold of 5 is therefore less sensitive than the default. (Note that it is >= not <=) The default min.readFractionByGene is also 0.05.

Let me know once you can upload the rcOutDir, it gets produced before the quantification step, so it make be in the directory even before it finishes running.

I also wanted to ask about what happened with your custom annotation file when using prepareAnnotations, as it is not expected behavior that it should not import anything other than the .gtf that is provided to it. How did you know it loaded in a standard annotation and not your custom annotation?

Could you please check the annotations object you have created with for me with TxDbFromGFFmcols(annotations) and just double check that the TXNAME and GENEID are not the same, and that there are multiple rows that have different TXNAMES but share a GENEID. We have had reports that makeTxDbFromGFF sometimes causes all transcript and gene names to be the same resulting in no multi-isoform genes. This is a big problem for bambu and might be the crux of the issue.

Thanks, I hope we will be able to get this working for you soon!

Andre

Aug 22 '22 02:08 andredsim

Hey Andre,

Right, I tried several different min.readCount and min.readFractionByGene cutoffs from more to less stringent and like I said I was able to detect just a few novel transcripts from those cutoffs downward. It is strange that I didn't get any when increasing the NDR compared to even default cutoffs.

With the custom annotation, when I used prepareAnnotations I noticed the resulting txdb object contained transcript IDs that were not in my custom annotation, which is why I switched to using TxDbFromGFF which works just fine for me (no problems with duplicate transcript/gene IDs, and the resulting txdb object contained multi-isoform genes).

Here https://drive.google.com/file/d/1KPyQG2K5XXXii03kzkZouEVNiFQRqPoE/view?usp=sharing's the link to one of the rcOutDir outputs with NDR = 1.0.

Thank you,

Alex

On Sun, Aug 21, 2022 at 7:15 PM Andre Sim @.***> wrote:

Hi Alex,

The default threshold for min.readCount is 2, meaning only transcripts with 1 full length read are filtered out via this threshold. Your threshold of 5 is therefore less sensitive than the default. (Note that it is >= not <=) The default min.readFractionByGene is also 0.05.

Let me know once you can upload the rcOutDir, it gets produced before the quantification step, so it make be in the directory even before it finishes running.

I also wanted to ask about what happened with your custom annotation file when using prepareAnnotations, as it is not expected behavior that it should not import anything other than the .gtf that is provided to it. How did you know it loaded in a standard annotation and not your custom annotation?

Could you please check the annotations object you have created with for me with TxDbFromGFFmcols(annotations) and just double check that the TXNAME and GENEID are not the same, and that there are multiple rows that have different TXNAMES but share a GENEID. We have had reports that makeTxDbFromGFF sometimes causes all transcript and gene names to be the same resulting in no multi-isoform genes. This is a big problem for bambu and might be the crux of the issue.

— Reply to this email directly, view it on GitHub https://github.com/GoekeLab/bambu/issues/300#issuecomment-1221710957, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALT2F36DEMYUUL4MBIGUU63V2LPDRANCNFSM54M2OMZQ . You are receiving this because you authored the thread.Message ID: @.***>

Aug 22 '22 19:08 ajritter8

Hi Alex,

Thanks for sending the rcFile. I havn't finished investigating this, but I will update you with what I have seen so far. I had a look at the rcFile and Bambu reports 523982 read classes (with 2 or more reads), with 14283 being already annotated and the remaining 509699 being novel candidates.

When I look at the transcript probability score of these read classes, nearly all the novel candidates are being scored very lowly, with only 2725 read classes being > 0.5.

After doing extend annotations (where additional filtering is performed including the NDR and min.readFractionByGene) I am left with 874 novel genes with default threshold and NDR = 1 in the final output.

This was with my own human genome and annotation file (which might explain different results), however while it sounds higher than what you are reporting it is still lower than I would expect so I am going to investigate further and get back to you on this.

Kind Regards, Andre Sim

Aug 23 '22 06:08 andredsim

Do you think it's possible that many/all of the novel transcripts just aren't being included in the output from writeBambuOutput()? Because I'm primarily looking at the transcript counts and the GTF file given from that command. Also, is there a way to manually filter the rcFile myself?

Best,

Alex

On Mon, Aug 22, 2022 at 11:31 PM Andre Sim @.***> wrote:

Hi Alex,

Thanks for sending the rcFile. I havn't finished investigating this, but I will update you with what I have seen so far. I had a look at the rcFile and Bambu reports 523982 read classes (with 2 or more reads), with 14283 being already annotated and the remaining 509699 being novel candidates.

When I look at the transcript probability score of these read classes, nearly all the novel candidates are being scored very lowly, with only 2725 read classes being > 0.5.

After doing extend annotations (where additional filtering is performed including the NDR and min.readFractionByGene) I am left with 874 novel genes with default threshold and NDR = 1 in the final output.

This was with my own human genome and annotation file (which might explain different results), however while it sounds higher than what you are reporting it is still lower than I would expect so I am going to investigate further and get back to you on this.

Kind Regards, Andre Sim

— Reply to this email directly, view it on GitHub https://github.com/GoekeLab/bambu/issues/300#issuecomment-1223611548, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALT2F35MFJQJZ7W7ATDVQLLV2RV37ANCNFSM54M2OMZQ . You are receiving this because you authored the thread.Message ID: @.***>

Aug 23 '22 19:08 ajritter8

Hi Alex,

Sorry again for the belated response. I was a bit busy the last week and didn't get a chance to investigate this further until today.

I found that the majority of candidate novel transcripts in this sample appear to be subsets of known transcripts. That is transcripts which share a continuous run of exons with a known transcript and could also be produced by degradation of the transcripts. By default bambu does not output subset transcripts due to the high chance they may be false positive. However, if I set opt.discovery = list(remove.subsetTx = FALSE) and output them at an NDR of 1 I get 14606 novel transcripts.

As a side experiment I also set opt.discovery = list(min.readFractionByGene = 0). This resulted in a boost of ~5000 novel transcripts at an NDR of 1. These would be transcripts that have at least 2 reads, but stem from a gene with very high expression that dwarves the expression of these novel isoform candidates. You can combine these two options like so opt.discovery = list(remove.subsetTx = FALSE, min.readFractionByGene = 0) which at an NDR of 1 results in 335863 novel transcripts accounting for nearly every possible candidate.

Therefore if you are looking to boost the number of novel transcripts you could consider allowing the subset transcripts to be output as well. (I would recommend an NDR of 0.1-0.2 and not 1).

To answer your question. It is technically possible to filter the rcFile yourself but it is not very user friendly and the quantification statistics are not yet included at that step. You can view the rcFile data with rowData(rcFile).

I believe it is unlikely that writeBambuOutput() is the cause. To check this however you can view the annotations before they are output by looking at the rowRanges of the se output from bambu. rowRanges(se). This should have the same number of annotations in it as the output GTF file.

Let me know if this doesn't solve your issue and I can reopen this.

Kind Regards, Andre Sim

Sep 05 '22 03:09 andredsim

Awesome, thanks for checking all that, this is very helpful!

On Sun, Sep 4, 2022 at 8:53 PM Andre Sim @.***> wrote:

Closed #300 https://github.com/GoekeLab/bambu/issues/300 as completed.

— Reply to this email directly, view it on GitHub https://github.com/GoekeLab/bambu/issues/300#event-7319654828, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALT2F37DIZYP7UNPEDYTVITV4VVETANCNFSM54M2OMZQ . You are receiving this because you authored the thread.Message ID: @.***>

Oct 11 '22 07:10 ajritter8