bambu Help replicating results from another paper

Dear Bambu team,

First of all, I have to say that I really like this tool! and your guidance is very helpful!

Now, my issue: I am comparing bambu vs FLAIR using the published data in this article. https://www.nature.com/articles/s41467-024-48615-4#Fig1 . I am trying to reproduce the figure 1E using bambu and Squanti3.

A colleague ran the alignment part and FLAIR and I am running Bambu. We are not even close to replicating their results. Interestingly for Bambu, there are no novel transcripts/genes reported, not even forcing NDR, i.e., 0.3 Screenshot 2024-11-14 at 14 17 59 . how is this possible? For FLAIR Screenshot 2024-11-14 at 14 19 23 we are getting some novel transcripts but still need to get similar results.

How is it possible that bambu is not getting any novel transcript? could I do something to enhance my code? I am adding the bam file I am using for this. The reference we are using the human gencode.v38 files. DRR465296.aligned.corrected_all_corrected_chr21.bam.gz

My code looks like:

library(bambu)

BAMlist <- "DRR465296.aligned.corrected_all_corrected_chr21.bam"
fa.file <- "GCA_000001405.15_GRCh38_no_alt_analysis_set.fna"
gtf.file <-  "gencode.v38.annotation.gtf"
bambuAnnotations <- prepareAnnotations(gtf.file)

se = bambu(reads = BAMlist, annotations = bambuAnnotations, genome = fa.file, ncore = 14)
seGene <- transcriptToGeneExpression(se)

writeBambuOutput(se, path = "MY_PATH", prefix = "fPMBC_DefNDR_NewRef_")

Thanks for any help here! Kind regards, Niko

Nov 14 '24 13:11 NikoLichi

Hi @NikoLichi ,

Thanks for reporting this to us.

I had a look at the data, it seems that the bam file is too small to have novel transcripts that pass the default filters in bambu. If you would want to obtain expected results, we would recommend you to try with a larger bam file.

If you really need to run it on a small bam file, you might want to relax a bit on the discovery threshold, for example, set NDR =1, min.readCount = 1, and fitReadClassModel = FALSE.

Let us know if this helps. Thank you Warm regards, Ying

Nov 15 '24 03:11 cying111

Dear @cying111,

Thanks a lot for your reply and hints. I agree that the sample test is very small... However, we decided to go with this test, but see PS below.

Continuing with your input, I ran 3 bambu tests, adding each one of the parameters you mentioned. Please see the attached figure in which these 15-22 novel transcripts are about 0.01% NIC (Novel In Catalog).

Only activating NDR=1 gave 15 novel transcripts (no novel genes).
NDR=1 and min.readCount = 1, gave 22 novel transcripts (no novel genes). 3.NDR =1, min.readCount = 1, and fitReadClassModel = FALSE There was no change from the previous step.

So this is better but still not comparable with the initial graph I sent you from FLAIR. Do you have any other suggestions for increasing the number of novelties?

All the best, NiKo

PS. I am running another test with a larger cohort (and larger BAM files ~1Millon reads per sample in 12 samples) re-training the bambu model. I hope these bambu options can help to increase the number of novelties and make a fully comparable test.

Nov 18 '24 11:11 NikoLichi

Hi @NikoLichi ,

Thanks for getting back with these updates.

For your question to get a higher number of novel transcripts, there are a few additional parameters you can adjust, including remove.subsetTx = FALSE, min.readFractionByGene = 0.01. If these still don't yield a sufficient number of transcripts, you could also consider relaxing min.exonOverlap.

However, please note that using these relaxed parameters may work well for smaller samples, but for larger datasets, it could lead to less specificity and potentially become infeasible. I recommend to adjust these parameters accordingly.

Thank you and looking forward to hearing back from you! Warm regards, Ying

Nov 19 '24 03:11 cying111

Hi @cying111,

Thanks again for this extra input! I am learning a lot from bambu, so thanks! I definitely agree with you that some of these parameters should not be changed for larger datasets.

I did try the min.readFractionByGene = 0.01 and indeed there are many more novel transcripts (64 more), but the remove.subsetTx = FALSE may seen a little bit too error-prone and decided not to used.

Additionally, my colleagues and I realized that there are many single-exon transcripts from FLAIR. So I gave a try to the single-exon option min.txScore.singleExon = 0 and the number was very comparable. However, when using the quantification tool (sqanti3 https://github.com/ConesaLab/SQANTI3). I needed to filter may of the novel genes and transcripts because the lack the strand option in the GTF.... Then, I also tried setting up stranded=TRUE, but the results did not change.

Since we would like to include the single-exon option in our analyses, (1) do you recommend a way to have the strand information with bambu? And (2) perhaps what would be a good filtering threshold after including them with NDR=1?

Warm regards, Niko

Nov 28 '24 14:11 NikoLichi

Hi @NikoLichi ,

Sorry for the late response! I'm glad you figured out how to include the single-exon option.

Regarding strand information, when the strand is "*", it means Bambu is uncertain about the strand of the novel transcript. In this case, you can assign "+" or "-" based on the gene strand. If the strand of the novel gene is unclear, I recommend assuming it as "+" as what I usually would do. Hope this helps!

For your second question, when NDR = 1, you can use other estimates from Bambu to identify more confident novel transcripts, such as read count, full-length read count, or unique count. Adjusting these thresholds can help determine the most reasonable criteria.

Let me know if you need additional help!

Warm regards, Ying

Jan 31 '25 08:01 cying111