OpenSpliceAI icon indicating copy to clipboard operation
OpenSpliceAI copied to clipboard

Pre-compute available?

Open RachelKarchin opened this issue 1 year ago • 14 comments

Beautiful work! Have you done a pre-compute. If yes, we'd love to add OpenSpliceAI as an annotator to OpenCRAVAT

RachelKarchin avatar Mar 25 '25 18:03 RachelKarchin

Dear Professor Karchin,

Thank you for reaching out. I am excited about integrating OpenSpliceAI as an annotator for OpenCRAVAT and believe it is a great fit for the task. I have sent you an email with further details and look forward to our discussion.

Best, Kuan-Hao

Kuanhao-Chao avatar Mar 26 '25 14:03 Kuanhao-Chao

Hi @Kuanhao-Chao, this is fantastic work! I'd be very interested in including the precomputed OpenSpliceAI scores in GeneBe as well. If precomputed scores are available or in progress, I’d appreciate it if you could let me know how best to access or coordinate on them.

Looking forward to seeing how this develops!

pstawinski avatar May 09 '25 09:05 pstawinski

Can I ask what do you mean by pre-compute? Do you mean the pre-trained weights or the predicted results? I am recently working on reproducing the SpliceAI model, and we guarantee the same output as in the original TF implementation.

ZhiyuanChen avatar May 12 '25 15:05 ZhiyuanChen

Dear @ZhiyuanChen ,

Illumina provides precomputed SpliceAI scores for all exonic SNVs and small indels. Generating these scores requires significant computational resources, but they are very useful for end users, who don't need to compute scores for the most common types of variants themselves—saving both time and effort. Btw. I believe the limited popularity of Pangolin and CI-SpliceAI is due to the lack of precomputed scores.

pstawinski avatar May 12 '25 16:05 pstawinski

Dear @ZhiyuanChen ,

Illumina provides precomputed SpliceAI scores for all exonic SNVs and small indels. Generating these scores requires significant computational resources, but they are very useful for end users, who don't need to compute scores for the most common types of variants themselves—saving both time and effort. Btw. I believe the limited popularity of Pangolin and CI-SpliceAI is due to the lack of precomputed scores.

Thank you for providing these additional context. They have been very helpful. SpliceAI is indeed a very computational extensive model -- I was very surprised to see it takes 70G+ FLOPS for such a tiny model.

However, despite the relatively high computational costs, its memory footprint is very small, one can run it on almost all PCs, and it would only take less than 1 seconds to generate a prediction. Which means it would only take about 10 minutes for a thousand sequences. Can I ask how many sequences do you need to be pre-computed?

ZhiyuanChen avatar May 12 '25 16:05 ZhiyuanChen

In a single human whole-genome sequencing experiment, you typically identify several million variants. Even after excluding variants that are far from transcripts and very common in the population, you're still left with hundreds of thousands of variants in a single sample. Most of these remaining variants are still relatively common, but their allele frequencies are not high enough to filter them out confidently.

Unless there are precomputed pathogenicity scores or the user has built some form of cache, the scores must be recomputed for each variant from scratch. This is why precomputed scores are so valuable: they reduce the number of variants that require analysis-time scoring from hundreds of thousands to just thousands per sample.

Ideally, it would be beneficial to have precomputed scores for all SNVs within the exome (possibly restricted to MANE transcripts), including nearby intronic regions (perhaps a few hundred bases on either side). Including the 5′ UTR would also be helpful. If feasible, small insertions and deletions (InDels) would be valuable as well—such as single-base insertions and deletions of 1, 2, or 3 bases—similar to what Illumina has done.

Here are the variant counts in Illumina’s SpliceAI precomputed score files:

$ for i in *; do echo $i; lbzip2 -cd $i | wc -l ; done
spliceai_scores.masked.indel.hg38.vcf.gz.tsv.bz2
9155870805
spliceai_scores.masked.snv.hg38.vcf.gz.tsv.bz2
3433384833

pstawinski avatar May 12 '25 20:05 pstawinski

Hi @pstawinski , thanks for flagging this. We’re actively extending our precomputed OpenSpliceAI scores to cover all exonic SNVs, nearby intronic regions, etc. We’ll let you know as soon as the precomputed scores are available!

Kuanhao-Chao avatar May 13 '25 00:05 Kuanhao-Chao

In a single human whole-genome sequencing experiment, you typically identify several million variants. Even after excluding variants that are far from transcripts and very common in the population, you're still left with hundreds of thousands of variants in a single sample. Most of these remaining variants are still relatively common, but their allele frequencies are not high enough to filter them out confidently.

Unless there are precomputed pathogenicity scores or the user has built some form of cache, the scores must be recomputed for each variant from scratch. This is why precomputed scores are so valuable: they reduce the number of variants that require analysis-time scoring from hundreds of thousands to just thousands per sample.

Ideally, it would be beneficial to have precomputed scores for all SNVs within the exome (possibly restricted to MANE transcripts), including nearby intronic regions (perhaps a few hundred bases on either side). Including the 5′ UTR would also be helpful. If feasible, small insertions and deletions (InDels) would be valuable as well—such as single-base insertions and deletions of 1, 2, or 3 bases—similar to what Illumina has done.

Here are the variant counts in Illumina’s SpliceAI precomputed score files:

$ for i in *; do echo $i; lbzip2 -cd $i | wc -l ; done spliceai_scores.masked.indel.hg38.vcf.gz.tsv.bz2 9155870805 spliceai_scores.masked.snv.hg38.vcf.gz.tsv.bz2 3433384833

Thank you for the detailed response, they have been very valuable to me as I'm not familiar with the downstream use cases. What about the accuracy of the results? SpliceAI is an aged model, so the results are less promising. In our internal testing, SpliceAI can only achieves about 0.4792 in terms of AUPRC. I haven't had the opportunity to test the new OpenSpliceAI by @Kuanhao-Chao (thank you for this tool, I'm sure they will be very useful), but most deep learning methods (like Ernie-RNA) can easily get 0.55 or higher while running much faster (at least 3 times). Is there any "cut-off" value to consider they are accurate enough, or are they the higher the better?

ZhiyuanChen avatar May 13 '25 07:05 ZhiyuanChen

Is there any "cut-off" value to consider they are accurate enough, or are they the higher the better?

This question goes far beyond the scope of our current discussion. The topic is quite complex—typically, validation occurs when an expert evaluates whether the reported scores align with expectations and are consistent with other sources of evidence, such as sample phenotype, mode of inheritance, etc. The use of computational scores is just one of many pieces of information taken into account.

pstawinski avatar May 13 '25 16:05 pstawinski

The use of computational scores is just one of many pieces of information taken into account.

Thank you so much!

As AI4Bio is attracting increasing attention, I believe there will be dozens (if not hundreds) of new models coming out in the next few years. So my question is really about how can we make these models more accessible?

That's why I'm more interested in how users would work with these scores. If the users are reading the sequences one by one, then all we need to do is a pipeline that runs in the background -- it would only takes a few seconds for a modern networks to predict dozens of variants. If the users are using some programs to process in bulk, then I guess pre-computes would be mandatory. These create a series of problems: how do we generate these pre-computes? How do we store / distribute them?

ZhiyuanChen avatar May 13 '25 18:05 ZhiyuanChen

I assume we are still discussing in the context of OpenSpliceAI, the model capable of computing the consequences of mutations. For clinicians or researchers working with variants, the result should ideally be a numeric score accompanied by an interpretation. In the case of splicing, I would expect a floating-point number representing the strength of the change at the canonical splice site, along with measurements of the impact on potential alternative splice sites that may emerge after the mutation. A number alone is not sufficient—there must be a clear explanation of how to interpret the result.

More generally, the score should reflect the impact on the function of the region of interest, so that experts can classify the variant as Benign, Likely benign, Uncertain significance, Likely pathogenic, or Pathogenic.

Please take a look at the [SpliceAI Lookup tool](https://spliceailookup.broadinstitute.org/#variant=chrX-154966484-T-A&hg=38&bc=basic&distance=50&mask=1&ra=0), which nicely illustrates the outputs of several scoring models. Also see this related [publication](https://www.biorxiv.org/content/10.1101/2024.09.17.611902v1.full).

Regarding distribution:

There is currently no canonical way of distributing precomputed scores. One emerging initiative is GeneBe Hub—an open platform where variant scores can be stored along with tools for easy discovery, retrieval, conversion, and annotation of VCF files using these databases (https://genebe.net/hub). Disclaimer: I am the author of this project; it is a work in progress, and feedback is most welcome.

pstawinski avatar May 15 '25 08:05 pstawinski

Sorry for the confusion. What I was asking is beyond the scope of OpenSpliceAI. There have been many models (SpliceAI, OpenSpliceAI, etc.) for splice sites prediction, and I believe there will be many more in the near future. For example, SpliceBERT is a dedicated model for splicing, and many foundation models (Uni-RNA, Ernie-RNA, RNAGenesis, AIDO.RNA, etc.) have reported results on splicing. Soon, the time it takes to generate pre-compute scores would exceeds the time for a new model to release.

ZhiyuanChen avatar May 16 '25 14:05 ZhiyuanChen

Hi all, Jumping in this conversation around pre-computed scores. As pointed out by @pstawinski, for high throughput contexts, pre-computed scores (optionally added as ensembl VEP plugin) are a must have My question is if one were to perform the required analysis, which model would be recommended ?

sounkou-bioinfo avatar Jun 24 '25 07:06 sounkou-bioinfo

My question is if one were to perform the required analysis, which model would be recommended ?

I have not had the opportunity to test the performance of Open SpliceAI, but the original SpliceAI model does not perform well in certain cases. MMsplice or DeltaSplice may be better.

ZhiyuanChen avatar Jun 24 '25 09:06 ZhiyuanChen