salmon icon indicating copy to clipboard operation
salmon copied to clipboard

Feature Request: Generate tgMap from GTF when building Index

Open ACastanza opened this issue 5 years ago • 3 comments

Currently alevin requires a transcript to gene mapping file to be provided with the --tgMap flag. A similar file is required when computing gene level quantification downstream of Salmon.

This file needs to exactly match the transcript IDs used when building the initial salmon index, furthermore, when building the initial index, a gtf file containing the transcript to gene mappings is typically provided.

In order to ensure ideal mappings of transcripts to genes, it would be ideal if, when constructing the salmon index when a GTF file is provided, an additional tgMap file was automatically produced from the GTF and written to the index output directory that could then be automatically detected by alevin, and/or then manually passed to other downstream tools. Automatically generating this file, and thus allowing the --tgmap flag to become optional would ensure that the subsequent transcript to gene mappings were perfectly matched to the original quantification. This could additionally allow a future option for salmon to automatically produce gene level quantifications.

ACastanza avatar Dec 01 '20 22:12 ACastanza

Thanks @ACastanza , I think it's a good idea. I have marked it as a feature request and we'd update you here once we have some progress into the next release.

A bit tangential though, I find refgenie very useful as it has pre-built salmon indices with all the other relevant metadata (such as gtf to generate tgMap file) needed by salmon/alevin for quantifiation, but I agree saving the tgMap while indexing through GTF would be great for consistency.

k3yavi avatar Dec 03 '20 17:12 k3yavi

Yes, I'm aware of refgenie however, I was unable to identify for the hg38 salmon indices which specific transcriptome source (and additionally which version of said source) was used to build them. Additionally, my use case here isn't entirely personal, I work for GSEA-MSigDB and GenePattern, we're in the process of improving the end-to-end analysis pipeline we offer to users, and one of the things we've been working on were wrapping the Salmon indexer, Salmon quant, and Alevin into GenePattern modules so that we can offer them to users who may want to run them on arbitrary transcriptomes in addition to the ones we offer specifically for GSEA compatibility. This issue was something we encountered when considering potential sources of inconsistency at different points in the pipeline.

ACastanza avatar Dec 03 '20 18:12 ACastanza

Got it, thanks for the heads up. I'd probably reach out to the refgenie people about the hg38 specific versions.

It makes sense to have the feature of having the gtf at the time of indexing. The only concern I have is that mandating to have the gtf might restrict the overall workflow by a bit. Specifically because a user might not always have the full GTF available for every use case, although we can always make having GTF as an optional requirement for indexing. Adding the support should not be too difficult but it will certainly add a new logic path which would need thorough testing.

We'll certainly keep you updated with the feature as we progress although it can take some time to get back. In terms of your pipeline one option would be to actually save the GTF explicitly in the salmon index folder post indexing. Although it's definitely not a very computer science friendly solution but it will help maintain the consistency while we work on the feature.

k3yavi avatar Dec 03 '20 18:12 k3yavi