selene icon indicating copy to clipboard operation
selene copied to clipboard

Adding an RNA sequence class?

Open kathyxchen opened this issue 7 years ago • 1 comments

This is something I'd like to consider implementing, in the hopes that it would be used in one of the examples for the paper. It might not be necessary, but I do want to discuss it. @evancofer what do you think we'd need to train a model on RNA data? Only w.r.t a potential RNA Sequence class for now. (When you have coordinates data, do you also get the full sequence from a FASTA file?)

kathyxchen avatar Apr 25 '18 21:04 kathyxchen

This differs slightly depending on whether we want mRNA, pre-mRNA, and so on. However, as long as we use transcript or gene coordinates, things are simple. If we want mRNAs then the simplest solution is to just use a distinct FASTA file that just lists transcripts with "transcript" instead of "chrom", and uses coordinates within the transcript. This doesn't really require altering the genome type significantly. If we wanted pre-mRNA, we just include intronic regions in the FASTA file of genes.

The real difficulty occurs when we want to use genomic coordinates and not just gene coordinates. In this case, we have to keep the gene definitions as well as the genome in memory. We then transform the genomic coordinates into gene coordinates on the fly. This seems like it would require a fast coordinate or interval map, so that we can randomly access a coordinate or region and pull out the gene definition required.

evancofer avatar Apr 30 '18 14:04 evancofer