virtool icon indicating copy to clipboard operation
virtool copied to clipboard

Sequence metadata

Open igboyes opened this issue 6 years ago • 1 comments

  • [ ] Pull all metadata from Genbank when autofilling a sequence

    Also, it would be nice to extract all available metadata for a sequence along with the host (isolate, isolation_source, collection_date, note, subtype, serotype, etc) as well as any annotations.

  • [ ] Add metadata field for sequence creation and editing
  • [ ] Consider popover or other unintrusive method for displaying metadata in analysis results

Relates to #1548.

igboyes avatar Nov 13 '19 21:11 igboyes

I have a bit of code that might help:

from typing import List, Dict, Optional
import pandas as pd
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

def genbank_source_metadata(rec: SeqRecord) -> Dict[str, str]:
    """Get source feature metadata dictionary for a SeqRecord"""
    return {k: v[0] if v is not None and len(v) == 1 else v
            for k, v in rec.features[0].qualifiers.items()}


def genbank_metadata(genbank: str) -> pd.DataFrame:
    """Parse genome metadata from Genbank file into Pandas DataFrame.
    """
    id_to_rec = {r.id: r for r in SeqIO.parse(genbank, 'genbank')}
    df_metadata = pd.DataFrame({gid: genbank_source_metadata(rec)
                                for gid, rec in id_to_rec.items()}).transpose()
    if 'isolate' in df_metadata and 'strain' in df_metadata:
        df_metadata['strain'] = df_metadata['isolate']\
            .combine_first(df_metadata['strain'])
    return df_metadata

Pandas DataFrame can be output to a dict with to_dict or to JSON with to_json

I probably overuse Pandas in my projects, but I just find it super handy and quick for data table manipulations.

It would be great to try to automatically parse and clean up metadata extracted too. For example, converting all dates to ISO time. NCBI submitter provided metadata is usually a mess.

In the UI it would be really great to allow users to edit the metadata values and potentially add their own fields and values.

peterk87 avatar Nov 13 '19 22:11 peterk87