majora2 Cache PAG serialization

After three months of Majora-ing I think we have discovered an interesting flaw in the process model. I think it's important that we're able to model the concepts of samples, tubes, boxes, files, directories and the processes that are applied to them. It means we can quickly return information on particular artifacts and more easily model how to create and validate such artifacts through the API. It makes natural sense to send and receive information about these real world items through the API with structures that try to represent them.

Yet, when it comes to analyses, we most often want to dump our knowledge about these carefully crafted objects into a gigantic unstructured flat file to tabulate, count and plot things of interest. It's not impossible to do this - we already can unroll all the links between artifacts and processes to traverse the process tree model that is central to how Majora records the journal of an artifact.

The two issues with this are:

The unravelling is quite slow, likely owing to the suboptimal implementation (given my Django learning curve and time constraints) and the sheer number of models involved
The unravelling is quite inflexible. Currently the API supports unravelling Published Artifact Groups and Sequencing Runs and not much else. The serializers for the latter are even a special implementation to work specifically for flattening metadata and metrics for artifacts that lead up to a sequencing run.

The first is not hugely problematic, we request this data from the database infrequently. However the latter is why I'm writing this issue. I want users to be able to request specific information ("columns") of metadata pertaining to any group of artifacts in the system - ideally in a fast and simple fashion.

This led me to think more about what the PAG really represents: If you think about it, the Published Artifact Group is a brief highlight reel of the journey an artifact has taken through its analysis lifespan (eg. for the Covid work,a PAG shows the sample and its FASTA - skipping everything in-between). We can formalise the idea of binding everything (including that in-between part) by specifically linking all the processes that were performed onto the Published Artifact Group.

I've previously discussed this idea and first thought about collecting all the processes from the start of the process tree to the end (eg. a sample, through to its FASTA) and adding these to a process_set on the Published Artifact Group. One could then ask all the processes in this group to serialize themselves, potentially with some context (eg. "these columns only"). We can formalise this slightly better by adding a concrete idea of a "journal" as a many-to-many FK on the Artifact and Process-related models.

That is, we still maintain the audit linkage of what processes were applied to which artifacts and when. But once the result of such a journey is final and a Published Artifact Group is minted, we can collect all those processes and label them with a specific journal_id. This means we can fetch all the processes related to a PAG/journal and serialise them without processing the tree.

If this still doesn't suffice, a post_save process for the PAG could serialize all the information and store it as JSON in postgresql or something.

May 28 '20 14:05 SamStudio8

We need a solution to this. I think the current idea that would cause the least amount of collateral damage would be adding a M:M relationship between a PAG and the Processes it encapsulates. We can then write a migration to enumerate every PAG, collect its artifacts, recurse through their process trees and add every process to the PAG.

I think it is gross but has the following pros:

If we don't like it, it will be easy to remove
It will touch very little code as the heavy lifting will be done in the migration
It will work

Jul 01 '20 14:07 SamStudio8

I took a different direction as an experiment as an excuse to continue my battle with DRF. You can send a leaf_cls GET param when listing or fetching PAGs which will check the PAG for artifacts of that class, pick one, and grab its process tree. If this works, we'll go ahead of write a migration to link those processes into the PAG model proper.

Jul 02 '20 12:07 SamStudio8

Alright I've taken a new approach that works for now. We get all the artifact IDs in the PAG, and look for any ProcessRecord that starts or finishes with that artifact, expand all the artifacts out and repeat. This gets all the ProcessRecords we need (for now) in scope for serialisation. It's much nicer than some bullshit nominated artifact approach like what I came up with yesterday. It works really well and what's better is it seems to involve fewer DB hits too.

Jul 02 '20 18:07 SamStudio8

This is shockingly insightful for something written back in May. Indeed this problem is a core design issue with Majora (and would need considerable thought in any new version #44). The way I see it is there are two parts to Majora's job:

Maintaining a thorough, interconnected history of artifacts and the processes that manipulate them: requiring fast indexes and heavy use of relational keys
Dumper-trucking everything we know about a sample into a flat file for analysis (bonus points if that can be filtered for particular rows and fields)

This dual-model of storage will need to maintain an SQL-like structure for the first part, and I think the ideas I've touched upon in the past about pre-serialising PAGs out to JSON (and perhaps one day a separate large key-value database) will solve the second part. In the near future (time permitting) I think I'll experiment with adding JSON to each PAG and using the dynamic part of the v3 DRF API (likely removing the DRF part) to serve the dynamic queries.

Jan 12 '21 15:01 SamStudio8

As part of fast prep for mass ENA consensus submissions (https://github.com/COG-UK/dipi-group/issues/11), I've hacked the "original" GET PAG endpoint to allow a mode to override the behaviour of the celery task. This works absolutely brilliantly: it's blazing fast AND still satisfies the requirements of the ocarina struct which means I can just drop it in to work there. However this is completely against the design ethos of Majora (flexibility and genericism). Delving into this for ENA submissions has also made me realise I never solved the dual endpoint problem whereby the GET API for sequencing and PAGs have a lot of overlap but no shared code (currently); this has only worsened with the recent need to add highly specific code to speed those two processes up. I think the long term solution is to deploy the cache idea discussed here - then bring back the dynamic v3 API to handle JSON munging and API responses.

Jan 13 '21 17:01 SamStudio8