MegaQC Single MultiQC report mode

Once we have --noauth (#38), we can have a command line flag to run MegaQC with a single MultiQC JSON file. This would be useful when people generate huuuuge MultiQC runs (eg. single cell) which are not very usable. Here, they could quickly install MegaQC and then interactively look into the data.

Workflow:

pip install multiqc megaqc
multiqc huge_project/data
megaqc run --multiqc-json multiqc_data/multiqc_data.json

MegaQC could ignore usual installation (if any is present) and create a SQLite db file in /tmp. Everything will be lost when exited, so a warning should probably be added to the header.

May 12 '18 09:05 ewels

Dear Phil,

First of all thanks so much for the amazing tools! In our lab we are still in the discovery mode, but they are already making our life much easier.

I wanted to suggest a feature for Multi/MegaQC which goes along these lines, and luckily found this issue. It is indeed true that, in case some mid- to large-sized pipeline has a lot of output for each analysed sample and/or when there is certain degree of multiplexing in sequencing runs, a MultiQC report generated based on the complete data set might be overloaded and somewhat hard to interpret. For our pipelines an ideal scenario would be to run MultiQC sample-wise and then aggregate the most important metrics into a dataset-wide report. To me it looks very similar to what you sketched here, although I am not completely sure about the possible implementation.

So my question (suggestion) is how much deviation it would be to save the resulting 2nd level report into a "static" MultiQC-like report, instead of a transient session backed by a SQLite file. Perhaps, at the cost of some code reshuffling/duplication, this feature of hierarchical reports would better fit into MultiQC itself.

Let me know what you think.

Best regards,

Pavlo

Nov 20 '18 22:11 lutsik

Hi Pavlo,

Great to hear that you're finding MultiQC useful! A few people have suggested roughly what you're saying here - I don't think it will ever be possible with just MultiQC because it's a deceptively large amount of work to do. MegaQC kind of already does it, but you end up with a whole interactive website instead of a single static report.

An alternative (and the main reason I've never pursued the "make a MultiQC report from MultiQC reports" idea) is just to re-run MultiQC as much as you need. So, run it on your sample level and then run it again on the whole dataset. You can use different config files to exclude certain samples and change which modules / sections / table columns etc. are included in the report. My guess is that you should end up with roughly what you're after using this approach without having to code any new software.

In regards to this issue - I guess it could be expanded to accept multiple MultiQC JSON files for a single-run approach too, that should work. No promises for when it will be implemented though sorry (MegaQC has been progressing pretty slowly sadly due to me juggling too many projects).

Shout if you have any questions,

Phil

Nov 20 '18 22:11 ewels

Thanks for the prompt reaction, Phil.

I see your point, and it is fair enough. I guess a "recursive" MutliQC plugin (digesting and presenting MultiQC data to itself) could then be a possible solution for us to pursue, since the major aim is to summarise metrics from several "within-sample" batches. We will come back to you as soon as we have made our first experiences.

Best regards,

Pavlo

Nov 20 '18 23:11 lutsik

Sounds good! Out of curiosity, I'm still not totally clear on why you can't re-run MultiQC across the samples you need?

MultiQC has support for custom code in plugins, I've made an example at https://github.com/MultiQC/example-plugin which may be useful. The data that MultiQC parses is saved in multiqc_data files, plus everything should be in multiqc_data/multiqc.json (used by MegaQC). There is also an option to save the plot data to files if required.

Phil

Nov 20 '18 23:11 ewels

Sorry, I am still a MultiQC beginner, so I am not picking it up that fast.

As mentioned, running on everything is not a problem, but often inflates the amount of samples by the factor dependent on the degree of multiplexing, number of additional sequencing runs, or data splitting for the purpose of parallelisation. As a result the comprehensiveness of a report suffers a bit for a data set with many samples. I have custom scripts for dealing with this (e.g. manually merging flagstat output), although I might be overlooking a simpler solution.

Thanks for the reference to the plugin docs, we will definitely make use of it.

Pavlo

Nov 20 '18 23:11 lutsik

No worries - everyone's setup is different, it's just interesting to people's use cases. If you need to merge results before parsing then that is difficult for MultiQC to handle, as there needs to be logic involved as to how to do that step. A MultiQC plugin is probably a good way to go then, as you can interface with other systems (we have a plugin that fetches sample information from our LIMS for example).

Phil

Nov 21 '18 11:11 ewels