TopoStats Further Modularisation

Is your feature request related to a problem? Please describe. TopoStats has been made more modular with the merging of recent PRs #540 and #600 (thanks @ns-rse). It would be good to have further work enabling the independent function of the different steps.

This could enable easier feature additions and perhaps easier testing procedures. It would also enable people to use TopoStats in a more custom manner, and eliminate unnecessary re-processing that has been a large obstacle for several users with large datasets. For example, skipping flattening and loading pre-processed images along with their pixel_to_nanometre scaling factor.

I'll update this as I explore possible implementations for this. The issue is rather complicated with the amount of overhead that we do in the background (eg: checking if config is valid, updating plotting dictionary, and passing optionally plotted images around).

Jun 12 '23 11:06 SylviaWhittle

Definitely agree with this, if we can make it so each step can be run in isolation it will give users greater flexibility and that was the thinking behind #517 which #540 sets the framework in place to extend.

Some random thoughts...

Configuration Files

Config validation isn't that much of an overhead, its done once and is useful as it tells users (if we set the validation correctly) early on if there are problematic values in their configuration.
I can see a case for splitting configuration files into a per-module basis. I would advocate against this though as it could be a source of confusion. Have one config to rule them all, but each topostats <command> should have a comprehensive suite of arguments so that any individual command can be modified at run-time.

Intermediary Files

For this to work then we would need to save the output from each stage to make it available to the subsequent (these could be saved as Numpy .npy or Python's more general pickles .pkl).
We would have to save these in a consistent location so that the overall configuration file which specifies the filetype to be looked at and then works out where to find the files it needs.
Or we add more complexity to the configuration file (or perhaps better still just to the command line arguments) to say what files to load.

Arguments

Each command should have fully documented arguments so users can see what their options are.

Output

Should output over-write what already exists or not? I can see cases for not doing so which would allow results to be compared. How to handle this? One option is to leave it to the user to specify the output and make sure they don't over-write their existing results. They could modify this changing values in the configuration file or using command line options.

Jun 12 '23 14:06 ns-rse

From a heavy user perspective this would be very useful. For a complex dataset I usually have to iterate the analysis at each step (processing, image plotting, masking etc.) and can have different parameters for each part for each part of the analysis. This currently can be very time consuming especially when I sometimes need to use the 'all' image set and also topostats is filling up my laptop storage very quickly since I currently just change the name of the output file for each run I want to compare so get everything multiple times.

It would be great if once the raw data had been flattened I could simply use that flattened data to make new images (different scales, colours, formats etc.) and optimise masking from the flattened image rather than having to completely reprocess everything.

Jun 13 '23 10:06 derollins

Feels like we're well on the way to achieving this with the refactoring @SylviaWhittle undertook which was merged in #613.

What more needs doing to round this out? Is it better documentation, tutorials on how to use the new command line interface (CLI), are we missing modules for anything, perhaps tracing/skeletonisation aspects which are blocked by the refactoring that is a work in (slow) progress.

Sep 21 '23 22:09 ns-rse