BioSimSpace Feature analysis

Hello! This is the PR relating to the updated analysis in BSS. I have tested this for all three engine outputs via the following, with work_dir being the directory of the run and leg being either the bound or free leg:

pmf, overlap_matrix = BSS.FreeEnergy.Relative.analyse(f"{work_dir}/{leg}", 'MBAR')

and this also works if the estimator is set to TI. This proceeds by first extracting the data from the output files depending on the engine, and then carrying out the equilibrium detection, statistical inefficiency, and auto MBAR or TI using alchemlyb.

I also tested it for starting FreeEnergy runs with SOMD and directly analysing the outputs in BSS:


bound_run = BSS.FreeEnergy.Relative(
    system_bound,
    freenrg_protocol,
    engine=f"{engine_query}",
    work_dir=f'{work_dir}/bound'
)
bound_run.run()

free_run = BSS.FreeEnergy.Relative(
    system_free,
    freenrg_protocol,
    engine=f"{engine_query}",
    work_dir=f'{work_dir}/free'
)
free_run.run()

pmf_bound, overlap_bound = bound_run.analyse()
pmf_free,  overlap_free  = free_run.analyse()
free_nrg_binding = BSS.FreeEnergy.Relative.difference(pmf_bound, pmf_free)

if TI is to be used with .analyse(), it needs to be set during the BSS.FreeEnergy.Relative stage, as otherwise MBAR is the default. Otherwise it can also be used using the whole BSS.FreeEnergy.Relative.analyse(f"{work_dir}/{leg}", 'TI') for that run and that works fine.

Checking the overlap for MBAR works using:

free_run.check_overlap()

or

BSS.FreeEnergy.Relative.check_overlap(free_overlap, estimator='MBAR')
BSS.FreeEnergy.Relative.check_overlap(free_overlap) # or this as MBAR is the default

I also added the plotting from alchemlyb for the overlap and dhdl as well, eg:

free_run.plot()

or

overlap_plot = BSS.FreeEnergy.Relative.plot(overlap_free, estimator='MBAR', work_dir=f'{work_dir}/free')

When no work dir is defined, the output isn't saved so for jupyter notebooks it would only show it. The estimator must match the estimator of the anaylse used, as otherwise the data format is not correct for MBAR/TI. If no estimator is set, it will infer the type of data and if MBAR/TI plotting should be used.

Please let me know any things I should change, or if there are any code mistakes, or any ideas for improvements!

Jul 22 '22 09:07 annamherz

Thanks, @annamherz! I should be able to take a look at this in detail early next week.

Jul 22 '22 11:07 lohedges

Sorry for the delay with this. I'll try to make some comments today/tomorrow. Don't worry about making any edits before the workshop, we can deal with them afterwards when we think about merging things into devel. It might be the case the workshop has highlighted some other things that need editing/improving too.

Cheers.

Sep 07 '22 08:09 lohedges

Just added some comments. This is still a WIP so I'll add more if I think of anything.

Thanks again for this. It will be good to catch up after the workshop to discuss any further refinements based on feedback.

Sep 07 '22 10:09 lohedges

Thank you so much for all the feedback 😊 I'm also currently doing some large scale analysis to check the reproducibility between the statistical inefficiency / autoequilibration detection and other analysis protocols, so I think after this I'll have a more clear idea of what the best way to handle these with BSS would be as well.

Sep 08 '22 12:09 annamherz

Another potential further thing for analysis, that is implemented for now very roughly in the workshop for the sake of the absolute analysis (6aaae504d04d1d3bdef0b3bdf69caccf6ace2017 , fixed in d084e58658878811f48c77cda22f5dcba00da6e8, warnings in 6ac77fb5496a9e3a7b19d839e593b3ad3a1832bd) is a way to use the the previous ('native') analysis for SOMD so that previous results using older BSS can still be reproduced. I'll still test this some more and I think for the main analysis feature if we keep it the implementation of it would need some more thought as I don't think the way I've put it in right now is ideal.

Sep 08 '22 14:09 annamherz

Another potential further thing for analysis, that is implemented for now very roughly in the workshop for the sake of the absolute analysis (6aaae50 , fixed in d084e58, warnings in 6ac77fb) is a way to use the the previous ('native') analysis for SOMD so that previous results using older BSS can still be reproduced. I'll still test this some more and I think for the main analysis feature if we keep it the implementation of it would need some more thought as I don't think the way I've put it in right now is ideal.

I've merged the changes made for the workshop so that they're together with the analysis stuff.

I'm also currently doing some large scale analysis to check the reproducibility between the statistical inefficiency / autoequilibration detection and other analysis protocols, so I think after this I'll have a more clear idea of what the best way to handle these with BSS would be as well.

After doing this, it seems like the statistical inefficiency / subsampling (for the SOMD native) does not improve agreement of the results with the experimental, and also makes the alchemlyb and native analysis methods less consistent with each other, so I've turned it off for now as well.

Sep 16 '22 15:09 annamherz