emcee HDF backend slow

General information:

emcee version: 3.0.2
platform: opensuse
installation method (pip/conda/source/other?): pip

I am looking at your saving example at https://emcee.readthedocs.io/en/v3.0.2/tutorials/monitor/

Problem description:

I very much like the flexibility offered by the HDF backend, being able to save my chain to a file and to continue at a later point in time (especially as backup for long computations when my cpu node dies). However, when I have a fast log_prob function the overhead of opening/writing/closing the HDF file on each iteration seems to be disproportionately high and the overall computation performance is painfully slow.

Expected behavior:

Perhaps an easy solution would be an option to only save the chain state on every n-th iteration (where n is some adjustable number or n is calculated based on the relative progress). This may save some overhead by only opening/closing the HDF file once in a while.

Jun 24 '21 18:06 AndreWaehlisch

Yes - this is a nice proposal and I'd be happy to review such a PR, but I'm not immediately sure how painful it would be to implement.

If you're planning on thinning your chain eventually, you could use the thin_by parameter for sample or run_mcmc (see here), but your suggestion would be much better in general!

Jun 24 '21 19:06 dfm

I was about to suggest this as well, I even opened a discussions on the emcee google groups - link - and in the forth reply a user gave a small example of how to run emcee as an iterator, which I already was doing, and I was now trying to figure how to to periodically write the chain to disk.

One thing that I had in mind was saving the chain at the same time it would compute the autocorrelation time, in parallel, because for longer chains that computation can take a while and that way you would make the most out of that time.

In the mean time I'll keep trying to figure that out myself, but having this by default would be great!

Aug 28 '21 19:08 jpmvferreira

I've been digging through the source code and each time a step is computer its saved to the backend, using the method save_step. If it happens to be an HDF file, on each new step the file is opened and then the contents saved onto it.

The same thing happens when getting a value from this backend, with the method get_value, where the file is opened, the relevant entry is read, put into memory, and then used to perform whatever computation it was called for.

I was trying to modify the sample function on ensemble.py, but I realized that if I were to save periodically if the user would try to, let's say, get the autocorrelation time, then the file would be outdated and the result would be wrong, so all modifications should be in the backend itself, right?

If so then it would have to be added a way to communicate the buffer to the backend, which at the moment there doesn't seem to be any way of communicating with the backend.

Aug 30 '21 13:08 jpmvferreira