keepsake Better way to store Tensorboard logs in Keepsake

At the moment you have to put Tensorboard logs in the path in experiment.checkpoint. This has two main drawbacks:

Checkpoint sizes grow with each checkpoint as TB logs expand
To view the Tensorboard logs for the whole experiment you have to check out the last checkpoint

We should probably have some way of tracking experiment-level files that change during the course of the experiment, that aren't necessarily tied to specific checkpoints.

This could apply to other use cases than Tensorboard as well.

Mar 19 '21 18:03 andreasjansson

One potential approach to handle this issue would be to add a post-experiment path parameter in experiment.init in addition to the current (pre-experiment) path parameter. Currently the path parameter uploads prior to the experiment being run. The idea would be to have the new parameter upload once the experiment is finished running. Additionally, I think it would be good, but not absolutely necessary, to trigger that upload if the experiment stops "early" for whatever reason. Maybe a keyboard interrupt or some other forced early-stopping.

Mar 19 '21 18:03 sjakkampudi

Been keeping an eye on keepsake for a few days now, and this would definitely be a great feature that would get me to go all in! Especially if there's a way to monitor experiments from multiple computers in a single Tensorboard, by syncing them. It'd be nice if Tensorboard logs could just be synced along with the checkpoints, rather than having one stored at each one. Then, perhaps you could access a synced folder (which all experiments can access) through keepsake like keepsake.logs or something along those lines, sync that folder to some machine periodically, and run Tensorboard on that synced folder.

Mar 24 '21 09:03 pseeth

One potential design for this could be something along the lines of:

experiment = keepsake.init(path=".", params=..., logs_path="./tensorboard-logs/")

where logs_path is stored in the experiment and synced to the same remote directory on each experiment.checkpoint(). As opposed to the checkpoint path which is uploaded to a new checkpoint directory each time.

This could create a logs folder somewhere in the remote storage directory. On each experiment.checkpoint() the local logs_path is uploaded to remote logs folder that is shared across the experiment for all checkpoints. Either that could naively just upload the entire directory on each checkpoint, or it could be smarter and sync based on file hashes (though that's probably an optimization that could be added later).

What do you think @bfirsh?

Mar 25 '21 00:03 andreasjansson