sacred Heartbeat problems when running with caffe/pytorch

Background: I'm running my experiments using Caffe. I had this weird error that my experiments would not terminate and heartbeat signals got lost without notice. It seems that this is due to the main thread accessing C++ stuff through boost.python bindings and that the heartbeat thread was simply not responding.

I was able to fix the problem by changing the IntervalTimer class https://github.com/IDSIA/sacred/blob/d71cf7c6294e8b7d0c2a3992bfdf1eaf4cf00652/sacred/utils.py#L443 to inherit from multiprocessing.Process instead of threading.Thread.

If there's interest for adding this I can add a PR. This change should not break any existing functionality and allow for running experiments which use the boost.python interface.

EDIT: Just noticed that passing objects between processes is not as easy as I thought. I'll have to work out some more magic to make this work

Feb 24 '18 10:02 ssudholt

@ssudholt any updates on this? I'm facing issues which seem very similar when running pytorch.

May 10 '18 23:05 talesa

I hacked together a very ugly fix using multi processing. I'm still losing heartbeats and sometimes the experiments take hours (our days) till they finally terminate. Unfortunately, I haven't gotten down to what the actual problem is. But it's interesting that this happens in PyTorch as well. I thought it was a Caffe-only problem due to the use of boost.python.

May 11 '18 07:05 ssudholt

Could you post some minimal example to reproduce this problem? Also one thing worth trying could be to use --capture=sys since the default FD based capturing is known to sometimes cause problems.

May 16 '18 14:05 Qwlouse

I was trying to come up with an MWE but unfortunately this behavior only shows in my larger experiments. I might have to add that I store quite some stuff in the info dict (at least an ndarray of shape (13000, 4000)). My best guess at the moment is that the heartbeat thread is being starved and and only gets updated during my evaluations at the end of an epoch which takes some time "in the python code" compared to only using the underlying C++ functions. Is it at all possible that all heartbeats queue up and then get dequeued in the heartbeat thread every 10 seconds after the experiments finishes using C++ functions? This would be my only explanation so far as to why termination takes for ever. I'm going to try to come up with a minimal example that shows this behavior. In the mean time, I'm going to try out the capture flag and will report on the results, thanks for the tip. @talesa did you get any further with this?

May 19 '18 09:05 ssudholt

I'm aiming to write a more thorough description of the problem and my attempts at debugging it at the beginning of June - I've started drafting an issue here https://gist.github.com/talesa/6a78447bda17b3d85ebe2e311cac61da, but I want to do some more debugging before I bother you about it @Qwlouse, I might draft a PR if I figure out something.

I think part of my problems with heartbeats is that MongoObserver was failing because this call to MongoDB
https://github.com/IDSIA/sacred/blob/576803abd2fa4d5945ceedbd0bf5a7db6daac0d8/sacred/observers/mongo.py#L217 does not have its errors handled the way it is done in save() https://github.com/IDSIA/sacred/blob/576803abd2fa4d5945ceedbd0bf5a7db6daac0d8/sacred/observers/mongo.py#L245-L252 I think if the pymongo.errors.AutoReconnect is not caught at all, PyMongo is not reconnecting automatically.

May 21 '18 14:05 talesa

I should've added that I'm running a custom PostgreSQL observer. It seems the problem has nothing to do with the capture mode. I've tried sys and no and both yielded the same results.

May 23 '18 18:05 ssudholt