orbit icon indicating copy to clipboard operation
orbit copied to clipboard

Occasional crash in distributed training and prediction of DLT models due to pystan

Open ggerogiokas opened this issue 3 years ago • 1 comments

When I run roughly 10,000 different time series, I get runs crashing for various reasons. They typically only relate to two errors:

pickle data was truncated or Ran out of input.

Both seem to relate to stan compilation issues.

Any tips on how to avoid these issues.

I am running on ubuntu, with python 3.8 so don't think it's an OS issue.

ggerogiokas avatar Sep 13 '22 07:09 ggerogiokas

Can you provide some data / object snapshot when the issue happen? @ggerogiokas

edwinnglabs avatar Sep 20 '22 21:09 edwinnglabs

Hi @edwinnglabs

Managed to find a work around. Everytime the cluster starts up I run every flavour(DLT, ETS, LGT) of orbit model. That seems to cache all the stan models I need and there are no longer any stan compilation errors when I run multiple orbit models in parallel.

Now I have issues getting good cpu utilisation. But I guess I can mention that in a new issue.

ggerogiokas avatar Sep 27 '22 15:09 ggerogiokas