Solver error crashes Julia session
Within my research code I hit a very confusing behavior of PATH that cased my entire Julia session to just silently close. Debugging with GDB gave no sbacktrace and I did not get any error on the Julia side -- the session just closed.
Upon further inspection of the solver log, it looks like this is happening due to a PATH internal error that is not gracefully handled on the Julia side; the log ends with,
** SOLVER ERROR **
Lemke: invertible basis could not be computed.
You may avoid this error by increasing the
lemke_rank_deficiency_iterations option.
then the Julia session crashes.
Can this bet caught within PATHSolver.jl and converted to a recoverable error? Or is this considered a Julia bug and should be filed upstream?
Unfortunately, I have no compact reproducer for this issue yet.
Copying comment from #71:
During my hunt for #70 I noticed some weird behavior which looks to me like PATH has some global state either in C or in Julia. Specifically, the #70 only showed up if I solve another specific MCP before it. I suspect that some solver object is not fully reset leading to this interdependence between solves. I'm calling the solver through the MCP C-api so I wouldn't expect any side-effects between calls as I'm not carrying over any solver object. This issue is part of the reason why I cannot give a compact reproducer for #70 -- it only shows up if I run my research code in a specific sequence.
I suspect these two issues are related. Currently, we use a very basic interface to the C API:
https://github.com/chkwon/PATHSolver.jl/blob/2087cc0669fa9e1a6faf994bd6b942fadb324a40/src/C_API.jl#L575-L633
For example, we don't use the workspace feature, so I assume it's using some global workspace.
We also don't try to catch any errors in https://github.com/chkwon/PATHSolver.jl/blob/2087cc0669fa9e1a6faf994bd6b942fadb324a40/src/C_API.jl#L626
There are probably additional functions in the C API that we could wrap and update how we call PATH, but I don't have time to look into the best practices. PRs accepted :smile:
Thank you for the swift reply. Regarding your comment
we don't try to catch any errors in ...
It seems to me that Julia should throw them as normal Julia errors automatically (https://docs.julialang.org/en/v1/manual/calling-c-and-fortran-code/). Or are you aware of any special things that would need to be done to catch the errors of ccalls?
Oops. I led you astray with "catch." I didn't mean try-catch.
ccalls don't throw Julia errors. There might be something in the C API for handling errors. Or we might need to look at the status and do something different if it didn't solve correctly.
Thanks for the clarification.
FWIW, my debugging suggests that for the error above we actually never hit any code after the solve line. So there is no way to look at the status on the Julia end.
If PATH is aborting ungracefully, then there's not much we can do. I do wonder if using an explicit workspace would fix things though.
~~Are you aware if the API in Path.h is preferred / more recent than the on in MCP_Interface.h?~~
Nevermind, I guess they are totally doing different things. I'll try and see if a workspace helps. Thanks for the suggestions
I implemented the workspace feature and it indeed seems to change the behavior. In my setting, unfortunately, it now reliably crashes with the error reported above. So from that observation one can only conclude that workspace allocation changes the behavior of the first run (which seems scary in itself). It remains unclear if subsequent runs are now independent from one another because I never get to a second run with my original reproducer. I'll try to poke a bit more before I submit a PR.
~~Update: Even with the workspace feature I'm seeing the same problem of getting different solutions depending on which problem PATH has been invoked on before~~ (see below)
Another update on this front: the reported source of non-determinism in #71 was almost surely a mistake on my side because I am wrapping the path solver in another struct whose RNG I forgot to reset. So I'm fairly confident that that that was a fluke. The issue with error above persists, however.
Are you using PATH with JuMP? Or directly? Do you have a reproducible example?
I'm calling PATHSolver.solve_mcp directly in a fashion similar to what I published here. Unfortunately, I don't have a reproducible example that seems shareable. The only context in which I was able to reproduce the issue so far was in long-running simulation of code that I'm unable to share at this point. If you are keen, I can share a private repo but even there it takes ~5min compilation/setup to get to the error. I will keep trying to isolate a simple MCP that triggers this.
It'd be interesting to revisit this on the PATHSolver#master with the new options I've been adding:
https://github.com/chkwon/PATHSolver.jl/blob/606c5f271a6f3ac3d00ce3660ec83a93cc87841a/src/C_API.jl#L716-L727
Nice! Thank you for adding all those great features. I'll give this a spin -- ideally this weekend -- and post an update here!
Quick update: I'll probably have to get back to this in early February. Sorry for the delay.
There have been a bunch of updates since this issue, but without a reproducible example there's not much we can do.
I'm closing for now, but please re-open if you have a reproducible example on the latest version.