PATHSolver.jl icon indicating copy to clipboard operation
PATHSolver.jl copied to clipboard

Solver error crashes Julia session

Open lassepe opened this issue 3 years ago • 14 comments

Within my research code I hit a very confusing behavior of PATH that cased my entire Julia session to just silently close. Debugging with GDB gave no sbacktrace and I did not get any error on the Julia side -- the session just closed.

Upon further inspection of the solver log, it looks like this is happening due to a PATH internal error that is not gracefully handled on the Julia side; the log ends with,

  ** SOLVER ERROR **
Lemke: invertible basis could not be computed.
       You may avoid this error by increasing the
       lemke_rank_deficiency_iterations option.

then the Julia session crashes.

Can this bet caught within PATHSolver.jl and converted to a recoverable error? Or is this considered a Julia bug and should be filed upstream?

lassepe avatar Nov 10 '22 15:11 lassepe

Unfortunately, I have no compact reproducer for this issue yet.

lassepe avatar Nov 10 '22 15:11 lassepe

Copying comment from #71:

During my hunt for #70 I noticed some weird behavior which looks to me like PATH has some global state either in C or in Julia. Specifically, the #70 only showed up if I solve another specific MCP before it. I suspect that some solver object is not fully reset leading to this interdependence between solves. I'm calling the solver through the MCP C-api so I wouldn't expect any side-effects between calls as I'm not carrying over any solver object. This issue is part of the reason why I cannot give a compact reproducer for #70 -- it only shows up if I run my research code in a specific sequence.

odow avatar Nov 10 '22 20:11 odow

I suspect these two issues are related. Currently, we use a very basic interface to the C API:

https://github.com/chkwon/PATHSolver.jl/blob/2087cc0669fa9e1a6faf994bd6b942fadb324a40/src/C_API.jl#L575-L633

For example, we don't use the workspace feature, so I assume it's using some global workspace.

We also don't try to catch any errors in https://github.com/chkwon/PATHSolver.jl/blob/2087cc0669fa9e1a6faf994bd6b942fadb324a40/src/C_API.jl#L626

There are probably additional functions in the C API that we could wrap and update how we call PATH, but I don't have time to look into the best practices. PRs accepted :smile:

odow avatar Nov 10 '22 20:11 odow

Thank you for the swift reply. Regarding your comment

we don't try to catch any errors in ...

It seems to me that Julia should throw them as normal Julia errors automatically (https://docs.julialang.org/en/v1/manual/calling-c-and-fortran-code/). Or are you aware of any special things that would need to be done to catch the errors of ccalls?

lassepe avatar Nov 10 '22 21:11 lassepe

Oops. I led you astray with "catch." I didn't mean try-catch.

ccalls don't throw Julia errors. There might be something in the C API for handling errors. Or we might need to look at the status and do something different if it didn't solve correctly.

odow avatar Nov 10 '22 21:11 odow

Thanks for the clarification.

FWIW, my debugging suggests that for the error above we actually never hit any code after the solve line. So there is no way to look at the status on the Julia end.

lassepe avatar Nov 10 '22 21:11 lassepe

If PATH is aborting ungracefully, then there's not much we can do. I do wonder if using an explicit workspace would fix things though.

odow avatar Nov 10 '22 22:11 odow

~~Are you aware if the API in Path.h is preferred / more recent than the on in MCP_Interface.h?~~

lassepe avatar Nov 10 '22 22:11 lassepe

Nevermind, I guess they are totally doing different things. I'll try and see if a workspace helps. Thanks for the suggestions

lassepe avatar Nov 10 '22 22:11 lassepe

I implemented the workspace feature and it indeed seems to change the behavior. In my setting, unfortunately, it now reliably crashes with the error reported above. So from that observation one can only conclude that workspace allocation changes the behavior of the first run (which seems scary in itself). It remains unclear if subsequent runs are now independent from one another because I never get to a second run with my original reproducer. I'll try to poke a bit more before I submit a PR.

lassepe avatar Nov 11 '22 09:11 lassepe

~~Update: Even with the workspace feature I'm seeing the same problem of getting different solutions depending on which problem PATH has been invoked on before~~ (see below)

lassepe avatar Nov 11 '22 11:11 lassepe

Another update on this front: the reported source of non-determinism in #71 was almost surely a mistake on my side because I am wrapping the path solver in another struct whose RNG I forgot to reset. So I'm fairly confident that that that was a fluke. The issue with error above persists, however.

lassepe avatar Nov 11 '22 19:11 lassepe

Are you using PATH with JuMP? Or directly? Do you have a reproducible example?

odow avatar Nov 11 '22 20:11 odow

I'm calling PATHSolver.solve_mcp directly in a fashion similar to what I published here. Unfortunately, I don't have a reproducible example that seems shareable. The only context in which I was able to reproduce the issue so far was in long-running simulation of code that I'm unable to share at this point. If you are keen, I can share a private repo but even there it takes ~5min compilation/setup to get to the error. I will keep trying to isolate a simple MCP that triggers this.

lassepe avatar Nov 11 '22 21:11 lassepe

It'd be interesting to revisit this on the PATHSolver#master with the new options I've been adding: https://github.com/chkwon/PATHSolver.jl/blob/606c5f271a6f3ac3d00ce3660ec83a93cc87841a/src/C_API.jl#L716-L727

odow avatar Jan 03 '23 00:01 odow

Nice! Thank you for adding all those great features. I'll give this a spin -- ideally this weekend -- and post an update here!

lassepe avatar Jan 03 '23 10:01 lassepe

Quick update: I'll probably have to get back to this in early February. Sorry for the delay.

lassepe avatar Jan 11 '23 08:01 lassepe

There have been a bunch of updates since this issue, but without a reproducible example there's not much we can do.

I'm closing for now, but please re-open if you have a reproducible example on the latest version.

odow avatar Aug 02 '23 01:08 odow