[Issue]: msgpack symbols in migraphx conflict with the ones in rocroller
Problem Description
I've run into a problem that msgpackc-cxx symbols in rocroller from TheRock build and the ones in libmigraphx.so are conflicting, and being shared libraries with the symbols public, one or the other will break. In my case rocroller was loaded first and I see following error when trying to load a compiled binary mxr file (with migraphx::load(path, options)):
MIGraphX Error: /workspace/AMDMIGraphX/src/value.cpp:371: at: Not an object for field: version
exception: Failed to call function
From LD_DEBUG=bindings:
3813: binding file /opt/rocm/lib/../lib/migraphx/lib/libmigraphx.so.2013000 [0] to /opt/rocm/lib/librocroller.so.1 [0]: normal symbol `_ZTIN7msgpack2v113size_overflowE'
3813: binding file /opt/rocm/lib/../lib/migraphx/lib/libmigraphx.so.2013000 [0] to /opt/rocm/lib/librocroller.so.1 [0]: normal symbol `_ZN7msgpack2v117bin_size_overflowD0Ev'
3813: binding file /opt/rocm/lib/../lib/migraphx/lib/libmigraphx.so.2013000 [0] to /opt/rocm/lib/librocroller.so.1 [0]: normal symbol `_ZN7msgpack2v111parse_errorD0Ev'
3813: binding file /opt/rocm/lib/../lib/migraphx/lib/libmigraphx.so.2013000 [0] to /opt/rocm/lib/librocroller.so.1 [0]: normal symbol `_ZTSN7msgpack2v117str_size_overflowE'
3813: binding file /opt/rocm/lib/../lib/migraphx/lib/libmigraphx.so.2013000 [0] to /opt/rocm/lib/librocroller.so.1 [0]: normal symbol `_ZN7msgpack2v110type_errorD0Ev'
A quick hacky fix for me was:
diff --git a/src/msgpack.cpp b/src/msgpack.cpp
index 2ee49071a..b497714c1 100644
--- a/src/msgpack.cpp
+++ b/src/msgpack.cpp
@@ -23,7 +23,10 @@
*/
#include <migraphx/msgpack.hpp>
#include <migraphx/serialize.hpp>
+
+#pragma GCC visibility push(hidden)
#include <msgpack.hpp>
+#pragma GCC visibility pop
namespace migraphx {
inline namespace MIGRAPHX_INLINE_NS {
@@ -52,6 +55,7 @@ static void msgpack_chunk_for_each(Iterator start, Iterator last, F f)
} // namespace MIGRAPHX_INLINE_NS
} // namespace migraphx
+#pragma GCC visibility push(hidden)
namespace msgpack {
MSGPACK_API_VERSION_NAMESPACE(MSGPACK_DEFAULT_API_NS)
{
@@ -206,6 +210,7 @@ MSGPACK_API_VERSION_NAMESPACE(MSGPACK_DEFAULT_API_NS)
} // namespace adaptor
} // MSGPACK_API_VERSION_NAMESPACE(MSGPACK_DEFAULT_API_NS)
} // namespace msgpack
+#pragma GCC visibility pop
namespace migraphx {
inline namespace MIGRAPHX_INLINE_NS {
Things started to work after this patch was applied.
Aside from msgpack, I also see other symbol conflicts: libsqlite3 is pulled from several places (/opt/rocm/lib/rocm_sysdeps/lib/librocm_sysdeps_sqlite3.so and /usr/lib/x86_64-linux-gnu/libsqlite3.so.0 for me)
Related to recent llvm symbol conflict fix in migraphx_gpu @pfultz2
Operating System
Ubuntu Any
CPU
Ryzen
GPU
Other
Other
Any
ROCm Version
ROCm 6.0.0
Steps to Reproduce
Run your migraphx app with:
LD_PRELOAD=/opt/rocm/lib/librocroller.so.1 migraphx_app
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
What is rocroller?
Is TheRock building msgpack and distributing shared libraries of msgpack with rocm so other components that need msgpack will use that version? If so, migraphx should just use the shared version instead of linking in a different static library version.
A larger context would be useful to help understand whats going on and how to solve the issue.
I don't know what rocroller is or does exactly. In TheRock it seems to be built for libBLASLt.
At this moment migraphx is not integrated into TheRock so I'm building it on top of it. I'm not aware of all the components in ROCm to synchronise them properly -- there are too many to keep track of and understand intimately to figure out what might conflict -- a human can't do it.
In an ideal world we could synchronise everything and use single shared libraries (and single shared headers -- note that msgpackc-cxx is header-only). In practice everyone builds and bundles their own static libraries or headers to keep stability, links them publicly into their shared libraries, and these kind of things happen.
Another problem is that you won't get clear errors on these kind of conflicts. Just random runtime errors and it's a pain to track them down.
So this is the reality. This happens. If you share generic symbols outside your own API but linked into your shared library -- they will eventually get overriden by something that also wants to use a version of that API, and unexpected things will start to happen. The only solution is to hide them.