[BUG] oiiotool crash/undefined behavior at exit
Describe the bug In Windows, oiiotool returns 0 from main, but the process returns 3 to the shell. This is because the c runtime abort function is terminating the process due to an error and it returns 3.
In Linux a few random times I got segfaults at exit. Valgrind reports errors of the form:
==56783== Command: oiiotool --info testrender.tif
==56783==
testrender.tif : 320 x 240, 4 channel, uint8 tiff
==56783== Invalid read of size 8
==56783== at 0x78610E: OpenImageIO_v2_4_0::ImageBufImpl::clear() (src/libOpenImageIO/imagebuf.cpp:703)
==56783== by 0x785D95: OpenImageIO_v2_4_0::ImageBufImpl::~ImageBufImpl() (src/libOpenImageIO/imagebuf.cpp:485)
==56783== by 0x78357D: OpenImageIO_v2_4_0::ImageBuf::impl_deleter(OpenImageIO_v2_4_0::ImageBufImpl*) (src/libOpenImageIO/imagebuf.cpp:333)
==56783== by 0x786F6E: ~unique_ptr (unique_ptr.h:292)
==56783== by 0x786F6E: OpenImageIO_v2_4_0::ImageBuf::~ImageBuf() (src/libOpenImageIO/imagebuf.cpp:569)
==56783== by 0x4288D1: std::_Sp_counted_ptr<OpenImageIO_v2_4_0::ImageBuf*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr_base.h:377)
==56783== by 0x49CD2B: _M_release (shared_ptr_base.h:155)
==56783== by 0x49CD2B: ~__shared_count (shared_ptr_base.h:730)
==56783== by 0x49CD2B: ~__shared_ptr (shared_ptr_base.h:1169)
==56783== by 0x49CD2B: _Destroy<std::shared_ptr<OpenImageIO_v2_4_0::ImageBuf> > (stl_construct.h:98)
==56783== by 0x49CD2B: __destroy<std::shared_ptr<OpenImageIO_v2_4_0::ImageBuf> *> (stl_construct.h:108)
==56783== by 0x49CD2B: _Destroy<std::shared_ptr<OpenImageIO_v2_4_0::ImageBuf> *> (stl_construct.h:136)
==56783== by 0x49CD2B: _Destroy<std::shared_ptr<OpenImageIO_v2_4_0::ImageBuf> *, std::shared_ptr<OpenImageIO_v2_4_0::ImageBuf> > (stl_construct.h:206)
==56783== by 0x49CD2B: ~vector (stl_vector.h:677)
==56783== by 0x49CD2B: ~SubimageRec (src/oiiotool/oiiotool.h:375)
==56783== by 0x49CD2B: _Destroy<OpenImageIO_v2_4_0::OiioTool::SubimageRec> (stl_construct.h:98)
==56783== by 0x49CD2B: __destroy<OpenImageIO_v2_4_0::OiioTool::SubimageRec *> (stl_construct.h:108)
==56783== by 0x49CD2B: _Destroy<OpenImageIO_v2_4_0::OiioTool::SubimageRec *> (stl_construct.h:136)
==56783== by 0x49CD2B: _Destroy<OpenImageIO_v2_4_0::OiioTool::SubimageRec *, OpenImageIO_v2_4_0::OiioTool::SubimageRec> (stl_construct.h:206)
==56783== by 0x49CD2B: ~vector (stl_vector.h:677)
==56783== by 0x49CD2B: OpenImageIO_v2_4_0::OiioTool::ImageRec::~ImageRec() (src/oiiotool/oiiotool.h:415)
==56783== by 0x49DB81: std::_Sp_counted_ptr<OpenImageIO_v2_4_0::OiioTool::ImageRec*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr_base.h:377)
==56783== by 0x48DE1C: _M_release (shared_ptr_base.h:155)
==56783== by 0x48DE1C: ~__shared_count (shared_ptr_base.h:730)
==56783== by 0x48DE1C: ~__shared_ptr (shared_ptr_base.h:1169)
==56783== by 0x48DE1C: OpenImageIO_v2_4_0::OiioTool::Oiiotool::~Oiiotool() (src/oiiotool/oiiotool.h:69)
==56783== by 0x5CB7C98: __run_exit_handlers (in /usr/lib64/libc-2.17.so)
==56783== by 0x5CB7CE6: exit (in /usr/lib64/libc-2.17.so)
==56783== by 0x5CA050B: (below main) (in /usr/lib64/libc-2.17.so)
==56783== Address 0x78d4200 is 0 bytes inside a block of size 66,176 free'd
==56783== at 0x4C2B06D: free (vg_replace_malloc.c:538)
==56783== by 0x863B4A: std::_Sp_counted_deleter<OpenImageIO_v2_4_0::pvt::ImageCacheImpl*, void (*)(OpenImageIO_v2_4_0::pvt::ImageCacheImpl*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::
_M_dispose() (shared_ptr_base.h:471)
==56783== by 0x85B87B: _M_release (shared_ptr_base.h:155)
==56783== by 0x85B87B: ~__shared_count (shared_ptr_base.h:730)
==56783== by 0x85B87B: std::__shared_ptr<OpenImageIO_v2_4_0::pvt::ImageCacheImpl, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() (shared_ptr_base.h:1169)
==56783== by 0x5CB7C98: __run_exit_handlers (in /usr/lib64/libc-2.17.so)
==56783== by 0x5CB7CE6: exit (in /usr/lib64/libc-2.17.so)
==56783== by 0x5CA050B: (below main) (in /usr/lib64/libc-2.17.so)
==56783== Block was alloc'd at
==56783== at 0x4C2C375: memalign (vg_replace_malloc.c:906)
==56783== by 0x4C2C43F: posix_memalign (vg_replace_malloc.c:1070)
==56783== by 0xABA1CB: OpenImageIO_v2_4_0::aligned_malloc(unsigned long, unsigned long) (src/libutil/sysutil.cpp:637)
==56783== by 0x85B5CD: aligned_new<OpenImageIO_v2_4_0::pvt::ImageCacheImpl> (src/include/OpenImageIO/platform.h:572)
==56783== by 0x85B5CD: OpenImageIO_v2_4_0::ImageCache::create(bool) (src/libtexture/imagecache.cpp:3899)
==56783== by 0x47DFF8: main (src/oiiotool/oiiotool.cpp:7010)
This happens because of incorrect static order initialization (and consequently destruction), at least in my build of OIIO using static libraries. There's the static Oiiotool ot in oiiotool.cpp and there's static std::shared_ptr<ImageCacheImpl> shared_image_cache; in imagecache.cpp. If shared_image_cache is destroyed first, because it was initialized after ot (maybe by chance, maybe by design) it will leave the ot with a dangling pointer to the shared cache which of course will fail in undefined ways when we try to later on access it.
Ignoring the issue of mixing shared pointers with naked pointers, I can imagine two possible fixes:
- Get rid of the global scope
otand instead create and delete it manually (or through a smart pointer) inmain(). You could still keep a global scopeOiiotool* otso that the code changes would be minimal (.becomes->, etc.). - If
otcan't be modified for some reason, we can due something a bit hacky that looks like:
index e3c87ad0..86778f47 100644
--- a/src/libOpenImageIO/imagebuf.cpp
+++ b/src/libOpenImageIO/imagebuf.cpp
@@ -28,7 +28,7 @@ OIIO_STRONG_PARAM_TYPE(DoLock, bool);
static atomic_ll IB_local_mem_current;
-
+extern bool shared_image_cache_destroyed;
ROI
get_roi(const ImageSpec& spec)
@@ -243,6 +243,8 @@ public:
// Invalidate the file in our imagecache and the shared one
void invalidate(ustring filename, bool force)
{
+ if (shared_image_cache_destroyed)
+ return; // this will leak private cache if private cache is also held in a global scope object...
ImageCache* shared_imagecache = ImageCache::create(true);
OIIO_DASSERT(shared_imagecache);
if (m_imagecache)
@@ -700,7 +702,8 @@ void
ImageBufImpl::clear()
{
if (storage() == ImageBuf::IMAGECACHE && m_imagecache && !m_name.empty()) {
- m_imagecache->close(m_name);
+ if (!shared_image_cache_destroyed)
+ m_imagecache->close(m_name);
invalidate(m_name, false);
}
free_pixels();
diff --git a/src/libtexture/imagecache.cpp b/src/libtexture/imagecache.cpp
index 60c07f90..ea7ccf16 100644
--- a/src/libtexture/imagecache.cpp
+++ b/src/libtexture/imagecache.cpp
@@ -43,8 +43,14 @@ spin_mutex ImageCacheImpl::m_perthread_info_mutex;
} // namespace pvt
+bool shared_image_cache_destroyed = false; // this must be before shared_image_cache
namespace { // anonymous
+struct SharedImageCacheDestroyedReporter {
+ SharedImageCacheDestroyedReporter(bool& ref) : m_ref(ref) {}
+ ~SharedImageCacheDestroyedReporter() {m_ref = true;}
+ bool& m_ref;
+} SharedImageCacheDestroyedReporterInstance(shared_image_cache_destroyed);
static std::shared_ptr<ImageCacheImpl> shared_image_cache;
static spin_mutex shared_image_cache_mutex;
I gave the second option a try, because it's what first came to mind and seemed like fewer changes and it fixes the valgrind errors, non-zero returns, and crashes. However, I would suggest going with option 1 since it's less hacky. Note that option 1 only solves the problem in oiiotool while option 2 could maybe avoid problems in other code bases that do something similar to oiiotool. What do others prefer?
To Reproduce
oiiotool --info somefile.tif or probably any other oiiotool command that loads a file should do this. valgrind on linux always shows the errors for me. However, I only tested static libraries, so it's possible a shared library might by chance have the initialization of the two variables in a different order and you won't get any bugs. It could even be that your static libs are linked together in a different way from mine and so you don't get this problem...
Platform information:
- OIIO branch/version: I'm running a fork that has 1e488aed merged in. So only a few weeks old.
IF YOU ALREADY HAVE A CODE FIX: There is no need to file a separate issue, please just go straight to making a pull request.
I do have a code fix shown above, but I don't know if it's the fix you'd want (I don't even know if it's what I'd want). Let me know what you think is the right approach.
FWIW I'm running into the same issue while trying to run OIIO tests on current master (2.5) -- the python test runner gets 3 as exit code, even if main() returns zero.
See #3591
Does that fully address this issue? Or are there other structural issues we should be thinking about?