metall SEGFAULT through nullptr access in `object_cache::priv_make_room_for_new

In object_cache::priv_make_room_for_new_blocks there is an assertion that the oldest block must not be nullptr https://github.com/LLNL/metall/blob/1f6c027680ea89bf92bd9a5255e006c044aa078f/include/metall/kernel/object_cache.hpp#L742.

However, this assertion is not always true which causes a segfault on one of the following lines when oldest_block is accessed.

I have not yet figured out what exactly causes this but maybe you have an idea. I will try to narrow the bug down further and comment on this issue if I find something new.

Stack trace: grafik

Sep 17 '24 08:09 liss-h

@liss-h Hmm, I'll look into that and get back to you.

Sep 19 '24 21:09 KIwabuchi

Could you modify Metall's code as follows and run the program again for me?

Insert #define METALL_DISABLE_OBJECT_CACHE to https://github.com/LLNL/metall/blob/1f6c027680ea89bf92bd9a5255e006c044aa078f/include/metall/kernel/segment_allocator.hpp#L35

Insert the code below to https://github.com/LLNL/metall/blob/1f6c027680ea89bf92bd9a5255e006c044aa078f/include/metall/kernel/chunk_directory.hpp#L212

    if (!m_table[chunk_no].slot_occupancy.get(num_slots, slot_no)) {
      logger::out(logger::level::error, __FILE__, __LINE__,
                  "Duplicate deallocation.");
      return;
    }

Sep 19 '24 22:09 KIwabuchi

Disabling the object cache seems to have fixed the issue, the message however is not printed.

(I still haven't figured out how to reproduce it in a minimal example yet)

Sep 23 '24 11:09 liss-h

Thanks for trying it. That means that there is likely an issue in the object cache. I keep investigating.

Sep 25 '24 19:09 KIwabuchi

@liss-h Could you add the following codes on the specified lines and run the program again? Please remove the code I mentioned in my previous comment beforehand. I want to double-check if the object cache is the issue or if another location is.

#include <unordered_set>
std::unordered_set<std::ptrdiff_t> g_allocated_offsets;

https://github.com/LLNL/metall/blob/1f6c027680ea89bf92bd9a5255e006c044aa078f/include/metall/kernel/segment_allocator.hpp#L39

g_allocated_offsets.insert(offset);

https://github.com/LLNL/metall/blob/1f6c027680ea89bf92bd9a5255e006c044aa078f/include/metall/kernel/segment_allocator.hpp#L157

    if (g_allocated_offsets.count(offset) == 0) {
      // error log
      logger::out(logger::level::error, __FILE__, __LINE__,
                  "Deallocating an offset that was not allocated");
      return;
    } else {
      g_allocated_offsets.erase(offset);
    }

https://github.com/LLNL/metall/blob/1f6c027680ea89bf92bd9a5255e006c044aa078f/include/metall/kernel/segment_allocator.hpp#L209

Sep 26 '24 01:09 KIwabuchi

Finally came around to testing that, it does indeed print "Deallocating an offset that was not allocated" a few times close to the beginning. The messages in the screenshot are the only ones, the rest of the log is just more of the same grafik

Sep 30 '24 12:09 liss-h

Thanks for update!

As for the 3 in my previous post, can you please change the function as below and insert at https://github.com/LLNL/metall/blob/1f6c027680ea89bf92bd9a5255e006c044aa078f/include/metall/kernel/segment_allocator.hpp#L212.

if (g_allocated_offsets.count(offset) == 0) {
  // error log
  logger::out(logger::level::error, __FILE__, __LINE__,
              "Deallocating an offset that was not allocated chunk_no: " +
                  std::to_string(chunk_no) +
                  ", bin_no: " + std::to_string(bin_no) +
                  ", offset: " + std::to_string(offset));
  return;
} else {
  g_allocated_offsets.erase(offset);
}

Sep 30 '24 23:09 KIwabuchi

grafik

Oct 01 '24 11:10 liss-h

Thanks for the update!

Could you modify item 2 in my previous post to as follows?

if (g_allocated_offsets.count(offset) > 0) {
  logger::out(logger::level::error, __FILE__, __LINE__, "Duplicated offset detected. offset: " + std::to_string(offset));
  std::abort();
}
g_allocated_offsets.insert(offset); // from the previous post

Specifically, please insert the code above here: https://github.com/LLNL/metall/blob/1f6c027680ea89bf92bd9a5255e006c044aa078f/include/metall/kernel/segment_allocator.hpp#L157

BTW, if you happen to have a bug reproducer, I can run it on my side. I tried to reproduce the issue on my side also, but no luck so far...

Oct 04 '24 20:10 KIwabuchi

The program does not abort.

And sorry I don't have a minimal example I could send you. Currently we basically need to run it through the whole codebase with 2 specific datasets, so far I've not found another way to reproduce it.

Oct 08 '24 13:10 liss-h

Thanks for the update.

The result suggests that Metall somehow received deallocation requests of memory regions it A) did not allocate or B) deallocated already. Could you send me the line of your code where you construct a Metall manager object?

And sorry I don't have a minimal example I could send you. No problem! I just did not want to disturb you many times until we fixed the issue.

FYI, I'll be on vacation from tomorrow until later next week. My response will be stopped until I come back.

Oct 10 '24 22:10 KIwabuchi

We are currently constructing it through our FFI (very minor differences/patches to the one in metall)

https://github.com/dice-group/metall-ffi/blob/7a5b59bdfe71a9bdc4f0f308577f4987a43b536c/src/dice/ffi/metall.cpp#L6-L48

Nov 05 '24 08:11 liss-h

Hi Liss,

I was able to identify and fix the bug: https://github.com/LLNL/metall/pull/337 The PR has been merged into the develop branch already. Thanks again for the issue report!

Nov 11 '24 01:11 KIwabuchi

Wow thank you. Impressive that you were able to find it with so little info. We will try it out at some point this week probably

Nov 11 '24 14:11 liss-h

Sounds good! Let me know if you see the issue still...

Nov 11 '24 18:11 KIwabuchi

Yup, as far as I can tell the issue is gone 👍

Nov 18 '24 14:11 liss-h

SEGFAULT through nullptr access in `object_cache::priv_make_room_for_new_blocks`