SEGFAULT through nullptr access in `object_cache::priv_make_room_for_new_blocks`
In object_cache::priv_make_room_for_new_blocks there is an assertion that the oldest block must not be nullptr https://github.com/LLNL/metall/blob/1f6c027680ea89bf92bd9a5255e006c044aa078f/include/metall/kernel/object_cache.hpp#L742.
However, this assertion is not always true which causes a segfault on one of the following lines when oldest_block is accessed.
I have not yet figured out what exactly causes this but maybe you have an idea. I will try to narrow the bug down further and comment on this issue if I find something new.
Stack trace:
@liss-h Hmm, I'll look into that and get back to you.
Could you modify Metall's code as follows and run the program again for me?
Insert #define METALL_DISABLE_OBJECT_CACHE to https://github.com/LLNL/metall/blob/1f6c027680ea89bf92bd9a5255e006c044aa078f/include/metall/kernel/segment_allocator.hpp#L35
Insert the code below to https://github.com/LLNL/metall/blob/1f6c027680ea89bf92bd9a5255e006c044aa078f/include/metall/kernel/chunk_directory.hpp#L212
if (!m_table[chunk_no].slot_occupancy.get(num_slots, slot_no)) {
logger::out(logger::level::error, __FILE__, __LINE__,
"Duplicate deallocation.");
return;
}
Disabling the object cache seems to have fixed the issue, the message however is not printed.
(I still haven't figured out how to reproduce it in a minimal example yet)
Thanks for trying it. That means that there is likely an issue in the object cache. I keep investigating.
@liss-h Could you add the following codes on the specified lines and run the program again? Please remove the code I mentioned in my previous comment beforehand. I want to double-check if the object cache is the issue or if another location is.
#include <unordered_set>
std::unordered_set<std::ptrdiff_t> g_allocated_offsets;
https://github.com/LLNL/metall/blob/1f6c027680ea89bf92bd9a5255e006c044aa078f/include/metall/kernel/segment_allocator.hpp#L39
g_allocated_offsets.insert(offset);
https://github.com/LLNL/metall/blob/1f6c027680ea89bf92bd9a5255e006c044aa078f/include/metall/kernel/segment_allocator.hpp#L157
if (g_allocated_offsets.count(offset) == 0) {
// error log
logger::out(logger::level::error, __FILE__, __LINE__,
"Deallocating an offset that was not allocated");
return;
} else {
g_allocated_offsets.erase(offset);
}
https://github.com/LLNL/metall/blob/1f6c027680ea89bf92bd9a5255e006c044aa078f/include/metall/kernel/segment_allocator.hpp#L209
Finally came around to testing that, it does indeed print "Deallocating an offset that was not allocated" a few times close to the beginning. The messages in the screenshot are the only ones, the rest of the log is just more of the same
Thanks for update!
As for the 3 in my previous post, can you please change the function as below and insert at https://github.com/LLNL/metall/blob/1f6c027680ea89bf92bd9a5255e006c044aa078f/include/metall/kernel/segment_allocator.hpp#L212.
if (g_allocated_offsets.count(offset) == 0) {
// error log
logger::out(logger::level::error, __FILE__, __LINE__,
"Deallocating an offset that was not allocated chunk_no: " +
std::to_string(chunk_no) +
", bin_no: " + std::to_string(bin_no) +
", offset: " + std::to_string(offset));
return;
} else {
g_allocated_offsets.erase(offset);
}
Thanks for the update!
Could you modify item 2 in my previous post to as follows?
if (g_allocated_offsets.count(offset) > 0) {
logger::out(logger::level::error, __FILE__, __LINE__, "Duplicated offset detected. offset: " + std::to_string(offset));
std::abort();
}
g_allocated_offsets.insert(offset); // from the previous post
Specifically, please insert the code above here: https://github.com/LLNL/metall/blob/1f6c027680ea89bf92bd9a5255e006c044aa078f/include/metall/kernel/segment_allocator.hpp#L157
BTW, if you happen to have a bug reproducer, I can run it on my side. I tried to reproduce the issue on my side also, but no luck so far...
The program does not abort.
And sorry I don't have a minimal example I could send you. Currently we basically need to run it through the whole codebase with 2 specific datasets, so far I've not found another way to reproduce it.
Thanks for the update.
The result suggests that Metall somehow received deallocation requests of memory regions it A) did not allocate or B) deallocated already. Could you send me the line of your code where you construct a Metall manager object?
And sorry I don't have a minimal example I could send you. No problem! I just did not want to disturb you many times until we fixed the issue.
FYI, I'll be on vacation from tomorrow until later next week. My response will be stopped until I come back.
We are currently constructing it through our FFI (very minor differences/patches to the one in metall)
https://github.com/dice-group/metall-ffi/blob/7a5b59bdfe71a9bdc4f0f308577f4987a43b536c/src/dice/ffi/metall.cpp#L6-L48
Hi Liss,
I was able to identify and fix the bug: https://github.com/LLNL/metall/pull/337 The PR has been merged into the develop branch already. Thanks again for the issue report!
Wow thank you. Impressive that you were able to find it with so little info. We will try it out at some point this week probably
Sounds good! Let me know if you see the issue still...
Yup, as far as I can tell the issue is gone 👍