foundationdb icon indicating copy to clipboard operation
foundationdb copied to clipboard

segment fault under C/C++ API binding

Open royguo opened this issue 5 months ago • 1 comments

I am using C API (libfdb_c.so) in my C++ application:

  1. I created an Singleton method for the FDBCppBinding.
static std::shared_ptr<FDBCppBinding> GetInstance(FDBConfig config) {
    static std::shared_ptr<FDBCppBinding> instance(new FDBCppBinding(std::move(config)));
    TLOG(INFO, "FDBCppBinding instance returned");
    return instance;
  }
  1. Here's how I init the FDBCppBinding:
  fdb_error_t error = fdb_select_api_version(FDB_API_VERSION);
  if (error != 0) {
    TLOG(ERROR, "Failed to select FDB API version {}: {}", FDB_API_VERSION, error);
    return FDBError::kUnknownError;
  }
  TLOG(INFO, "Selected FDB API version: {}", FDB_API_VERSION);


  setNetworkOptions();


  error = fdb_setup_network();
  if (error != 0) {
    TLOG(ERROR, "FDB Network setup failed: {}", error);
    return FDBError::kNetworkError;
  }
  LOG_INFO("FDB Network setup completed");


  startNetworkThread();
  1. Inside the startNetworkThread
network_future_ = std::async(std::launch::async, [this]() {
    network_running_ = true;
    TLOG(INFO, "Starting global FDB network thread");
    auto error = fdb_run_network();
    network_running_.store(false);
    TLOG(INFO, "FDB network thread exited with error: {}", error);
  });

The problem is, when I exit the UT of the application, there's an segment error:

Missing separate debuginfos, use: dnf debuginfo-install foundationdb-clients-7.1.38-1.x86_64 glibc-2.38-29.tl4.x86_64 libgcc-12.3.1.5-2.tl4.x86_64 libstdc++-12.3.1.5-2.tl4.x86_64 numactl-devel-2.0.16-5.tl4.x86_64
(gdb) bt
#0  0x00000000047bb6c0 in ?? ()
#1  0x00007ffff69d576d in Transaction::createTrLogInfoProbabilistically(Database const&) () from /lib64/libfdb_c.so
#2  0x00007ffff69fbe86 in Transaction::Transaction(Database const&, Optional<Standalone<StringRef> > const&) ()
   from /lib64/libfdb_c.so
#3  0x00007ffff6d8772c in ReadYourWritesTransaction::ReadYourWritesTransaction(Database const&, Optional<Standalone<StringRef> >) () from /lib64/libfdb_c.so
#4  0x00007ffff6fcdd78 in internal_thread_helper::DoOnMainThreadVoidActorState<ThreadSafeTransaction::ThreadSafeTransaction(DatabaseContext*, ISingleThreadTransaction::Type, Optional<Standalone<StringRef> >)::{lambda()#1}, internal_thread_helper::DoOnMainThreadVoidActor<ThreadSafeTransaction::ThreadSafeTransaction(DatabaseContext*, ISingleThreadTransaction::Type, Optional<Standalone<StringRef> >)::{lambda()#1}> >::a_body1cont1(Void const&, int) [clone .constprop.0] [clone .isra.0] () from /lib64/libfdb_c.so
#5  0x00007ffff6fcde68 in ActorCallback<internal_thread_helper::DoOnMainThreadVoidActor<ThreadSafeTransaction::ThreadSafeTransaction(DatabaseContext*, ISingleThreadTransaction::Type, Optional<Standalone<StringRef> >)::{lambda()#1}>, 0, Void>::fire(Void const&) () from /lib64/libfdb_c.so
#6  0x00007ffff715c3f8 in N2::Net2::run() () from /lib64/libfdb_c.so
#7  0x00007ffff69a12c2 in runNetwork() () from /lib64/libfdb_c.so
#8  0x00007ffff6fce1bf in ThreadSafeApi::runNetwork() () from /lib64/libfdb_c.so
#9  0x00007ffff6948f91 in MultiVersionApi::runNetwork() () from /lib64/libfdb_c.so
#10 0x00007ffff691e9fa in fdb_run_network () from /lib64/libfdb_c.so
#11 0x0000000001fccef0 in operator() (__closure=0x47a4a68)
    at /data00/kuankuan/tcqa-table/src/ms/fdb/fdb_cpp_binding.cc:569
#12 std::__invoke_impl<void, tcqa::table::fdb::FDBCppBinding::startNetworkThread()::<lambda()> > (__f=...)
    at /usr/include/c++/12/bits/invoke.h:61
#13 std::__invoke<tcqa::table::fdb::FDBCppBinding::startNetworkThread()::<lambda()> > (__fn=...)

FDB version: 7.1.38 (And I've tried 7.3.59)

royguo avatar Sep 07 '25 09:09 royguo

hi @royguo

I did some research on this issue, my understanding is basically

The Root Cause

Looking at the crash location in NativeAPI.actor.cpp line 6784-6785:

Reference<TransactionLogInfo> Transaction::createTrLogInfoProbabilistically(const Database& cx) {
    if (!cx->isError()) {
        double sampleRate = cx->globalConfig->get<double>(...);  // ← CRASH HERE

The crash at address 0x00000000047bb6c0 indicates that cx (the Database object) or cx->globalConfig has already been destroyed, but the network thread is still running and trying to access it.

The Critical Issues in Your Code

  1. Missing fdb_stop_network() call - You never signal the network thread to stop
  2. Database handles outlive the network thread - Your Database objects are destroyed while the network thread is still running
  3. Static destruction order problem - Your singleton's static instance is destroyed during program exit while the network thread is active

The Correct Cleanup Order

From analyzing FDB test code (unit_tests.cpp, ryw_benchmark.c, etc.), the proper sequence is:

// 1. Destroy all transactions and database handles FIRST
fdb_database_destroy(db);

// 2. Then stop the network
fdb_stop_network();

// 3. Wait for network thread to complete
network_thread.join();

The Complete Fix for Your Code

Here's what you need to implement in your FDBCppBinding class:

1. Add proper member variables to track resources

class FDBCppBinding {
private:
    FDBDatabase* db_ = nullptr;  // Track database handle
    std::future<void> network_future_;
    std::atomic<bool> network_running_{false};
    std::atomic<bool> network_stopped_{false};
    
    // ... other members
};

2. Implement proper cleanup in the destructor

~FDBCppBinding() {
    TLOG(INFO, "FDBCppBinding destructor called");
    cleanup();
}

void cleanup() {
    if (network_stopped_.load()) {
        return;  // Already cleaned up
    }
    
    // Step 1: Destroy all database handles FIRST
    if (db_ != nullptr) {
        TLOG(INFO, "Destroying FDB database handle");
        fdb_database_destroy(db_);
        db_ = nullptr;
    }
    
    // Step 2: Stop the network thread
    if (network_running_.load()) {
        TLOG(INFO, "Stopping FDB network");
        fdb_error_t error = fdb_stop_network();
        if (error != 0) {
            TLOG(ERROR, "Failed to stop FDB network: {}", error);
        }
    }
    
    // Step 3: Wait for the network thread to complete
    if (network_future_.valid()) {
        try {
            TLOG(INFO, "Waiting for FDB network thread to join");
            network_future_.wait();
            TLOG(INFO, "FDB network thread stopped successfully");
        } catch (const std::exception& e) {
            TLOG(ERROR, "Exception while waiting for network thread: {}", e.what());
        }
    }
    
    network_running_.store(false);
    network_stopped_.store(true);
}

3. Fix the singleton pattern

**Option A: Use explicit cleanup ** //unit tests can be implemented in this approach...

class FDBCppBinding {
private:
    static std::shared_ptr<FDBCppBinding> instance_;
    static std::mutex instance_mutex_;
    
public:
    static std::shared_ptr<FDBCppBinding> GetInstance(FDBConfig config) {
        std::lock_guard<std::mutex> lock(instance_mutex_);
        
        if (!instance_) {
            instance_.reset(new FDBCppBinding(std::move(config)));
            TLOG(INFO, "FDBCppBinding instance created");
        }
        
        return instance_;
    }
    
    // Call this explicitly in your unit test teardown
    static void DestroyInstance() {
        std::lock_guard<std::mutex> lock(instance_mutex_);
        
        if (instance_) {
            TLOG(INFO, "Explicitly destroying FDBCppBinding instance");
            instance_->cleanup();
            instance_.reset();
        }
    }
};

// In .cpp file
std::shared_ptr<FDBCppBinding> FDBCppBinding::instance_ = nullptr;
std::mutex FDBCppBinding::instance_mutex_;

4. In your unit test teardown exit

// At the end of your unit tests or before program exit
FDBCppBinding::DestroyInstance();

I have also double-checked with these metrics:

  1. Database handles destroyed first - No more access to destroyed objects
  2. Network thread properly signaled - fdb_stop_network() tells it to exit
  3. Synchronized shutdown - .wait() ensures network thread completes before proceeding
  4. Explicit cleanup - No reliance on static destruction order

According to FDB Documentation

From api-c.rst line 203:

we must wait for: func:fdb_run_network() to return before allowing your program to exit, or else the behavior is undefined."

if you are comfortable I can fix this in my local and create a PR you can view them

abinesha312 avatar Nov 04 '25 17:11 abinesha312