CoreNeuron Efficient setup of random123 streams on GPU

[copy/paste of internal issue created by @pramodk]

In GPU implementation of synapse model, we need to setup random123 streams on GPU which are being done in bbcore_read for every synapse instance:

static void bbcore_read(double* x, int* d, int* xx, int* offset, _threadargsproto_) {
    assert(!_p_rng);
    uint32_t* di = ((uint32_t*)d) + *offset;
        if (di[0] != 0 || di[1] != 0)
        {
      nrnran123_State** pv = (nrnran123_State**)(&_p_rng);
      *pv = nrnran123_newstream(di[0], di[1]);
        }
    *offset += 2;
}

For this we launch a kernel for Serial compute i.e. just initialize stream as:

/* nrn123 streams are created from cpu launcher routine */
nrnran123_State* nrnran123_newstream(uint32_t id1, uint32_t id2) {

    nrnran123_State* s;

    cudaMalloc( (void**)&s, sizeof(nrnran123_State) );
    cudaMemset( (void**)&s, 0, sizeof(nrnran123_State) );

    nrnran123_setup_cuda_newstream<<<1,1>>> (s, id1, id2);
    cudaDeviceSynchronize();

    return s;
}

This is terribly slow and inefficient! See sample profile for small 5msec simulation: nvvp_cuda_stream_random123

Jul 09 '21 12:07 olupton

Is this partially improved by https://github.com/BlueBrain/CoreNeuron/pull/595?

Jul 26 '21 22:07 pramodk

I think it was improved, but #595 caused a large number of small unified memory allocations, which were still quite slow. #702 improves the situation further.

Dec 01 '21 13:12 olupton