CoreNeuron
CoreNeuron copied to clipboard
Efficient setup of random123 streams on GPU
[copy/paste of internal issue created by @pramodk]
In GPU implementation of synapse model, we need to setup random123 streams on GPU which are being done in bbcore_read for every synapse instance:
static void bbcore_read(double* x, int* d, int* xx, int* offset, _threadargsproto_) {
assert(!_p_rng);
uint32_t* di = ((uint32_t*)d) + *offset;
if (di[0] != 0 || di[1] != 0)
{
nrnran123_State** pv = (nrnran123_State**)(&_p_rng);
*pv = nrnran123_newstream(di[0], di[1]);
}
*offset += 2;
}
For this we launch a kernel for Serial compute i.e. just initialize stream as:
/* nrn123 streams are created from cpu launcher routine */
nrnran123_State* nrnran123_newstream(uint32_t id1, uint32_t id2) {
nrnran123_State* s;
cudaMalloc( (void**)&s, sizeof(nrnran123_State) );
cudaMemset( (void**)&s, 0, sizeof(nrnran123_State) );
nrnran123_setup_cuda_newstream<<<1,1>>> (s, id1, id2);
cudaDeviceSynchronize();
return s;
}
This is terribly slow and inefficient! See sample profile for small 5msec simulation: 
Is this partially improved by https://github.com/BlueBrain/CoreNeuron/pull/595?
I think it was improved, but #595 caused a large number of small unified memory allocations, which were still quite slow. #702 improves the situation further.