An odd error happened when i try to alloc a gpu memory in loop
I try to using 'cudaMalloc' to alloc a gpu memory in the beginning of each loop and i using 'cudaFree' to destory it in the ending of each loop. (that variable is defined out of loop)
The loop run about 4 times, but the first 3 times it seems like success, when i start the fourth loop it suddenly stop and given a error code 719.
I try to read the offical guide, but i have no clue about the error happen when alloc memory.
BTW, it was success on windos10 by GTX1070, but failure on Ubuntu18.04 by Jetson Nano
/**
* An exception occurred on the device while executing a kernel. Common
* causes include dereferencing an invalid device pointer and accessing
* out of bounds shared memory. Less common cases can be system specific - more
* information about these cases can be found in the system specific user guide.
* This leaves the process in an inconsistent state and any further CUDA work
* will return the same error. To continue using CUDA, the process must be terminated
* and relaunched.
*/
cudaErrorLaunchFailure = 719,
And this is the code of catch error, i used this code to wrapper every cuda function:
void validate(cudaError_t error, char* msg) {
if (error != cudaSuccess) {
std::cout << msg;
fprintf(stderr, "Failed to execute cuda function: %s! Error code: %d\n",
cudaGetErrorString(error), error);
exit(EXIT_FAILURE);
}
}
Here is the console out:
Alloc the s_GPU Array (this is the msg)
Failed to execute cuda function: unspecified launch failure! Error code: 719
Also: the first time alloc size is 154900 * sizeof(float2) the second time alloc size is 280142 * sizeof(float2) the third time alloc size is 406426 * sizeof(float2) the third time alloc size is 428178 * sizeof(float2) and error when alloc it
Without understanding what you're doing it's very difficult to speculate; there are many possibilities. For instance are you freeing memory before your kernel finishes with it? Accessing the array out of bounds with a misbehaving kernel? I don't think this looks like a memory allocation failure per se, my gut feeling is that it looks more like the allocation fails because the device is not in a good state for some other reason (usually this is a misbehaving kernel).