running with four GPUs
The following is a tested script (and output) using 4 1080ti with a 256Gb RAM Xeon eight core platform. The question is how to start the four GPUs in parallel? The print(Sys.time()) call suggests that the second GPU starts after the first GPU completes the math (how can I be sure of this?). I tried with foreach %dopar% and is slower and I have to export (clusterExport) all four working matrices to each GPU, so that is a waste of memory. I named the vlc matrices accordingly to each GPU in case further operations are needed downstream in the future.
system.time({
ORDER = 384*(2^5)
B = matrix(rnorm(ORDER^2), nrow=ORDER) # about 3 Gb max GPU memory is 11 Gb
Bm <- as.matrix(B)
Cm <- Bm
print(Bm[40,50])
for (i in 1:detectGPUs()) {
setContext(i) # select each of the four GPUs
assign(paste("gpuA.", i, sep=""),vclMatrix(Bm)) # create three gpu matrix and do math
assign(paste("gpuB.", i, sep=""),vclMatrix(Cm))
assign(paste("gpuD.", i, sep=""), get(paste("gpuA.",i, sep=""))*get(paste("gpuB.",i, sep="")) * 0.1 * i)
out <- (paste("gpuD.", i, sep=""))
Bm <- get(out)[,] # copy resulting matrix back to main memory
print(Bm[40,50])
print(Sys.time())
}
})
[1] 0.9357006
[1] 0.08755357
[1] "2018-06-01 14:11:04 CDT"
[1] 0.01638479
[1] "2018-06-01 14:11:18 CDT"
[1] 0.004599376
[1] "2018-06-01 14:11:33 CDT"
[1] 0.001721456
[1] "2018-06-01 14:11:47 CDT"
user system elapsed
58.544 16.441 74.948
@AurelioG the for loop you have written is sequential. So each one will wait for the previous iteration to complete. In theory it would have been nice to not require the copy back to main memory but the non thread safe aspect prevents that so the asynchronous calls with vclMatrix are moot. The only way to use the GPUs concurrently is to use a foreach loop.
You shouldn't necessarily need to copy the 'working matrices' to each loop. I believe that depends on the parallel backend. I believe backends like doMC facilitate 'shared-memory' parallelism.
This is essentially a reference to issue #24
All,
I thought I would share my solution to this problem. Note that my use of the "RLinuxModules" package is NOT necessary here if you are not working on a shared Linux system being managed with "environment modules".
The idea is to launch R sub-processes from within the main R
session - changing the processor context for each R sub-process
(e.g. "setContext") - computing the results for each sub-process -
then returning those results back to the the main R process.
This example is accessing Tesla K80 cards on a remote compute node in our university HPC center - a screen-shot of the polling of the Nvidia card is provided to show simultaneous 100% utilization of all processors on the card.
Best regards,
Rich [email protected]
##########################################
Multiple background R processes using 2 seperate GPU cards, while accessing the 2 GPUS processors per card, for parallel usage of all 4 GPU processors
Ref: See https://github.com/cdeterman/gpuR See https://github.com/r-lib/callr See https://github.com/larsgr/RLinuxModules
R-code below:
######################
library(callr) library(gpuR) library(RLinuxModules) rp.1<- r_bg(function() { library(RLinuxModules) module("load cudnn/5.1/cuda75 cuda/75/toolkit/7.5.18 cuda/75/profiler/7.5.18 cuda/75/nsight/7.5.18 cuda/75/blas/7.5.18 cuda/75/fft/7.5.18") library(gpuR) setContext(1) b<-8000 gpuA<-gpuMatrix(rnorm(bb), nrow=b, ncol=b, type="double") gpuC<-gpuMatrix(rnorm(bb), nrow=b, ncol=b, type="double") gpuB <- gpuA %% gpuA gpuB.inv <- solve(gpuB) gpuB.inv <- gpuB.inv %% gpuC gpuB.inv[]})
rp.2<- r_bg(function() { library(RLinuxModules) module("load cudnn/5.1/cuda75 cuda/75/toolkit/7.5.18 cuda/75/profiler/7.5.18 cuda/75/nsight/7.5.18 cuda/75/blas/7.5.18 cuda/75/fft/7.5.18") library(gpuR) setContext(2) b<-8000 gpuA<-gpuMatrix(rnorm(bb), nrow=b, ncol=b, type="double") gpuC<-gpuMatrix(rnorm(bb), nrow=b, ncol=b, type="double") gpuB <- gpuA %% gpuA gpuB.inv <- solve(gpuB) gpuB.inv <- gpuB.inv %% gpuC gpuB.inv[] })
rp.3<- r_bg(function() { library(RLinuxModules) module("load cudnn/5.1/cuda75 cuda/75/toolkit/7.5.18 cuda/75/profiler/7.5.18 cuda/75/nsight/7.5.18 cuda/75/blas/7.5.18 cuda/75/fft/7.5.18") library(gpuR) setContext(3) b<-8000 gpuA<-gpuMatrix(rnorm(bb), nrow=b, ncol=b, type="double") gpuC<-gpuMatrix(rnorm(bb), nrow=b, ncol=b, type="double") gpuB <- gpuA %% gpuA gpuB.inv <- solve(gpuB) gpuB.inv <- gpuB.inv %% gpuC gpuB.inv[] })
rp.4<- r_bg(function() { library(RLinuxModules) module("load cudnn/5.1/cuda75 cuda/75/toolkit/7.5.18 cuda/75/profiler/7.5.18 cuda/75/nsight/7.5.18 cuda/75/blas/7.5.18 cuda/75/fft/7.5.18") library(gpuR) setContext(4) b<-8000 gpuA<-gpuMatrix(rnorm(bb), nrow=b, ncol=b, type="double") gpuC<-gpuMatrix(rnorm(bb), nrow=b, ncol=b, type="double") gpuB <- gpuA %% gpuA gpuB.inv <- solve(gpuB) gpuB.inv <- gpuB.inv %% gpuC gpuB.inv[] })
Check to see if the background processes are still alive
rp.1$is_alive() rp.2$is_alive() rp.3$is_alive() rp.4$is_alive()
Retrieve results
gpuB.inv.mat.1<-rp.1$get_result() gpuB.inv.mat.1[1:5, 1:5]
gpuB.inv.mat.2<-rp.2$get_result() gpuB.inv.mat.2[1:5, 1:5]
gpuB.inv.mat.3<-rp.3$get_result() gpuB.inv.mat.3[1:5, 1:5]
gpuB.inv.mat.4<-rp.4$get_result() gpuB.inv.mat.4[1:5, 1:5]
####################################################
Nvidia polling using "watch -n 1 nvidia-smi" :
