SYCL implementation
This is the SYCL/DPC++ port. It currently depends on two features that are not widely available:
-
USM (unified shared memory), which serves the same purpose as
cudaMallocManaged. It is part of SYCL 2020 but only Intel DPC++ on Intel back-ends supports it today. CodePlay ComputeCpp has started implementing it but I don't think it is finished yet and I didn't test it. -
sycl::intel::experimental::printf, which is, as one might expect, an Intel extension to supportprintf. There is an alternative extension in CodePlay ComputeCpp but I didn't bother with that because of the previous issue.
The following output appears to be identical to that of GCC OpenMP, but please let me know what other verification I need to do.
Today, the Intel DPC++ implementation is working with the host and Gen9 GPU devices, but not the CPU device because of an Intel OpenCL issue that is known and in the process of being fixed.
jrhammon@jrhammon-nuc:~/QUICKSILVER/src$ QS_DEVICE=GPU ./qs
Copyright (c) 2016
Lawrence Livermore National Security, LLC
All Rights Reserved
Quicksilver Version : 2020-Feb-4-22:35:56
Quicksilver Git Hash : af27b3dcce08933786cb526e2f8a0bbe99d99b07
MPI Version : 3.0
Number of MPI ranks : 1
Number of OpenMP Threads: 1
Number of OpenMP CPUs : 1
Simulation:
dt: 1e-08
fMax: 0.1
inputFile:
energySpectrum:
boundaryCondition: reflect
loadBalance: 0
cycleTimers: 0
debugThreads: 0
lx: 100
ly: 100
lz: 100
nParticles: 1000000
batchSize: 0
nBatches: 10
nSteps: 10
nx: 10
ny: 10
nz: 10
seed: 1029384756
xDom: 0
yDom: 0
zDom: 0
eMax: 20
eMin: 1e-09
nGroups: 230
lowWeightCutoff: 0.001
bTally: 1
fTally: 1
cTally: 1
coralBenchmark: 0
crossSectionsOut:
Geometry:
material: sourceMaterial
shape: brick
xMax: 100
xMin: 0
yMax: 100
yMin: 0
zMax: 100
zMin: 0
Material:
name: sourceMaterial
mass: 1000
nIsotopes: 10
nReactions: 9
sourceRate: 1e+10
totalCrossSection: 1
absorptionCrossSection: flat
fissionCrossSection: flat
scatteringCrossSection: flat
absorptionCrossSectionRatio: 1
fissionCrossSectionRatio: 0.1
scatteringCrossSectionRatio: 1
CrossSection:
name: flat
A: 0
B: 0
C: 0
D: 0
E: 1
nuBar: 2.4
is gpu
Building partition 0
Building partition 1
Building partition 2
Building partition 3
done building
Building MC_Domain 0
Building MC_Domain 1
Building MC_Domain 2
Building MC_Domain 3
Starting Consistency Check
Finished Consistency Check
Finished initMesh
Using SYCL device
cycle start source rr split absorb scatter fission produce collisn escape census num_seg scalar_flux cycleInit cycleTracking cycleFinalize
0 0 100000 0 900000 1078182 1076792 107133 257364 2262107 0 72049 2670386 2.264064e+08 4.442000e-02 6.090415e+00 0.000000e+00
1 72049 100000 0 828008 1107255 1106235 110306 264657 2323796 0 47153 2719702 2.438830e+08 4.589800e-02 3.454263e+00 0.000000e+00
2 47153 100000 0 852712 1086097 1088696 108334 259738 2283127 0 65172 2687840 2.435394e+08 4.463300e-02 3.493513e+00 0.000000e+00
3 65172 100000 68015 834785 1017555 1018659 101778 244593 2137992 0 57202 2517378 2.450517e+08 4.562600e-02 3.246188e+00 0.000000e+00
4 57202 100000 62214 842934 1020418 1019522 101687 244038 2141627 0 59855 2522163 2.434017e+08 4.571000e-02 3.285100e+00 0.000000e+00
5 59855 100000 56726 840345 1029994 1029682 103183 247672 2162859 0 57969 2545713 2.451216e+08 4.450600e-02 3.400606e+00 0.000000e+00
6 57969 100000 52439 841925 1032190 1032180 102801 246877 2167171 0 59341 2551468 2.446226e+08 4.578000e-02 3.692880e+00 0.000000e+00
7 59341 100000 59663 840635 1023444 1022593 102792 246649 2148829 0 60726 2531066 2.441845e+08 4.598600e-02 3.360731e+00 0.000000e+00
8 60726 100000 68187 839357 1013501 1014287 101238 243112 2129026 0 60269 2508491 2.440307e+08 4.525700e-02 3.606579e+00 0.000000e+00
9 60269 100000 71159 839953 1012439 1011892 101368 243262 2125699 0 58518 2500968 2.444142e+08 4.538600e-02 3.415050e+00 0.000000e+00
Timer Cumulative Cumulative Cumulative Cumulative Cumulative Cumulative
Name number microSecs microSecs microSecs microSecs Efficiency
of calls min avg max stddev Rating
main 1 3.750e+07 3.750e+07 3.750e+07 0.000e+00 100.00
cycleInit 10 4.532e+05 4.532e+05 4.532e+05 0.000e+00 100.00
cycleTracking 10 3.705e+07 3.705e+07 3.705e+07 0.000e+00 100.00
cycleTracking_Kernel 992 3.699e+07 3.699e+07 3.699e+07 0.000e+00 100.00
cycleTracking_MPI 1083 5.392e+04 5.392e+04 5.392e+04 0.000e+00 100.00
cycleTracking_Test_Done 0 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.00
cycleFinalize 20 1.564e+03 1.564e+03 1.564e+03 0.000e+00 100.00
Figure Of Merit 6.952e+05 [Num Segments / Cycle Tracking Time]
Makefile is set to use DPC++ so we will want to change that before merging into the main branch, but I'm leaving it that way for testing purposes.
I am also leaving in a bit of debug code that I think is useful until we figure out the right way to control SYCL device dispatch. Because SYCL is a pluripotent back-end, it isn't obvious how to do this. Intel DPC++ allows one to set the default device with an environment variable, but I wanted to push control in QS for debugging purposes.
I have verified that the SYCL implementation also runs correctly on NVIDIA (Pascal).
There is a compiler bug with printf on the device that requires me to disable 4 source instances of device printf but otherwise the code is identical.
Compiler
clang version 12.0.0 (https://github.com/intel/llvm.git b1cf776e91a0f8f99397a5c3668cceda19b1b000)
Hardware
$ nvidia-smi
Fri Oct 9 12:35:04 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 On | 00000000:3B:00.0 Off | N/A |
| 24% 38C P8 8W / 180W | 7MiB / 8119MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3309 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
Host execution
$ QS_DEVICE=HOST ./qs
Copyright (c) 2016
Lawrence Livermore National Security, LLC
All Rights Reserved
Quicksilver Version : 2020-Jun-27-09:44:01
Quicksilver Git Hash : 7794a66f2a3d1ea6446519615dcadb7a15c2a707
MPI Version : 3.0
Number of MPI ranks : 1
Number of OpenMP Threads: 1
Number of OpenMP CPUs : 1
Simulation:
dt: 1e-08
fMax: 0.1
inputFile:
energySpectrum:
boundaryCondition: reflect
loadBalance: 0
cycleTimers: 0
debugThreads: 0
lx: 100
ly: 100
lz: 100
nParticles: 1000000
batchSize: 0
nBatches: 10
nSteps: 10
nx: 10
ny: 10
nz: 10
seed: 1029384756
xDom: 0
yDom: 0
zDom: 0
eMax: 20
eMin: 1e-09
nGroups: 230
lowWeightCutoff: 0.001
bTally: 1
fTally: 1
cTally: 1
coralBenchmark: 0
crossSectionsOut:
Geometry:
material: sourceMaterial
shape: brick
xMax: 100
xMin: 0
yMax: 100
yMin: 0
zMax: 100
zMin: 0
Material:
name: sourceMaterial
mass: 1000
nIsotopes: 10
nReactions: 9
sourceRate: 1e+10
totalCrossSection: 1
absorptionCrossSection: flat
fissionCrossSection: flat
scatteringCrossSection: flat
absorptionCrossSectionRatio: 1
fissionCrossSectionRatio: 0.1
scatteringCrossSectionRatio: 1
CrossSection:
name: flat
A: 0
B: 0
C: 0
D: 0
E: 1
nuBar: 2.4
is host
Building partition 0
Building partition 1
Building partition 2
Building partition 3
done building
Building MC_Domain 0
Building MC_Domain 1
Building MC_Domain 2
Building MC_Domain 3
Starting Consistency Check
Finished Consistency Check
Finished initMesh
Using SYCL device
cycle start source rr split absorb scatter fission produce collisn escape census num_seg scalar_flux cycleInit cycleTracking cycleFinalize
0 0 100000 0 900000 1078182 1076792 107133 257364 2262107 0 72049 2670386 2.264064e+08 8.100800e-02 2.926682e+00 0.000000e+00
1 72049 100000 0 828008 1107255 1106235 110306 264657 2323796 0 47153 2719702 2.438830e+08 1.334280e-01 3.022620e+00 0.000000e+00
2 47153 100000 0 852712 1086097 1088696 108334 259738 2283127 0 65172 2687840 2.435394e+08 1.117280e-01 2.942065e+00 0.000000e+00
3 65172 100000 68015 834785 1017555 1018659 101778 244593 2137992 0 57202 2517378 2.450517e+08 1.087580e-01 2.757061e+00 0.000000e+00
4 57202 100000 62214 842934 1020418 1019522 101687 244038 2141627 0 59855 2522163 2.434017e+08 1.091730e-01 2.750167e+00 0.000000e+00
5 59855 100000 56726 840345 1029994 1029682 103183 247672 2162859 0 57969 2545713 2.451216e+08 8.949400e-02 2.795324e+00 0.000000e+00
6 57969 100000 52439 841925 1032190 1032180 102801 246877 2167171 0 59341 2551468 2.446226e+08 8.452000e-02 2.786467e+00 0.000000e+00
7 59341 100000 59663 840635 1023444 1022593 102792 246649 2148829 0 60726 2531066 2.441845e+08 8.383200e-02 2.782119e+00 1.000000e-06
8 60726 100000 68187 839357 1013501 1014287 101238 243112 2129026 0 60269 2508491 2.440307e+08 8.301600e-02 2.735566e+00 1.000000e-06
9 60269 100000 71159 839953 1012439 1011892 101368 243262 2125699 0 58518 2500968 2.444142e+08 8.994500e-02 2.722267e+00 1.000000e-06
Timer Cumulative Cumulative Cumulative Cumulative Cumulative Cumulative
Name number microSecs microSecs microSecs microSecs Efficiency
of calls min avg max stddev Rating
main 1 2.920e+07 2.920e+07 2.920e+07 0.000e+00 100.00
cycleInit 10 9.749e+05 9.749e+05 9.749e+05 0.000e+00 100.00
cycleTracking 10 2.822e+07 2.822e+07 2.822e+07 0.000e+00 100.00
cycleTracking_Kernel 992 2.810e+07 2.810e+07 2.810e+07 0.000e+00 100.00
cycleTracking_MPI 1083 1.171e+05 1.171e+05 1.171e+05 0.000e+00 100.00
cycleTracking_Test_Done 0 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.00
cycleFinalize 20 1.387e+03 1.387e+03 1.387e+03 0.000e+00 100.00
Figure Of Merit 9.126e+05 [Num Segments / Cycle Tracking Time]
NVIDIA execution
$ QS_DEVICE=GPU /usr/local/cuda-11.0/bin/nvprof ./qs
==44046== NVPROF is profiling process 44046, command: ./qs
Copyright (c) 2016
Lawrence Livermore National Security, LLC
All Rights Reserved
Quicksilver Version : 2020-Jun-27-09:44:01
Quicksilver Git Hash : 7794a66f2a3d1ea6446519615dcadb7a15c2a707
MPI Version : 3.0
Number of MPI ranks : 1
Number of OpenMP Threads: 1
Number of OpenMP CPUs : 1
Simulation:
dt: 1e-08
fMax: 0.1
inputFile:
energySpectrum:
boundaryCondition: reflect
loadBalance: 0
cycleTimers: 0
debugThreads: 0
lx: 100
ly: 100
lz: 100
nParticles: 1000000
batchSize: 0
nBatches: 10
nSteps: 10
nx: 10
ny: 10
nz: 10
seed: 1029384756
xDom: 0
yDom: 0
zDom: 0
eMax: 20
eMin: 1e-09
nGroups: 230
lowWeightCutoff: 0.001
bTally: 1
fTally: 1
cTally: 1
coralBenchmark: 0
crossSectionsOut:
Geometry:
material: sourceMaterial
shape: brick
xMax: 100
xMin: 0
yMax: 100
yMin: 0
zMax: 100
zMin: 0
Material:
name: sourceMaterial
mass: 1000
nIsotopes: 10
nReactions: 9
sourceRate: 1e+10
totalCrossSection: 1
absorptionCrossSection: flat
fissionCrossSection: flat
scatteringCrossSection: flat
absorptionCrossSectionRatio: 1
fissionCrossSectionRatio: 0.1
scatteringCrossSectionRatio: 1
CrossSection:
name: flat
A: 0
B: 0
C: 0
D: 0
E: 1
nuBar: 2.4
is gpu
Building partition 0
Building partition 1
Building partition 2
Building partition 3
done building
Building MC_Domain 0
Building MC_Domain 1
Building MC_Domain 2
Building MC_Domain 3
Starting Consistency Check
Finished Consistency Check
Finished initMesh
Using SYCL device
cycle start source rr split absorb scatter fission produce collisn escape census num_seg scalar_flux cycleInit cycleTracking cycleFinalize
0 0 100000 0 900000 1078182 1076792 107133 257364 2262107 0 72049 2670386 2.264064e+08 8.140800e-02 3.494830e-01 4.400000e-05
1 72049 100000 0 828008 1107254 1106235 110306 264657 2323795 0 47154 2719703 2.438830e+08 1.319410e-01 3.669860e-01 4.200000e-05
2 47154 100000 0 852712 1086097 1088693 108332 259734 2283122 0 65171 2687830 2.435392e+08 1.292360e-01 3.583770e-01 4.100000e-05
3 65171 100000 68010 834788 1017561 1018651 101777 244591 2137989 0 57202 2517372 2.450488e+08 1.285070e-01 3.940940e-01 4.400000e-05
4 57202 100000 62214 842933 1020417 1019520 101687 244038 2141624 0 59855 2522155 2.434013e+08 1.233870e-01 3.430330e-01 4.200000e-05
5 59855 100000 56726 840345 1029994 1029682 103183 247672 2162859 0 57969 2545713 2.451217e+08 1.251520e-01 9.301870e-01 4.900000e-05
6 57969 100000 52439 841925 1032190 1032180 102801 246877 2167171 0 59341 2551472 2.446224e+08 1.254430e-01 4.115140e-01 4.300000e-05
7 59341 100000 59663 840635 1023444 1022593 102792 246649 2148829 0 60726 2531066 2.441844e+08 1.263230e-01 3.548820e-01 4.100000e-05
8 60726 100000 68187 839357 1013501 1014287 101238 243112 2129026 0 60269 2508492 2.440307e+08 1.245810e-01 1.525691e+00 4.200000e-05
9 60269 100000 71159 839953 1012439 1011892 101368 243262 2125699 0 58518 2500968 2.444142e+08 1.244990e-01 3.747170e-01 4.200000e-05
Timer Cumulative Cumulative Cumulative Cumulative Cumulative Cumulative
Name number microSecs microSecs microSecs microSecs Efficiency
of calls min avg max stddev Rating
main 1 6.647e+06 6.647e+06 6.647e+06 0.000e+00 100.00
cycleInit 10 1.220e+06 1.220e+06 1.220e+06 0.000e+00 100.00
cycleTracking 10 5.409e+06 5.409e+06 5.409e+06 0.000e+00 100.00
cycleTracking_Kernel 992 5.028e+06 5.028e+06 5.028e+06 0.000e+00 100.00
cycleTracking_MPI 1083 3.471e+05 3.471e+05 3.471e+05 0.000e+00 100.00
cycleTracking_Test_Done 0 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.00
cycleFinalize 20 3.918e+03 3.918e+03 3.918e+03 0.000e+00 100.00
Figure Of Merit 4.762e+06 [Num Segments / Cycle Tracking Time]
==44046== Profiling application: ./qs
==44046== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 5.00802s 195 25.682ms 175.84us 1.15108s _ZTSZZ13cycleTrackingP10MonteCarloENKUlRN2cl4sycl7handlerEE276_52clES4_EUlNS2_4itemILi1ELb1EEEE277_66
API calls: 88.88% 5.00789s 195 25.681ms 175.10us 1.15110s cuEventSynchronize
7.76% 437.11ms 2 218.55ms 205.62ms 231.49ms cuCtxCreate
2.40% 135.24ms 2 67.618ms 61.238ms 73.998ms cuCtxDestroy
0.45% 25.116ms 393 63.908us 4.7490us 20.552ms cuMemAllocManaged
0.39% 21.747ms 393 55.335us 8.9590us 950.74us cuMemFree
0.06% 3.3578ms 195 17.219us 8.1610us 101.59us cuLaunchKernel
0.03% 1.9149ms 1 1.9149ms 1.9149ms 1.9149ms cuModuleLoadDataEx
0.01% 495.29us 1 495.29us 495.29us 495.29us cuModuleUnload
0.01% 351.42us 1579 222ns 162ns 3.0640us cuCtxGetCurrent
0.01% 326.05us 783 416ns 174ns 13.519us cuDeviceGetAttribute
0.00% 272.17us 197 1.3810us 590ns 12.086us cuEventCreate
0.00% 224.69us 197 1.1400us 708ns 5.0690us cuEventRecord
0.00% 156.83us 197 796ns 412ns 7.6580us cuEventDestroy
0.00% 154.58us 393 393ns 230ns 7.6360us cuPointerGetAttribute
0.00% 25.972us 2 12.986us 11.569us 14.403us cuStreamCreate
0.00% 24.368us 2 12.184us 7.1720us 17.196us cuStreamDestroy
0.00% 8.0820us 2 4.0410us 2.6660us 5.4160us cuStreamSynchronize
0.00% 4.0600us 2 2.0300us 1.4350us 2.6250us cuCtxSynchronize
0.00% 2.7130us 1 2.7130us 2.7130us 2.7130us cuDeviceGetPCIBusId
0.00% 2.1270us 4 531ns 307ns 748ns cuCtxSetCurrent
0.00% 2.0830us 2 1.0410us 612ns 1.4710us cuModuleGetFunction
0.00% 1.5310us 3 510ns 191ns 1.0640us cuDeviceGetCount
0.00% 1.1070us 2 553ns 461ns 646ns cuCtxPopCurrent
0.00% 733ns 2 366ns 199ns 534ns cuDeviceGet
==44046== Unified Memory profiling result:
Device "GeForce GTX 1080 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
28880 75.789KB 4.0000KB 0.9961MB 2.087399GB 214.8492ms Host To Device
13520 151.98KB 4.0000KB 0.9961MB 1.959534GB 169.6310ms Device To Host
9274 - - - - 860.9489ms Gpu page fault groups
366 4.0000KB 4.0000KB 4.0000KB 1.429688MB - Memory thrashes
Total CPU Page faults: 8063
Total CPU thrashes: 366
CUDA execution
For reference, here is the execution with CUDA 11.0 and the master branch. There are some issues with preprocessor logic that breaks the sycl-clean branch when compiling for CUDA. I do not know what is wrong there yet.
$ QS_DEVICE=GPU /usr/local/cuda-11.0/bin/nvprof ./qs
Copyright (c) 2016
Lawrence Livermore National Security, LLC
All Rights Reserved
Quicksilver Version : 2020-Feb-4-22:35:56
Quicksilver Git Hash : af27b3dcce08933786cb526e2f8a0bbe99d99b07
MPI Version : 3.0
Number of MPI ranks : 1
Number of OpenMP Threads: 1
Number of OpenMP CPUs : 1
Simulation:
dt: 1e-08
fMax: 0.1
inputFile:
energySpectrum:
boundaryCondition: reflect
loadBalance: 0
cycleTimers: 0
debugThreads: 0
lx: 100
ly: 100
lz: 100
nParticles: 1000000
batchSize: 0
nBatches: 10
nSteps: 10
nx: 10
ny: 10
nz: 10
seed: 1029384756
xDom: 0
yDom: 0
zDom: 0
eMax: 20
eMin: 1e-09
nGroups: 230
lowWeightCutoff: 0.001
bTally: 1
fTally: 1
cTally: 1
coralBenchmark: 0
crossSectionsOut:
Geometry:
material: sourceMaterial
shape: brick
xMax: 100
xMin: 0
yMax: 100
yMin: 0
zMax: 100
zMin: 0
Material:
name: sourceMaterial
mass: 1000
nIsotopes: 10
nReactions: 9
sourceRate: 1e+10
totalCrossSection: 1
absorptionCrossSection: flat
fissionCrossSection: flat
scatteringCrossSection: flat
absorptionCrossSectionRatio: 1
fissionCrossSectionRatio: 0.1
scatteringCrossSectionRatio: 1
CrossSection:
name: flat
A: 0
B: 0
C: 0
D: 0
E: 1
nuBar: 2.4
==71046== NVPROF is profiling process 71046, command: ./qs
Building partition 0
Building partition 1
Building partition 2
Building partition 3
done building
Building MC_Domain 0
Building MC_Domain 1
Building MC_Domain 2
Building MC_Domain 3
Starting Consistency Check
Finished Consistency Check
Finished initMesh
cycle start source rr split absorb scatter fission produce collisn escape census num_seg scalar_flux cycleInit cycleTracking cycleFinalize
0 0 100000 0 900000 1078182 1076792 107133 257364 2262107 0 72049 2670386 2.264064e+08 9.041900e-02 4.416690e-01 4.400000e-05
1 72049 100000 0 828008 1107255 1106235 110306 264657 2323796 0 47153 2719702 2.438830e+08 1.405880e-01 4.555290e-01 4.300000e-05
2 47153 100000 0 852712 1086097 1088696 108334 259738 2283127 0 65172 2687840 2.435394e+08 1.361020e-01 4.307970e-01 4.100000e-05
3 65172 100000 68015 834785 1017555 1018659 101778 244593 2137992 0 57202 2517378 2.450517e+08 1.369190e-01 4.161950e-01 4.200000e-05
4 57202 100000 62214 842934 1020418 1019522 101687 244038 2141627 0 59855 2522163 2.434017e+08 1.338490e-01 4.129190e-01 4.200000e-05
5 59855 100000 56726 840345 1029994 1029682 103183 247672 2162859 0 57969 2545713 2.451216e+08 1.335090e-01 4.229580e-01 4.200000e-05
6 57969 100000 52439 841925 1032190 1032180 102801 246877 2167171 0 59341 2551468 2.446226e+08 1.345140e-01 4.111950e-01 4.300000e-05
7 59341 100000 59663 840635 1023444 1022593 102792 246649 2148829 0 60726 2531066 2.441845e+08 1.343080e-01 4.148660e-01 4.200000e-05
8 60726 100000 68187 839357 1013501 1014287 101238 243112 2129026 0 60269 2508491 2.440307e+08 1.345430e-01 4.114840e-01 4.400000e-05
9 60269 100000 71159 839953 1012439 1011892 101368 243262 2125699 0 58518 2500968 2.444142e+08 1.341150e-01 4.092590e-01 4.200000e-05
Timer Cumulative Cumulative Cumulative Cumulative Cumulative Cumulative
Name number microSecs microSecs microSecs microSecs Efficiency
of calls min avg max stddev Rating
main 1 5.552e+06 5.552e+06 5.552e+06 0.000e+00 100.00
cycleInit 10 1.309e+06 1.309e+06 1.309e+06 0.000e+00 100.00
cycleTracking 10 4.227e+06 4.227e+06 4.227e+06 0.000e+00 100.00
cycleTracking_Kernel 992 3.889e+06 3.889e+06 3.889e+06 0.000e+00 100.00
cycleTracking_MPI 1083 3.375e+05 3.375e+05 3.375e+05 0.000e+00 100.00
cycleTracking_Test_Done 0 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.00
cycleFinalize 20 3.602e+03 3.602e+03 3.602e+03 0.000e+00 100.00
Figure Of Merit 6.093e+06 [Num Segments / Cycle Tracking Time]
==71046== Profiling application: ./qs
==71046== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 3.84139s 195 19.699ms 254.05us 38.108ms CycleTrackingKernel(MonteCarlo*, int, ParticleVault*, ParticleVault*)
0.00% 1.3760us 1 1.3760us 1.3760us 1.3760us _GLOBAL__N__48_tmpxft_00011231_00000000_7_cudaFunctions_cpp1_ii_2bb9853e::WarmUpKernel(void)
API calls: 92.43% 3.84205s 196 19.602ms 10.971us 38.107ms cudaDeviceSynchronize
6.96% 289.13ms 393 735.69us 5.5720us 284.39ms cudaMallocManaged
0.50% 20.979ms 393 53.380us 9.7610us 742.77us cudaFree
0.09% 3.8650ms 196 19.719us 8.9830us 202.60us cudaLaunchKernel
0.01% 272.25us 101 2.6950us 183ns 170.40us cuDeviceGetAttribute
0.01% 236.49us 1 236.49us 236.49us 236.49us cuDeviceTotalMem
0.00% 63.885us 195 327ns 253ns 3.6680us cudaPeekAtLastError
0.00% 26.808us 1 26.808us 26.808us 26.808us cuDeviceGetName
0.00% 2.8380us 1 2.8380us 2.8380us 2.8380us cuDeviceGetPCIBusId
0.00% 2.8210us 1 2.8210us 2.8210us 2.8210us cudaSetDevice
0.00% 1.5900us 3 530ns 236ns 1.0870us cuDeviceGetCount
0.00% 873ns 2 436ns 192ns 681ns cuDeviceGet
0.00% 447ns 1 447ns 447ns 447ns cudaGetDeviceCount
0.00% 371ns 1 371ns 371ns 371ns cuDeviceGetUuid
==71046== Unified Memory profiling result:
Device "GeForce GTX 1080 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
31624 69.160KB 4.0000KB 0.9961MB 2.085815GB 217.8466ms Host To Device
13544 151.74KB 4.0000KB 0.9961MB 1.960014GB 169.6492ms Device To Host
8236 - - - - 916.3543ms Gpu page fault groups
258 4.0000KB 4.0000KB 4.0000KB 1.007813MB - Memory thrashes
Total CPU Page faults: 8087
Total CPU thrashes: 258