SYCL implementation

Open jeffhammond opened this issue 5 years ago • 2 comments

This is the SYCL/DPC++ port. It currently depends on two features that are not widely available:

USM (unified shared memory), which serves the same purpose as cudaMallocManaged. It is part of SYCL 2020 but only Intel DPC++ on Intel back-ends supports it today. CodePlay ComputeCpp has started implementing it but I don't think it is finished yet and I didn't test it.
sycl::intel::experimental::printf, which is, as one might expect, an Intel extension to support printf. There is an alternative extension in CodePlay ComputeCpp but I didn't bother with that because of the previous issue.

The following output appears to be identical to that of GCC OpenMP, but please let me know what other verification I need to do.

Today, the Intel DPC++ implementation is working with the host and Gen9 GPU devices, but not the CPU device because of an Intel OpenCL issue that is known and in the process of being fixed.

jrhammon@jrhammon-nuc:~/QUICKSILVER/src$ QS_DEVICE=GPU ./qs 
Copyright (c) 2016
Lawrence Livermore National Security, LLC
All Rights Reserved
Quicksilver Version     : 2020-Feb-4-22:35:56
Quicksilver Git Hash    : af27b3dcce08933786cb526e2f8a0bbe99d99b07
MPI Version             : 3.0
Number of MPI ranks     : 1
Number of OpenMP Threads: 1
Number of OpenMP CPUs   : 1

Simulation:
   dt: 1e-08
   fMax: 0.1
   inputFile: 
   energySpectrum: 
   boundaryCondition: reflect
   loadBalance: 0
   cycleTimers: 0
   debugThreads: 0
   lx: 100
   ly: 100
   lz: 100
   nParticles: 1000000
   batchSize: 0
   nBatches: 10
   nSteps: 10
   nx: 10
   ny: 10
   nz: 10
   seed: 1029384756
   xDom: 0
   yDom: 0
   zDom: 0
   eMax: 20
   eMin: 1e-09
   nGroups: 230
   lowWeightCutoff: 0.001
   bTally: 1
   fTally: 1
   cTally: 1
   coralBenchmark: 0
   crossSectionsOut:

Geometry:
   material: sourceMaterial
   shape: brick
   xMax: 100
   xMin: 0
   yMax: 100
   yMin: 0
   zMax: 100
   zMin: 0

Material:
   name: sourceMaterial
   mass: 1000
   nIsotopes: 10
   nReactions: 9
   sourceRate: 1e+10
   totalCrossSection: 1
   absorptionCrossSection: flat
   fissionCrossSection: flat
   scatteringCrossSection: flat
   absorptionCrossSectionRatio: 1
   fissionCrossSectionRatio: 0.1
   scatteringCrossSectionRatio: 1

CrossSection:
   name: flat
   A: 0
   B: 0
   C: 0
   D: 0
   E: 1
   nuBar: 2.4
is gpu
Building partition 0
Building partition 1
Building partition 2
Building partition 3
done building
Building MC_Domain 0
Building MC_Domain 1
Building MC_Domain 2
Building MC_Domain 3
Starting Consistency Check
Finished Consistency Check
Finished initMesh
Using SYCL device
cycle           start       source           rr        split       absorb      scatter      fission      produce      collisn       escape       census      num_seg   scalar_flux      cycleInit  cycleTracking  cycleFinalize
       0            0       100000            0       900000      1078182      1076792       107133       257364      2262107            0        72049      2670386  2.264064e+08   4.442000e-02   6.090415e+00   0.000000e+00
       1        72049       100000            0       828008      1107255      1106235       110306       264657      2323796            0        47153      2719702  2.438830e+08   4.589800e-02   3.454263e+00   0.000000e+00
       2        47153       100000            0       852712      1086097      1088696       108334       259738      2283127            0        65172      2687840  2.435394e+08   4.463300e-02   3.493513e+00   0.000000e+00
       3        65172       100000        68015       834785      1017555      1018659       101778       244593      2137992            0        57202      2517378  2.450517e+08   4.562600e-02   3.246188e+00   0.000000e+00
       4        57202       100000        62214       842934      1020418      1019522       101687       244038      2141627            0        59855      2522163  2.434017e+08   4.571000e-02   3.285100e+00   0.000000e+00
       5        59855       100000        56726       840345      1029994      1029682       103183       247672      2162859            0        57969      2545713  2.451216e+08   4.450600e-02   3.400606e+00   0.000000e+00
       6        57969       100000        52439       841925      1032190      1032180       102801       246877      2167171            0        59341      2551468  2.446226e+08   4.578000e-02   3.692880e+00   0.000000e+00
       7        59341       100000        59663       840635      1023444      1022593       102792       246649      2148829            0        60726      2531066  2.441845e+08   4.598600e-02   3.360731e+00   0.000000e+00
       8        60726       100000        68187       839357      1013501      1014287       101238       243112      2129026            0        60269      2508491  2.440307e+08   4.525700e-02   3.606579e+00   0.000000e+00
       9        60269       100000        71159       839953      1012439      1011892       101368       243262      2125699            0        58518      2500968  2.444142e+08   4.538600e-02   3.415050e+00   0.000000e+00

Timer                       Cumulative   Cumulative   Cumulative   Cumulative   Cumulative   Cumulative
Name                            number    microSecs    microSecs    microSecs    microSecs   Efficiency
                              of calls          min          avg          max       stddev       Rating
main                                 1    3.750e+07    3.750e+07    3.750e+07    0.000e+00       100.00
cycleInit                           10    4.532e+05    4.532e+05    4.532e+05    0.000e+00       100.00
cycleTracking                       10    3.705e+07    3.705e+07    3.705e+07    0.000e+00       100.00
cycleTracking_Kernel               992    3.699e+07    3.699e+07    3.699e+07    0.000e+00       100.00
cycleTracking_MPI                 1083    5.392e+04    5.392e+04    5.392e+04    0.000e+00       100.00
cycleTracking_Test_Done              0    0.000e+00    0.000e+00    0.000e+00    0.000e+00         0.00
cycleFinalize                       20    1.564e+03    1.564e+03    1.564e+03    0.000e+00       100.00
Figure Of Merit              6.952e+05 [Num Segments / Cycle Tracking Time]

Jun 27 '20 16:06 jeffhammond

Makefile is set to use DPC++ so we will want to change that before merging into the main branch, but I'm leaving it that way for testing purposes.

I am also leaving in a bit of debug code that I think is useful until we figure out the right way to control SYCL device dispatch. Because SYCL is a pluripotent back-end, it isn't obvious how to do this. Intel DPC++ allows one to set the default device with an environment variable, but I wanted to push control in QS for debugging purposes.

Jun 27 '20 16:06 jeffhammond

I have verified that the SYCL implementation also runs correctly on NVIDIA (Pascal).

There is a compiler bug with printf on the device that requires me to disable 4 source instances of device printf but otherwise the code is identical.

Compiler

clang version 12.0.0 (https://github.com/intel/llvm.git b1cf776e91a0f8f99397a5c3668cceda19b1b000)

Hardware

$ nvidia-smi
Fri Oct  9 12:35:04 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 00000000:3B:00.0 Off |                  N/A |
| 24%   38C    P8     8W / 180W |      7MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3309      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

Host execution

$ QS_DEVICE=HOST ./qs 
Copyright (c) 2016
Lawrence Livermore National Security, LLC
All Rights Reserved
Quicksilver Version     : 2020-Jun-27-09:44:01
Quicksilver Git Hash    : 7794a66f2a3d1ea6446519615dcadb7a15c2a707
MPI Version             : 3.0
Number of MPI ranks     : 1
Number of OpenMP Threads: 1
Number of OpenMP CPUs   : 1

Simulation:
   dt: 1e-08
   fMax: 0.1
   inputFile: 
   energySpectrum: 
   boundaryCondition: reflect
   loadBalance: 0
   cycleTimers: 0
   debugThreads: 0
   lx: 100
   ly: 100
   lz: 100
   nParticles: 1000000
   batchSize: 0
   nBatches: 10
   nSteps: 10
   nx: 10
   ny: 10
   nz: 10
   seed: 1029384756
   xDom: 0
   yDom: 0
   zDom: 0
   eMax: 20
   eMin: 1e-09
   nGroups: 230
   lowWeightCutoff: 0.001
   bTally: 1
   fTally: 1
   cTally: 1
   coralBenchmark: 0
   crossSectionsOut:

Geometry:
   material: sourceMaterial
   shape: brick
   xMax: 100
   xMin: 0
   yMax: 100
   yMin: 0
   zMax: 100
   zMin: 0

Material:
   name: sourceMaterial
   mass: 1000
   nIsotopes: 10
   nReactions: 9
   sourceRate: 1e+10
   totalCrossSection: 1
   absorptionCrossSection: flat
   fissionCrossSection: flat
   scatteringCrossSection: flat
   absorptionCrossSectionRatio: 1
   fissionCrossSectionRatio: 0.1
   scatteringCrossSectionRatio: 1

CrossSection:
   name: flat
   A: 0
   B: 0
   C: 0
   D: 0
   E: 1
   nuBar: 2.4
is host
Building partition 0
Building partition 1
Building partition 2
Building partition 3
done building
Building MC_Domain 0
Building MC_Domain 1
Building MC_Domain 2
Building MC_Domain 3
Starting Consistency Check
Finished Consistency Check
Finished initMesh
Using SYCL device
cycle           start       source           rr        split       absorb      scatter      fission      produce      collisn       escape       census      num_seg   scalar_flux      cycleInit  cycleTracking  cycleFinalize
       0            0       100000            0       900000      1078182      1076792       107133       257364      2262107            0        72049      2670386  2.264064e+08   8.100800e-02   2.926682e+00   0.000000e+00
       1        72049       100000            0       828008      1107255      1106235       110306       264657      2323796            0        47153      2719702  2.438830e+08   1.334280e-01   3.022620e+00   0.000000e+00
       2        47153       100000            0       852712      1086097      1088696       108334       259738      2283127            0        65172      2687840  2.435394e+08   1.117280e-01   2.942065e+00   0.000000e+00
       3        65172       100000        68015       834785      1017555      1018659       101778       244593      2137992            0        57202      2517378  2.450517e+08   1.087580e-01   2.757061e+00   0.000000e+00
       4        57202       100000        62214       842934      1020418      1019522       101687       244038      2141627            0        59855      2522163  2.434017e+08   1.091730e-01   2.750167e+00   0.000000e+00
       5        59855       100000        56726       840345      1029994      1029682       103183       247672      2162859            0        57969      2545713  2.451216e+08   8.949400e-02   2.795324e+00   0.000000e+00
       6        57969       100000        52439       841925      1032190      1032180       102801       246877      2167171            0        59341      2551468  2.446226e+08   8.452000e-02   2.786467e+00   0.000000e+00
       7        59341       100000        59663       840635      1023444      1022593       102792       246649      2148829            0        60726      2531066  2.441845e+08   8.383200e-02   2.782119e+00   1.000000e-06
       8        60726       100000        68187       839357      1013501      1014287       101238       243112      2129026            0        60269      2508491  2.440307e+08   8.301600e-02   2.735566e+00   1.000000e-06
       9        60269       100000        71159       839953      1012439      1011892       101368       243262      2125699            0        58518      2500968  2.444142e+08   8.994500e-02   2.722267e+00   1.000000e-06

Timer                       Cumulative   Cumulative   Cumulative   Cumulative   Cumulative   Cumulative
Name                            number    microSecs    microSecs    microSecs    microSecs   Efficiency
                              of calls          min          avg          max       stddev       Rating
main                                 1    2.920e+07    2.920e+07    2.920e+07    0.000e+00       100.00
cycleInit                           10    9.749e+05    9.749e+05    9.749e+05    0.000e+00       100.00
cycleTracking                       10    2.822e+07    2.822e+07    2.822e+07    0.000e+00       100.00
cycleTracking_Kernel               992    2.810e+07    2.810e+07    2.810e+07    0.000e+00       100.00
cycleTracking_MPI                 1083    1.171e+05    1.171e+05    1.171e+05    0.000e+00       100.00
cycleTracking_Test_Done              0    0.000e+00    0.000e+00    0.000e+00    0.000e+00         0.00
cycleFinalize                       20    1.387e+03    1.387e+03    1.387e+03    0.000e+00       100.00
Figure Of Merit              9.126e+05 [Num Segments / Cycle Tracking Time]

NVIDIA execution

$ QS_DEVICE=GPU /usr/local/cuda-11.0/bin/nvprof ./qs 
==44046== NVPROF is profiling process 44046, command: ./qs
Copyright (c) 2016
Lawrence Livermore National Security, LLC
All Rights Reserved
Quicksilver Version     : 2020-Jun-27-09:44:01
Quicksilver Git Hash    : 7794a66f2a3d1ea6446519615dcadb7a15c2a707
MPI Version             : 3.0
Number of MPI ranks     : 1
Number of OpenMP Threads: 1
Number of OpenMP CPUs   : 1

Simulation:
   dt: 1e-08
   fMax: 0.1
   inputFile: 
   energySpectrum: 
   boundaryCondition: reflect
   loadBalance: 0
   cycleTimers: 0
   debugThreads: 0
   lx: 100
   ly: 100
   lz: 100
   nParticles: 1000000
   batchSize: 0
   nBatches: 10
   nSteps: 10
   nx: 10
   ny: 10
   nz: 10
   seed: 1029384756
   xDom: 0
   yDom: 0
   zDom: 0
   eMax: 20
   eMin: 1e-09
   nGroups: 230
   lowWeightCutoff: 0.001
   bTally: 1
   fTally: 1
   cTally: 1
   coralBenchmark: 0
   crossSectionsOut:

Geometry:
   material: sourceMaterial
   shape: brick
   xMax: 100
   xMin: 0
   yMax: 100
   yMin: 0
   zMax: 100
   zMin: 0

Material:
   name: sourceMaterial
   mass: 1000
   nIsotopes: 10
   nReactions: 9
   sourceRate: 1e+10
   totalCrossSection: 1
   absorptionCrossSection: flat
   fissionCrossSection: flat
   scatteringCrossSection: flat
   absorptionCrossSectionRatio: 1
   fissionCrossSectionRatio: 0.1
   scatteringCrossSectionRatio: 1

CrossSection:
   name: flat
   A: 0
   B: 0
   C: 0
   D: 0
   E: 1
   nuBar: 2.4
is gpu
Building partition 0
Building partition 1
Building partition 2
Building partition 3
done building
Building MC_Domain 0
Building MC_Domain 1
Building MC_Domain 2
Building MC_Domain 3
Starting Consistency Check
Finished Consistency Check
Finished initMesh
Using SYCL device
cycle           start       source           rr        split       absorb      scatter      fission      produce      collisn       escape       census      num_seg   scalar_flux      cycleInit  cycleTracking  cycleFinalize
       0            0       100000            0       900000      1078182      1076792       107133       257364      2262107            0        72049      2670386  2.264064e+08   8.140800e-02   3.494830e-01   4.400000e-05
       1        72049       100000            0       828008      1107254      1106235       110306       264657      2323795            0        47154      2719703  2.438830e+08   1.319410e-01   3.669860e-01   4.200000e-05
       2        47154       100000            0       852712      1086097      1088693       108332       259734      2283122            0        65171      2687830  2.435392e+08   1.292360e-01   3.583770e-01   4.100000e-05
       3        65171       100000        68010       834788      1017561      1018651       101777       244591      2137989            0        57202      2517372  2.450488e+08   1.285070e-01   3.940940e-01   4.400000e-05
       4        57202       100000        62214       842933      1020417      1019520       101687       244038      2141624            0        59855      2522155  2.434013e+08   1.233870e-01   3.430330e-01   4.200000e-05
       5        59855       100000        56726       840345      1029994      1029682       103183       247672      2162859            0        57969      2545713  2.451217e+08   1.251520e-01   9.301870e-01   4.900000e-05
       6        57969       100000        52439       841925      1032190      1032180       102801       246877      2167171            0        59341      2551472  2.446224e+08   1.254430e-01   4.115140e-01   4.300000e-05
       7        59341       100000        59663       840635      1023444      1022593       102792       246649      2148829            0        60726      2531066  2.441844e+08   1.263230e-01   3.548820e-01   4.100000e-05
       8        60726       100000        68187       839357      1013501      1014287       101238       243112      2129026            0        60269      2508492  2.440307e+08   1.245810e-01   1.525691e+00   4.200000e-05
       9        60269       100000        71159       839953      1012439      1011892       101368       243262      2125699            0        58518      2500968  2.444142e+08   1.244990e-01   3.747170e-01   4.200000e-05

Timer                       Cumulative   Cumulative   Cumulative   Cumulative   Cumulative   Cumulative
Name                            number    microSecs    microSecs    microSecs    microSecs   Efficiency
                              of calls          min          avg          max       stddev       Rating
main                                 1    6.647e+06    6.647e+06    6.647e+06    0.000e+00       100.00
cycleInit                           10    1.220e+06    1.220e+06    1.220e+06    0.000e+00       100.00
cycleTracking                       10    5.409e+06    5.409e+06    5.409e+06    0.000e+00       100.00
cycleTracking_Kernel               992    5.028e+06    5.028e+06    5.028e+06    0.000e+00       100.00
cycleTracking_MPI                 1083    3.471e+05    3.471e+05    3.471e+05    0.000e+00       100.00
cycleTracking_Test_Done              0    0.000e+00    0.000e+00    0.000e+00    0.000e+00         0.00
cycleFinalize                       20    3.918e+03    3.918e+03    3.918e+03    0.000e+00       100.00
Figure Of Merit              4.762e+06 [Num Segments / Cycle Tracking Time]
==44046== Profiling application: ./qs
==44046== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  5.00802s       195  25.682ms  175.84us  1.15108s  _ZTSZZ13cycleTrackingP10MonteCarloENKUlRN2cl4sycl7handlerEE276_52clES4_EUlNS2_4itemILi1ELb1EEEE277_66
      API calls:   88.88%  5.00789s       195  25.681ms  175.10us  1.15110s  cuEventSynchronize
                    7.76%  437.11ms         2  218.55ms  205.62ms  231.49ms  cuCtxCreate
                    2.40%  135.24ms         2  67.618ms  61.238ms  73.998ms  cuCtxDestroy
                    0.45%  25.116ms       393  63.908us  4.7490us  20.552ms  cuMemAllocManaged
                    0.39%  21.747ms       393  55.335us  8.9590us  950.74us  cuMemFree
                    0.06%  3.3578ms       195  17.219us  8.1610us  101.59us  cuLaunchKernel
                    0.03%  1.9149ms         1  1.9149ms  1.9149ms  1.9149ms  cuModuleLoadDataEx
                    0.01%  495.29us         1  495.29us  495.29us  495.29us  cuModuleUnload
                    0.01%  351.42us      1579     222ns     162ns  3.0640us  cuCtxGetCurrent
                    0.01%  326.05us       783     416ns     174ns  13.519us  cuDeviceGetAttribute
                    0.00%  272.17us       197  1.3810us     590ns  12.086us  cuEventCreate
                    0.00%  224.69us       197  1.1400us     708ns  5.0690us  cuEventRecord
                    0.00%  156.83us       197     796ns     412ns  7.6580us  cuEventDestroy
                    0.00%  154.58us       393     393ns     230ns  7.6360us  cuPointerGetAttribute
                    0.00%  25.972us         2  12.986us  11.569us  14.403us  cuStreamCreate
                    0.00%  24.368us         2  12.184us  7.1720us  17.196us  cuStreamDestroy
                    0.00%  8.0820us         2  4.0410us  2.6660us  5.4160us  cuStreamSynchronize
                    0.00%  4.0600us         2  2.0300us  1.4350us  2.6250us  cuCtxSynchronize
                    0.00%  2.7130us         1  2.7130us  2.7130us  2.7130us  cuDeviceGetPCIBusId
                    0.00%  2.1270us         4     531ns     307ns     748ns  cuCtxSetCurrent
                    0.00%  2.0830us         2  1.0410us     612ns  1.4710us  cuModuleGetFunction
                    0.00%  1.5310us         3     510ns     191ns  1.0640us  cuDeviceGetCount
                    0.00%  1.1070us         2     553ns     461ns     646ns  cuCtxPopCurrent
                    0.00%     733ns         2     366ns     199ns     534ns  cuDeviceGet

==44046== Unified Memory profiling result:
Device "GeForce GTX 1080 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
   28880  75.789KB  4.0000KB  0.9961MB  2.087399GB  214.8492ms  Host To Device
   13520  151.98KB  4.0000KB  0.9961MB  1.959534GB  169.6310ms  Device To Host
    9274         -         -         -           -  860.9489ms  Gpu page fault groups
     366  4.0000KB  4.0000KB  4.0000KB  1.429688MB           -  Memory thrashes
Total CPU Page faults: 8063
Total CPU thrashes: 366

CUDA execution

For reference, here is the execution with CUDA 11.0 and the master branch. There are some issues with preprocessor logic that breaks the sycl-clean branch when compiling for CUDA. I do not know what is wrong there yet.

$ QS_DEVICE=GPU /usr/local/cuda-11.0/bin/nvprof ./qs 
Copyright (c) 2016
Lawrence Livermore National Security, LLC
All Rights Reserved
Quicksilver Version     : 2020-Feb-4-22:35:56
Quicksilver Git Hash    : af27b3dcce08933786cb526e2f8a0bbe99d99b07
MPI Version             : 3.0
Number of MPI ranks     : 1
Number of OpenMP Threads: 1
Number of OpenMP CPUs   : 1

Simulation:
   dt: 1e-08
   fMax: 0.1
   inputFile: 
   energySpectrum: 
   boundaryCondition: reflect
   loadBalance: 0
   cycleTimers: 0
   debugThreads: 0
   lx: 100
   ly: 100
   lz: 100
   nParticles: 1000000
   batchSize: 0
   nBatches: 10
   nSteps: 10
   nx: 10
   ny: 10
   nz: 10
   seed: 1029384756
   xDom: 0
   yDom: 0
   zDom: 0
   eMax: 20
   eMin: 1e-09
   nGroups: 230
   lowWeightCutoff: 0.001
   bTally: 1
   fTally: 1
   cTally: 1
   coralBenchmark: 0
   crossSectionsOut:

Geometry:
   material: sourceMaterial
   shape: brick
   xMax: 100
   xMin: 0
   yMax: 100
   yMin: 0
   zMax: 100
   zMin: 0

Material:
   name: sourceMaterial
   mass: 1000
   nIsotopes: 10
   nReactions: 9
   sourceRate: 1e+10
   totalCrossSection: 1
   absorptionCrossSection: flat
   fissionCrossSection: flat
   scatteringCrossSection: flat
   absorptionCrossSectionRatio: 1
   fissionCrossSectionRatio: 0.1
   scatteringCrossSectionRatio: 1

CrossSection:
   name: flat
   A: 0
   B: 0
   C: 0
   D: 0
   E: 1
   nuBar: 2.4
==71046== NVPROF is profiling process 71046, command: ./qs
Building partition 0
Building partition 1
Building partition 2
Building partition 3
done building
Building MC_Domain 0
Building MC_Domain 1
Building MC_Domain 2
Building MC_Domain 3
Starting Consistency Check
Finished Consistency Check
Finished initMesh
cycle           start       source           rr        split       absorb      scatter      fission      produce      collisn       escape       census      num_seg   scalar_flux      cycleInit  cycleTracking  cycleFinalize
       0            0       100000            0       900000      1078182      1076792       107133       257364      2262107            0        72049      2670386  2.264064e+08   9.041900e-02   4.416690e-01   4.400000e-05
       1        72049       100000            0       828008      1107255      1106235       110306       264657      2323796            0        47153      2719702  2.438830e+08   1.405880e-01   4.555290e-01   4.300000e-05
       2        47153       100000            0       852712      1086097      1088696       108334       259738      2283127            0        65172      2687840  2.435394e+08   1.361020e-01   4.307970e-01   4.100000e-05
       3        65172       100000        68015       834785      1017555      1018659       101778       244593      2137992            0        57202      2517378  2.450517e+08   1.369190e-01   4.161950e-01   4.200000e-05
       4        57202       100000        62214       842934      1020418      1019522       101687       244038      2141627            0        59855      2522163  2.434017e+08   1.338490e-01   4.129190e-01   4.200000e-05
       5        59855       100000        56726       840345      1029994      1029682       103183       247672      2162859            0        57969      2545713  2.451216e+08   1.335090e-01   4.229580e-01   4.200000e-05
       6        57969       100000        52439       841925      1032190      1032180       102801       246877      2167171            0        59341      2551468  2.446226e+08   1.345140e-01   4.111950e-01   4.300000e-05
       7        59341       100000        59663       840635      1023444      1022593       102792       246649      2148829            0        60726      2531066  2.441845e+08   1.343080e-01   4.148660e-01   4.200000e-05
       8        60726       100000        68187       839357      1013501      1014287       101238       243112      2129026            0        60269      2508491  2.440307e+08   1.345430e-01   4.114840e-01   4.400000e-05
       9        60269       100000        71159       839953      1012439      1011892       101368       243262      2125699            0        58518      2500968  2.444142e+08   1.341150e-01   4.092590e-01   4.200000e-05

Timer                       Cumulative   Cumulative   Cumulative   Cumulative   Cumulative   Cumulative
Name                            number    microSecs    microSecs    microSecs    microSecs   Efficiency
                              of calls          min          avg          max       stddev       Rating
main                                 1    5.552e+06    5.552e+06    5.552e+06    0.000e+00       100.00
cycleInit                           10    1.309e+06    1.309e+06    1.309e+06    0.000e+00       100.00
cycleTracking                       10    4.227e+06    4.227e+06    4.227e+06    0.000e+00       100.00
cycleTracking_Kernel               992    3.889e+06    3.889e+06    3.889e+06    0.000e+00       100.00
cycleTracking_MPI                 1083    3.375e+05    3.375e+05    3.375e+05    0.000e+00       100.00
cycleTracking_Test_Done              0    0.000e+00    0.000e+00    0.000e+00    0.000e+00         0.00
cycleFinalize                       20    3.602e+03    3.602e+03    3.602e+03    0.000e+00       100.00
Figure Of Merit              6.093e+06 [Num Segments / Cycle Tracking Time]
==71046== Profiling application: ./qs
==71046== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  3.84139s       195  19.699ms  254.05us  38.108ms  CycleTrackingKernel(MonteCarlo*, int, ParticleVault*, ParticleVault*)
                    0.00%  1.3760us         1  1.3760us  1.3760us  1.3760us  _GLOBAL__N__48_tmpxft_00011231_00000000_7_cudaFunctions_cpp1_ii_2bb9853e::WarmUpKernel(void)
      API calls:   92.43%  3.84205s       196  19.602ms  10.971us  38.107ms  cudaDeviceSynchronize
                    6.96%  289.13ms       393  735.69us  5.5720us  284.39ms  cudaMallocManaged
                    0.50%  20.979ms       393  53.380us  9.7610us  742.77us  cudaFree
                    0.09%  3.8650ms       196  19.719us  8.9830us  202.60us  cudaLaunchKernel
                    0.01%  272.25us       101  2.6950us     183ns  170.40us  cuDeviceGetAttribute
                    0.01%  236.49us         1  236.49us  236.49us  236.49us  cuDeviceTotalMem
                    0.00%  63.885us       195     327ns     253ns  3.6680us  cudaPeekAtLastError
                    0.00%  26.808us         1  26.808us  26.808us  26.808us  cuDeviceGetName
                    0.00%  2.8380us         1  2.8380us  2.8380us  2.8380us  cuDeviceGetPCIBusId
                    0.00%  2.8210us         1  2.8210us  2.8210us  2.8210us  cudaSetDevice
                    0.00%  1.5900us         3     530ns     236ns  1.0870us  cuDeviceGetCount
                    0.00%     873ns         2     436ns     192ns     681ns  cuDeviceGet
                    0.00%     447ns         1     447ns     447ns     447ns  cudaGetDeviceCount
                    0.00%     371ns         1     371ns     371ns     371ns  cuDeviceGetUuid

==71046== Unified Memory profiling result:
Device "GeForce GTX 1080 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
   31624  69.160KB  4.0000KB  0.9961MB  2.085815GB  217.8466ms  Host To Device
   13544  151.74KB  4.0000KB  0.9961MB  1.960014GB  169.6492ms  Device To Host
    8236         -         -         -           -  916.3543ms  Gpu page fault groups
     258  4.0000KB  4.0000KB  4.0000KB  1.007813MB           -  Memory thrashes
Total CPU Page faults: 8087
Total CPU thrashes: 258

Oct 09 '20 16:10 jeffhammond