meshmode Accelerate resample_by_picking through better ordering

resample_by_picking routinely shows up at the top of our GPU profiles. Here's an example from a run (mirgecom wave-eager, nel_1d = 24, 3D, order 3):

 GPU activities:   15.95%  1.72898s     28160  61.398us  3.4240us  133.22us  resample_by_picking
                   14.57%  1.57934s     26499  59.599us  4.4480us  117.47us  multiply
                   14.41%  1.56223s      9698  161.09us  16.000us  543.23us  diff
                   12.79%  1.38640s     21601  64.182us  4.4150us  118.56us  axpbyz
                   11.66%  1.26342s      5283  239.15us  120.54us  529.85us  grudge_assign_0
                    8.81%  954.48ms     23428  40.741us  1.6640us  81.888us  axpb
                    7.93%  859.20ms     10560  81.363us  60.831us  135.04us  resample_by_mat
                    7.84%  849.96ms      1760  482.93us  481.79us  541.57us  face_mass
                    2.16%  233.96ms        62  3.7735ms  1.3440us  11.178ms  [CUDA memcpy DtoH]
                    1.58%  171.51ms      2235  76.738us  1.6640us  87.391us  [CUDA memcpy DtoD]
                    1.19%  128.99ms      3523  36.612us  19.328us  37.952us  [CUDA memset]
                    0.49%  53.375ms       440  121.31us  120.67us  136.42us  grudge_assign_1
                    0.49%  53.364ms       440  121.28us  120.67us  136.19us  grudge_assign_2
                    0.07%  7.2597ms       127  57.162us  1.1520us  847.90us  [CUDA memcpy HtoD]
                    0.03%  3.5939ms        12  299.49us  13.696us  529.66us  nodes_0
                    0.01%  1.4677ms         6  244.62us  14.112us  528.73us  actx_special_sqrt
                    0.00%  359.81us         6  59.967us  5.0880us  115.14us  divide
                    0.00%  136.06us         1  136.06us  136.06us  136.06us  actx_special_exp

It's especially striking that it's at the top of the list because it touches lower-dimensional data. (surface vs volume) multiply has a similar number of calls, but it touches volume data, and it completes more quickly.

I think there are two opportunities here that we could try:

Currently, the kernel has an indirection on the read and the write end (see the (very simple) source). For surjective/onto connections, we can do away with the indirection on write by appropriately sorting the source indices.
Even for non-surjective connections, it's likely that we would benefit by sorting by the write index, to try to keep the writes as coalesced as possible.

IMO it's likely that this will have a benefit (but I obviously can't guarantee it). I think it's worth trying.

cc @lukeolson

Jan 18 '21 19:01 inducer

Still experimenting with this, but so far I'm getting the sense that it's something other than the indirection that's causing resample_by_picking to be slow.

First I tried pre-sorting by batch.to_element_indices, and that didn't seem to have any effect. Next I tried progressively stripping down the loop (ignoring correctness) to see whether I could simplify it enough to make it disappear from the profile results. I stripped it down to this:

knl = make_loopy_program(
    """{[iel, idof]:
        0<=iel<nelements and
        0<=idof<n_to_nodes}""",
    "result[iel, idof] = ary[iel, idof]",
    [
        lp.GlobalArg("result", None,
            shape="nelements_result, n_to_nodes",
            offset=lp.auto),
        lp.GlobalArg("ary", None,
            shape="nelements_vec, n_from_nodes",
            offset=lp.auto),
        lp.ValueArg("nelements_result", np.int32),
        lp.ValueArg("nelements_vec", np.int32),
        lp.ValueArg("n_from_nodes", np.int32),
        "...",
        ],
    name="resample_by_picking")

(i.e., no indirection), and it still looks about the same in the profiling results.

(Note: nelements here should be the number of entries in batch.to_element_indices, but looking at the generated code I can't tell if it is. More on this below.)

Before:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   16.16%  149.58ms       908  164.73us  17.120us  498.11us  diff
                   15.78%  146.09ms      2560  57.065us  4.6710us  123.14us  resample_by_picking
                   12.37%  114.46ms      2499  45.802us  5.5670us  92.512us  multiply
                   11.44%  105.84ms       493  214.68us  106.69us  482.94us  grudge_assign_0
                   10.60%  98.111ms      2001  49.031us  5.5670us  92.192us  axpbyz
                    7.81%  72.307ms       960  75.319us  55.200us  124.96us  resample_by_mat
                    7.76%  71.836ms      2228  32.242us  3.3280us  66.976us  axpb
                    7.44%  68.819ms        72  955.81us  1.3440us  3.0474ms  [CUDA memcpy DtoH]
                    7.09%  65.586ms       160  409.91us  395.33us  465.12us  face_mass
                    1.46%  13.493ms       235  57.417us  3.8720us  68.735us  [CUDA memcpy DtoD]
                    0.60%  5.5327ms        50  110.65us  106.78us  124.29us  grudge_assign_1
                    0.48%  4.4395ms        40  110.99us  106.98us  124.26us  grudge_assign_2
                    0.41%  3.8396ms       129  29.764us  1.0240us  518.59us  [CUDA memcpy HtoD]
                    0.35%  3.2791ms        12  273.25us  14.016us  482.05us  nodes_0
                    0.14%  1.3395ms         6  223.25us  14.368us  481.44us  actx_special_sqrt
                    0.03%  292.25us       323     904ns     832ns  1.6640us  [CUDA memset]
                    0.03%  276.19us         6  46.031us  6.4000us  85.567us  divide
                    0.02%  213.18us        10  21.318us  20.960us  21.792us  reduce_kernel_stage1
                    0.01%  124.99us         1  124.99us  124.99us  124.99us  actx_special_exp
                    0.01%  72.831us        20  3.6410us  3.2960us  4.4160us  reduce_kernel_stage2

After:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   15.15%  164.78ms       908  181.47us  17.087us  500.35us  diff
                   14.83%  161.36ms      2560  63.032us  4.7990us  122.34us  resample_by_picking
                   10.91%  118.63ms       493  240.63us  123.46us  484.73us  grudge_assign_0
                   10.60%  115.31ms      2499  46.142us  6.3670us  92.160us  multiply
                    9.09%  98.829ms      2001  49.389us  6.3670us  93.216us  axpbyz
                    9.03%  98.180ms      3592  27.333us  1.4720us  3.1643ms  [CUDA memcpy DtoH]
                    7.38%  80.274ms       960  83.618us  63.136us  124.64us  resample_by_mat
                    6.77%  73.647ms       160  460.30us  459.17us  463.71us  face_mass
                    6.68%  72.666ms      2228  32.614us  3.3280us  66.400us  axpb
                    3.57%  38.881ms      7040  5.5220us  3.2640us  8.4480us  take
                    3.21%  34.891ms      3649  9.5610us  1.0240us  494.24us  [CUDA memcpy HtoD]
                    1.25%  13.577ms       235  57.775us  3.8400us  68.383us  [CUDA memcpy DtoD]
                    0.57%  6.1862ms        50  123.72us  123.49us  124.48us  grudge_assign_1
                    0.45%  4.9483ms        40  123.71us  123.52us  124.22us  grudge_assign_2
                    0.30%  3.2806ms        12  273.39us  14.112us  482.46us  nodes_0
                    0.12%  1.3401ms         6  223.34us  14.304us  481.57us  actx_special_sqrt
                    0.03%  312.64us       323     967ns     928ns  1.6640us  [CUDA memset]
                    0.03%  277.79us         6  46.298us  6.4640us  86.272us  divide
                    0.02%  215.10us        10  21.510us  21.152us  22.048us  reduce_kernel_stage1
                    0.01%  125.09us         1  125.09us  125.09us  125.09us  actx_special_exp
                    0.01%  81.535us        20  4.0760us  3.7120us  4.6080us  reduce_kernel_stage2

(this was for 10 timesteps with nel_1d = 24, 3D, order 3).

I checked the loop sizes (nelements*n_to_nodes):

min_size = 10580
max_size = 719440
avg_size = 365010.0

compared to 1.46M dofs in the volume discretization.

Here is the code generated for the above loop:

#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
#if __OPENCL_C_VERSION__ < 120
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#endif

__kernel void __attribute__ ((reqd_work_group_size(16, 1, 1))) resample_by_picking(__global double *__restrict__ result, int const result_offset, __global double const *__restrict__ ary, int const ary_offset, int const nelements_result, int const nelements_vec, int const n_from_nodes, int const n_to_nodes, int const nelements)
{
  if (-1 + -1 * lid(0) + n_to_nodes >= 0)
    for (int idof_outer = 0; idof_outer <= -1 + -1 * lid(0) + (15 + n_to_nodes + 15 * lid(0)) / 16; ++idof_outer)
      result[result_offset + n_to_nodes * gid(0) + 16 * idof_outer + lid(0)] = ary[ary_offset + n_from_nodes * gid(0) + 16 * idof_outer + lid(0)];
}

I'm a little bit confused that it's not using nelements (I'm passing it in explicitly to actx.call_loopy; I get an error saying it can't be deduced if I omit it). Any ideas?

Jan 25 '21 18:01 majosm

Thanks for experimenting with this!

I'm a little bit confused that it's not using nelements (I'm passing it in explicitly to actx.call_loopy; I get an error saying it can't be deduced if I omit it). Any ideas?

It's OK that that doesn't appear in the kernel. It gets used in the "wrapper" code that computes the OpenCL "grid" bounds. Right now, the (simple) work decomposition is one group per element.

As for removing the indirections, it's super confusing to me that that's not having an effect. It has to... that's a whole bunch of extra memory access that's now not happening (the index reads), and somehow that's not measurable? Wha?

The only explanation I can think of is that there's something else very wrong with this kernel. Speaking of work decomposition, one element per group is stupid (i.e. much too small), especially for surface kernels. (cc @nchristensen) Was this 2D? Nvidia's visual profiler can produce a nice report of "what's wrong with your kernel". It might be worth looking at what it has to say.

Jan 25 '21 21:01 inducer

It's OK that that doesn't appear in the kernel. It gets used in the "wrapper" code that computes the OpenCL "grid" bounds. Right now, the (simple) work decomposition is one group per element.

Ah, ok.

The only explanation I can think of is that there's something else very wrong with this kernel. Speaking of work decomposition, one element per group is stupid (i.e. much too small), especially for surface kernels. (cc @nchristensen) Was this 2D? Nvidia's visual profiler can produce a nice report of "what's wrong with your kernel". It might be worth looking at what it has to say.

It's 3D. I'll take a look into the visual profiler thing.

Jan 26 '21 16:01 majosm

Note that the visual profiler (nvvp) isn't very usable from Lassen, the network latency for graphical apps is too high. It is usually better to generate a profile with nvprof, scp it to your local machine, and import it into your local nvvp.

Note that for the deep analysis, nvvp/nvprof can be flaky for Python applications. I've had luck with the following command:

$ export PYOPENCL_CTX=':1'
$ nvprof -f --analysis-metrics -o foo.nvvp --replay-mode application python examples/wave-eager.py

, which will create (or overwrite) the output profile in foo.nvvp.

Note that the deep analysis (the --analysis-metrics argument) is extremely slow, so you need to use small input parameters.

EDIT: You will need an updated cuda version:

$ ml load cuda/11

Jan 26 '21 18:01 matthiasdiener

Finally was able to get some decent nvprof results on Lassen:

Command:

$ nvprof --kernels resample_by_picking -m all --replay-mode application python ./wave-eager.py

Output:

    Kernel: resample_by_picking
Invocations                               Metric Name                                               Metric Description         Min         Max         Avg
          2                             inst_per_warp                                            Instructions per warp  274.000000  274.000000  274.000000
          2                         branch_efficiency                                                Branch Efficiency      80.00%      80.00%      80.00%
          2                 warp_execution_efficiency                                        Warp Execution Efficiency      20.57%      20.57%      20.57%
          2         warp_nonpred_execution_efficiency                         Warp Non-Predicated Execution Efficiency      19.71%      19.71%      19.71%
          1                      inst_replay_overhead                                      Instruction Replay Overhead    0.095238    0.095238    0.095238
          2      shared_load_transactions_per_request                      Shared Memory Load Transactions Per Request    0.000000    0.000000    0.000000
          2     shared_store_transactions_per_request                     Shared Memory Store Transactions Per Request    0.000000    0.000000    0.000000
          1       local_load_transactions_per_request                       Local Memory Load Transactions Per Request    0.000000    0.000000    0.000000
          2      local_store_transactions_per_request                      Local Memory Store Transactions Per Request    0.000000    0.000000    0.000000
          1              gld_transactions_per_request                             Global Load Transactions Per Request    1.500000    1.500000    1.500000
          3              gst_transactions_per_request                            Global Store Transactions Per Request    1.000000    1.000000    1.000000
          2                 shared_store_transactions                                        Shared Store Transactions           0           0           0
          2                  shared_load_transactions                                         Shared Load Transactions           0           0           0
          1                   local_load_transactions                                          Local Load Transactions           0           0           0
          2                  local_store_transactions                                         Local Store Transactions           0           0           0
          1                          gld_transactions                                         Global Load Transactions         180         180         180
          3                          gst_transactions                                        Global Store Transactions          30          30          30
          2                  sysmem_read_transactions                                  System Memory Read Transactions           0           0           0
          1                 sysmem_write_transactions                                 System Memory Write Transactions           5           5           5
          2                      l2_read_transactions                                             L2 Read Transactions         231         564         397
          1                     l2_write_transactions                                            L2 Write Transactions          50          50          50
          3                    dram_read_transactions                                  Device Memory Read Transactions           0           0           0
          3                   dram_write_transactions                                 Device Memory Write Transactions          32          40          36
          1                           global_hit_rate                                Global Hit Rate in unified l1/tex       0.00%       0.00%       0.00%
          1                            local_hit_rate                                                   Local Hit Rate       0.00%       0.00%       0.00%
       2640                  gld_requested_throughput                                 Requested Global Load Throughput  423.86MB/s  5.9027GB/s  3.9078GB/s
       2640                  gst_requested_throughput                                Requested Global Store Throughput  211.93MB/s  2.9520GB/s  1.9715GB/s
          1                            gld_throughput                                           Global Load Throughput  1.3629GB/s  1.3629GB/s  1.3629GB/s
          3                            gst_throughput                                          Global Store Throughput  232.60MB/s  234.51MB/s  233.24MB/s
          2                     local_memory_overhead                                            Local Memory Overhead       0.00%       0.00%       0.00%
          2                        tex_cache_hit_rate                                           Unified Cache Hit Rate       0.00%       0.00%       0.00%
          3                      l2_tex_read_hit_rate                                      L2 Hit Rate (Texture Reads)     100.00%     100.00%     100.00%
          2                     l2_tex_write_hit_rate                                     L2 Hit Rate (Texture Writes)      93.33%     100.00%      96.67%
          3                      dram_read_throughput                                    Device Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          3                     dram_write_throughput                                   Device Memory Write Throughput  248.11MB/s  312.68MB/s  279.88MB/s
          3                      tex_cache_throughput                       Unified cache to Multiprocessor throughput  4.5430GB/s  4.5803GB/s  4.5554GB/s
          3                    l2_tex_read_throughput                                    L2 Throughput (Texture Reads)  1.0306GB/s  1.3629GB/s  1.2527GB/s
          2                   l2_tex_write_throughput                                   L2 Throughput (Texture Writes)  232.60MB/s  234.51MB/s  233.55MB/s
          2                        l2_read_throughput                                            L2 Throughput (Reads)  1.7634GB/s  4.2705GB/s  3.0220GB/s
          1                       l2_write_throughput                                           L2 Throughput (Writes)  387.67MB/s  387.67MB/s  387.67MB/s
          2                    sysmem_read_throughput                                    System Memory Read Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                   sysmem_write_throughput                                   System Memory Write Throughput  38.767MB/s  38.767MB/s  38.767MB/s
          1                     local_load_throughput                                     Local Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          2                    local_store_throughput                                    Local Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          2                    shared_load_throughput                                    Shared Memory Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          2                   shared_store_throughput                                   Shared Memory Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
          1                            gld_efficiency                                    Global Memory Load Efficiency      33.33%      33.33%      33.33%
          3                            gst_efficiency                                   Global Memory Store Efficiency     100.00%     100.00%     100.00%
          3                    tex_cache_transactions                     Unified cache to Multiprocessor transactions         150         150         150
       2640                             flop_count_dp                      Floating Point Operations(Double Precision)           0           0           0
       2640                         flop_count_dp_add                  Floating Point Operations(Double Precision Add)           0           0           0
       2640                         flop_count_dp_fma                  Floating Point Operations(Double Precision FMA)           0           0           0
       2640                         flop_count_dp_mul                  Floating Point Operations(Double Precision Mul)           0           0           0
       2640                             flop_count_sp                      Floating Point Operations(Single Precision)           0           0           0
       2640                         flop_count_sp_add                  Floating Point Operations(Single Precision Add)           0           0           0
       2640                         flop_count_sp_fma                  Floating Point Operations(Single Precision FMA)           0           0           0
       2640                         flop_count_sp_mul                   Floating Point Operation(Single Precision Mul)           0           0           0
       2640                     flop_count_sp_special              Floating Point Operations(Single Precision Special)           0           0           0
          2                             inst_executed                                            Instructions Executed        1890        8220        5055
          2                               inst_issued                                              Instructions Issued        2070        2070        2070
          3                          dram_utilization                                        Device Memory Utilization     Low (1)     Low (1)     Low (1)
          1                        sysmem_utilization                                        System Memory Utilization     Low (1)     Low (1)     Low (1)
          2                          stall_inst_fetch                         Issue Stall Reasons (Instructions Fetch)       3.17%      17.88%      10.53%
          2                     stall_exec_dependency                       Issue Stall Reasons (Execution Dependency)      14.07%      16.96%      15.52%
          2                   stall_memory_dependency                               Issue Stall Reasons (Data Request)      35.32%      41.15%      38.23%
          2                             stall_texture                                    Issue Stall Reasons (Texture)       0.00%       0.00%       0.00%
          2                                stall_sync                            Issue Stall Reasons (Synchronization)       0.00%       0.00%       0.00%
          2                               stall_other                                      Issue Stall Reasons (Other)       0.21%       0.27%       0.24%
          2          stall_constant_memory_dependency                         Issue Stall Reasons (Immediate constant)      31.85%      37.70%      34.77%
          2                           stall_pipe_busy                                  Issue Stall Reasons (Pipe Busy)       0.00%       0.00%       0.00%
          2                         shared_efficiency                                         Shared Memory Efficiency       0.00%       0.00%       0.00%
       2640                                inst_fp_32                                          FP Instructions(Single)           0           0           0
       2640                                inst_fp_64                                          FP Instructions(Double)           0           0           0
       2640                              inst_integer                                             Integer Instructions        7560      113400       70690
       2640                          inst_bit_convert                                         Bit-Convert Instructions           0           0           0
       2640                              inst_control                                        Control-Flow Instructions         480        7200        4581
       2640                        inst_compute_ld_st                                          Load/Store Instructions         600        9000        5727
       2640                                 inst_misc                                                Misc Instructions         600        9000        5727
       2640           inst_inter_thread_communication                                        Inter-Thread Instructions           0           0           0
          2                               issue_slots                                                      Issue Slots        2070        2070        2070
          1                                 cf_issued                                 Issued Control-Flow Instructions         150         150         150
          1                               cf_executed                               Executed Control-Flow Instructions         150         150         150
          2                               ldst_issued                                   Issued Load/Store Instructions         210         210         210
          2                             ldst_executed                                 Executed Load/Store Instructions         210         210         210
          2                       atomic_transactions                                              Atomic Transactions           0           0           0
          2           atomic_transactions_per_request                                  Atomic Transactions Per Request    0.000000    0.000000    0.000000
          2                      l2_atomic_throughput                                  L2 Throughput (Atomic requests)  0.00000B/s  0.00000B/s  0.00000B/s
          2                    l2_atomic_transactions                                L2 Transactions (Atomic requests)           0           0           0
          3                  l2_tex_read_transactions                                  L2 Transactions (Texture Reads)         135         180         165
          2                     stall_memory_throttle                            Issue Stall Reasons (Memory Throttle)       0.67%       0.74%       0.71%
          2                        stall_not_selected                               Issue Stall Reasons (Not Selected)       0.00%       0.00%       0.00%
          2                 l2_tex_write_transactions                                 L2 Transactions (Texture Writes)          30          30          30
       2640      nvlink_total_nratom_data_transmitted                             NVLink Total Nratom Data Transmitted           0           0           0
       2640       nvlink_user_nratom_data_transmitted                              NVLink User Nratom Data Transmitted           0           0           0
       2640       nvlink_total_ratom_data_transmitted                              NVLink Total Ratom Data Transmitted           0           0           0
       2640        nvlink_user_ratom_data_transmitted                               NVLink User Ratom Data Transmitted           0           0           0
       2640       nvlink_total_write_data_transmitted                              NVLink Total Write Data Transmitted         384        4992         413
       2640        nvlink_user_write_data_transmitted                               NVLink User Write Data Transmitted         128        1664         137
       2640       nvlink_total_response_data_received                              NVLink Total Response Data Received        1856        3376        1893
       2640        nvlink_user_response_data_received                               NVLink User Response Data Received        1280        2080        1302
       2640                             flop_count_hp                        Floating Point Operations(Half Precision)           0           0           0
       2640                         flop_count_hp_add                    Floating Point Operations(Half Precision Add)           0           0           0
       2640                         flop_count_hp_mul                     Floating Point Operation(Half Precision Mul)           0           0           0
       2640                         flop_count_hp_fma                    Floating Point Operations(Half Precision FMA)           0           0           0
       2640                                inst_fp_16                                            HP Instructions(Half)           0           0           0
          2                                       ipc                                                     Executed IPC    0.040125    0.062150    0.051138
          2                                issued_ipc                                                       Issued IPC    0.043946    0.052494    0.048220
          2                    issue_slot_utilization                                           Issue Slot Utilization       1.10%       1.31%       1.21%
          2                             sm_efficiency                                          Multiprocessor Activity       8.98%      16.06%      12.52%
          2                        achieved_occupancy                                               Achieved Occupancy    0.015624    0.015625    0.015624
          2                  eligible_warps_per_cycle                                  Eligible Warps Per Active Cycle    0.043946    0.052494    0.048220
          2                        shared_utilization                                        Shared Memory Utilization    Idle (0)    Idle (0)    Idle (0)
          1                            l2_utilization                                             L2 Cache Utilization     Low (1)     Low (1)     Low (1)
          3                           tex_utilization                                        Unified Cache Utilization     Low (1)     Low (1)     Low (1)
          2                       ldst_fu_utilization                             Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1                         cf_fu_utilization                           Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
          2                        tex_fu_utilization                                Texture Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1                    special_fu_utilization                                Special Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          3             half_precision_fu_utilization                         Half-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          1           single_precision_fu_utilization                       Single-Precision Function Unit Utilization     Low (1)     Low (1)     Low (1)
          1           double_precision_fu_utilization                       Double-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
          3                        flop_hp_efficiency                                       FLOP Efficiency(Peak Half)       0.00%       0.00%       0.00%
          3                        flop_sp_efficiency                                     FLOP Efficiency(Peak Single)       0.00%       0.00%       0.00%
          3                        flop_dp_efficiency                                     FLOP Efficiency(Peak Double)       0.00%       0.00%       0.00%
          2                   sysmem_read_utilization                                   System Memory Read Utilization    Idle (0)    Idle (0)    Idle (0)
          1                  sysmem_write_utilization                                  System Memory Write Utilization     Low (1)     Low (1)     Low (1)
          2                            stall_sleeping                                   Issue Stall Reasons (Sleeping)       0.00%       0.00%       0.00%
          2                inst_executed_global_loads                         Warp level instructions for global loads         120         120         120
          2                 inst_executed_local_loads                          Warp level instructions for local loads           0           0           0
          2                inst_executed_shared_loads                         Warp level instructions for shared loads           0           0           0
          3               inst_executed_surface_loads                        Warp level instructions for surface loads           0           0           0
          3               inst_executed_global_stores                        Warp level instructions for global stores          30          30          30
          3                inst_executed_local_stores                         Warp level instructions for local stores           0           0           0
          2               inst_executed_shared_stores                        Warp level instructions for shared stores           0           0           0
          3              inst_executed_surface_stores                       Warp level instructions for surface stores           0           0           0
          3              inst_executed_global_atomics             Warp level instructions for global atom and atom cas           0           0           0
          3           inst_executed_global_reductions                    Warp level instructions for global reductions           0           0           0
          3             inst_executed_surface_atomics            Warp level instructions for surface atom and atom cas           0           0           0
          3          inst_executed_surface_reductions                   Warp level instructions for surface reductions           0           0           0
          2              inst_executed_shared_atomics             Warp level shared instructions for atom and atom CAS           0           0           0
          3                     inst_executed_tex_ops                              Warp level instructions for texture           0           0           0
          3                           dram_read_bytes                           Total bytes read from DRAM to L2 cache           0           0           0
          3                          dram_write_bytes                        Total bytes written from L2 cache to DRAM        1024        1280        1152
          2                      global_load_requests         Total number of global load requests from Multiprocessor         120         120         120
          2                       local_load_requests          Total number of local load requests from Multiprocessor           0           0           0
          5                     surface_load_requests        Total number of surface load requests from Multiprocessor           0           0           0
          3                     global_store_requests        Total number of global store requests from Multiprocessor          30          30          30
          3                      local_store_requests         Total number of local store requests from Multiprocessor           0           0           0
          3                    surface_store_requests       Total number of surface store requests from Multiprocessor           0           0           0
          2                    global_atomic_requests       Total number of global atomic requests from Multiprocessor           0           0           0
          2                 global_reduction_requests    Total number of global reduction requests from Multiprocessor           0           0           0
          5                   surface_atomic_requests      Total number of surface atomic requests from Multiprocessor           0           0           0
          5                surface_reduction_requests   Total number of surface reduction requests from Multiprocessor           0           0           0
          2                      l2_global_load_bytes             Bytes read from L2 for misses in L1 for global loads        4320        5760        5040
          1                       l2_local_load_bytes              Bytes read from L2 for misses in L1 for local loads           0           0           0
          5                     l2_surface_load_bytes            Bytes read from L2 for misses in L1 for surface loads           0           0           0
          5              l2_global_atomic_store_bytes                   Bytes written to L2 from L1 for global atomics           0           0           0
          3               l2_local_global_store_bytes         Bytes written to L2 from L1 for local and global stores.         960         960         960
          5                    l2_surface_store_bytes           Bytes read from L2 for misses in L1 for surface stores           0           0           0
          2                         sysmem_read_bytes                                         System Memory Read Bytes           0           0           0
          1                        sysmem_write_bytes                                        System Memory Write Bytes         160         160         160
          2                           l2_tex_hit_rate                                                L2 Cache Hit Rate      99.05%     100.00%      99.52%
          2                     texture_load_requests        Total number of texture Load requests from Multiprocessor           0           0           0
          1           tensor_precision_fu_utilization                       Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)

Edit: full profile on porter: profile.nvvp.gz

resample_by_picking.pdf

multiply.pdf

Jan 30 '21 23:01 matthiasdiener