Accelerate resample_by_picking through better ordering
resample_by_picking routinely shows up at the top of our GPU profiles. Here's an example from a run (mirgecom wave-eager, nel_1d = 24, 3D, order 3):
GPU activities: 15.95% 1.72898s 28160 61.398us 3.4240us 133.22us resample_by_picking
14.57% 1.57934s 26499 59.599us 4.4480us 117.47us multiply
14.41% 1.56223s 9698 161.09us 16.000us 543.23us diff
12.79% 1.38640s 21601 64.182us 4.4150us 118.56us axpbyz
11.66% 1.26342s 5283 239.15us 120.54us 529.85us grudge_assign_0
8.81% 954.48ms 23428 40.741us 1.6640us 81.888us axpb
7.93% 859.20ms 10560 81.363us 60.831us 135.04us resample_by_mat
7.84% 849.96ms 1760 482.93us 481.79us 541.57us face_mass
2.16% 233.96ms 62 3.7735ms 1.3440us 11.178ms [CUDA memcpy DtoH]
1.58% 171.51ms 2235 76.738us 1.6640us 87.391us [CUDA memcpy DtoD]
1.19% 128.99ms 3523 36.612us 19.328us 37.952us [CUDA memset]
0.49% 53.375ms 440 121.31us 120.67us 136.42us grudge_assign_1
0.49% 53.364ms 440 121.28us 120.67us 136.19us grudge_assign_2
0.07% 7.2597ms 127 57.162us 1.1520us 847.90us [CUDA memcpy HtoD]
0.03% 3.5939ms 12 299.49us 13.696us 529.66us nodes_0
0.01% 1.4677ms 6 244.62us 14.112us 528.73us actx_special_sqrt
0.00% 359.81us 6 59.967us 5.0880us 115.14us divide
0.00% 136.06us 1 136.06us 136.06us 136.06us actx_special_exp
It's especially striking that it's at the top of the list because it touches lower-dimensional data. (surface vs volume) multiply has a similar number of calls, but it touches volume data, and it completes more quickly.
I think there are two opportunities here that we could try:
- Currently, the kernel has an indirection on the read and the write end (see the (very simple) source). For surjective/onto connections, we can do away with the indirection on write by appropriately sorting the source indices.
- Even for non-surjective connections, it's likely that we would benefit by sorting by the write index, to try to keep the writes as coalesced as possible.
IMO it's likely that this will have a benefit (but I obviously can't guarantee it). I think it's worth trying.
cc @lukeolson
Still experimenting with this, but so far I'm getting the sense that it's something other than the indirection that's causing resample_by_picking to be slow.
First I tried pre-sorting by batch.to_element_indices, and that didn't seem to have any effect. Next I tried progressively stripping down the loop (ignoring correctness) to see whether I could simplify it enough to make it disappear from the profile results. I stripped it down to this:
knl = make_loopy_program(
"""{[iel, idof]:
0<=iel<nelements and
0<=idof<n_to_nodes}""",
"result[iel, idof] = ary[iel, idof]",
[
lp.GlobalArg("result", None,
shape="nelements_result, n_to_nodes",
offset=lp.auto),
lp.GlobalArg("ary", None,
shape="nelements_vec, n_from_nodes",
offset=lp.auto),
lp.ValueArg("nelements_result", np.int32),
lp.ValueArg("nelements_vec", np.int32),
lp.ValueArg("n_from_nodes", np.int32),
"...",
],
name="resample_by_picking")
(i.e., no indirection), and it still looks about the same in the profiling results.
(Note: nelements here should be the number of entries in batch.to_element_indices, but looking at the generated code I can't tell if it is. More on this below.)
Before:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 16.16% 149.58ms 908 164.73us 17.120us 498.11us diff
15.78% 146.09ms 2560 57.065us 4.6710us 123.14us resample_by_picking
12.37% 114.46ms 2499 45.802us 5.5670us 92.512us multiply
11.44% 105.84ms 493 214.68us 106.69us 482.94us grudge_assign_0
10.60% 98.111ms 2001 49.031us 5.5670us 92.192us axpbyz
7.81% 72.307ms 960 75.319us 55.200us 124.96us resample_by_mat
7.76% 71.836ms 2228 32.242us 3.3280us 66.976us axpb
7.44% 68.819ms 72 955.81us 1.3440us 3.0474ms [CUDA memcpy DtoH]
7.09% 65.586ms 160 409.91us 395.33us 465.12us face_mass
1.46% 13.493ms 235 57.417us 3.8720us 68.735us [CUDA memcpy DtoD]
0.60% 5.5327ms 50 110.65us 106.78us 124.29us grudge_assign_1
0.48% 4.4395ms 40 110.99us 106.98us 124.26us grudge_assign_2
0.41% 3.8396ms 129 29.764us 1.0240us 518.59us [CUDA memcpy HtoD]
0.35% 3.2791ms 12 273.25us 14.016us 482.05us nodes_0
0.14% 1.3395ms 6 223.25us 14.368us 481.44us actx_special_sqrt
0.03% 292.25us 323 904ns 832ns 1.6640us [CUDA memset]
0.03% 276.19us 6 46.031us 6.4000us 85.567us divide
0.02% 213.18us 10 21.318us 20.960us 21.792us reduce_kernel_stage1
0.01% 124.99us 1 124.99us 124.99us 124.99us actx_special_exp
0.01% 72.831us 20 3.6410us 3.2960us 4.4160us reduce_kernel_stage2
After:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 15.15% 164.78ms 908 181.47us 17.087us 500.35us diff
14.83% 161.36ms 2560 63.032us 4.7990us 122.34us resample_by_picking
10.91% 118.63ms 493 240.63us 123.46us 484.73us grudge_assign_0
10.60% 115.31ms 2499 46.142us 6.3670us 92.160us multiply
9.09% 98.829ms 2001 49.389us 6.3670us 93.216us axpbyz
9.03% 98.180ms 3592 27.333us 1.4720us 3.1643ms [CUDA memcpy DtoH]
7.38% 80.274ms 960 83.618us 63.136us 124.64us resample_by_mat
6.77% 73.647ms 160 460.30us 459.17us 463.71us face_mass
6.68% 72.666ms 2228 32.614us 3.3280us 66.400us axpb
3.57% 38.881ms 7040 5.5220us 3.2640us 8.4480us take
3.21% 34.891ms 3649 9.5610us 1.0240us 494.24us [CUDA memcpy HtoD]
1.25% 13.577ms 235 57.775us 3.8400us 68.383us [CUDA memcpy DtoD]
0.57% 6.1862ms 50 123.72us 123.49us 124.48us grudge_assign_1
0.45% 4.9483ms 40 123.71us 123.52us 124.22us grudge_assign_2
0.30% 3.2806ms 12 273.39us 14.112us 482.46us nodes_0
0.12% 1.3401ms 6 223.34us 14.304us 481.57us actx_special_sqrt
0.03% 312.64us 323 967ns 928ns 1.6640us [CUDA memset]
0.03% 277.79us 6 46.298us 6.4640us 86.272us divide
0.02% 215.10us 10 21.510us 21.152us 22.048us reduce_kernel_stage1
0.01% 125.09us 1 125.09us 125.09us 125.09us actx_special_exp
0.01% 81.535us 20 4.0760us 3.7120us 4.6080us reduce_kernel_stage2
(this was for 10 timesteps with nel_1d = 24, 3D, order 3).
I checked the loop sizes (nelements*n_to_nodes):
min_size = 10580
max_size = 719440
avg_size = 365010.0
compared to 1.46M dofs in the volume discretization.
Here is the code generated for the above loop:
#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
#if __OPENCL_C_VERSION__ < 120
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#endif
__kernel void __attribute__ ((reqd_work_group_size(16, 1, 1))) resample_by_picking(__global double *__restrict__ result, int const result_offset, __global double const *__restrict__ ary, int const ary_offset, int const nelements_result, int const nelements_vec, int const n_from_nodes, int const n_to_nodes, int const nelements)
{
if (-1 + -1 * lid(0) + n_to_nodes >= 0)
for (int idof_outer = 0; idof_outer <= -1 + -1 * lid(0) + (15 + n_to_nodes + 15 * lid(0)) / 16; ++idof_outer)
result[result_offset + n_to_nodes * gid(0) + 16 * idof_outer + lid(0)] = ary[ary_offset + n_from_nodes * gid(0) + 16 * idof_outer + lid(0)];
}
I'm a little bit confused that it's not using nelements (I'm passing it in explicitly to actx.call_loopy; I get an error saying it can't be deduced if I omit it). Any ideas?
Thanks for experimenting with this!
I'm a little bit confused that it's not using
nelements(I'm passing it in explicitly toactx.call_loopy; I get an error saying it can't be deduced if I omit it). Any ideas?
It's OK that that doesn't appear in the kernel. It gets used in the "wrapper" code that computes the OpenCL "grid" bounds. Right now, the (simple) work decomposition is one group per element.
As for removing the indirections, it's super confusing to me that that's not having an effect. It has to... that's a whole bunch of extra memory access that's now not happening (the index reads), and somehow that's not measurable? Wha?
The only explanation I can think of is that there's something else very wrong with this kernel. Speaking of work decomposition, one element per group is stupid (i.e. much too small), especially for surface kernels. (cc @nchristensen) Was this 2D? Nvidia's visual profiler can produce a nice report of "what's wrong with your kernel". It might be worth looking at what it has to say.
It's OK that that doesn't appear in the kernel. It gets used in the "wrapper" code that computes the OpenCL "grid" bounds. Right now, the (simple) work decomposition is one group per element.
Ah, ok.
The only explanation I can think of is that there's something else very wrong with this kernel. Speaking of work decomposition, one element per group is stupid (i.e. much too small), especially for surface kernels. (cc @nchristensen) Was this 2D? Nvidia's visual profiler can produce a nice report of "what's wrong with your kernel". It might be worth looking at what it has to say.
It's 3D. I'll take a look into the visual profiler thing.
Note that the visual profiler (nvvp) isn't very usable from Lassen, the network latency for graphical apps is too high. It is usually better to generate a profile with nvprof, scp it to your local machine, and import it into your local nvvp.
Note that for the deep analysis, nvvp/nvprof can be flaky for Python applications. I've had luck with the following command:
$ export PYOPENCL_CTX=':1'
$ nvprof -f --analysis-metrics -o foo.nvvp --replay-mode application python examples/wave-eager.py
, which will create (or overwrite) the output profile in foo.nvvp.
Note that the deep analysis (the --analysis-metrics argument) is extremely slow, so you need to use small input parameters.
EDIT: You will need an updated cuda version:
$ ml load cuda/11
Finally was able to get some decent nvprof results on Lassen:
Command:
$ nvprof --kernels resample_by_picking -m all --replay-mode application python ./wave-eager.py
Output:
Kernel: resample_by_picking
Invocations Metric Name Metric Description Min Max Avg
2 inst_per_warp Instructions per warp 274.000000 274.000000 274.000000
2 branch_efficiency Branch Efficiency 80.00% 80.00% 80.00%
2 warp_execution_efficiency Warp Execution Efficiency 20.57% 20.57% 20.57%
2 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 19.71% 19.71% 19.71%
1 inst_replay_overhead Instruction Replay Overhead 0.095238 0.095238 0.095238
2 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
2 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
2 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 gld_transactions_per_request Global Load Transactions Per Request 1.500000 1.500000 1.500000
3 gst_transactions_per_request Global Store Transactions Per Request 1.000000 1.000000 1.000000
2 shared_store_transactions Shared Store Transactions 0 0 0
2 shared_load_transactions Shared Load Transactions 0 0 0
1 local_load_transactions Local Load Transactions 0 0 0
2 local_store_transactions Local Store Transactions 0 0 0
1 gld_transactions Global Load Transactions 180 180 180
3 gst_transactions Global Store Transactions 30 30 30
2 sysmem_read_transactions System Memory Read Transactions 0 0 0
1 sysmem_write_transactions System Memory Write Transactions 5 5 5
2 l2_read_transactions L2 Read Transactions 231 564 397
1 l2_write_transactions L2 Write Transactions 50 50 50
3 dram_read_transactions Device Memory Read Transactions 0 0 0
3 dram_write_transactions Device Memory Write Transactions 32 40 36
1 global_hit_rate Global Hit Rate in unified l1/tex 0.00% 0.00% 0.00%
1 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
2640 gld_requested_throughput Requested Global Load Throughput 423.86MB/s 5.9027GB/s 3.9078GB/s
2640 gst_requested_throughput Requested Global Store Throughput 211.93MB/s 2.9520GB/s 1.9715GB/s
1 gld_throughput Global Load Throughput 1.3629GB/s 1.3629GB/s 1.3629GB/s
3 gst_throughput Global Store Throughput 232.60MB/s 234.51MB/s 233.24MB/s
2 local_memory_overhead Local Memory Overhead 0.00% 0.00% 0.00%
2 tex_cache_hit_rate Unified Cache Hit Rate 0.00% 0.00% 0.00%
3 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 100.00% 100.00% 100.00%
2 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 93.33% 100.00% 96.67%
3 dram_read_throughput Device Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
3 dram_write_throughput Device Memory Write Throughput 248.11MB/s 312.68MB/s 279.88MB/s
3 tex_cache_throughput Unified cache to Multiprocessor throughput 4.5430GB/s 4.5803GB/s 4.5554GB/s
3 l2_tex_read_throughput L2 Throughput (Texture Reads) 1.0306GB/s 1.3629GB/s 1.2527GB/s
2 l2_tex_write_throughput L2 Throughput (Texture Writes) 232.60MB/s 234.51MB/s 233.55MB/s
2 l2_read_throughput L2 Throughput (Reads) 1.7634GB/s 4.2705GB/s 3.0220GB/s
1 l2_write_throughput L2 Throughput (Writes) 387.67MB/s 387.67MB/s 387.67MB/s
2 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 sysmem_write_throughput System Memory Write Throughput 38.767MB/s 38.767MB/s 38.767MB/s
1 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
2 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
2 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
2 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 gld_efficiency Global Memory Load Efficiency 33.33% 33.33% 33.33%
3 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
3 tex_cache_transactions Unified cache to Multiprocessor transactions 150 150 150
2640 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
2640 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
2640 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
2640 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
2640 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
2640 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
2640 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 0 0 0
2640 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
2640 flop_count_sp_special Floating Point Operations(Single Precision Special) 0 0 0
2 inst_executed Instructions Executed 1890 8220 5055
2 inst_issued Instructions Issued 2070 2070 2070
3 dram_utilization Device Memory Utilization Low (1) Low (1) Low (1)
1 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
2 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 3.17% 17.88% 10.53%
2 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 14.07% 16.96% 15.52%
2 stall_memory_dependency Issue Stall Reasons (Data Request) 35.32% 41.15% 38.23%
2 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
2 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
2 stall_other Issue Stall Reasons (Other) 0.21% 0.27% 0.24%
2 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 31.85% 37.70% 34.77%
2 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.00% 0.00% 0.00%
2 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
2640 inst_fp_32 FP Instructions(Single) 0 0 0
2640 inst_fp_64 FP Instructions(Double) 0 0 0
2640 inst_integer Integer Instructions 7560 113400 70690
2640 inst_bit_convert Bit-Convert Instructions 0 0 0
2640 inst_control Control-Flow Instructions 480 7200 4581
2640 inst_compute_ld_st Load/Store Instructions 600 9000 5727
2640 inst_misc Misc Instructions 600 9000 5727
2640 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
2 issue_slots Issue Slots 2070 2070 2070
1 cf_issued Issued Control-Flow Instructions 150 150 150
1 cf_executed Executed Control-Flow Instructions 150 150 150
2 ldst_issued Issued Load/Store Instructions 210 210 210
2 ldst_executed Executed Load/Store Instructions 210 210 210
2 atomic_transactions Atomic Transactions 0 0 0
2 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
2 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
2 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
3 l2_tex_read_transactions L2 Transactions (Texture Reads) 135 180 165
2 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.67% 0.74% 0.71%
2 stall_not_selected Issue Stall Reasons (Not Selected) 0.00% 0.00% 0.00%
2 l2_tex_write_transactions L2 Transactions (Texture Writes) 30 30 30
2640 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0
2640 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0
2640 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0
2640 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0
2640 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 384 4992 413
2640 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 128 1664 137
2640 nvlink_total_response_data_received NVLink Total Response Data Received 1856 3376 1893
2640 nvlink_user_response_data_received NVLink User Response Data Received 1280 2080 1302
2640 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
2640 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
2640 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
2640 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
2640 inst_fp_16 HP Instructions(Half) 0 0 0
2 ipc Executed IPC 0.040125 0.062150 0.051138
2 issued_ipc Issued IPC 0.043946 0.052494 0.048220
2 issue_slot_utilization Issue Slot Utilization 1.10% 1.31% 1.21%
2 sm_efficiency Multiprocessor Activity 8.98% 16.06% 12.52%
2 achieved_occupancy Achieved Occupancy 0.015624 0.015625 0.015624
2 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.043946 0.052494 0.048220
2 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
1 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1)
3 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
2 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
1 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
2 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 special_fu_utilization Special Function Unit Utilization Idle (0) Idle (0) Idle (0)
3 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
1 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
3 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
3 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
3 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
2 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
1 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
2 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00%
2 inst_executed_global_loads Warp level instructions for global loads 120 120 120
2 inst_executed_local_loads Warp level instructions for local loads 0 0 0
2 inst_executed_shared_loads Warp level instructions for shared loads 0 0 0
3 inst_executed_surface_loads Warp level instructions for surface loads 0 0 0
3 inst_executed_global_stores Warp level instructions for global stores 30 30 30
3 inst_executed_local_stores Warp level instructions for local stores 0 0 0
2 inst_executed_shared_stores Warp level instructions for shared stores 0 0 0
3 inst_executed_surface_stores Warp level instructions for surface stores 0 0 0
3 inst_executed_global_atomics Warp level instructions for global atom and atom cas 0 0 0
3 inst_executed_global_reductions Warp level instructions for global reductions 0 0 0
3 inst_executed_surface_atomics Warp level instructions for surface atom and atom cas 0 0 0
3 inst_executed_surface_reductions Warp level instructions for surface reductions 0 0 0
2 inst_executed_shared_atomics Warp level shared instructions for atom and atom CAS 0 0 0
3 inst_executed_tex_ops Warp level instructions for texture 0 0 0
3 dram_read_bytes Total bytes read from DRAM to L2 cache 0 0 0
3 dram_write_bytes Total bytes written from L2 cache to DRAM 1024 1280 1152
2 global_load_requests Total number of global load requests from Multiprocessor 120 120 120
2 local_load_requests Total number of local load requests from Multiprocessor 0 0 0
5 surface_load_requests Total number of surface load requests from Multiprocessor 0 0 0
3 global_store_requests Total number of global store requests from Multiprocessor 30 30 30
3 local_store_requests Total number of local store requests from Multiprocessor 0 0 0
3 surface_store_requests Total number of surface store requests from Multiprocessor 0 0 0
2 global_atomic_requests Total number of global atomic requests from Multiprocessor 0 0 0
2 global_reduction_requests Total number of global reduction requests from Multiprocessor 0 0 0
5 surface_atomic_requests Total number of surface atomic requests from Multiprocessor 0 0 0
5 surface_reduction_requests Total number of surface reduction requests from Multiprocessor 0 0 0
2 l2_global_load_bytes Bytes read from L2 for misses in L1 for global loads 4320 5760 5040
1 l2_local_load_bytes Bytes read from L2 for misses in L1 for local loads 0 0 0
5 l2_surface_load_bytes Bytes read from L2 for misses in L1 for surface loads 0 0 0
5 l2_global_atomic_store_bytes Bytes written to L2 from L1 for global atomics 0 0 0
3 l2_local_global_store_bytes Bytes written to L2 from L1 for local and global stores. 960 960 960
5 l2_surface_store_bytes Bytes read from L2 for misses in L1 for surface stores 0 0 0
2 sysmem_read_bytes System Memory Read Bytes 0 0 0
1 sysmem_write_bytes System Memory Write Bytes 160 160 160
2 l2_tex_hit_rate L2 Cache Hit Rate 99.05% 100.00% 99.52%
2 texture_load_requests Total number of texture Load requests from Multiprocessor 0 0 0
1 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
Edit: full profile on porter: profile.nvvp.gz