radeon_compute_profiler
radeon_compute_profiler copied to clipboard
Missing basic counters from full list of performance counters
I'm running the current version of RCP (5.6) on a Radeon VII. When I ask for the list of available performance counters its incomplete, it only gives derived counts. Basic counts are nowhere to be found although clearly they're needed for the derived counts. However when I ask rocprofiler (also current version), which I understand is what RCP is based on, for a list of metrics they're all there.
rcprof -l
OpenCL performance counters:
The list of valid counters for Graphics IP v6 based graphics cards:
Wavefronts, VALUInsts, SALUInsts, VFetchInsts, SFetchInsts,
VWriteInsts, LDSInsts, GDSInsts, VALUUtilization, VALUBusy,
SALUBusy, FetchSize, WriteSize, CacheHit, MemUnitBusy,
MemUnitStalled, WriteUnitStalled, LDSBankConflict
...
HSA performance counters:
The list of valid counters for Graphics IP v8 based graphics cards:
Wavefronts, VALUInsts, SALUInsts, VFetchInsts, SFetchInsts,
VWriteInsts, FlatVMemInsts, LDSInsts, FlatLDSInsts, GDSInsts,
VALUUtilization, VALUBusy, SALUBusy, FetchSize, WriteSize,
CacheHit, MemUnitBusy, MemUnitStalled, WriteUnitStalled, LDSBankConflict
The list of valid counters for Vega based graphics cards:
Wavefronts, VALUInsts, SALUInsts, VFetchInsts, SFetchInsts,
VWriteInsts, FlatVMemInsts, LDSInsts, FlatLDSInsts, GDSInsts,
VALUUtilization, VALUBusy, SALUBusy, FetchSize, WriteSize,
L2CacheHit, MemUnitBusy, MemUnitStalled, WriteUnitStalled, LDSBankConflict
rpl_run.sh --list-basic
RPL: on '190801_110408' from '/home/ddpruitt/rocm' in '/home/ddpruitt/HIP/samples/0_Intro/square'
ROCProfiler: rc-file '/home/ddpruitt/rpl_rc.xml'
Basic HW counters:
gpu-agent0 : GRBM_COUNT : Tie High - Count Number of Clocks
block GRBM has 2 counters
gpu-agent0 : GRBM_GUI_ACTIVE : The GUI is Active
block GRBM has 2 counters
gpu-agent0 : SQ_WAVES : Count number of waves sent to SQs. (per-simd, emulated, global)
block SQ has 8 counters
gpu-agent0 : SQ_INSTS_VALU : Number of VALU instructions issued. (per-simd, emulated)
block SQ has 8 counters
gpu-agent0 : SQ_INSTS_VMEM_WR : Number of VMEM write instructions issued (including FLAT). (per-simd, emulated)
block SQ has 8 counters
gpu-agent0 : SQ_INSTS_VMEM_RD : Number of VMEM read instructions issued (including FLAT). (per-simd, emulated)
block SQ has 8 counters
gpu-agent0 : SQ_INSTS_SALU : Number of SALU instructions issued. (per-simd, emulated)
block SQ has 8 counters
gpu-agent0 : SQ_INSTS_SMEM : Number of SMEM instructions issued. (per-simd, emulated)
block SQ has 8 counters
gpu-agent0 : SQ_INSTS_FLAT : Number of FLAT instructions issued. (per-simd, emulated)
block SQ has 8 counters
...
rpl_run.sh --list-derived
RPL: on '190801_110411' from '/home/ddpruitt/rocm' in '/home/ddpruitt/HIP/samples/0_Intro/square'
ROCProfiler: rc-file '/home/ddpruitt/rpl_rc.xml'
Derived metrics:
gpu-agent0 : TA_BUSY_avr : TA block is busy. Average over TA instances.
TA_BUSY_avr = avr(TA_TA_BUSY,16)
gpu-agent0 : TA_BUSY_max : TA block is busy. Max over TA instances.
TA_BUSY_max = max(TA_TA_BUSY,16)
gpu-agent0 : TA_BUSY_min : TA block is busy. Min over TA instances.
TA_BUSY_min = min(TA_TA_BUSY,16)
gpu-agent0 : TA_FLAT_READ_WAVEFRONTS_sum : Number of flat opcode reads processed by the TA. Sum over TA instances.
TA_FLAT_READ_WAVEFRONTS_sum = sum(TA_FLAT_READ_WAVEFRONTS,16)
gpu-agent0 : TA_FLAT_WRITE_WAVEFRONTS_sum : Number of flat opcode writes processed by the TA. Sum over TA instances.
TA_FLAT_WRITE_WAVEFRONTS_sum = sum(TA_FLAT_WRITE_WAVEFRONTS,16)
gpu-agent0 : TCC_HIT_sum : Number of cache hits. Sum over TCC instances.
TCC_HIT_sum = sum(TCC_HIT,16)
gpu-agent0 : TCC_MISS_sum : Number of cache misses. Sum over TCC instances.
TCC_MISS_sum = sum(TCC_MISS,16)
gpu-agent0 : TCC_EA_RDREQ_32B_sum : Number of 32-byte TCC/EA read requests. Sum over TCC instances.
TCC_EA_RDREQ_32B_sum = sum(TCC_EA_RDREQ_32B,16)
gpu-agent0 : TCC_EA_RDREQ_sum : Number of TCC/EA read requests (either 32-byte or 64-byte). Sum over TCC instances.
TCC_EA_RDREQ_sum = sum(TCC_EA_RDREQ,16)
gpu-agent0 : TCC_EA_WRREQ_sum : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Sum over TCC instances.
TCC_EA_WRREQ_sum = sum(TCC_EA_WRREQ,16)