radeon_compute_profiler icon indicating copy to clipboard operation
radeon_compute_profiler copied to clipboard

Missing basic counters from full list of performance counters

Open daviddpruitt opened this issue 6 years ago • 0 comments

I'm running the current version of RCP (5.6) on a Radeon VII. When I ask for the list of available performance counters its incomplete, it only gives derived counts. Basic counts are nowhere to be found although clearly they're needed for the derived counts. However when I ask rocprofiler (also current version), which I understand is what RCP is based on, for a list of metrics they're all there.

rcprof -l
OpenCL performance counters:
The list of valid counters for Graphics IP v6 based graphics cards:
Wavefronts, VALUInsts, SALUInsts, VFetchInsts, SFetchInsts,
VWriteInsts, LDSInsts, GDSInsts, VALUUtilization, VALUBusy,
SALUBusy, FetchSize, WriteSize, CacheHit, MemUnitBusy,
MemUnitStalled, WriteUnitStalled, LDSBankConflict

...

HSA performance counters:
The list of valid counters for Graphics IP v8 based graphics cards:
Wavefronts, VALUInsts, SALUInsts, VFetchInsts, SFetchInsts,
VWriteInsts, FlatVMemInsts, LDSInsts, FlatLDSInsts, GDSInsts,
VALUUtilization, VALUBusy, SALUBusy, FetchSize, WriteSize,
CacheHit, MemUnitBusy, MemUnitStalled, WriteUnitStalled, LDSBankConflict


The list of valid counters for Vega based graphics cards:
Wavefronts, VALUInsts, SALUInsts, VFetchInsts, SFetchInsts,
VWriteInsts, FlatVMemInsts, LDSInsts, FlatLDSInsts, GDSInsts,
VALUUtilization, VALUBusy, SALUBusy, FetchSize, WriteSize,
L2CacheHit, MemUnitBusy, MemUnitStalled, WriteUnitStalled, LDSBankConflict
rpl_run.sh --list-basic
RPL: on '190801_110408' from '/home/ddpruitt/rocm' in '/home/ddpruitt/HIP/samples/0_Intro/square'
ROCProfiler: rc-file '/home/ddpruitt/rpl_rc.xml'
Basic HW counters:

  gpu-agent0 : GRBM_COUNT : Tie High - Count Number of Clocks
      block GRBM has 2 counters

  gpu-agent0 : GRBM_GUI_ACTIVE : The GUI is Active
      block GRBM has 2 counters

  gpu-agent0 : SQ_WAVES : Count number of waves sent to SQs. (per-simd, emulated, global)
      block SQ has 8 counters

  gpu-agent0 : SQ_INSTS_VALU : Number of VALU instructions issued. (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent0 : SQ_INSTS_VMEM_WR : Number of VMEM write instructions issued (including FLAT). (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent0 : SQ_INSTS_VMEM_RD : Number of VMEM read instructions issued (including FLAT). (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent0 : SQ_INSTS_SALU : Number of SALU instructions issued. (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent0 : SQ_INSTS_SMEM : Number of SMEM instructions issued. (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent0 : SQ_INSTS_FLAT : Number of FLAT instructions issued. (per-simd, emulated)
      block SQ has 8 counters

...
rpl_run.sh --list-derived
RPL: on '190801_110411' from '/home/ddpruitt/rocm' in '/home/ddpruitt/HIP/samples/0_Intro/square'
ROCProfiler: rc-file '/home/ddpruitt/rpl_rc.xml'
Derived metrics:

  gpu-agent0 : TA_BUSY_avr : TA block is busy. Average over TA instances.
      TA_BUSY_avr = avr(TA_TA_BUSY,16)

  gpu-agent0 : TA_BUSY_max : TA block is busy. Max over TA instances.
      TA_BUSY_max = max(TA_TA_BUSY,16)

  gpu-agent0 : TA_BUSY_min : TA block is busy. Min over TA instances.
      TA_BUSY_min = min(TA_TA_BUSY,16)

  gpu-agent0 : TA_FLAT_READ_WAVEFRONTS_sum : Number of flat opcode reads processed by the TA. Sum over TA instances.
      TA_FLAT_READ_WAVEFRONTS_sum = sum(TA_FLAT_READ_WAVEFRONTS,16)

  gpu-agent0 : TA_FLAT_WRITE_WAVEFRONTS_sum : Number of flat opcode writes processed by the TA. Sum over TA instances.
      TA_FLAT_WRITE_WAVEFRONTS_sum = sum(TA_FLAT_WRITE_WAVEFRONTS,16)

  gpu-agent0 : TCC_HIT_sum : Number of cache hits. Sum over TCC instances.
      TCC_HIT_sum = sum(TCC_HIT,16)

  gpu-agent0 : TCC_MISS_sum : Number of cache misses. Sum over TCC instances.
      TCC_MISS_sum = sum(TCC_MISS,16)

  gpu-agent0 : TCC_EA_RDREQ_32B_sum : Number of 32-byte TCC/EA read requests. Sum over TCC instances.
      TCC_EA_RDREQ_32B_sum = sum(TCC_EA_RDREQ_32B,16)

  gpu-agent0 : TCC_EA_RDREQ_sum : Number of TCC/EA read requests (either 32-byte or 64-byte). Sum over TCC instances.
      TCC_EA_RDREQ_sum = sum(TCC_EA_RDREQ,16)

  gpu-agent0 : TCC_EA_WRREQ_sum : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Sum over TCC instances.
      TCC_EA_WRREQ_sum = sum(TCC_EA_WRREQ,16)

daviddpruitt avatar Aug 01 '19 15:08 daviddpruitt