pai icon indicating copy to clipboard operation
pai copied to clipboard

Enrich cluster GPU utilization report

Open suiguoxin opened this issue 4 years ago • 0 comments

Cluster Utilization in One Week

Cluster Level​

GPU*Days Used GPU*Days Provided GPU*Days Capbiity Average Number of GPU Cards Provided Max Number of GPU Cards GPU Card Occupied Rate GPU Utilization
Overall 1000 1200 1500 150 200 70% 50%
VC0 (V100) 200 250 250 25 20 80% 60%
VC1 (K80) 800 1250 1250 100 100 60% 40%

User Level​

Number of Users Number of Active Users
Overall 50 15
VC0 (V100) 20 10
VC1 (K80) 40 10

Top 10 Users (Ordered by Resources Used)

User name Resources Used (GPU*hour) GPU Utilization Number of Submitted Jobs​ Number of Runned Jobs​
user0 3000 40% 5 10
user1 2000 50% 20 50

Job Level​

Number of Runned Jobs Number of Submitted Jobs Retried Number / Rate Succeeded Number / Rate Failed Number / Rate Stopped Number / Rate Failure Exit Code Average Waiting Time Max Waiting Time Average Running Time Max Running Time Long-Running Rate (> 48 hours) Short-Running Rate (< 30 minutes)
Overall 300 200 60 / 20% 60 / 20% 60 / 20% 180 / 60% -210(10), 220(5), 404(3) 54 minutes 80 minutes 35 minutes 5 hours 30% 30%
VC0 (V100) 100 100 20 / 20% 20 / 20% 20 / 20% 60 / 60% -210(5), 220(5), 404(3) 2 hours 15 minutes 80 minutes 3 hours 3 minutes 5 hours 70% 10%
VC1 (K80) 200 100 40 / 20% 40 / 20% 40 / 20% 120 / 60% -210(5) 3 minutes 5 minutes 3 minutes 30 minutes 10% 70%

Notes:

  • Retried / succeeded / failed / stopped rates are calculated based on completed jobs
  • Succeed rate + failed rate + stopped rate = 1
  • Waiting Time of Running Jobs = Current Time - job submitted time

Top 10 Jobs (Ordered by Resources Used)

Job name Resources Used (GPU*hour) GPU Number VC Job Duration Job Start Time​ Job Status
job0 3000 40 VC1 (K80) 1 day 10.7 hours 21-03-26 03:14:19 RUNNING
job1 2000 12 VC0 (V100) 23.5 hours 21-03-27 03:14:19 SUCCEEDED

suiguoxin avatar Mar 30 '21 08:03 suiguoxin