Guoxin

Results 8 issues of Guoxin

bug
nnidev
v2.9.1

## Design API: GET api/v2/alerts Response: 200 ``` json [ { "labels": { "alertname": "NodeFilesystemUsage", "severity": "warn" ... }, "annotations": { "summary": "Free space in /dev/sdc1 from 10.151.40.40:9100 is less...

pai-dev

## Motivation Some jobs may fail unexpectedly. If the users can be informed when the jobs fail, the users will be able to handle the issue in time. This will...

pai-dev

### Motivation Support other OIDC modes, such as LDAP ### Implementation #### RestServer - write new controller like AzureAD/ msgraph, refer to https://github.com/microsoft/pai/blob/master/src/rest-server/src/controllers/v2/azureAD.js - add corresponding router : https://github.com/microsoft/pai/blob/accc5fda3d19df0ff7eba2411e9937d8ff6ed4b3/src/rest-server/src/routes/authn.js#L47 -...

pai-dev

Cluster Utilization in One Week ### Cluster Level​ | |GPU*Days Used | GPU*Days Provided | GPU*Days Capbiity | Average Number of GPU Cards Provided | Max Number of GPU Cards...

pai-dev

## Motivation Jobs fail at different stages. The users will have to retrain their models from scratch if they fail to save the checkpoints / result files properly. If they...

pai-dev