Kai Zhang

Results 7 issues of Kai Zhang

besides submit training jobs, user also wants to create a development box which contains jupyter, math lib, frameworks, in order to dev and debug algorithms before starting to train in...

kind/feature
lifecycle/stale

It should be helpful for data scientists to use command like "arena create data imagenet-full" to create, index and manage different training datasets for different training jobs. Then when use...

kind/feature

cpu and mem resource info is useful for both non-gpu and gpu jobs. these should be included in "arena top "

kind/feature
lifecycle/stale

more and more user cases are request specify customized label, toleration, securityContext, priorityClass, etc to job underlaying Pods. arena should give a unified mechanism to meet those customization requirement.

kind/feature
lifecycle/stale

"arena get jobname -e" only shows events of chief worker pod's event. while some meaningful events info of job level should also be shown. e.g. when ResourceQuotas enabled, if job...

kind/feature
lifecycle/stale

to support look up previous jobs command details is helpful to diagnose client problems or reference for re-run

kind/feature
lifecycle/stale

arena version command should contain more info of which versions of charts, apiVersion of tfjob, mpijob etc, and corresponding operators release version deployed in cluster. it's helpful to quickly identify...

kind/feature
lifecycle/stale