gpu-manager
gpu-manager copied to clipboard
The initialization time is too long during mnist test
Initializing from Creating TensorFlow device to task running in my training session of mnist takes too much time(about 5mins to ready)
2022-11-23 08:15:22.173334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2022-11-23 08:15:22.173363: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2022-11-23 08:15:22.173375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0: Y Y
2022-11-23 08:15:22.173384: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1: Y Y
2022-11-23 08:15:22.173402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
2022-11-23 08:15:22.173450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:44:00.0, compute capability: 7.0)
[-------------COST ABOUT 2mins ---------------------]
Initialized!
[-------------COST ABOUT 3mins ---------------------]
Step 0 (epoch 0.00), 2118.7 ms
Minibatch loss: 8.334, learning rate: 0.010000
Minibatch error: 85.9%
Validation error: 84.5%
duration between initialized and running is %d s 210.556521893
duration between initialized and running is %d s 210.559849024
duration between initialized and running is %d s 210.563081026
Base environment
Device: Tesla V100-PCIE-16GB; Driver Version: 470.141.03 CUDA Version: 11.4
System ENV
KUBE: v1.23.10 RUNC: 1.1.1 Containerd: v1.6.4 OS Kernel: Linux 3.10.0-1160.el7.x86_64 OS version: CentOS Linux 7 (Core) CPU: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz Pod Resource:
kind: Deployment
metadata:
labels:
k8s-app: vcuda-test
qcloud-app: vcuda-test
name: vcuda-test
namespace: default
spec:
replicas: 1
selector:
matchLabels:
k8s-app: vcuda-test
template:
metadata:
labels:
k8s-app: vcuda-test
qcloud-app: vcuda-test
spec:
containers:
- command:
- sleep
- 360000s
env:
- name: PATH
value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
image: <internal-repository>/tensorflow-gputest:0.2
imagePullPolicy: IfNotPresent
name: tensorflow-test
resources:
limits:
cpu: "4"
memory: 8Gi
tencent.com/vcuda-core: "200"
tencent.com/vcuda-memory: "30"
requests:
cpu: "4"
memory: 8Gi
tencent.com/vcuda-core: "200"
tencent.com/vcuda-memory: "30"