gpu-manager icon indicating copy to clipboard operation
gpu-manager copied to clipboard

The initialization time is too long during mnist test

Open Natelu opened this issue 3 years ago • 0 comments

Initializing from Creating TensorFlow device to task running in my training session of mnist takes too much time(about 5mins to ready)

2022-11-23 08:15:22.173334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2022-11-23 08:15:22.173363: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2022-11-23 08:15:22.173375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y Y
2022-11-23 08:15:22.173384: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   Y Y
2022-11-23 08:15:22.173402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
2022-11-23 08:15:22.173450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:44:00.0, compute capability: 7.0)
[-------------COST ABOUT 2mins ---------------------]
Initialized!
[-------------COST ABOUT 3mins ---------------------]
Step 0 (epoch 0.00), 2118.7 ms
Minibatch loss: 8.334, learning rate: 0.010000
Minibatch error: 85.9%
Validation error: 84.5%
duration between initialized and running is %d s 210.556521893
duration between initialized and running is %d s 210.559849024
duration between initialized and running is %d s 210.563081026

Base environment

Device: Tesla V100-PCIE-16GB; Driver Version: 470.141.03 CUDA Version: 11.4

System ENV

KUBE: v1.23.10 RUNC: 1.1.1 Containerd: v1.6.4 OS Kernel: Linux 3.10.0-1160.el7.x86_64 OS version: CentOS Linux 7 (Core) CPU: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz Pod Resource:

kind: Deployment
metadata:
  labels:
    k8s-app: vcuda-test
    qcloud-app: vcuda-test
  name: vcuda-test
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: vcuda-test
  template:
    metadata:
      labels:
        k8s-app: vcuda-test
        qcloud-app: vcuda-test
    spec:
      containers:
      - command:
        - sleep
        - 360000s
        env:
        - name: PATH
          value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
        image: <internal-repository>/tensorflow-gputest:0.2
        imagePullPolicy: IfNotPresent
        name: tensorflow-test
        resources:
          limits:
            cpu: "4"
            memory: 8Gi
            tencent.com/vcuda-core: "200"
            tencent.com/vcuda-memory: "30"
          requests:
            cpu: "4"
            memory: 8Gi
            tencent.com/vcuda-core: "200"
            tencent.com/vcuda-memory: "30"

Natelu avatar Nov 23 '22 09:11 Natelu