The initialization time is too long during mnist test

Open Natelu opened this issue 3 years ago • 0 comments

Initializing from `Creating TensorFlow device` to task running in my training session of mnist takes too much time(about 5mins to ready)

2022-11-23 08:15:22.173334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2022-11-23 08:15:22.173363: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2022-11-23 08:15:22.173375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y Y
2022-11-23 08:15:22.173384: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   Y Y
2022-11-23 08:15:22.173402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
2022-11-23 08:15:22.173450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:44:00.0, compute capability: 7.0)
[-------------COST ABOUT 2mins ---------------------]
Initialized!
[-------------COST ABOUT 3mins ---------------------]
Step 0 (epoch 0.00), 2118.7 ms
Minibatch loss: 8.334, learning rate: 0.010000
Minibatch error: 85.9%
Validation error: 84.5%
duration between initialized and running is %d s 210.556521893
duration between initialized and running is %d s 210.559849024
duration between initialized and running is %d s 210.563081026

Base environment

Device: Tesla V100-PCIE-16GB; Driver Version: 470.141.03 CUDA Version: 11.4

System ENV

KUBE: v1.23.10 RUNC: 1.1.1 Containerd: v1.6.4 OS Kernel: Linux 3.10.0-1160.el7.x86_64 OS version: CentOS Linux 7 (Core) CPU: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz Pod Resource:

kind: Deployment
metadata:
  labels:
    k8s-app: vcuda-test
    qcloud-app: vcuda-test
  name: vcuda-test
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: vcuda-test
  template:
    metadata:
      labels:
        k8s-app: vcuda-test
        qcloud-app: vcuda-test
    spec:
      containers:
      - command:
        - sleep
        - 360000s
        env:
        - name: PATH
          value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
        image: <internal-repository>/tensorflow-gputest:0.2
        imagePullPolicy: IfNotPresent
        name: tensorflow-test
        resources:
          limits:
            cpu: "4"
            memory: 8Gi
            tencent.com/vcuda-core: "200"
            tencent.com/vcuda-memory: "30"
          requests:
            cpu: "4"
            memory: 8Gi
            tencent.com/vcuda-core: "200"
            tencent.com/vcuda-memory: "30"

Nov 23 '22 09:11 Natelu

The initialization time is too long during mnist test

Initializing from Creating TensorFlow device to task running in my training session of mnist takes too much time(about 5mins to ready)

Base environment

System ENV

Initializing from `Creating TensorFlow device` to task running in my training session of mnist takes too much time(about 5mins to ready)