TGAN icon indicating copy to clipboard operation
TGAN copied to clipboard

TGAN crashing at Epoch 1

Open nabarunaguha opened this issue 6 years ago • 3 comments

Hi, I am facing this issue for some time and not able to fix this.

  • Python version: 3.7
  • Operating System: Linux
  • TensorFlow version: 1.14.0
  • CUDA version: 10.0

Description

I keep getting this warning and then the execution crashes at Epoch 1. TGAN uses CPU

What I Did

import tensorflow as tf
if tf.test.gpu_device_name():
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
else:
    print("Please install GPU version of TF")

And it shows tf is using GPU fine.

2019-10-03 13:11:01.720688: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-10-03 13:11:01.768834: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2596780000 Hz
2019-10-03 13:11:01.771431: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56157647a930 executing computations on platform Host. Devices:
2019-10-03 13:11:01.771460: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-10-03 13:11:01.772877: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-10-03 13:11:04.249822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:04:00.0
2019-10-03 13:11:04.250926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:05:00.0
2019-10-03 13:11:04.251999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:09:00.0
2019-10-03 13:11:04.253103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:0a:00.0
2019-10-03 13:11:04.254193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 4 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:85:00.0
2019-10-03 13:11:04.255276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 5 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:86:00.0
2019-10-03 13:11:04.255566: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-03 13:11:04.256938: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-03 13:11:04.258142: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-03 13:11:04.258427: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-03 13:11:04.260019: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-03 13:11:04.261283: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-03 13:11:04.265096: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-03 13:11:04.277832: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2019-10-03 13:11:04.277873: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-03 13:11:04.284987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-03 13:11:04.285005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 2 3 4 5
2019-10-03 13:11:04.285013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y Y Y N N
2019-10-03 13:11:04.285018: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N Y Y N N
2019-10-03 13:11:04.285023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2:   Y Y N Y N N
2019-10-03 13:11:04.285028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3:   Y Y Y N N N
2019-10-03 13:11:04.285033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 4:   N N N N N Y
2019-10-03 13:11:04.285040: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 5:   N N N N Y N
2019-10-03 13:11:04.293727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 7647 MB memory) -> physical GPU (device: 0, name: Tesla M60, pci bus id: 0000:04:00.0, compute capability: 5.2)
2019-10-03 13:11:04.296282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:1 with 7647 MB memory) -> physical GPU (device: 1, name: Tesla M60, pci bus id: 0000:05:00.0, compute capability: 5.2)
2019-10-03 13:11:04.298803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:2 with 7647 MB memory) -> physical GPU (device: 2, name: Tesla M60, pci bus id: 0000:09:00.0, compute capability: 5.2)
2019-10-03 13:11:04.301310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:3 with 7647 MB memory) -> physical GPU (device: 3, name: Tesla M60, pci bus id: 0000:0a:00.0, compute capability: 5.2)
2019-10-03 13:11:04.303979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:4 with 7647 MB memory) -> physical GPU (device: 4, name: Tesla M60, pci bus id: 0000:85:00.0, compute capability: 5.2)
2019-10-03 13:11:04.306456: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:5 with 7647 MB memory) -> physical GPU (device: 5, name: Tesla M60, pci bus id: 0000:86:00.0, compute capability: 5.2)
2019-10-03 13:11:04.310204: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56157ab4cab0 executing computations on platform CUDA. Devices:
2019-10-03 13:11:04.310223: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.310229: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.310234: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (2): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.310239: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (3): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.310244: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (4): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.310249: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (5): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.314251: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:04:00.0
2019-10-03 13:11:04.315484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:05:00.0
2019-10-03 13:11:04.316567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:09:00.0
2019-10-03 13:11:04.317632: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:0a:00.0
2019-10-03 13:11:04.318705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 4 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:85:00.0
2019-10-03 13:11:04.319780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 5 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:86:00.0
2019-10-03 13:11:04.319806: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-03 13:11:04.319820: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-03 13:11:04.319833: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-03 13:11:04.319846: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-03 13:11:04.319859: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-03 13:11:04.319872: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-03 13:11:04.319885: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-03 13:11:04.332488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2019-10-03 13:11:04.332811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-03 13:11:04.332823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 2 3 4 5
2019-10-03 13:11:04.332830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y Y Y N N
2019-10-03 13:11:04.332835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N Y Y N N
2019-10-03 13:11:04.332840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2:   Y Y N Y N N
2019-10-03 13:11:04.332845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3:   Y Y Y N N N
2019-10-03 13:11:04.332850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 4:   N N N N N Y
2019-10-03 13:11:04.332856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 5:   N N N N Y N
2019-10-03 13:11:04.340711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 7647 MB memory) -> physical GPU (device: 0, name: Tesla M60, pci bus id: 0000:04:00.0, compute capability: 5.2)
2019-10-03 13:11:04.341796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:1 with 7647 MB memory) -> physical GPU (device: 1, name: Tesla M60, pci bus id: 0000:05:00.0, compute capability: 5.2)
2019-10-03 13:11:04.342889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:2 with 7647 MB memory) -> physical GPU (device: 2, name: Tesla M60, pci bus id: 0000:09:00.0, compute capability: 5.2)
2019-10-03 13:11:04.343989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:3 with 7647 MB memory) -> physical GPU (device: 3, name: Tesla M60, pci bus id: 0000:0a:00.0, compute capability: 5.2)
2019-10-03 13:11:04.345103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:4 with 7647 MB memory) -> physical GPU (device: 4, name: Tesla M60, pci bus id: 0000:85:00.0, compute capability: 5.2)
2019-10-03 13:11:04.346189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:5 with 7647 MB memory) -> physical GPU (device: 5, name: Tesla M60, pci bus id: 0000:86:00.0, compute capability: 5.2)
Default GPU Device: /device:GPU:0

I set the argument of GPU in TGANModel to '/GPU:0' and also tried with '/device:GPU:0'

But, it is the same warning and the crash just while running the first epoch.

I also uninstalled and re-installed Tensorflow-gpu and TGAN, just to check but of no use.

Regards, Nabaruna

nabarunaguha avatar Oct 03 '19 11:10 nabarunaguha

Hi @nabarunaguha

Would you mind sharing a short code snippet that shows the exact arguments that you use when creating the TGAN instance and calling the fit and sample methods?

We will then try to reproduce the error to be able to assist you better.

Also, regarding the GPU usage, please check this other issue: https://github.com/DAI-Lab/TGAN/issues/34

So, basically, the gpu argument is now being ignored, and all that matters in regards of GPU usage is whether you have installed tensorflow or tensorflow-gpu.

csala avatar Oct 04 '19 10:10 csala

Hi @csala ,

Yeah sure, here are my arguments. from tgan.model import TGANModel tgan = TGANModel(continuous_columns, output='output', gpu='/device:GPU:0', max_epoch=5, steps_per_epoch=150, save_checkpoints=False, restore_session=False, batch_size=50, z_dim=50, noise=0.2, l2norm=0.00001, learning_rate=0.001, num_gen_rnn=100, num_gen_feature=100, num_dis_layers=1, num_dis_hidden=100, optimizer='AdamOptimizer')

tgan.fit(data) model_path = '/home/naguha/ModelSave/ModelCheck.pkl' num_samples = 20868 samples = tgan.sample(num_samples) export_csv = samples.to_csv(r'/home/naguha/Samples_TGAN.csv',index = None, header=True)

And I installed tensorflow-gpu==1.14

nabarunaguha avatar Oct 04 '19 10:10 nabarunaguha

Hello, any news for this issue ??

lablebi96 avatar Apr 16 '21 09:04 lablebi96