models icon indicating copy to clipboard operation
models copied to clipboard

build_image_dataset.py crashes for delf training

Open avidullu opened this issue 3 years ago • 4 comments

Running instructions from https://github.com/tensorflow/models/blob/master/research/delf/delf/python/training/README.md#prepare-the-data-for-training without any GPUs on a Google Cloud VM encounters an error.

Below is the command with the error python3 build_image_dataset.py --train_csv_path=$LANDMARK_DATA/train/train.csv --train_clean_csv_path=$LANDMARK_DATA/train/train_clean.csv --train_directory=$LANDMARK_DATA/train//// --output_directory=$LANDMARK_DATA/tfrecord/ --num_shards=128 --generate_train_validation_splits --validation_split_size=0.2 --test_csv_path=$LANDMARK_DATA/train/test.csv --test_directory=$LANDMARK_DATA/test//// 2022-01-27 22:07:03.277440: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2022-01-27 22:07:03.277495: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2022-01-27 22:07:05.252430: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2022-01-27 22:07:05.252502: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303) 2022-01-27 22:07:05.252525: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (gcsfuse-experiment): /proc/driver/nvidia/version does not exist /home/avidullu/mldata/cvdfoundation/google-landmark/train/train_clean.csv Traceback (most recent call last): File "build_image_dataset.py", line 491, in app.run(main) File "/home/avidullu/.local/lib/python3.7/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/home/avidullu/.local/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "build_image_dataset.py", line 485, in main FLAGS.seed) File "build_image_dataset.py", line 439, in _build_train_tfrecord_dataset image_dir) File "build_image_dataset.py", line 144, in _get_clean_train_image_files_and_labels df = pd.read_csv(csv_file) File "/home/avidullu/.local/lib/python3.7/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/home/avidullu/.local/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv return _read(filepath_or_buffer, kwds) File "/home/avidullu/.local/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 482, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/home/avidullu/.local/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 811, in init self._engine = self._make_engine(self.engine) File "/home/avidullu/.local/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine return mapping[engine](self.f, **self.options) # type: ignore[call-arg] File "/home/avidullu/.local/lib/python3.7/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 51, in init self._open_handles(src, kwds) File "/home/avidullu/.local/lib/python3.7/site-packages/pandas/io/parsers/base_parser.py", line 229, in _open_handles errors=kwds.get("encoding_errors", "strict"), File "/home/avidullu/.local/lib/python3.7/site-packages/pandas/io/common.py", line 724, in get_handle newline="", AttributeError: 'GFile' object has no attribute 'readable'

https://github.com/tensorflow/models/blob/a033df775262a1a48420649a216a16b687bc39f6/research/delf/delf/python/training/build_image_dataset.py#L143 seems to be using a binary mode for read. On removing the 'b' from here and L116 the script makes progress.

avidullu avatar Jan 27 '22 22:01 avidullu

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. What is the top-level directory of the model you are using Have I written custom code OS Platform and Distribution TensorFlow installed from TensorFlow version Bazel version CUDA/cuDNN version GPU model and memory Exact command to reproduce

tensorflowbutler avatar Jan 29 '22 04:01 tensorflowbutler

Responding inline

Q. What is the top-level directory of the model you are using A. models/research/delf

Q. Have I written custom code? A. No. Ran the commands as mentioned in research/delf/delf/python/training/README.md

Q. OS Platform and Distribution A. GCP VM with Debian image. 32 CPU and 128GB RAM

Q. TensorFlow installed from? A. From the script mentioned in /research/delf/INSTALL_INSTRUCTIONS.md

Q. TensorFlow version A. 2.7

Q. Bazel version A. 5.0

Q. CUDA/chDNN version A. NA (no GPU on hardware)

Q. GPU model A. NA (no GPU on hardware)

Q Exact command to reproduce A. python3 build_image_dataset.py --train_csv_path=$LANDMARK_DATA/train/train.csv --train_clean_csv_path=$LANDMARK_DATA/train/train_clean.csv --train_directory=$LANDMARK_DATA/train/ --output_directory=$LANDMARK_DATA/tfrecord/ --num_shards=128 --generate_train_validation_splits --validation_split_size=0.2 --test_csv_path=$LANDMARK_DATA/train/test.csv --test_directory=$LANDMARK_DATA/test/

avidullu avatar Jan 30 '22 07:01 avidullu

Thanks for reporting this!

@dan-anghel , I think you wrote this part of the code. Do you remember if the 'b' is really necessary when reading the CSV? What @avidullu reported makes sense to me, although I remember you had run the code several times, which seems to contradict it.

andrefaraujo avatar Feb 03 '22 16:02 andrefaraujo

@andrefaraujo

Thanks for reporting this!

@dan-anghel , I think you wrote this part of the code. Do you remember if the 'b' is really necessary when reading the CSV?

The CSV shouldn't be binary. Removing b solves the problem for me.

khatchad avatar Aug 07 '24 14:08 khatchad