datasets icon indicating copy to clipboard operation
datasets copied to clipboard

tfds does not work with HDFS since 4.1.0

Open yangw1234 opened this issue 3 years ago • 0 comments

/!\ PLEASE INCLUDE THE FULL STACKTRACE AND CODE SNIPPET

Short description tfds with a hdfs data_dir does not work since 4.1.0.

Environment information

  • Operating System: linux

  • Python version: 3.7.10

  • tensorflow-datasets/tfds-nightly version: 4.1.0 ~ 4.5.2

  • tensorflow/tf-nightly version: 2.5.3

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ?

Reproduction instructions

Env:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${JAVA_HOME}/jre/lib/amd64/server:/opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/lib64/
export CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath --glob) 
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_io as tfio

ratings = tfds.load("movielens/100k-ratings", split="train", data_dir="hdfs://ip:port/datasets")
for elem in ratings:
    print(elem)
    break

In 4.0.0, the above code works. In 4.1.0, 4.2.0, the code result in the following error:

22/04/11 11:19:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 395, in try_reraise
    yield
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/load.py", line 172, in builder
    return cls(**builder_kwargs)  # pytype: disable=not-instantiable
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 941, in __init__
    super().__init__(**kwargs)
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 180, in __init__
    self.info.read_from_directory(self._data_dir)
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_info.py", line 366, in read_from_directory
    parsed_proto = read_from_json(json_filename)
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_info.py", line 536, in read_from_json
    json_str = utils.as_path(path).read_text()
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/utils/type_utils.py", line 171, in read_text
    return f.read()
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow/python/lib/io/file_io.py", line 117, in read
    self._preread_check()
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow/python/lib/io/file_io.py", line 80, in _preread_check
    compat.path_to_str(self.__name), 1024 * 512)
tensorflow.python.framework.errors_impl.NotFoundError: hdfs:/172.16.0.105:8020/yina/dcn_2/movielens/100k-ratings/0.1.0/dataset_info.json; No such file or directory

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test_dataset.py", line 5, in <module>
    ratings = tfds.load("movielens/100k-ratings", split="train", data_dir="hdfs://172.16.0.105:8020/yina/dcn_2")
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/load.py", line 325, in load
    dbuilder = builder(name, data_dir=data_dir, try_gcs=try_gcs, **builder_kwargs)
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/load.py", line 172, in builder
    return cls(**builder_kwargs)  # pytype: disable=not-instantiable
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 397, in try_reraise
    reraise(e, *args, **kwargs)
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 363, in reraise
    raise exception from e
RuntimeError: NotFoundError: Failed to construct dataset movielens: hdfs:/172.16.0.105:8020/yina/dcn_2/movielens/100k-ratings/0.1.0/dataset_info.json; No such file or directory

In 4.3.0, 4.4.0, 4.5.2, the code results in the following error

22/04/11 11:17:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 397, in try_reraise
    yield
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/load.py", line 177, in builder
    return cls(**builder_kwargs)  # pytype: disable=not-instantiable
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 923, in __init__
    super().__init__(**kwargs)
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 182, in __init__
    self.info.read_from_directory(self._data_dir)
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_info.py", line 377, in read_from_directory
    parsed_proto = read_from_json(json_filename)
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_info.py", line 556, in read_from_json
    json_str = utils.as_path(path).read_text()
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/utils/generic_path.py", line 100, in as_path
    return _URI_PREFIXES_TO_CLS[uri_splits[0] + '://'](path)  # pytype: disable=bad-return-type
KeyError: 'hdfs://'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test_dataset.py", line 5, in <module>
    ratings = tfds.load("movielens/100k-ratings", split="train", data_dir="hdfs://172.16.0.105:8020/yina/dcn_2")
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/load.py", line 330, in load
    dbuilder = builder(name, data_dir=data_dir, try_gcs=try_gcs, **builder_kwargs)
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/load.py", line 177, in builder
    return cls(**builder_kwargs)  # pytype: disable=not-instantiable
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 399, in try_reraise
    reraise(e, *args, **kwargs)
  File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 365, in reraise
    raise exception from e
RuntimeError: KeyError: Failed to construct dataset movielens: 'hdfs://'

If you share a colab, make sure to update the permissions to share it.

Link to logs If applicable, <link to gist with logs, stack trace>

Expected behavior What you expected to happen.

Additional context Add any other context about the problem here.

yangw1234 avatar Apr 11 '22 03:04 yangw1234