tfds does not work with HDFS since 4.1.0
/!\ PLEASE INCLUDE THE FULL STACKTRACE AND CODE SNIPPET
Short description tfds with a hdfs data_dir does not work since 4.1.0.
Environment information
-
Operating System:
linux -
Python version:
3.7.10 -
tensorflow-datasets/tfds-nightlyversion:4.1.0 ~ 4.5.2 -
tensorflow/tf-nightlyversion:2.5.3 -
Does the issue still exists with the last
tfds-nightlypackage (pip install --upgrade tfds-nightly) ?
Reproduction instructions
Env:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${JAVA_HOME}/jre/lib/amd64/server:/opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/lib64/
export CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath --glob)
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_io as tfio
ratings = tfds.load("movielens/100k-ratings", split="train", data_dir="hdfs://ip:port/datasets")
for elem in ratings:
print(elem)
break
In 4.0.0, the above code works. In 4.1.0, 4.2.0, the code result in the following error:
22/04/11 11:19:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 395, in try_reraise
yield
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/load.py", line 172, in builder
return cls(**builder_kwargs) # pytype: disable=not-instantiable
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 941, in __init__
super().__init__(**kwargs)
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 180, in __init__
self.info.read_from_directory(self._data_dir)
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_info.py", line 366, in read_from_directory
parsed_proto = read_from_json(json_filename)
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_info.py", line 536, in read_from_json
json_str = utils.as_path(path).read_text()
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/utils/type_utils.py", line 171, in read_text
return f.read()
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow/python/lib/io/file_io.py", line 117, in read
self._preread_check()
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow/python/lib/io/file_io.py", line 80, in _preread_check
compat.path_to_str(self.__name), 1024 * 512)
tensorflow.python.framework.errors_impl.NotFoundError: hdfs:/172.16.0.105:8020/yina/dcn_2/movielens/100k-ratings/0.1.0/dataset_info.json; No such file or directory
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "test_dataset.py", line 5, in <module>
ratings = tfds.load("movielens/100k-ratings", split="train", data_dir="hdfs://172.16.0.105:8020/yina/dcn_2")
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/load.py", line 325, in load
dbuilder = builder(name, data_dir=data_dir, try_gcs=try_gcs, **builder_kwargs)
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/load.py", line 172, in builder
return cls(**builder_kwargs) # pytype: disable=not-instantiable
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 397, in try_reraise
reraise(e, *args, **kwargs)
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 363, in reraise
raise exception from e
RuntimeError: NotFoundError: Failed to construct dataset movielens: hdfs:/172.16.0.105:8020/yina/dcn_2/movielens/100k-ratings/0.1.0/dataset_info.json; No such file or directory
In 4.3.0, 4.4.0, 4.5.2, the code results in the following error
22/04/11 11:17:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 397, in try_reraise
yield
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/load.py", line 177, in builder
return cls(**builder_kwargs) # pytype: disable=not-instantiable
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 923, in __init__
super().__init__(**kwargs)
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 182, in __init__
self.info.read_from_directory(self._data_dir)
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_info.py", line 377, in read_from_directory
parsed_proto = read_from_json(json_filename)
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_info.py", line 556, in read_from_json
json_str = utils.as_path(path).read_text()
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/utils/generic_path.py", line 100, in as_path
return _URI_PREFIXES_TO_CLS[uri_splits[0] + '://'](path) # pytype: disable=bad-return-type
KeyError: 'hdfs://'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "test_dataset.py", line 5, in <module>
ratings = tfds.load("movielens/100k-ratings", split="train", data_dir="hdfs://172.16.0.105:8020/yina/dcn_2")
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/load.py", line 330, in load
dbuilder = builder(name, data_dir=data_dir, try_gcs=try_gcs, **builder_kwargs)
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/load.py", line 177, in builder
return cls(**builder_kwargs) # pytype: disable=not-instantiable
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 399, in try_reraise
reraise(e, *args, **kwargs)
File "/root/anaconda3/envs/yang_tf2_6/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 365, in reraise
raise exception from e
RuntimeError: KeyError: Failed to construct dataset movielens: 'hdfs://'
If you share a colab, make sure to update the permissions to share it.
Link to logs If applicable, <link to gist with logs, stack trace>
Expected behavior What you expected to happen.
Additional context Add any other context about the problem here.