datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Unable to load MNIST dataset on Colab, running TPU v2.8

Open jeromemassot opened this issue 1 year ago • 4 comments

Short description Unable to load MNIST dataset on Colab, running TPU v2.8

Environment information

  • Operating System: Colab

  • Python version: 3.10.12

  • tensorflow-datasets/tfds-nightly version: 4.9.7

  • tensorflow/tf-nightly version: 2.15.0

  • JAX version: 0.4.33

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ?

Reproduction instructions

data_dir = '/tmp/tfds'
mnist_data, info = tfds.load(name="mnist", batch_size=-1, data_dir=data_dir, with_info=True)
mnist_data = tfds.as_numpy(mnist_data)
data_train, data_test = mnist_data['train'], mnist_data['test']

Logs

TypeError                                 Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/tensorflow/python/data/util/structure.py](https://localhost:8080/#) in normalize_element(element, element_signature)
    104         if spec is None:
--> 105           spec = type_spec_from_value(t, use_fallback=False)
    106       except TypeError:

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/data/util/structure.py](https://localhost:8080/#) in type_spec_from_value(element, use_fallback)
    513 
--> 514   raise TypeError("Could not build a `TypeSpec` for {} with type {}".format(
    515       element,

TypeError: Could not build a `TypeSpec` for ['gs://tfds-data/datasets/mnist/3.0.1/mnist-test.tfrecord-00000-of-00001'] with type list

During handling of the above exception, another exception occurred:

FailedPreconditionError                   Traceback (most recent call last)
[<ipython-input-16-6a5f245a8c89>](https://localhost:8080/#) in <cell line: 1>()
----> 1 mnist_data, info = tfds.load(name="mnist", batch_size=-1, with_info=True, try_gcs=True)
      2 mnist_data = tfds.as_numpy(mnist_data)
      3 data_train, data_test = mnist_data['train'], mnist_data['test']

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/logging/__init__.py](https://localhost:8080/#) in __call__(self, function, instance, args, kwargs)
    174     metadata = self._start_call()
    175     try:
--> 176       return function(*args, **kwargs)
    177     except Exception:
    178       metadata.mark_error()

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/load.py](https://localhost:8080/#) in load(name, split, data_dir, batch_size, shuffle_files, download, as_supervised, decoders, read_config, with_info, builder_kwargs, download_and_prepare_kwargs, as_dataset_kwargs, try_gcs)
    671   as_dataset_kwargs.setdefault('read_config', read_config)
    672 
--> 673   ds = dbuilder.as_dataset(**as_dataset_kwargs)
    674   if with_info:
    675     return ds, dbuilder.info

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/logging/__init__.py](https://localhost:8080/#) in __call__(self, function, instance, args, kwargs)
    174     metadata = self._start_call()
    175     try:
--> 176       return function(*args, **kwargs)
    177     except Exception:
    178       metadata.mark_error()

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/dataset_builder.py](https://localhost:8080/#) in as_dataset(self, split, batch_size, shuffle_files, decoders, read_config, as_supervised)
   1024         as_supervised=as_supervised,
   1025     )
-> 1026     all_ds = tree.map_structure(build_single_dataset, split)
   1027     return all_ds
   1028 

[/usr/local/lib/python3.10/dist-packages/tree/__init__.py](https://localhost:8080/#) in map_structure(func, *structures, **kwargs)
    433     assert_same_structure(structures[0], other, check_types=check_types)
    434   return unflatten_as(structures[0],
--> 435                       [func(*args) for args in zip(*map(flatten, structures))])
    436 
    437 

[/usr/local/lib/python3.10/dist-packages/tree/__init__.py](https://localhost:8080/#) in <listcomp>(.0)
    433     assert_same_structure(structures[0], other, check_types=check_types)
    434   return unflatten_as(structures[0],
--> 435                       [func(*args) for args in zip(*map(flatten, structures))])
    436 
    437 

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/dataset_builder.py](https://localhost:8080/#) in _build_single_dataset(self, split, batch_size, shuffle_files, decoders, read_config, as_supervised)
   1042 
   1043     # Build base dataset
-> 1044     ds = self._as_dataset(
   1045         split=split,
   1046         shuffle_files=shuffle_files,

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/dataset_builder.py](https://localhost:8080/#) in _as_dataset(self, split, decoders, read_config, shuffle_files)
   1496     )
   1497     decode_fn = functools.partial(features.decode_example, decoders=decoders)
-> 1498     return reader.read(
   1499         instructions=split,
   1500         split_infos=self.info.splits.values(),

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/reader.py](https://localhost:8080/#) in read(self, instructions, split_infos, read_config, shuffle_files, disable_shuffling, decode_fn)
    428       )
    429 
--> 430     return tree.map_structure(_read_instruction_to_ds, instructions)
    431 
    432   def read_files(

[/usr/local/lib/python3.10/dist-packages/tree/__init__.py](https://localhost:8080/#) in map_structure(func, *structures, **kwargs)
    433     assert_same_structure(structures[0], other, check_types=check_types)
    434   return unflatten_as(structures[0],
--> 435                       [func(*args) for args in zip(*map(flatten, structures))])
    436 
    437 

[/usr/local/lib/python3.10/dist-packages/tree/__init__.py](https://localhost:8080/#) in <listcomp>(.0)
    433     assert_same_structure(structures[0], other, check_types=check_types)
    434   return unflatten_as(structures[0],
--> 435                       [func(*args) for args in zip(*map(flatten, structures))])
    436 
    437 

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/reader.py](https://localhost:8080/#) in _read_instruction_to_ds(instruction)
    420     def _read_instruction_to_ds(instruction):
    421       file_instructions = splits_dict[instruction].file_instructions
--> 422       return self.read_files(
    423           file_instructions,
    424           read_config=read_config,

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/reader.py](https://localhost:8080/#) in read_files(self, file_instructions, read_config, shuffle_files, disable_shuffling, decode_fn)
    460 
    461     # Read serialized example (eventually with `tfds_id`)
--> 462     ds = _read_files(
    463         file_instructions=file_instructions,
    464         read_config=read_config,

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/reader.py](https://localhost:8080/#) in _read_files(file_instructions, read_config, shuffle_files, disable_shuffling, file_format)
    265   )
    266 
--> 267   instruction_ds = tf.data.Dataset.from_tensor_slices(tensor_inputs)
    268 
    269   # On distributed environments, we can shard per-file if a

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/data/ops/dataset_ops.py](https://localhost:8080/#) in from_tensor_slices(tensors, name)
    823     # pylint: disable=g-import-not-at-top,protected-access
    824     from tensorflow.python.data.ops import from_tensor_slices_op
--> 825     return from_tensor_slices_op._from_tensor_slices(tensors, name)
    826     # pylint: enable=g-import-not-at-top,protected-access
    827 

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/data/ops/from_tensor_slices_op.py](https://localhost:8080/#) in _from_tensor_slices(tensors, name)
     23 
     24 def _from_tensor_slices(tensors, name=None):
---> 25   return _TensorSliceDataset(tensors, name=name)
     26 
     27 

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/data/ops/from_tensor_slices_op.py](https://localhost:8080/#) in __init__(self, element, is_files, name)
     31   def __init__(self, element, is_files=False, name=None):
     32     """See `Dataset.from_tensor_slices` for details."""
---> 33     element = structure.normalize_element(element)
     34     batched_spec = structure.type_spec_from_value(element)
     35     self._tensors = structure.to_batched_tensor_list(batched_spec, element)

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/data/util/structure.py](https://localhost:8080/#) in normalize_element(element, element_signature)
    108         # the value. As a fallback try converting the value to a tensor.
    109         normalized_components.append(
--> 110             ops.convert_to_tensor(t, name="component_%d" % i))
    111       else:
    112         # To avoid a circular dependency between dataset_ops and structure,

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/profiler/trace.py](https://localhost:8080/#) in wrapped(*args, **kwargs)
    181         with Trace(trace_name, **trace_kwargs):
    182           return func(*args, **kwargs)
--> 183       return func(*args, **kwargs)
    184 
    185     return wrapped

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py](https://localhost:8080/#) in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
    694   # TODO(b/142518781): Fix all call-sites and remove redundant arg
    695   preferred_dtype = preferred_dtype or dtype_hint
--> 696   return tensor_conversion_registry.convert(
    697       value, dtype, name, as_ref, preferred_dtype, accepted_result_types
    698   )

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/tensor_conversion_registry.py](https://localhost:8080/#) in convert(value, dtype, name, as_ref, preferred_dtype, accepted_result_types)
    232 
    233     if ret is None:
--> 234       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
    235 
    236     if ret is NotImplemented:

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    333                                          as_ref=False):
    334   _ = as_ref
--> 335   return constant(v, dtype=dtype, name=name)
    336 
    337 # Register the conversion function for the "unconvertible" types

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/weak_tensor_ops.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    140   def wrapper(*args, **kwargs):
    141     if not ops.is_auto_dtype_conversion_enabled():
--> 142       return op(*args, **kwargs)
    143     bound_arguments = signature.bind(*args, **kwargs)
    144     bound_arguments.apply_defaults()

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in constant(value, dtype, shape, name)
    269     ValueError: if called on a symbolic tensor.
    270   """
--> 271   return _constant_impl(value, dtype, shape, name, verify_shape=False,
    272                         allow_broadcast=True)
    273 

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    282       with trace.Trace("tf.constant"):
    283         return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
--> 284     return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    285 
    286   const_tensor = ops._create_graph_constant(  # pylint: disable=protected-access

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    294 ) -> ops._EagerTensorBase:
    295   """Creates a constant on the current device."""
--> 296   t = convert_to_eager_tensor(value, ctx, dtype)
    297   if shape is None:
    298     return t

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py](https://localhost:8080/#) in convert_to_eager_tensor(value, ctx, dtype)
    100     except AttributeError:
    101       dtype = dtypes.as_dtype(dtype).as_datatype_enum
--> 102   ctx.ensure_initialized()
    103   return ops.EagerTensor(value, ctx.device_name, dtype)
    104 

[/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/context.py](https://localhost:8080/#) in ensure_initialized(self)
    601         pywrap_tfe.TFE_ContextOptionsSetJitCompileRewrite(
    602             opts, self._jit_compile_rewrite)
--> 603         context_handle = pywrap_tfe.TFE_NewContext(opts)
    604       finally:
    605         pywrap_tfe.TFE_DeleteContextOptions(opts)

FailedPreconditionError: ioctl failed; [0000:00:06.0 PE0 C2 MC-1 TN0] Failed to set number of simple DMA addresses

Expected behavior MNIST dataset to be loaded.

jeromemassot avatar Jan 14 '25 01:01 jeromemassot

The error is perhaps linked to batch_size=-1, which means that you're loading all data at once. Could you try some smaller values?

fineguy avatar Jan 20 '25 11:01 fineguy

Hi fineguy2, I have changed to a batch size of 32 and the problem persists

jeromemassot avatar Feb 02 '25 23:02 jeromemassot

There is an evolution of the frameworks installed by default on my Colab:

TPU v2.8:

  • python: 3.11.11
  • jax: 0.4.33
  • TensorFlow: 2.18.0
  • tfds: 4.9.7

No more errors were reported but the Colab session crashes every time.

TPU v5e:

Same framework versions. Now, I got an error:


TypeError Traceback (most recent call last) /usr/local/lib/python3.11/dist-packages/tensorflow/python/data/util/structure.py in normalize_element(element, element_signature) 104 if spec is None: --> 105 spec = type_spec_from_value(t, use_fallback=False) 106 except TypeError:

29 frames TypeError: Could not build a TypeSpec for ['/tmp/tfds/mnist/3.0.1/mnist-test.tfrecord-00000-of-00001'] with type list

During handling of the above exception, another exception occurred:

InternalError Traceback (most recent call last) /usr/local/lib/python3.11/dist-packages/tensorflow/python/eager/context.py in ensure_initialized(self) 724 opts, self._jit_compile_rewrite 725 ) --> 726 context_handle = pywrap_tfe.TFE_NewContext(opts) 727 finally: 728 pywrap_tfe.TFE_DeleteContextOptions(opts)

InternalError: RET_CHECK failure (platforms/asic_sw/driver/common/internal/vfio_device_access.cc:161) !static_map_->contains(iommu_group_path_) /dev/vfio/0 already exists in the map and should only exist once!

jeromemassot avatar Feb 02 '25 23:02 jeromemassot

Is this issue still present? It seems to stem from tensorflow.data and not tensorflow_datasets.

fineguy avatar May 06 '25 08:05 fineguy