Transform component on databricks with azure storage mount as pipeline root error when removing tmp directory
- Environment in which the code is executed: Databricks Notebook
- TensorFlow version: 2.5.0
- TFX Version: 1.0.0
- Python version: 3.8.8
Running TFX on databricks with Azure Storage mount works fine up until the Transform component. The "tmp" directory doesn't exists (it is written to .temp). As a result, when tf.io.gfile.rmtree(base_temp_dir) is called in the Transform component, it will error out.
high level reproduction:
dbutils.fs.mount(source=/path/to/storage, mount_point=/dbfs/mnt/tfxdirctory) db_config=tfx.orchestration.metadata.mysql_metadata_connection_config() context = InteractiveContext(pipeline_root="/dbfs/mnt/tfxdirectory/", metadata_connection_config=db_config)
transform = tfx.components.Transform( examples=example_gen.outputs['examples'], schema=schema_gen.outputs['schema'], module_file=os.path.abspath(_taxi_transform_module_file),
) context.run(transform)
info
INFO:tensorflow:Assets written to: /dbfs/mnt/tfxdirectory/Transform/transform_graph/5/.temp_path/tftransform_tmp/5638d1f2d4d8472bb27ad29ffebea23c/assets
error
RuntimeError: tensorflow.python.framework.errors_impl.NotFoundError: /dbfs/mnt/tfxdirectory/Transform/transform_graph/5/transform_tmp; No such file or directory [while running 'WriteTransformFn/PublishMetadataAndTransformFn']
directory structure (missing transform_tmp --> .temp_path/transform_tmp
!tree -a {/dbfs/mnt/tfxdirectory} /dbfs/mnt/tfxdirectory/Transform/transform_graph/5/ ├── .temp_path │ └── tftransform_tmp │ ├── 5638d1f2d4d8472bb27ad29ffebea23c │ │ ├── .tft_metadata │ │ │ └── schema.pbtxt │ │ ├── assets │ │ │ ├── vocab_compute_and_apply_vocabulary_1_vocabulary │ │ │ └── vocab_compute_and_apply_vocabulary_vocabulary │ │ ├── saved_model.pb │ │ └── variables │ │ ├── variables.data-00000-of-00001 │ │ └── variables.index
I wonder whether there are subtle different when azure storage mount is used on filesystem behavior? My reading of [1] suggests that something should be written, but if it's not guaranteed to be flushed, then it could be a race condition later.
Note that we don't use azure storage mount ourselves so this would be quite difficult to reproduce from our side.
[1] https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L83
What I do think is interesting is that the other components work as expected with the azure storage mount (i.e. ExampleGen, StatisticsGen, etc). In addition, I was going through the repository, but couldn't figure out where the ".temp" part in Transform/transform_graph/5/.temp_path/tftransform_tmp/5638d1f2d4d8472bb27ad29ffebea23c comes from. I'm not sure if it is a race condition. I tried to create the directory beforehand, but that case it doesn't work.
We basically expect the following to write and flush: https://github.com/tensorflow/transform/blob/a183f6848d266e399e8cd2a2d37111411e8bd4e4/tensorflow_transform/tf_metadata/metadata_io.py#L122
I'm surprised that the directory doesn't even exist though because of this: https://github.com/tensorflow/transform/blob/a183f6848d266e399e8cd2a2d37111411e8bd4e4/tensorflow_transform/tf_metadata/metadata_io.py#L119
Do we know the extent of TF's support of azure storage mounts?
@Sruinard is this an error that reproduces in 100% of runs or some?
Yes, the error reproduces 100% of the times...
@zoyahav Can you PTAL?
@Sruinard,
This line in tensorflow_transform/tf_metadata/metadata_io.py should create a directory for writing files if the directory is not present.
Please make sure your azure storage has write access. Ref: Authorize access to data in Azure Storage. Thank you!
This issue has been marked stale because it has no recent activity since 14 days. It will be closed if no further activity occurs. Thank you.
This issue was closed due to lack of activity after being marked stale for past 7 days.