tfx icon indicating copy to clipboard operation
tfx copied to clipboard

Transform component on databricks with azure storage mount as pipeline root error when removing tmp directory

Open Sruinard opened this issue 4 years ago • 4 comments

  • Environment in which the code is executed: Databricks Notebook
  • TensorFlow version: 2.5.0
  • TFX Version: 1.0.0
  • Python version: 3.8.8

Running TFX on databricks with Azure Storage mount works fine up until the Transform component. The "tmp" directory doesn't exists (it is written to .temp). As a result, when tf.io.gfile.rmtree(base_temp_dir) is called in the Transform component, it will error out.

high level reproduction:

dbutils.fs.mount(source=/path/to/storage, mount_point=/dbfs/mnt/tfxdirctory) db_config=tfx.orchestration.metadata.mysql_metadata_connection_config() context = InteractiveContext(pipeline_root="/dbfs/mnt/tfxdirectory/", metadata_connection_config=db_config)

transform = tfx.components.Transform( examples=example_gen.outputs['examples'], schema=schema_gen.outputs['schema'], module_file=os.path.abspath(_taxi_transform_module_file),

) context.run(transform)

info

INFO:tensorflow:Assets written to: /dbfs/mnt/tfxdirectory/Transform/transform_graph/5/.temp_path/tftransform_tmp/5638d1f2d4d8472bb27ad29ffebea23c/assets

error

RuntimeError: tensorflow.python.framework.errors_impl.NotFoundError: /dbfs/mnt/tfxdirectory/Transform/transform_graph/5/transform_tmp; No such file or directory [while running 'WriteTransformFn/PublishMetadataAndTransformFn']

directory structure (missing transform_tmp --> .temp_path/transform_tmp

!tree -a {/dbfs/mnt/tfxdirectory} /dbfs/mnt/tfxdirectory/Transform/transform_graph/5/ ├── .temp_path │ └── tftransform_tmp │ ├── 5638d1f2d4d8472bb27ad29ffebea23c │ │ ├── .tft_metadata │ │ │ └── schema.pbtxt │ │ ├── assets │ │ │ ├── vocab_compute_and_apply_vocabulary_1_vocabulary │ │ │ └── vocab_compute_and_apply_vocabulary_vocabulary │ │ ├── saved_model.pb │ │ └── variables │ │ ├── variables.data-00000-of-00001 │ │ └── variables.index

Sruinard avatar Aug 04 '21 06:08 Sruinard

I wonder whether there are subtle different when azure storage mount is used on filesystem behavior? My reading of [1] suggests that something should be written, but if it's not guaranteed to be flushed, then it could be a race condition later.

Note that we don't use azure storage mount ourselves so this would be quite difficult to reproduce from our side.

[1] https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L83

zhitaoli avatar Aug 04 '21 17:08 zhitaoli

What I do think is interesting is that the other components work as expected with the azure storage mount (i.e. ExampleGen, StatisticsGen, etc). In addition, I was going through the repository, but couldn't figure out where the ".temp" part in Transform/transform_graph/5/.temp_path/tftransform_tmp/5638d1f2d4d8472bb27ad29ffebea23c comes from. I'm not sure if it is a race condition. I tried to create the directory beforehand, but that case it doesn't work.

Sruinard avatar Aug 04 '21 18:08 Sruinard

We basically expect the following to write and flush: https://github.com/tensorflow/transform/blob/a183f6848d266e399e8cd2a2d37111411e8bd4e4/tensorflow_transform/tf_metadata/metadata_io.py#L122

I'm surprised that the directory doesn't even exist though because of this: https://github.com/tensorflow/transform/blob/a183f6848d266e399e8cd2a2d37111411e8bd4e4/tensorflow_transform/tf_metadata/metadata_io.py#L119

Do we know the extent of TF's support of azure storage mounts?

@Sruinard is this an error that reproduces in 100% of runs or some?

zoyahav avatar Aug 10 '21 09:08 zoyahav

Yes, the error reproduces 100% of the times...

Sruinard avatar Aug 17 '21 05:08 Sruinard

@zoyahav Can you PTAL?

gowthamkpr avatar Oct 27 '22 04:10 gowthamkpr

@Sruinard,

This line in tensorflow_transform/tf_metadata/metadata_io.py should create a directory for writing files if the directory is not present.

Please make sure your azure storage has write access. Ref: Authorize access to data in Azure Storage. Thank you!

singhniraj08 avatar Dec 22 '22 10:12 singhniraj08

This issue has been marked stale because it has no recent activity since 14 days. It will be closed if no further activity occurs. Thank you.

github-actions[bot] avatar Mar 26 '23 01:03 github-actions[bot]

This issue was closed due to lack of activity after being marked stale for past 7 days.

github-actions[bot] avatar Apr 05 '23 01:04 github-actions[bot]

Are you satisfied with the resolution of your issue? Yes No

google-ml-butler[bot] avatar Apr 05 '23 01:04 google-ml-butler[bot]