Allow Sharded TFRecord Writing
In response to 5160 and 5676, reveal num_shards in the Beam IO WriteToTFRecord.
This leverages custom_config. I think custom_config may always be there when there are executive_properties in which case the one-liner on line 83 is acceptable.
Hi @AlexanderLavelle , thank you for your contribution! Could you please add a test related to the change? This will help us review the code and ensure that it is working as expected.
Hi @AlexanderLavelle Any update on this PR? Please. Thank you!
@roseayeon @gbaned
I looked at the tests, and I think I will need to run this one to confirm the following pseudocode:
executive_properties['num_shards'] = 5
...
# Check BigQuery example gen outputs.
train_output_files = [os.path.join(examples.uri, 'Split-train',
f'data_tfrecord-0000{n}-of-00004.gz')
for n in range(executive_properties['num_shards']]
eval_output_files = [os.path.join(examples.uri, 'Split-eval',
f'data_tfrecord-0000{n}-of-00004.gz')
for n in range(executive_properties['num_shards']]
[self.assertTrue(fileio.exists(file)) for file in train_output_files]
[self.assertTrue(fileio.exists(file)) for file in eval_output_file]
Hi @AlexanderLavelle Any update on this PR? Please. Thank you!
@gbaned unfortunately I have not had time to run the unit test as posited -- however, this generally just allows the user to reach an argument provided by Apache Beam itself...
Hi @roseayeon Can you please review this PR ? Thank you!
Hi @roseayeon Can you please review this PR ? Thank you!
Hi @roseayeon Can you please review this PR ? Thank you!
Hi @roseayeon Can you please review this PR ? Thank you!
Hi @roseayeon Can you please review this PR ? Thank you!
Hi @roseayeon Can you please review this PR ? Thank you!
Apologies for the delay in my response.
Although it is a minor change, we need to include a test for it. I'll work on my end to add an internal test and will update you on whether this code can be merged.
Thanks
Hi @roseayeon Any update on this PR? Please. Thank you!