tfx icon indicating copy to clipboard operation
tfx copied to clipboard

Allow Sharded TFRecord Writing

Open AlexanderLavelle opened this issue 2 years ago • 17 comments

In response to 5160 and 5676, reveal num_shards in the Beam IO WriteToTFRecord.

This leverages custom_config. I think custom_config may always be there when there are executive_properties in which case the one-liner on line 83 is acceptable.

AlexanderLavelle avatar Aug 01 '23 14:08 AlexanderLavelle

Hi @AlexanderLavelle , thank you for your contribution! Could you please add a test related to the change? This will help us review the code and ensure that it is working as expected.

roseayeon avatar Aug 08 '23 01:08 roseayeon

Hi @AlexanderLavelle Any update on this PR? Please. Thank you!

gbaned avatar Aug 18 '23 10:08 gbaned

@roseayeon @gbaned

I looked at the tests, and I think I will need to run this one to confirm the following pseudocode:


executive_properties['num_shards'] = 5

...

# Check BigQuery example gen outputs.
  train_output_files = [os.path.join(examples.uri, 'Split-train',
                                   f'data_tfrecord-0000{n}-of-00004.gz')
                                for n in range(executive_properties['num_shards']]
  eval_output_files = [os.path.join(examples.uri, 'Split-eval',
                                  f'data_tfrecord-0000{n}-of-00004.gz')
                                for n in range(executive_properties['num_shards']]

  [self.assertTrue(fileio.exists(file)) for file in train_output_files]
  [self.assertTrue(fileio.exists(file)) for file in eval_output_file]

AlexanderLavelle avatar Aug 19 '23 10:08 AlexanderLavelle

Hi @AlexanderLavelle Any update on this PR? Please. Thank you!

gbaned avatar Oct 27 '23 04:10 gbaned

@gbaned unfortunately I have not had time to run the unit test as posited -- however, this generally just allows the user to reach an argument provided by Apache Beam itself...

AlexanderLavelle avatar Oct 29 '23 15:10 AlexanderLavelle

Hi @roseayeon Can you please review this PR ? Thank you!

gbaned avatar Nov 07 '23 06:11 gbaned

Hi @roseayeon Can you please review this PR ? Thank you!

gbaned avatar Nov 15 '23 05:11 gbaned

Hi @roseayeon Can you please review this PR ? Thank you!

gbaned avatar Nov 27 '23 04:11 gbaned

Hi @roseayeon Can you please review this PR ? Thank you!

gbaned avatar Dec 15 '23 05:12 gbaned

Hi @roseayeon Can you please review this PR ? Thank you!

gbaned avatar Dec 27 '23 04:12 gbaned

Hi @roseayeon Can you please review this PR ? Thank you!

gbaned avatar Jan 10 '24 05:01 gbaned

Apologies for the delay in my response.

Although it is a minor change, we need to include a test for it. I'll work on my end to add an internal test and will update you on whether this code can be merged.

Thanks

roseayeon avatar Feb 22 '24 02:02 roseayeon

Hi @roseayeon Any update on this PR? Please. Thank you!

gbaned avatar Apr 03 '24 06:04 gbaned