beam icon indicating copy to clipboard operation
beam copied to clipboard

[Feature Request]: Optimize the number of shards for FileIO

Open liferoad opened this issue 1 year ago • 0 comments

What would you like to happen?

FileIO supports https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/FileIO.Write.html#withAutoSharding, which usually works well to split large bundles to smaller ones and then write the shards to files. However, in some cases, the shard sizes could be small like few KBs (1-10KBs) and the number of shards could be huge (e.g., 1M).

This issue could be mitigated by adding Reshuffle before writing the files. However, we could add this as the builtin step when detecting too many bundles with small sizes.

Issue Priority

Priority: 3 (nice-to-have improvement)

Issue Components

  • [ ] Component: Python SDK
  • [X] Component: Java SDK
  • [ ] Component: Go SDK
  • [ ] Component: Typescript SDK
  • [X] Component: IO connector
  • [ ] Component: Beam YAML
  • [ ] Component: Beam examples
  • [ ] Component: Beam playground
  • [ ] Component: Beam katas
  • [ ] Component: Website
  • [ ] Component: Spark Runner
  • [ ] Component: Flink Runner
  • [ ] Component: Samza Runner
  • [ ] Component: Twister2 Runner
  • [ ] Component: Hazelcast Jet Runner
  • [ ] Component: Google Cloud Dataflow Runner

liferoad avatar May 13 '24 20:05 liferoad