ray icon indicating copy to clipboard operation
ray copied to clipboard

[Datasets] Improve documentation for map_batches()

Open jianoaix opened this issue 3 years ago • 0 comments

This is extracting learnings from Data oncall, where we saw user confusions around map_batches(), regarding:

  • UDF needs to be picklable: this is an implicit requirement so far, and we should document the requirement explicitly. Source issue: https://discuss.ray.io/t/cannot-pickle-batchinfermodel-when-ds-map-batches-batchinfermodel/7553/2
  • Execution model for a single block: specifically, the batches yielded from a single block are executed serially (although each batch may leverage SIMD to vectorize execution), so to improve parallelism the recommendation is increase number of blocks (because blocks are pallelerizable). Source: https://discuss.ray.io/t/dataset-support-concurrency-in-one-block-when-using-map-batches/7440/4

jianoaix avatar Sep 20 '22 20:09 jianoaix