[WIP][Datasets] Enable lazy execution by default
Signed-off-by: Cheng Su [email protected]
Why are these changes needed?
This PR is to enable lazy execution by default. See https://github.com/ray-project/enhancements/pull/19 for motivation. The change includes:
- Change
Datasetconstructor:Dataset.__init__(lazy: bool = True). Also removedefer_executionfield, as it's no longer needed. -
read_api.py:read_datasource()returns a lazyDatasetwith computing the first input block. - Add
ds.fully_executed()calls to required unit tests, to make sure they are passing.
TODO:
- [ ] Fix all unit tests
- [ ] Update documentation
Related issue number
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [ ] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
Nice. Shall we also update the documentation in the same change?
Shall we also update the documentation in the same change?
@ericl - I am thinking about in a separate PR for easier doc review (assuming there're more code change for fixing unit tests). But I can also do in same PR if people prefer.
@c21 We should also audit all benchmarks that involve Datasets, to ensure that setup operations that were previously executed eagerly are still executed eagerly, so we're not accidentally including e.g. reading or setup transformations when we're trying to benchmark a single downstream operation. The .map_batches() benchmarks come to mind.
We should also audit all benchmarks that involve Datasets, to ensure that setup operations that were previously executed eagerly are still executed eagerly, so we're not accidentally including e.g. reading or setup transformations when we're trying to benchmark a single downstream operation. The .map_batches() benchmarks come to mind.
@clarkzinzow - thanks, will go over all nightly tests.
Currently, any execution will "cache" a snapshot of the final stage of blocks. Should we change this behavior to only "cache" on a call to fully_executed() or add an explicit cache() action?
Discussed offline with @ericl, we shall postpone it later given impact is low.
The str-form of Datasets will include "num_rows=?" and "schema=Unknown schema" a lot now. Can we change this to num_rows=<Pending execution> and schema=<Pending execution> for clarity? Better yet, we could improve the str-form to show the pending stages that will be executed.
Will do it in a separate PR.
All CI tests are passed. The failed nightly tests will be addressed in https://github.com/ray-project/ray/pull/31460 .
FWIW, getting rid of the first block reading would also help with integrating fully streaming execution (right now actually that breaks streaming actually). How about we do that as a separate PR followup from this though? I think this PR is already a pretty extensive change, and we should generally avoid mixing complex changes.
FWIW, getting rid of the first block reading would also help with integrating fully streaming execution (right now actually that breaks streaming actually). How about we do that as a separate PR followup from this though? I think this PR is already a pretty extensive change, and we should generally avoid mixing complex changes.
SGTM. @clarkzinzow - WDYT?
btw I will make the batch_predictor.predict change in this PR.
How about we do that as a separate PR followup from this though?
@ericl @c21 As long as it's done as a P0 follow-up PR that we're sure will get in before the next release, that sounds good to me! I would normally say that we shouldn't enable lazy execution by default in master until we have the fully lazy semantics, but since we're actively iterating on the execution model and we have a good bit of time before the release, we can be pragmatic here.
In hindsight, it probably would have been better to do the following sequence of PRs:
- Move semi-lazy reading from an eager computing of the first block in
read_datasourceto a metadata peeking and streaming read optimization in theExecutionPlan, which would keep the existing "eagerly compute first block" semantics for eager mode and would make lazy mode fully lazy. - Enable lazy execution by default.
- Port metadata peeking optimization to plan optimization + streaming executor, since the policy of "only compute as many blocks as needed for operation" is really just a pushdown optimization rule + streaming execution.
As long as it's done as a P0 follow-up PR that we're sure will get in before the next release, that sounds good to me! I would normally say that we shouldn't enable lazy execution by default in master until we have the fully lazy semantics, but since we're actively iterating on the execution model and we have a good bit of time before the release, we can be pragmatic here.
@clarkzinzow - yeah agree here. The TODOs for this PR (1.Remove the behavior to eagerly compute first block for read, 2.improve the str/repr of dataset, 3.update documentation) are definitely P0 which I will work on next week immediately.