ray [WIP][Datasets] Enable lazy execution by default

Signed-off-by: Cheng Su [email protected]

Why are these changes needed?

This PR is to enable lazy execution by default. See https://github.com/ray-project/enhancements/pull/19 for motivation. The change includes:

Change Dataset constructor: Dataset.__init__(lazy: bool = True). Also remove defer_execution field, as it's no longer needed.
read_api.py:read_datasource() returns a lazy Dataset with computing the first input block.
Add ds.fully_executed() calls to required unit tests, to make sure they are passing.

TODO:

[ ] Fix all unit tests
[ ] Update documentation

Related issue number

Checks

[ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[ ] I've run scripts/format.sh to lint the changes in this PR.
[ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
[ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(

Dec 22 '22 01:12 c21

Nice. Shall we also update the documentation in the same change?

Dec 22 '22 01:12 ericl

Shall we also update the documentation in the same change?

@ericl - I am thinking about in a separate PR for easier doc review (assuming there're more code change for fixing unit tests). But I can also do in same PR if people prefer.

Dec 22 '22 01:12 c21

@c21 We should also audit all benchmarks that involve Datasets, to ensure that setup operations that were previously executed eagerly are still executed eagerly, so we're not accidentally including e.g. reading or setup transformations when we're trying to benchmark a single downstream operation. The .map_batches() benchmarks come to mind.

Jan 04 '23 18:01 clarkzinzow

We should also audit all benchmarks that involve Datasets, to ensure that setup operations that were previously executed eagerly are still executed eagerly, so we're not accidentally including e.g. reading or setup transformations when we're trying to benchmark a single downstream operation. The .map_batches() benchmarks come to mind.

@clarkzinzow - thanks, will go over all nightly tests.

Jan 05 '23 01:01 c21

Currently, any execution will "cache" a snapshot of the final stage of blocks. Should we change this behavior to only "cache" on a call to fully_executed() or add an explicit cache() action?

Discussed offline with @ericl, we shall postpone it later given impact is low.

The str-form of Datasets will include "num_rows=?" and "schema=Unknown schema" a lot now. Can we change this to num_rows=<Pending execution> and schema=<Pending execution> for clarity? Better yet, we could improve the str-form to show the pending stages that will be executed.

Will do it in a separate PR.

Jan 05 '23 01:01 c21

All CI tests are passed. The failed nightly tests will be addressed in https://github.com/ray-project/ray/pull/31460 .

Jan 05 '23 20:01 c21

FWIW, getting rid of the first block reading would also help with integrating fully streaming execution (right now actually that breaks streaming actually). How about we do that as a separate PR followup from this though? I think this PR is already a pretty extensive change, and we should generally avoid mixing complex changes.

Jan 06 '23 00:01 ericl

FWIW, getting rid of the first block reading would also help with integrating fully streaming execution (right now actually that breaks streaming actually). How about we do that as a separate PR followup from this though? I think this PR is already a pretty extensive change, and we should generally avoid mixing complex changes.

SGTM. @clarkzinzow - WDYT?

btw I will make the batch_predictor.predict change in this PR.

Jan 06 '23 00:01 c21

How about we do that as a separate PR followup from this though?

@ericl @c21 As long as it's done as a P0 follow-up PR that we're sure will get in before the next release, that sounds good to me! I would normally say that we shouldn't enable lazy execution by default in master until we have the fully lazy semantics, but since we're actively iterating on the execution model and we have a good bit of time before the release, we can be pragmatic here.

In hindsight, it probably would have been better to do the following sequence of PRs:

Move semi-lazy reading from an eager computing of the first block in read_datasource to a metadata peeking and streaming read optimization in the ExecutionPlan, which would keep the existing "eagerly compute first block" semantics for eager mode and would make lazy mode fully lazy.
Enable lazy execution by default.
Port metadata peeking optimization to plan optimization + streaming executor, since the policy of "only compute as many blocks as needed for operation" is really just a pushdown optimization rule + streaming execution.

Jan 06 '23 16:01 clarkzinzow

As long as it's done as a P0 follow-up PR that we're sure will get in before the next release, that sounds good to me! I would normally say that we shouldn't enable lazy execution by default in master until we have the fully lazy semantics, but since we're actively iterating on the execution model and we have a good bit of time before the release, we can be pragmatic here.

@clarkzinzow - yeah agree here. The TODOs for this PR (1.Remove the behavior to eagerly compute first block for read, 2.improve the str/repr of dataset, 3.update documentation) are definitely P0 which I will work on next week immediately.

Jan 06 '23 18:01 c21