kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

feat(datasets): Add option to async load and save in PartitionedDatasets

Open puneeter opened this issue 1 year ago • 5 comments

Description

  • This PR provides the user to load and save PartitionedDataset asynchronously for partitions provided.
  • PartitionedDatasets already provide a way to do lazy loading, which solves for memory complexity. With this PR the time complexity is also reduced if the user wants to save/load these partitions in parallel with the help of use_async argument.

Development notes

  • Additional use_async argument to PartitionedDataset constructor is used to control the async load/save.
  • Based on this argument, _save and _load methods call different private functions.
  • Leveraged existing tests for PartitionedDataset by parameterizing value for use_async using @pytest.mark.parametrize("use_async", [True, False])

Checklist

  • [x] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • [x] Updated the documentation to reflect the code changes
  • [x] Added a description of this change in the relevant RELEASE.md file
  • [x] Added tests to cover my changes

puneeter avatar May 23 '24 17:05 puneeter

Hi @puneeter, can you please provide a description and any relevant development notes on the PR? This will make it easier for the team to review.

merelcht avatar May 24 '24 11:05 merelcht

Hi @puneeter, can you please provide a description and any relevant development notes on the PR? This will make it easier for the team to review.

I updated the description. Please let me know if it needs any refactoring.

puneeter avatar May 24 '24 11:05 puneeter

Would need team's help to point to the right documentation to be changed because of this change. Maybe: docs/source/data/partitioned_and_incremental_datasets.md?

puneeter avatar Oct 18 '24 11:10 puneeter

Hey @puneeter, sorry for the long delay. Indeed, partitioned_and_incremental_datasets.md corresponds to https://docs.kedro.org/en/0.19.10/data/partitioned_and_incremental_datasets.html

In the end, is the usage similar to what I wrote here https://github.com/kedro-org/kedro-plugins/pull/696#discussion_r1616675036 or is it different?

Aside from that, I'll leave one more comment

astrojuanlu avatar Jan 14 '25 15:01 astrojuanlu

@puneeter I see all the tests have been modified to take the use_async argument, but is there a way to also check that the async functionality is working?

merelcht avatar Jan 16 '25 09:01 merelcht

My comments on asyncio.run above were premonitory, because we ended up finding a actual example of such breakage in a separate PR https://github.com/kedro-org/kedro/issues/4611

I think this had good intentions but it might actually be difficult to do. @puneeter do you mind if we turn this PR into an issue and we tackle it at some other time?

astrojuanlu avatar May 13 '25 16:05 astrojuanlu

I'm going to close this PR as there's not been any response from the author for a while. If anyone is interested in trying something like this in the future, please open an issue.

merelcht avatar May 26 '25 11:05 merelcht