feat(datasets): Add option to async load and save in PartitionedDatasets
Description
- This PR provides the user to load and save
PartitionedDatasetasynchronously for partitions provided. - PartitionedDatasets already provide a way to do lazy loading, which solves for memory complexity. With this PR the time complexity is also reduced if the user wants to save/load these partitions in parallel with the help of
use_asyncargument.
Development notes
- Additional
use_asyncargument toPartitionedDatasetconstructor is used to control the async load/save. - Based on this
argument,_saveand_loadmethods call different private functions. - Leveraged existing tests for
PartitionedDatasetby parameterizing value foruse_asyncusing@pytest.mark.parametrize("use_async", [True, False])
Checklist
- [x] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
- [x] Updated the documentation to reflect the code changes
- [x] Added a description of this change in the relevant
RELEASE.mdfile - [x] Added tests to cover my changes
Hi @puneeter, can you please provide a description and any relevant development notes on the PR? This will make it easier for the team to review.
Hi @puneeter, can you please provide a description and any relevant development notes on the PR? This will make it easier for the team to review.
I updated the description. Please let me know if it needs any refactoring.
Would need team's help to point to the right documentation to be changed because of this change. Maybe: docs/source/data/partitioned_and_incremental_datasets.md?
Hey @puneeter, sorry for the long delay. Indeed, partitioned_and_incremental_datasets.md corresponds to https://docs.kedro.org/en/0.19.10/data/partitioned_and_incremental_datasets.html
In the end, is the usage similar to what I wrote here https://github.com/kedro-org/kedro-plugins/pull/696#discussion_r1616675036 or is it different?
Aside from that, I'll leave one more comment
@puneeter I see all the tests have been modified to take the use_async argument, but is there a way to also check that the async functionality is working?
My comments on asyncio.run above were premonitory, because we ended up finding a actual example of such breakage in a separate PR https://github.com/kedro-org/kedro/issues/4611
I think this had good intentions but it might actually be difficult to do. @puneeter do you mind if we turn this PR into an issue and we tackle it at some other time?
I'm going to close this PR as there's not been any response from the author for a while. If anyone is interested in trying something like this in the future, please open an issue.