dvc icon indicating copy to clipboard operation
dvc copied to clipboard

config: support checkout_jobs

Open JohnAtl opened this issue 2 years ago • 1 comments

Bug Report

checkout: slow checkouts

Description

Checkout copies all files in parallel, leading to disk saturation, and excessive checkout times. E.g. At this time, lsof for the dvc process shows 331 files open.

Reproduce

dvc pull

Expected

Parallelization in moderation, respecting the jobs: parameter in .dvc/config, or some similar parameter.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.11.1 (pip)
-------------------------
Platform: Python 3.10.10 on Linux-6.1.0-11-amd64-x86_64-with-glibc2.36
Subprojects:
	dvc_data = 2.10.1
	dvc_objects = 0.24.1
	dvc_render = 0.5.3
	dvc_task = 0.3.0
	scmrepo = 1.1.0
Supports:
	http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
	ssh (sshfs = 2023.7.0)
Config:
	Global: /home/john/.config/dvc
	System: /etc/xdg/dvc

Additional Information (if any):

https://discuss.dvc.org/t/is-jobs-n-ignored-on-local-stores/1768

JohnAtl avatar Sep 13 '23 15:09 JohnAtl

I had a quick look at it, and I need to dig down deeper but at first sight, the jobs parameter (which is renamed batch_size at some point) seems lost between here https://github.com/iterative/dvc-data/blob/aea2be100b0cf4c8bcdb1dc0755bcee10bff296c/src/dvc_data/hashfile/transfer.py#L224-L237

and here

https://github.com/iterative/dvc-data/blob/aea2be100b0cf4c8bcdb1dc0755bcee10bff296c/src/dvc_data/hashfile/transfer.py#L58-L67

It is maybe reused later somewhere using the **kwargs but I haven't had the time to look deeper into it yet.

rmic avatar Nov 26 '24 09:11 rmic