S3 Paths change local cache dir after pickling
Hey,
thanks for the awesome library!
Bug
I noticed that S3Paths change their local cache dir of the attached S3Client when pickled. The S3Client itself appears not to be pickleable due to boto sessions not being pickleable, however, the S3Path even with explicitly specified client is pickleable without throwing errors or warnings. This is a major problem in multiprocessing for me where pickling is necessary. Here files are then downloaded again instead of using the cached version.
I saw https://github.com/drivendataorg/cloudpathlib/issues/450 but encounter the error both for version 0.19 and 0.20 (python 3.10.16)
Steps to reproduce
import cloudpathlib as cl
# get s3 client from somewhere
def test_s3_pickled_client(s3_client: cl.S3Client) -> None:
path = cl.S3Path("s3://dummy", client=s3_client)
path_re = pickle.loads(pickle.dumps(path))
assert path_re.client._local_cache_dir == path.client._local_cache_dir # fails
Environment
boto3 1.36.5
botocore 1.36.5
cloudpathlib 0.20.0
Thanks @Haydnspass, did you see these docs on pickling/unpickling? https://cloudpathlib.drivendata.org/stable/authentication/#pickling-cloudpath-objects
Another option is to set the env var for the local cache dir: https://cloudpathlib.drivendata.org/stable/caching/#keeping-the-cache-around
Let me know if either of these options address your scenario.
@pjbull thanks for the docs, indeed I missed that. Trying setting the client as default did not work for me, however I don't know if thats some extra magic going on in multiprocessing (via joblib) or if I did something wrong.
For the moment I use this workaround by subclassing S3Path:
class S3Path(cloudpathlib.S3Path):
"""Temporary child for fixing cache dir"""
def __getstate__(self) -> dict[str, Any]:
state = super().__getstate__()
state["_client_cache_dir"] = self.client._local_cache_dir if self.client is not None else None
return state
def __setstate__(self, state: dict[str, Any]) -> None:
self.__dict__ = state
if (p := state["_client_cache_dir"]) is not None:
self.client._local_cache_dir = p
I'll follow up
@Haydnspass Do you have a minimally reproducible example we could work with to see what's happening with the multiprocessing? There might be potential workarounds or enhancements we want to do.
@pjbull sorry for getting back to you late.
Hmm, indeed i thought my parallel processing using joblib is the problem, but that appears not to be the case in my minimal example here:
from pathlib import Path
import cloudpathlib as cl
from joblib import Parallel, delayed
def local_cache_dir(path_s3: cl.S3Path) -> Path:
return Path(path_s3.fspath)
def test_cache_dir(s3_client: cl.S3Client):
s3_client.set_as_default_client()
path_s3 = cl.S3Path("[s3 path]")
assert str(s3_client._local_cache_dir) in str(local_cache_dir(path_s3).parent)
out = Parallel(n_jobs=1)(delayed(local_cache_dir)(path_s3) for _ in range(1))[0]
assert str(s3_client._local_cache_dir) in str(out)
So I figure it could be a combination of joblib and pytorch multiprocessing but I will have to further investigate.