cloudpathlib icon indicating copy to clipboard operation
cloudpathlib copied to clipboard

S3 Paths change local cache dir after pickling

Open Haydnspass opened this issue 11 months ago • 4 comments

Hey,

thanks for the awesome library!

Bug

I noticed that S3Paths change their local cache dir of the attached S3Client when pickled. The S3Client itself appears not to be pickleable due to boto sessions not being pickleable, however, the S3Path even with explicitly specified client is pickleable without throwing errors or warnings. This is a major problem in multiprocessing for me where pickling is necessary. Here files are then downloaded again instead of using the cached version.

I saw https://github.com/drivendataorg/cloudpathlib/issues/450 but encounter the error both for version 0.19 and 0.20 (python 3.10.16)

Steps to reproduce

import cloudpathlib as cl

# get s3 client from somewhere
def test_s3_pickled_client(s3_client: cl.S3Client) -> None:
    path = cl.S3Path("s3://dummy", client=s3_client)
    path_re = pickle.loads(pickle.dumps(path))

    assert path_re.client._local_cache_dir == path.client._local_cache_dir  # fails

Environment

boto3                     1.36.5
botocore                  1.36.5
cloudpathlib              0.20.0

Haydnspass avatar Feb 25 '25 12:02 Haydnspass

Thanks @Haydnspass, did you see these docs on pickling/unpickling? https://cloudpathlib.drivendata.org/stable/authentication/#pickling-cloudpath-objects

Another option is to set the env var for the local cache dir: https://cloudpathlib.drivendata.org/stable/caching/#keeping-the-cache-around

Let me know if either of these options address your scenario.

pjbull avatar Feb 25 '25 18:02 pjbull

@pjbull thanks for the docs, indeed I missed that. Trying setting the client as default did not work for me, however I don't know if thats some extra magic going on in multiprocessing (via joblib) or if I did something wrong.

For the moment I use this workaround by subclassing S3Path:

class S3Path(cloudpathlib.S3Path):
    """Temporary child for fixing cache dir"""

    def __getstate__(self) -> dict[str, Any]:
        state = super().__getstate__()
        state["_client_cache_dir"] = self.client._local_cache_dir if self.client is not None else None
        return state

    def __setstate__(self, state: dict[str, Any]) -> None:
        self.__dict__ = state
        if (p := state["_client_cache_dir"]) is not None:
            self.client._local_cache_dir = p

I'll follow up

Haydnspass avatar Feb 27 '25 09:02 Haydnspass

@Haydnspass Do you have a minimally reproducible example we could work with to see what's happening with the multiprocessing? There might be potential workarounds or enhancements we want to do.

pjbull avatar Feb 27 '25 22:02 pjbull

@pjbull sorry for getting back to you late.

Hmm, indeed i thought my parallel processing using joblib is the problem, but that appears not to be the case in my minimal example here:

from pathlib import Path

import cloudpathlib as cl
from joblib import Parallel, delayed


def local_cache_dir(path_s3: cl.S3Path) -> Path:
    return Path(path_s3.fspath)


def test_cache_dir(s3_client: cl.S3Client):
    s3_client.set_as_default_client()
    path_s3 = cl.S3Path("[s3 path]")

    assert str(s3_client._local_cache_dir) in str(local_cache_dir(path_s3).parent)

    out = Parallel(n_jobs=1)(delayed(local_cache_dir)(path_s3) for _ in range(1))[0]
    assert str(s3_client._local_cache_dir) in str(out)

So I figure it could be a combination of joblib and pytorch multiprocessing but I will have to further investigate.

Haydnspass avatar Mar 28 '25 09:03 Haydnspass