filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

Ability to use fsspec.generic.rsync across different filesystems

Open f4hy opened this issue 2 years ago • 10 comments

The rsync method looks like exactly what I have been looking for, but I am not sure how one would use it to say sync data from two different s3 buckets which required difference credentials.

What I would want is something like

fs1_args = {'client_kwargs': {'aws_access_key_id': 'foo', 'aws_secret_access_key': 'bar'}}
fs2_args = {'client_kwargs': {'aws_access_key_id': 'baz', 'aws_secret_access_key': 'qux'}}
fs1 = fsspec.filesystem("s3", **fs1_args)
fs2 = fsspec.filesystem("s3", **fs2_args)
fsspec.generic.rsync("s3://bucket1/somepath", "s3://bucket2/somepath", from_fs=fs1, to_fs=fs2)

but the rsync method only takes an fs= not a to_fs and from_fs. So how is one supposed to pass in both values? Why does the rsync method only take one if it is meant to be able to copy cross systems?

f4hy avatar Oct 24 '23 02:10 f4hy

This could be documented better... You may even be right that rsync is the most important thing in the whole module, and so other APIs should be tailored to make it as simple as possible.

The fs in question would be an instance of GenericFileSystem from the same module, which handles all the operations rsync needs by dispatch to other backends. It provides various ways to map from URL to filesystem instance, keyed by the protocol of each URL. So, if you wanted to copy s3->s3 with different instances, you would need to define a lookup for two different protocol strings.

You could for instance do

fsspec.generic._generic_fs["s3_source"] = fsspec.filesystem("s3", ....)
fsspec.generic._generic_fs["s3_target"] = fsspec.filesystem("s3", ....)
generic = fsspec.filesystem("generic", default_method="generic")

fsspec.generic.rsync(source, target, fs=fs)

where the URLs in source start with "s3_source://" and the ones in target with "s3_target://". Obviously this is more complicated than it might be for the case of two instances of the same backend!

martindurant avatar Oct 24 '23 14:10 martindurant

ok wow. ya thats exactly what I wanted to do but really not clear from the docs how to do that. So ok the idea is to define a new "backend" and then mangle my uris to use the new s3_source or s3_target

I guess the downside of this is that now I can't use the URIs that I would use in other places. Ideally I want to have the inputs here to rsync both be s3:// since thats what they uri is.. (remember that i=identifier)

It seems far cleaner to have the api I proposed above with a source_fs and target_fs instead of having to mangle the uris. Any reason not to also support that?

Glad there is a workaround though. Will try this out.

f4hy avatar Oct 24 '23 15:10 f4hy

Any reason not to also support that?

No reason, except that the generic filesystem came first, and so reused code rather than tailor something that was easier to use. Would you like to work on this?

martindurant avatar Oct 24 '23 15:10 martindurant

Ya. If this would be of value would love to contribute. Ill draft up a PR.

f4hy avatar Oct 24 '23 15:10 f4hy

I am still missing something about how the solution you mentioned can work. Trying to use a generic to load a different s3 config doesnt seem to work as expected.

fsspec.generic._generic_fs["s3_source"] = fsspec.filesystem("s3", ....)
generic = fsspec.filesystem("generic", default_method="generic")
generic.find('s3_source://mybucket/path/')

throws an error about Invalid bucket name "s3_source:": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"

since the whole path gets passed now to the s3fs which tries to validate that path and s3_source:// isn't valid. So My guess is that needs to be striped from what gets passed down in generics?

f4hy avatar Oct 24 '23 17:10 f4hy

Ah, that is annoying! I guess "s3" and "s3a" would work, since those are different, but still prefixes that we know and expect. You are right that a better solution is certainly warranted!

martindurant avatar Oct 24 '23 17:10 martindurant

I think the issue is with _strip_protocol() is broken for generic in this case.

fsspec.generic._generic_fs["s3_source"] = fsspec.filesystem("s3", ....)
generic = fsspec.filesystem("generic", default_method="generic")
generic._strip_protocol('s3_source://mybucket/')

doesn't give what you expect. it gives 's3://s3_source://mybucket' which is not right.

f4hy avatar Oct 24 '23 17:10 f4hy

ya https://github.com/fsspec/filesystem_spec/blob/master/fsspec/generic.py#L177 is wrong. It tries to strip the protocol using the s3fs which would remove an s3:// but it needs to remove the s3_source:// let me see if there is an easy fix.

f4hy avatar Oct 24 '23 17:10 f4hy

You can set the instance's value of protocol, that might be enough; but using "s3" and "s3a" would be a workaround for the moment.

martindurant avatar Oct 24 '23 17:10 martindurant

ok s3 and s3a are different protocols though. Also not sure how this would then work with things like hdfs if you wanted to copy from 1 hdfs cluster to another.

but ok interesting there is a workaround to just s3 and s3a for the two. I will still propose a PR to implement a rsync like method that works without the generic and uses 2 explicit filesystems.

f4hy avatar Oct 24 '23 19:10 f4hy