command/storage: add versioning support
This commit adds versioning support to the s5cmd.
- Added
--all-versionsflag tols,rm,duandselectsubcommands to apply operation on(/over) all versions of the objects. - Added
--version-idflag tocat,cp/mv,rm,duandselectsubcommands to apply operation on(/over) a specific versions of the object. - Added
bucket-versioncommand to configure bucket versioning. Bucket name alone returns the bucket versioning status of the bucket. Bucket versioning can be configured withsetflag which only accepts. - Added
--rawflag tocatandselectsubcommands. It disables the wildcard operations.
Note: Google Cloud Storage uses a different approach for versioning. So with current implementation, s5cmd cannot use or retrieve generation numbers . However, bucket-version command and du command with all-versions flag works as expected since they do not use version ids.
Fixes: #218 Fixes: #386
Status as of July 26 (Outdated):
- add all-versions flag to following subcommands:
- [x] ls
- [ ] rm (
only with wildcards,does not delete delete markers) - [x] du
- add
version-idflag to following sub commands:- [ ] cp/mv
- [x] cat
- [x] rm
- [x] du
- format outputs
- [ ] ls ...
Background
You may refer to https://github.com/peak/s5cmd/issues/386#issuecomment-1176069705 for background of changes
Current problem (I'm trying to solve):
rm uses expandSource method which handle keys differently when wildcards are used (or not). So It doesn't work when all-versions of a particular key was to be deleted, it just put a delete marker, though rm succesfully deletes objects when wildcards are. used.
To fix this, we need to pass value of "all-versions" flag expandSource. Hence I propose to
- put value of all-version flag as a field to the URL (instead of passing to s3 object and using it in s3.List method).
- I want to change the type of src & dst fields of commands (Copy, Delete etc.) to URL (from string) accordingly.
Alternatively, we can add new parameters to expandSource method to pass all-versions flag.
Note You can refer to Kucukaslan@df94602 to see what kind of changes I'm intended to do in code as rm.go and Delete being an example.
Example usage syntax
s5cmd ls --all-versions s3://bucket/
s5cmd rm --all-versions "s3://bucket/*"
s5cmd rm --all-versions s3://bucket/key
s5cmd du --all-versions "s3://bucket/*"
s5cmd cat --version-id smUtf8Thng s3://bucket/key
s5cmd du --version-id smUtf8Thng s3://bucket/key
s5cmd rm --version-id smUtf8Thng s3://bucket/key
Up to date status
I've made the changes to the Command objects and url.URL I mentioned earlier.
Implementation
- add
all-versionsflag to following subcommands:- [x] ls ( including delete markers)
- [x] rm ( including delete markers)
- [x] du
- [x] select
- add
version-idflag to following subcommands:- [x] cp/mv
- [x] cat
- [x] rm
- [x] du
- [x] select
- Added
bucket-versioncommand to configure bucket versioning.- [x] get status
- [x] set bucket versioning
Output formats
- [x] cp/mv : Didn't change. It is ambiguous that whether version-id that should be printed belongs to the source or destination.
- [x] cat: Didn't change. It should print the content of the file
- [x] du: Didn't change. It is generally used for multiple objects and return their total disk usage.
lscan be used withall-versionsto see sizes of each version. - [x] select: Didn't change. It should print the result of the query.
-
all-versionsflag :- [x] ls ( including delete markers)
Example
2022/08/10 09:53:03 3171 log/log.go 3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1J0109VB3PCPEL0PO= 2022/08/10 09:53:28 23 log/log.go 3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1K01TF8AJ9OV6NF7O= {"key":"s3://mcks5cmd/log/log.go","etag":"b96979fea4ce57766596e47d1b6cc5e1","last_modified":"2022-08-10T09:53:03.124Z","type":"file","size":3171,"storage_class":"STANDARD","version_id":"3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1J0109VB3PCPEL0PO="} {"key":"s3://mcks5cmd/log/log.go","etag":"05f2faf2442033698d1aa6778ca70c1b","last_modified":"2022-08-10T09:53:28.325Z","type":"file","size":23,"storage_class":"STANDARD","version_id":"3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1K01TF8AJ9OV6NF7O="} - [x] rm ( including delete markers)
Example
rm s3://mcks5cmd/log/log.go 3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1J0109VB3PCPEL0PO= rm s3://mcks5cmd/log/log.go 3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1N03KIF6K5MOK5168= {"operation":"rm","success":true,"source":"s3://mcks5cmd/log/log.go","version_id":"3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1J0109VB3PCPEL0PO="} {"operation":"rm","success":true,"source":"s3://mcks5cmd/log/log.go","version_id":"3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1N03KIF6K5MOK5168="}
- [x] ls ( including delete markers)
-
version-id:- [x] rm
Example
rm s3://mcks5cmd/log/log.go 3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1K01TF8AJ9OV6NF7O= {"operation":"rm","success":true,"source":"s3://mcks5cmd/log/log.go","version_id":"3/60O30C1G60O30C1G60O30C1G60O30C1G60O30C1G60O30C1J0109VB3PCPEL0PO"}
- [x] rm
Tests
- [x] prepare versioning setup for gofakes3
- testing
all-versionsflag- [x] ls & rm
- [x] du
- testing
version-idflag- [x] cp/mv
- [x] cat
- [x] rm
- [x] du
- command validations
- [x] cannot use both of the flags together
- [x] this flags are only meaningful to remote files
- [x] make validation checks reusable
Google Cloud
Warning Google Cloud Storage uses a different approach for versioning. So with current implementation, s5cmd cannot use or retrieve generation numbers . However, with
all-versionsflagduworks as expected since it does not use version ids,lslists object metadata except the generation numbers etc.
Commentary & Known Issues & Discussion Topics
-
gofakes3package that we use in our tests supports versioning only with in memory backend, so I've added another method to setup fake server. - There
iswas a bug when I try to delete from gofakes3 server usings5cmd rm. Despite usingversion-id/all-versionsflags, the server does not permanently delete the corresponding objects and just adds delete marker to them. Interestingly:- this bug does not happen when I use
aws s3api delete-objectto connect gofakes3 server. - this bug does not happen when I use
s5cmd rmto connect real AWS S3 server. - other subcommands of s5cmd works as expected. I'm currently trying to identify root cause of this bug and to fix it.
- this bug does not happen when I use
Note It turned out that gofakes3 does not support multidelete for versioned objects. At the moment we've fixed it in igungor's fork with https://github.com/igungor/gofakes3/pull/6. Also we've a PR to fix it in upstream too https://github.com/johannesboyne/gofakes3/pull/69.
- I do not discern the objects and delete markers when
all-versionflag is used.- Should we show the distinction in outputs?
- Should we require yet another flag to take delete markers into account (and ignore them otherwise)
Note We continue not to discern objects and delete markers, in this case. No special flag.
- Both of the s3 keys and object versions have maximum length of 1024 byte (UTF-8 string). It, potentially, might require a lot of whitespaces to align VersionID and Key columns in output (especially because we don't know, in advance, what their respective maximum lengths will be. Should we apply adaptive alignment? I mean: Each column is aligned according to the longest element so far.)
Note We will only align key to left with fixed "50" (?) characters width and append the versionID (prefixed with a space) to it aws s3api prints out json mc has an example output here
Request for Comments
Configuring bucket versioning
Warning We decided to add "
bucket-versioncommand to configure bucket versioning. Removed bucket versioning related logic from the version command."^vers
add
setandgetflags toversionsubcommand
Alternatively we can remove the get flag and use this syntax:
$ s5cmd versionv0.0.0-dev$ s5cmd version s3://bucketBucket versioning for "bucket" is "Enabled"$ s5cmd version --set Enabled s3://mcks5cmdBucket versioning for "bucket" is set to "Enabled"
ps. At the moment to get bucket versioning we need to write:
$ s5cmd version --get s3://bucket
JSON Unmarshall'ing storage.Object to display versionID
Warning We decided to add a VersionId field to storage.Object, just for this use case.
JSON Marshall should give version ids. But we marshal the storage.Object type https://github.com/peak/s5cmd/blob/3a49799e064477c49c252d4e807cc66de685c913/command/ls.go#L294 which does not have versionID field https://github.com/peak/s5cmd/blob/3a49799e064477c49c252d4e807cc66de685c913/storage/storage.go#L105-L114
On the Google Cloud Storage
It has generation numbers analogous to S3 Version Ids.
In its REST API it uses generation tag while the AWS S3 uses VersionId tag. So the (Un)Marshalers in AWS SDK does not handle generation tag. Hence it can neither get generation number from the response nor can send it with the request.
I don't think that intervening into Marshaler's logic via request handlers would be an acceptable/practical solution, even if it were to be possible.
As a last resort I've tried to modify AWS. SDK to add Generation fields to relevant types[^fork]. It helped to read generation numbers in List request without breaking any other thing, that is ls --all-versions worked with GCS.
However, it still failed to use those generation numbers in requests, that is --version-id flag and rm/cp... --all-versions did not work. I've tried a few other modifications to SDK, but none of them worked with GCS without breaking the AWS S3.
Even if these attempts were to be succesfull, upstream would not have accepted these changes and we would need to use a custom fork.
ps. I've used the first version of AWS-SDK-GO but I'm not optimistic that using v2 (or. its middlewares) would made any difference
RFC: How should s5cmd act when versioning flags are used with Google Cloud endpoints?
Note Only
bucket-versioncommand andducommand with--all-versionsflag works accurately with GCS.
- Should it print an error and cancel the operation?
- Should it print a warning and continue to the operation even though the result would not be the one user expected?
[^fork]: The attempt may be seen here.
🥇