experimental traits to more cleanly extend to key-value stores
This is a (very small) first step of an experimental implementation of traits to attempt to address some of the problems with this issue.
Currently, FilePathsBase makes quite a lot of assumptions about the semantics of paths, I think it's fair to say that it assumes that paths are either POSIX compatible or windows. This is arguably unnecessarily restrictive to use cases which mirror these very closely.
In this draft I had created the traits DirectoriesType and PermissionsType. The former is to indicate whether directories are their own objects in a file system or whether they are merely implied by the path structure, the latter being the case for key-value stores such as S3. PermissionsType is whether file or directory permissions are supported by the file system.
An objection I anticipate to this proposal is that key-value stores are not file systems and it is a mistake to treat them like one. This is a wise objection, however there seem to be a number of cases (notably AWS S3, but also other AWS key-value stores) in which the key-value store is so commonly treated as a file system that using methods such as those provided by FilePathsBase is unavoidable. Indeed, the idea that S3 is a file system seems to be deeply ingrained in much of the "big data" ecosystem (horrific though that ecosystem might be). Therefore, I think it is at least worth asking whether FilePathsBase can be made general enough to handle those cases more seamlessly.
The "demo" I have in this draft defines methods in terms of intermediate methods which check the interface, so that attempts to use unimplemented methods will result in method errors. The real work to be done is re-writing some of the fall-back methods so that they make sense with the provided traits (I haven't done even a single example of this yet).
Obviously, I have done very little work here, but I wanted to gauge interest and receptiveness to this idea before I put any real time into this.
Codecov Report
Merging #159 (b9bbf3a) into master (534b301) will decrease coverage by
0.29%. The diff coverage is0.00%.
@@ Coverage Diff @@
## master #159 +/- ##
==========================================
- Coverage 91.91% 91.61% -0.30%
==========================================
Files 12 12
Lines 1224 1228 +4
==========================================
Hits 1125 1125
- Misses 99 103 +4
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/FilePathsBase.jl | 71.42% <0.00%> (-28.58%) |
:arrow_down: |
| src/path.jl | 87.66% <0.00%> (-0.29%) |
:arrow_down: |
| src/system.jl | 91.33% <0.00%> (-0.62%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 534b301...b9bbf3a. Read the comment docs.
I'm growing a little skeptical of my own idea. On one hand, I definitely stand by the idea that this package assumes too much of the underlying file system implementations, but on the other, I think we should be very careful not to completely overhaul this package based on S3 alone. Yes, in principle most of what would be done for S3 could be extended to other key-value stores, but S3 is the only use case that I feel certain will be used.
I think it would be productive if we could collectively come up with some more concrete examples of how this package might be used in the future. If those are too hard to come by, maybe generalizing this is a bad idea.
I think a few obvious examples include:
- Other cloud providers like azure
- FTP client paths https://github.com/invenia/FTPClient.jl/issues/80
- SSH filepaths
- Navigating and manipulating archives like zip or tar files
- Better integration into FileTrees.jl https://github.com/shashi/FileTrees.jl
One thing that keeps bugging me is that there's a weird overlap between Dictionaries.jl, FilePathsBase.jl, FileTrees.jl and AxisSets.jl in which we often want to just describe algorithms for search/navigating prefixes against arbitrary objects (e.g., local files, remote objects, in-memory julia types). I wonder if some of the S3 issues would be better served with a kind of index API which must be implemented independent of authorization or read/write or permissions logic?
Ok, that's definitely more examples than I had come up with. Also @omus suggested something like this in the other thread, I think not realizing I had done this.
My interest in this is renewed, I will work on it some more some time this week.
I've revisited this today. On further reflection, it does seem that extending the interface this way is a good idea especially for AWSS3, but it is slightly more complicated than I had hoped.
The main breaking change I propose to the interface as of now is that exists should always return false on directories for key-value stores. We can provide a separate function indicating whether or not there exist any files at all in a certain set of keys.
The main breaking change I propose to the interface as of now is that exists should always return false on directories for key-value stores. We can provide a separate function indicating whether or not there exist any files at all in a certain set of keys.
So the issue is that s3://mybucket/my/prefix/ is a perfectly valid object/filename on S3, separate from it being used as a prefix, so saying it doesn't exist seems equally dishonest. I think if you don't want to handle that case then it should be a method error.
I don't know yet... regardless I don't think there's going to be any avoiding some breaking changes (though I think they can be minor).