CRUD updates, dataset mutation and BlobTree API updates
This is a big batch of changes, implementing
- A big rewrite of the
BlobTreeAPI to make it more coherent and simpler - Allow a dataset to be opened for mutation with
open(write=true). - A CRUD interface for modifying
DataProject - Changes to the data driver interface to support all this
A lot of these changes are intertwined so I've put all this here as a draft, but I'll probably need to break this apart into separate PRs.
BlobTree
BlobTree now has a largely dictionary-like interface:
- List keys (ie, file and directory names):
keys(tree) - List keys and values:
pairs(tree) - Query keys:
haskey(tree, path) - Traverse the tree:
tree[path] - Add new content:
newdir(tree, path),newfile(tree, path) - Delete content:
delete!(tree, path)
Where path is either a relative path RelPath type, or an AbstractString (in which case it'll be split on / to become a relative path).
Unlike Dict, iteration of BlobTree currently iterates values (not key value pairs). This has some benefits - for example, broadcasting processing across files in a directory.
- Property access
-
isdir(),isfile()- determine whether a child of tree is a directory or file.
-
Example
You can create a new temporary BlobTree via the newdir() function and fill it with combinations of newfile() or newdir()
julia> dir = newdir()
for i = 1:3
newfile(dir, "\$i/a.txt") do io
println(io, "Content of a")
end
newfile(dir, "b-\$i.txt") do io
println(io, "Content of b")
end
end
dir
📂 Tree @ /tmp/jl_Sp6wMF
📁 1
📁 2
📁 3
📄 b-1.txt
📄 b-2.txt
📄 b-3.txt
You can also get access to a BlobTree by using DataSets.from_path() with a
local directory name. For example:
julia> using Pkg
open(DataSets.from_path(joinpath(Pkg.dir("DataSets"), "src")))
📂 Tree @ ~/.julia/dev/DataSets/src
📄 DataSet.jl
📄 DataSets.jl
📄 DataTomlStorage.jl
...
AbstractDataProject interface additions
To support CRUD of datasets (#31) within data projects, the data driver interface needs much more flexibility. I've added:
-
DataSets.create()to create datasets — still needs some refinement, in particular the keyword parameters. -
Base.setindex!()to add a dataset to a project -
DataSets.delete()to delete datasets - Implementations for
StackedDataProject,AbstractTOMLDataProjectandTOMLDataProject
Relatedly, I've added DataSets.from_path() to create a standalone DataSet from data on the local filesystem, inferring the type as Blob or BlobTree. This can be passed as a source to create() to make a copy.
Still TODO here is DataSets.config (or some such) to update the metadata of a DataSet (alternatively — have the dataset know its owning data project and call back into that when it's updated?)
Low level AbstractDataDriver interface
The low level driver interface is currently (in 0.2.6) just a function taking a user-defined callback.
However, to support CRUD operations for DataProject it needs to be expanded quite a bit. In particular to be able to create and delete storage in the storage backend. This PR adds AbstractDataDriver and, so far a single implementation FileSystemDriver with implementations of
- open_dataset to do what the current function-based API does
- close_dataset to cleanup any dataset resources, also indicating whether the close happened due to an exception.
- create_storage to initialize storage
- delete_storage to remove storage
This interface is probably still a bit half-baked and needs some refinement.