Is this package still being maintained?
Seems like a great package for handling large datasets.
diskframe.com and arrow can handle large datasets. I haven't looked at arrow recently though.
Hi @waynelapierre, thanks for asking! Yes, with diskframe you can handle larger-than-memory datasets if that's what you need. To include this functionality in the fstlib library (the C++ backend of fst), and to make fst work more like a real database, I had a couple of ideas for this package that might be worth exploring:
- we define a data set (table) as one or more
fstfiles in a separate folder - the users can access these tables but the underlying structure (multiple
fstfiles) is transparent to them - a set of such tables form a local database that can be accessed using the
dplyrand / ordata.tableinterfaces.
The big advantage of using folders instead of single files for general operation is that on-disk sorting and merging requires storage of temporary (fst) files. Also, operations like row binding or adding columns can be done on multiple files without the need to physically copy data. And with multiple files we can have more threads working on IO, which would speed-up read- and write- times (and this should work even if one of the arguments (of a merge for example) is an in-memory table).
These are just some ideas which could speed-up fst when faster PCIe 5.0 SSD's will hit the market later this year and could solve some feature requests on fst that cannot really be solved effectively with single file datasets 😸