fsttable icon indicating copy to clipboard operation
fsttable copied to clipboard

Is this package still being maintained?

Open waynelapierre opened this issue 2 years ago • 2 comments

Seems like a great package for handling large datasets.

waynelapierre avatar Feb 06 '23 02:02 waynelapierre

diskframe.com and arrow can handle large datasets. I haven't looked at arrow recently though.

xiaodaigh avatar Feb 06 '23 10:02 xiaodaigh

Hi @waynelapierre, thanks for asking! Yes, with diskframe you can handle larger-than-memory datasets if that's what you need. To include this functionality in the fstlib library (the C++ backend of fst), and to make fst work more like a real database, I had a couple of ideas for this package that might be worth exploring:

  • we define a data set (table) as one or more fst files in a separate folder
  • the users can access these tables but the underlying structure (multiple fst files) is transparent to them
  • a set of such tables form a local database that can be accessed using the dplyr and / or data.table interfaces.

The big advantage of using folders instead of single files for general operation is that on-disk sorting and merging requires storage of temporary (fst) files. Also, operations like row binding or adding columns can be done on multiple files without the need to physically copy data. And with multiple files we can have more threads working on IO, which would speed-up read- and write- times (and this should work even if one of the arguments (of a merge for example) is an in-memory table).

These are just some ideas which could speed-up fst when faster PCIe 5.0 SSD's will hit the market later this year and could solve some feature requests on fst that cannot really be solved effectively with single file datasets 😸

MarcusKlik avatar Feb 06 '23 10:02 MarcusKlik