fstlib icon indicating copy to clipboard operation
fstlib copied to clipboard

Large datasets

Open tafia opened this issue 8 years ago • 3 comments

First thanks for the library!

What is the recommended approach to write large datasets (e.g. 20+ GB csv files). Is there any way to stream reading / writing ?

I have a hard time finding documentation on how to use it. The only one I found uses data frames. I am not an expert on R but I think it is in memory only.

Also I would ideally like to use it in a rust program, which means I'll probably need to do a rust binding for the required parts. Happy to share it if you want!

tafia avatar Dec 07 '17 03:12 tafia

Hi @tafia, first of all, congratulations on filing the first issue in the fstlib repository :-)

Up till now, the core C++ code of R's fst package was part of the R package itself. But now, I've published the library as a separate component to enable implementation in other languages than R.

As you noticed, I have yet to write documentation on the fstlib API and will do so in the coming months. In short, with the fstlib library you can and will be able to:

  • Write in-memory datasets to the file using the fst format
  • Have random access to that fst file, both row- and column wise
  • Use custom type-specific compression on each column in the fst file
  • Very fast multi-threaded compression of memory blocks
  • Very fast multi-threaded hashing of memory blocks
  • Add new datasets to existing fst files (row-binding) future expansion but format is ready
  • Add new columns to existing fst files (column binding) future expansion but format is ready
  • Retrieve data using on-the fly sub-setting (e.g. YEAR == 2016) without any memory overhead future expansion but format is ready
  • On-the-fly ('chunked') operations on data in a fst file, this is like applying map-reduce type algorithms on chunked data. This will be a fully multi-threaded feature. future expansion

The future expansion features will be developed in the coming period using the R package as a technology driver.

IO operations using the fstlib are designed to be as fast as possible, typically topping (due to compression) the maximum speed of a (NVME) SSD drives. At the same time, the library will be very small, so can easily be included in other packages or components.

Having a rust binding would be great!

MarcusKlik avatar Dec 07 '17 09:12 MarcusKlik

first of all, congratulations on filing the first issue in the fstlib repository :-)

🥇

As you noticed, I have yet to write documentation on the fstlib API and will do so in the coming months.

You sure have lot of work to do! I certainly don't want to bother you too much. I'll split my input file for the moment in as many chunks as necessary.

For the moment, I am mainly interested in creating fst files (Write in-memory datasets and saving it to the disk). There are examples in tests drive, I guess if I manage to have rust bindings, it should be enough for me.

tafia avatar Dec 07 '17 10:12 tafia

That's great, please let me know if you need anything. The Visual Studio 2017 solution contains 4 projects:

  • Project fstcpp: this is a very basic implementation of a fstlib wrapper in C++ (let's say the C++ variant of the R package.
  • Project fstlib: that's the fstlib library.
  • Project fstlibtest: a Google test project to test basic functionality. Currently I mostly use this to track and debug issues that arise from the R package users. Eventually, this will be the main test repository for fstlib.
  • Project googletests: the Google library for writing unit tests

image

Unfortunately, I have no experience with Rust but if you can make a wrapper for C++ code, then you should have no problems. It would be nice if you could have your work in a GitHub repository, so that we can learn from the process!

MarcusKlik avatar Dec 07 '17 11:12 MarcusKlik