Large datasets
First thanks for the library!
What is the recommended approach to write large datasets (e.g. 20+ GB csv files). Is there any way to stream reading / writing ?
I have a hard time finding documentation on how to use it. The only one I found uses data frames. I am not an expert on R but I think it is in memory only.
Also I would ideally like to use it in a rust program, which means I'll probably need to do a rust binding for the required parts. Happy to share it if you want!
Hi @tafia, first of all, congratulations on filing the first issue in the fstlib repository :-)
Up till now, the core C++ code of R's fst package was part of the R package itself. But now, I've published the library as a separate component to enable implementation in other languages than R.
As you noticed, I have yet to write documentation on the fstlib API and will do so in the coming months. In short, with the fstlib library you can and will be able to:
- Write in-memory datasets to the file using the
fstformat - Have random access to that
fstfile, both row- and column wise - Use custom type-specific compression on each column in the
fstfile - Very fast multi-threaded compression of memory blocks
- Very fast multi-threaded hashing of memory blocks
- Add new datasets to existing
fstfiles (row-binding) future expansion but format is ready - Add new columns to existing
fstfiles (column binding) future expansion but format is ready - Retrieve data using on-the fly sub-setting (e.g. YEAR == 2016) without any memory overhead future expansion but format is ready
- On-the-fly ('chunked') operations on data in a
fstfile, this is like applying map-reduce type algorithms on chunked data. This will be a fully multi-threaded feature. future expansion
The future expansion features will be developed in the coming period using the R package as a technology driver.
IO operations using the fstlib are designed to be as fast as possible, typically topping (due to compression) the maximum speed of a (NVME) SSD drives. At the same time, the library will be very small, so can easily be included in other packages or components.
Having a rust binding would be great!
first of all, congratulations on filing the first issue in the fstlib repository :-)
🥇
As you noticed, I have yet to write documentation on the fstlib API and will do so in the coming months.
You sure have lot of work to do! I certainly don't want to bother you too much. I'll split my input file for the moment in as many chunks as necessary.
For the moment, I am mainly interested in creating fst files (Write in-memory datasets and saving it to the disk). There are examples in tests drive, I guess if I manage to have rust bindings, it should be enough for me.
That's great, please let me know if you need anything. The Visual Studio 2017 solution contains 4 projects:
- Project
fstcpp: this is a very basic implementation of afstlibwrapper in C++ (let's say the C++ variant of theRpackage. - Project
fstlib: that's thefstliblibrary. - Project
fstlibtest: a Google test project to test basic functionality. Currently I mostly use this to track and debug issues that arise from theRpackage users. Eventually, this will be the main test repository forfstlib. - Project
googletests: the Google library for writing unit tests

Unfortunately, I have no experience with Rust but if you can make a wrapper for C++ code, then you should have no problems. It would be nice if you could have your work in a GitHub repository, so that we can learn from the process!