Loading larger .csv files broken
Loading large .csv files with 0.1.4 will result in a blank page. There are no error messages, no progress bar, no progress available in cli log as well. (py 3.12.2, clean env, macos).
It seems the only way to get data-formulator working currently is by using example data or copy pasting small amounts.
Wondering what size of large csv are you testing? Would like to take a look!
Roughly 2m rows and 200 columns come out at about 500MB. Also I noticed Firefox will load it but be very unresponsive. Edge (and I guess Chrome) will blank page.
If working with larger datasets is a target of this project I suggest replacing the pandas df approach with duckdb loading of csvs in memory and having the llm agent write sql queries instead. This would solve the performance issue and also enable a way forward to integrating *sql databases in the future.
You are right, data formulator is not able to handle data of such size --- partly due to using pandas, another part is that the dataset lives in the frontend most of the time that makes UI rendering unscalable.
I have tested datasets with ~10Mb size that can work (but already not super smooth). Some way to integrate with DB from the server side is a good potential approach to address this data size issue.
If it can't handle big data, why should I use it?
If it can't handle big data, why should I use it?
We would love to expand the support for large data with backend DB (so that computation is completed with SQL etc) while frontend renders samples or aggregated results for plotting.
The current application is towards exploring smaller data (e.g., experiment data) or on sample, or aggregated datasets.
Would love to learn some of your application scenarios --- and we'd also love to get more dev bandwidth to support DB integration (our dev team is currently a little small to bring all alive...)
Some updates: we are looking into support virtual tables via backend databases (both cloud and local). It will take some time though. :)
The current pull request (on the dev branch) supports working with large dataset using DATABASE option upon entering the app. Feel free to play with it now if feeling advanterous --- I'll test a bit and clean it up, and then merge it next week.
https://github.com/microsoft/data-formulator/pull/146