ideas icon indicating copy to clipboard operation
ideas copied to clipboard

DataStore to CSV service, for download of large resources.

Open cphsolutionslab opened this issue 12 years ago • 8 comments

Explanation:

I would like to use the DataStore via the API as primary data-source. This works without a problem already.

However, if people wants to download the entire resource as a CSV, via /dump/, it only downloads 100K records (this is hardcoded into CKAN). It also takes quite a long time to generate the CSV file.

I have resources with over 10+ mio. Rows and would like to offer a complete download via CSV. But changing the hardcoded 100K row limit puts a lot of pressure on the system. It would be very nice to have a feature where, using the API for the DataStore, would update a corresponding CSV-file for download. So download wouldn’t need to generate the file.

cphsolutionslab avatar Apr 08 '14 06:04 cphsolutionslab

+1

andresmrm avatar Apr 08 '14 11:04 andresmrm

This sounds most useful

davidread avatar May 20 '14 13:05 davidread

+1 - think this seems sensible. Shouldn't be hard to link this with the filestore so one auto-pushes there ...

rufuspollock avatar May 20 '14 15:05 rufuspollock

This sounds like it could use a celery like service which we're talking about in #66.

nigelbabu avatar Jul 04 '14 06:07 nigelbabu

bumping this up now that we have background jobs http://docs.ckan.org/en/latest/maintaining/background-tasks.html.

And rely instead on running a native Postgres COPY command asynchronously which is much faster, doesn't need to load everything in memory, skipping the 100,000 row limit.

We have several clients who want to use the Datastore transactionally for large resources, and the current dump mechanism is presenting a problem.

If we do implement it as a background task, I also suggest putting into place a caching mechanism so that this relatively expensive process is not unnecessarily started if the table has not changed.

I'd also add a way to export to other formats other than CSV as well (e.g. JSON, XLSX, XML) as other data portal solutions allow files to be uploaded as CSV and downloaded using different formats. Perhaps, using tools like https://github.com/lukasmartinelli/pgclimb

cc @wardi @amercader

jqnatividad avatar Dec 01 '16 23:12 jqnatividad

Streaming data on request is a nice approach too. That gives you live data and doesn't multiply your storage requirements.

edit: I've found openpyxl dumps XLSX data quite efficiently and has constant memory overhead with its write_only=True mode. CSV, JSON and XML can be streamed easily too.

wardi avatar Dec 02 '16 00:12 wardi

https://github.com/frictionlessdata/jsontableschema-py and https://github.com/frictionlessdata/tabulator-py can be used for various aspects of this.

pwalsh avatar Dec 02 '16 06:12 pwalsh

Here's a simple fix that reduces memory usage and allows large CSV dumps from datastore: https://github.com/ckan/ckan/pull/3344

wardi avatar Dec 02 '16 23:12 wardi