stdlib icon indicating copy to clipboard operation
stdlib copied to clipboard

Support Apache Arrow tables in database clients

Open domoritz opened this issue 3 years ago • 6 comments

https://github.com/observablehq/stdlib/blob/6058924f39cb437cf627e5621d493846ebcf6ec7/src/duckdb.js#L58 introduces a copy that may not be needed. As soon as Arrow is supported as an output format, it would be good to remove this call.

domoritz avatar Nov 11 '22 20:11 domoritz

Retitled this issue to describe the more generic problem: we want to support Apache Arrow tables as a tabular data representation throughout database clients, SQL cells, and data table cells.

mbostock avatar Nov 11 '22 21:11 mbostock

https://github.com/apache/arrow/pull/34939 adds an indexed access proxy for Arrow but the performance isn't great compared to properly adopting Arrow. It would be great to have Arrow support throughout the different clients and cells.

domoritz avatar Apr 18 '23 22:04 domoritz

Now that Arrow is used in a lot more places, I think it may be a good time to revisit this issue. The extra copies are introducing extra overhead in many places and I think it would be super awesome if we could just pass Arrow columns directly into Plot (https://github.com/observablehq/plot/issues/191) without it making extra copies.

domoritz avatar Mar 20 '24 21:03 domoritz

FWIW, Framework’s DuckDBClient (as of 1.3) returns Apache Arrow tables without materializing array-of-objects. So there’s that.

mbostock avatar Mar 20 '24 23:03 mbostock

Oh nice. I guess you can't just remove the toArray call here for backwards compatibility?

How good is Arrow/columnar data support in Plot these days?

domoritz avatar Mar 21 '24 00:03 domoritz

That’s correct, it wouldn’t be backwards-compatible so I don’t think we are likely to change the behavior in Observable notebooks any time soon. (But eventually we’ll have a way to version control the Observable standard library, and port improvements from Observable Framework back to notebooks.)

Plot uses columnar data internally, so I would rate support as excellent, but we don’t yet have the shorthand syntax so it’s cumbersome to avoid materializing the array-of-objects — you have to pass the column vectors in yourself for each channel. https://github.com/observablehq/plot/issues/191 covers making the syntax more convenient.

mbostock avatar Mar 21 '24 00:03 mbostock