Cube crashes when storing pre-aggregation with duckdb (type `hugeInt`)
Describe the bug
Cubestore does not support the HugeInt type used by duckdb
I have first encountered the bug with the on-prem cube, but I have been able to reproduce it on the cube cloud.
Both the cube files and printscreens are comming from the cloud version.
When using Cube with the duckdb driver, Cube fails at creating (external) pre-aggregation
To Reproduce Steps to reproduce the behavior :
- Create parquet file on s3 containing at least one integer column
- Connect cube to the s3 account
- Create a schema where one measure is a
sumover the integer column - Add the measure a preaggregation
- The cube will fail when trying to create the pre-aggregation table with a
hugeIntcolumn to store the data.
I have also created a small cube definition showcasing the bug.
Expected behavior
The driver should be casting the hugeInt to the corresponding MySql type
Screenshots
Schema :
Playground :
Pre-aggregation error :
Stack trace
Error: Error during create table: CREATE TABLE prod_pre_aggregations.spam_main_w1hqvvny_l2pn4q2e_1ig1fr6 (`spam__org` varchar(255), `spam__metric` HUGEINT) LOCATION ?: Custom type 'HUGEINT' is not supported
at WebSocket.<anonymous> (/cube/node_modules/@cubejs-backend/cubestore-driver/src/WebSocketConnection.ts:121:30)
at WebSocket.emit (node:events:513:28)
at Receiver.receiverOnMessage (/cube/node_modules/ws/lib/websocket.js:1008:20)
at Receiver.emit (node:events:513:28)
at Receiver.dataMessage (/cube/node_modules/ws/lib/receiver.js:502:14)
at Receiver.getData (/cube/node_modules/ws/lib/receiver.js:435:17)
at Receiver.startLoop (/cube/node_modules/ws/lib/receiver.js:143:22)
at Receiver._write (/cube/node_modules/ws/lib/receiver.js:78:10)
at writeOrBuffer (node:internal/streams/writable:391:12)
at _write (node:internal/streams/writable:332:10)
at Receiver.Writable.write (node:internal/streams/writable:336:10)
at TLSSocket.socketOnData (/cube/node_modules/ws/lib/websocket.js:1102:35)
at TLSSocket.emit (node:events:513:28)
at addChunk (node:internal/streams/readable:315:12)
at readableAddChunk (node:internal/streams/readable:289:9)
at TLSSocket.Readable.push (node:internal/streams/readable:228:10)
at TLSWrap.onStreamRead (node:internal/stream_base_commons:190:23)
Minimally reproducible Cube Schema
cubes:
- name: spam
pre_aggregations:
- name: main
dimensions:
- CUBE.org
measures:
- CUBE.metric
sql: SELECT 'foo' AS org, 10::hugeInt AS metric
dimensions:
- name: org
sql: org
type: string
primary_key: true
measures:
- name: metric
sql: metric
type: sum
Version: v0.33.53
Additional context Dockerfile used to run the Cube
version: "3.9"
services:
cube:
network_mode: host
image: cubejs/cube:v0.33.53
environment:
- CUBEJS_DB_TYPE=duckdb
- CUBEJS_DB_DUCKDB_S3_ACCESS_KEY_ID=<REDACTED>
- CUBEJS_DB_DUCKDB_S3_SECRET_ACCESS_KEY=<REDACTED>
- CUBEJS_DB_DUCKDB_S3_ENDPOINT=s3.ca-central-1.amazonaws.com
- CUBEJS_DB_DUCKDB_S3_REGION=ca-central-1
- CUBEJS_API_SECRET=<REDACTED>
- CUBEJS_CONCURRENCY=8
- CUBEJS_DEV_MODE=true
volumes:
- .:/cube/conf
If you are interested in working on this issue, please leave a comment below and we will be happy to assign the issue to you. If this is the first time you are contributing a Pull Request to Cube.js, please check our contribution guidelines. You can also post any questions while contributing in the #contributors channel in the Cube.js Slack.
@hugoJuhel You can cast to int as a workaround.
@hugoJuhel You can cast to int as a workaround.
@paveltiunov , thanks for your suggestion.
I tried that in Cube Cloud yesterday but, even when casted as an int in the schema definition, the pre-aggregation still casts it back as an hugeInt.
here is the yaml definition I have been using on cube cloud.
cubes:
- name: spam
pre_aggregations:
- name: main
dimensions:
- CUBE.org
measures:
- CUBE.metric
sql: SELECT 'foo' AS org, 10::int AS metric
dimensions:
- name: org
sql: org
type: string
primary_key: true
measures:
- name: metric
sql: metric::int
type: sum
The only workaround I have found so far is casting my field to a float type.
any updates?
@dyegoaurelio Not yet.
I can confirm that the behavior is still the same (hugeInt-related error) on Cube v0.35.81 (latest in v0.35.x).
However, on v0.36.0 thru v0.36.7 (latest) the query that should trigger a pre-aggregation build just never finishes:
I had a look at this...
if I understood how pre-aggregations work in this scenario is it needs to convert database-specific types to type that cubestore understands... and looking at the code the conversion happens in toGenericType()
I created some tests in my local which goes something like:
within DuckDBDriver.test.ts
await driver.query('CREATE TABLE test.issue_7127_test (org VARCHAR, metric INTEGER);', []);
await driver.query('INSERT INTO test.issue_7127_test VALUES (\'foo\', 10::int), (\'bar\', 20::int);', []);
....
test('hugeInt handling for sum aggregation with cast', async () => {
// I hope this simulates the exact scenario from the issue - a sum on a column cast to int
const result = await driver.query('SELECT org, SUM(metric::int) as total FROM test.issue_7127_test GROUP BY org ORDER BY org', []);
// this part passes...
// we inserted just 2 rows with 2 different orgs... so this is pretty much just having their values returned even if we use SUM
// because we're grouping by org
expect(result).toEqual([
{ org: 'bar', total: '20' },
{ org: 'foo', total: '10' }
]);
// let's check if duckdb actually returns a type of hugeint .. lets use typeof() for that
// https://duckdb.org/docs/stable/sql/functions/utility.html#typeofexpression
const typeInfo = await driver.query<{ type: string }>(
"SELECT typeof(SUM(metric::int)) as type FROM test.issue_7127_test",
[]
);
// yes, i get a type of hugeint even if we cast to metric to int
expect(typeInfo[0].type.toLowerCase()).toEqual('hugeint');
// now, since it is a hugeint, toGenericType('hugeint') should support this and map it to bigint
// at this point it fails
expect(driver.toGenericType('hugeint')).toEqual('bigint');
});
So, if I got that right, the the fix can be as simple as:
DuckDBDriver.ts
const DuckDBToGenericType: Record<string, GenericDataBaseType> = {
// DATE_TRUNC returns DATE, but Cube Store still doesn't support DATE type
// DuckDB's driver transform date/timestamp to Date object, but HydrationStream converts any Date object to ISO timestamp
// That's why It's safe to use timestamp here
date: 'timestamp',
hugeint: 'bigint', // <------ just add this mapping
};
I can raise a PR for this if my analysis makes sense