cube icon indicating copy to clipboard operation
cube copied to clipboard

Cube crashes when storing pre-aggregation with duckdb (type `hugeInt`)

Open hugoJuhel opened this issue 2 years ago • 5 comments

Describe the bug

Cubestore does not support the HugeInt type used by duckdb

I have first encountered the bug with the on-prem cube, but I have been able to reproduce it on the cube cloud. Both the cube files and printscreens are comming from the cloud version.

When using Cube with the duckdb driver, Cube fails at creating (external) pre-aggregation

To Reproduce Steps to reproduce the behavior :

  1. Create parquet file on s3 containing at least one integer column
  2. Connect cube to the s3 account
  3. Create a schema where one measure is a sum over the integer column
  4. Add the measure a preaggregation
  5. The cube will fail when trying to create the pre-aggregation table with a hugeInt column to store the data.

I have also created a small cube definition showcasing the bug.

Expected behavior The driver should be casting the hugeInt to the corresponding MySql type

Screenshots Schema : image

Playground : image

Pre-aggregation error : image

Stack trace

Error: Error during create table: CREATE TABLE prod_pre_aggregations.spam_main_w1hqvvny_l2pn4q2e_1ig1fr6 (`spam__org` varchar(255), `spam__metric` HUGEINT) LOCATION ?: Custom type 'HUGEINT' is not supported
    at WebSocket.<anonymous> (/cube/node_modules/@cubejs-backend/cubestore-driver/src/WebSocketConnection.ts:121:30)
    at WebSocket.emit (node:events:513:28)
    at Receiver.receiverOnMessage (/cube/node_modules/ws/lib/websocket.js:1008:20)
    at Receiver.emit (node:events:513:28)
    at Receiver.dataMessage (/cube/node_modules/ws/lib/receiver.js:502:14)
    at Receiver.getData (/cube/node_modules/ws/lib/receiver.js:435:17)
    at Receiver.startLoop (/cube/node_modules/ws/lib/receiver.js:143:22)
    at Receiver._write (/cube/node_modules/ws/lib/receiver.js:78:10)
    at writeOrBuffer (node:internal/streams/writable:391:12)
    at _write (node:internal/streams/writable:332:10)
    at Receiver.Writable.write (node:internal/streams/writable:336:10)
    at TLSSocket.socketOnData (/cube/node_modules/ws/lib/websocket.js:1102:35)
    at TLSSocket.emit (node:events:513:28)
    at addChunk (node:internal/streams/readable:315:12)
    at readableAddChunk (node:internal/streams/readable:289:9)
    at TLSSocket.Readable.push (node:internal/streams/readable:228:10)
    at TLSWrap.onStreamRead (node:internal/stream_base_commons:190:23)

Minimally reproducible Cube Schema

cubes: 
  - name: spam
    pre_aggregations:
      - name: main
        dimensions:
          - CUBE.org
        measures:
          - CUBE.metric

    sql: SELECT 'foo' AS org, 10::hugeInt AS metric
    dimensions:
      - name: org
        sql: org
        type: string
        primary_key: true
    measures:
        - name: metric
          sql: metric
          type: sum

Version: v0.33.53

Additional context Dockerfile used to run the Cube

version: "3.9"

services:

  cube:
    network_mode: host
    image: cubejs/cube:v0.33.53
    environment:
      - CUBEJS_DB_TYPE=duckdb
      - CUBEJS_DB_DUCKDB_S3_ACCESS_KEY_ID=<REDACTED>
      - CUBEJS_DB_DUCKDB_S3_SECRET_ACCESS_KEY=<REDACTED>
      - CUBEJS_DB_DUCKDB_S3_ENDPOINT=s3.ca-central-1.amazonaws.com
      - CUBEJS_DB_DUCKDB_S3_REGION=ca-central-1
      - CUBEJS_API_SECRET=<REDACTED>
      - CUBEJS_CONCURRENCY=8
      - CUBEJS_DEV_MODE=true

    volumes:
      - .:/cube/conf

hugoJuhel avatar Sep 12 '23 19:09 hugoJuhel

If you are interested in working on this issue, please leave a comment below and we will be happy to assign the issue to you. If this is the first time you are contributing a Pull Request to Cube.js, please check our contribution guidelines. You can also post any questions while contributing in the #contributors channel in the Cube.js Slack.

github-actions[bot] avatar Sep 13 '23 20:09 github-actions[bot]

@hugoJuhel You can cast to int as a workaround.

paveltiunov avatar Sep 13 '23 20:09 paveltiunov

@hugoJuhel You can cast to int as a workaround.

@paveltiunov , thanks for your suggestion.

I tried that in Cube Cloud yesterday but, even when casted as an int in the schema definition, the pre-aggregation still casts it back as an hugeInt.

here is the yaml definition I have been using on cube cloud.

cubes: 
  - name: spam
    pre_aggregations:
      - name: main
        dimensions:
          - CUBE.org
        measures:
          - CUBE.metric

    sql: SELECT 'foo' AS org, 10::int AS metric
    dimensions:
      - name: org
        sql: org
        type: string
        primary_key: true
    measures:
        - name: metric
          sql: metric::int
          type: sum

The only workaround I have found so far is casting my field to a float type.

hugoJuhel avatar Sep 14 '23 13:09 hugoJuhel

any updates?

dyegoaurelio avatar Oct 03 '24 14:10 dyegoaurelio

@dyegoaurelio Not yet.

I can confirm that the behavior is still the same (hugeInt-related error) on Cube v0.35.81 (latest in v0.35.x).

However, on v0.36.0 thru v0.36.7 (latest) the query that should trigger a pre-aggregation build just never finishes:

Screenshot 2024-10-09 at 00 16 08

igorlukanin avatar Oct 08 '24 22:10 igorlukanin

I had a look at this... if I understood how pre-aggregations work in this scenario is it needs to convert database-specific types to type that cubestore understands... and looking at the code the conversion happens in toGenericType()

I created some tests in my local which goes something like:

within DuckDBDriver.test.ts

await driver.query('CREATE TABLE test.issue_7127_test (org VARCHAR, metric INTEGER);', []);
await driver.query('INSERT INTO test.issue_7127_test VALUES (\'foo\', 10::int), (\'bar\', 20::int);', []);
....
test('hugeInt handling for sum aggregation with cast', async () => {
    // I hope this simulates the exact scenario from the issue - a sum on a column cast to int
    const result = await driver.query('SELECT org, SUM(metric::int) as total FROM test.issue_7127_test GROUP BY org ORDER BY org', []);
    
    // this part passes...
    // we inserted just 2 rows with 2 different orgs... so this is pretty much just  having their values returned even if we use SUM
    // because we're grouping by  org
    expect(result).toEqual([
      { org: 'bar', total: '20' },
      { org: 'foo', total: '10' }
    ]);
    
    // let's check if duckdb actually returns a type of hugeint .. lets use typeof() for that
    // https://duckdb.org/docs/stable/sql/functions/utility.html#typeofexpression
    const typeInfo = await driver.query<{ type: string }>(
      "SELECT typeof(SUM(metric::int)) as type FROM test.issue_7127_test",
      []
    );
    
    // yes, i get a type of hugeint even if we cast to metric to int
    expect(typeInfo[0].type.toLowerCase()).toEqual('hugeint');
    
    // now, since it is a hugeint, toGenericType('hugeint') should support this and map it to bigint
    // at this point it fails
    expect(driver.toGenericType('hugeint')).toEqual('bigint');
  });

So, if I got that right, the the fix can be as simple as:

DuckDBDriver.ts

const DuckDBToGenericType: Record<string, GenericDataBaseType> = {
  // DATE_TRUNC returns DATE, but Cube Store still doesn't support DATE type
  // DuckDB's driver transform date/timestamp to Date object, but HydrationStream converts any Date object to ISO timestamp
  // That's why It's safe to use timestamp here
  date: 'timestamp',
  hugeint: 'bigint',     // <------ just add this mapping
};

I can raise a PR for this if my analysis makes sense

karlomedallo avatar Mar 18 '25 13:03 karlomedallo