kamu-cli icon indicating copy to clipboard operation
kamu-cli copied to clipboard

Struct not supported in nested data

Open onyalcin opened this issue 1 year ago • 0 comments

I hit an issue while trying to pull a nested data source structured as follows:

{
    "starred_at": "2020-07-05T01:34:55Z",
    "user": {
      "login": "onyalcin",
      "id": 7300802
    }
  }

Received the following error, showing STRUCT is not supported: Failed to pull com.github.kamu-cli.stargazers: 0: Internal error 1: This feature is not implemented: Unsupported SQL type Custom(ObjectName([Ident { value: "STRUCT", quote_style: None }]), ["id", "BIGINT", "login", "STRING"])

Currently we need to add a preprocessing step with jq to handle this, which is too complex. Can we support this feature during the read phase in DataFusion?

Below is the dataset definition I worked with:

kind: DatasetSnapshot
version: 1
content:
  name: com.github.kamu-cli.stargazers
  kind: Root
  metadata:
    - kind: SetPollingSource
      fetch:
        kind: Url
        url: https://api.github.com/repos/kamu-data/kamu-cli/stargazers
        headers:
          - name: User-Agent
            value: kamu
          - name: Accept
            value: application/vnd.github.star+json
      read:
        kind: Json
        schema:
          - starred_at TIMESTAMP
          - user STRUCT(id BIGINT, login STRING)
      preprocess:
        kind: Sql
        engine: datafusion
        query: |
          SELECT
            starred_at as event_time,
            user.id as user_id,
            user.login as user_name
          FROM input
      merge:
        kind: Snapshot
        primaryKey:
          - event_time
          - user_id
    - kind: SetInfo
      description: Stars of the selected github repository.

onyalcin avatar Jul 01 '24 00:07 onyalcin