dbt-databricks icon indicating copy to clipboard operation
dbt-databricks copied to clipboard

Add support for `INSERT ... BY NAME`

Open flvndh opened this issue 2 years ago • 4 comments

Describe the feature

Starting from Databricks Runtime 13.3 and above, it is possible to use the BY NAME clause instead of an explicit column list in an INSERT statement (documentation).

The feature would add support for this clause as part of the insert_overwrite strategy.

Additional context

This feature would be very handy for ingesting files from cloud storage.

Let's say that we have this model that inserts new partitions every day:

{{ config(materialized="incremental", incremental_strategy="insert_overwrite", partition_by=["creation_date"]) }}

SELECT *
FROM READ_FILES('/Volumes/my_catalog/my_schema/my_volume/{{ var("partition") }}/data.csv', format => 'csv')

The first time (partition=2023/09/06) the ingested data looks like this:

creation_date,a,b
2023-09-06,1,1
2023-09-06,2,2

The table produced has the following schema:

creation_date: date, a: int, b: int

The second time (partition=2023/09/07) the ingested data looks like this:

creation_date,a
2023-09-07,3
2023-09-07,4

The column b is missing. At this point, the current adapter would fail because it would execute this command:

INSERT OVERWRITE TABLE my_catalog.my_schema.my_table
PARTITION (creation_date)
SELECT
    creation_date,
    a,
    b
FROM my_table__dbt_tmp -- A view corresponding the compiled model

but b doesn't exist in the view.

This would not be the case if we instead executed this:

INSERT OVERWRITE TABLE my_catalog.my_schema.my_table
PARTITION (creation_date) BY NAME -- By name this time
SELECT *
FROM my_table__dbt_tmp

The value for b would just be NULL.

Who will this benefit?

People who wants to ingest "bronze" data (with schema evolution in mind) even in the absence of some columns.

Related issues

#334

flvndh avatar Sep 06 '23 15:09 flvndh

@flvndh just want to say thank you for writing a great issue. I'm still getting to grips with managing this project, and issues like this, where the desired changes to the config and the generated SQL are spelled out, are tremendously helpful.

benc-db avatar Sep 06 '23 17:09 benc-db

I've added the help wanted tag. @flvndh right now we are focused on preparing for 1.7.0 compatibility with dbt-core, so this work will get delayed for at least a couple of months. Please consider submitting a PR with your solution if you wish to speed up the timeline.

benc-db avatar Oct 03 '23 16:10 benc-db

I'd be glad to contribute this. As a first approach, I'd modify this macro:

https://github.com/databricks/dbt-databricks/blob/c901f8c52ae25a658cf71b6ef08cff0ffd9d9540/dbt/include/databricks/macros/materializations/incremental/strategies.sql#L24-L32

to something like:

insert overwrite table {{ target_relation }}
{{ partition_cols(label="partition") }} by name -- Using 'by name'
select * from {{ source_relation }} -- Removing columns listing

in fact getting rid of the columns listing. I'm just wondering if we should keep the current implementation for DBR < 13.3? If yes, then we would have to get that version. I see in the documentation that there is the current_version function in DBSQL available as of DBR 11.0. Is it an other way that works across more DBR versions?

Is it anything else I should consider?

flvndh avatar Oct 16 '23 14:10 flvndh

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue.

github-actions[bot] avatar Apr 14 '24 02:04 github-actions[bot]

Implemented in 1.11.0

benc-db avatar Nov 19 '25 18:11 benc-db