Add support for `INSERT ... BY NAME`
Describe the feature
Starting from Databricks Runtime 13.3 and above, it is possible to use the BY NAME clause instead of an explicit column list in an INSERT statement (documentation).
The feature would add support for this clause as part of the insert_overwrite strategy.
Additional context
This feature would be very handy for ingesting files from cloud storage.
Let's say that we have this model that inserts new partitions every day:
{{ config(materialized="incremental", incremental_strategy="insert_overwrite", partition_by=["creation_date"]) }}
SELECT *
FROM READ_FILES('/Volumes/my_catalog/my_schema/my_volume/{{ var("partition") }}/data.csv', format => 'csv')
The first time (partition=2023/09/06) the ingested data looks like this:
creation_date,a,b
2023-09-06,1,1
2023-09-06,2,2
The table produced has the following schema:
creation_date: date, a: int, b: int
The second time (partition=2023/09/07) the ingested data looks like this:
creation_date,a
2023-09-07,3
2023-09-07,4
The column b is missing. At this point, the current adapter would fail because it would execute this command:
INSERT OVERWRITE TABLE my_catalog.my_schema.my_table
PARTITION (creation_date)
SELECT
creation_date,
a,
b
FROM my_table__dbt_tmp -- A view corresponding the compiled model
but b doesn't exist in the view.
This would not be the case if we instead executed this:
INSERT OVERWRITE TABLE my_catalog.my_schema.my_table
PARTITION (creation_date) BY NAME -- By name this time
SELECT *
FROM my_table__dbt_tmp
The value for b would just be NULL.
Who will this benefit?
People who wants to ingest "bronze" data (with schema evolution in mind) even in the absence of some columns.
Related issues
#334
@flvndh just want to say thank you for writing a great issue. I'm still getting to grips with managing this project, and issues like this, where the desired changes to the config and the generated SQL are spelled out, are tremendously helpful.
I've added the help wanted tag. @flvndh right now we are focused on preparing for 1.7.0 compatibility with dbt-core, so this work will get delayed for at least a couple of months. Please consider submitting a PR with your solution if you wish to speed up the timeline.
I'd be glad to contribute this. As a first approach, I'd modify this macro:
https://github.com/databricks/dbt-databricks/blob/c901f8c52ae25a658cf71b6ef08cff0ffd9d9540/dbt/include/databricks/macros/materializations/incremental/strategies.sql#L24-L32
to something like:
insert overwrite table {{ target_relation }}
{{ partition_cols(label="partition") }} by name -- Using 'by name'
select * from {{ source_relation }} -- Removing columns listing
in fact getting rid of the columns listing. I'm just wondering if we should keep the current implementation for DBR < 13.3? If yes, then we would have to get that version. I see in the documentation that there is the current_version function in DBSQL available as of DBR 11.0. Is it an other way that works across more DBR versions?
Is it anything else I should consider?
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue.
Implemented in 1.11.0