python-bigquery-pandas icon indicating copy to clipboard operation
python-bigquery-pandas copied to clipboard

Adding `if_exists="override" Mode

Open ismailsimsek opened this issue 4 years ago • 5 comments

Is your feature request related to a problem? Please describe. is there mode to do truncate load for existing table if_exists="override", it could be useful to avoid dropping table

Describe the solution you'd like being able to override data in existing table.

ismailsimsek avatar Sep 29 '21 12:09 ismailsimsek

Can you share a bit more about your use case? You'd like to replace the data, but keep the table intact? I suppose this is related to your other request about preserving clustering/partitioning?

tswast avatar Sep 29 '21 22:09 tswast

correct i like to replace the data, but keep the table intact. dropping table always has risk of losing table metadata, and could cause failing queries for the time between drop and create. override is safer operation.

ismailsimsek avatar Sep 30 '21 06:09 ismailsimsek

What if we call it "truncate" to match what the API is actually doing? If/when we support non-load jobs, we can run a DML "truncate" operation first via SQL.

Edit: On second thought, the "truncate" write disposition can ignore the existing schema. I think this safer replace requires a DML drop rows, followed by an append rows job, even without using streaming.

tswast avatar Sep 30 '21 17:09 tswast

could cause failing queries for the time between drop and create

To avoid these problems, a single "truncate" load job would work, but does risk changing the schema accidentally. I think some combination of writing to a staging table, then a SQL merge statement to delete existing rows and replace with rows from the staging table would be required to do this atomically without changing the table metadata.

tswast avatar Dec 23 '21 18:12 tswast

could cause failing queries for the time between drop and create

To avoid these problems, a single "truncate" load job would work, but does risk changing the schema accidentally. I think some combination of writing to a staging table, then a SQL merge statement to delete existing rows and replace with rows from the staging table would be required to do this atomically without changing the table metadata.

Could such an implementation also be considered for the replace mode? The time between "drop" and "create" can be significant for larger tables and results in the failure of downstream queries.

Could the replace mode be re-implemented with a more atomic solution, i.e.:

  1. Write to a staging location
  2. Drop the existing table
  3. Rename the staging location to the target location

What would be the downsides?

jkukul avatar Jun 27 '22 09:06 jkukul

I like the idea here, but based on current workloads and priorities, going to close this issue as will not fix.

chalmerlowe avatar Jun 08 '23 15:06 chalmerlowe