PyAirbyte icon indicating copy to clipboard operation
PyAirbyte copied to clipboard

Feature Request - make cache compatible with RAW Json Airbyte format

Open maver1ck opened this issue 1 year ago • 10 comments

Hi, Current format of Cache is in line with Airbyte normalized v2 format. This has some drawback like lack of compatibility between PyAirbyte and Airbyte. Also schema evolution is not supported.

Adding Airbyte raw json format support may fix those issues.

maver1ck avatar Apr 03 '24 19:04 maver1ck

@maver1ck - Thanks for logging this suggestion. 🙏

We're actually launching an update today or tomorrow which will add a couple new features:

  1. Missing columns will be auto-added if they are detected missing. (Very basic schema evolution.)
  2. PyAirbyte will add support for the following 3 _airbyte_* columns: _airbyte_raw_id, _airbyte_extracted_at, and _airbyte_meta. This will make PyAirbyte closer to the Dv2 table conventions.

New docs page here show a reference for those not-yet-released features:

  • https://airbytehq.github.io/PyAirbyte/airbyte/records.html

This is also in the context of all Airbyte destinations currently on their way to the "v2" standard.

What do you think of this path forward? Any concerns?

aaronsteers avatar Apr 03 '24 22:04 aaronsteers

Decided to expand our docs to better cover this topic:

  • https://github.com/airbytehq/PyAirbyte/pull/170

aaronsteers avatar Apr 03 '24 22:04 aaronsteers

v0.9.0 is now live 🚀 : https://github.com/airbytehq/PyAirbyte/releases/tag/v0.9.0

aaronsteers avatar Apr 03 '24 23:04 aaronsteers

@aaronsteers It may help but I really want to get access to _airbyte_data column as we currently are doing normalization by ourselves using raw data.

maver1ck avatar Apr 04 '24 14:04 maver1ck

@maver1ck - I've been thinking about this more, and I'm not sure how we'd implement it. And as noted above, this could be hard to prioritize because destinations are moving to V2 style more broadly across Airbyte.

Can you help me with a few follow-up questions?

  1. Which destination(s) are you using today and looking to match with?
  2. Do you have a preference for the all-data-in-one-column approach, or is it mostly just for compatibility reasons that you are requesting?
  3. Do you use append-only mode when loading, or do you use deduping as well?

And to add a bit more tech detail regarding my question point 3 above: we likely wouldn't be able to support deduping/merge if data was only in the 'raw' column. Merge and update functions will generally require the Primary key columns to be declared as top-level columns (at least, in order to perform well), and similarly with the incremental "cursor" columns - the database can optimize filtering more effectively if those are also defined as top-level columns.

aaronsteers avatar Apr 05 '24 16:04 aaronsteers

Hi @aaronsteers We have multiple reasons for all-data-in-one-column approach.

  1. Compatibility reasons. Changing source schema doesn't require changing db schema.
  2. Datatypes. We're using dbt to do json to proper table mapping and we can set up proper MySQL types (for example varchar(255) instead of text.
  3. We're using append only. Deduplication is done by dbt.

maver1ck avatar Apr 23 '24 18:04 maver1ck

@maver1ck - Which destination connector(s) are you using today in Airbyte Cloud/OSS? And do you have the ability to run docker in your python runtimes?

We have a potential path forward, which would be to add support for destinations:

  • https://github.com/airbytehq/PyAirbyte/issues/197

Combined with a community contribution WIP:

  • #209

aaronsteers avatar May 07 '24 22:05 aaronsteers

This came up in slack in conversation with contributor @SebastienN15 :

  • https://airbytehq.slack.com/archives/C06FZ238P8W/p1715112018711929?thread_ts=1714742768.802949&cid=C06FZ238P8W

And I raised in that slack thread some inherent limitations with the raw json format:

There are some other data engineering challenges that this raises:

  1. Primary keys and cursor keys aren't indexible or optimized for querying and partition elimination.
  2. Because of #1, deduping (merge upserts) is impossible or least less efficient.
  3. Compression is worse because columns aren't typed, leading to more storage usage and worse read/write performance overall.

aaronsteers avatar May 07 '24 22:05 aaronsteers

Hi @aaronsteers I'm using MySQL as destination. I will look into this thread and I will come back to you.

maver1ck avatar May 08 '24 14:05 maver1ck

Hi @aaronsteers I rechecked V2 destinations as they are new addition to MySQL sink.

From what I understand there is still possibility to use raw data stored in airbyte_internal schema. (and we even can Disable Final Tables.)

So my proposal is to mimic this behaviour in cache. Having both:

  • raw tables
  • final tables

maver1ck avatar May 09 '24 08:05 maver1ck