plugin-sdk icon indicating copy to clipboard operation
plugin-sdk copied to clipboard

feat: Added a CQ ID Salt to the CQ ID generation to support overlapping sync

Open yvardhineni opened this issue 2 years ago • 10 comments

ISSUE

https://github.com/cloudquery/cloudquery/issues/14590 Right now the behaviour of overlapping syncs of same resource is undefined as they always refer to the same primary keys while writing

PROPOSAL

Accept a custom hash salt as a source spec which can be used as a custom salt while generating deterministic cq_ids. This way we can get different cq_ids across different syncs even for the same resource.

This holds true when we use deterministic_cq_id: true && pk_mode: _cq_id_only

TODO

Need to update source spec and other parts of Plugin SDKs to support this

yvardhineni avatar Aug 10 '23 05:08 yvardhineni

kind: source
spec:
  # Source spec section
  name: "azure-<salt-1>"
  path: "cloudquery/azure"
  version: "v9.3.0"
  destinations: ["postgresql"]
  tables: ["azure_compute_virtual_machines"]
  deterministic_id: true
  cq_id_salt: salt-1 #proposed
  spec:
    # Optional parameters
   subscriptions: [1,2,3]
--
kind: source
spec:
  # Source spec section
  name: "azure-<salt-2>"
  path: "cloudquery/azure"
  version: "v9.3.0"
  destinations: ["postgresql"]
  deterministic_id: true
  cq_id_salt: salt-1 #proposed
  tables: ["azure_compute_virtual_machines"]
  spec:
    # Optional parameters
    subscriptions: [2]

Here we can see that subscription id is an overlapping resource across configs, here the duplication of data is expected as that can be one of the use case of User and it will be consistent across parallel runs too as we are taking pk_mode: cq_id_only

and also to note it won't alter the existing behaviour

how one can use the data without dedicated queries for the overlapping resources (those will get duplicated correct?)

We can make the queries scoped to _cq_source_name as it is different for the different syncs as defined here

this way we can bring the consistency for overlapping syncs

yvardhineni avatar Aug 10 '23 10:08 yvardhineni

this way we can bring the consistency for overlapping syncs

Won't it be better in this case to drop the second source configuration? What's the reason to have it?

erezrokah avatar Aug 10 '23 11:08 erezrokah

Agreed, but when we want to run as them as two separate sync jobs and when it runs concurrently they are overriding the source names which is leading to undesired behaviour.

yvardhineni avatar Aug 10 '23 12:08 yvardhineni

This is highly inconsistent when we have the granular permission on resources and those permissions are on overlapped resources.

For e.g. We have a set of creds-1 which have the access to azure resources R1, R2, R3 Then we have a set of creds-2 which have access to azure resources R3,R4,R5

When we run with these creds we will end up in inconsistency

yvardhineni avatar Aug 10 '23 12:08 yvardhineni

Ah OK I think I get it. So the use case for having overlapping syncs is having overlapping credentials?

erezrokah avatar Aug 10 '23 12:08 erezrokah

Yeah, thats one such case

yvardhineni avatar Aug 10 '23 12:08 yvardhineni

@erezrokah @yevgenypats could you please help with further steps on this

yvardhineni avatar Aug 11 '23 05:08 yvardhineni

@erezrokah @yevgenypats could you please help with further steps on this

Hi, @yvardhineni, thanks for this PR! This sprint we are working on other priorities at the moment (SDKs for Python/Javascript/Java), but we will review this as soon as we are available. What is the urgency of this? Is it for internal use or commercial offering? If this is urgent - take a look at our support so we can prioritize some of the PRs reviews especially like this one that needs additional care and time. Alternatively, you can maintain a private/public fork so you don't get blocked until we merge it in.

yevgenypats avatar Aug 11 '23 07:08 yevgenypats

Codecov Report

Patch and project coverage have no change.

Comparison is base (70d12e4) 48.66% compared to head (c72d99e) 48.66%.

:exclamation: Current head c72d99e differs from pull request most recent head 9bf5699. Consider uploading reports for the commit 9bf5699 to get more accurate results

Additional details and impacted files
@@           Coverage Diff            @@
##             main    cloudquery/plugin-sdk#1141    +/-   ##
========================================
  Coverage   48.66%   48.66%            
========================================
  Files          86       85     -1     
  Lines        8012     7841   -171     
========================================
- Hits         3899     3816    -83     
+ Misses       3760     3690    -70     
+ Partials      353      335    -18     
Files Changed Coverage Δ
scheduler/scheduler.go 53.22% <0.00%> (-0.88%) :arrow_down:
scheduler/scheduler_dfs.go 60.89% <0.00%> (ø)
schema/resource.go 0.00% <0.00%> (ø)

... and 5 files with indirect coverage changes

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Aug 11 '23 09:08 codecov[bot]

@yevgenypats Could you please update your ETA to look at this PR? Please let us know if there is any concerns with the current approach

yvardhineni avatar Aug 16 '23 09:08 yvardhineni

Sorry for the delayed response @yvardhineni, closing per my comment in https://github.com/cloudquery/cloudquery/issues/14590#issuecomment-2276290200

Let's use the issue for a discussion if needed

erezrokah avatar Aug 08 '24 17:08 erezrokah