feat: Added a CQ ID Salt to the CQ ID generation to support overlapping sync
ISSUE
https://github.com/cloudquery/cloudquery/issues/14590 Right now the behaviour of overlapping syncs of same resource is undefined as they always refer to the same primary keys while writing
PROPOSAL
Accept a custom hash salt as a source spec which can be used as a custom salt while generating deterministic cq_ids. This way we can get different cq_ids across different syncs even for the same resource.
This holds true when we use deterministic_cq_id: true && pk_mode: _cq_id_only
TODO
Need to update source spec and other parts of Plugin SDKs to support this
kind: source
spec:
# Source spec section
name: "azure-<salt-1>"
path: "cloudquery/azure"
version: "v9.3.0"
destinations: ["postgresql"]
tables: ["azure_compute_virtual_machines"]
deterministic_id: true
cq_id_salt: salt-1 #proposed
spec:
# Optional parameters
subscriptions: [1,2,3]
--
kind: source
spec:
# Source spec section
name: "azure-<salt-2>"
path: "cloudquery/azure"
version: "v9.3.0"
destinations: ["postgresql"]
deterministic_id: true
cq_id_salt: salt-1 #proposed
tables: ["azure_compute_virtual_machines"]
spec:
# Optional parameters
subscriptions: [2]
Here we can see that subscription id is an overlapping resource across configs, here the duplication of data is expected as that can be one of the use case of User and it will be consistent across parallel runs too as we are taking pk_mode: cq_id_only
and also to note it won't alter the existing behaviour
how one can use the data without dedicated queries for the overlapping resources (those will get duplicated correct?)
We can make the queries scoped to _cq_source_name as it is different for the different syncs as defined here
this way we can bring the consistency for overlapping syncs
this way we can bring the consistency for overlapping syncs
Won't it be better in this case to drop the second source configuration? What's the reason to have it?
Agreed, but when we want to run as them as two separate sync jobs and when it runs concurrently they are overriding the source names which is leading to undesired behaviour.
This is highly inconsistent when we have the granular permission on resources and those permissions are on overlapped resources.
For e.g. We have a set of creds-1 which have the access to azure resources R1, R2, R3 Then we have a set of creds-2 which have access to azure resources R3,R4,R5
When we run with these creds we will end up in inconsistency
Ah OK I think I get it. So the use case for having overlapping syncs is having overlapping credentials?
Yeah, thats one such case
@erezrokah @yevgenypats could you please help with further steps on this
@erezrokah @yevgenypats could you please help with further steps on this
Hi, @yvardhineni, thanks for this PR! This sprint we are working on other priorities at the moment (SDKs for Python/Javascript/Java), but we will review this as soon as we are available. What is the urgency of this? Is it for internal use or commercial offering? If this is urgent - take a look at our support so we can prioritize some of the PRs reviews especially like this one that needs additional care and time. Alternatively, you can maintain a private/public fork so you don't get blocked until we merge it in.
Codecov Report
Patch and project coverage have no change.
Comparison is base (
70d12e4) 48.66% compared to head (c72d99e) 48.66%.
:exclamation: Current head c72d99e differs from pull request most recent head 9bf5699. Consider uploading reports for the commit 9bf5699 to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## main cloudquery/plugin-sdk#1141 +/- ##
========================================
Coverage 48.66% 48.66%
========================================
Files 86 85 -1
Lines 8012 7841 -171
========================================
- Hits 3899 3816 -83
+ Misses 3760 3690 -70
+ Partials 353 335 -18
| Files Changed | Coverage Δ | |
|---|---|---|
| scheduler/scheduler.go | 53.22% <0.00%> (-0.88%) |
:arrow_down: |
| scheduler/scheduler_dfs.go | 60.89% <0.00%> (ø) |
|
| schema/resource.go | 0.00% <0.00%> (ø) |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@yevgenypats Could you please update your ETA to look at this PR? Please let us know if there is any concerns with the current approach
Sorry for the delayed response @yvardhineni, closing per my comment in https://github.com/cloudquery/cloudquery/issues/14590#issuecomment-2276290200
Let's use the issue for a discussion if needed