Synthesizing Relational Data with Linked Sources in Different Domains
Perhaps this is a naive question, but could SDV be used to synthesize relational data for sources located in different domains that can't be joined directly. Consider the following:
In Domain A, we have:
| ID | VarX |
|---|---|
| 1 | A |
| 2 | B |
| 3 | C |
And in Domain B, we have:
| ID | VarY |
|---|---|
| 1 | 7 |
| 2 | 8 |
| 3 | 9 |
Data from A and B cannot cross domains directly, but we would like to get the synthetics:
| ID | VarX | VarY |
|---|---|---|
| 4 | A | 7 |
| 5 | B | 8 |
| 6 | C | 9 |
Hello,
I'm not sure I understand the use case fully. What is the logic for joining them (indirectly)? What does it mean that they cannot cross domains?
The SDV is able to model data across primary/foreign key relationships in a multi-table setting. If the referential integrity is maintained (they keys from one table match another), then it should be possible to synthesize those 2 original tables.
Hi @npatki,
Two parties may hold information about the same identities, but, for privacy reasons, may not want to join them directly in an unprotected way. This paper on federated learning summarizes the challenge well.
Hi @nviets, thanks for your response!
We have some ideas about the federated learning problem - could you share some more details about your use case so that we can understand it better and suggest an approach?
Hi @katxiao , a common industry scenario might be when one company is evaluating the data of another. Both have information on a shared set of identities, but, for regulatory or legal reasons, neither can share information directly. Federated learning might solve for model fitting through collaborative training, but, for instance, could this approach be extended to assemble synthetic identities with the complete set of features in just one company's domain.
Hi @nviets, thanks for the example scenario. We will follow up on this thread with a suggested approach!