dataset symlinks provided
Signed-off-by: Pawel Leszczynski [email protected]
Problem
A SymlinkDatasetFacet has been introduced in spec recently (https://github.com/OpenLineage/OpenLineage/pull/936) and it allows a dataset to be identified by multiple (name, namespace) tuples. We need to modify Marquez to handle it, as currently many joins to dataset table are done directly based on name and namespace value.
Part of: #2066
Solution
This PR contains:
- a refactor of current Marquez database model to allow alternative dataset names by creating an extra
dataset_symlinkstable with dataset name. - removal of dataset's name from dataset table.
- implementation of
SymlinkDatasetFacetlogic.
Note: All database schema changes require discussion. Please link the issue for context.
Checklist
- [x] You've signed-off your work
- [ ] Your changes are accompanied by tests (if relevant)
- [ ] Your change contains a small diff and is self-contained
- [ ] You've updated any relevant documentation (if relevant)
- [ ] You've updated the
CHANGELOG.mdwith details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary) - [ ] You've versioned your
.sqldatabase schema migration according to Flyway's naming convention (if relevant) - [ ] You've included a header in any source code files (if relevant)
Codecov Report
Merging #2087 (60c423d) into main (bb3d163) will increase coverage by
0.19%. The diff coverage is97.82%.
@@ Coverage Diff @@
## main #2087 +/- ##
============================================
+ Coverage 75.30% 75.49% +0.19%
- Complexity 1038 1045 +7
============================================
Files 203 206 +3
Lines 4883 4925 +42
Branches 399 399
============================================
+ Hits 3677 3718 +41
Misses 763 763
- Partials 443 444 +1
| Impacted Files | Coverage Δ | |
|---|---|---|
| api/src/main/java/marquez/db/Columns.java | 81.81% <ø> (ø) |
|
| ...a/marquez/db/mappers/DatasetSymlinksRowMapper.java | 90.00% <90.00%> (ø) |
|
| api/src/main/java/marquez/db/DatasetDao.java | 98.64% <100.00%> (+0.11%) |
:arrow_up: |
| ...pi/src/main/java/marquez/db/DatasetSymlinkDao.java | 100.00% <100.00%> (ø) |
|
| api/src/main/java/marquez/db/OpenLineageDao.java | 95.41% <100.00%> (+0.24%) |
:arrow_up: |
| ...main/java/marquez/db/models/DatasetSymlinkRow.java | 100.00% <100.00%> (ø) |
|
| ...main/java/marquez/service/models/LineageEvent.java | 84.12% <100.00%> (+1.07%) |
:arrow_up: |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
@pawel-big-lebowski let me try to summarize changes to model: after this change, internally datasets table don't represent "full" dataset, since name is dropped - getting dataset identifier requires joining with symlinks table. datasets table still contains namespace_name - it's the same namespace name that is used by primary symlink.
Primary symlink is the one that is being send in regular dataset name, rather the ones in SymlinksDatasetFacet. That suggest question: what happens if two jobs produce the same dataset, and they have the identifiers in different order; that is one has namespace X and name A as primary, and second one has namespace Y and name B? Does the first one "win" and dataset is forever known as (X, A) pair?
@pawel-big-lebowski let me try to summarize changes to model: after this change, internally
datasetstable don't represent "full" dataset, since name is dropped - gettingdataset identifierrequires joining with symlinks table.datasetstable still containsnamespace_name- it's the same namespace name that is used byprimarysymlink.
Yes, you're right. The reason behind droping a name column is to make sure that no-one now or later tries to retrieve dataset by filtering name column in dataset table. Whenever this is required, dataset_symlinks table should be joined to make sure we look for alternative names.
The namespace_name column existed before together with namespace_uuid field. In my opinion, it should be removed but I found out it's out of scope for this PR.
Primary symlink is the one that is being send in regular dataset name, rather the ones in
SymlinksDatasetFacet. That suggest question: what happens if two jobs produce the same dataset, and they have the identifiers in different order; that is one has namespace X and name A as primary, and second one has namespace Y and name B? Does the first one "win" and dataset is forever known as (X, A) pair?
Yes, the first (namespace/name) becomes primary and I think it's the best we can do.
Given the problem the dataset symlinking aims to solve is to resolve lineage for jobs that may be accessing datasets by different names, I'm wondering if we can scope the symlinks functionality to just the lineage API. As is, its impact is pervasive and, as we saw with the job symlinks, that can have unintended consequences. I think we can start by scoping this to just the lineage API, then adjust other APIs as needed.
In order to confirm that, one should know all the Marquez features to make sure datasets are not joined by name whenever symlinks should be applied. This creates an assumption which is not stated explicitly, so is error prone in future. That's why I am in favour of an assumption that dataset_symlinks should be always use to retrieve a dataset by name. If a dataset has two different names, it gets returned to matter how we request it.
Performance wise, this should not affect postgresql. It's just an extra join of index column to retrieve extra column (not used for filtering) and an uncorrelated subquery to get possible symlinks run before existing queries. Postgres is smart and joins on foreign keys, which are just applied on result rows, should be fast. Full-scans digging jsonb columns are slow, we know that and there is proposal to fix it.
To sum up, there is a tradeoff "(1) not clean assumption in DB modelling" vs "(2) DB performance risk that should not happen but cannot be confirmed due to lack of perf tests". On one hand, I would go with (2) as it gives great chance to make it clean in future. On the other hand, I don't mind going with (1) as I haven't worked that much with Marquez DB perf issues.
@julienledem great point. I talked with @pawel-big-lebowski and we aligned to use datasets_view to provide joined data from datasets and datasets_symlinks. This way, most of the SQL busywork behind this change will be opaque to queries.
With that, only issue is that "namespaces" aren't unique anymore - methods like
List<Dataset> findAll(String namespaceName, int limit, int offset)
still need to filter by is_primary field to not return duplicates.
To sum up, there is a tradeoff "(1) not clean assumption in DB modelling" vs "(2) DB performance risk that should not happen but cannot be confirmed due to lack of perf tests". On one hand, I would go with (2) as it gives great chance to make it clean in future. On the other hand, I don't mind going with (1) as I haven't worked that much with Marquez DB perf issues.
@pawel-big-lebowski I think @collado-mike's comment: "As is, its impact is pervasive and, as we saw with the job symlinks, that can have unintended consequences" wasn't necessarily performance related (though there was follow up work on job symlinks needed to improve performance and implementation), but rather unexpected behavior given the number of queries modified. That said, I agree with you that "dataset_symlinks should be always use to retrieve a dataset by name", but would prefer we use datasets_view (introduced in https://github.com/MarquezProject/marquez/pull/2032).
Note, we're hoping to add load testing to CI at some point to help with DB performance insight, see #2047
The namespace_name column existed before together with namespace_uuid field. In my opinion, it should be removed but I found out it's out of scope for this PR.
@pawel-big-lebowski mind opening an issue to capture this change as follow up work?
@collado-mike , @mobuchowski and @wslulciuc, thank you for great reviews and feedback.
Changes applied according to that:
- will reuse a
dataset_viewintroduced by Maciej, - won't drop
namecolumn indatasetstable to avoid risky migrations and limit amount of possible unexpected behavior changes.