marquez icon indicating copy to clipboard operation
marquez copied to clipboard

dataset symlinks provided

Open pawel-big-lebowski opened this issue 3 years ago • 8 comments

Signed-off-by: Pawel Leszczynski [email protected]

Problem

A SymlinkDatasetFacet has been introduced in spec recently (https://github.com/OpenLineage/OpenLineage/pull/936) and it allows a dataset to be identified by multiple (name, namespace) tuples. We need to modify Marquez to handle it, as currently many joins to dataset table are done directly based on name and namespace value.

Part of: #2066

Solution

This PR contains:

  • a refactor of current Marquez database model to allow alternative dataset names by creating an extra dataset_symlinks table with dataset name.
  • removal of dataset's name from dataset table.
  • implementation of SymlinkDatasetFacet logic.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • [x] You've signed-off your work
  • [ ] Your changes are accompanied by tests (if relevant)
  • [ ] Your change contains a small diff and is self-contained
  • [ ] You've updated any relevant documentation (if relevant)
  • [ ] You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • [ ] You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • [ ] You've included a header in any source code files (if relevant)

pawel-big-lebowski avatar Aug 25 '22 08:08 pawel-big-lebowski

Codecov Report

Merging #2087 (60c423d) into main (bb3d163) will increase coverage by 0.19%. The diff coverage is 97.82%.

@@             Coverage Diff              @@
##               main    #2087      +/-   ##
============================================
+ Coverage     75.30%   75.49%   +0.19%     
- Complexity     1038     1045       +7     
============================================
  Files           203      206       +3     
  Lines          4883     4925      +42     
  Branches        399      399              
============================================
+ Hits           3677     3718      +41     
  Misses          763      763              
- Partials        443      444       +1     
Impacted Files Coverage Δ
api/src/main/java/marquez/db/Columns.java 81.81% <ø> (ø)
...a/marquez/db/mappers/DatasetSymlinksRowMapper.java 90.00% <90.00%> (ø)
api/src/main/java/marquez/db/DatasetDao.java 98.64% <100.00%> (+0.11%) :arrow_up:
...pi/src/main/java/marquez/db/DatasetSymlinkDao.java 100.00% <100.00%> (ø)
api/src/main/java/marquez/db/OpenLineageDao.java 95.41% <100.00%> (+0.24%) :arrow_up:
...main/java/marquez/db/models/DatasetSymlinkRow.java 100.00% <100.00%> (ø)
...main/java/marquez/service/models/LineageEvent.java 84.12% <100.00%> (+1.07%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Aug 26 '22 10:08 codecov[bot]

@pawel-big-lebowski let me try to summarize changes to model: after this change, internally datasets table don't represent "full" dataset, since name is dropped - getting dataset identifier requires joining with symlinks table. datasets table still contains namespace_name - it's the same namespace name that is used by primary symlink.

Primary symlink is the one that is being send in regular dataset name, rather the ones in SymlinksDatasetFacet. That suggest question: what happens if two jobs produce the same dataset, and they have the identifiers in different order; that is one has namespace X and name A as primary, and second one has namespace Y and name B? Does the first one "win" and dataset is forever known as (X, A) pair?

mobuchowski avatar Aug 30 '22 14:08 mobuchowski

@pawel-big-lebowski let me try to summarize changes to model: after this change, internally datasets table don't represent "full" dataset, since name is dropped - getting dataset identifier requires joining with symlinks table. datasets table still contains namespace_name - it's the same namespace name that is used by primary symlink.

Yes, you're right. The reason behind droping a name column is to make sure that no-one now or later tries to retrieve dataset by filtering name column in dataset table. Whenever this is required, dataset_symlinks table should be joined to make sure we look for alternative names.

The namespace_name column existed before together with namespace_uuid field. In my opinion, it should be removed but I found out it's out of scope for this PR.

Primary symlink is the one that is being send in regular dataset name, rather the ones in SymlinksDatasetFacet. That suggest question: what happens if two jobs produce the same dataset, and they have the identifiers in different order; that is one has namespace X and name A as primary, and second one has namespace Y and name B? Does the first one "win" and dataset is forever known as (X, A) pair?

Yes, the first (namespace/name) becomes primary and I think it's the best we can do.

pawel-big-lebowski avatar Aug 30 '22 16:08 pawel-big-lebowski

Given the problem the dataset symlinking aims to solve is to resolve lineage for jobs that may be accessing datasets by different names, I'm wondering if we can scope the symlinks functionality to just the lineage API. As is, its impact is pervasive and, as we saw with the job symlinks, that can have unintended consequences. I think we can start by scoping this to just the lineage API, then adjust other APIs as needed.

In order to confirm that, one should know all the Marquez features to make sure datasets are not joined by name whenever symlinks should be applied. This creates an assumption which is not stated explicitly, so is error prone in future. That's why I am in favour of an assumption that dataset_symlinks should be always use to retrieve a dataset by name. If a dataset has two different names, it gets returned to matter how we request it.

Performance wise, this should not affect postgresql. It's just an extra join of index column to retrieve extra column (not used for filtering) and an uncorrelated subquery to get possible symlinks run before existing queries. Postgres is smart and joins on foreign keys, which are just applied on result rows, should be fast. Full-scans digging jsonb columns are slow, we know that and there is proposal to fix it.

To sum up, there is a tradeoff "(1) not clean assumption in DB modelling" vs "(2) DB performance risk that should not happen but cannot be confirmed due to lack of perf tests". On one hand, I would go with (2) as it gives great chance to make it clean in future. On the other hand, I don't mind going with (1) as I haven't worked that much with Marquez DB perf issues.

pawel-big-lebowski avatar Aug 31 '22 07:08 pawel-big-lebowski

@julienledem great point. I talked with @pawel-big-lebowski and we aligned to use datasets_view to provide joined data from datasets and datasets_symlinks. This way, most of the SQL busywork behind this change will be opaque to queries.

With that, only issue is that "namespaces" aren't unique anymore - methods like

List<Dataset> findAll(String namespaceName, int limit, int offset)

still need to filter by is_primary field to not return duplicates.

mobuchowski avatar Aug 31 '22 09:08 mobuchowski

To sum up, there is a tradeoff "(1) not clean assumption in DB modelling" vs "(2) DB performance risk that should not happen but cannot be confirmed due to lack of perf tests". On one hand, I would go with (2) as it gives great chance to make it clean in future. On the other hand, I don't mind going with (1) as I haven't worked that much with Marquez DB perf issues.

@pawel-big-lebowski I think @collado-mike's comment: "As is, its impact is pervasive and, as we saw with the job symlinks, that can have unintended consequences" wasn't necessarily performance related (though there was follow up work on job symlinks needed to improve performance and implementation), but rather unexpected behavior given the number of queries modified. That said, I agree with you that "dataset_symlinks should be always use to retrieve a dataset by name", but would prefer we use datasets_view (introduced in https://github.com/MarquezProject/marquez/pull/2032).

Note, we're hoping to add load testing to CI at some point to help with DB performance insight, see #2047

wslulciuc avatar Sep 07 '22 07:09 wslulciuc

The namespace_name column existed before together with namespace_uuid field. In my opinion, it should be removed but I found out it's out of scope for this PR.

@pawel-big-lebowski mind opening an issue to capture this change as follow up work?

wslulciuc avatar Sep 07 '22 08:09 wslulciuc

@collado-mike , @mobuchowski and @wslulciuc, thank you for great reviews and feedback.

Changes applied according to that:

  • will reuse a dataset_view introduced by Maciej,
  • won't drop name column in datasets table to avoid risky migrations and limit amount of possible unexpected behavior changes.

pawel-big-lebowski avatar Sep 09 '22 10:09 pawel-big-lebowski