datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

MINOR: Add branching model to contrib guide

Open andygrove opened this issue 3 years ago • 21 comments

Which issue does this PR close?

N/A

Rationale for this change

It is getting challenging to create PRs that depend on recent changes in dependencies such as arrow and sqlparser that have many breaking changes.

What changes are included in this PR?

New docs

Are there any user-facing changes?

No

andygrove avatar Aug 18 '22 16:08 andygrove

@alamb @tustvold @avantgardnerio @jdye64 WDYT?

andygrove avatar Aug 18 '22 16:08 andygrove

I think this makes a lot of sense. Doesn't seem like too much maintenance overhead to have to maintain and merge a few more branches and really helps unblock new feature development. I am for this.

jdye64 avatar Aug 18 '22 16:08 jdye64

I have created a sqlparser-0.21 branch so we can try this out and see how well it works.

andygrove avatar Aug 18 '22 16:08 andygrove

I think as both crates have a fairly regular release cadence this is fine. The risk would be if any of these branches live longer than a couple of weeks, as then the opportunity for a potentially non-trivial, error-prone final integration increases.

FWIW I have typically just maintained draft PRs that build on each other for this purpose, and then rebased them individually to get them merged following the release.

One thing that might be worth clarifying is how the final merge occurs, I presume we would rebase the integration branch and then do a fast forward merge? I ask as currently the only option would be a squash merge, which would lose any history from the integration branch (which may not be a problem but should be made explicit)

tustvold avatar Aug 18 '22 16:08 tustvold

I think as both crates have a fairly regular release cadence this is fine. The risk would be if any of these branches live longer than a couple of weeks, as then the opportunity for a potentially non-trivial, error-prone final integration increases.

Agreed. I guess I am now volunteering to take on sqlparser releases on a more frequent schedule. The release process is mostly automated and there is no voting since it is outside of Apache, so this shouldn't be too much work.

FWIW I have typically just maintained draft PRs that build on each other for this purpose, and then rebased them individually to get them merged following the release.

This works fine when one author is making the changes but recently there are more contributors making independent changes to sqlparser so this is harder to manage.

One thing that might be worth clarifying is how the final merge occurs, I presume we would rebase the integration branch and then do a fast forward merge? I ask as currently the only option would be a squash merge, which would lose any history from the integration branch (which may not be a problem but should be made explicit)

Good points. I will address this.

andygrove avatar Aug 18 '22 17:08 andygrove

One thing that might be worth clarifying is how the final merge occurs, I presume we would rebase the integration branch and then do a fast forward merge?

I'm not sure if this idea is doable. If it is, I suspect we'd want to:

  1. make a commit in datafusion that's one ahead of master that only updates the arrow version= to ref=
  2. active work being done still goes on master
  3. we periodically rebase that commit/branch onto master
  4. If there's no breaking changes to arrow, the build keeps working and it's not a big deal
  5. If there are breaking changes to the arrow API then datafusion changes are required
  6. those changes also are done on the next_release branch
  7. but now it's a pain because if you want to start building a new feature, do you do it off master or next_release?
  8. By logical inference, anything you do this way with arrow you also have to do with sqlparser and up the chain to ballista :(

A possible solution is to do the reverse:

  1. we keep master always pointed at a ref= to arrow
  2. we periodically cut release branches as we have been
  3. when we cut the release branch, we point it at a version of arrow

That way we can keep closer sync between the two repos, and @tustvold , myself, and anyone else doing repo-spanning features don't have to duplicate effort of fixing builds. Are there other possibilities? I don't know. These are all gross, as dependency hell tends to be an unsolved problem unless there are stable APIs.

avantgardnerio avatar Aug 18 '22 18:08 avantgardnerio

One more option:

We have one PR for upgrading to the next version of sqlparser/arrow Others create PRs against that PR Once the release is available, we point to that release then merge the PR into master

I think this preserves contributor information.

andygrove avatar Aug 18 '22 20:08 andygrove

next_release

I think the intention is not to have a single next_release branch, but a staging point for the features that require the specific upstream change. Changes that don't will continue to target master (or potentially another integration branch). IMO this is a reasonable evolution of the process we've effectively adopted thus far, using a dedicated branch instead of a draft PR, to allow for easier collaboration. I therefore think this makes sense and is workable.

A possible solution is to do the reverse

The downside with having git refs in master, which we used to have previously for arrow, is repos that track datafusion by git sha then have a non-trivial job tracking the correct upstream versions, and are prevented from using patch to override dependencies. It's also not clear what happens if it comes time to cut a release, but there is still a pinned git sha on master?

tustvold avatar Aug 18 '22 21:08 tustvold

a reasonable evolution of the process we've effectively adopted thus far

I like evolution.

  1. Who does it?
  2. When do they do it?
  3. What's the naming convention?

avantgardnerio avatar Aug 18 '22 21:08 avantgardnerio

Very good questions, I guess the first person creates a draft PR and if another person wants to collaborate on the work they flag it and a maintainer can create the branch?? Although they probably could also submit PRs to the fork? 🤔

tustvold avatar Aug 18 '22 21:08 tustvold

  1. Who: the first person who needs it?
  2. When: when they need it?
  3. PR to non-master branch v[next version here]

?

avantgardnerio avatar Aug 18 '22 21:08 avantgardnerio

PR to non-master branch v[next version here]

Probably want the name of the dependency in there somewhere

tustvold avatar Aug 18 '22 22:08 tustvold

name of the dependency

I was thinking the reverse - if datafusion pre-25-release wants sqlparser 57, the datafusion branch name is v25.

There might be multiple dependency upgrades (arrow-rs, sqlparser, chrono) and we wouldn't want branches for each one.

avantgardnerio avatar Aug 18 '22 22:08 avantgardnerio

There might be multiple dependency upgrades (arrow-rs, sqlparser, chrono) and we wouldn't want branches for each one.

So long as they're independent I see no reason not to, it keeps the branches small and short lived?

TBC I view this process as an exceptional case where there is a complex upgrade requiring multiple contributors. If this is a common occurrence, something is wrong, and there is a mismatch between the rate of change and the release cadence

tustvold avatar Aug 18 '22 22:08 tustvold

So the feature branch is now merged via this PR: https://github.com/apache/arrow-datafusion/pull/3200/commits

Each commit is one PR that was merged to this branch so we have the history preserved, or am I missing something?

andygrove avatar Aug 19 '22 13:08 andygrove

Each commit is one PR that was merged to this branch so we have the history preserved, or am I missing something?

When you PR that branch into master it will be squashed into a single commit?

tustvold avatar Aug 19 '22 13:08 tustvold

squashed into a single commit

It is an option in github to rebase or squash and rebase.

avantgardnerio avatar Aug 19 '22 14:08 avantgardnerio

Indeed, but not one that is currently enabled for this repo :smile:

image

tustvold avatar Aug 19 '22 14:08 tustvold

Indeed, but not one that is currently enabled for this repo smile

Ah, I see. We can enable rebase easily enough.

https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features#Git.asf.yamlfeatures-Mergebuttons

andygrove avatar Aug 19 '22 16:08 andygrove

Shall we merge this document in now that we have run it successfully once?

alamb avatar Sep 03 '22 09:09 alamb

Marking as draft to signify this PR has planned changes (and make it easier to find PRs that are in need of review). Please mark "ready for review" when it is next ready or if this change was a mistake.

alamb avatar Sep 12 '22 11:09 alamb