MINOR: Add branching model to contrib guide
Which issue does this PR close?
N/A
Rationale for this change
It is getting challenging to create PRs that depend on recent changes in dependencies such as arrow and sqlparser that have many breaking changes.
What changes are included in this PR?
New docs
Are there any user-facing changes?
No
@alamb @tustvold @avantgardnerio @jdye64 WDYT?
I think this makes a lot of sense. Doesn't seem like too much maintenance overhead to have to maintain and merge a few more branches and really helps unblock new feature development. I am for this.
I have created a sqlparser-0.21 branch so we can try this out and see how well it works.
I think as both crates have a fairly regular release cadence this is fine. The risk would be if any of these branches live longer than a couple of weeks, as then the opportunity for a potentially non-trivial, error-prone final integration increases.
FWIW I have typically just maintained draft PRs that build on each other for this purpose, and then rebased them individually to get them merged following the release.
One thing that might be worth clarifying is how the final merge occurs, I presume we would rebase the integration branch and then do a fast forward merge? I ask as currently the only option would be a squash merge, which would lose any history from the integration branch (which may not be a problem but should be made explicit)
I think as both crates have a fairly regular release cadence this is fine. The risk would be if any of these branches live longer than a couple of weeks, as then the opportunity for a potentially non-trivial, error-prone final integration increases.
Agreed. I guess I am now volunteering to take on sqlparser releases on a more frequent schedule. The release process is mostly automated and there is no voting since it is outside of Apache, so this shouldn't be too much work.
FWIW I have typically just maintained draft PRs that build on each other for this purpose, and then rebased them individually to get them merged following the release.
This works fine when one author is making the changes but recently there are more contributors making independent changes to sqlparser so this is harder to manage.
One thing that might be worth clarifying is how the final merge occurs, I presume we would rebase the integration branch and then do a fast forward merge? I ask as currently the only option would be a squash merge, which would lose any history from the integration branch (which may not be a problem but should be made explicit)
Good points. I will address this.
One thing that might be worth clarifying is how the final merge occurs, I presume we would rebase the integration branch and then do a fast forward merge?
I'm not sure if this idea is doable. If it is, I suspect we'd want to:
- make a commit in datafusion that's one ahead of master that only updates the arrow
version=toref= - active work being done still goes on master
- we periodically rebase that commit/branch onto master
- If there's no breaking changes to arrow, the build keeps working and it's not a big deal
- If there are breaking changes to the arrow API then datafusion changes are required
- those changes also are done on the
next_releasebranch - but now it's a pain because if you want to start building a new feature, do you do it off
masterornext_release? - By logical inference, anything you do this way with
arrowyou also have to do withsqlparserand up the chain toballista:(
A possible solution is to do the reverse:
- we keep master always pointed at a
ref=to arrow - we periodically cut release branches as we have been
- when we cut the release branch, we point it at a version of arrow
That way we can keep closer sync between the two repos, and @tustvold , myself, and anyone else doing repo-spanning features don't have to duplicate effort of fixing builds. Are there other possibilities? I don't know. These are all gross, as dependency hell tends to be an unsolved problem unless there are stable APIs.
One more option:
We have one PR for upgrading to the next version of sqlparser/arrow Others create PRs against that PR Once the release is available, we point to that release then merge the PR into master
I think this preserves contributor information.
next_release
I think the intention is not to have a single next_release branch, but a staging point for the features that require the specific upstream change. Changes that don't will continue to target master (or potentially another integration branch). IMO this is a reasonable evolution of the process we've effectively adopted thus far, using a dedicated branch instead of a draft PR, to allow for easier collaboration. I therefore think this makes sense and is workable.
A possible solution is to do the reverse
The downside with having git refs in master, which we used to have previously for arrow, is repos that track datafusion by git sha then have a non-trivial job tracking the correct upstream versions, and are prevented from using patch to override dependencies. It's also not clear what happens if it comes time to cut a release, but there is still a pinned git sha on master?
a reasonable evolution of the process we've effectively adopted thus far
I like evolution.
- Who does it?
- When do they do it?
- What's the naming convention?
Very good questions, I guess the first person creates a draft PR and if another person wants to collaborate on the work they flag it and a maintainer can create the branch?? Although they probably could also submit PRs to the fork? 🤔
- Who: the first person who needs it?
- When: when they need it?
- PR to non-master branch
v[next version here]
?
PR to non-master branch v[next version here]
Probably want the name of the dependency in there somewhere
name of the dependency
I was thinking the reverse - if datafusion pre-25-release wants sqlparser 57, the datafusion branch name is v25.
There might be multiple dependency upgrades (arrow-rs, sqlparser, chrono) and we wouldn't want branches for each one.
There might be multiple dependency upgrades (arrow-rs, sqlparser, chrono) and we wouldn't want branches for each one.
So long as they're independent I see no reason not to, it keeps the branches small and short lived?
TBC I view this process as an exceptional case where there is a complex upgrade requiring multiple contributors. If this is a common occurrence, something is wrong, and there is a mismatch between the rate of change and the release cadence
So the feature branch is now merged via this PR: https://github.com/apache/arrow-datafusion/pull/3200/commits
Each commit is one PR that was merged to this branch so we have the history preserved, or am I missing something?
Each commit is one PR that was merged to this branch so we have the history preserved, or am I missing something?
When you PR that branch into master it will be squashed into a single commit?
squashed into a single commit
It is an option in github to rebase or squash and rebase.
Indeed, but not one that is currently enabled for this repo :smile:

Indeed, but not one that is currently enabled for this repo smile
Ah, I see. We can enable rebase easily enough.
https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features#Git.asf.yamlfeatures-Mergebuttons
Shall we merge this document in now that we have run it successfully once?
Marking as draft to signify this PR has planned changes (and make it easier to find PRs that are in need of review). Please mark "ready for review" when it is next ready or if this change was a mistake.