backbeat icon indicating copy to clipboard operation
backbeat copied to clipboard

Ensure lifecycle tasks wait for messages to be pushed

Open francoisferrand opened this issue 1 year ago • 6 comments

lifecycle task pushes new entries to bucket topic, but may commit before the entry is commited : which allows multiple lifeycle iterations to happen in parallel.

Issue: BB-641

francoisferrand avatar Dec 18 '24 13:12 francoisferrand

Hello francoisferrand,

My role is to assist you with the merge of this pull request. Please type @bert-e help to get information on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval :star:
/bypass_build_status Bypass the build and test status :star:
/bypass_commit_size Bypass the check on the size of the changeset TBA :star:
/bypass_incompatible_branch Bypass the check on the source branch prefix :star:
/bypass_jira_check Bypass the Jira issue check :star:
/bypass_peer_approval Bypass the pull request peers' approval :star:
/bypass_leader_approval Bypass the pull request leaders' approval :star:
/approve Instruct Bert-E that the author has approved the pull request. :writing_hand:
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

bert-e avatar Dec 18 '24 13:12 bert-e

Codecov Report

Attention: Patch coverage is 85.71429% with 5 lines in your changes missing coverage. Please review.

Project coverage is 55.34%. Comparing base (61b9e9a) to head (0c257d7).

Files with missing lines Patch % Lines
extensions/lifecycle/tasks/LifecycleTaskV2.js 88.46% 3 Missing :warning:
extensions/lifecycle/tasks/LifecycleTask.js 77.77% 2 Missing :warning:
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
extensions/lifecycle/tasks/LifecycleTask.js 83.30% <77.77%> (+0.11%) :arrow_up:
extensions/lifecycle/tasks/LifecycleTaskV2.js 89.74% <88.46%> (+0.85%) :arrow_up:

... and 4 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 18.51% <ø> (ø)
Core Library 61.90% <ø> (-0.23%) :arrow_down:
Ingestion 67.53% <ø> (ø)
Lifecycle 47.15% <85.71%> (+0.24%) :arrow_up:
Oplog Populator 84.20% <ø> (ø)
Replication 51.01% <ø> (-0.04%) :arrow_down:
Bucket Scanner 85.60% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/8.6    #2603      +/-   ##
===================================================
- Coverage            55.40%   55.34%   -0.06%     
===================================================
  Files                  198      198              
  Lines                12915    12928      +13     
===================================================
  Hits                  7155     7155              
- Misses                5750     5763      +13     
  Partials                10       10              
Flag Coverage Δ
api:retry 9.62% <0.00%> (-0.01%) :arrow_down:
api:routes 9.51% <0.00%> (-0.01%) :arrow_down:
bucket-scanner 85.60% <ø> (ø)
ingestion 12.45% <0.00%> (-0.02%) :arrow_down:
lib 7.51% <0.00%> (-0.01%) :arrow_down:
lifecycle 19.44% <85.71%> (+0.08%) :arrow_up:
notification 0.88% <0.00%> (-0.01%) :arrow_down:
replication 18.87% <0.00%> (-0.13%) :arrow_down:
unit 5.13% <0.00%> (-0.01%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

codecov[bot] avatar Dec 18 '24 14:12 codecov[bot]

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically create the integration branches.

bert-e avatar Dec 19 '24 10:12 bert-e

The message is already considered as locally consumed even before it reached the queue processor queue

That is a fair point (and may actually help on another issue), but I don't really see how this is a problem for this change: handling an entry by the bucket processor typically takes at least one second already (scanning & checking the state of every object), so we face this discrepancy anyway...

This change is simply about ensuring that the we keep the "slot" until the entry is "fully" processed, instead of leaving many things pending: which can be an issue esp. since we are listing pushing continuation messages.

What am I missing here?

Ifwe decide to block or wait synchronously for that delivery report every time we send a message, it will impact performance and throughput.

In theory yes; Practically however, since we are processing up to 1000 entries at a time, I wonder if this makes a real impact: most of the reports would be received in the time we process each entry... (except for very small buckets, in which case throughput may not be so important)

It is certainly a trade off, but consistent processing seems important as well: or do you think it is completely safe to leave all these messages dangling, and already start processing next message(s)?

francoisferrand avatar Jan 21 '25 20:01 francoisferrand

What am I missing here?

My understanding was that the goal of this PR is to prevent multiple lifecycle iterations (triggered by Conductor) from running in parallel. I just pointed out that the lag is based on the “locally consumed” offset rather than on a processed or stored offset. So even if we wait for an entry to be fully processed, it won't stop the bucket-lifecycle topic lag from being zero while there are still other bucket messages in the pipeline.

most of the reports would be received in the time we process each entry

Regarding the internal lifecycle listing, it does not necessarily return a 1000 objects; it only includes those that meet the specified criteria (prefix, age, etc...) from the next 10,000 entries. We might even end up with a listing response containing only a few objects, or none at all. NOTE: This 10,000 entry limit helps avoid placing excessive load on Metadata by preventing the evaluation of an unbounded number of entries.

nicolas2bert avatar Jan 22 '25 09:01 nicolas2bert

Incorrect fix version

The Fix Version/s in issue BB-641 contains:

  • 8.6.56

  • 9.0.19

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 8.6.57

  • 9.0.19

  • 9.1.1

Please check the Fix Version/s of BB-641, or the target branch of this pull request.

bert-e avatar Nov 03 '25 08:11 bert-e