hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-7503] Compaction and LogCompaction executions should start a heartbeat on every attempt and block concurrent executions of same plan

Open kbuci opened this issue 1 year ago • 2 comments

Change Logs

Updated compact and logcompact to start a heartbeat (within a transaction) before attempting to execute a plan. If multiple writers attempt to execute same compact/logcompact plan at same time, only one of them will process and the rest will fail with an exception (upon seeing a heartbeat has already been started) and will abort.

Impact

Without this change, if multiple jobs are launched at the same that target executing the same compact/logcompact plan (due to a non-HUDI related user-side configuration/orchestration issue) then one job can execute the compact/logcompact plan and create an inflight/commit instant file while the other jobs can roll it back (and delete inflight instant files or data files). This can lead to timeline being in a corrupted state.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

kbuci avatar Apr 05 '24 19:04 kbuci

CI report:

  • e1a6e4a24083dd8871a2fc3fbb289e1a6192593a Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Apr 09 '24 00:04 hudi-bot

@kbuci Please include heartbeat for clustering commit as well. Also, treat clustering and logcompaction as removable plans so rollback for them can happen in the ingestion itself. Considering that how do we create heart beats?

@suryaprasanna Do you recall why logcompaction execution/rollback is different than compaction, in the sense that unlike compaction execution

  • log compaction won't retry a failed/inflight log compact plan and will instead completely roll it back
  • Lazy clean rollback of failed writes is allowed to rollback log compact instants I assume this is because we want to avoid a "stuck" log compact plan for preventing compaction from being scheduled, but just wanted to confirm, since @nsivabalan had the same question as well.

The reason I ask is that (as per my understanding) this behavior of log compact will make it more tricky for us to schedule log compact plans and reliably defer execution to an async job. This is since if clean's failedWritesRollback can rollback log compact instants, then it can rollback a log compact plan (.requested file) before it has the chance to be "picked up" and executed by an async job. We can handle this situation by adding heartbeating (the same way as compact) and updating failedWritesRollback to only try rolling back a log compact instant if it is inflight (and ignore it if it's just in .requested state), though that still has the following consequence we should keep in mind:

  • If the async job is disabled or delayed (due to a configuration or orchestration issue), the log compact plan (.requested file) will remain in the timeline

@suryaprasanna @nsivabalan @danny0405 (tagging all commenters) I was wondering if you had any opinions/suggestions on this?

kbuci avatar Apr 17 '24 23:04 kbuci