KAFKA-14073 Logging the reason for Snapshot
When a snapshot is taken it is due to either of the following reasons -
- Max bytes were applied
- Metadata version was changed
Once the snapshot process is started, it will log the reason that initiated the process.
Updated existing tests to include code changes required to log the reason. I was not able to check the logs when running tests - could someone guide me on how to enable logs when running a specific test case.
Example logs after the changes -
[2022-07-25 14:34:39,769] INFO [Controller 3000] Generating a snapshot that includes (epoch=1, offset=0) after 91 committed bytes since the last snapshot, because max bytes exceeded. (org.apache.kafka.controller.QuorumController:1328)
[2022-07-25 14:40:51,783] INFO [BrokerMetadataSnapshotter id=2] Creating a new snapshot at offset 5 because metadata version changed... (kafka.server.metadata.BrokerMetadataSnapshotter:66)
Committer Checklist (excluded from commit message)
- [ ] Verify design and implementation
- [ ] Verify test coverage and CI build status
- [ ] Verify documentation (including upgrade notes)
@dengziming since you raised the JIRA ticket for KAFKA-14073, would please review this PR.
@dengziming I have made the changes requested. I have also made the changes in QuorumController.maybeGenerateSnapshot to log the reason for the snapshot being generated.
maybeGenerateSnapshot seemed to be a better place than QuorumController.SnapshotGeneratorManager.createSnapshotGenerator since it already had logs for starting a snapshot.
Also, it seems like QuorumController only creates a snapshot when max bytes are exceeded, I couldn't find it calling the snapshot generator for the reason metadata version changed.
Hi @dengziming, could you review this once more? I have made the requested changes.
@dengziming I have made the requested changes and added the sample logs in the description of the PR.
Got it, working on these changes. Taking a bit of time to go through the code.
Hi @jsancio, I have defined an enum in the raft module for SnapshotReason and have used it in place of string messages that were being passed previously.
I am working on making the changes for RaftClient and KafkaMetadataLog. These changes will log the snapshot reason in createNewSnapshot function of KafkaMetadataLog.
I had two implementation queries, could you please help with them -
- How do we handle multiple reasons for starting a snapshot in an enum?
- I would also need to change the function signature of
createNewSnapshot()inKafkaMetadataLogto accommodate the newSnapshotReasonparameter. Below is the new signature. Would this be okay?
Optional<RawSnapshotWriter> createNewSnapshot(OffsetAndEpoch snapshotId, SnapshotReason reason);
Thanks for the changes @ashmeet13 they look good in general.
1. How do we handle multiple reasons for starting a snapshot in an enum?
Instead of using Optional[SnapshotReason] we could use Set[SnapshotReason]. Depending on the layer we can associate different means to this set. When returned from def shouldSnapshot() the caller can assume that if the set is empty it means that it should not generate a snapshot.
2. I would also need to change the function signature of `createNewSnapshot()` in `KafkaMetadataLog` to accommodate the new `SnapshotReason` parameter. Below is the new signature. Would this be okay?Optional<RawSnapshotWriter> createNewSnapshot(OffsetAndEpoch snapshotId, SnapshotReason reason);
How about:
Optional<RawSnapshotWriter> createNewSnapshot(OffsetAndEpoch snapshotId, Set<SnapshotReason> reasons);
If the set reasons is empty then the implementation of createNewSnapshot can assume that the reason is unknown.
What do you think?
Thanks @jsancio, this is helpful. I'll go ahead and make these changes!
Thanks @jsancio, this is helpful. I'll go ahead and make these changes!
@ashmeet13 ,
I thought about this some more and I don't think we should include the reason for the snapshot in the raft module and types. In other words, I think we should:
- Move the
SnapshotReasontype fromraft/src/main/java/org/apache/kafka/rafttometadata/src/main/java/org/apache/kafka/metadata/util. - Keep the old signature for the
createNewSnapshot. I suspect that we can undo all of the changes you made to theraftmodule.
I think this now because the reasons for creating a snapshot are specific to metadata and not raft. We have an interest in using the raft module and the KRaft protocol to solve other problems. I don't think we will have the same reasons for wanting to generate snapshot in those use cases.
We should still use the SnapshotReason type and I am just concern on where it should live. Thanks and excuse my earlier misguided advice.
Hi @jsancio, just one another doubt - Does this mean we don't want to log the reason within KafkaMetadataLog and keep the logging as it is right now within BrokerMetadataSnapshotter and QuorumController?
Hi @jsancio, just one another doubt - Does this mean we don't want to log the reason within
KafkaMetadataLogand keep the logging as it is right now withinBrokerMetadataSnapshotterandQuorumController?
Yes. That's correct. Let's keep the logging we have now and extend the message to include the reason(s).
Hi @jsancio, I have reverted the changes made to RaftClient and kept the logging withing BrokerMetadataSnapshotter and QuorumController
I have two questions, would really be helpful if you could guide me on this -
-
We also take a snapshot in
org.apache.kafka.raftwithin the classReplicatedCounterunder thehandleCommitfunction. I am facing an issue in being able to import theSnapshotReasonfrom themetadata/src/main/java/org/apache/kafka/metadata/utilpackage. Could you guide me on how I could fix this? -
We take snapshots at two different places within
QuorumController- One I was able to figure out the reason beingMaxBytesExceeded. There is another functionbeginWritingSnapshotwhich also initiates a snapshot - I was not able to figure out who calls this function and what would the reason be in this scenario?
Hi @dengziming @jsancio, can you help with a review on this PR. If it looks okay to merge I can clean up the PR and remove temporary comments added by me.
Thank you @dengziming for the review. Made the changes to remove the comment I had added and fixed the typo in the log message.
Thanks @dengziming for the review. @jsancio could you please have a look at this PR?
Hi @jsancio, bumping this PR up in case it got missed. Could please have a look at this?
Hi @jsancio, I have reverted the changes made to
RaftClientand kept the logging withingBrokerMetadataSnapshotterandQuorumControllerI have two questions, would really be helpful if you could guide me on this -
1. We also take a snapshot in `org.apache.kafka.raft` within the class `ReplicatedCounter` under the `handleCommit` function. I am facing an issue in being able to import the `SnapshotReason` from the `metadata/src/main/java/org/apache/kafka/metadata/util` package. Could you guide me on how I could fix this?
ReplicatedCounter is mainly used to test the KRaft implementation. We should probably consider moving this type to raft/src/test.
2. We take snapshots at two different places within `QuorumController` - One I was able to figure out the reason being `MaxBytesExceeded`. There is another function `beginWritingSnapshot` which also initiates a snapshot - I was not able to figure out who calls this function and what would the reason be in this scenario?
It looks like this method is only used by tests. I think it is okay for this to have an "unknown reason" for now.
Thank you @jsancio for the review. Have made the changes.
For the following comment -
It looks like this method is only used by tests. I think it is okay for this to have an "unknown reason" for now.
Currently the function in discussion has no logging taking place hence I haven't added the UnknownReason in. Should I update the function to log the same?
Edit - I have gone ahead and added the log the same way we are logging in the actual function within QuorumController. I can remove it if necessary.
Hi @jsancio could you please review this PR? Sorry for the multiple changes this has required.
Hi @jsancio could you please review this PR? Sorry for the multiple changes this has required.
Thanks for the changes and excuse the delays. LGTM in general. Restarted the build as it seems the JVM 17 and Scala 2.13 configuration failed to build.
Thank you @jsancio, I see that the build for JVM 17 and JDK 2.13 has timed out again. What can be done from my end to fix this?