RATIS-244. Skip snapshot file if corresponding MD5 file is missing
What changes were proposed in this pull request?
Snapshot is not an atomic operation and in some cases there might be corrupted snapshots. Every snapshot has an associated MD5 file with it, if this file is missing we can skip the snapshot.
This PR adds the check for the existence of MD5 file and skips any snapshot which doesn't have it.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/RATIS-244
How was this patch tested?
Patch was tested manually by running the unit tests.
@devabhishekpal , could you also add some tests to TestRaftStorage?
@devabhishekpal , thanks for the update! The change looks good but there are test failures. Please take a look.
It seems the tests are failing because in findLatestSnapshot() we are doing:
getSingleFileSnapshotInfos(dir, true).iterator()
Now if we create an initial snapshot with md5 file, this will return null from the check:
if (!i.hasNext()) {
return null;
}
We now create a second snapshot with no MD5 hash, this will still return null as the iterator still has one element. Hence when we run a scenario as below:
try {
createSnapshot(simpleStateMachineStorage, 1, 100, true);
simpleStateMachineStorage.loadLatestSnapshot();
createSnapshot(simpleStateMachineStorage, 1, 200, false);
simpleStateMachineStorage.loadLatestSnapshot();
SingleFileSnapshotInfo latest = simpleStateMachineStorage.getLatestSnapshot();
Assertions.assertNotNull(latest);
}
latest is always returned as null even though there are two snapshots present. Previous behaviour would have been that it would have returned one snapshot.
This is also causing gRPC test failures as SimpleStateMachineStorage.findLatestSnapshot() filters out every snapshot that lacks an MD5 file as we are passing (requireMd5 == true). So on the follower that just copied the snapshot, findLatestSnapshot now returns null - loadLatestSnapshot() never updates the cache, and the snapshot install is effectively ignored.
Hence follower never recognises the installed snapshot, it fails to catch up and doesn’t participate in the new configuration. setConfiguration therefore keeps retrying and times-out
@szetszwo do you think we might need to change the test case? Or should the behaviour be changed?
latest is always returned as null even though there are two snapshots present.
It was a bug previously that it was using filename but not path. Is it still happening after the fix?
Yup, we are still seeing the same issue
... we are still seeing the same issue
Could you debug it and find out why?
createSnapshot(simpleStateMachineStorage, 1, 100, true);
@devabhishekpal , your test creating fake MD5 file is causing exception
2025-11-29 13:43:15,570 [main] WARN impl.SimpleStateMachineStorage (SimpleStateMachineStorage.java:loadLatestSnapshot(262)) - Failed to updateLatestSnapshot from target/test/data/81dc258c/TestRaftStorage/TestRaftStorage.testCreateSnapshot/sm
java.io.IOException: Invalid MD5 file target/test/data/81dc258c/TestRaftStorage/TestRaftStorage.testCreateSnapshot/sm/snapshot.1_100.md5: the content "null" does not match the pattern ([0-9a-f]{32}) [ *](.+)
at org.apache.ratis.util.MD5FileUtil.lambda$readStoredMd5ForFile$0(MD5FileUtil.java:99)
at java.base/java.util.Optional.orElseThrow(Optional.java:408)
at org.apache.ratis.util.MD5FileUtil.readStoredMd5ForFile(MD5FileUtil.java:98)
at org.apache.ratis.statemachine.impl.SimpleStateMachineStorage.findLatestSnapshot(SimpleStateMachineStorage.java:223)
at org.apache.ratis.statemachine.impl.SimpleStateMachineStorage.loadLatestSnapshot(SimpleStateMachineStorage.java:258)
at org.apache.ratis.server.storage.TestRaftStorage.testCreateSnapshot(TestRaftStorage.java:101)