ratis icon indicating copy to clipboard operation
ratis copied to clipboard

RATIS-244. Skip snapshot file if corresponding MD5 file is missing

Open devabhishekpal opened this issue 2 months ago • 7 comments

What changes were proposed in this pull request?

Snapshot is not an atomic operation and in some cases there might be corrupted snapshots. Every snapshot has an associated MD5 file with it, if this file is missing we can skip the snapshot.

This PR adds the check for the existence of MD5 file and skips any snapshot which doesn't have it.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-244

How was this patch tested?

Patch was tested manually by running the unit tests.

devabhishekpal avatar Nov 20 '25 20:11 devabhishekpal

@devabhishekpal , could you also add some tests to TestRaftStorage?

szetszwo avatar Nov 21 '25 02:11 szetszwo

@devabhishekpal , thanks for the update! The change looks good but there are test failures. Please take a look.

szetszwo avatar Nov 21 '25 19:11 szetszwo

It seems the tests are failing because in findLatestSnapshot() we are doing:

getSingleFileSnapshotInfos(dir, true).iterator()

Now if we create an initial snapshot with md5 file, this will return null from the check:

if (!i.hasNext()) {
      return null;
}

We now create a second snapshot with no MD5 hash, this will still return null as the iterator still has one element. Hence when we run a scenario as below:

try {
      createSnapshot(simpleStateMachineStorage, 1, 100, true);
      simpleStateMachineStorage.loadLatestSnapshot();

      createSnapshot(simpleStateMachineStorage, 1, 200, false);
      simpleStateMachineStorage.loadLatestSnapshot();

      SingleFileSnapshotInfo latest = simpleStateMachineStorage.getLatestSnapshot();
      Assertions.assertNotNull(latest);
}

latest is always returned as null even though there are two snapshots present. Previous behaviour would have been that it would have returned one snapshot.

This is also causing gRPC test failures as SimpleStateMachineStorage.findLatestSnapshot() filters out every snapshot that lacks an MD5 file as we are passing (requireMd5 == true). So on the follower that just copied the snapshot, findLatestSnapshot now returns null - loadLatestSnapshot() never updates the cache, and the snapshot install is effectively ignored. Hence follower never recognises the installed snapshot, it fails to catch up and doesn’t participate in the new configuration. setConfiguration therefore keeps retrying and times-out

@szetszwo do you think we might need to change the test case? Or should the behaviour be changed?

devabhishekpal avatar Nov 22 '25 09:11 devabhishekpal

latest is always returned as null even though there are two snapshots present.

It was a bug previously that it was using filename but not path. Is it still happening after the fix?

szetszwo avatar Nov 26 '25 20:11 szetszwo

Yup, we are still seeing the same issue

devabhishekpal avatar Nov 26 '25 21:11 devabhishekpal

... we are still seeing the same issue

Could you debug it and find out why?

szetszwo avatar Nov 29 '25 21:11 szetszwo

createSnapshot(simpleStateMachineStorage, 1, 100, true);

@devabhishekpal , your test creating fake MD5 file is causing exception

2025-11-29 13:43:15,570 [main] WARN  impl.SimpleStateMachineStorage (SimpleStateMachineStorage.java:loadLatestSnapshot(262)) - Failed to updateLatestSnapshot from target/test/data/81dc258c/TestRaftStorage/TestRaftStorage.testCreateSnapshot/sm
java.io.IOException: Invalid MD5 file target/test/data/81dc258c/TestRaftStorage/TestRaftStorage.testCreateSnapshot/sm/snapshot.1_100.md5: the content "null" does not match the pattern ([0-9a-f]{32}) [ *](.+)
	at org.apache.ratis.util.MD5FileUtil.lambda$readStoredMd5ForFile$0(MD5FileUtil.java:99)
	at java.base/java.util.Optional.orElseThrow(Optional.java:408)
	at org.apache.ratis.util.MD5FileUtil.readStoredMd5ForFile(MD5FileUtil.java:98)
	at org.apache.ratis.statemachine.impl.SimpleStateMachineStorage.findLatestSnapshot(SimpleStateMachineStorage.java:223)
	at org.apache.ratis.statemachine.impl.SimpleStateMachineStorage.loadLatestSnapshot(SimpleStateMachineStorage.java:258)
	at org.apache.ratis.server.storage.TestRaftStorage.testCreateSnapshot(TestRaftStorage.java:101)

szetszwo avatar Nov 29 '25 21:11 szetszwo