HDDS-11068. Move SstFiltered flag to another SnapshotProperties table
What changes were proposed in this pull request?
SstFiltering service directly updates the snapshotInfo and doesn't go through DoubleBufferFlush to update the DB. This interferes with SnapshotInfo updated during SnapshotPurgeRequest.
Consider the case:
S1 <- S2(SstFiltered) <- S3 (Sst unfiltered)
At T1 if S2 is purged and submitted to OM ratis but it hasn't been flushed to the rocksDb but just present in the cache which makes the chain.
S1 <- S3 (Sst Unfiltered in cache)
S2(SstFiltered, deleted in Cache but present in table)
Suppose at T2 SstFiltering service runs on the chain and marks S3 to be SstFiltered which would be directly flushed to DB
S1 <- S3 (Sst filtered in DB)
S1 <- S2 (Present in DB but not present in cache)
Now if the OM is restarted, snapshot chain would have gotten corrupted since both S2 & S3 are pointing S1 in the DB.
Changes made in the patch:
- Create a snapshot property table in Snapshot DB which would store all properties allied to a snapshot specific to one OM. This would hold a sstFiltered flag correspoding to the snapshotId which would be set when the SstFiltering service marks the snapshot as filtered.
- SnapshotPropertyManager which would be responsible to maintain a full cache for the table in memory and handling updates to the db.
- In OmSnapshotPurgeRequest mark the boolean property snapshotPurged, a background service will be running in OM which would wait for the double buffer flush for the SnapshotPurgeRequest which would remove the entry from the table. Post which the service will clean up the property entry corresponding to the snapshot.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11068
How was this patch tested?
Exisiting unit tests and additional unit tests.
This is a doubtful approach the fix.The main problem here is updating the Snapshot Info without going through the RATIS transaction. Removing the dead entries from RocksDB is acceptable but touching anything in RocksDB that is relevant to active object store should not happen outside RATIS transaction.
Adding another table or restructuring the table is a disk layout change and will impact upgrades too.
I have been thinking about this end to end holistically. And I believe the fundamental problem is mixing metadata that is updated by RATIS with Metadata that is updated by a OM-node-local-service.
The fix posted here will indeed take care of this problem and segregate update to metadata such that SFS filtering service will only update local table.
After this change, Garbage collection service can no longer check for SST-filtered flag reliably. Because this flag is only local to the OM node. GC is run on leader and it's possible that the flag is true on leader node. But other nodes may still not have the SFS service run yet and their SST-filtered flag could be false. cc : @aswinshakil
After this change, Garbage collection service can no longer check for SST-filtered flag reliably. Because this flag is only local to the OM node. GC is run on leader and it's possible that the flag is true on leader node. But other nodes may still not have the SFS service run yet and their SST-filtered flag could be false. cc : @aswinshakil
@prashantpogde that should be fine. The worst that can happen is that we update the snapshot info table or move keys from one table to another while the sst filtering service is running for the snapshotted table. This patch will just ignore that particular run of sst filtering service in case of any errors. I believe there should be no errors since sst filtering service will only delete files which don't fall in the bucket for the key table, while garbage collection will be running on deletedKeyTable. As per @hemantk-12 jira https://issues.apache.org/jira/browse/HDDS-9135 we need not have the condition at the first place.
Thanks @swamirishi .
I remember last year I suggested offline that after SST filtering we could simply write an empty file named e.g. filtered inside that snapshot DB directory. Mostly because SST filtering is independent for each OM anyways.
But we also have transactionInfoTable, so a new table for SST filtering would also do.