HDDS-10300. InstallSnapshot may fail if OM metadata dir and OM DB dir are in different local storage partitions.
What changes were proposed in this pull request?
When we hard link files between different disk/partition we face this Invalid cross-device link error. Because we can't create hard links between different disks. In this patch, Instead of creating a hard link we move the file to the target directory.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10300
How was this patch tested?
Existing Tests
Copying seems like a bad solution since we will lose efficiency of snapshots space and diff. Additionally there is no indication to the user that this will happen if they set up their configurations in this way, and the state machines, although functionally the same, will be different across OM replicas when only one node installs a snapshot.
Why not use either of these simpler options:
- Rebuild hardlinks after moving the DB from the snapshot install directory to the OM DB directory.
- Deprecate configs for setting snapshot install location to a different directory and put it directly in the OM DB directory. I don't see an advantage to doing the initial download in a different place.
@errose28 Thanks for taking a look at it. For the first approach, With moving and creating a hardlink we would be doing double the work. Instead, why not move to the correct directory and avoid the hard linking part altogether?
I agree with 2nd, I'm not sure why the snapshot download directory is in a different disk. Ideally, both should be in the same disk.
@errose28 regarding point#2
2. Deprecate configs for setting snapshot install location to a different directory and put it directly in the OM DB directory. I don't see an advantage to doing the initial download in a different place.
@aswinshakil, @swamirishi and I had a discussion about it and we don't see any problem if candidate dir just resides inside ozone.om.db.dirs.
I looked into code and config for the snapshot dir is ozone.om.ratis.snapshot.dir and if it is not set, it falls back to ozone.metadata.dirs. There is no call out or recommendation to have ozone.om.ratis.snapshot.dir or ozone.metadata.dirs in the same partition as ozone.om.db.dirs. Original PR#703 discussion.
If we can get to an agreement that ozone.om.ratis.snapshot.dir should be in the same partition as ozone.om.db.dirs. I think we should add it to release notes and make it a general recommendation. Or we can simply deprecate ozone.om.ratis.snapshot.dir config as you said by moving snapshot dir inside ozone.om.db.dirs.
I was talking about this with @aswinshakil and there is another problem we should consider: the size of the Ratis snapshot might be far greater than the size of the final OM DB. This is because the Ratis snapshot has "inflated" all the hardlinks, so if there are many filesystem snapshots on dense buckets, the OM DB device needs an undetermined amount of space greater than the current DB in order to install the ratis snapshot. If this space is not available and we remove the configs, there is no good way out of the situation. We already don't have great handling to make sure there is always room to install a ratis snapshot, but the hardlink inflation makes the problem worse.
One idea to mitigate the problem:
- Allow ratis snapshot to be installed wherever is configured, as it is currently.
- Copy files one-by-one to the OM DB dir. If the file is in the hardlink list, create a link instead of a new file.
- This keeps space required on the OM DB device to no more than double that of the current DB, regardless of the amount of snapshots and hardlinks. It still does not make sure there is always space, but it is not worse than the current system behavior.
Discussed with @hemantk-12 and @swamirishi offline, Right now installSnapshotFromLeader is synchronized so this is where the follower downloads the checkpoint(candidate) from the leader and moves it to the actual OM DB. There will be only one download happening at any point in time.
This can happen in the same partition because we would be doing the same thing if we downloaded it in a different partition and moved that into OM DB's partition. Instead, download the checkpoint(candidate) in the OM DB partition.
We need to do a few things, deprecate configs ozone.om.ratis.snapshot.dir which allows the user to specify different partitions and also don't fall back to ozone.metadata.dirs. Instead, use ozone.om.db.dirs to download the checkpoint and avoid this problem altogether.
Regarding this previous comment, it turns out that the snapshot install only copies one copy of each SST file needed, and then builds the hardlinks from there. That means the Ratis snapshot will not be bigger than the OM DB and we should be able to put it right on the main DB disk without any more space concerns than what we already have.
I'm +1 for deprecating the configs.
@errose28 @aswinshakil Hi, the issue we encountered is similar, and we would like to know if this issue will continue to progress according to the discussion above?
Hi @weimingdiit based on the discussion here I think we would like to simplify the snapshot install configuration by deprecating ozone.om.ratis.snapshot.dir and putting the snapshot directly in ozone.om.db.dir, or ozone.om.metadata.dir if the DB dir is not configured. This makes sure that hardlinks and directory renames always work. I'll close this PR since I don't think it lines up with the current proposal and I'll update the Jira. This should be a minor change if you want to take it on.