ozone icon indicating copy to clipboard operation
ozone copied to clipboard

HDDS-10984. Tool to restore SCM certificates from RocksDB.

Open sadanand48 opened this issue 1 year ago • 4 comments

What changes were proposed in this pull request?

Add a tool to restore SCM certs from RocksDB. See jira description for context

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10984

How was this patch tested?

Manually deleted and restored certs using this command, verified the integrity of the certs using the md5sum of the deleted and regenerated certs

[~]# ozone debug -conf ozone-conf/ozone-site.xml cert-recover --db=/var/lib/hadoop-ozone/scm/data/scm.db
24/06/06 09:50:08 INFO codec.RepeatedOmKeyInfoCodec: RepeatedOmKeyInfoCodec ignorePipeline = true
24/06/06 09:50:08 INFO codec.OmKeyInfoCodec: OmKeyInfoCodec ignorePipeline = true
24/06/06 09:50:08 INFO codec.OmKeyInfoCodec: OmKeyInfoCodec ignorePipeline = true
24/06/06 09:50:08 INFO codec.OmKeyInfoCodec: OmKeyInfoCodec ignorePipeline = true
24/06/06 09:50:08 INFO codec.OmKeyInfoCodec: OmKeyInfoCodec ignorePipeline = true
24/06/06 09:50:08 INFO codec.OmKeyInfoCodec: OmKeyInfoCodec ignorePipeline = true
All Certs in DB : [8372456399380787, 1, 8370741018375358, 8372457017676890]
Host: xxxx.xxxx.xxxx.site
Sub cert serialID for this host: 8370741018375358
Root cert serialID: 1
Writing certs to path : /var/lib/hadoop-ozone/scm/ozone-metadata/scm/sub-ca/certs
Writing root certs to path : /var/lib/hadoop-ozone/scm/ozone-metadata/scm/ca/certs

sadanand48 avatar Jun 06 '24 10:06 sadanand48

@sadanand48 This tool will not help us if the private key of the SCM is also deleted along with the certificate.

If someone has deleted the metadata directory by accident, the private key is also lost. There is no point in recovering the certificate in such cases.

nandakumar131 avatar Jun 06 '24 16:06 nandakumar131

If this is making modifications to SCM state, it should be under ozone repair instead of ozone debug.

errose28 avatar Jun 06 '24 18:06 errose28

This tool will not help us if the private key of the SCM is also deleted along with the certificate.

Thanks @nandakumar131 , Right, I didn't think of this. Without private key , the SCM won't be able to issue certs to new Roles (OM/DN) that want to be added to the cluster as it won't be able to sign their certs. If private key is lost , I guess we have to regenerate new keys & certs again and go through the same init/bootstrap flow. Should I close this or do you think this may be of any use?

sadanand48 avatar Jun 07 '24 06:06 sadanand48

@sadanand48, this will be useful if the private key is intact and the user deletes the certificates by accident. I'm not able to think of a situation where this could happen.

Since this change is not modifying any existing code/behaviour we can keep it. Please move the command under ozone repair as @errose28 suggested.

nandakumar131 avatar Jun 07 '24 07:06 nandakumar131

Thanks @fapifta , @nandakumar131 for the reviews. I was exploring point 2 mentioned by @fapifta

Have you considered the possibility to add this as an automatic recovery possibility that happens during CertificateClients initWithRecovery() call, that is called also from SCMs? Isn't that more feasible?

I think if we implement this , there would be no need for a separate tool. I have pushed a draft skeleton on how the changes would look. However I faced the following issues :

  1. If we need to integrate logic to read the rocksdb and persist certs from rocksdb to local in CertificateClient , it needs to have a dependency to ozone-tools to use classes like SCMDBDefinition and many more class references that is used. Is it ok to add a dependency as I did in the current revision of the patch?
  2. Should we make this recovery i.e to pull certs from DB if missing based on a config
  3. " close all BouncyCastle usages to a separate module" This too is violated as having the main logic inside the tools module makes it necessary to import the security classes in the RecoverSCMCertificate class.

I will try to think over the problems I mentioned and how to fix them but I just wanted to get feedback on the approach that the current patch takes.

sadanand48 avatar Jul 11 '24 08:07 sadanand48

Sorry for the long silence here @sadanand48, I was pretty much flooded with things before my summer break, and could not get back to you on this one.

I think there was a misunderstanding here, on my end. Now that I am reevaluating this whole thing, I came to the realization that we are talking about a scenario where all three SCM lost their certificates, and therefore the Ratis server could not come up, which also means that their SCMSecurityProtocolServers has not started either...

In this case unfortunately the certificate client can not really be helpful, as that communicates with the SCMSecurityProtocolServer of the leader SCM... And that is why the automatic recovery options could not do anything about this, as SCM could not get to the point where the recovery is meaningful.

So all in all we would need a command line tool that directly accesses RocksDB on the SCM hosts, similar to what you have posted for the first time. I would suggest the following approach: The tool should get the path to the VERSION file of the SCM, and the path to SCM's RocksDB, then it should emit the certificate based on the serial id find in the VERSION file. It can error out if the VERSION file does not contain the serial ID, as that most likely mean that security were never bootstrapped, and there can be no mistakes about which SCM the certificate is belonging to. The certificate is stored as a PEM encoded string converted to bytes in the DB, we can dump this byte stream to a PEM file without any further ado.

Sorry for the extra work caused by my misunderstanding of the actual state in which the tool is useful.

fapifta avatar Aug 26 '24 15:08 fapifta

I came to the realization that we are talking about a scenario where all three SCM lost their certificates, and therefore the Ratis server could not come up, which also means that their SCMSecurityProtocolServers has not started either...

Right @fapifta , We cannot go through the usual way of fetching the SCM cert info via Ratis here which is why I implemented a hacky way to do it in my last updated patch. In an ideal flow the SCMSecurityClient is initialised in the SCM startup code flow which is run by the SCM node itself. Each SCM has its rocksdb and if we are guaranteed that the certs are persisted to each rocksdb, I just get the path of the scm.db from the scm configs and open the scm rocksdb instance in read mode and read the cert and write it to local file.

I guess this still has problems because

  1. we are assuming all SCM's have their RocksDB updated and on the same page which might not be true as we only need majority commit during usual write.
  2. Also while implementing this I faced many issues with respect to dependencies/bouncy castle violations etc

The problem here as mentioned by @nandakumar131 is very peculiar and is restricted to the user accidentally deleting their SCM certs but not private keys which is quite rare. To handle such rare cases, the tool might be helpful instead of trying to automate the logic. Thanks for the inputs, I will revert the last commit to only add the tool.

sadanand48 avatar Aug 29 '24 19:08 sadanand48

Thanks, @fapifta, @nandakumar131, @errose28 for the reviews and @sadanand48 for the patch!

aryangupta1998 avatar Sep 13 '24 06:09 aryangupta1998