HIVE-12679: Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf
What changes were proposed in this pull request?
Make it possible to replace the default IMetaStoreClient with a custom one.
It is the third time to try to send this patch. Austin originally opened HIVE-12679 in 2016, and @moomindani tried to send his patch in #1402, and then I took over it again.
Why are the changes needed?
In some environments, we want to connect to another data catalog rather than the native Hive Metastore. Looks like, AWS Glue Data Catalog is one of the cases. We also have a similar case for some historical reasons.
As we can see, more than 40 people are watching HIVE-12679, and some of them have asked us about the progress of this ticket. I'm sure it is worth maintaining this feature.
Does this PR introduce any user-facing change?
This doesn't change the original behavior at all.
Is the change a dependency upgrade?
No
How was this patch tested?
I tested HiveServer2 still works with this patch.
We have a data connector for bridging over different resources for this purpose, do we consider using this feature? https://issues.apache.org/jira/browse/HIVE-24396
We have a data connector for bridging over different resources for this purpose, do we consider using this feature? https://issues.apache.org/jira/browse/HIVE-24396
If I understand correctly, it requires us to have a Hive Metastore as a single source. The expected environment of HIVE-12679 doesn't have a true Hive Metastore but does have another data catalog whose protocol is incompatible with Hive Metastore.
@okumin Could you please address the comment in the original pull request? The limitation with this approach is we don't get the features implemented in SessionHiveMetastoreClient (Example: handling of temp tables). This will lead to the issue of temp tables not getting cleaned up when the session is closed. Please see the comments from @thejasmn and @alanfgates on HIVE-12679.
@okumin Could you please address the comment in the original pull request? The limitation with this approach is we don't get the features implemented in SessionHiveMetastoreClient (Example: handling of temp tables). This will lead to the issue of temp tables not getting cleaned up when the session is closed. Please see the comments from @thejasmn and @alanfgates on HIVE-12679.
@ganeshashree Quickly checking SessionHiveMetastoreClient, it has the following additional features.
- Temporary table management in most methods
- Query cache management
- TX id management
It sounds like a good idea to make it easy to compose the features with a custom metastore client. I am personally thinking we can work on it in another ticket and I will definitely try it. In my mind, I wonder how about having the following entities.
-
IMetastoreClient- It is the interface of any metastore client.
-
HiveMetaStoreClient- It is the implementation of
IMetastoreClientto talk to Hive Metastore
- It is the implementation of
-
SessionHiveMetaStoreClient- It is a proxy to add the above three features to any
IMetastoreClient
- It is a proxy to add the above three features to any
-
RetryingMetaStoreClient- It is a dynamic proxy to add retry capability to an
IMetastoreClientrelying onHiveMetastoreClient. Practically, it is dedicated forSessionHiveMetaStoreClienton the top ofHiveMetaStoreClient
- It is a dynamic proxy to add retry capability to an
As proposed in the JIRA ticket, I expect we can implement SessionHiveMetaStoreClient by composition rather than inheritance like below.
class SessionHiveMetaStoreClient implement IMetastoreClient {
private IMetastoreClient underlying;
@Override
protected void create_table(CreateTableRequest request) throws
InvalidObjectException, MetaException, NoSuchObjectException, TException {
org.apache.hadoop.hive.metastore.api.Table tbl = request.getTable();
if (tbl.isTemporary()) {
createTempTable(tbl);
return;
}
underlying.create_table(request);
}
}
If we follow the current API design using HiveMetaStoreClientFactory, the responsibility of the factory will be just "create an IMetastoreClient". Users may wrap their custom clients with SessionHiveMetaStoreClient if they need to support temp tables, or may wrap them with RetryingMetaStoreClient if the custom clients depend on Thrift. It is up to the users. I guess it would give us the best flexibility(I guess all the past authors have not needed the features provided by SessionHiveMetaStoreClient and it could be possible that someone will want to customize such session-related features for some requirements of their platforms).
@wecharyu and @deniskuzZ may have different opinions about how to generate an instance. So, this is just my opinion, though.
@okumin Thank you for addressing this! I vote for users to specify an implementation of IMetaStoreClient via HiveConf and use that in SessionHiveMetastoreClient instead of specifying a factory class. This way, it will be easy for users to just implement their version of IMetaStoreClient and also get all the features implemented in SessionHiveMetastoreClient along with RetryingMetaStoreClient. I am also thinking of using the specified implementation of IMetaStoreClient for delegating the calls in SessionHiveMetastoreClient (same as you mentioned). I would let senior community members to review this approach and give their feedback.
@ganeshashree Thanks! I think this point is still a bit controversial.
I vote for users to specify an implementation of IMetaStoreClient via HiveConf and use that in SessionHiveMetastoreClient instead of specifying a factory class
As the third owner of this patch, to be honest, I currently prefer to keep the current design using a factory class for the following reasons.
- Is it useful to always add traits of
RetryingMetaStoreClient,SessionHiveMetastoreClient, andHiveMetaStoreClientWithLocalCache?-
RetryingMetaStoreClientis meaningless or even harmful to a custom client since it is coupled with Thrift or Kerberos. I am not 100% sure the features ofSessionHiveMetastoreClientwill never conflict with all custom clients now or in the future.HiveMetaStoreClientWithLocalCacheseems to be tightly coupled withHiveMetaStoreClient. The design using a factory class can enable us to handle any situation since users can decorate their custom clients as they like
-
- Who is the target users we help?
- Existing users historically have depended on the current design for years. If we adopt a new design, they have to maintain the unmerge patch. Otherwise, they have to ask their customers to take action on Hive 4 upgrade. Of course, we can choose the best way for not existing users but new users. I believe existing users have ported the unapproved patch on their own responsibility
Anyway, I hope we agree with the following points.
- Some of us are really willing to have a feature to replace
IMetaStoreClientas we see many watchers on HIVE-12679 - It seems to be useful if we can reuse traits of
SessionHiveMetastoreClient. I started working on it on HIVE-27473
As for if we should configure our custom clients via a factory or not, I would also say it is finally up to the community members. Sorry for my long opinions 🙇
I'm checking the HiveMetaStoreClient family and I found it is not trivial. HiveMetaStoreClient, HiveMetaStoreClientWithLocalCache, and SessionHiveMetaStoreClient are tightly coupled by dynamic dispatching throw APIs defined in not IMetaStoreClient but HiveMetaStoreClient. It means they depend on each other in a bi-directional manner, and we can not simply replace the dependencies with compositions.
- https://github.com/apache/hive/blob/rel/release-4.0.0-alpha-2/ql/src/java/org/apache/hadoop/hive/ql/metadata/SessionHiveMetaStoreClient.java#L167C18-L175
- https://github.com/apache/hive/blob/rel/release-4.0.0-alpha-2/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java#L1430
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.
Let me implement the alternative idea and ask the community to conclude that later
@deniskuzZ @wecharyu @dengzhhu653 @ganeshashree Sorry for the late response.
I tried to implement the suggested option and summarize the current points. To be honest, I still prefer the original option from the point of view of both usability and convenience of existing users. https://gist.github.com/okumin/30b058b14db1b099ba37ba7dc257fe8e
Should we ask more users about their opinions using the mailing list?
@deniskuzZ @wecharyu @dengzhhu653 @ganeshashree Sorry for the late response.
I tried to implement the suggested option and summarize the current points. To be honest, I still prefer the original option from the point of view of both usability and convenience of existing users. https://gist.github.com/okumin/30b058b14db1b099ba37ba7dc257fe8e
Should we ask more users about their opinions using the mailing list?
hi @okumin, sure, why not. Also, it would be great to include HMS folks as reviewers: @dengzhhu653 , @nrg4878 , @saihemanth-cloudera
Thanks. I sent an e-mail to the dev ML for visibility. https://lists.apache.org/thread/607swrj6pq4g6q052tyo4l304vb091m2
Thanks @okumin for the great work! some ideas:
-
Can we just expose SessionHiveMetaStoreClient in the Hive class? https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L5839 I think the
HiveMetaStoreClientWithLocalCachecan also benefit other data catalogs and it won't hurt the temp table by design.
I vote for users to specify an implementation of IMetaStoreClient via HiveConf and use that in SessionHiveMetastoreClient instead of specifying a factory class.
This makes sense to me, the SessionHiveMetastoreClient acts as a wrapper for the IMetaStoreClient implementation.
- IDataConnectorProvider in HMS can plugin different datasources nowadays, example: https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/dataconnector/jdbc/PostgreSQLConnectorProvider.java, which possible makes HMS a central meta repository across the organization, we can add a Glue connector for HMS talking to Glue data catalog.
@dengzhhu653 Thanks for the meaningful insight!
- Can we just expose SessionHiveMetaStoreClient in the Hive class?
- This makes sense to me, the SessionHiveMetastoreClient acts as a wrapper for the IMetaStoreClient implementation
If I understand correctly, it is a problem resolved by HIVE-27473. I agree users might want to use adaptors to add some capabilities. In addition, they can be optional in my opinion. I mean if a user wants to use the convenient adaptors, they can use it in the following way.
IMetaStoreClient customMetaStoreClient = ...;
IMetaStoreClient metaStoreClientWithCache = new MetaStoreClientWithCache(customMetaStoreClient);
IMetaStoreClient metaStoreClientWithTmpTable = new MetaStoreClientWithTmpTable(metaStoreClientWithCache);
If the user doesn't need any or all of the additional capabilities or features, they don't have to wrap their client. Everything, e.g. what to support or how to implement, is up to the user as long as their plugin is integrated with Hive via the primitive interface, IMetaStoreClient. I expect this doesn't sacrifice convenience so much, keeping things explicit and flexible. If we don't prefer it, I guess we should resolve HIVE-27473 first(and it would not be trivial as SessionHiveMetastoreClient or HiveMetaStoreClientWithLocalCache are tightly coupled with HiveMetaStoreClient).
- IDataConnectorProvider in HMS can plugin different datasources nowadays
It seems to be a really lovely feature. I also guess it may not fit into our use case. That's because we would need to verify the security, availability, and performance of both HMS and our metadata service. To be fair, I am very confident with the capabilities of HMS. But we still always need to keep all endpoints well-tested. Anyway, I think it is a strong option and I listed it as an alternative. Thanks!
@zabetak
What is a bit worrisome is that there are lots of calls to RetryingMetaStoreClient.getProxy methods outside the Hive class itself. It seems that the factory will not have any effect on those. Isn't this a problem?
I didn't notice that point, and I am listing use cases...
- RetryingMetaStoreClient.getProxy
- UpgradeTool
- HiveClientCache of HCatalog
- Constructor of HiveMetaStoreClient, including inheritance
- HiveClientCache
- MsckOperation & Msck
- HiveStrictManagedMigration
- PartitionManagementTask
- SmokeTest
- Constructor of SessionHiveMetaStoreClient
- Hive.java
Some might not be needed for a 3rd party metastore but I feel some could be valuable. Now, I guess it could be better to think of usages outside Hive.java, too. I can PoC once we can agree with the gluing API...
@okumin,
their plugin is integrated with Hive via the primitive interface, IMetaStoreClient
That's right, there plugin client is wrapped by cache and session client, example:
IMetaStoreClient client = new SessionHiveMetaStoreClient(new RetryingMetastoreClient(new HiveMetaStoreClient(....)));
We can put the user client under the SessionHiveMetaStoreClient, e.g, new SessionHiveMetaStoreClient(new CatalogMetaClient(...))
In SessionHiveMetaStoreClient we need to change the call super.create_table(request) to delegate.create_table(request), that I think that it's worthful and wouldn't introduce too much complexity.
@dengzhhu653 Thanks. I hope we are on the same page, meaning the purpose of HIVE-12679 is to make it possible to integrate another metastore with Hive via IMetaStoreClient. And we can decouple how to make it easy to implement IMetaStoreClient. Please let us know if this point is wrong.
I'd like to move to HIVE-27473 if we want to discuss the second problem and we agree that HIVE-12679 doesn't depend on HIVE-27473. That's because we might be confused when we talk about two complex problems in a single place.
Hi @okumin, sorry for that! What I mean is that no matter what the underlying client is, we should expose SessionHiveMetaStoreClient to the HiveServer2, as this would not break the temporary table or the meta cache in my opinion.
For the gluing API, I have no strong reason to against it, I'm thinking:
- How to load it on runtime and separate it from other pluggable clients?
- Where we can put these pluggable clients?
@dengzhhu653 OK. In your opinion, we should always wrap any IMetaStoreClient with SessionHiveMetaStoreClient. I personally thought it could be up to the owner of a custom client. But I also understand your point. In that case, HIVE-27473 will be a blocker of this PR. Should we work on it first and then revisit here?
PS: I'd be glad if someone could join the discussion of HIVE-27473 since I have not found a very smart way to make them integrated not by inheritance but by composition. Thanks.
@dengzhhu653 OK. In your opinion, we should always wrap any IMetaStoreClient with SessionHiveMetaStoreClient. I personally thought it could be up to the owner of a custom client. But I also understand your point. In that case, HIVE-27473 will be a blocker of this PR. Should we work on it first and then revisit here?
I think so, let's hear other opinions. Thanks.
Just thinking out loud. Would it be a better model to federate these external catalogs thru HMS instead of swapping out the IMetastoreClient implementation? Changing the implementation makes it only work with that source where as if we build a Connector for Glue (see HIVE-24396), we can get HMS to pull metadata from Glue and present it to HS2 as if it were local.
@nrg4878 Thanks. I think you are mentioning an alternative in this list and it mostly sounds great. In HIVE-12679, I assume the environment where they need to maintain a single source of truth other than HMS.
@okumin this thread is pretty long. Could you please share a summary about the current status? Do you need any support?
@aturoczy
- The base discussion is here
- @dengzhhu653 suggests we always wrap an
IMetaStoreClientwith the composable version ofSessionHiveMetaStoreClient. If we second the idea, HIVE-27473 is a blocker of HIVE-12679(or we can merge both into one ticket if we agree it is a hard requirement) - @zabetak wonders if
IMetaStoreClientwould be used in other places thanHive.java. It is true and I think we can resolve that point just by brute-forcing all references
So, we can make an advance if we find a solution for HIVE-27473, if we prove HIVE-27473 is not practically impossible, or if we agree that we will go without HIVE-27473. I'm trying to tackle HIVE-27473 as it sounds useful anyway, but I have not found an elegant way.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.
Just a note. I am working on HIVE-27473 and will revive this PR once we've agreed or disagreed with the feasibility of HIVE-27473
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.







