hive icon indicating copy to clipboard operation
hive copied to clipboard

HIVE-12679: Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf

Open okumin opened this issue 2 years ago • 26 comments

What changes were proposed in this pull request?

Make it possible to replace the default IMetaStoreClient with a custom one.

It is the third time to try to send this patch. Austin originally opened HIVE-12679 in 2016, and @moomindani tried to send his patch in #1402, and then I took over it again.

Why are the changes needed?

In some environments, we want to connect to another data catalog rather than the native Hive Metastore. Looks like, AWS Glue Data Catalog is one of the cases. We also have a similar case for some historical reasons.

As we can see, more than 40 people are watching HIVE-12679, and some of them have asked us about the progress of this ticket. I'm sure it is worth maintaining this feature.

Does this PR introduce any user-facing change?

This doesn't change the original behavior at all.

Is the change a dependency upgrade?

No

How was this patch tested?

I tested HiveServer2 still works with this patch.

okumin avatar Jun 21 '23 15:06 okumin

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 13 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

sonarqubecloud[bot] avatar Jun 22 '23 08:06 sonarqubecloud[bot]

We have a data connector for bridging over different resources for this purpose, do we consider using this feature? https://issues.apache.org/jira/browse/HIVE-24396

dengzhhu653 avatar Jun 22 '23 13:06 dengzhhu653

We have a data connector for bridging over different resources for this purpose, do we consider using this feature? https://issues.apache.org/jira/browse/HIVE-24396

If I understand correctly, it requires us to have a Hive Metastore as a single source. The expected environment of HIVE-12679 doesn't have a true Hive Metastore but does have another data catalog whose protocol is incompatible with Hive Metastore.

okumin avatar Jun 22 '23 16:06 okumin

@okumin Could you please address the comment in the original pull request? The limitation with this approach is we don't get the features implemented in SessionHiveMetastoreClient (Example: handling of temp tables). This will lead to the issue of temp tables not getting cleaned up when the session is closed. Please see the comments from @thejasmn and @alanfgates on HIVE-12679.

ganeshashree avatar Jun 27 '23 04:06 ganeshashree

@okumin Could you please address the comment in the original pull request? The limitation with this approach is we don't get the features implemented in SessionHiveMetastoreClient (Example: handling of temp tables). This will lead to the issue of temp tables not getting cleaned up when the session is closed. Please see the comments from @thejasmn and @alanfgates on HIVE-12679.

@ganeshashree Quickly checking SessionHiveMetastoreClient, it has the following additional features.

  • Temporary table management in most methods
  • Query cache management
  • TX id management

It sounds like a good idea to make it easy to compose the features with a custom metastore client. I am personally thinking we can work on it in another ticket and I will definitely try it. In my mind, I wonder how about having the following entities.

  • IMetastoreClient
    • It is the interface of any metastore client.
  • HiveMetaStoreClient
    • It is the implementation of IMetastoreClient to talk to Hive Metastore
  • SessionHiveMetaStoreClient
    • It is a proxy to add the above three features to any IMetastoreClient
  • RetryingMetaStoreClient
    • It is a dynamic proxy to add retry capability to an IMetastoreClient relying on HiveMetastoreClient. Practically, it is dedicated for SessionHiveMetaStoreClient on the top of HiveMetaStoreClient

As proposed in the JIRA ticket, I expect we can implement SessionHiveMetaStoreClient by composition rather than inheritance like below.

class SessionHiveMetaStoreClient implement IMetastoreClient {
  private IMetastoreClient underlying;

  @Override
  protected void create_table(CreateTableRequest request) throws
      InvalidObjectException, MetaException, NoSuchObjectException, TException {
    org.apache.hadoop.hive.metastore.api.Table tbl = request.getTable();
    if (tbl.isTemporary()) {
      createTempTable(tbl);
      return;
    }
    underlying.create_table(request);
  }
}

If we follow the current API design using HiveMetaStoreClientFactory, the responsibility of the factory will be just "create an IMetastoreClient". Users may wrap their custom clients with SessionHiveMetaStoreClient if they need to support temp tables, or may wrap them with RetryingMetaStoreClient if the custom clients depend on Thrift. It is up to the users. I guess it would give us the best flexibility(I guess all the past authors have not needed the features provided by SessionHiveMetaStoreClient and it could be possible that someone will want to customize such session-related features for some requirements of their platforms).

@wecharyu and @deniskuzZ may have different opinions about how to generate an instance. So, this is just my opinion, though.

okumin avatar Jun 28 '23 16:06 okumin

@okumin Thank you for addressing this! I vote for users to specify an implementation of IMetaStoreClient via HiveConf and use that in SessionHiveMetastoreClient instead of specifying a factory class. This way, it will be easy for users to just implement their version of IMetaStoreClient and also get all the features implemented in SessionHiveMetastoreClient along with RetryingMetaStoreClient. I am also thinking of using the specified implementation of IMetaStoreClient for delegating the calls in SessionHiveMetastoreClient (same as you mentioned). I would let senior community members to review this approach and give their feedback.

ganeshashree avatar Jun 29 '23 07:06 ganeshashree

@ganeshashree Thanks! I think this point is still a bit controversial.

I vote for users to specify an implementation of IMetaStoreClient via HiveConf and use that in SessionHiveMetastoreClient instead of specifying a factory class

As the third owner of this patch, to be honest, I currently prefer to keep the current design using a factory class for the following reasons.

  • Is it useful to always add traits of RetryingMetaStoreClient, SessionHiveMetastoreClient, and HiveMetaStoreClientWithLocalCache?
    • RetryingMetaStoreClient is meaningless or even harmful to a custom client since it is coupled with Thrift or Kerberos. I am not 100% sure the features of SessionHiveMetastoreClient will never conflict with all custom clients now or in the future. HiveMetaStoreClientWithLocalCache seems to be tightly coupled with HiveMetaStoreClient. The design using a factory class can enable us to handle any situation since users can decorate their custom clients as they like
  • Who is the target users we help?
    • Existing users historically have depended on the current design for years. If we adopt a new design, they have to maintain the unmerge patch. Otherwise, they have to ask their customers to take action on Hive 4 upgrade. Of course, we can choose the best way for not existing users but new users. I believe existing users have ported the unapproved patch on their own responsibility

Anyway, I hope we agree with the following points.

  • Some of us are really willing to have a feature to replace IMetaStoreClient as we see many watchers on HIVE-12679
  • It seems to be useful if we can reuse traits of SessionHiveMetastoreClient. I started working on it on HIVE-27473

As for if we should configure our custom clients via a factory or not, I would also say it is finally up to the community members. Sorry for my long opinions 🙇

okumin avatar Jun 29 '23 10:06 okumin

I'm checking the HiveMetaStoreClient family and I found it is not trivial. HiveMetaStoreClient, HiveMetaStoreClientWithLocalCache, and SessionHiveMetaStoreClient are tightly coupled by dynamic dispatching throw APIs defined in not IMetaStoreClient but HiveMetaStoreClient. It means they depend on each other in a bi-directional manner, and we can not simply replace the dependencies with compositions.

  • https://github.com/apache/hive/blob/rel/release-4.0.0-alpha-2/ql/src/java/org/apache/hadoop/hive/ql/metadata/SessionHiveMetaStoreClient.java#L167C18-L175
  • https://github.com/apache/hive/blob/rel/release-4.0.0-alpha-2/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java#L1430

okumin avatar Jul 01 '23 13:07 okumin

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.

github-actions[bot] avatar Sep 01 '23 00:09 github-actions[bot]

Let me implement the alternative idea and ask the community to conclude that later

okumin avatar Sep 04 '23 12:09 okumin

@deniskuzZ @wecharyu @dengzhhu653 @ganeshashree Sorry for the late response.

I tried to implement the suggested option and summarize the current points. To be honest, I still prefer the original option from the point of view of both usability and convenience of existing users. https://gist.github.com/okumin/30b058b14db1b099ba37ba7dc257fe8e

Should we ask more users about their opinions using the mailing list?

okumin avatar Oct 05 '23 10:10 okumin

@deniskuzZ @wecharyu @dengzhhu653 @ganeshashree Sorry for the late response.

I tried to implement the suggested option and summarize the current points. To be honest, I still prefer the original option from the point of view of both usability and convenience of existing users. https://gist.github.com/okumin/30b058b14db1b099ba37ba7dc257fe8e

Should we ask more users about their opinions using the mailing list?

hi @okumin, sure, why not. Also, it would be great to include HMS folks as reviewers: @dengzhhu653 , @nrg4878 , @saihemanth-cloudera

deniskuzZ avatar Oct 12 '23 12:10 deniskuzZ

Thanks. I sent an e-mail to the dev ML for visibility. https://lists.apache.org/thread/607swrj6pq4g6q052tyo4l304vb091m2

okumin avatar Oct 13 '23 15:10 okumin

Thanks @okumin for the great work! some ideas:

  1. Can we just expose SessionHiveMetaStoreClient in the Hive class? https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L5839 I think the HiveMetaStoreClientWithLocalCache can also benefit other data catalogs and it won't hurt the temp table by design.

I vote for users to specify an implementation of IMetaStoreClient via HiveConf and use that in SessionHiveMetastoreClient instead of specifying a factory class.

This makes sense to me, the SessionHiveMetastoreClient acts as a wrapper for the IMetaStoreClient implementation.

  1. IDataConnectorProvider in HMS can plugin different datasources nowadays, example: https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/dataconnector/jdbc/PostgreSQLConnectorProvider.java, which possible makes HMS a central meta repository across the organization, we can add a Glue connector for HMS talking to Glue data catalog.

dengzhhu653 avatar Oct 18 '23 04:10 dengzhhu653

@dengzhhu653 Thanks for the meaningful insight!

  1. Can we just expose SessionHiveMetaStoreClient in the Hive class?
  2. This makes sense to me, the SessionHiveMetastoreClient acts as a wrapper for the IMetaStoreClient implementation

If I understand correctly, it is a problem resolved by HIVE-27473. I agree users might want to use adaptors to add some capabilities. In addition, they can be optional in my opinion. I mean if a user wants to use the convenient adaptors, they can use it in the following way.

IMetaStoreClient customMetaStoreClient = ...;
IMetaStoreClient metaStoreClientWithCache = new MetaStoreClientWithCache(customMetaStoreClient);
IMetaStoreClient metaStoreClientWithTmpTable = new MetaStoreClientWithTmpTable(metaStoreClientWithCache);

If the user doesn't need any or all of the additional capabilities or features, they don't have to wrap their client. Everything, e.g. what to support or how to implement, is up to the user as long as their plugin is integrated with Hive via the primitive interface, IMetaStoreClient. I expect this doesn't sacrifice convenience so much, keeping things explicit and flexible. If we don't prefer it, I guess we should resolve HIVE-27473 first(and it would not be trivial as SessionHiveMetastoreClient or HiveMetaStoreClientWithLocalCache are tightly coupled with HiveMetaStoreClient).

  1. IDataConnectorProvider in HMS can plugin different datasources nowadays

It seems to be a really lovely feature. I also guess it may not fit into our use case. That's because we would need to verify the security, availability, and performance of both HMS and our metadata service. To be fair, I am very confident with the capabilities of HMS. But we still always need to keep all endpoints well-tested. Anyway, I think it is a strong option and I listed it as an alternative. Thanks!

@zabetak

What is a bit worrisome is that there are lots of calls to RetryingMetaStoreClient.getProxy methods outside the Hive class itself. It seems that the factory will not have any effect on those. Isn't this a problem?

I didn't notice that point, and I am listing use cases...

  • RetryingMetaStoreClient.getProxy
    • UpgradeTool
    • HiveClientCache of HCatalog
  • Constructor of HiveMetaStoreClient, including inheritance
    • HiveClientCache
    • MsckOperation & Msck
    • HiveStrictManagedMigration
    • PartitionManagementTask
    • SmokeTest
  • Constructor of SessionHiveMetaStoreClient
    • Hive.java

Some might not be needed for a 3rd party metastore but I feel some could be valuable. Now, I guess it could be better to think of usages outside Hive.java, too. I can PoC once we can agree with the gluing API...

okumin avatar Oct 19 '23 14:10 okumin

@okumin,

their plugin is integrated with Hive via the primitive interface, IMetaStoreClient

That's right, there plugin client is wrapped by cache and session client, example:

IMetaStoreClient client = new SessionHiveMetaStoreClient(new RetryingMetastoreClient(new HiveMetaStoreClient(....)));

We can put the user client under the SessionHiveMetaStoreClient, e.g, new SessionHiveMetaStoreClient(new CatalogMetaClient(...)) In SessionHiveMetaStoreClient we need to change the call super.create_table(request) to delegate.create_table(request), that I think that it's worthful and wouldn't introduce too much complexity.

dengzhhu653 avatar Oct 20 '23 06:10 dengzhhu653

@dengzhhu653 Thanks. I hope we are on the same page, meaning the purpose of HIVE-12679 is to make it possible to integrate another metastore with Hive via IMetaStoreClient. And we can decouple how to make it easy to implement IMetaStoreClient. Please let us know if this point is wrong. I'd like to move to HIVE-27473 if we want to discuss the second problem and we agree that HIVE-12679 doesn't depend on HIVE-27473. That's because we might be confused when we talk about two complex problems in a single place.

okumin avatar Oct 20 '23 10:10 okumin

Hi @okumin, sorry for that! What I mean is that no matter what the underlying client is, we should expose SessionHiveMetaStoreClient to the HiveServer2, as this would not break the temporary table or the meta cache in my opinion.

For the gluing API, I have no strong reason to against it, I'm thinking:

  1. How to load it on runtime and separate it from other pluggable clients?
  2. Where we can put these pluggable clients?

dengzhhu653 avatar Oct 21 '23 01:10 dengzhhu653

@dengzhhu653 OK. In your opinion, we should always wrap any IMetaStoreClient with SessionHiveMetaStoreClient. I personally thought it could be up to the owner of a custom client. But I also understand your point. In that case, HIVE-27473 will be a blocker of this PR. Should we work on it first and then revisit here?

PS: I'd be glad if someone could join the discussion of HIVE-27473 since I have not found a very smart way to make them integrated not by inheritance but by composition. Thanks.

okumin avatar Oct 21 '23 02:10 okumin

@dengzhhu653 OK. In your opinion, we should always wrap any IMetaStoreClient with SessionHiveMetaStoreClient. I personally thought it could be up to the owner of a custom client. But I also understand your point. In that case, HIVE-27473 will be a blocker of this PR. Should we work on it first and then revisit here?

I think so, let's hear other opinions. Thanks.

dengzhhu653 avatar Oct 22 '23 00:10 dengzhhu653

Just thinking out loud. Would it be a better model to federate these external catalogs thru HMS instead of swapping out the IMetastoreClient implementation? Changing the implementation makes it only work with that source where as if we build a Connector for Glue (see HIVE-24396), we can get HMS to pull metadata from Glue and present it to HS2 as if it were local.

nrg4878 avatar Oct 24 '23 17:10 nrg4878

@nrg4878 Thanks. I think you are mentioning an alternative in this list and it mostly sounds great. In HIVE-12679, I assume the environment where they need to maintain a single source of truth other than HMS.

okumin avatar Oct 25 '23 02:10 okumin

@okumin this thread is pretty long. Could you please share a summary about the current status? Do you need any support?

aturoczy avatar Nov 08 '23 12:11 aturoczy

@aturoczy

  • The base discussion is here
  • @dengzhhu653 suggests we always wrap an IMetaStoreClient with the composable version of SessionHiveMetaStoreClient. If we second the idea, HIVE-27473 is a blocker of HIVE-12679(or we can merge both into one ticket if we agree it is a hard requirement)
  • @zabetak wonders if IMetaStoreClient would be used in other places than Hive.java. It is true and I think we can resolve that point just by brute-forcing all references

So, we can make an advance if we find a solution for HIVE-27473, if we prove HIVE-27473 is not practically impossible, or if we agree that we will go without HIVE-27473. I'm trying to tackle HIVE-27473 as it sounds useful anyway, but I have not found an elegant way.

okumin avatar Nov 13 '23 11:11 okumin

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.

github-actions[bot] avatar Jan 13 '24 00:01 github-actions[bot]

Just a note. I am working on HIVE-27473 and will revive this PR once we've agreed or disagreed with the feasibility of HIVE-27473

okumin avatar Jan 22 '24 02:01 okumin

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.

github-actions[bot] avatar Oct 09 '24 00:10 github-actions[bot]