kyuubi [TASK][MEDIUM] Support Amazon EMR Serverless on AWS

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Search before asking

[X] I have searched in the issues and found no similar issues.

Describe the feature

Support Amazon EMR Serverless as Kyuubi Spark Engine to minimize the operation cost and implement real serverless Spark SQL goal. Amazon EMR Serverless is not supported yet as Amazon EMR Serverless has no JDBC connection.

Motivation

Implement real serverless Spark SQL target on AWS cloud.

Describe the solution

Amazon EMR Serverless makes it easy for users to run Spark without configuring, managing, and scaling clusters or servers.

Additional context

No response

Are you willing to submit PR?

[ ] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
[X] No. I cannot submit a PR at this time.

Mar 06 '23 06:03 davidshtian

Hello @davidshtian, Thanks for finding the time to report the issue! We really appreciate the community's efforts to improve Apache Kyuubi.

Mar 06 '23 06:03 github-actions[bot]

Does deploying Kyuubi on Amazon EMR satisfy this？ AFAIK, cloud vendors like Tencent cloud and Aliyun, who provide similar EMR services, have provided JDBC though Kyuubi

Mar 06 '23 08:03 yaooqinn

Does deploying Kyuubi on Amazon EMR satisfy this？ AFAIK, cloud vendors like Tencent cloud and Aliyun, who provide similar EMR services, have provided JDBC though Kyuubi

Thanks for your response~

Kyuubi can be deployed on AWS EMR cluster mode (EMR on EC2), but it still need to manage and operate the cluster, while EMR Serverless is fully serverless and managed bringing more flexibility and it could handle different scenarios. EMR Serverless has no JDBC connection and it uses AWS API to submit the job, it would be better to have the support to adapt EMR Serverless to Kyuubi. Thanks~

Mar 06 '23 09:03 davidshtian

EMR Serverless has no JDBC connection and it uses AWS API to submit the job

Sorry for being late. Never used AWS. Correct me if I am wrong, do you mean that an AWS EMR cluster does not support spark-submit?

Mar 10 '23 06:03 yaooqinn

Hi @yaooqinn,

do you mean that an AWS EMR cluster does not support spark-submit?

EMR (on EC2) does support submitting via spark-submit utility. However, EMR-Serverless does not.

For EMR-Serverless, one can submit jobs by either using of the AWS' SDK or via AWSCli.

It looks something like this:

aws emr-serverless start-job-run \
    --application-id <EMR_Severless_App_Id> \
    --execution-role-arn arn:aws:iam::012345678901:role/my-cool-emr-exec-role \
    --job-driver 'sparkSubmit={entryPoint=s3://my-bucket/script_to_be_execeuted.py}' \
    --configuration-overrides '{"monitoringConfiguration": {"managedPersistenceMonitoringConfiguration": {"enabled": true}, "cloudWatchLoggingConfiguration": {"enabled": true, "logGroupName": "/aws/emr-serverless/my-logs"}}}'

Feb 26 '24 01:02 PauloMigAlmeida

It looks like we need a specific version of the org.apache.kyuubi.engine.ProcBuilder for was. also cc @pan3793

Feb 26 '24 03:02 yaooqinn

To support AWS EMR Serverless Spark, we need to implement the following interface in Kyuubi

org.apache.kyuubi.engine.ApplicationOperation (for querying and canceling job)
org.apache.kyuubi.engine.ProcBuilder (for submitting job)

According to the EMR Docs, I think we can use the CLI aws emr-serverless to implement that.

One concern is the integration tests, I'm not an AWS user, seems that localstack also does not support AWS EMR Serverless? I'm afraid that functionality without CI verification is fragile.

Feb 26 '24 04:02 pan3793

BTW, does GCP and Azure have similar services?

Feb 26 '24 04:02 pan3793

@pan3793, @yaooqinn

I'm not an AWS user, seems that localstack also does not support AWS EMR Serverless?

You are right, localstack doesn't support the endpoints for dealing with EMR serverless yet

I'm afraid that functionality without CI verification is fragile.

I agree. Does the apache foundation have AWS accounts that can be used for CI/CD purposes? If so, that would be the fastest way to address this as I think we may not be the first one needing that type of integration tests.

If it doesn't, you may want to try the AWS promotional credits for Open Source projects. More info at https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/

Regardless of the option you go with, count on me for helping with the AWS implementation details (IAM permissions, service configuration, sdk, cli options and so on) - that can speed up the time machine for the development of this feature.

Feb 26 '24 21:02 PauloMigAlmeida

It looks like ASF Infra doesn't have AWS resources for CI/CD. https://infra.apache.org/build-supported-services.html

Feb 27 '24 02:02 yaooqinn

Another option would be to mock responses expected from the AWS services involved.

I've seen that done before in other project in which localstack didn't support the service required. Would it address the CI tests concern?

Feb 27 '24 03:02 PauloMigAlmeida

Another option would be to mock responses expected from the AWS services involved.

This is a necessary step for local dev. AWS promotional credits might be necessary for setting the integration tests.

Feb 27 '24 03:02 yaooqinn

cc @zhaohehuhu, you may be interested in this feature

Feb 29 '24 11:02 pan3793

Yup. I'm going to implement this feature.

Mar 01 '24 06:03 zhaohehuhu

@zhaohehuhu just checking in to see if you need any help with the AWS side of things

Mar 20 '24 21:03 PauloMigAlmeida

@zhaohehuhu just checking in to see if you need any help with the AWS side of things

Thanks. It's going well so far. I already finished the draft code and decided to do a round of test. @PauloMigAlmeida

Mar 22 '24 02:03 zhaohehuhu

@zhaohehuhu If that helps, this is the terraform code that can provision an EMR Serverless cluster with the right permissions https://gist.github.com/PauloMigAlmeida/5cebf3efcd0f105d73646a6a9e8cc2f3

Instructions on how to deploy and run it are in the gist too.

Mar 25 '24 03:03 PauloMigAlmeida

@zhaohehuhu If that helps, this is the terraform code that can provision an EMR Serverless cluster with the right permissions https://gist.github.com/PauloMigAlmeida/5cebf3efcd0f105d73646a6a9e8cc2f3

Instructions on how to deploy and run it are in the gist too.

Thanks. may contact you if needed.

Mar 25 '24 09:03 zhaohehuhu

@pan3793 plz assign it to me.

Apr 15 '24 05:04 zhaohehuhu

@PauloMigAlmeida I deployed a kyuubi server on EC2. When kyuubi server talks to Spark engine in EMR sververless, it always says connection timeout. Kyuubi server and EMR sververless already are in the same VPC, how should it be? Do you have any idea about it ?

Apr 16 '24 03:04 zhaohehuhu

@zhaohehuhu EMRServerless doesn't run in a customer-managed VPC. It's accessible via an API that should be invoked using one of the AWS SDKs instead.

API method: https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html AWS SDK for Java: https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/emrserverless/EmrServerlessClient.html#startJobRun(software.amazon.awssdk.services.emrserverless.model.StartJobRunRequest)

PS: It requires that an EMRServerless application is created.

Does this help?

Apr 16 '24 04:04 PauloMigAlmeida

Thanks @PauloMigAlmeida. The EMR serverless runs on a VPC mannered by AWS, I just wonder it is possbile for Kyuubi Service talk to Spark Engine in EMR serverless through Thrift Protocol.

Apr 16 '24 05:04 zhaohehuhu

@zhaohehuhu I'm almost certain that it isn't supported but let me check that internally first and I will come back to you with an answer tomorrow.

On a separate note, if communicating via thrift isn't possible, is there any alternative that could be explored instead?

Apr 16 '24 07:04 PauloMigAlmeida

Thanks. It looks like it's hard for Kyuubi service running on EC2 or others to access Amazon EMR Serverless Spark. I will discuss it with @pan3793.

Apr 16 '24 07:04 zhaohehuhu

@zhaohehuhu I got hold of a EMR Serverless specialist internally. Thrift communication isn't possible at this moment on EMR Serverless =/

Apr 17 '24 02:04 PauloMigAlmeida

@PauloMigAlmeida do you know the exact restriction? TCP inbound traffic or something?

Apr 17 '24 02:04 pan3793

@pan3793 Seems to be that we don't have thrift server running on those nodes for the serverless offering.

Apr 17 '24 02:04 PauloMigAlmeida

@PauloMigAlmeida Kyuubi use thrift as internal RPC protocol for Kyuubi server and Spark driver, it will auto bootstrap a thrift server on the Spark driver. So the question is, is Spark Servless allows Thrift(kind of a TCP-based protocol) traffic between outside and inside, I suppose it should work, the Jupyter Notebook case runs in similiar way(not thrift, should be another TCP-based protocol). https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/interactive-workloads.html

I have a offline talk with @zhaohehuhu, according to his feedback, the driver successfully launched Thrift RPC server, and registered it to Zookeeper, so Kyuubi server got the Thrift RPC server address but can NOT establish the connection. We have reported similar issues on other public cloud vendor, it caused by dual NICs https://github.com/apache/kyuubi/issues/6296, I'm not sure what's exact issue on AWS EMR Serverless

Apr 17 '24 03:04 pan3793

@pan3793 got you point now.

Does the thrift communication initiate from the EMR serverless to the kyuubi server? Or is it the other way around?

In case, it's the former:

Out of curiosity, what's the security groups rules for both kyuubi server and EMR serverless (with VPC)?

I'm aware that EMR serverless can establish connections within a VPC

https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/vpc-access.html

The example above was for integration with databases but the set up should be relatively the same if the flow is from EMR serverless to kyuubi server

Apr 17 '24 03:04 PauloMigAlmeida

@PauloMigAlmeida connect initiate from Kyuubi server to Spark driver. The detailed steps are:

when a new connection comes in, Kyuubi looks up Zookeeper to find a reusable Spark application. If not found, try to perform a spark-submit to launch a new Spark app (a Kyuubi-customized Spark app, called Kyuubi Spark SQL engine).
after the Spark driver starts, it launches a Thrift RPC server, and registers itself to Zookeeper, so that the Kyuubi server knows how to connect to this RPC server.
Kyuubi server connects to the Spark driver and forwards queries to it.
Spark application self terminates (also deregister from zookeeper) after idle(no active connections) timeout

Apr 17 '24 03:04 pan3793