[TASK][MEDIUM] Support Amazon EMR Serverless on AWS
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Search before asking
- [X] I have searched in the issues and found no similar issues.
Describe the feature
Support Amazon EMR Serverless as Kyuubi Spark Engine to minimize the operation cost and implement real serverless Spark SQL goal. Amazon EMR Serverless is not supported yet as Amazon EMR Serverless has no JDBC connection.
Motivation
Implement real serverless Spark SQL target on AWS cloud.
Describe the solution
Amazon EMR Serverless makes it easy for users to run Spark without configuring, managing, and scaling clusters or servers.
Additional context
No response
Are you willing to submit PR?
- [ ] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
- [X] No. I cannot submit a PR at this time.
Hello @davidshtian, Thanks for finding the time to report the issue! We really appreciate the community's efforts to improve Apache Kyuubi.
Does deploying Kyuubi on Amazon EMR satisfy this? AFAIK, cloud vendors like Tencent cloud and Aliyun, who provide similar EMR services, have provided JDBC though Kyuubi
Does deploying Kyuubi on Amazon EMR satisfy this? AFAIK, cloud vendors like Tencent cloud and Aliyun, who provide similar EMR services, have provided JDBC though Kyuubi
Thanks for your response~
Kyuubi can be deployed on AWS EMR cluster mode (EMR on EC2), but it still need to manage and operate the cluster, while EMR Serverless is fully serverless and managed bringing more flexibility and it could handle different scenarios. EMR Serverless has no JDBC connection and it uses AWS API to submit the job, it would be better to have the support to adapt EMR Serverless to Kyuubi. Thanks~
EMR Serverless has no JDBC connection and it uses AWS API to submit the job
Sorry for being late.
Never used AWS. Correct me if I am wrong, do you mean that an AWS EMR cluster does not support spark-submit?
Hi @yaooqinn,
do you mean that an AWS EMR cluster does not support spark-submit?
EMR (on EC2) does support submitting via spark-submit utility. However, EMR-Serverless does not.
For EMR-Serverless, one can submit jobs by either using of the AWS' SDK or via AWSCli.
It looks something like this:
aws emr-serverless start-job-run \
--application-id <EMR_Severless_App_Id> \
--execution-role-arn arn:aws:iam::012345678901:role/my-cool-emr-exec-role \
--job-driver 'sparkSubmit={entryPoint=s3://my-bucket/script_to_be_execeuted.py}' \
--configuration-overrides '{"monitoringConfiguration": {"managedPersistenceMonitoringConfiguration": {"enabled": true}, "cloudWatchLoggingConfiguration": {"enabled": true, "logGroupName": "/aws/emr-serverless/my-logs"}}}'
It looks like we need a specific version of the org.apache.kyuubi.engine.ProcBuilder for was. also cc @pan3793
To support AWS EMR Serverless Spark, we need to implement the following interface in Kyuubi
org.apache.kyuubi.engine.ApplicationOperation (for querying and canceling job)
org.apache.kyuubi.engine.ProcBuilder (for submitting job)
According to the EMR Docs, I think we can use the CLI aws emr-serverless to implement that.
One concern is the integration tests, I'm not an AWS user, seems that localstack also does not support AWS EMR Serverless? I'm afraid that functionality without CI verification is fragile.
BTW, does GCP and Azure have similar services?
@pan3793, @yaooqinn
I'm not an AWS user, seems that localstack also does not support AWS EMR Serverless?
You are right, localstack doesn't support the endpoints for dealing with EMR serverless yet
I'm afraid that functionality without CI verification is fragile.
I agree. Does the apache foundation have AWS accounts that can be used for CI/CD purposes? If so, that would be the fastest way to address this as I think we may not be the first one needing that type of integration tests.
If it doesn't, you may want to try the AWS promotional credits for Open Source projects. More info at https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/
Regardless of the option you go with, count on me for helping with the AWS implementation details (IAM permissions, service configuration, sdk, cli options and so on) - that can speed up the time machine for the development of this feature.
It looks like ASF Infra doesn't have AWS resources for CI/CD. https://infra.apache.org/build-supported-services.html
Another option would be to mock responses expected from the AWS services involved.
I've seen that done before in other project in which localstack didn't support the service required. Would it address the CI tests concern?
Another option would be to mock responses expected from the AWS services involved.
This is a necessary step for local dev. AWS promotional credits might be necessary for setting the integration tests.
cc @zhaohehuhu, you may be interested in this feature
Yup. I'm going to implement this feature.
@zhaohehuhu just checking in to see if you need any help with the AWS side of things
@zhaohehuhu just checking in to see if you need any help with the AWS side of things
Thanks. It's going well so far. I already finished the draft code and decided to do a round of test. @PauloMigAlmeida
@zhaohehuhu If that helps, this is the terraform code that can provision an EMR Serverless cluster with the right permissions https://gist.github.com/PauloMigAlmeida/5cebf3efcd0f105d73646a6a9e8cc2f3
Instructions on how to deploy and run it are in the gist too.
@zhaohehuhu If that helps, this is the terraform code that can provision an EMR Serverless cluster with the right permissions https://gist.github.com/PauloMigAlmeida/5cebf3efcd0f105d73646a6a9e8cc2f3
Instructions on how to deploy and run it are in the gist too.
Thanks. may contact you if needed.
@pan3793 plz assign it to me.
@PauloMigAlmeida I deployed a kyuubi server on EC2. When kyuubi server talks to Spark engine in EMR sververless, it always says connection timeout. Kyuubi server and EMR sververless already are in the same VPC, how should it be? Do you have any idea about it ?
@zhaohehuhu EMRServerless doesn't run in a customer-managed VPC. It's accessible via an API that should be invoked using one of the AWS SDKs instead.
API method: https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html AWS SDK for Java: https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/emrserverless/EmrServerlessClient.html#startJobRun(software.amazon.awssdk.services.emrserverless.model.StartJobRunRequest)
PS: It requires that an EMRServerless application is created.
Does this help?
Thanks @PauloMigAlmeida. The EMR serverless runs on a VPC mannered by AWS, I just wonder it is possbile for Kyuubi Service talk to Spark Engine in EMR serverless through Thrift Protocol.
@zhaohehuhu I'm almost certain that it isn't supported but let me check that internally first and I will come back to you with an answer tomorrow.
On a separate note, if communicating via thrift isn't possible, is there any alternative that could be explored instead?
Thanks. It looks like it's hard for Kyuubi service running on EC2 or others to access Amazon EMR Serverless Spark. I will discuss it with @pan3793.
@zhaohehuhu I got hold of a EMR Serverless specialist internally. Thrift communication isn't possible at this moment on EMR Serverless =/
@PauloMigAlmeida do you know the exact restriction? TCP inbound traffic or something?
@pan3793 Seems to be that we don't have thrift server running on those nodes for the serverless offering.
@PauloMigAlmeida Kyuubi use thrift as internal RPC protocol for Kyuubi server and Spark driver, it will auto bootstrap a thrift server on the Spark driver. So the question is, is Spark Servless allows Thrift(kind of a TCP-based protocol) traffic between outside and inside, I suppose it should work, the Jupyter Notebook case runs in similiar way(not thrift, should be another TCP-based protocol). https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/interactive-workloads.html
I have a offline talk with @zhaohehuhu, according to his feedback, the driver successfully launched Thrift RPC server, and registered it to Zookeeper, so Kyuubi server got the Thrift RPC server address but can NOT establish the connection. We have reported similar issues on other public cloud vendor, it caused by dual NICs https://github.com/apache/kyuubi/issues/6296, I'm not sure what's exact issue on AWS EMR Serverless
@pan3793 got you point now.
Does the thrift communication initiate from the EMR serverless to the kyuubi server? Or is it the other way around?
In case, it's the former:
Out of curiosity, what's the security groups rules for both kyuubi server and EMR serverless (with VPC)?
I'm aware that EMR serverless can establish connections within a VPC
https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/vpc-access.html
The example above was for integration with databases but the set up should be relatively the same if the flow is from EMR serverless to kyuubi server
@PauloMigAlmeida connect initiate from Kyuubi server to Spark driver. The detailed steps are:
- when a new connection comes in, Kyuubi looks up Zookeeper to find a reusable Spark application. If not found, try to perform a
spark-submitto launch a new Spark app (a Kyuubi-customized Spark app, called Kyuubi Spark SQL engine). - after the Spark driver starts, it launches a Thrift RPC server, and registers itself to Zookeeper, so that the Kyuubi server knows how to connect to this RPC server.
- Kyuubi server connects to the Spark driver and forwards queries to it.
- Spark application self terminates (also deregister from zookeeper) after idle(no active connections) timeout