Adds support to connections with Thrift over HTTP transport.
- Supports HTTP transport for Thrift protocol.
- Three types of authentication supported: NONE, NOSASL, BASIC and KERBEROS.
- BASIC authentication is useful when the Thrift HTTP interface is behind a proxy (e.g. in Azure HDInsight clusters).
This PR solves issue https://github.com/dropbox/PyHive/issues/69. Let me know what you think and if it looks good for you I'll invest time in the tests that are missing in this PR. :)
Codecov Report
Merging #325 into master will increase coverage by
0.44%. The diff coverage is91.71%.
@@ Coverage Diff @@
## master #325 +/- ##
==========================================
+ Coverage 93.23% 93.67% +0.44%
==========================================
Files 14 14
Lines 1523 1677 +154
Branches 165 185 +20
==========================================
+ Hits 1420 1571 +151
+ Misses 75 74 -1
- Partials 28 32 +4
| Impacted Files | Coverage Δ | |
|---|---|---|
| pyhive/hive.py | 89.34% <86.76%> (+1.94%) |
:arrow_up: |
| pyhive/tests/test_hive.py | 96.81% <95.50%> (-0.93%) |
:arrow_down: |
| pyhive/tests/test_common.py | 100.00% <0.00%> (ø) |
|
| pyhive/tests/test_presto.py | 100.00% <0.00%> (ø) |
|
| pyhive/tests/test_sqlalchemy_hive.py | 100.00% <0.00%> (ø) |
|
| pyhive/sqlalchemy_hive.py | 86.51% <0.00%> (+0.06%) |
:arrow_up: |
| pyhive/sqlalchemy_presto.py | 91.52% <0.00%> (+0.07%) |
:arrow_up: |
| pyhive/presto.py | 86.50% <0.00%> (+0.51%) |
:arrow_up: |
| pyhive/common.py | 94.53% <0.00%> (+0.56%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update c448ed9...97fe26e. Read the comment docs.
will take a pass on it next week, unfortunately can't get to this earlier
@joaopedroantonio I'm aligned with the approach here, let's add the unit tests & add the test plan in the PR description if you don't mind.
Thank you for the feedback @bkyryliuk, will start working on the tests between this weekend and start of next week.
Any update on this?
Any update on this?
Oh well, tbh this got lost on my backlog, thanks for the bump. :) Will try to wrap it up this week.
Added unit tests for different HTTP scenarios and fixed a few issues in the process.
@bkyryliuk the codecov checks failed but I guess that's mostly because I moved the binary transport implementation and there are a few line/branch misses. Do you think it's relevant to cover those misses or can we move forward as it is?
@bkyryliuk, any idea when you might be able to pick this up? :)
@bkyryliuk bump. :)
Hello again everyone! Can you give some feedback on this PR? Thanks! (cc @bkyryliuk )
I've tested this PR and this is working nicely for our HiveServer2 service running over http. Awesome work @joaopedroantonio - would love to see this get merged
For our particular use-case we need a few more features:
- Support for connecting to a
httpsendpoint - Support for passing through custom http headers
- Support for passing through custom cookie values
I don't think these should block this PR though, I'd be happy to attempt to add these features once this initial version is in, unless of course you want to add them :)
@gthomas-slack sorry for the late reply! I'm glad this was helpful! I'll take a better look at your comments over the weekend and I'm happy to include your suggestions if they aren't too complex. :)
For thrift=0.13.0 (current version in conda/pip) it doesn't work with the included TCookieHttpClient wrapper class, and I get the following error in the connect constructor when executing the first "use default" query:
File "/root/superset/pyhive/pyhive/hive.py", line 189, in __init__
cursor.execute('USE `{}`'.format(database))
File "/root/superset/pyhive/pyhive/hive.py", line 459, in execute
response = self._connection.client.ExecuteStatement(req)
File "/root/superset/pyhive/TCLIService/TCLIService.py", line 280, in ExecuteStatement
return self.recv_ExecuteStatement()
File "/root/superset/pyhive/TCLIService/TCLIService.py", line 292, in recv_ExecuteStatement
(fname, mtype, rseqid) = iprot.readMessageBegin()
File "/root/superset/pyhive/env/lib/python3.7/site-packages/thrift/protocol/TBinaryProtocol.py", line 148, in readMessageBegin
name = self.trans.readAll(sz)
File "/root/superset/pyhive/env/lib/python3.7/site-packages/thrift/transport/TTransport.py", line 68, in readAll
raise EOFError()
After replacing TCookieHttpClient with the original THttpClient it seems to work ok for http connections (no SSL support included).
I believe the parameters name for the HTTP Thrift should be renamed to match the (simpler) names used in the original JDBC Thrift URI. Also 'binary', 'http', 'sasl' are not named "protocols" but transport "modes" according to JDBC/Hive documentation Connection URL When HiveServer2 Is Running in HTTP Mode
Example Thrift JDBC over HTTPS URI:
jdbc:hive2://{host}:{port}/{database};transportMode=http;httpPath=/hive;ssl=true;sslTrustStore=truststore.jks;trustStorePassword=changeit
Based on this it seems the names should be:
-
thrift_transport_protocol->transport_mode(transportMode in JDBC/Thrift) -
http_path->http_path(httpPath in JDBC/Thrift)
The remaining parameters needed to support HTTPS (HTTP over SSL) should be:
-
ssl = True/False(could also be handled withtransport_mode=httpsbut this seems cleaner). -
cacert,ca_certorcafile(Python uses plain PEM cert files instead of Java keystores/truststores - OPTIONAL:
cert_file,key_fileandkey_pass- to support SSL client authentication (as intended in #306)
Hi @joaopedroantonio . Bumping this. Thank you for working on this! Could you pls look at the new feedback. Thank you
I've tested this with my Spark Thrift HiveServer2 deployment behind a load balancer on Kubernetes and it works great. I'm looking forward to using this to connect Superset to my Spark Thrift server and would like to see this accepted soon. Thanks @joaopedroantonio and everyone in the thread.
For anyone who needs to use http/https HiveServer2 in production right now, It's actually possible with the current release of PyHive, as the HiveServer2 Connection class exposes a thrift_transport which allows you to pass in a custom configured transport to use: https://github.com/dropbox/PyHive/blob/master/pyhive/hive.py#L112-L113
Here is a basic example using HTTPS and adding a custom authentication header:
from pyhive import hive
import base64
import thrift.transport.THttpClient
def thrift_http_transport():
transport = thrift.transport.THttpClient.THttpClient(uri_or_host='https://my-hiveserver2.com:443/cliservice')
auth_credentials = '{}:{}'.format('test', 'test').encode('UTF-8')
auth_credentials_base64 = base64.standard_b64encode(auth_credentials).decode('UTF-8')
transport.setCustomHeaders(
{
'Authorization': 'Basic {}'.format(auth_credentials_base64), # HiveServer2 BASIC auth
'X-Auth-Token': 'xxx' # Custom header to auth with some kind of middleware
}
)
return transport
conn = hive.connect(thrift_transport=thrift_http_transport())
cursor = conn.cursor()
cursor.execute("""SELECT SUM(1) from model.dim_date""")
data = cursor.fetchall()
cursor.close()
conn.close()
print(data)
Can we use http protocol with SQLAlchemy?
Does anyone have an example for http connection with kerberos auth ?
FYI there is currently a bug using HTTP transport connections when using thrift 0.15.0, some details here:
https://github.com/dropbox/PyHive/issues/417#issuecomment-961257993
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.
:x: Joao Antonio
:x: joaopedroantonio
Joao Antonio seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.
Hi @joaopedroantonio are there any updates on this?