PyHive icon indicating copy to clipboard operation
PyHive copied to clipboard

Adds support to connections with Thrift over HTTP transport.

Open joaopedroantonio opened this issue 5 years ago • 25 comments

  • Supports HTTP transport for Thrift protocol.
  • Three types of authentication supported: NONE, NOSASL, BASIC and KERBEROS.
  • BASIC authentication is useful when the Thrift HTTP interface is behind a proxy (e.g. in Azure HDInsight clusters).

joaopedroantonio avatar Apr 04 '20 15:04 joaopedroantonio

This PR solves issue https://github.com/dropbox/PyHive/issues/69. Let me know what you think and if it looks good for you I'll invest time in the tests that are missing in this PR. :)

joaopedroantonio avatar Apr 04 '20 15:04 joaopedroantonio

Codecov Report

Merging #325 into master will increase coverage by 0.44%. The diff coverage is 91.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #325      +/-   ##
==========================================
+ Coverage   93.23%   93.67%   +0.44%     
==========================================
  Files          14       14              
  Lines        1523     1677     +154     
  Branches      165      185      +20     
==========================================
+ Hits         1420     1571     +151     
+ Misses         75       74       -1     
- Partials       28       32       +4     
Impacted Files Coverage Δ
pyhive/hive.py 89.34% <86.76%> (+1.94%) :arrow_up:
pyhive/tests/test_hive.py 96.81% <95.50%> (-0.93%) :arrow_down:
pyhive/tests/test_common.py 100.00% <0.00%> (ø)
pyhive/tests/test_presto.py 100.00% <0.00%> (ø)
pyhive/tests/test_sqlalchemy_hive.py 100.00% <0.00%> (ø)
pyhive/sqlalchemy_hive.py 86.51% <0.00%> (+0.06%) :arrow_up:
pyhive/sqlalchemy_presto.py 91.52% <0.00%> (+0.07%) :arrow_up:
pyhive/presto.py 86.50% <0.00%> (+0.51%) :arrow_up:
pyhive/common.py 94.53% <0.00%> (+0.56%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c448ed9...97fe26e. Read the comment docs.

codecov[bot] avatar Apr 04 '20 18:04 codecov[bot]

will take a pass on it next week, unfortunately can't get to this earlier

bkyryliuk avatar Apr 16 '20 23:04 bkyryliuk

@joaopedroantonio I'm aligned with the approach here, let's add the unit tests & add the test plan in the PR description if you don't mind.

bkyryliuk avatar Apr 28 '20 16:04 bkyryliuk

Thank you for the feedback @bkyryliuk, will start working on the tests between this weekend and start of next week.

joaopedroantonio avatar Apr 30 '20 16:04 joaopedroantonio

Any update on this?

MinuMichaelOlassayil avatar Jun 16 '20 00:06 MinuMichaelOlassayil

Any update on this?

Oh well, tbh this got lost on my backlog, thanks for the bump. :) Will try to wrap it up this week.

joaopedroantonio avatar Jun 16 '20 08:06 joaopedroantonio

Added unit tests for different HTTP scenarios and fixed a few issues in the process.

@bkyryliuk the codecov checks failed but I guess that's mostly because I moved the binary transport implementation and there are a few line/branch misses. Do you think it's relevant to cover those misses or can we move forward as it is?

joaopedroantonio avatar Jun 26 '20 16:06 joaopedroantonio

@bkyryliuk, any idea when you might be able to pick this up? :)

joaopedroantonio avatar Jul 13 '20 08:07 joaopedroantonio

@bkyryliuk bump. :)

joaopedroantonio avatar Aug 21 '20 13:08 joaopedroantonio

Hello again everyone! Can you give some feedback on this PR? Thanks! (cc @bkyryliuk )

joaopedroantonio avatar Oct 15 '20 18:10 joaopedroantonio

I've tested this PR and this is working nicely for our HiveServer2 service running over http. Awesome work @joaopedroantonio - would love to see this get merged

gthomas-slack avatar Nov 13 '20 21:11 gthomas-slack

For our particular use-case we need a few more features:

  • Support for connecting to a https endpoint
  • Support for passing through custom http headers
  • Support for passing through custom cookie values

I don't think these should block this PR though, I'd be happy to attempt to add these features once this initial version is in, unless of course you want to add them :)

gthomas-slack avatar Nov 13 '20 22:11 gthomas-slack

@gthomas-slack sorry for the late reply! I'm glad this was helpful! I'll take a better look at your comments over the weekend and I'm happy to include your suggestions if they aren't too complex. :)

joaopedroantonio avatar Nov 18 '20 23:11 joaopedroantonio

For thrift=0.13.0 (current version in conda/pip) it doesn't work with the included TCookieHttpClient wrapper class, and I get the following error in the connect constructor when executing the first "use default" query:

  File "/root/superset/pyhive/pyhive/hive.py", line 189, in __init__
    cursor.execute('USE `{}`'.format(database))
  File "/root/superset/pyhive/pyhive/hive.py", line 459, in execute
    response = self._connection.client.ExecuteStatement(req)
  File "/root/superset/pyhive/TCLIService/TCLIService.py", line 280, in ExecuteStatement
    return self.recv_ExecuteStatement()
  File "/root/superset/pyhive/TCLIService/TCLIService.py", line 292, in recv_ExecuteStatement
    (fname, mtype, rseqid) = iprot.readMessageBegin()
  File "/root/superset/pyhive/env/lib/python3.7/site-packages/thrift/protocol/TBinaryProtocol.py", line 148, in readMessageBegin
    name = self.trans.readAll(sz)
  File "/root/superset/pyhive/env/lib/python3.7/site-packages/thrift/transport/TTransport.py", line 68, in readAll
    raise EOFError()

After replacing TCookieHttpClient with the original THttpClient it seems to work ok for http connections (no SSL support included).

lvazquez avatar Nov 19 '20 21:11 lvazquez

I believe the parameters name for the HTTP Thrift should be renamed to match the (simpler) names used in the original JDBC Thrift URI. Also 'binary', 'http', 'sasl' are not named "protocols" but transport "modes" according to JDBC/Hive documentation Connection URL When HiveServer2 Is Running in HTTP Mode

Example Thrift JDBC over HTTPS URI: jdbc:hive2://{host}:{port}/{database};transportMode=http;httpPath=/hive;ssl=true;sslTrustStore=truststore.jks;trustStorePassword=changeit

Based on this it seems the names should be:

  • thrift_transport_protocol -> transport_mode (transportMode in JDBC/Thrift)
  • http_path -> http_path (httpPath in JDBC/Thrift)

The remaining parameters needed to support HTTPS (HTTP over SSL) should be:

  • ssl = True/False (could also be handled with transport_mode=https but this seems cleaner).
  • cacert, ca_cert or cafile (Python uses plain PEM cert files instead of Java keystores/truststores
  • OPTIONAL: cert_file, key_file and key_pass - to support SSL client authentication (as intended in #306)

lvazquez avatar Nov 19 '20 21:11 lvazquez

Hi @joaopedroantonio . Bumping this. Thank you for working on this! Could you pls look at the new feedback. Thank you

kent-pawar avatar Dec 10 '20 21:12 kent-pawar

I've tested this with my Spark Thrift HiveServer2 deployment behind a load balancer on Kubernetes and it works great. I'm looking forward to using this to connect Superset to my Spark Thrift server and would like to see this accepted soon. Thanks @joaopedroantonio and everyone in the thread.

mhconradt avatar Dec 14 '20 04:12 mhconradt

For anyone who needs to use http/https HiveServer2 in production right now, It's actually possible with the current release of PyHive, as the HiveServer2 Connection class exposes a thrift_transport which allows you to pass in a custom configured transport to use: https://github.com/dropbox/PyHive/blob/master/pyhive/hive.py#L112-L113

Here is a basic example using HTTPS and adding a custom authentication header:

from pyhive import hive
import base64
import thrift.transport.THttpClient


def thrift_http_transport():
    transport = thrift.transport.THttpClient.THttpClient(uri_or_host='https://my-hiveserver2.com:443/cliservice')

    auth_credentials = '{}:{}'.format('test', 'test').encode('UTF-8')
    auth_credentials_base64 = base64.standard_b64encode(auth_credentials).decode('UTF-8')
    transport.setCustomHeaders(
        {
            'Authorization': 'Basic {}'.format(auth_credentials_base64),  # HiveServer2 BASIC auth
            'X-Auth-Token': 'xxx' # Custom header to auth with some kind of middleware
        }
    )
    return transport


conn = hive.connect(thrift_transport=thrift_http_transport())
cursor = conn.cursor()
cursor.execute("""SELECT SUM(1) from model.dim_date""")
data = cursor.fetchall()
cursor.close()
conn.close()
print(data)

gthomas-slack avatar Dec 14 '20 17:12 gthomas-slack

Can we use http protocol with SQLAlchemy?

mithunjmistry avatar Jun 21 '21 21:06 mithunjmistry

Does anyone have an example for http connection with kerberos auth ?

ajprabhu09 avatar Oct 13 '21 10:10 ajprabhu09

FYI there is currently a bug using HTTP transport connections when using thrift 0.15.0, some details here: https://github.com/dropbox/PyHive/issues/417#issuecomment-961257993

gthomas-slack avatar Nov 04 '21 17:11 gthomas-slack

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

:x: Joao Antonio
:x: joaopedroantonio


Joao Antonio seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Apr 16 '22 21:04 CLAassistant

Hi @joaopedroantonio are there any updates on this?

danjampro avatar Nov 28 '22 20:11 danjampro