snowpark-python SNOW-704114 Implement `DataFrame.summary()` (#629)

Please answer these questions before submitting your pull requests. Thanks!

What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes #SNOW-704114
Fill out the following pre-review checklist:
- [x] I am adding a new automated test(s) to verify correctness of my new code
- [ ] I am adding new logging messages
- [ ] I am adding a new telemetry message
- [ ] I am adding new credentials
- [ ] I am adding a new dependency
Please describe how your code solves the related issue.

This changes matches pandas.describe() function to show percentiles which previously wasn't shown.

Mar 19 '23 21:03 suenalaba

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

Mar 19 '23 21:03 github-actions[bot]

Hello @suenalaba , thank you for contributing! Could you please also update CHANGELOG.md to indicate this improvement to df.describe?

Mar 20 '23 19:03 sfc-gh-stan

I have read the CLA Document and I hereby sign the CLA

Mar 20 '23 20:03 suenalaba

Hello @sfc-gh-stan , I've updated CHANGELOG.md to reflect this updated feature for df.describe().

Mar 20 '23 20:03 suenalaba

Hello @suenalaba , #735 is fixed, could you please fix the integration test and doctest failures and re-request review?

Mar 27 '23 17:03 sfc-gh-stan

Hi @sfc-gh-stan , I understand that #735 is fixed. Even for files I did not touch I am still not able to run any of the integration tests.(Only unit tests work). It seems that there is still some issues with the connection parameters.

The following error messages are consistent throughout.

    @pytest.fixture(scope="session")
    def connection(db_parameters):
        ret = db_parameters
>       with snowflake.connector.connect(
            user=ret.get("user"),
            password=ret.get("password"),
            host=ret.get("host"),
            port=ret.get("port"),
            database=ret.get("database"),
            account=ret.get("account"),
            protocol=ret.get("protocol"),
            role=ret.get("role"),
        ) as con:

tests/integ/conftest.py:57:

../../opt/anaconda3/envs/snowpark-dv/lib/python3.8/site-packages/snowflake/connector/__init__.py:50: in Connect
    return SnowflakeConnection(**kwargs)
../../opt/anaconda3/envs/snowpark-dv/lib/python3.8/site-packages/snowflake/connector/connection.py:304: in __init__
    self.connect(**kwargs)
../../opt/anaconda3/envs/snowpark-dv/lib/python3.8/site-packages/snowflake/connector/connection.py:571: in connect
    self.__open_connection()

self = <snowflake.connector.connection.SnowflakeConnection object at 0x7fe7e2ffd4f0>

    def __open_connection(self):
        """Opens a new network connection."""
        self.converter = self._converter_class(
            use_numpy=self._numpy, support_negative_year=self._support_negative_year
        )
    
        proxy.set_proxies(
            self.proxy_host, self.proxy_port, self.proxy_user, self.proxy_password
        )
    
        self._rest = SnowflakeRestful(
            host=self.host,
            port=self.port,
            protocol=self._protocol,
            inject_client_pause=self._inject_client_pause,
            connection=self,
        )
        logger.debug("REST API object was created: %s:%s", self.host, self.port)
    
        if "SF_OCSP_RESPONSE_CACHE_SERVER_URL" in os.environ:
            logger.debug(
                "Custom OCSP Cache Server URL found in environment - %s",
                os.environ["SF_OCSP_RESPONSE_CACHE_SERVER_URL"],
            )
    
>       if self.host.endswith(".privatelink.snowflakecomputing.com"):
E       AttributeError: 'NoneType' object has no attribute 'endswith'

../../opt/anaconda3/envs/snowpark-dv/lib/python3.8/site-packages/snowflake/connector/connection.py:742: AttributeError

I've tested my connection parameters elsewhere and they work completely fine. Any idea why is it still failing?

Mar 27 '23 20:03 suenalaba

Hi @suenalaba , I opened #750 to address the issue you are encountering. Feel free to pull the changes to unblock your development. On a separate note, I had a discussion with @sfc-gh-gfrere on SNOW-704114. We concluded it's better to implement DataFrame.summary() to display the percentiles instead of modifying the exsiting DataFrame.describe() to avoid unexpected behavior change. Similar to DataFrame.describe, DataFrame.summary should display the count, mean, stddev, min, max and approximate quartiles for the specified list of columns. Sorry for the change of requirements, would you like to address this new feature request? Please don't hesitate if you have any questions.

Mar 27 '23 23:03 sfc-gh-stan

Hello @sfc-gh-stan, I've pulled the changes in #750 and it now allows me to run the integration tests, thanks for that!

I am unable to see the discussion on SNOW-704114 as I do not have access, but I can understand the rationale you mentioned on the unexpected behavior change.

I've created a new df.summary() function as per what I understand from your description and I've addressed the issue of failing integration tests that you mentioned about earlier as well.

I've added those in my latest commits, kindly let me know if there are any ways I can improve on it!

Mar 28 '23 21:03 suenalaba

Hi @suenalaba , thank you for addressing the new requirements! I meant to link #629 . Your changes look good to me, let's make it more concise by (1) using lambda and (2) extract and reuse the duplicate code between DataFrame.describe and DataFrame.summary.

Mar 31 '23 00:03 sfc-gh-stan

Hey @sfc-gh-stan , thanks for your suggestion. I've cleaned it up a little and used lambda to make it more concise! Kindly let me know if this is good or if there any further changes needed.

Mar 31 '23 20:03 suenalaba

Hi @suenalaba , I think some merge gates are failing because of the session parameter APPROX_PERCENTILE_EXACT_IF_POSSIBLE is set to True by default in some deployments, there is an ongoing parameter rollout. Could you please change the tests to accept both values here? Your code LGTM otherwise.

Apr 06 '23 18:04 sfc-gh-stan

Hey, @sfc-gh-stan is there anywhere I can find more information on the APPROX_PERCENTILE_EXACT_IF_POSSIBLE parameter?

Apr 13 '23 19:04 suenalaba

Hey, @sfc-gh-stan is there anywhere I can find more information on the APPROX_PERCENTILE_EXACT_IF_POSSIBLE parameter?

I'm not sure if there's any public documentation, but you could flip the parameter on and off with alter session set APPROX_PERCENTILE_EXACT_IF_POSSIBLE = true/false.

Apr 13 '23 21:04 sfc-gh-stan

@sfc-gh-stan the parameter allows APPROX_PERC to use the full population if we do not fill memory. This was an optimization since we unnecessarily used the sparse representation on input sizes that were small.

Apr 13 '23 21:04 sfc-gh-gfrere

@sfc-gh-stan where exactly do you guys set this parameter? Is it in any test configuration files?

Apr 14 '23 04:04 suenalaba

@sfc-gh-stan where exactly do you guys set this parameter? Is it in any test configuration files?

These are snowflake parameters. It is set from the server side. For client, these are set at runtime by issuing SQL commands like alter account/user/session <parameter> = <val>. In Snowpark Python, this would be session.sql("..."). The merge gates test failures are caused by different Snowflake servers having different default values for this parameter due to an ongoing parameter rollout. The value of this parameter affects the results approx_percentile. For now, I'm suggesting we change the tests to accept both results (when the parameter is true/false). Please let me know if you have any other questions 🙂

Apr 14 '23 20:04 sfc-gh-stan

@sfc-gh-stan hi sophie, sorry for the late revert. Apparently my free trial expired and Im having some issues setting up my payment on Snowflake. Is it possible for you to help make the above changes to close this issue?

May 23 '23 13:05 suenalaba