SNOW-704114 Implement `DataFrame.summary()` (#629)
Please answer these questions before submitting your pull requests. Thanks!
-
What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes #SNOW-704114
-
Fill out the following pre-review checklist:
- [x] I am adding a new automated test(s) to verify correctness of my new code
- [ ] I am adding new logging messages
- [ ] I am adding a new telemetry message
- [ ] I am adding new credentials
- [ ] I am adding a new dependency
-
Please describe how your code solves the related issue.
This changes matches pandas.describe() function to show percentiles which previously wasn't shown.
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅
Hello @suenalaba , thank you for contributing! Could you please also update CHANGELOG.md to indicate this improvement to df.describe?
I have read the CLA Document and I hereby sign the CLA
Hello @sfc-gh-stan , I've updated CHANGELOG.md to reflect this updated feature for df.describe().
Hello @suenalaba , #735 is fixed, could you please fix the integration test and doctest failures and re-request review?
Hi @sfc-gh-stan , I understand that #735 is fixed. Even for files I did not touch I am still not able to run any of the integration tests.(Only unit tests work). It seems that there is still some issues with the connection parameters.
The following error messages are consistent throughout.
@pytest.fixture(scope="session")
def connection(db_parameters):
ret = db_parameters
> with snowflake.connector.connect(
user=ret.get("user"),
password=ret.get("password"),
host=ret.get("host"),
port=ret.get("port"),
database=ret.get("database"),
account=ret.get("account"),
protocol=ret.get("protocol"),
role=ret.get("role"),
) as con:
tests/integ/conftest.py:57:
../../opt/anaconda3/envs/snowpark-dv/lib/python3.8/site-packages/snowflake/connector/__init__.py:50: in Connect
return SnowflakeConnection(**kwargs)
../../opt/anaconda3/envs/snowpark-dv/lib/python3.8/site-packages/snowflake/connector/connection.py:304: in __init__
self.connect(**kwargs)
../../opt/anaconda3/envs/snowpark-dv/lib/python3.8/site-packages/snowflake/connector/connection.py:571: in connect
self.__open_connection()
self = <snowflake.connector.connection.SnowflakeConnection object at 0x7fe7e2ffd4f0>
def __open_connection(self):
"""Opens a new network connection."""
self.converter = self._converter_class(
use_numpy=self._numpy, support_negative_year=self._support_negative_year
)
proxy.set_proxies(
self.proxy_host, self.proxy_port, self.proxy_user, self.proxy_password
)
self._rest = SnowflakeRestful(
host=self.host,
port=self.port,
protocol=self._protocol,
inject_client_pause=self._inject_client_pause,
connection=self,
)
logger.debug("REST API object was created: %s:%s", self.host, self.port)
if "SF_OCSP_RESPONSE_CACHE_SERVER_URL" in os.environ:
logger.debug(
"Custom OCSP Cache Server URL found in environment - %s",
os.environ["SF_OCSP_RESPONSE_CACHE_SERVER_URL"],
)
> if self.host.endswith(".privatelink.snowflakecomputing.com"):
E AttributeError: 'NoneType' object has no attribute 'endswith'
../../opt/anaconda3/envs/snowpark-dv/lib/python3.8/site-packages/snowflake/connector/connection.py:742: AttributeError
I've tested my connection parameters elsewhere and they work completely fine. Any idea why is it still failing?
Hi @suenalaba , I opened #750 to address the issue you are encountering. Feel free to pull the changes to unblock your development. On a separate note, I had a discussion with @sfc-gh-gfrere on SNOW-704114. We concluded it's better to implement DataFrame.summary() to display the percentiles instead of modifying the exsiting DataFrame.describe() to avoid unexpected behavior change. Similar to DataFrame.describe, DataFrame.summary should display the count, mean, stddev, min, max and approximate quartiles for the specified list of columns. Sorry for the change of requirements, would you like to address this new feature request? Please don't hesitate if you have any questions.
Hello @sfc-gh-stan, I've pulled the changes in #750 and it now allows me to run the integration tests, thanks for that!
I am unable to see the discussion on SNOW-704114 as I do not have access, but I can understand the rationale you mentioned on the unexpected behavior change.
I've created a new df.summary() function as per what I understand from your description and I've addressed the issue of failing integration tests that you mentioned about earlier as well.
I've added those in my latest commits, kindly let me know if there are any ways I can improve on it!
Hi @suenalaba , thank you for addressing the new requirements! I meant to link #629 . Your changes look good to me, let's make it more concise by (1) using lambda and (2) extract and reuse the duplicate code between DataFrame.describe and DataFrame.summary.
Hey @sfc-gh-stan , thanks for your suggestion. I've cleaned it up a little and used lambda to make it more concise! Kindly let me know if this is good or if there any further changes needed.
Hi @suenalaba , I think some merge gates are failing because of the session parameter APPROX_PERCENTILE_EXACT_IF_POSSIBLE is set to True by default in some deployments, there is an ongoing parameter rollout. Could you please change the tests to accept both values here? Your code LGTM otherwise.
Hey, @sfc-gh-stan is there anywhere I can find more information on the APPROX_PERCENTILE_EXACT_IF_POSSIBLE parameter?
Hey, @sfc-gh-stan is there anywhere I can find more information on the
APPROX_PERCENTILE_EXACT_IF_POSSIBLEparameter?
I'm not sure if there's any public documentation, but you could flip the parameter on and off with alter session set APPROX_PERCENTILE_EXACT_IF_POSSIBLE = true/false.
@sfc-gh-stan the parameter allows APPROX_PERC to use the full population if we do not fill memory. This was an optimization since we unnecessarily used the sparse representation on input sizes that were small.
@sfc-gh-stan where exactly do you guys set this parameter? Is it in any test configuration files?
@sfc-gh-stan where exactly do you guys set this parameter? Is it in any test configuration files?
These are snowflake parameters. It is set from the server side. For client, these are set at runtime by issuing SQL commands like alter account/user/session <parameter> = <val>. In Snowpark Python, this would be session.sql("...").
The merge gates test failures are caused by different Snowflake servers having different default values for this parameter due to an ongoing parameter rollout. The value of this parameter affects the results approx_percentile. For now, I'm suggesting we change the tests to accept both results (when the parameter is true/false).
Please let me know if you have any other questions 🙂
@sfc-gh-stan hi sophie, sorry for the late revert. Apparently my free trial expired and Im having some issues setting up my payment on Snowflake. Is it possible for you to help make the above changes to close this issue?