positron icon indicating copy to clipboard operation
positron copied to clipboard

Consider Polars behavior for columns with only np.nan values

Open petetronic opened this issue 1 year ago • 1 comments

Following on from #4307, and the associated fix #4329, we should review how a Polars series of only np.nan values should summarize. Our treatment for Polars differs from Pandas:

Polars with np.nan series

import polars as pl
import numpy as np
pl_nan = pl.DataFrame({"missing": pl.Series([np.nan] * 5, dtype=pl.Float64)})
Screenshot 2024-08-14 at 10 13 33 AM

Pandas with np.nan series

import pandas as pd
import numpy as np

pd_nan = pd.DataFrame({"missing": pd.Series([np.nan] * 5, dtype="float64")})
Screenshot 2024-08-14 at 10 14 52 AM

petetronic avatar Aug 14 '24 14:08 petetronic

There's some reasoning for this beahvior in the polars docs:

https://docs.pola.rs/user-guide/expressions/missing-data/#notanumber-or-nan-values

Basically, unlike pandas they don't treat NaN's as missing data. It seems like NaN is the expected behavior for the summary stats here. Note also the %missing is 0. We would want to change the behavior of %missing here too.

dfalbel avatar Aug 20 '24 13:08 dfalbel

I'm closing as wontfix -- in polars, NaN is not considered missing, from the docs:

NaN values are considered to be a type of floating point data and are not considered to be missing data in Polars. This means:

NaN values are not counted with the function null_count; and NaN values are filled when you use the specialised function fill_nan method but are not filled with the function fill_null.

Polars has the functions is_nan and fill_nan, which work in a similar way to the functions is_null and fill_null. Unlike with missing data, Polars does not hold any metadata regarding the NaN values, so the function is_nan entails actual computation.

One further difference between the values null and NaN is that numerical aggregating functions, like mean and sum, skip the missing values when computing the result, whereas the value NaN is considered for the computation and typically propagates into the result. If desirable, this behavior can be avoided by replacing the occurrences of the value NaN with the value null:

When the values are all None, the results are consistent

Image

wesm avatar Dec 06 '24 17:12 wesm