dataframe icon indicating copy to clipboard operation
dataframe copied to clipboard

Casting strings to double using `with { it.toDouble()}` and `toDouble()` gives different results

Open devcrocod opened this issue 2 years ago • 4 comments

Reproduce

  1. Take the ramen dataset: https://www.kaggle.com/code/sujan97/complete-analysis-of-ramen-ratings/input
val df = DataFrame.readCSV("ramen-ratings.csv").renameToCamelCase()
df.filter { !stars.startsWith("Un") }.convert { stars }.toDouble()
  1. convert stars column to a double type
df.filter { !stars.startsWith("Un") }.convert { stars }.toDouble()

Expected

df.filter { !stars.startsWith("Un") }.convert { stars }.with { it.toDouble() }

result:

   review#          brand                                  variety style country stars topTen
 0    2580      New Touch                 T's Restaurant Tantanmen   Cup   Japan  3,75   null
 1    2579       Just Way Noodles Spicy Hot Sesame Spicy Hot Se...  Pack  Taiwan  1,00   null
 2    2578         Nissin            Cup Noodles Chicken Vegetable   Cup     USA  2,25   null
 3    2577        Wei Lih            GGE Ramen Snack Tomato Flavor  Pack  Taiwan  2,75   null
 4    2576 Ching's Secret                          Singapore Curry  Pack   India  3,75   null

Actual

   review#          brand                                  variety style country stars topTen
 0    2580      New Touch                 T's Restaurant Tantanmen   Cup   Japan 375,0   null
 1    2579       Just Way Noodles Spicy Hot Sesame Spicy Hot Se...  Pack  Taiwan   1,0   null
 2    2578         Nissin            Cup Noodles Chicken Vegetable   Cup     USA 225,0   null
 3    2577        Wei Lih            GGE Ramen Snack Tomato Flavor  Pack  Taiwan 275,0   null
 4    2576 Ching's Secret                          Singapore Curry  Pack   India 375,0   null

Version and Environment

Name: kotlin-jupyter-kernel, Version: 0.11.0.385

dataframe version: 0.12.1

devcrocod avatar Jan 22 '24 13:01 devcrocod

Thanks @devcrocod

zaleslaw avatar Jan 22 '24 14:01 zaleslaw

I'm sorry, I cannot reproduce it directly. It returns the same result for me.

It might be a locale thing (as I see your Doubles have "," instead "."). Convert relies on parse to parse Strings. It defaults to your system locale and interprets "," as the decimal splitter and "." as the thousands splitter.

This may be different from the default String.toDouble() function from the stdlib you call the other time. I feel like this is intended behavior, though a bit unfortunate in this example.

Since you're trying to parse a String I'd recommend using parse as you can define extra ParserOptions, such as a Locale.

Jolanrensen avatar Jan 22 '24 20:01 Jolanrensen

Yes, this is a problem specifically with the locale. But I expect to get one result: df.filter { !stars.startsWith("Un") }.convert { stars }.with { it.toDouble() }, df.filter { !stars.startsWith("Un") }.convert { stars }.toDouble()

Because in my opinion, toDouble() is just a shortcut for with.

devcrocod avatar Jan 23 '24 14:01 devcrocod

Yes, this is a problem specifically with the locale. But I expect to get one result: df.filter { !stars.startsWith("Un") }.convert { stars }.with { it.toDouble() }, df.filter { !stars.startsWith("Un") }.convert { stars }.toDouble()

Because in my opinion, toDouble() is just a shortcut for with.

I know, it should, but I'd argue our solution is "better" as it takes locale into account. It's the stlib toDouble() function that should change, but that's not something we can do.

Jolanrensen avatar Jan 23 '24 16:01 Jolanrensen