machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Data.Analysis : float/double values don't convert properly with . decimals in csv file

Open chriss2401 opened this issue 5 years ago • 7 comments

If I have a csv dataframe that looks like this:

Date;Value 12-04-1989;20,7 03-11-1990;22,1

Then my two float values will properly be loaded when I call DataFrame.LoadCsv (with values of 20.7 and 22.1)

But if my separator is the default one (',') and my floats are with a dot instead, they will get ignored and I will get 207 and 221 as values.

The issue comes from here : https://github.com/dotnet/corefxlab/blob/master/src/Microsoft.Data.Analysis/DataFrame.cs#L488

Since if you write object value = Convert.ChangeType("20.7", typeof(double)); you will get 207 as a result.

chriss2401 avatar May 30 '20 17:05 chriss2401

Apparently you need to add CultureInfo.InvariantCulture to the function if your double/float has dots instead of commas:

https://stackoverflow.com/questions/7725020/how-to-use-convert-changetype-when-conversiontype-is-decimal-and-input-is-40

So a potential fix could be to check whether the type of the column is a double/float and call ChangeType in the right way depending on whether the value(s) is described with dots or commas. I can open a branch for this if the maintainers agree with this approach.

chriss2401 avatar Jun 01 '20 10:06 chriss2401

So a potential fix could be to check whether the type of the column is a double/float and call ChangeType in the right way depending on whether the value(s) is described with dots or commas. I can open a branch for this if the maintainers agree with this approach.

We may also think about accepting a CultureInfo in the public API. That way the caller can specify what culture to use to parse numbers. The caller is going to know best whether they want CurrentCulture or InvariantCulture.

@chriss2401 - if you have a fix and test cases, feel free to send a PR and @pgovind and I can take a look.

Note also dotnet/corefxlab#2927 is re-doing how LoadCsv is implemented. So we may want to consider that as well.

eerhardt avatar Jun 05 '20 22:06 eerhardt

@eerhardt Good idea. I can implement this in a couple of days unless you guys decide to add it to dotnet/corefxlab#2927 , which in that case I can just close the issue.

chriss2401 avatar Jun 06 '20 10:06 chriss2401

I think taking a CultureInfo in the public API is the right approach. @chriss2401 : If you have a fix and unit tests, I don't mind accepting it into either DataFrame.LoadCsv or dotnet/corefxlab#2927 (feel free to push a commit to that PR). If you make a new PR, I'll port your fix to dotnet/corefxlab#2927 when I go through there next :)

pgovind avatar Jun 08 '20 18:06 pgovind

Sounds good @pgovind , I'll push this and dotnet/corefxlab#2902 in dotnet/corefxlab#2927 in the next couple of days. I think I will clean up some duplicate code I saw as well that I saw in that PR :)

chriss2401 avatar Jun 08 '20 19:06 chriss2401

Will this ever be fixed? For people outside of the US, the DataFrame.LoadCsv method doesn't exist in practice.

Oblomoff avatar Dec 29 '21 21:12 Oblomoff

I wouldn't mind re-taking a look at this at some point, just don't know exactly when that would be :)

chriss2401 avatar Jan 03 '22 11:01 chriss2401