datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

feat: allow math functions to parse input from Utf8

Open caicancai opened this issue 1 year ago • 10 comments

Is your feature request related to a problem or challenge?

log10('-Infinity'); log2('-Infinity'); power('Infinity',2)

As above, using the + infinity function in some math functions will report an error in arrow-datafusion, but mysql and posgtres will not report an error. I think the + infinity parameter is meaningful in mathematics.

--postgres
postgres=# select power('Infinity', -2);
 power 
-------
     0
(1 row)postgres=# select log('Infinity', 2);
 log 
-----
   0
(1 row)postgres=# select log10('Infinity');
  log10   
----------
 Infinity
(1 row)postgres=# select power('Infinity', 2);
  power   
----------
 Infinity
(1 row)

--mysql
mysql> select log10('+Infinity');
+--------------------+
| log10('+Infinity') |
+--------------------+
| NULL |
+--------------------+
1 row in set, 2 warnings (0.00 sec)mysql> select power('Infinity', 2);
+----------------------+
| power('Infinity', 2) |
+----------------------+
| 0 |
+----------------------+
1 row in set, 1 warning (0.00 sec)mysql> select power('-Infinity', 2);
+-----------------------+
| power('-Infinity', 2) |
+-----------------------+
| 0 |
+-----------------------+
1 row in set, 1 warning (0.00 sec)mysql> select log10('-Infinity');
+--------------------+
| log10('-Infinity') |
+--------------------+
| NULL |
+--------------------+
1 row in set, 2 warnings (0.00 sec)

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

caicancai avatar Feb 21 '24 08:02 caicancai

@alamb Maybe I need some opinions from you. In my opinion, this is also a complicated issue.

caicancai avatar Feb 22 '24 01:02 caicancai

Datafusion does actually support this it seems:

❯ select log(arrow_cast('+Infinity', 'Float32'));
+------------------------+
| log(Utf8("+Infinity")) |
+------------------------+
| inf                    |
+------------------------+
1 row in set. Query took 0.002 seconds.

❯ select log(arrow_cast('-Infinity', 'Float32'));
+------------------------+
| log(Utf8("-Infinity")) |
+------------------------+
| NaN                    |
+------------------------+
1 row in set. Query took 0.002 seconds.

❯

Could you elaborate on which math functions are throwing error in cases of infinity input?

Jefffrey avatar Feb 22 '24 11:02 Jefffrey

Datafusion does actually support this it seems:

❯ select log(arrow_cast('+Infinity', 'Float32'));
+------------------------+
| log(Utf8("+Infinity")) |
+------------------------+
| inf                    |
+------------------------+
1 row in set. Query took 0.002 seconds.

❯ select log(arrow_cast('-Infinity', 'Float32'));
+------------------------+
| log(Utf8("-Infinity")) |
+------------------------+
| NaN                    |
+------------------------+
1 row in set. Query took 0.002 seconds.

❯

Could you elaborate on which math functions are throwing error in cases of infinity input?

Thank you for your reply. It seems that you need to do some cast conversion. Can't you log2('-Infinity') directly?

caicancai avatar Feb 22 '24 13:02 caicancai

Yeah that would seem reasonable. I guess it would involve changing the signatures of the math functions to also accept Utf8 data type, then try parse into float (or something like that)

Jefffrey avatar Feb 22 '24 19:02 Jefffrey

I've updated the issue title to better reflect this :+1:

(Since less of an issue with infinity, and more just a general data type issue with math functions)

Jefffrey avatar Feb 22 '24 20:02 Jefffrey

I've updated the issue title to better reflect this 👍

(Since less of an issue with infinity, and more just a general data type issue with math functions)

Very happy to see the way arrow-datafusion handles this, it's a way I would like to see, I think doing a cast is a very rigorous behavior

caicancai avatar Feb 23 '24 00:02 caicancai

@Jefffrey But I think this cast should be an implicit conversion. If not, maybe we should indicate this in the tests and documentation.

caicancai avatar Feb 23 '24 05:02 caicancai

@Jefffrey But I think this cast should be an implicit conversion. If not, maybe we should indicate this in the tests and documentation.

Yes, I think this is what I was trying to convey, that we should allow log('100.0') to be implicitly converted to log(100.0) since we can already cast from Utf8 to Float. I just might be wrong on the exact implementation details as I'm not as familiar with the internals of the function related code :slightly_smiling_face:

Jefffrey avatar Feb 23 '24 08:02 Jefffrey

@Jefffrey But I think this cast should be an implicit conversion. If not, maybe we should indicate this in the tests and documentation.

Yes, I think this is what I was trying to convey, that we should allow log('100.0') to be implicitly converted to log(100.0) since we can already cast from Utf8 to Float. I just might be wrong on the exact implementation details as I'm not as familiar with the internals of the function related code 🙂

Maybe I want to try it, but I'm not sure I can pull it off

caicancai avatar Feb 23 '24 08:02 caicancai

@Jefffrey I will try to add some tests first and then find the problem. At present, the math function should have some negative tests that need to be improved.

caicancai avatar Feb 23 '24 08:02 caicancai

@Jefffrey But I think this cast should be an implicit conversion. If not, maybe we should indicate this in the tests and documentation.

Yes, I think this is what I was trying to convey, that we should allow log('100.0') to be implicitly converted to log(100.0) since we can already cast from Utf8 to Float. I just might be wrong on the exact implementation details as I'm not as familiar with the internals of the function related code 🙂

I think this process is called "coercion" in the datafusion codebase.

So we would need to add a coercion from DataType::Utf8 to float perhaps

alamb avatar Feb 26 '24 18:02 alamb

I think this process is called "coercion" in the datafusion codebase.

So we would need to add a coercion from DataType::Utf8 to float perhaps

double or float?

caicancai avatar Feb 27 '24 07:02 caicancai

So we would need to add a coercion from DataType::Utf8 to float perhaps

double or float?

I think we would probably need the coercion for both Float32 and Float64

alamb avatar Feb 27 '24 13:02 alamb

@alamb I have some ideas about type conversion. I am trying out my ideas on calcite. If it makes sense, I will transplant it to arrow-datafusion. This may take some time.

caicancai avatar Mar 01 '24 14:03 caicancai

@alamb I have some ideas about type conversion. I am trying out my ideas on calcite. If it makes sense, I will transplant it to arrow-datafusion. This may take some time.

Thanks @caicancai -- BTW I find following the example of existing systems is helpful in cases like this (there is no need to reinvent different coercion rules)

For example, postgres appears to happily coerce UTF8 -> float:

postgres=# select sqrt('1.4');
        sqrt
--------------------
 1.1832159566199232
(1 row)

alamb avatar Mar 01 '24 17:03 alamb