datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Order of Interval Addition Should Affect Final Output

Open vbarua opened this issue 1 year ago • 6 comments

Describe the bug

In various engines, the order in which intervals are added to dates can affect the final value. This is especially noticeable with leap years.

Datafusion appears to constant fold these intervals, which throws away the operation order.

> EXPLAIN SELECT
         DATE '2019-02-28' + INTERVAL '1 YEAR' + INTERVAL '1 DAY' AS FEB,
         DATE '2019-02-28' + INTERVAL '1 DAY' + INTERVAL '1 YEAR' AS MAR;
+---------------+----------------------------------------------------------------------+
| plan_type     | plan                                                                 |
+---------------+----------------------------------------------------------------------+
| logical_plan  | Projection: Date32("2020-02-29") AS feb, Date32("2020-02-29") AS mar |
|               |   EmptyRelation                                                      |
| physical_plan | ProjectionExec: expr=[2020-02-29 as feb, 2020-02-29 as mar]          |
|               |   PlaceholderRowExec                                                 |
|               |                                                                      |
+---------------+----------------------------------------------------------------------+

To Reproduce

Testing via datafusion-cli

> SELECT
  DATE '2019-02-28' + INTERVAL '1 YEAR' + INTERVAL '1 DAY' AS FEB,
  DATE '2019-02-28' + INTERVAL '1 DAY' + INTERVAL '1 YEAR' AS MAR;
+------------+------------+
| feb        | mar        |
+------------+------------+
| 2020-02-29 | 2020-02-29 |
+------------+------------+

Expected behavior

Due to leap year shenanigans, adding the year before the day results in a different date than adding the day before the year.

Trino emits

trino> SELECT
    ->   DATE '2019-02-28' + INTERVAL '1' YEAR + INTERVAL '1' DAY AS FEB,
    ->   DATE '2019-02-28' + INTERVAL '1' DAY + INTERVAL '1' YEAR AS MAR;
    FEB     |    MAR     
------------+------------
 2020-02-29 | 2020-03-01 
(1 row)

as does Postgres and Snowflake (based on their documentation for interval types where this example came from)

Additional context

No response

vbarua avatar Jun 21 '24 21:06 vbarua

🤔 maybe we could turn of constant folding for adding intervals

alamb avatar Jun 21 '24 22:06 alamb

take

Lordworms avatar Jun 21 '24 23:06 Lordworms

This is caused by the adding ordering since the sqlparser would always generate a AST like image by only disabling constant folding for Interval would not help here since when we evaluate the PhysicalExpr, it would still do the calculation of the later two components first. I think a change in sqlparser would help? (like we generate the BinaryOP AST with first two components as left leaf and the last one as right leaf) not sure the best way to do it.

Lordworms avatar Jun 23 '24 01:06 Lordworms

I agree with @Lordworms the root cause is related to the precidence rules in sqlparser which control if expressions like

  DATE '2019-02-28' + INTERVAL '1 YEAR' + INTERVAL '1 DAY' AS FEB,

are parsed like this

  ((DATE '2019-02-28' + INTERVAL '1 YEAR') + INTERVAL '1 DAY' AS FEB),

Or like

  (DATE '2019-02-28' + (INTERVAL '1 YEAR' + INTERVAL '1 DAY' AS FEB)),

Users who want to control the order of expression evaluation can use ( and ) to explicitly control the evaluation order.

alamb avatar Jun 23 '24 11:06 alamb

I agree with @Lordworms the root cause is related to the precidence rules in sqlparser which control if expressions like

  DATE '2019-02-28' + INTERVAL '1 YEAR' + INTERVAL '1 DAY' AS FEB,

are parsed like this

  ((DATE '2019-02-28' + INTERVAL '1 YEAR') + INTERVAL '1 DAY' AS FEB),

Or like

  (DATE '2019-02-28' + (INTERVAL '1 YEAR' + INTERVAL '1 DAY' AS FEB)),

Users who want to control the order of expression evaluation can use ( and ) to explicitly control the evaluation order.

Thanks so much. Should I fix it in sqlparser?

Lordworms avatar Jun 23 '24 17:06 Lordworms

Thanks so much. Should I fix it in sqlparser?

I think that is probably the best way -- though it is likely pretty tricky (likely related to operator precidence). I think the correct first step would be to do some research and see how such arithmetic is parsed by other databases

alamb avatar Jun 23 '24 19:06 alamb