datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

LAG window function schema issue. Non null conflict.

Open comphead opened this issue 1 year ago • 1 comments

Describe the bug

select lag(a) over () as x1
        from 
        (select 2 id, 'b' a union all select 1 id, null a union all select 3 id, null);

Arrow error: Invalid argument error: Column 'a' is declared as non-nullable but contains null values

To Reproduce

Above

Expected behavior

Query should work

Additional context

We faced this issue not once, latest is https://github.com/apache/arrow-datafusion/issues/4828 Found when working on https://github.com/apache/arrow-datafusion/pull/9221

comphead avatar Feb 23 '24 00:02 comphead

When I execute this query. I recognized, that sometimes it works, sometimes it fails. It seems like a strange bug.

mustafasrepo avatar Feb 23 '24 07:02 mustafasrepo

Weird thing, it might be a bigger problem behind

DataFusion CLI v36.0.0
❯ select lag(a) over () as x1
        from 
        (select 2 id, 'b' a union all select 1 id, null a union all select 3 id, null);
Arrow error: Invalid argument error: Column 'a' is declared as non-nullable but contains null values
❯ select lag(a) over () as x1
        from 
        (select 2 id, 'b' a union all select 1 id, null a union all select 3 id, null);
Arrow error: Invalid argument error: Column 'a' is declared as non-nullable but contains null values
❯ select lag(a) over () as x1
        from 
        (select 2 id, 'b' a union all select 1 id, null a union all select 3 id, null);
+----+
| x1 |
+----+
|    |
| b  |
|    |
+----+
3 rows in set. Query took 0.013 seconds.

Will dive into it really quick

comphead avatar Mar 08 '24 03:03 comphead

@comphead, I debugged this behaviour also. It seems that during concatenation in the BoundedWindowAggExec. We were effectively using the schema of the first batch in the partition (The schema of the first batch may not reflect the worst case scenario. In other words, RecordBatches emitted out of UnionExec will not always have same schema with the schema of the UnionExec. They will have the schema of the corresponding input of the UnionExec. As in this example, first batch may come from one of the input branches of the union with is_nullable=false). I have filed the PR9508 to fix this issue. Please note that, possible inconsistency between schema of the UnionExec and RecordBatches emitted from it stil persists. However, in the PR we no longer use RecordBatch schema, but input.schema to be on the safe side.

If you have time, I would appreciate your review.

mustafasrepo avatar Mar 08 '24 14:03 mustafasrepo