`generate_series` doesn't respect memory limit
Describe the bug
You can trivial cause datafusion to use any amount of memory by simply running
select generate_series(9876543210);
Memory management functionality, e.g. MemoryPool does't seem to have any effect
To Reproduce
Run datafusion-cli with a memory limit, then run generate_series:
datafusion-cli -m 1g -c 'select generate_series(9876543210);'
Memory immediately jumps to ~20GB. (note this is not limited to datafusion-cli)
This query also hangs indefinitely, but in production we see posts being killed OOM for queries like this.
Expected behavior
generate_series should either be streamed so it uses very little memory, or should be killed/constrained by the memory pool.
Additional context
Same presumably applies to the range UDF.
cc @davidhewitt
To be clear, our actual problem is not particularly with generate_series, although this behaviour is ugly.
The underlying problem we're having is lots of queries seem to exceed the memory pool limits, generate_series is just one example of the problem.
I actually recall thinking about that a bit when I was adding sub-day support to generate_series. Thanks for filing this issue
The underlying problem we're having is lots of queries seem to exceed the memory pool limits,
generate_seriesis just one example of the problem.
I have noticed a similar issue with external aggregation https://github.com/apache/datafusion/issues/12937
Now only Aggregation/Sort/Sort-merge-join supports spilling, it's likely they didn't actually free memory consumption after writing intermediates to disk Other accumulating operators should keep track of memory, and return error if it exceeds memory budget
Do you have any idea what kind of query/operators might not respect memory limit? 🤔
It seems that the memory used by the function(UDF) is not being managed within the memory pool. maybe intended design of the memory pool. https://github.com/apache/datafusion/blob/9eca7d165c3ddcd3449e833df9391b8216e0f5bc/datafusion/execution/src/memory_pool/mod.rs#L54-L58
It seems that the memory used by the function(UDF) is not being managed within the memory pool. maybe intended design of the memory pool.
I think that's true but quite unfortunate: clearly generate_series can use enormous amounts of memory, it should be possible to track.