datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

`generate_series` doesn't respect memory limit

Open samuelcolvin opened this issue 1 year ago • 3 comments

Describe the bug

You can trivial cause datafusion to use any amount of memory by simply running

select generate_series(9876543210);

Memory management functionality, e.g. MemoryPool does't seem to have any effect

To Reproduce

Run datafusion-cli with a memory limit, then run generate_series:

 datafusion-cli -m 1g -c 'select generate_series(9876543210);'

Memory immediately jumps to ~20GB. (note this is not limited to datafusion-cli)

This query also hangs indefinitely, but in production we see posts being killed OOM for queries like this.

Expected behavior

generate_series should either be streamed so it uses very little memory, or should be killed/constrained by the memory pool.

Additional context

Same presumably applies to the range UDF.

cc @davidhewitt

samuelcolvin avatar Oct 14 '24 10:10 samuelcolvin

To be clear, our actual problem is not particularly with generate_series, although this behaviour is ugly.

The underlying problem we're having is lots of queries seem to exceed the memory pool limits, generate_series is just one example of the problem.

samuelcolvin avatar Oct 14 '24 10:10 samuelcolvin

I actually recall thinking about that a bit when I was adding sub-day support to generate_series. Thanks for filing this issue

Omega359 avatar Oct 15 '24 01:10 Omega359

The underlying problem we're having is lots of queries seem to exceed the memory pool limits, generate_series is just one example of the problem.

I have noticed a similar issue with external aggregation https://github.com/apache/datafusion/issues/12937

Now only Aggregation/Sort/Sort-merge-join supports spilling, it's likely they didn't actually free memory consumption after writing intermediates to disk Other accumulating operators should keep track of memory, and return error if it exceeds memory budget

Do you have any idea what kind of query/operators might not respect memory limit? 🤔

2010YOUY01 avatar Oct 15 '24 11:10 2010YOUY01

It seems that the memory used by the function(UDF) is not being managed within the memory pool. maybe intended design of the memory pool. https://github.com/apache/datafusion/blob/9eca7d165c3ddcd3449e833df9391b8216e0f5bc/datafusion/execution/src/memory_pool/mod.rs#L54-L58

getChan avatar Jan 01 '25 07:01 getChan

It seems that the memory used by the function(UDF) is not being managed within the memory pool. maybe intended design of the memory pool.

I think that's true but quite unfortunate: clearly generate_series can use enormous amounts of memory, it should be possible to track.

adriangb avatar Oct 30 '25 20:10 adriangb