sql icon indicating copy to clipboard operation
sql copied to clipboard

[BUG] Memory Exhaustion for Multiple Filtering Operations in PPL

Open RyanL1997 opened this issue 2 months ago • 3 comments

Problem Description

OpenSearch SQL experienced severe memory exhaustion and Out-of-Memory (OOM) errors when executing PPL queries with 10+ consecutive filtering operations (regex and/or where clauses), regardless of dataset size. The issue manifested as:

  • Memory consumption growing exponentially with the number of filtering operations (regex/where)
  • GC overhead limit exceeded errors
  • Query failures even with small datasets
  • System becoming unresponsive during query execution

Example Failing Queries

Example 1: Multiple regex operations

source=accounts
| regex firstname="^[A-Z][a-z]+"
| regex lastname="^[A-Z][a-z]+$"
| regex address="\\d+\\s+[A-Za-z]+\\s+(Street|Lane|Avenue|Place|Court|Terrace)"
| regex employer="^[A-Z][a-z]+"
| regex email="[a-z]+[a-z]+@[a-z]+\\.(com|net|org)$"
| regex city="^[A-Z][a-z]+$"
| regex state="^[A-Z]{2}$"
| regex gender="^(M|F)$"
| regex firstname=".*[aeiouAEIOU].*"
| regex lastname=".*[bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ].*"
| regex address=".*[0-9]{3}.*"
| regex employer=".*[aeiouAEIOU]$"
| regex email=".*[a-z]{5,}@.*"
| regex city=".*[aeiouyAEIOUY]{2}.*"
| regex address="^[0-9]+\\s.*"
| regex lastname="[A-Z][a-z]*[aeiouy]$"
| where age >= 25 AND age <= 40
| fields firstname, lastname, email, employer, address, city, state
| head 10

Example 2: Multiple where operations

source=accounts
| where age > 20
| where age < 50
| where balance > 1000
| where balance < 50000
| where account_number > 0
| where account_number < 1000
| where age != 25
| where age != 30
| where age != 35
| where balance != 5000
| where balance != 10000
| where account_number != 100
| where account_number != 200
| where age > 21
| where age < 49
| where balance > 1001
| fields firstname, lastname, age, balance
| head 10

Example Failures in Log

[2025-11-20T20:55:54,723][INFO ][o.o.s.p.PPLService       ] [integTest-0] [9d49584e-83ad-484b-9fc8-5489de5517a5] Incoming request source=table | where identifier > *** | where identifier < *** | where identifier > *** | where identifier < *** | where identifier > *** | where identifier < *** | where identifier != *** | where identifier != *** | where identifier != *** | where identifier != *** | where identifier != *** | where identifier != *** | where identifier != *** | where identifier > *** | where identifier < *** | where identifier > *** | fields + identifier,identifier,identifier,identifier | head 10
[2025-11-20T20:56:04,871][INFO ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][123] overhead, spent [372ms] collecting in the last [1s]
[2025-11-20T20:56:05,878][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][124] overhead, spent [686ms] collecting in the last [1s]
[2025-11-20T20:56:07,136][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][125] overhead, spent [1.1s] collecting in the last [1.2s]
[2025-11-20T20:56:08,412][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][126] overhead, spent [1.2s] collecting in the last [1.2s]
[2025-11-20T20:56:09,536][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][127] overhead, spent [1s] collecting in the last [1.1s]
[2025-11-20T20:56:10,601][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][128] overhead, spent [1s] collecting in the last [1s]
[2025-11-20T20:56:11,753][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][129] overhead, spent [1s] collecting in the last [1.1s]
[2025-11-20T20:56:12,914][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][130] overhead, spent [1.1s] collecting in the last [1.1s]
[2025-11-20T20:56:13,920][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][131] overhead, spent [967ms] collecting in the last [1s]
...

Root Cause Analysis

Initial Misdiagnosis

Initially suspected that Pattern.compile() calls were causing memory issues due to lack of caching. However, implementing pattern caching did not resolve the problem.

Actual Root Cause

The real issue was identified by examining the Calcite query planning process:

  1. Deep Filter RelNode Chains: Each regex or where command in the PPL query created a separate Filter RelNode during the analysis phase, resulting in a deeply nested chain:

    Filter(regex15)
    └─ Filter(regex14)
       └─ Filter(regex13)
          └─ ... (10+ levels deep)
             └─ Scan(table)
    
  2. Calcite FilterMergeRule Explosion: During the optimization phase, Calcite's FilterMergeRule attempts to merge these Filter nodes. However, the intersection with various push-down rules caused a combinatorial explosion:

    • The rule tried to generate all possible combinations of filter orderings
    • With 10+ filters, this created millions of intermediate RelNode objects
    • Each intermediate object consumed memory
    • The planning queue grew exponentially
  3. Memory Exhaustion: The memory was exhausted not during query execution, but during the query planning/optimization phase itself, before any data was processed.

Key Insight

The problem was architectural: too many separate Filter RelNodes triggered exponential growth in Calcite's optimization queue due to the intersection of FilterMergeRule and push-down optimization rules.

RyanL1997 avatar Nov 21 '25 05:11 RyanL1997

Thanks for analysis. Does calcite SQL has similar Deep Filter RelNode Chains? Does this issue exist becuase we skip SQL validation stage?

penghuo avatar Nov 21 '25 16:11 penghuo

Does calcite SQL has similar Deep Filter RelNode Chains? Does this issue exist becuase we skip SQL validation stage?

Hi @penghuo .That is a good questions.

I quickly tried something like this:

SELECT age FROM accounts
  WHERE age > 30 AND age < 40 AND balance > 10000

SQL parser automatically combines these into a single Filter node with an AND expression. SQL naturally groups all WHERE conditions together.

The issue exists because of how PPL chains filters sequentially:

  source=accounts
  | where age > 30
  | where age < 40
  | where balance > 10000

PPL's pipeline syntax creates separate filter operations - each | where is a distinct command. Without filter accumulation, this creates:

  LogicalFilter(age > 30)
    LogicalFilter(age < 40)
      LogicalFilter(balance > 10000)
        IndexScan

This is a chain of 3 nested Filter nodes, which causes exponential memory growth with many filters

RyanL1997 avatar Nov 21 '25 17:11 RyanL1997

My question is if SQL contain subquery, how does calcite handle it?

"select name from (\n"
        + "  select *\n"
        + "  from dept\n"
        + "  where deptno = 10)\n"
        + "where name = 'A'\n"

I verify in calcite, 1) SqlValidator does not normalize the query 2) RelOptRules merge filter correctlly.

  @Test void testMergeFilter2() {
    final String sql = "select name from (\n"
        + "  select *\n"
        + "  from dept\n"
        + "  where deptno = 10)\n"
        + "where name = 'A'\n";
    sql(sql)
        .withRule(CoreRules.FILTER_PROJECT_TRANSPOSE,
            CoreRules.FILTER_MERGE)
        .check();
  }

penghuo avatar Nov 21 '25 17:11 penghuo