[BUG] Circuit breaker getting triggered when multiple PPL queries are fired in parallel.
Description
This issue transfers a experiment @paulstn and I did in OSD's new discover experience. When loading into the discover page at OSD, it fires multiple queries against each index field to calculate the presentation percentage.
- Similar issue https://github.com/opensearch-project/sql/issues/4584
Example queries:
# query 1
source = demo-ai-logs-otel-v1*
| WHERE `@timestamp` >= '2025-07-29 22:53:20.604' AND `@timestamp` <= '2025-10-27 22:53:20.604'
| where isnotnull(`attributes.State`)
| stats count() as field_count, distinct_count(`attributes.State`) as distinct_count
# query 2
source = demo-ai-logs-otel-v1*
| WHERE `@timestamp` >= '2025-07-29 22:53:20.604' AND `@timestamp` <= '2025-10-27 22:53:20.604'
| where isnotnull(`attributes.address`) | stats count() as field_count, distinct_count(`attributes.address`) as distinct_count
...
Example errors:
...
{
"text": "{\"statusCode\":400,\"error\":\"Bad Request\",\"message\":\"{\\n \\\"error\\\": {\\n \\\"reason\\\": \\\"Error occurred in OpenSearch engine: all shards failed\\\",\\n
\\\"details\\\": \\\"Shard[0]: [demo-ai-logs-otel-v1-00001/oIkyldJtSHm6Cwgnj-aqxA] QueryShardException[failed to create query:
Failed to compile inline script [rO0ABXNyADRvcmcub3BlbnNlYXJjaC5zcWwuZXhwcmVzc2lvbi5mdW5jdGlvbi5GdW5jdGlvbkRTTCQyPc501CEBPWwCAAVMAA12YWwkYXJn
...
...
dGltZS5TZXKVXYS6GyJIsgwAAHhwdwYHAANVVEN4c3EAfgBNdw0CAAAAAGj/9+AtYkYHeH5xAH4AG3QAB0JPT0xFQU4=] using lang
[opensearch_query_expression]];
nested: CircuitBreakingException[[script] Too many dynamic script compilations within, max: [75/5m]; please use indexed,
or scripts with parameters instead; this limit can be changed by the [script.context.filter.max_compilations_rate] setting];\\\\n\\\\n
For more details, please send request for Json format to see the raw response from OpenSearch engine.\\\",\\n
\\\"type\\\": \\\"SearchPhaseExecutionException\\\"\\n },\\n \\\"status\\\": 400\\n}\"}"
},
"redirectURL": "",
"headersSize": 800,
"bodySize": 3641
},
...
Env
- OS: 3.3
Screenshots
Potential Work Around
Instead of having multiple queries fired front frontend, do a single query.
Exit Criteria:
As SQL plugin, here are couple action items:
- [ ] Investigate the current default limitation of
script.context.filter.max_compilations_rate - [ ] Implement the fix/enhancement to address the above issue
Adding some additional input from my perspective here:
I think it is not a reliable design for this experience in general - the current design in OSD fires n queries which n is depending on the number of defined index fields. In other word, when the number of index fields is big enough, it will eventually cause performance issue even with a single query that includes all the fields.
@qianheng-aws is working on the optimization https://github.com/opensearch-project/sql/issues/4757 But we don't plan to fix the same issue in v2. Can you confirm which is the PPL execution engine running in OSD's new discover? It seems the query is running on v2 engine. From the query
source = demo-ai-logs-otel-v1*
| WHERE `@timestamp` >= '2025-07-29 22:53:20.604' AND `@timestamp` <= '2025-10-27 22:53:20.604'
| where isnotnull(`attributes.State`)
| stats count() as field_count, distinct_count(`attributes.State`) as distinct_count
In V2. @timestamp >= '2025-07-29 22:53:20.604', @timestamp <= '2025-10-27 22:53:20.604' and isnotnull(attributes.State) creates 3 scripts, but 0 in V3.
But we don't plan to fix the same issue in v2. Can you confirm which is the PPL execution engine running in OSD's new discover? It seems the query is running on v2 engine.
This information using lang [opensearch_query_expression]]; indicates that it's using v2 engine script. cc @LantaoJin @RyanL1997
Can you confirm which is the PPL execution engine running in OSD's new discover?
Hi @LantaoJin and @qianheng-aws , thanks for the info. Yes, lets confirm with @paulstn on this. And lets also confirm the current design in production at the moment, since this issue was originally reported on a WIP feature development in OSD.
So the workaround mentioned in the issue description of reducing the number of queries is not viable, the best option is to still send out multiple queries. I just noticed that this issue of hitting the circuit breaker only exists when when the query is run on V2, but I want to note that also supporting this would be ideal.