Flaky test: TestPrometheusCompatibilityQueryFuzz
See on #6650
query_fuzz_test.go:1649:
Error Trace: /home/runner/work/cortex/cortex/integration/query_fuzz_test.go:1649
/home/runner/work/cortex/cortex/integration/query_fuzz_test.go:1534
Error: finished query fuzzing tests
Test: TestPrometheusCompatibilityQueryFuzz
Messages: 1 test cases failed
I looked at all failures of this test on master in the last month. 1 looked like a resource issue on CI workers. For the rest, I recorded the failing queries below:
query_fuzz_test.go:1786: case 1068 # of samples mismatch.
range query: (
sort_desc(bottomk(4, acos({__name__="test_series_a",job=~"te.*"} offset -1m31s)))
^
{__name__="test_series_a",status_code!~".*00"} @ end()
)
query_fuzz_test.go:1786: case 1717 # of samples mismatch.
range query: -(topk(3, hour({__name__="test_series_b"})) % -{__name__="test_series_a"})
query_fuzz_test.go:1786: case 1020 # of samples mismatch.
range query: rad(
(
{__name__="test_series_a"} offset 28s
>= bool
bottomk without (status_code, series) (
4,
({__name__="test_series_a"} < bool {__name__="test_series_b"} @ end() offset -1m45s)
)
)
)
query_fuzz_test.go:1786: case 1357 # of samples mismatch.
range query: -(
bottomk by (status_code, job, __name__) (
1,
(
{__name__="test_series_a",series!="2",status_code!~"4.*"} offset -2m37s
<= bool
{__name__="test_series_a"} offset -3m33s
)
)
== bool
(
scalar({__name__="test_series_a",series!~".*"} offset -2m25s)
> bool
cos({__name__="test_series_a"} offset -4m34s)
)
)
query_fuzz_test.go:1786: case 1959 # of samples mismatch.
range query: (
ceil(bottomk(4, round({__name__="test_series_a"} offset -4m22s, 5)))
and
min_over_time({__name__="test_series_b"} offset -4m50s[1h:1m] offset -21s)
)
query_fuzz_test.go:1786: case 1196 # of samples mismatch.
range query: -(
-{__name__="test_series_a"}
%
bottomk(5, ({__name__="test_series_b"} > bool {__name__="test_series_a"} offset -2m56s))
)
query_fuzz_test.go:1786: case 1509 # of samples mismatch.
range query: (
topk without (status_code, series, job) (
3,
-({__name__="test_series_a",series=~".*"} - {__name__="test_series_a"} offset -2m38s)
)
<= bool
predict_linear(
{__name__="test_series_b"} offset 2m59s[1h:1m],
scalar(-{__name__="test_series_a",status_code="502"} offset 4m23s)
)
)
query_fuzz_test.go:1824: case 1119 # of samples mismatch.
range query: cosh(
(
{__name__="test_series_a",job="test"} offset -31s
+
bottomk by (job, __name__) (3, tanh({__name__="test_series_b",status_code=~".*00"}))
)
)
All queries above use either topk or bottomk, composed with some other operation or function.
In https://github.com/cortexproject/cortex/pull/6350, we introduced special equality checks for queries that use bottomk or topk to reduce test failures because these queries are not deterministic.
Looking at the printed test results for one of the failing queries, I see that different results are returned from Prometheus itself on different iterations of the same test using the same query.
If prometheus itself is not deterministic for these queries, I think we should adjust our equality logic for these tests. I propose the following:
- Exclude all
bottomk/topkqueries from this test. - Create a new test that only uses
bottomk/topkqueries. For each query, it should run the query 100 times on Cortex and Prometheus. Each result from Cortex is considered correct iff it matches at least one result from Prometheus.
I reproduced this in a standalone prometheus server with one of the queries above. Setting a fixed end time and clicking "Execute" multiple times (while changing nothing else) yields different results:
The query time is set to over 24 hours in the past, so new data is not affecting it.
I still don't understand why this is nondeterministic.
I managed to reduce the suspect query to this:
topk(3, sgn({__name__="prometheus_tsdb_compaction_chunk_samples_bucket"}))
The above expression when used in a range query (via /api/v1/query_range) with fixed start/end/step returns different results almost every time the query is run against Prometheus.
When the above expression is used in an instant query (via Prometheus UI) with fixed evaluation time, the results are stable and do not change from one execution to the next.
Removing the sgn() from the expression yields stable results. Because of this, I suspect that when multiple series have identical values, the order of the returned series is unstable. Where exactly this sorting is (not) happening, I'm not sure yet.