cortex Flaky test: TestPrometheusCompatibilityQueryFuzz

See on #6650

    query_fuzz_test.go:1649: 
        	Error Trace:	/home/runner/work/cortex/cortex/integration/query_fuzz_test.go:1649
        	            				/home/runner/work/cortex/cortex/integration/query_fuzz_test.go:1534
        	Error:      	finished query fuzzing tests
        	Test:       	TestPrometheusCompatibilityQueryFuzz
        	Messages:   	1 test cases failed

Mar 17 '25 23:03 dsabsay

I looked at all failures of this test on master in the last month. 1 looked like a resource issue on CI workers. For the rest, I recorded the failing queries below:

Click to expand

query_fuzz_test.go:1786: case 1068 # of samples mismatch.
        range query: (
            sort_desc(bottomk(4, acos({__name__="test_series_a",job=~"te.*"} offset -1m31s)))
          ^
            {__name__="test_series_a",status_code!~".*00"} @ end()
        )

    query_fuzz_test.go:1786: case 1717 # of samples mismatch.
        range query: -(topk(3, hour({__name__="test_series_b"})) % -{__name__="test_series_a"})

    query_fuzz_test.go:1786: case 1020 # of samples mismatch.
        range query: rad(
          (
              {__name__="test_series_a"} offset 28s
            >= bool
              bottomk without (status_code, series) (
                4,
                ({__name__="test_series_a"} < bool {__name__="test_series_b"} @ end() offset -1m45s)
              )
          )
        )

    query_fuzz_test.go:1786: case 1357 # of samples mismatch.
        range query: -(
            bottomk by (status_code, job, __name__) (
              1,
              (
                  {__name__="test_series_a",series!="2",status_code!~"4.*"} offset -2m37s
                <= bool
                  {__name__="test_series_a"} offset -3m33s
              )
            )
          == bool
            (
                scalar({__name__="test_series_a",series!~".*"} offset -2m25s)
              > bool
                cos({__name__="test_series_a"} offset -4m34s)
            )
        )

    query_fuzz_test.go:1786: case 1959 # of samples mismatch.
        range query: (
            ceil(bottomk(4, round({__name__="test_series_a"} offset -4m22s, 5)))
          and
            min_over_time({__name__="test_series_b"} offset -4m50s[1h:1m] offset -21s)
        )

    query_fuzz_test.go:1786: case 1196 # of samples mismatch.
        range query: -(
            -{__name__="test_series_a"}
          %
            bottomk(5, ({__name__="test_series_b"} > bool {__name__="test_series_a"} offset -2m56s))
        )

    query_fuzz_test.go:1786: case 1509 # of samples mismatch.
        range query: (
            topk without (status_code, series, job) (
              3,
              -({__name__="test_series_a",series=~".*"} - {__name__="test_series_a"} offset -2m38s)
            )
          <= bool
            predict_linear(
        {__name__="test_series_b"} offset 2m59s[1h:1m],
              scalar(-{__name__="test_series_a",status_code="502"} offset 4m23s)
            )
        )

    query_fuzz_test.go:1824: case 1119 # of samples mismatch.
        range query: cosh(
          (
              {__name__="test_series_a",job="test"} offset -31s
            +
              bottomk by (job, __name__) (3, tanh({__name__="test_series_b",status_code=~".*00"}))
          )
        )

All queries above use either topk or bottomk, composed with some other operation or function.

In https://github.com/cortexproject/cortex/pull/6350, we introduced special equality checks for queries that use bottomk or topk to reduce test failures because these queries are not deterministic.

Looking at the printed test results for one of the failing queries, I see that different results are returned from Prometheus itself on different iterations of the same test using the same query.

If prometheus itself is not deterministic for these queries, I think we should adjust our equality logic for these tests. I propose the following:

Exclude all bottomk/topk queries from this test.
Create a new test that only uses bottomk/topk queries. For each query, it should run the query 100 times on Cortex and Prometheus. Each result from Cortex is considered correct iff it matches at least one result from Prometheus.

Jun 03 '25 01:06 dsabsay

I reproduced this in a standalone prometheus server with one of the queries above. Setting a fixed end time and clicking "Execute" multiple times (while changing nothing else) yields different results:

The query time is set to over 24 hours in the past, so new data is not affecting it.

I still don't understand why this is nondeterministic.

Jun 03 '25 02:06 dsabsay

I managed to reduce the suspect query to this:

topk(3, sgn({__name__="prometheus_tsdb_compaction_chunk_samples_bucket"}))

The above expression when used in a range query (via /api/v1/query_range) with fixed start/end/step returns different results almost every time the query is run against Prometheus.

When the above expression is used in an instant query (via Prometheus UI) with fixed evaluation time, the results are stable and do not change from one execution to the next.

Removing the sgn() from the expression yields stable results. Because of this, I suspect that when multiple series have identical values, the order of the returned series is unstable. Where exactly this sorting is (not) happening, I'm not sure yet.

Jun 03 '25 04:06 dsabsay