java-driver icon indicating copy to clipboard operation
java-driver copied to clipboard

Make MetricsITBase tests serial instead of parallelizable (CASSJAVA-19)

Open Bouncheck opened this issue 6 months ago • 1 comments

This change fixes random failures by removing MicrometerMetricsIT, MicroProfileMetricsIT and DropwizardMetricsIT from the ParallelizableTests category. It is not the only solution but seems to be the simplest workaround.

The reasoning is as follows: should_evict_down_node_metrics_when_timeout_fires() has two places where it manipulates AbstractMetricUpdater.MIN_EXPIRE_AFTER in order to set DefaultDriverOption.METRICS_NODE_EXPIRE_AFTER to 1 second during session initialization. Otherwise MIN_EXPIRE_AFTER would not allow for that and warning log would be printed about using higher value than what user wanted to set. It lowers it before session initialization: https://github.com/apache/cassandra-java-driver/blob/17ebe6092e2877d8c524e07489c4c3d005cfeea5/integration-tests/src/test/java/com/datastax/oss/driver/core/metrics/MetricsITBase.java#L157 And sets it back to 5 mintues at the end of the test: https://github.com/apache/cassandra-java-driver/blob/17ebe6092e2877d8c524e07489c4c3d005cfeea5/integration-tests/src/test/java/com/datastax/oss/driver/core/metrics/MetricsITBase.java#L186 The code comment in AbstractMetricUpdater also mentions that this variable is intentionally not made final for testing purposes. I believe this is what causes the intermittent failures. When those three tests that extend this class run in parallel it is possible that test B sets the minimum back to 5 minutes after the test A lowers it to 1 second but before test A initializes its session. The test waits around 10 seconds for the change to happen, but in such case it would need to wait for 5 minutes. Warnings with this pattern should be visible when the test fails on the CI:

c.d.o.d.i.c.m.AbstractMetricUpdater - [s6] Value too low for advanced.metrics.node.expire-after: PT1S. Forcing to PT5M instead.

I can confirm that this is what was happening when I was testing this locally, but I did not try to search for the logs on cassandra-java-driver's CI.

Bouncheck avatar Jul 23 '25 10:07 Bouncheck

Here's also a handy script for reproducing this issue. I think I needed around 20 runs on version before the change to trigger the bug. (note: exit code 0 does not mean no test failed in general, but here it's sufficient)

#!/bin/bash

mvn clean install -P fast

echo "Running tests in a loop until failure detection..."
counter=1

while true; do
    echo "=============================================="
    echo "Attempt #$counter"
    echo "=============================================="

    # Run the test command and show output in real-time while also capturing it
    mvn -pl integration-tests integration-test -Dit.test='MicrometerMetricsIT, MicroProfileMetricsIT, DropwizardMetricsIT' | tee /tmp/maven_output.log

    # Capture the exit status of the tee command (which will be the exit status of the Maven command)
    maven_status=${PIPESTATUS[0]}

    # Check for explicit error containing the method name in the captured output
    if grep -q "\[ERROR\].*should_evict_down_node_metrics_when_timeout_fires" /tmp/maven_output.log; then
        echo ""
        echo "Found test failure on attempt #$counter"
        echo "Failing test output:"
        grep -A 5 -B 2 "should_evict_down_node_metrics_when_timeout_fires" /tmp/maven_output.log
        break
    fi

    # Also check standard Maven error exit code
    if [ $maven_status -ne 0 ]; then
        echo ""
        echo "Test command failed on attempt #$counter, but the specific error wasn't found."
        break
    fi

    echo ""
    echo "Test passed. Continuing..."
    ((counter++))

    # Optional: small delay between runs
    sleep 1
done

echo ""
echo "Tests failed after $counter attempts."

Bouncheck avatar Jul 23 '25 11:07 Bouncheck