Off-by-one errors in Java APM Integration Tests
Over the past year, we have seen an ongoing pattern of cases wherein we see APM Integration tests fail because a single event which we believe to have sent was not received and an assertion fails.
Error
AssertionError: queried for [('processor.event', 'transaction'), ('service.name', ['springapp'])], expected 1000, got 999
Examples
https://apm-ci.elastic.co/job/apm-integration-tests/job/7.x/751/testReport/junit/tests.agent/test_java/Integration_Tests___All___test_concurrent_req_java_spring/
https://apm-ci.elastic.co/job/apm-integration-tests/job/7.x/738/testReport/junit/tests.agent/test_java/Integration_Tests___All___test_concurrent_req_java_spring/
Discussion
I would like to discuss either finding a fix for the underlying issue or perhaps adjusting the assertion to allow for either 999 or 1,000 as acceptable values if we agree that this condition does not represent an actual software defect.
If we could do this, it would go a long way toward increasing the overall stability of the APM Integration Test suite which is used by many teams to verify the functionality of the APM suite of products.
Thanks!
cc: @elastic/observablt-robots
This is the case for years, thanks for taking the initiative to address that!!
Is it specific only to Java? If so, should we still assume the event is sent and not received?
In order to understand whether all events are sent, if we set agent with log_level=debug and ingest the test logs somewhere, we may be able to count log events.
Any serialization and send attempt should log (expecting 1000):
Receiving TRANSACTION event (sequence <ring-buffer-slot-number>)
Any serialization/write failure should log:
Failed to handle event of type TRANSACTION with this error: ...
Failure to create a connection with the APM Server should log:
Failed to get APM server connection, dropping event: ...
Failure to send should log:
Error sending data to APM server: ...
In addition, at shutdown we log:
Reported events: <num-events>
Dropped events: <num-events>
where the dropped events should contain failures to send.
Once we get to a conclusion whether the right number was sent, we know whether to investigate upstream (agent tracing logic) or downstream (APM Server).
Thanks, @eyalkoren. This is an excellent proposal. (As always!)
I have submitted https://github.com/elastic/apm-integration-testing/pull/1302 to crank up the log level to debug for a few days to see if we can catch this issue happening. If I can, I'll return to this issue with logs for us to look at.
I believe this is no longer an issue. Leaving open for a couple of days, then will close if no concerns raised
Closing this, feel free to reopen if it's still an issue