azure-sdk-for-java icon indicating copy to clipboard operation
azure-sdk-for-java copied to clipboard

[BUG] Cosmos hangs forever with CosmosEndToEndOperationLatencyPolicyConfig set

Open lnist opened this issue 1 year ago • 4 comments

Describe the bug Certain operations cause the Cosmos SDK to hang forever and certain operations do not respect the timeout set by CosmosEndToEndOperationLatencyPolicyConfig.

It seems the hangs occur for operations that span partitions.

To Reproduce See this example repository and test: https://github.com/lnist/cosmos-sdk-hang/blob/main/src/test/java/cosmosTimeouts.java

In the test you need to fill in the connection string and master key for cosmos.

The test utilizes WireMock to simulate a delay in accessing the cosmos backend. For this a self-signed certificate is used, since the Cosmos SDK insists on using HTTPS.

If you execute the tests then they are all expected to fail due to timeout from the Cosmos SDK. That does not happen.

The readAllContainers and properties tests both return the desired data, but it takes longer than the configured timeout of 1 second. They should fail instead.

The readNonDefaultPartitionKey, count, readAll, and writeBulk all respect the timeout of 1 second if the DELAY parameter is set to 2_000, but they hang forever (until the test timeout of 1 minutes) if the DELAY parameter is set to 10_000.

Note: The code includes a couple of configurations that I think are redundant, but they were used during extensive testing, so I did not want to change them. A quick test without them seems to indicate the issues are present with default parameters (except of course for the CosmosEndToEndOperationLatencyPolicyConfig)

Code Snippet Add the code snippet that causes the issue.

Expected behavior The API uses the configured timeout.

Setup (please complete the following information):

  • OS: Windows 11
  • IDE: IntelliJ
  • Library/Libraries: com.azure:azure-cosmos:4.61.1
  • Java version: 21
  • App Server/Environment: jupiter test runner
  • Frameworks: N/A

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • [x] Bug Description Added
  • [x] Repro Steps Added
  • [x] Setup information Added

lnist avatar Jun 24 '24 14:06 lnist

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @kushagraThapar @pjohari-ms @TheovanKraay.

github-actions[bot] avatar Jun 24 '24 14:06 github-actions[bot]

@tvaron3 please take a look at this, thanks!

kushagraThapar avatar Jun 24 '24 20:06 kushagraThapar

@tvaron3 : Do you need any information from us ? This still reproduces.

lnist avatar Aug 29 '24 14:08 lnist

@lnist - I recall we solved some of these already in the latest version, are you testing this in the latest version or still in 4.61.1? @nehrao1 - can you please investigate this if this still reproduces on the latest version of the SDK?

kushagraThapar avatar Aug 29 '24 19:08 kushagraThapar

@kushagraThapar and @nehrao1 : I have just reproduced the issues in the originally supplied example on version 4.63.2, so they are not fixed.

One change on the new SDK version is that the readAllContainers test seem to return very quickly, indicating that some of the information is now cached earlier, but besides this the behavior is the same for all other tests.

lnist avatar Aug 30 '24 07:08 lnist

I have confirmed the issue persists with 4.63.3 (2024-09-10).

lnist avatar Sep 24 '24 13:09 lnist

Thanks @lnist for confirming with the latest version. I looked at the sample code you have provided and wanted to clarify that end to end operation policy config doesn't apply on all the SDK operations. It only applies on document operations, so your database / container operations not following the timeout is expected. We will fix our documentation to update this behavior, apologies about that. Regarding the hang issue, I am curious, if this is something only being seen through wire mock. Because we have not received any complaints from any other customers so far, and this release has been thoroughly stress tested since we are preparing for upcoming holiday season.

I request you to use this fault injection framework, that is specifically developed to test delays / response timeouts / high latency scenarios - https://learn.microsoft.com/en-us/java/api/overview/azure/cosmos-test-readme?view=azure-java-preview We have developed this framework which injects faults in the SDK and you can test all these scenarios.

kushagraThapar avatar Sep 24 '24 19:09 kushagraThapar

@kushagraThapar : We have made the tests because we previously observed hangs in production, so it is not just with WireMock. We use the sync container, and the testing framework you provide does, as far as I can tell, only support the async container.

Since WireMock is just a http proxy, then even if it is only with WireMock that the issue can be reproduced, then consider whether any "bad behaving" route between the client and the cosmos database could cause similar issues?

Regarding documentation: if the end to end policy does not affect all operations, then what are the timeouts for those operations?

lnist avatar Sep 25 '24 02:09 lnist

@lnist apologies for the delay here, I was on vacation. I tried reproducing the issue and it is reproducible on my windows machine. Will debug more to see what's going on there. However, one thing to add is that our sync container is actually based off on async container under the hood. Everything in cosmos java sdk is async in nature. However, since you have observed these issues in production, we will continue to investigate these more.

kushagraThapar avatar Oct 29 '24 22:10 kushagraThapar

@dibahlfi can you please take a look at this, thanks!

kushagraThapar avatar Oct 09 '25 00:10 kushagraThapar