azure-sdk-for-java [QUERY] tryReceive operations generated from DefaultJmsListenerContainer of spring-jms

Query/Question When using the @JmsListener API with spring-cloud-azure-starter-servicebus-jms library, customers observed there are ongoing and high value of requests of tryReceive call sending from the spring-cloud-azure-starter-servicebus-jms` library to Azure Service Bus. From the customer's application, the request rate is 250/min. We need to investigate the root cause of such behavior.

Why is this not a Bug or a feature Request? @JmsListener API with spring-cloud-azure-starter-servicebus-jms by default leverages DefaultJmsListenerContainer to pull messages from Service Bus, so to have background pull request should be normal. Currently we need more information from the customer about their project environement and structure to reproduce the issue and locate the cause.

Setup (please complete the following information if applicable):

OS: [e.g. iOS]
IDE: [e.g. IntelliJ]
Library/Libraries: [e.g. com.azure:azure-core:1.16.0 (groupId:artifactId:version)]

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

[x] Query Added
[x] Setup information Added

Jul 29 '22 08:07 yiliuTo

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

Aug 16 '22 08:08 ghost

When using two applications to connect to the Service Bus, the requests are close to 250/min. For each additional application, the requests increase by about 120.

As shown below:

Sep 22 '22 02:09 hui1110

Add the following properties to the configuration:

spring.jms.listener.receive-timeout=

Timeout to use for receiving calls. Use -1 for a no-wait receive or 0 for no timeout at all. The latter is only feasible if not running within a transaction manager and is generally discouraged since it prevents clean shutdown.

When there are no messages, timeout = 0, or a larger value is given, the number of Service Bus requests is close to 0. But the number of pull messages is not the same as the number of requests.

Sep 26 '22 09:09 hui1110

Use arthas to monitor the number of executions of the method schedule which is the same as the number of requests on the Service Bus.

Oct 09 '22 02:10 hui1110

[ ] reproduce issue
[ ] expected behavior

Nov 08 '22 07:11 stliu

Use the following environment to try to reproduce the issue.

Environment 1

Spring Boot version: 2.7.3
Java version: 11
spring-cloud-azure-starter-servicebus-jms: 4.4.1
Windows 11 (8 CPUs, Memory 16384MB RAm)

Environment 2

Spring Boot version: 2.7.3
Java version: 11
spring-cloud-azure-starter-servicebus-jms: 4.4.1
Azure Kubernetes Service

Project used: https://github.com/Azure-Samples/azure-spring-boot-samples/tree/spring-cloud-azure_v4.4.1/servicebus/spring-cloud-azure-starter-servicebus-jms/servicebus-jms-topic

Expect

In the receive() method, the timeout defaults to 1s, and in the method of pulling messages: performPullIfRequired(long timeout, boolean treatAsPullConsumer), the timeout is the valid time of the pull request. View Metrics in Service Bus, the number of requests should be around 60 per minute.

Check

The number of Successful Requests is also the number of pull requests.

After starting the Spring Boot application in Windows, check the Metrics of Service Bus, around 90. As shown below:
After deploying the Spring Boot application to Azure Kubernetes Service, check the Metrics of Service Bus, which is around 80. As shown below:

Nov 10 '22 05:11 hui1110

@hui1110 I still have some doubts towards your reproducing information:

Since there is "Environment 1", so can I expect other different environments?
Please confirm that the library of spring-cloud-azure-starter-servicebus-jms you are using are 4.3.0 or 4.4.1.
Please confirm that whether you deployed the project to ASA or AKS? I believe that we have decided using AKS for the reproduction.
Assuming you used ASA, could you provide the configuration information of your ASA service?
For your local test env, please provide information about the number, memory of your CPU.
The expected behavior is not clear, what does the timeout mean? Could you provide the complete description. And why will it be related with the number of requests?
For the chart of Service Bus metrics, there are multiple metrics you are showing , which one stands for the tryReceive call we are talking about?
Assuming you are using ASA, it's ok that you don't provide another chart to show the number since you might not have that env right now. But please delete the uncompleted sentence As shown below:.

Nov 11 '22 02:11 yiliuTo

Cause analysis

@JmsListener underlyingly is a polling listener which is built for repeated polling attempts.

The listener sits on an ongoing loop of polling, each invoking the JMS MessageConsumer.receive() to poll the local consumer for messages to consume. By default, each poll operation triggers the local consumer to send 2 pull requests to the message broker.

[1343427689:1] -> Flow{nextIncomingId=1, incomingWindow=2047, nextOutgoingId=420, outgoingWindow=2147483647, handle=1, deliveryCount=209, linkCredit=1, available=null, drain=false, echo=false, properties=null}
[1343427689:1] -> Flow{nextIncomingId=1, incomingWindow=2047, nextOutgoingId=420, outgoingWindow=2147483647, handle=1, deliveryCount=209, linkCredit=1, available=null, drain=true, echo=false, properties=null}
[1343427689:1] <- Flow{nextIncomingId=420, incomingWindow=5000, nextOutgoingId=1, outgoingWindow=2047, handle=1, deliveryCount=210, linkCredit=0, available=0, drain=true, echo=false, properties=null}

The semantic of the two flow frames(pull requests) is as

The concrete polling process is decided by several properties including receiveTimeout, prefetchSize and receiveLocalOnly,

Configuring prefetchSize as zero means a pull consumer, which is also the default option. For each poll, the pull consumer sends pull requests to Service Bus.
For a non-zero prefetch consumer, receiveLocalOnly or receiveNoWaitLocalOnly presents whether the receive operation polls the local consumer only. A false value can result in sending pull requests to the server which is the default option.
the configuration of receiveTimeout decides how long it blocks for each pull request, so it can affects the frequency of pull requests sending to the server. The default value is 1 second.

The complete flow is described in the below picture:

jms-pull

In the customer's case which is also the default configuration, it is case 2.1 that for each receive() call, it sends pull requests to Service Bus. Therefore, the client keeps sending pull requests in a loop even though no messages in Service Bus. And the frequency is up to the CPU status, performance and how many tasks to be handled, which can influence how fast to run a loop. That's why Yonghui can reproduce the similar number of requests as the customer on Azure, but when running his own local machine, the request number changes.

Dec 22 '22 06:12 yiliuTo

Solution

For how to deal with the issue, there are 2 solutions:

Solution 1. Change to push consumer and local-check only.

By changing the mode as push, the consumer is now an Asynchronous Notification consumer that it doesn't pull messages from the broker, but maintains a target amount of link credit. The amout is decided by a prefetch property. As Service Bus(sender) pushes messages, the sender’s link-credit decreases and when the sender’s link-credit falls below a threshold, the client will send a request to the server to increase the sender’s link-credit back to the desired target amount.

To accomplish it, users can add the following configuration: First, configure the prefetch number as non-zero which configures the consumer as non-pull, there are several prefetch properties each controls different Service Bus entities, users should choose one(s) that applies with their cases:

spring.jms.servicebus.prefetch.all=<Fallback value for prefetch option in this Service Bus namespace>
spring.jms.servicebus.prefetch.queue-prefetch=<The number of prefetch for queue.>
spring.jms.servicebus.prefetch.queue-brower-prefetch=<The number of prefetch for queue browser.>
spring.jms.servicebus.prefetch.topic-prefetch=<The number of prefetch for topic.>
spring.jms.servicebus.prefetch.durable-topic-prefetch=<The number of prefetch for durable topic.>

Second, configure the non-local-check by adding a configuration class for the factory customizer:

@Configuration(proxyBeanMethods = false)
public class CustomJmsConfiguration {

    @Bean
    ServiceBusJmsConnectionFactoryCustomizer customizer() {
        return factory -> {
            factory.setReceiveLocalOnly(true);
            // Configure the below ReceiveNoWaitLocalOnly instead if you have specified the property 
            // spring.jms.listener.receive-timeout with negative value. Otherwise, configure the above `ReceiveLocalOnly`.
            //factory.setReceiveNoWaitLocalOnly(true);
        };
    }
}

The value of prefetch can affect how fast messages are dispatched to the consumer's local buffer. Users should adjust the value according to their consuming performance and message volumes. Suitable value can speed up the consuming process, while too larget prefeth can cause the locally buffered messages are outdated and be dispatched again. For low message volumes, where each message takes a long time to process, the prefetch should be set to 1. This ensures that a consumer is only processing one message at a time.

Solution 2. Change the receive timeout to extend each pull-blocking time.

The receive timeout property decides the strategy of how long the consumer blocks there to wait for a pull request. So, by extending the timeout, it can reduce the pulling frequency then reduce the number of pull requests further when users choose pull mode still. And in an extreme case, users can set the strategy to be infinitely waiting until a message arrives, which means the consumer only pulls after consume a message, so when there are no messages in the server, it will block for waiting.

To accomplish this, users can configure the below property, which is of java.time.Duration type and the default value is 1 second.

spring.jms.listener.receive-timeout=

The meaning of the value is

Seting the receive-timeout as 0, means the pull blocks infinitely till a message is dispatched. Setting the receive-timeout as positive value, means the pull blocks up to timeout amount of time. Setting the receive-timeout as negative value, means the pull is a no-wait receive , it returns a message immediately or null if none available.

Note, a high timeout value can bring some side effects: it will also extend the time when the main thread is in a block status. So the container will be less responsive to stop() calls - the container can only stop between receive().

Besides, since the container can only send requests after the receive-timeout, so if the interval is longer than 10 minutes, Service Bus will close the link and cause the listener which by default uses a CachingConnectionFactory cannot send/receive anymore. So if you require a high receive-timeout, please use the JmsPoolConnectionFactory alongside.

Dec 22 '22 07:12 yiliuTo

Hints about how to read the request metrics in Azure Portal for Service Bus:

Difference between sum and count: Referring to the doc, the sum aggregation means how many requests were collected in the last interval. While count presents how many measurement were capured in the interval. So the sum can be caculated by adding the results of each measurement happened in an interval.
What is the aggregation interval? Service Bus sends data to metrics pipeline at 1 min aggregation. But when you view the metrics in the portal based the total interval you choose the metrics are aggregated to fit in the interval. For example, if you pick 30 days duration then it may have to use daily aggregation to fit the samples in the graph. If you pick 24 hours duration then it may have to pick hourly aggregation. Aggregation window changes automatically based on the duration. A quick way to figure out the interval with your current chart is to view it as a scatter chart, and the interval between two adjacent records is the actual aggregation interval.

Dec 27 '22 05:12 yiliuTo

Close this issue since both the cause and solution are clear now.

Dec 29 '22 06:12 yiliuTo