Low concurrency[🐛 Bug]:
What happened?
V4 grid node utilization is very low compared to V3. In V3 if you had 100 grid nodes ready and registered to the hub and then ran 100 tests in parallel, all 100 grid nodes was immediately put to work but in V4 you can do the same and see only about 12 nodes being utilized. 88 nodes never got tests routed to them. Note: that the queue in this scenario is 0.
Setting a higher "--max-threads" value has no effect and actually we cannot see this parameter being used in the selenium code at all?
How can we reproduce the issue?
We see this consistently with V4.1.2 using java openjdk 11 and latest TestNG. We ruled out TestNG as the issue and verified that when requesting 100 tests in parallel on an empty grid, the queue immediately went to 100.
The grid is started as distributed, hub and nodes on separate AWS VM's.
Relevant log output
Full logs of this event are very large, will try to attach in comments...
Operating System
Ubuntu 20.04
Selenium version
Selenium v4.1.2
What are the browser(s) and version(s) where you see this issue?
Chrome 96
What are the browser driver(s) and version(s) where you see this issue?
96.0.4664.45
Are you using Selenium Grid?
4.1.2
@csgbn, thank you for creating this issue. We will troubleshoot it as soon as we can.
Info for maintainers
Triage this issue by using labels.
If information is missing, add a helpful comment and then I-issue-template label.
If the issue is a question, add the I-question label.
If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.
If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, W3C),
add the applicable G-* label, and it will provide the correct link and auto-close the
issue.
After troubleshooting the issue, please add the R-awaiting answer label.
Thank you!
Full logs of this event:
https://fhlsdhfkjshfsf6867f87dsfidsf7sd6fds.s3.us-west-1.amazonaws.com/distributor-gridlog.log https://fhlsdhfkjshfsf6867f87dsfidsf7sd6fds.s3.us-west-1.amazonaws.com/event-bus-gridlog.log https://fhlsdhfkjshfsf6867f87dsfidsf7sd6fds.s3.us-west-1.amazonaws.com/router-gridlog.log https://fhlsdhfkjshfsf6867f87dsfidsf7sd6fds.s3.us-west-1.amazonaws.com/sessionqueue-gridlog.log https://fhlsdhfkjshfsf6867f87dsfidsf7sd6fds.s3.us-west-1.amazonaws.com/sessions-gridlog.log
Can you share how you deploy the Grid, please?
Also, how are your tests looking like? How long do they last and what browsers are you using?
The grid is deployed distributed (hub and nodes on separate vm's):
(sudo -u ubuntu /usr/bin/java -jar /home/ubuntu/selenium-server-4.1.2.jar event-bus --log "/home/ubuntu/event-bus-gridlog.log" --log-level "FINE" --allow-cors true --tracing false --max-threads 6096 >/home/ubuntu/event-bus-stdout.log 2>&1 &) &&
(sudo -u ubuntu /usr/bin/java -jar /home/ubuntu/selenium-server-4.1.2.jar sessions --log "/home/ubuntu/sessions-gridlog.log" --log-level "FINE" --allow-cors true --tracing false --max-threads 6096 >/home/ubuntu/sessions-stdout.log 2>&1 &) &&
(sudo -u ubuntu /usr/bin/java -jar /home/ubuntu/selenium-server-4.1.2.jar sessionqueue --log "/home/ubuntu/sessionqueue-gridlog.log" --log-level "FINE" --bind-bus false --allow-cors true --tracing false --max-threads 6096 --session-request-timeout 600 >/home/ubuntu/sessionqueue-stdout.log 2>&1 &) &&
(sudo -u ubuntu /usr/bin/java -jar /home/ubuntu/selenium-server-4.1.2.jar distributor --log "/home/ubuntu/distributor-gridlog.log" --log-level "FINE" --allow-cors true --tracing false --max-threads 6096 --session-request-timeout 600 --healthcheck-interval 120 --sessions http://localhost:5556 --sessionqueue http://localhost:5559 --bind-bus false >/home/ubuntu/distributor-stdout.log 2>&1 &) &&
(sudo -u ubuntu /usr/bin/java -jar /home/ubuntu/selenium-server-4.1.2.jar router --log "/home/ubuntu/router-gridlog.log" --log-level "FINE" --allow-cors true --tracing false --max-threads 6096 --session-request-timeout 600 --healthcheck-interval 120 --sessions http://localhost:5556 --distributor http://localhost:5553 --sessionqueue http://localhost:5559 --port 44144 >/home/ubuntu/router-stdout.log 2>&1 &);
The logs are from 200 tests (less than 1 minute per test and simply opens a test site that have no issues processing 100 requests in parallel ) was executed 100 parallel. Chrome 96 on Linux Ubuntu 20.04. Hub VM has 8 cpus/16GB memory and never observed to lack resources. The router runs on a custom port 44144.
Note: we have seen scenarios when the tests are longer like 5-10 minutes, and then there is a slow but steady increase in the concurrency, like 1-3 nodes every 5-10 minutes... In v3 grid with the same test suite the concurrency was 100% in seconds.
Correction, the hub VM for this test was a 4CPU/8GB and the large max-threads 6096 came from our V3 grid deployment of (jettyMaxThreads). We have though experimented with different max-threads with no effect on concurrency.
Is there are an equivalent Netty option for v3 jettyMaxThreads?
I was checking the code, and more precisely how the Distributor and the NewSessionQueue interact.
In the LocalDistributor we have the NewSessionRunnable running in a single threaded executor. It gets the available stereotypes from the Nodes.
Then, it asks the NewSessionQueue for the next available request, and passes all the available stereotypes.
The NewSessionQueue gets the request, and filters all the queued requests based on the passed stereotypes. Then it returns the first of those..
Therefore, the Distributor is processing the requests one by one, which is not ideal when you have many session requests in the queue. An improvement would be to have the getNextAvailable method to return all the available requests to the Distributor. With that, the Distributor could process them in batches with the sessionCreatorExecutor.
One more thing to notice is that the sessionCreatorExecutor uses newFixedThreadPool with the amount of available processors. So if the Hub/Distributor is only given 1-2 CPUs, it will affect session creation speed.
Not sure if we should leave this value configurable and using Runtime.getRuntime().availableProcessors() as a default?
One more thing to notice is that the
sessionCreatorExecutorusesnewFixedThreadPoolwith the amount of available processors. So if the Hub/Distributor is only given 1-2 CPUs, it will affect session creation speed.Not sure if we should leave this value configurable and using
Runtime.getRuntime().availableProcessors()as a default?
This is true. The session creation is dependent on the number of CPUs available. Ideally, if the number of threads is more than the available processors it will not be as efficient. The higher number of threads causes greater context switching which is not optimal. We can probably call this out in the DistributorFlags class where we define the configurable value.
One more thing to notice is that the
sessionCreatorExecutorusesnewFixedThreadPoolwith the amount of available processors. So if the Hub/Distributor is only given 1-2 CPUs, it will affect session creation speed. Not sure if we should leave this value configurable and usingRuntime.getRuntime().availableProcessors()as a default?This is true. The session creation is dependent on the number of CPUs available. Ideally, if the number of threads is more than the available processors it will not be as efficient. The higher number of threads causes greater context switching which is not optimal. We can probably call this out in the DistributorFlags class where we define the configurable value.
I'd say we can do this in a separate PR. Just to avoid too many changes at the same time. Well, unless you already have it for the upcoming release :)
I agree. That was the plan. To have a separate PR for the flag. The PR for the fix is in progress. Hopefully should be able to get it out soon.
We have fixed this in 4.5.3 but wanted to wait until 4.6 to double check. Closing.
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.