selenium icon indicating copy to clipboard operation
selenium copied to clipboard

Low concurrency[🐛 Bug]:

Open csgbn opened this issue 4 years ago • 11 comments

What happened?

V4 grid node utilization is very low compared to V3. In V3 if you had 100 grid nodes ready and registered to the hub and then ran 100 tests in parallel, all 100 grid nodes was immediately put to work but in V4 you can do the same and see only about 12 nodes being utilized. 88 nodes never got tests routed to them. Note: that the queue in this scenario is 0.

Setting a higher "--max-threads" value has no effect and actually we cannot see this parameter being used in the selenium code at all?

How can we reproduce the issue?

We see this consistently with V4.1.2 using java openjdk 11 and latest TestNG. We ruled out TestNG as the issue and verified that when requesting 100 tests in parallel on an empty grid, the queue immediately went to 100.

The grid is started as distributed, hub and nodes on separate AWS VM's.

Relevant log output

Full logs of this event are very large, will try to attach in comments...

Operating System

Ubuntu 20.04

Selenium version

Selenium v4.1.2

What are the browser(s) and version(s) where you see this issue?

Chrome 96

What are the browser driver(s) and version(s) where you see this issue?

96.0.4664.45

Are you using Selenium Grid?

4.1.2

csgbn avatar Feb 05 '22 19:02 csgbn

@csgbn, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

github-actions[bot] avatar Feb 05 '22 19:02 github-actions[bot]

Full logs of this event:

https://fhlsdhfkjshfsf6867f87dsfidsf7sd6fds.s3.us-west-1.amazonaws.com/distributor-gridlog.log https://fhlsdhfkjshfsf6867f87dsfidsf7sd6fds.s3.us-west-1.amazonaws.com/event-bus-gridlog.log https://fhlsdhfkjshfsf6867f87dsfidsf7sd6fds.s3.us-west-1.amazonaws.com/router-gridlog.log https://fhlsdhfkjshfsf6867f87dsfidsf7sd6fds.s3.us-west-1.amazonaws.com/sessionqueue-gridlog.log https://fhlsdhfkjshfsf6867f87dsfidsf7sd6fds.s3.us-west-1.amazonaws.com/sessions-gridlog.log

csgbn avatar Feb 05 '22 19:02 csgbn

Can you share how you deploy the Grid, please?

diemol avatar Feb 06 '22 18:02 diemol

Also, how are your tests looking like? How long do they last and what browsers are you using?

diemol avatar Feb 07 '22 08:02 diemol

The grid is deployed distributed (hub and nodes on separate vm's):

(sudo -u ubuntu /usr/bin/java -jar /home/ubuntu/selenium-server-4.1.2.jar event-bus --log "/home/ubuntu/event-bus-gridlog.log" --log-level "FINE" --allow-cors true --tracing false --max-threads 6096 >/home/ubuntu/event-bus-stdout.log 2>&1 &) &&
(sudo -u ubuntu /usr/bin/java -jar /home/ubuntu/selenium-server-4.1.2.jar sessions --log "/home/ubuntu/sessions-gridlog.log" --log-level "FINE" --allow-cors true --tracing false --max-threads 6096 >/home/ubuntu/sessions-stdout.log 2>&1 &) &&
(sudo -u ubuntu /usr/bin/java -jar /home/ubuntu/selenium-server-4.1.2.jar sessionqueue --log "/home/ubuntu/sessionqueue-gridlog.log" --log-level "FINE" --bind-bus false --allow-cors true --tracing false --max-threads 6096 --session-request-timeout 600 >/home/ubuntu/sessionqueue-stdout.log 2>&1 &) &&
(sudo -u ubuntu /usr/bin/java -jar /home/ubuntu/selenium-server-4.1.2.jar distributor --log "/home/ubuntu/distributor-gridlog.log" --log-level "FINE" --allow-cors true --tracing false --max-threads 6096 --session-request-timeout 600 --healthcheck-interval 120 --sessions http://localhost:5556 --sessionqueue http://localhost:5559 --bind-bus false >/home/ubuntu/distributor-stdout.log 2>&1 &) &&	
(sudo -u ubuntu /usr/bin/java -jar /home/ubuntu/selenium-server-4.1.2.jar router --log "/home/ubuntu/router-gridlog.log" --log-level "FINE" --allow-cors true --tracing false --max-threads 6096 --session-request-timeout 600 --healthcheck-interval 120 --sessions http://localhost:5556 --distributor http://localhost:5553 --sessionqueue http://localhost:5559 --port 44144 >/home/ubuntu/router-stdout.log 2>&1 &);

The logs are from 200 tests (less than 1 minute per test and simply opens a test site that have no issues processing 100 requests in parallel ) was executed 100 parallel. Chrome 96 on Linux Ubuntu 20.04. Hub VM has 8 cpus/16GB memory and never observed to lack resources. The router runs on a custom port 44144.

Note: we have seen scenarios when the tests are longer like 5-10 minutes, and then there is a slow but steady increase in the concurrency, like 1-3 nodes every 5-10 minutes... In v3 grid with the same test suite the concurrency was 100% in seconds.

csgbn avatar Feb 07 '22 19:02 csgbn

Correction, the hub VM for this test was a 4CPU/8GB and the large max-threads 6096 came from our V3 grid deployment of (jettyMaxThreads). We have though experimented with different max-threads with no effect on concurrency.

Is there are an equivalent Netty option for v3 jettyMaxThreads?

csgbn avatar Feb 10 '22 17:02 csgbn

I was checking the code, and more precisely how the Distributor and the NewSessionQueue interact.

In the LocalDistributor we have the NewSessionRunnable running in a single threaded executor. It gets the available stereotypes from the Nodes.

Then, it asks the NewSessionQueue for the next available request, and passes all the available stereotypes.

The NewSessionQueue gets the request, and filters all the queued requests based on the passed stereotypes. Then it returns the first of those..

Therefore, the Distributor is processing the requests one by one, which is not ideal when you have many session requests in the queue. An improvement would be to have the getNextAvailable method to return all the available requests to the Distributor. With that, the Distributor could process them in batches with the sessionCreatorExecutor.

diemol avatar Aug 16 '22 11:08 diemol

One more thing to notice is that the sessionCreatorExecutor uses newFixedThreadPool with the amount of available processors. So if the Hub/Distributor is only given 1-2 CPUs, it will affect session creation speed.

Not sure if we should leave this value configurable and using Runtime.getRuntime().availableProcessors() as a default?

diemol avatar Aug 19 '22 10:08 diemol

One more thing to notice is that the sessionCreatorExecutor uses newFixedThreadPool with the amount of available processors. So if the Hub/Distributor is only given 1-2 CPUs, it will affect session creation speed.

Not sure if we should leave this value configurable and using Runtime.getRuntime().availableProcessors() as a default?

This is true. The session creation is dependent on the number of CPUs available. Ideally, if the number of threads is more than the available processors it will not be as efficient. The higher number of threads causes greater context switching which is not optimal. We can probably call this out in the DistributorFlags class where we define the configurable value.

pujagani avatar Aug 25 '22 04:08 pujagani

One more thing to notice is that the sessionCreatorExecutor uses newFixedThreadPool with the amount of available processors. So if the Hub/Distributor is only given 1-2 CPUs, it will affect session creation speed. Not sure if we should leave this value configurable and using Runtime.getRuntime().availableProcessors() as a default?

This is true. The session creation is dependent on the number of CPUs available. Ideally, if the number of threads is more than the available processors it will not be as efficient. The higher number of threads causes greater context switching which is not optimal. We can probably call this out in the DistributorFlags class where we define the configurable value.

I'd say we can do this in a separate PR. Just to avoid too many changes at the same time. Well, unless you already have it for the upcoming release :)

diemol avatar Aug 30 '22 08:08 diemol

I agree. That was the plan. To have a separate PR for the flag. The PR for the fix is in progress. Hopefully should be able to get it out soon.

pujagani avatar Aug 30 '22 09:08 pujagani

We have fixed this in 4.5.3 but wanted to wait until 4.6 to double check. Closing.

diemol avatar Nov 07 '22 11:11 diemol

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

github-actions[bot] avatar Dec 08 '22 01:12 github-actions[bot]