Windows Agent Thread Leak when using WinRM
Jenkins and plugins versions report
Environment
ec2-plugin 1856.vf40220e7a_75f
What Operating System are you using (both controller, and any agents involved in the problem)?
The issue applies to provisioning Windows agents via WinRM
Reproduction steps
I’m troubleshooting a controller that experiences slower performance and increasing memory usage over time.
A thread dump analysis showed thousands of “input copy” threads are being held:
TIMED_WAITING "input copy: java -jar C:\Windows\Temp\remoting.jar -workDir C:\Jenkins"
java.base/java.lang.Object.wait(Native Method)
hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:181)
hudson.plugins.ec2.win.winrm.WindowsProcess$2.run(WindowsProcess.java:131)
Looking at the logs, I found two scenarios where WindowsProcess leak resources:
setChannel failure
- Configure EC2 cloud with Windows template (WinRM connection) and trigger agent provisioning
-
connection.execute()createsWindowsProcesswith input/output threads -
setChannel()begins channel negotiation - During negotiation, agent or network fails
-
EOFExceptionorIOExceptionthrown while reading remoting protocol -
Exception thrown before
onClosedlistener registration completes - No cleanup callback, so
destroy()is never called
Cleanup failure in destroy
- Configure retention strategy that terminates instances (idle timeout)
- Configure EC2 cloud with Windows template (WinRM connection) and trigger agent provisioning
- Windows agent launches successfully via WinRM and reaches online state, processes work
- Agent becomes idle for configured timeout (e.g., 5 minutes)
- EC2 retention strategy terminates instance
- Channel closes on controller side
-
onClosed()listener is triggered and callsprocess.destroy() -
destroy()attemptsclient.signal()to terminate WinRM shell - Instance already terminated : WinRM port 5985 unreachable
-
client.signal()throwsConnectException: Connection refused - Exception prevents rest of
destroy()from executing:- Pipes never closed
- Threads never interrupted
Expected Results
WindowsProcess is always destroyed and it properly cleanup all the resources
Actual Results
Both scenarios results on the WindowsProcess#inputThread remains blocked at toCallersStdin.read()
setChannel failure
INFO: Connection allowed after the host key has been verified
ERROR: unexpected stream termination
java.io.EOFException: unexpected stream termination
at hudson.remoting.ChannelBuilder.negotiate(ChannelBuilder.java:478)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:422)
at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:440)
at PluginClassLoader for ec2//hudson.plugins.ec2.ssh.EC2UnixLauncher.launchRemotingAgent(EC2UnixLauncher.java:456)
at PluginClassLoader for ec2//hudson.plugins.ec2.ssh.EC2UnixLauncher.launchScript(EC2UnixLauncher.java:405)
at PluginClassLoader for ec2//hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:55)
at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)
at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Cleanup failure in destroy
SEVERE hudson.remoting.Channel#terminate: Listener hudson.plugins.ec2.win.EC2WindowsLauncher$1@6acc2cda propagated an exception for channel hudson.remoting.Channel@7d94a3db:EC2 (aws_ec2_cloud_identity) - identity-pythontestbox01 (i-0c3e42d7494b3bcdd)s close: {2}
java.io.IOException: Attempted read from closed stream.
at PluginClassLoader for apache-httpcomponents-client-4-api//org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:165)
at PluginClassLoader for apache-httpcomponents-client-4-api//org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135)
at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:287)
at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:330)
at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:190)
at java.base/java.io.InputStreamReader.read(InputStreamReader.java:177)
at java.base/java.io.Reader.read(Reader.java:250)
at PluginClassLoader for apache-httpcomponents-client-4-api//org.apache.http.util.EntityUtils.toString(EntityUtils.java:227)
at PluginClassLoader for apache-httpcomponents-client-4-api//org.apache.http.util.EntityUtils.toString(EntityUtils.java:308)
at PluginClassLoader for ec2//hudson.plugins.ec2.win.winrm.WinRMClient.sendRequest(WinRMClient.java:327)
Caused: hudson.plugins.ec2.win.winrm.RuntimeIOException: I/O Exception Attempted read from closed stream.
at PluginClassLoader for ec2//hudson.plugins.ec2.win.winrm.WinRMClient.sendRequest(WinRMClient.java:342)
at PluginClassLoader for ec2//hudson.plugins.ec2.win.winrm.WinRMClient.sendRequest(WinRMClient.java:251)
at PluginClassLoader for ec2//hudson.plugins.ec2.win.winrm.WinRMClient.signal(WinRMClient.java:121)
at PluginClassLoader for ec2//hudson.plugins.ec2.win.winrm.WindowsProcess.destroy(WindowsProcess.java:89)
at PluginClassLoader for ec2//hudson.plugins.ec2.win.EC2WindowsLauncher$1.onClosed(EC2WindowsLauncher.java:106)
at hudson.remoting.Channel.terminate(Channel.java:1219)
at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1438)
at hudson.remoting.Channel$1.handle(Channel.java:664)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:86)
Anything else?
Workarounds
-
Use Windows SSH launcher (Eliminates Issue) Completely bypasses WinRM and WindowsProcess, eliminating this leak
-
Reduce launch timeout (Mitigate) Plugin defaults
launchTimeoutto ~24.8 days by default, with a 10-second retry intervals, allows ~214,000 retry attempts for each agent, increasing leak accumulation -
Adjust retention strategy (Mitigate) Increase idle timeout to reduce frequency of agent terminations, reducing cleanup failures
Refs
- https://github.com/jenkinsci/ec2-plugin/pull/383
- CloudBees internal reference
Are you interested in contributing a fix?
Yes