ec2-plugin icon indicating copy to clipboard operation
ec2-plugin copied to clipboard

Windows Agent Thread Leak when using WinRM

Open apuig opened this issue 4 months ago • 0 comments

Jenkins and plugins versions report

Environment
ec2-plugin 1856.vf40220e7a_75f

What Operating System are you using (both controller, and any agents involved in the problem)?

The issue applies to provisioning Windows agents via WinRM

Reproduction steps

I’m troubleshooting a controller that experiences slower performance and increasing memory usage over time.

A thread dump analysis showed thousands of “input copy” threads are being held:

TIMED_WAITING  "input copy: java -jar C:\Windows\Temp\remoting.jar -workDir C:\Jenkins"
java.base/java.lang.Object.wait(Native Method)
hudson.remoting.FastPipedInputStream.read(FastPipedInputStream.java:181)
hudson.plugins.ec2.win.winrm.WindowsProcess$2.run(WindowsProcess.java:131)

Looking at the logs, I found two scenarios where WindowsProcess leak resources:

setChannel failure

  1. Configure EC2 cloud with Windows template (WinRM connection) and trigger agent provisioning
  2. connection.execute() creates WindowsProcess with input/output threads
  3. setChannel() begins channel negotiation
  4. During negotiation, agent or network fails
  5. EOFException or IOException thrown while reading remoting protocol
  6. Exception thrown before onClosed listener registration completes
  7. No cleanup callback, so destroy() is never called

Cleanup failure in destroy

  1. Configure retention strategy that terminates instances (idle timeout)
  2. Configure EC2 cloud with Windows template (WinRM connection) and trigger agent provisioning
  3. Windows agent launches successfully via WinRM and reaches online state, processes work
  4. Agent becomes idle for configured timeout (e.g., 5 minutes)
  5. EC2 retention strategy terminates instance
  6. Channel closes on controller side
  7. onClosed() listener is triggered and calls process.destroy()
  8. destroy() attempts client.signal() to terminate WinRM shell
  9. Instance already terminated : WinRM port 5985 unreachable
  10. client.signal() throws ConnectException: Connection refused
  11. Exception prevents rest of destroy() from executing:
    • Pipes never closed
    • Threads never interrupted

Expected Results

WindowsProcess is always destroyed and it properly cleanup all the resources

Actual Results

Both scenarios results on the WindowsProcess#inputThread remains blocked at toCallersStdin.read()

setChannel failure

 INFO: Connection allowed after the host key has been verified
 ERROR: unexpected stream termination
 java.io.EOFException: unexpected stream termination
 	at hudson.remoting.ChannelBuilder.negotiate(ChannelBuilder.java:478)
 	at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:422)
 	at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:440)
 	at PluginClassLoader for ec2//hudson.plugins.ec2.ssh.EC2UnixLauncher.launchRemotingAgent(EC2UnixLauncher.java:456)
 	at PluginClassLoader for ec2//hudson.plugins.ec2.ssh.EC2UnixLauncher.launchScript(EC2UnixLauncher.java:405)
 	at PluginClassLoader for ec2//hudson.plugins.ec2.EC2ComputerLauncher.launch(EC2ComputerLauncher.java:55)
 	at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)
 	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
 	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
 	at java.base/java.lang.Thread.run(Thread.java:840)

Cleanup failure in destroy

SEVERE	hudson.remoting.Channel#terminate: Listener hudson.plugins.ec2.win.EC2WindowsLauncher$1@6acc2cda propagated an exception for channel hudson.remoting.Channel@7d94a3db:EC2 (aws_ec2_cloud_identity) - identity-pythontestbox01 (i-0c3e42d7494b3bcdd)s close: {2}
java.io.IOException: Attempted read from closed stream.
	at PluginClassLoader for apache-httpcomponents-client-4-api//org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:165)
	at PluginClassLoader for apache-httpcomponents-client-4-api//org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:287)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:330)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:190)
	at java.base/java.io.InputStreamReader.read(InputStreamReader.java:177)
	at java.base/java.io.Reader.read(Reader.java:250)
	at PluginClassLoader for apache-httpcomponents-client-4-api//org.apache.http.util.EntityUtils.toString(EntityUtils.java:227)
	at PluginClassLoader for apache-httpcomponents-client-4-api//org.apache.http.util.EntityUtils.toString(EntityUtils.java:308)
	at PluginClassLoader for ec2//hudson.plugins.ec2.win.winrm.WinRMClient.sendRequest(WinRMClient.java:327)
Caused: hudson.plugins.ec2.win.winrm.RuntimeIOException: I/O Exception Attempted read from closed stream.
	at PluginClassLoader for ec2//hudson.plugins.ec2.win.winrm.WinRMClient.sendRequest(WinRMClient.java:342)
	at PluginClassLoader for ec2//hudson.plugins.ec2.win.winrm.WinRMClient.sendRequest(WinRMClient.java:251)
	at PluginClassLoader for ec2//hudson.plugins.ec2.win.winrm.WinRMClient.signal(WinRMClient.java:121)
	at PluginClassLoader for ec2//hudson.plugins.ec2.win.winrm.WindowsProcess.destroy(WindowsProcess.java:89)
	at PluginClassLoader for ec2//hudson.plugins.ec2.win.EC2WindowsLauncher$1.onClosed(EC2WindowsLauncher.java:106)
	at hudson.remoting.Channel.terminate(Channel.java:1219)
	at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1438)
	at hudson.remoting.Channel$1.handle(Channel.java:664)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:86)

Anything else?

Workarounds

  • Use Windows SSH launcher (Eliminates Issue) Completely bypasses WinRM and WindowsProcess, eliminating this leak

  • Reduce launch timeout (Mitigate) Plugin defaults launchTimeout to ~24.8 days by default, with a 10-second retry intervals, allows ~214,000 retry attempts for each agent, increasing leak accumulation

  • Adjust retention strategy (Mitigate) Increase idle timeout to reduce frequency of agent terminations, reducing cleanup failures

Refs

Are you interested in contributing a fix?

Yes

apuig avatar Dec 10 '25 15:12 apuig