Getting OSError: Socket is closed when agent try to use code execution
os : ubuntu docker version :Docker version 28.1.1, build 4eba377 image build :sha256:f1ce5b265fc3f196a5fd0a2477a7d88997b6d1ec1bf46c50cbe1964933472ca8 error : Traceback (most recent call last): Traceback (most recent call last): File "/a0/agent.py", line 332, in monologue tools_result = await self.process_tools(agent_response) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/a0/agent.py", line 683, in process_tools response = await tool.execute(**tool_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/a0/python/tools/code_execution_tool.py", line 43, in execute response = await self.execute_terminal_command( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/a0/python/tools/code_execution_tool.py", line 141, in execute_terminal_command return await self.terminal_session(session, command, reset) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/a0/python/tools/code_execution_tool.py", line 186, in terminal_session raise e File "/a0/python/tools/code_execution_tool.py", line 172, in terminal_session self.state.shells[session].send_command(command) File "/a0/python/helpers/shell_ssh.py", line 82, in send_command self.shell.send(self.last_command) File "/opt/venv/lib/python3.11/site-packages/paramiko/channel.py", line 799, in send return self._send(s, m) ^^^^^^^^^^^^^^^^ File "/opt/venv/lib/python3.11/site-packages/paramiko/channel.py", line 1196, in _send raise socket.error("Socket is closed") OSError: Socket is closed
OSError: Socket is closed
so my dockerd demon is started using dockerd --data root ~/somedockerdir/ which doesnt follow stander systemd sockets . im still quite not sure if this caused the issue . but i just installed the a0 on my another vps which uses systemd . and it works now .
I can confirming this, using gpt-5 & gpt-4o as chat model
There are sometimes workaround to ask Agent0ai to always use fresh terminal and create keep alive wrapper when using terminals/code execution tool
Workaround dont work constantly, I did fallback to older versions but also Version M v0.9.1.1 25-07-09 08:41 started to give this
Would like to express this is not once and while its been more or less constant 7 days now (:latest version )
Please find the root cause and fix ASAP :)
meanwhile while waiting this to correct you could try to ask agent0ai instance to follow this
"Here’s a battle-tested playbook to keep sockets open and stable in automated tests, from TCP up through app tooling. Think of each layer as a character in the ensemble: client, protocol, and infrastructure must all agree to “stay in the scene,” or someone will exit early with a socket closed.
Core principles
Reuse connections, don’t recreate them: pool and keep sessions alive per host. Match timeouts across layers: client < proxy/LB < server idle, so the server/proxy closes first only if truly idle. Send gentle heartbeats for long idles: protocol-native pings (WS/gRPC) or minimal HEAD/GET to prevent NAT/LB timeouts. Backoff and retry on stale/closed: treat “socket closed” as a first-class transient; retry idempotent ops with jitter. Client-side practices (HTTP/HTTPS)
Python (requests): Use one Session per target host, mounted with HTTPAdapter and tuned pools. Example: Keep-alive: built-in when reusing Session Pool: pool_connections=20, pool_maxsize=100 Timeouts: (connect=5, read=30) Retry: urllib3 Retry(total=3, backoff_factor=0.3, status_forcelist=[502, 503, 504]) Java (OkHttp): Single OkHttpClient shared across tests. ConnectionPool(maxIdleConnections=100, keepAliveDuration=5, TimeUnit.MINUTES) retryOnConnectionFailure(true) Node.js (http/https/got/axios): Use a keep-alive agent: new http.Agent({ keepAlive: true, maxSockets: 100, maxFreeSockets: 20, timeout: 60000 }) Share the agent instance; set request timeouts and retries for idempotent calls. CLI (curl): Default reuses connections when possible. You can explicitly keep TCP keepalive if supported by your curl: --keepalive-time ; disable only with --no-keepalive. Use --http1.1 or --http2 based on server behavior; measure. Protocol choices and hints
HTTP/1.1 Keep-Alive: Ensure Connection: keep-alive and reuse same host+scheme+port. HTTP/2 multiplexing: Often more stable under many requests; if intermediaries are flaky on H2, fallback to H1.1 in tests. WebSockets: Send ping/pong at a safe cadence (e.g., 15–30s) below intermediary idle timeouts; handle close frames and reconnect. gRPC: Enable keepalive pings on both client and server; align with LB/NAT policies to avoid being mistaken for abuse. Infrastructure and gateway tuning (the usual suspects)
Load balancer/proxy (NGINX/Envoy/HAProxy/ALB/Cloudflare, etc.): Increase idle timeouts above your longest expected quiet period. Limit requests per connection if needed, but set high enough (e.g., keepalive_requests) to benefit pooling. Ensure upstream keepalive is enabled between proxy and app servers too. Example NGINX: keepalive_timeout 75s; keepalive_requests 1000; upstream backend { server app:8080; keepalive 64; } NAT/Firewall timeouts: NATs often drop idle TCP in a few minutes. Either: Send application pings more frequently than the idle timeout, or Increase the NAT idle timeout where allowed. Host TCP keepalive (Linux)
Enable and tune conservatively in long-lived tests (system-wide or per-socket): Check: sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_intvl net.ipv4.tcp_keepalive_probes Example (ephemeral testing values): sysctl -w net.ipv4.tcp_keepalive_time=60 sysctl -w net.ipv4.tcp_keepalive_intvl=15 sysctl -w net.ipv4.tcp_keepalive_probes=4 Note: TCP keepalive is a liveness probe for dead peers, not an application ping; still helpful to prevent silent blackholes. Automation tool specifics
JMeter: enable “Use KeepAlive” on HTTP Request; add HTTP Cache Manager; use a single HTTP Client implementation; align “Idle Connection Cleanup” with test length. Gatling: default connection reuse; size the connection and warm-up; prefer HTTP/2 where stable. k6: reuses connections by default; tune batch/concurrency to keep pools hot; use thresholds for reuse rate. Locust: shared HttpUser client sessions; keep_alive is on by default via requests; set connection pool sizes with custom adapters if needed. Resilience patterns for tests
Connection warm-up: fire a pre-flight batch to establish pools and TLS sessions. Idempotent retry: on ECONNRESET/“socket closed,” retry with exponential backoff and jitter. Circuit breaking: short-circuit repeated failures to avoid storming infrastructure. Telemetry: log connection reuse rate, average requests per connection, error classes (RST vs FIN), and idle durations. Quick Linux-first diagnostics when sockets close
Confirm server intent: curl -vI https://host/path and inspect Connection and Keep-Alive headers. Trace a failing call: curl -vL URL 2>&1 | tee /root/diag_curl.txt Check local sockets: ss -tanp | grep ESTAB; lsof -iTCP -sTCP:ESTABLISHED Packet-level: tcpdump -i any host and tcp port 443 -w /root/diag.pcap (then review in Wireshark). TLS/session resumption: test with openssl s_client -servername host -connect host:443 -brief Common anti-patterns (avoid these)
Creating a new client per request: destroys pooling and exhausts ephemeral ports. Mismatched timeouts: e.g., client read timeout > LB idle timeout → mid-stream closes. Silent long idles: no heartbeats behind aggressive NAT/LBs. Infinite retries on non-idempotent requests. Disabling keepalive globally to “fix” flaky tests; it hides real issues and increases load. Minimal code recipes
Python (requests) pool session = requests.Session() adapter = HTTPAdapter(pool_connections=20, pool_maxsize=100, max_retries=Retry(total=3, backoff_factor=0.3, status_forcelist=[502,503,504])) session.mount('https://', adapter); session.mount('http://', adapter) resp = session.get(url, timeout=(5, 30)) Node.js keep-alive agent const agent = new https.Agent({ keepAlive: true, maxSockets: 100, maxFreeSockets: 20, timeout: 60000 }); axios.create({ httpsAgent: agent, timeout: 30000 }) OkHttp (Java) OkHttpClient client = new OkHttpClient.Builder() .connectionPool(new ConnectionPool(100, 5, TimeUnit.MINUTES)) .retryOnConnectionFailure(true) .build(); Governance and risk notes
Keepalive pings increase background traffic; coordinate with NetSec and comply with WAF/LB policies. For load/perf runs, size pools to avoid connection churn but cap to protect servers. Document chosen timeouts and keepalive cadences in your test strategy for traceability."
Ask it to include this as behaviour change