verifiers icon indicating copy to clipboard operation
verifiers copied to clipboard

Connection Error Mid Training

Open Jgmedina95 opened this issue 2 months ago • 0 comments

Everything was doing great on training, for at least 90 steps, and then this error appeared, any idea on how I can prevent complete failure when theres connection issues? This training was on an 8xH100 instance from PrimeIntellect.

Generating rollouts (train): 75%|███████▌ | 96/128 [01:31<00:12, 2.63it/s]2025-12-02 03:58:17 - verifiers.envs.CrystalRelaxationMultiTurnEnv - ERROR - Error getting model response: Connection error. 2025-12-01 22:58:17

2025-12-01 22:58:17 Exiting... 2025-12-01 22:58:18 2025-12-02 03:58:17.856 | ERROR | asyncio.events:_run:88 - An error has been caught in function '_run', process 'MainProcess' (11969), thread 'MainThread' (139832729312128): 2025-12-01 22:58:18 Traceback (most recent call last): 2025-12-01 22:58:18

2025-12-01 22:58:18 File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions 2025-12-01 22:58:18 yield 2025-12-01 22:58:18 File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 394, in handle_async_request 2025-12-01 22:58:18 resp = await self._pool.handle_async_request(req) 2025-12-01 22:58:18 │ │ │ └ <Request [b'POST']> 2025-12-01 22:58:18 │ │ └ <function AsyncConnectionPool.handle_async_request at 0x7f2b63f58540> 2025-12-01 22:58:18 │ └ <AsyncConnectionPool [Requests: 0 active, 0 queued | Connections: 0 active, 0 idle]> 2025-12-01 22:58:18 └ <httpx.AsyncHTTPTransport object at 0x7f2b5685e3f0> 2025-12-01 22:58:18 File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 256, in handle_async_request 2025-12-01 22:58:18 raise exc from None 2025-12-01 22:58:18 File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 236, in handle_async_request 2025-12-01 22:58:18 response = await connection.handle_async_request( 2025-12-01 22:58:18 │ └ <function AsyncHTTPConnection.handle_async_request at 0x7f2b63f4b740> 2025-12-01 22:58:18 └ <AsyncHTTPConnection [CONNECTION FAILED]> 2025-12-01 22:58:18 File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_async/connection.py", line 101, in handle_async_request 2025-12-01 22:58:18 raise exc 2025-12-01 22:58:18 File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_async/connection.py", line 78, in handle_async_request 2025-12-01 22:58:18 stream = await self._connect(request) 2025-12-01 22:58:18 │ │ └ <Request [b'POST']> 2025-12-01 22:58:18 │ └ <function AsyncHTTPConnection._connect at 0x7f2b63f4b7e0> 2025-12-01 22:58:18 └ <AsyncHTTPConnection [CONNECTION FAILED]> 2025-12-01 22:58:18 File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_async/connection.py", line 124, in _connect 2025-12-01 22:58:18 stream = await self._network_backend.connect_tcp(**kwargs) 2025-12-01 22:58:18 │ │ │ └ {'host': 'localhost', 'port': 8000, 'local_address': None, 'timeout': 1200, 'socket_options': None} 2025-12-01 22:58:18 │ │ └ <function AutoBackend.connect_tcp at 0x7f2b63f498a0> 2025-12-01 22:58:18 │ └ <httpcore._backends.auto.AutoBackend object at 0x7f2b5685e2d0> 2025-12-01 22:58:18 └ <AsyncHTTPConnection [CONNECTION FAILED]> 2025-12-01 22:58:18 File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_backends/auto.py", line 31, in connect_tcp 2025-12-01 22:58:18 return await self._backend.connect_tcp( 2025-12-01 22:58:18 │ │ └ <function AnyIOBackend.connect_tcp at 0x7f2b63f5b2e0> 2025-12-01 22:58:18 │ └ <httpcore.AnyIOBackend object at 0x7f2a87ad3110> 2025-12-01 22:58:18 └ <httpcore._backends.auto.AutoBackend object at 0x7f2b5685e2d0> 2025-12-01 22:58:18 File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_backends/anyio.py", line 113, in connect_tcp 2025-12-01 22:58:18 with map_exceptions(exc_map): 2025-12-01 22:58:18 │ └ {<class 'TimeoutError'>: <class 'httpcore.ConnectTimeout'>, <class 'OSError'>: <class 'httpcore.ConnectError'>, <class 'anyio... 2025-12-01 22:58:18 └ <function map_exceptions at 0x7f2b640c6b60> 2025-12-01 22:58:18 File "/usr/local/lib/python3.12/contextlib.py", line 158, in exit 2025-12-01 22:58:18 self.gen.throw(value) 2025-12-01 22:58:18 │ │ │ └ OSError('All connection attempts failed') 2025-12-01 22:58:18 │ │ └ <method 'throw' of 'generator' objects> 2025-12-01 22:58:18 │ └ <generator object map_exceptions at 0x7f26c33f7c40> 2025-12-01 22:58:18 └ <contextlib._GeneratorContextManager object at 0x7f26e0f529c0> 2025-12-01 22:58:18 File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions 2025-12-01 22:58:18 raise to_exc(exc) from exc 2025-12-01 22:58:18 └ <class 'httpcore.ConnectError'> 2025-12-01 22:58:18

2025-12-01 22:58:18 httpcore.ConnectError: All connection attempts failed 2025-12-01 22:58:18

2025-12-01 22:58:18

2025-12-01 22:58:18 The above exception was the direct cause of the following exception: 2025-12-01 22:58:18

2025-12-01 22:58:18

2025-12-01 22:58:18 Traceback (most recent call last): 2025-12-01 22:58:18

2025-12-01 22:58:18 File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/openai/_base_client.py", line 1529, in request 2025-12-01 22:58:18 response = await self._client.send( 2025-12-01 22:58:18 │ │ └ <function AsyncClient.send at 0x7f2ba5d2f060> 2025-12-01 22:58:18 │ └ <httpx.AsyncClient object at 0x7f2b56a18440> 2025-12-01 22:58:18 └ <openai.AsyncOpenAI object at 0x7f2b56a183b0> 2025-12-01 22:58:18 File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1629, in send 2025-12-01 22:58:18 response = await self._send_handling_auth( 2025-12-01 22:58:18 │ └ <function AsyncClient._send_handling_auth at 0x7f2ba5d2f100> 2025-12-01 22:58:18 └ <httpx.AsyncClient object at 0x7f2b56a18440> 2025-12-01 22:58:18 File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1657, in _send_handling_auth 2025-12-01 22:58:18 response = await self._send_handling_redirects( 2025-12-01 22:58:18 │ └ <function AsyncClient._send_handling_redirects at 0x7f2ba5d2f1a0> 2025-12-01 22:58:18 └ <httpx.AsyncClient object at 0x7f2b56a18440> 2025-12-01 22:58:18 File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1694, in _send_handling_redirects 2025-12-01 22:58:18 response = await self._send_single_request(request) 2025-12-01 22:58:18 │ │ └ <Request('POST', 'http://localhost:8000/v1/chat/completions')> 2025-12-01 22:58:18 │ └ <function AsyncClient._send_single_request at 0x7f2ba5d2f240> 2025-12-01 22:58:18 └ <httpx.AsyncClient object at 0x7f2b56a18440> 2025-12-01 22:58:18 File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_client.py", line 1730, in _send_single_request 2025-12-01 22:58:18 response = await transport.handle_async_request(request) 2025-12-01 22:58:18 │ │ └ <Request('POST', 'http://localhost:8000/v1/chat/completions')> 2025-12-01 22:58:18 │ └ <function AsyncHTTPTransport.handle_async_request at 0x7f2ba5d23ec0> 2025-12-01 22:58:18 └ <httpx.AsyncHTTPTransport object at 0x7f2b5685e3f0> 2025-12-01 22:58:18 File "/alloc/prime-rl/.venv/lib/python3.12/site-packages/httpx/_transports/default.py", line 393, in handle_async_request 2025-12-01 22:58:18 with map_httpcore_exceptions():

Jgmedina95 avatar Dec 02 '25 04:12 Jgmedina95