SimpleAsyncHTTPClient randomly blocking on "ValueError: IOStream is not idle; cannot convert to SSL"
Hi, I'm using SimpleAsyncHTTPClient to perform a few hundred requests per minute. It has been working fine until I started encountering the error below. This error happens randomly (I didn't manage to replicate it) and once it occurs, the SimpleAsyncHTTPClient fails on all his following requests (Timeout while connecting). A process restart solves the problem until the error occurs again. Any idea where this comes from or how to handle it?
I tried catching it and managed to close the client and create a new instance, but it still can't perform HTTP requests until the program is restarted.
ERROR:tornado.application:Exception in callback <functools.partial object at 0x7f6a3894c3c0>
Traceback (most recent call last):
File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/ioloop.py", line 758, in _run_callback
ret = callback()
File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/stack_context.py", line 300, in null_wrapper
return fn(*args, **kwargs)
File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/ioloop.py", line 779, in _discard_future_result
future.result()
File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/concurrent.py", line 261, in result
raise_exc_info(self._exc_info)
File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/simple_httpclient.py", line 272, in run
max_buffer_size=self.max_buffer_size)
File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/concurrent.py", line 261, in result
raise_exc_info(self._exc_info)
File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/gen.py", line 1147, in run
yielded = self.gen.send(value)
File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/tcpclient.py", line 242, in connect
server_hostname=host)
File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/iostream.py", line 1368, in start_tls
raise ValueError("IOStream is not idle; cannot convert to SSL")
ValueError: IOStream is not idle; cannot convert to SSL
I'm using tornado 5.1.1 on Python 2.7.15 and here's a snippet of how I initialize the HTTPClient:
client = AsyncHTTPClient(force_instance=True)
client.initialize(max_clients=200)
Hmm, I haven't seen anything like that and I'm not sure how it could happen. Are you doing anything with multiple threads, by any chance?
You're not supposed to call initialize yourself. Instead, pass keyword arguments to the constructor:
client = AsyncHTTPClient(force_instance=True, max_clients=200)
I don't think that could break anything like this, though.
No multiple threads with my tornado process. I have a couple of celery eventlet workers on the same machine though. Do you think it's causing some interference?
Good catch for the initialize method, but the problem appeared a couple of days before I moved to this syntax.
Yes, eventlet could be the problem. Tornado is incompatible with the monkey-patching features of eventlet and greenlet. You can use these libraries in the same process as Tornado only if you avoid their monkey-patching features, but it looks like celery in eventlet mode uses monkey patching
So even though the processes are completely separate (tornado vs celery eventlet worker), celery patches still apply to the tornado process?
In a separate process it's fine. I misunderstood and thought you were running the celery workers inside the tornado process.
We ran into something similar today:
-
OSErrorssl in do_handshake error "[Errno 0] Error" -
ValueErrortornado.iostream in start_tls "IOStream is not idle; cannot convert to SSL" - Huge spike in response time leading to timeouts (running on Heroku dynos)
- Seemed to recover on its own after 40min
Anything I can share with you that would help you investigate?
Some more data.
Heroku
We run Tornado on Heroku using server.start(0)
- Each dyno has 4-8 Tornado processes.
- A single Heroku dyno was affected
- This started roughly one minute after the dyno started
- The dyno had received <10 requests when in began
- Once it started, the rate increased until almost all requests to that dyno were timing out
- I don't think the dyno restarted, so the issue appears to have resolved itself after ~40min
Sentry
We saw 3 exceptions spike in Sentry
-
OSErrorssl in do_handshake error "[Errno 0] Error" -
ValueErrortornado.iostream in start_tls "IOStream is not idle; cannot convert to SSL" -
HTTPTimeoutErrorTimeout while connecting (in ahttp_client.fetchline)
Which Tornado and Python versions? (Tornado 6 silences the "errno 0" log messages) Are you still using that monitoring package that does a lot of monkey-patching? (which one was it again?)
As far as I know the errno 0 thing is just harmless log spam (it's supposed to be an SSLError("EOF occurred in violation of protocol") but it's not raised correctly). But one thing I know about it is that it can be triggered by some kinds of port scans. Could there be some sort of DDoS or fuzzer attacking you?
What's the stack trace for the "iostream not idle" message? Do the client timeouts happen on particular URLs or domains?
Hi Ben, long time no see! 👋🏻
- We're on Python 3.8.3 and Tornado 6.0.4
- We're no longer using the monkey-patching monitoring package (new-relic).
- I did not see any spikes in throughput, so I don't have any evidence of port-scanning etc.
We see occasional timeouts when rendering large PDFs (which is not surprising). However, there were no PDF request to that instance around the time this started.
The whole instance locked up, ramping up to almost 100% timeouts and only recovering after 40 minutes. One explanation for that would be that it just fell behind, and then the queued up requests all timed out in succession until the request pattern allowed it to catch up. However, that wouldn't explain the spike in both the SSL and IOstream errors.
Below is the stack trace for the IOStream error. I can give you access to Sentry too if that'd be helpful:
ValueError: IOStream is not idle; cannot convert to SSL
File "tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "tornado/simple_httpclient.py", line 286, in <lambda>
gen.convert_yielded(self.run()), lambda f: f.result()
File "tornado/simple_httpclient.py", line 330, in run
stream = await self.tcp_client.connect(
File "tornado/tcpclient.py", line 293, in connect
stream = await stream.start_tls(
File "tornado/iostream.py", line 1296, in start_tls
raise ValueError("IOStream is not idle; cannot convert to SSL")
Hi Nick!
The fact that you're seeing "errno 0" in Tornado 6 means that it's getting raised in a different spot than I've seen before (it's not caught by https://github.com/tornadoweb/tornado/commit/e63980fb6b41ac5f8fda278331229fbcffd7f46d). Do you have the stack trace for that one?
I'm suspicious that this is related to some issues related to TLS 1.3 that we haven't tracked down yet (#2536). Disabling TLS 1.3 (like #2725) might prevent this.
Did CPU utilization go to 100% during this? If it happens again it would be great to get a profile (thinking of https://github.com/bdarnell/plop or https://github.com/nylas/nylas-perftools but neither has been updated in a long time so I'm not sure if there's something more current).
I do have a stack trace!
OSError: [Errno 0] Error
File "tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "tornado/simple_httpclient.py", line 286, in <lambda>
gen.convert_yielded(self.run()), lambda f: f.result()
File "tornado/simple_httpclient.py", line 330, in run
stream = await self.tcp_client.connect(
File "tornado/tcpclient.py", line 293, in connect
stream = await stream.start_tls(
File "tornado/iostream.py", line 1417, in _do_ssl_handshake
self.socket.do_handshake()
File "ssl.py", line 1309, in do_handshake
self._sslobj.do_handshake()
Since we're on Heroku, I have limited visibility into CPU utilization, but I did not see a spike in "dyno load".
It hasn't happened again, but I'll take a look at those profiling tools.