tornado SimpleAsyncHTTPClient randomly blocking on "ValueError: IOStream is not idle; cannot convert to SSL"

Hi, I'm using SimpleAsyncHTTPClient to perform a few hundred requests per minute. It has been working fine until I started encountering the error below. This error happens randomly (I didn't manage to replicate it) and once it occurs, the SimpleAsyncHTTPClient fails on all his following requests (Timeout while connecting). A process restart solves the problem until the error occurs again. Any idea where this comes from or how to handle it?

I tried catching it and managed to close the client and create a new instance, but it still can't perform HTTP requests until the program is restarted.

ERROR:tornado.application:Exception in callback <functools.partial object at 0x7f6a3894c3c0>
Traceback (most recent call last):
  File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/ioloop.py", line 758, in _run_callback
    ret = callback()
  File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/ioloop.py", line 779, in _discard_future_result
    future.result()
  File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/concurrent.py", line 261, in result
    raise_exc_info(self._exc_info)
  File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/simple_httpclient.py", line 272, in run
    max_buffer_size=self.max_buffer_size)
  File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/concurrent.py", line 261, in result
    raise_exc_info(self._exc_info)
  File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/gen.py", line 1147, in run
    yielded = self.gen.send(value)
  File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/tcpclient.py", line 242, in connect
    server_hostname=host)
  File "/home/ubuntu/env/local/lib/python2.7/site-packages/tornado/iostream.py", line 1368, in start_tls
    raise ValueError("IOStream is not idle; cannot convert to SSL")
ValueError: IOStream is not idle; cannot convert to SSL

I'm using tornado 5.1.1 on Python 2.7.15 and here's a snippet of how I initialize the HTTPClient:

    client = AsyncHTTPClient(force_instance=True)
    client.initialize(max_clients=200)

Dec 03 '18 11:12 kardaj

Hmm, I haven't seen anything like that and I'm not sure how it could happen. Are you doing anything with multiple threads, by any chance?

You're not supposed to call initialize yourself. Instead, pass keyword arguments to the constructor:

client = AsyncHTTPClient(force_instance=True, max_clients=200)

I don't think that could break anything like this, though.

Dec 07 '18 15:12 bdarnell

No multiple threads with my tornado process. I have a couple of celery eventlet workers on the same machine though. Do you think it's causing some interference?

Good catch for the initialize method, but the problem appeared a couple of days before I moved to this syntax.

Dec 07 '18 15:12 kardaj

Yes, eventlet could be the problem. Tornado is incompatible with the monkey-patching features of eventlet and greenlet. You can use these libraries in the same process as Tornado only if you avoid their monkey-patching features, but it looks like celery in eventlet mode uses monkey patching

Dec 07 '18 15:12 bdarnell

So even though the processes are completely separate (tornado vs celery eventlet worker), celery patches still apply to the tornado process?

Dec 07 '18 15:12 kardaj

In a separate process it's fine. I misunderstood and thought you were running the celery workers inside the tornado process.

Dec 07 '18 16:12 bdarnell

We ran into something similar today:

OSError ssl in do_handshake error "[Errno 0] Error"
ValueError tornado.iostream in start_tls "IOStream is not idle; cannot convert to SSL"
Huge spike in response time leading to timeouts (running on Heroku dynos)
Seemed to recover on its own after 40min

Anything I can share with you that would help you investigate?

Jun 16 '20 23:06 nickbaum

Some more data.

Heroku

We run Tornado on Heroku using server.start(0)

Each dyno has 4-8 Tornado processes.
A single Heroku dyno was affected
This started roughly one minute after the dyno started
The dyno had received <10 requests when in began
Once it started, the rate increased until almost all requests to that dyno were timing out
I don't think the dyno restarted, so the issue appears to have resolved itself after ~40min

Sentry

We saw 3 exceptions spike in Sentry

OSError ssl in do_handshake error "[Errno 0] Error"
ValueError tornado.iostream in start_tls "IOStream is not idle; cannot convert to SSL"
HTTPTimeoutError Timeout while connecting (in a http_client.fetch line)

Jun 17 '20 09:06 nickbaum

Which Tornado and Python versions? (Tornado 6 silences the "errno 0" log messages) Are you still using that monitoring package that does a lot of monkey-patching? (which one was it again?)

As far as I know the errno 0 thing is just harmless log spam (it's supposed to be an SSLError("EOF occurred in violation of protocol") but it's not raised correctly). But one thing I know about it is that it can be triggered by some kinds of port scans. Could there be some sort of DDoS or fuzzer attacking you?

What's the stack trace for the "iostream not idle" message? Do the client timeouts happen on particular URLs or domains?

Jun 18 '20 13:06 bdarnell

Hi Ben, long time no see! 👋🏻

We're on Python 3.8.3 and Tornado 6.0.4
We're no longer using the monkey-patching monitoring package (new-relic).
I did not see any spikes in throughput, so I don't have any evidence of port-scanning etc.

We see occasional timeouts when rendering large PDFs (which is not surprising). However, there were no PDF request to that instance around the time this started.

The whole instance locked up, ramping up to almost 100% timeouts and only recovering after 40 minutes. One explanation for that would be that it just fell behind, and then the queued up requests all timed out in succession until the request pattern allowed it to catch up. However, that wouldn't explain the spike in both the SSL and IOstream errors.

Below is the stack trace for the IOStream error. I can give you access to Sentry too if that'd be helpful:

ValueError: IOStream is not idle; cannot convert to SSL
  File "tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "tornado/simple_httpclient.py", line 286, in <lambda>
    gen.convert_yielded(self.run()), lambda f: f.result()
  File "tornado/simple_httpclient.py", line 330, in run
    stream = await self.tcp_client.connect(
  File "tornado/tcpclient.py", line 293, in connect
    stream = await stream.start_tls(
  File "tornado/iostream.py", line 1296, in start_tls
    raise ValueError("IOStream is not idle; cannot convert to SSL")

Jun 18 '20 14:06 nickbaum

Hi Nick!

The fact that you're seeing "errno 0" in Tornado 6 means that it's getting raised in a different spot than I've seen before (it's not caught by https://github.com/tornadoweb/tornado/commit/e63980fb6b41ac5f8fda278331229fbcffd7f46d). Do you have the stack trace for that one?

I'm suspicious that this is related to some issues related to TLS 1.3 that we haven't tracked down yet (#2536). Disabling TLS 1.3 (like #2725) might prevent this.

Did CPU utilization go to 100% during this? If it happens again it would be great to get a profile (thinking of https://github.com/bdarnell/plop or https://github.com/nylas/nylas-perftools but neither has been updated in a long time so I'm not sure if there's something more current).

Jun 19 '20 15:06 bdarnell

I do have a stack trace!

OSError: [Errno 0] Error
  File "tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "tornado/simple_httpclient.py", line 286, in <lambda>
    gen.convert_yielded(self.run()), lambda f: f.result()
  File "tornado/simple_httpclient.py", line 330, in run
    stream = await self.tcp_client.connect(
  File "tornado/tcpclient.py", line 293, in connect
    stream = await stream.start_tls(
  File "tornado/iostream.py", line 1417, in _do_ssl_handshake
    self.socket.do_handshake()
  File "ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()

Since we're on Heroku, I have limited visibility into CPU utilization, but I did not see a spike in "dyno load".

It hasn't happened again, but I'll take a look at those profiling tools.

Jun 19 '20 19:06 nickbaum