python-sdk icon indicating copy to clipboard operation
python-sdk copied to clipboard

Crash in grpc ("Too many open files")

Open amk43 opened this issue 5 years ago • 2 comments

Posting here as recommended by YC support.

We use Ubuntu 18.04.3, python 3.6.9, yandexcloud 0.34.0, grpc 1.28.1.

Our application continuously starts and stops instances in YC, making no more than a few hundred API requests an hour (probably less). We ran into the problem that after running this way for some time (perhaps a couple of days) the application inevitably crashes with a stack trace like

Traceback (most recent call last):
  File "./dispatcher.py", line 86, in runInstance
    disks = ysdk.client(DiskServiceStub).List(ListDisksRequest(folder_id = CONF['folder_id'])).disks
  File "/home/ubuntu/.local/lib/python3.6/site-packages/grpc/_interceptor.py", line 221, in __call__
    compression=compression)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/grpc/_interceptor.py", line 257, in _with_call
    return call.result(), call
  File "/home/ubuntu/.local/lib/python3.6/site-packages/grpc/_channel.py", line 333, in result
    raise self
  File "/home/ubuntu/.local/lib/python3.6/site-packages/grpc/_interceptor.py", line 247, in continuation
    compression=new_compression)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/grpc/_channel.py", line 837, in with_call
    return _end_unary_response_blocking(state, call, True, None)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/grpc/_channel.py", line 729, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Getting metadata from plugin failed with error: HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/token (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc975e55a58>: Failed to establish a new connection: [Errno 24] Too many open files',))"
        debug_error_string = "{"created":"@1590184822.768147799","description":"Getting metadata from plugin failed with error: HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/token (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc975e55a58>: Failed to establish a new connection: [Errno 24] Too many open files',))","file":"src/core/lib/security/credentials/plugin/plugin_credentials.cc","file_line":79,"grpc_status":14}"
>

Before the crash the grpc library also outputs error messages, e.g.:

E0522 22:00:22.710785667   14565 ev_epollex_linux.cc:1458]   pollset_set_add_pollset: {"created":"@1590184822.710768750","description":"Too many open files","errno":24,"file":"src/core/lib/iomgr/wakeup_fd_eventfd.cc","file_line":38,"os_error":"Too many open files","syscall":"eventfd"}
E0522 22:00:26.276082489   14563 ev_epollex_linux.cc:1306]   pollset_add_fd: {"created":"@1590184826.276050019","description":"pollset_transition_pollable_from_empty_to_fd","file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":325,"referenced_errors":[{"created":"@1590184826.276048606","description":"get_fd_pollable","file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":325,"referenced_errors":[{"created":"@1590184826.276041326","description":"Too many open files","errno":24,"file":"src/core/lib/iomgr/wakeup_fd_eventfd.cc","file_line":38,"os_error":"Too many open files","syscall":"eventfd"}]}]}
E0522 22:00:27.901743430   14560 ev_epollex_linux.cc:1458]   pollset_set_add_pollset: {"created":"@1590184827.901723028","description":"Too many open files","errno":24,"file":"src/core/lib/iomgr/wakeup_fd_eventfd.cc","file_line":38,"os_error":"Too many open files","syscall":"eventfd"}
E0522 22:00:29.869932962   14563 ev_epollex_linux.cc:1306]   pollset_add_fd: {"created":"@1590184829.869899748","description":"pollset_transition_pollable_from_empty_to_fd","file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":325,"referenced_errors":[{"created":"@1590184829.869898650","description":"get_fd_pollable","file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":325,"referenced_errors":[{"created":"@1590184829.869897060","description":"Too many open files","errno":24,"file":"src/core/lib/iomgr/ev_epollex_linux.cc","file_line":568,"os_error":"Too many open files","syscall":"epoll_create1"}]}]}
E0522 22:00:33.867603041   27147 ev_epollex_linux.cc:1408]   assertion failed: i != pss->pollset_count

This may be caused by a known problem in grpc. E.g. see https://github.com/grpc/grpc/issues/15759 and related issues.

As a workaround, we tried setting nofile OS limit to a very high value. This results in the following behavior: over the course of several days (or weeks) average cpu load of the application grows (presumably caused by an ever-growing number of open files) until it hits 100% the app becomes completely unresponsive.

It should be noted that when using AWS EC2 SDK/cloud for instance management in an otherwise identical app under a very similar load, no issues of this kind occur. This is an indication that the problem is truly an issue in YC SDK.

amk43 avatar Jul 22 '20 23:07 amk43

Is the issue still relevant? Can you provide minimal example, which can reproduce the problem?

l0kix2 avatar Feb 11 '22 08:02 l0kix2

Hi! I cannot test this on the latest version right now. We are currently using an old version (0.60.0) and the issue is still there. I am pretty sure this is caused by a gRPC issue, which does not appear do be fixed. See e.g. https://github.com/grpc/grpc/issues/20418 I will get back to you if I can confirm this on an up to date version of yandexcloud.

amk43 avatar Feb 17 '22 23:02 amk43