ray icon indicating copy to clipboard operation
ray copied to clipboard

[Core] error happened in _raylet.so, thread_proxy when using ray.init

Open guangzlu opened this issue 1 year ago • 0 comments

What happened + What you expected to happen

I can't init ray with num_cpus more than 10. I can get number of 192 cpus from multiprocessing.cpu_count() on the machine. This stucked me from using vllm on ulti GPUs.

error log:

import ray ray.init(num_cpus=20) 2024-05-23 06:21:54,029 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 RayContext(dashboard_url='127.0.0.1:8265', python_version='3.10.13', ray_version='2.9.3', ray_commit='62655e11ed76509b78654b60be67bc59f8f3460a', protocol_version=None) [2024-05-23 06:21:57,911 E 3961 4396] logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Re source temporarily unavailable [system:11] (pid=4413) [2024-05-23 06:21:57,911 E 4413 5132] logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thr ead: Resource temporarily unavailable [system:11] (pid=4408) [2024-05-23 06:21:58,299 E 4408 5033] logging.cc:104: Stack trace: (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0xfebc9a) [0x7fb1f0d84c9a] ray::operator<<() (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0xfee3d8) [0x7fb1f0d873d8] ray::TerminateHandler() (pid=4408) /opt/conda/envs/py_3.10/bin/../lib/libstdc++.so.6(+0xb057e) [0x7fb1efc8257e] __cxxabiv1::__terminate() (pid=4408) /opt/conda/envs/py_3.10/bin/../lib/libstdc++.so.6(+0xb05d0) [0x7fb1efc825d0] __cxxabiv1::__unexpected() (pid=4408) /opt/conda/envs/py_3.10/bin/../lib/libstdc++.so.6(__cxa_rethrow+0) [0x7fb1efc827c2] __cxa_rethrow (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0x5572c2) [0x7fb1f02f02c2] boost::throw_exception<>() (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0x10d7fbb) [0x7fb1f0e70fbb] boost::asio::detail::do_throw_error() (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0x10d89db) [0x7fb1f0e719db] boost::asio::detail::posix_thread::st art_thread() (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0x10d8e3c) [0x7fb1f0e71e3c] boost::asio::thread_pool::thread_pool () (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0xa16fc4) [0x7fb1f07affc4] ray::rpc::(anonymous namespace)::_GetS erverCallExecutor() (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray3rpc21GetServerCallExecutorEv+0x9) [0x7fb1f07b0059] ray::rp c::GetServerCallExecutor() (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/raylet.so(ZNSt17_Function_handlerIFvN3ray6StatusESt8functionIFvvEES4_EZNS0 3rpc14ServerCallImplINS6_24CoreWorkerServiceHandlerENS6_25GetCoreWorkerStatsRequestENS6_23GetCoreWorkerStatsReplyELNS6_8AuthTypeE0EE17HandleReque stImplEbEUlS1_S4_S4_E0_E9_M_invokeERKSt9_Any_dataOS1_OS4_SJ+0x128) [0x7fb1f04d55b8] std::_Function_handler<>::_M_invoke() (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker24HandleGetCoreWorkerStatsENS_3rpc25GetCor eWorkerStatsRequestEPNS2_23GetCoreWorkerStatsReplyESt8functionIFvNS_6StatusES6_IFvvEES9_EE+0x899) [0x7fb1f0517a79] ray::core::CoreWorker::HandleG etCoreWorkerStats() (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray3rpc14ServerCallImplINS0_24CoreWorkerServiceHandlerENS0_25G etCoreWorkerStatsRequestENS0_23GetCoreWorkerStatsReplyELNS0_8AuthTypeE0EE17HandleRequestImplEb+0x104) [0x7fb1f050d864] ray::rpc::ServerCallImpl<> ::HandleRequestImpl() (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0xa287ae) [0x7fb1f07c17ae] EventTracker::RecordExecution() (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0xa21b9e) [0x7fb1f07bab9e] std::_Function_handler<>::_M_invoke() (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0xa22016) [0x7fb1f07bb016] boost::asio::detail::completion_handle r<>::do_complete() (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0x10d564b) [0x7fb1f0e6e64b] boost::asio::detail::scheduler::do_ru n_one() (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0x10d6fc9) [0x7fb1f0e6ffc9] boost::asio::detail::scheduler::run() (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0x10d76d2) [0x7fb1f0e706d2] boost::asio::io_context::run() (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0xc9) [0x7fb1f04eeb49] ra y::core::CoreWorker::RunIOService() (pid=4408) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0xb158c0) [0x7fb1f08ae8c0] thread_proxy (pid=4408) /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fb1f1aceac3] (pid=4408) /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fb1f1b60850] (pid=4408) (pid=4408) *** SIGABRT received at time=1716445318 on cpu 62 *** (pid=4408) PC: @ 0x7fb1f1ad09fc (unknown) pthread_kill (pid=4408) @ 0x7fb1f1a7c520 (unknown) (unknown) (pid=4408) [2024-05-23 06:21:58,299 E 4408 5033] logging.cc:361: *** SIGABRT received at time=1716445318 on cpu 62 *** (pid=4408) [2024-05-23 06:21:58,299 E 4408 5033] logging.cc:361: PC: @ 0x7fb1f1ad09fc (unknown) pthread_kill (pid=4408) [2024-05-23 06:21:58,299 E 4408 5033] logging.cc:361: @ 0x7fb1f1a7c520 (unknown) (unknown) (pid=4408) Fatal Python error: Aborted (pid=4408) (pid=4408) (pid=4408) Extension modules: msgpack._cmsgpack, google.protobuf.pyext._message, psutil._psutil_linux, psutil.psutil_posix, setproctitle, yaml. yaml, charset_normalizer.md, uvloop.loop, ray._raylet (total: 9) [2024-05-23 06:21:58,341 E 3961 4396] logging.cc:104: Stack trace: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0xfebc9a) [0x7f4eac6c4c9a] ray::operator<<() /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0xfee3d8) [0x7f4eac6c73d8] ray::TerminateHandler() /opt/conda/envs/py_3.10/bin/../lib/libstdc++.so.6(+0xb057e) [0x7f4eab5c257e] __cxxabiv1::__terminate() /opt/conda/envs/py_3.10/bin/../lib/libstdc++.so.6(+0xb05d0) [0x7f4eab5c25d0] __cxxabiv1::__unexpected() /opt/conda/envs/py_3.10/bin/../lib/libstdc++.so.6(__cxa_rethrow+0) [0x7f4eab5c27c2] __cxa_rethrow /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0x5572c2) [0x7f4eabc302c2] boost::throw_exception<>() /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0x10d7fbb) [0x7f4eac7b0fbb] boost::asio::detail::do_throw_error() /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0x10d89db) [0x7f4eac7b19db] boost::asio::detail::posix_thread::start_thread( ) /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0x10d8e3c) [0x7f4eac7b1e3c] boost::asio::thread_pool::thread_pool() /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0xa16fc4) [0x7f4eac0effc4] ray::rpc::(anonymous namespace)::_GetServerCallEx ecutor() /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray3rpc21GetServerCallExecutorEv+0x9) [0x7f4eac0f0059] ray::rpc::GetServe rCallExecutor() /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(ZNSt17_Function_handlerIFvN3ray6StatusESt8functionIFvvEES4_EZNS0_3rpc14Serve rCallImplINS6_24CoreWorkerServiceHandlerENS6_25GetCoreWorkerStatsRequestENS6_23GetCoreWorkerStatsReplyELNS6_8AuthTypeE0EE17HandleRequestImplEbEUl S1_S4_S4_E0_E9_M_invokeERKSt9_Any_dataOS1_OS4_SJ+0x128) [0x7f4eabe155b8] std::_Function_handler<>::_M_invoke() /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker24HandleGetCoreWorkerStatsENS_3rpc25GetCoreWorkerStat sRequestEPNS2_23GetCoreWorkerStatsReplyESt8functionIFvNS_6StatusES6_IFvvEES9_EE+0x899) [0x7f4eabe57a79] ray::core::CoreWorker::HandleGetCoreWorke rStats() /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray3rpc14ServerCallImplINS0_24CoreWorkerServiceHandlerENS0_25GetCoreWorke rStatsRequestENS0_23GetCoreWorkerStatsReplyELNS0_8AuthTypeE0EE17HandleRequestImplEb+0x104) [0x7f4eabe4d864] ray::rpc::ServerCallImpl<>::HandleReq uestImpl() /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0xa287ae) [0x7f4eac1017ae] EventTracker::RecordExecution() /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0xa21b9e) [0x7f4eac0fab9e] std::_Function_handler<>::_M_invoke() /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0xa22016) [0x7f4eac0fb016] boost::asio::detail::completion_handler<>::do_com plete() /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0x10d564b) [0x7f4eac7ae64b] boost::asio::detail::scheduler::do_run_one() /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0x10d6fc9) [0x7f4eac7affc9] boost::asio::detail::scheduler::run() /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0x10d76d2) [0x7f4eac7b06d2] boost::asio::io_context::run() /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0xc9) [0x7f4eabe2eb49] ray::core::Co reWorker::RunIOService() /opt/conda/envs/py_3.10/lib/python3.10/site-packages/ray/_raylet.so(+0xb158c0) [0x7f4eac1ee8c0] thread_proxy /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f4ead483ac3] /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f4ead515850]

*** SIGABRT received at time=1716445318 on cpu 35 *** PC: @ 0x7f4ead4859fc (unknown) pthread_kill @ 0x7f4ead431520 (unknown) (unknown) [2024-05-23 06:21:58,342 E 3961 4396] logging.cc:361: *** SIGABRT received at time=1716445318 on cpu 35 *** [2024-05-23 06:21:58,342 E 3961 4396] logging.cc:361: PC: @ 0x7f4ead4859fc (unknown) pthread_kill [2024-05-23 06:21:58,342 E 3961 4396] logging.cc:361: @ 0x7f4ead431520 (unknown) (unknown) Fatal Python error: Aborted

Extension modules: msgpack._cmsgpack, google.protobuf.pyext._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, chars et_normalizer.md, uvloop.loop, ray._raylet, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, grpc._cython.cygrpc (total: 17) Aborted (core dumped)

Versions / Dependencies

ray, version 2.9.3

Here is my ulimit setting in container:

real-time non-blocking time (microseconds, -R) unlimited core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 8255160 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 65536 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

Here is my ulimit setting on bare metal:

core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 8255160 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

Reproduction script

import ray ray.init(num_cpus=20)

Issue Severity

High: It blocks me from completing my task.

guangzlu avatar May 23 '24 06:05 guangzlu