container confs_9999_pyhon_1 always restarts
Describe the bug when I use docker compose to deploy the cluster , the container confs_9999_pyhon_1 always restarts .
I experienced this issue in 1.4.2 as well (master branch). i notice service in port 9380 is not running? any issue?
docker logs -f confs-10000_python_1
static conf path: /data/projects/fate/eggroll/conf/eggroll.properties
Traceback (most recent call last):
File "./fate_flow/fate_flow_server.py", line 94, in <module>
session_utils.init_session_for_flow_server()
File "/data/projects/fate/python/fate_flow/utils/session_utils.py", line 60, in init_session_for_flow_server
options={"eggroll.session.processors.per.node": 1})
File "/data/projects/fate/python/arch/api/session.py", line 112, in init
RuntimeInstance.SESSION = builder.build_session()
File "/data/projects/fate/python/arch/api/impl/based_2x/build.py", line 38, in build_session
persistent_engine=self._persistent_engine, options=self._options)
File "/data/projects/fate/python/arch/api/impl/based_2x/session.py", line 45, in build_session
eggroll_session = build_eggroll_session(work_mode=work_mode, job_id=job_id, options=options)
File "/data/projects/fate/python/arch/api/impl/based_2x/session.py", line 36, in build_eggroll_session
return session_init(session_id=job_id, options=options)
File "/data/projects/fate/eggroll/python/eggroll/core/session.py", line 32, in session_init
er_session = ErSession(session_id=session_id, options=options)
File "/data/projects/fate/eggroll/python/eggroll/core/session.py", line 113, in __init__
self.__session_meta = self._cluster_manager_client.get_or_create_session(session_meta)
File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 176, in get_or_create_session
serdes_type=self.__serdes_type))
File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 230, in __check_processors
raise ValueError(f"processor in session meta is not valid: {session_meta}")
ValueError: processor in session meta is not valid: <ErSessionMeta(id=session_used_by_fate_flow_server_1cc6ed2cd97e11eaaee70242c0a70006, name=, status=ERROR, tag=, processors=[***, len=2], options=[{'eggroll.session.processors.per.node': '1', 'eggroll.session.deploy.mode': 'cluster'}]) at 0x7f6dd686b2b0>
for cluster manager
clustermanager_1 | [ERROR][191522][2020-08-11 05:47:19,157][grpc-server-4670-0,pid:1,tid:15][c.w.e.c.e.h.DefaultLoggingErrorHandler:109] -
clustermanager_1 | java.lang.reflect.InvocationTargetException: null
clustermanager_1 | at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_262]
clustermanager_1 | at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_262]
clustermanager_1 | at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_262]
clustermanager_1 | at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_262]
clustermanager_1 | at com.webank.eggroll.core.command.CommandRouter$.dispatch(CommandRouter.scala:153) ~[eggroll-core-2.0.1.jar:?]
clustermanager_1 | at com.webank.eggroll.core.command.CommandService.$anonfun$call$1(CommandService.scala:50) ~[eggroll-core-2.0.1.jar:?]
clustermanager_1 | at com.webank.eggroll.core.grpc.server.GrpcServerWrapper.wrapGrpcServerRunnable(GrpcServerWrapper.java:43) [eggroll-core-2.0.1.jar:?]
clustermanager_1 | at com.webank.eggroll.core.command.CommandService.call(CommandService.scala:41) [eggroll-core-2.0.1.jar:?]
clustermanager_1 | at com.webank.eggroll.core.command.CommandServiceGrpc$MethodHandlers.invoke(CommandServiceGrpc.java:209) [eggroll-core-2.0.1.jar:?]
clustermanager_1 | at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171) [grpc-stub-1.22.2.jar:1.22.2]
clustermanager_1 | at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331) [grpc-core-1.22.2.jar:1.22.2]
clustermanager_1 | at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:808) [grpc-core-1.22.2.jar:1.22.2]
clustermanager_1 | at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [grpc-core-1.22.2.jar:1.22.2]
clustermanager_1 | at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) [grpc-core-1.22.2.jar:1.22.2]
clustermanager_1 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_262]
clustermanager_1 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_262]
clustermanager_1 | at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]
clustermanager_1 | Caused by: com.webank.eggroll.core.error.ErSessionException: unable to start all processors for session id: 'session_used_by_fate_flow_server_b219df38db9511eaa0630242c0a70006'. Please check corresponding bootstrap logs at 'logs//session_used_by_fate_flow_server_b219df38db9511eaa0630242c0a70006' to check the reasons. Details:
clustermanager_1 | =================
clustermanager_1 | total processors: 2,
clustermanager_1 | started count: 1,
clustermanager_1 | not started count: 1,
clustermanager_1 | current active processors per node: TreeMap(nodemanager -> 1),
clustermanager_1 | not started processors and their nodes: TreeMap(3 -> nodemanager)
clustermanager_1 | at com.webank.eggroll.core.resourcemanager.SessionManagerService.getOrCreateSession(SessionManager.scala:172) ~[eggroll-core-2.0.1.jar:?]
clustermanager_1 | ... 17 more
clustermanager_1 | [INFO ][191646][2020-08-11 05:47:19,281][grpc-server-4670-0,pid:1,tid:15][c.w.e.c.c.CommandService:72] - [COMMAND] received v1/cluster-manager/session/getOrCreateSession
clustermanager_1 | [INFO ][191757][2020-08-11 05:47:19,392][grpc-server-4670-0,pid:1,tid:15][c.w.e.c.c.CommandService:72] - [COMMAND] received v1/cluster-manager/session/getOrCreateSession
solution : edit confs-xx/confs/eggroll/conf/eggroll.properties change this line
eggroll.resourcemanager.bootstrap.roll_pair_master.javahome=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64/jre
I have solved this. The reason is that the mysql container exited. And I close the selinux.
i have the same problem with you,it is a headache.
Same log but different cause, basiclly the eggroll sub process error:
My cluster is run with:
- k3s
- self built image on arm64v8
- base on code release
1.8.0
Log file from pod nodemanager-0-xxx container nodemanager-0-eggrollpair path logs//202209150646164580550_secure_add_example_0_0_guest_10001:
/data/projects/fate/eggroll/logs/202209150646164580550_secure_add_example_0_0_guest_10001 $ ll
total 28
-rw-r--r-- 1 root root 115 Sep 15 06:49 bootstrap-egg_pair-51.err
-rw-r--r-- 1 root root 11150 Sep 15 06:49 bootstrap-egg_pair-51.out
-rw-r--r-- 1 root root 791 Sep 15 06:46 egg_pair-51.err
-rw-r--r-- 1 root root 0 Sep 15 06:46 egg_pair-51.out
-rw-r--r-- 1 root root 44 Sep 15 06:46 pid.txt
-rw-r--r-- 1 root root 425 Sep 15 06:46 strace-51.log
content of egg_pair-51.err:
bin/roll_pair/egg_pair_bootstrap.sh: line 196: None/bin/python: No such file or directory
content of bootstrap-egg_pair-51.err:
egg_pair_bootstrap.sh: unrecognized option '--python-venv'
bin/roll_pair/egg_pair_bootstrap.sh: line 166: None/bin/activate: No such file or directory
bin/roll_pair/egg_pair_bootstrap.sh: line 181: None/bin/python: No such file or directory
strace: attach: ptrace(PTRACE_SEIZE, 135): No such process
kill: not enough arguments
Fix:
edit file bin/roll_pair/egg_pair_bootstrap.sh in running pod container:
- nodemanager-0-xxxx: nodemanager-0
- nodemanager-1-xxxx: nodemanager-1
PYTHON=`which python`
# if [[ -z ${venv} ]]; then
# PYTHON=`which python`
# else
# source ${venv}/bin/activate
# PYTHON=${venv}/bin/python
# fi
or edit the source code of such file and rebuild docker image.
This issue was closed because it has been inactive for 1 days since being marked as stale. If this issue is still relevant or if there is new information, please feel free to update or reopen it.