Python 3.9 Support
Describe the bug
I have several Seldon deployments across multiple versions. The pods are a 3-step graph with three different containers, using the python wrapper.
Occasionally, one of the containers will fail to start and enter into a CrashLoop stage where subsequently it'll either recover or not seemingly at random. Sometimes the only solution is to kill the container outright.
The error is the following:
2021-11-03 14:37:40,657 - seldon_core.microservice:main:206 - INFO: Starting microservice.py:main
2021-11-03 14:37:40,657 - seldon_core.microservice:main:207 - INFO: Seldon Core version: 1.11.2
2021-11-03 14:37:40,660 - seldon_core.microservice:main:362 - INFO: Parse JAEGER_EXTRA_TAGS []
2021-11-03 14:37:40,660 - seldon_core.microservice:load_annotations:158 - INFO: Found annotation buidId:38
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO: Found annotation buildJob:factory/webhooks/update-fused-seldon
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO: Found annotation deployment_version:v1
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO: Found annotation kubernetes.io/config.seen:2021-11-03T14:37:24.189445793Z
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO: Found annotation kubernetes.io/config.source:api
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO: Found annotation predictor_version:v1
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO: Found annotation prometheus.io/path:/prometheus
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO: Found annotation prometheus.io/scrape:true
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO: Found annotation sidecar.istio.io/inject:true
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO: Found annotation v1:
2021-11-03 14:37:40,661 - seldon_core.microservice:main:365 - INFO: Annotations: {'buidId': '38', 'buildJob': 'factory/webhooks/update-fused-seldon', 'deployment_version': 'v1', 'kubernetes.io/config.seen': '2021-11-03T14:37:24.189445793Z', 'kubernetes.io/config.source': 'api', 'predictor_version': 'v1', 'prometheus.io/path': '/prometheus', 'prometheus.io/scrape': 'true', 'sidecar.istio.io/inject': 'true', 'v1': ''}
2021-11-03 14:37:40,661 - seldon_core.microservice:main:369 - INFO: Importing Model
2021-11-03 14:37:43,548 - Model:__init__:13 - INFO: Clean transformer initiated
Process SyncManager-1:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 583, in _run_server
server = cls._Server(registry, address, authkey, serializer)
File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 156, in __init__
self.listener = Listener(address=address, backlog=16)
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 453, in __init__
self._listener = SocketListener(address, family, backlog)
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 596, in __init__
self._socket.bind(address)
OSError: [Errno 98[] Address already in use
Traceback (most recent call last):
File "/usr/local/bin/seldon-core-microservice", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/site-packages/seldon_core/microservice.py", line 385, in main
seldon_metrics = SeldonMetrics(worker_id_func=os.getpid)
File "/usr/local/lib/python3.9/site-packages/seldon_core/metrics.py", line 89, in __init__
self._manager = Manager()
File "/usr/local/lib/python3.9/multiprocessing/context.py", line 57, in Manager
m.start()
File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 558, in start
self._address = reader.recv()
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 255, in recv
buf = self._recv_bytes()
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
buf = self._recv(4)
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
raise EOFError
EOFError
Exception ignored in: <function SeldonMetrics.__del__ at 0x7f76680523a0>
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/seldon_core/metrics.py", line 96, in __del__
self._manager.shutdown()
AttributeError: 'SeldonMetrics' object has no attribute '_manager'
Initially, it looked like a concurrency bug (due to happening randomly at any of the three containers and not always), so what I did was to follow the documentation to reduce the possible concurrency problems that may exist by setting env flags to every container.
- env:
- name: LOG_LEVEL_ENV
value: 'DEBUG'
- name: FLASK_DEBUG
value: '1'
- name: GUNICORN_WORKERS
value: '1'
- name: GUNICORN_THREADS
value: '1'
- name: FLASK_SINGLE_THREADED
value: '1'
- name: SELDON_DEBUG
value: '1'
I verified and the flags are indeed being applied.
I'm also following the def load(): pattern according to the python wrapper documentation.
So right now I'm without any ideas on how to debug even more and looking for ideas
To reproduce
- Deploy seldon with a graph defined by 3 different images
- Start the pod multiple times until the error happens
Environment
- Cloud Provider: GKE
#### Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.3", GitCommit:"c92036820499fedefec0f847e2054d824aea6cd1", GitTreeState:"clean", BuildDate:"2021-10-27T18:41:28Z", GoVersion:"go1.16.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.13-gke.1900", GitCommit:"ee714a7b695ca42b9bd0c8fe2c0159024cdcba5e", GitTreeState:"clean", BuildDate:"2021-08-11T09:19:42Z", GoVersion:"go1.15.13b5", Compiler:"gc", Platform:"linux/amd64"}
#### Seldon Images:
value: docker.io/seldonio/engine:1.11.2
value: docker.io/seldonio/seldon-core-executor:1.11.2
image: docker.io/seldonio/seldon-core-operator:1.11.2
Model Details
- kubectl get seldondeployment: fulldep.txt
- kubectl kustomize deployment: full_deploy.txt
Hard to say. This looks supicious OSError: [Errno 98[] Address already in use is this happening on a crashLoopBackoff of the Pod? Or under high load?
The CrashLoopBackOff happens at container level, without high load
The section in question looks like to be at startup.
Here, when the microservice is staring: https://github.com/SeldonIO/seldon-core/blob/136e77a924518f0aabb399e340bfa8d8f947b00f/python/seldon_core/microservice.py#L390
it invokes the multiprocessing library https://github.com/SeldonIO/seldon-core/blob/136e77a924518f0aabb399e340bfa8d8f947b00f/python/seldon_core/metrics.py#L89
that is the one unable bind to the port for listening.
(...)
self._socket.bind(address)
OSError: [Errno 98] Address already in use
Now, I cannot find why multiprocessing cannot bind and which address is in use. SocketListener is reusing the same address across containers. Is this a multiprocessing lib bug?
This seems quite strange - not really an issue I have seen before. What is your current implementation? I have tried a couple of different variations to validate if it can be replicated but can't seem to find a way to validate the same issue, which seems to suggest it's most likely specific to the cluster where you are running these models, and you are indeed running into denied port access, which means you may need to check with your network permissions.
This is an example we used to try to replicate:
kubectl apply -f - << END
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: seldon-model
spec:
predictors:
- componentSpecs:
- spec:
containers:
- image: seldonio/mock_classifier:1.11.2
name: classifier
env:
- name: LOG_LEVEL_ENV
value: 'DEBUG'
- name: FLASK_DEBUG
value: '1'
- name: GUNICORN_WORKERS
value: '1'
- name: GUNICORN_THREADS
value: '1'
- name: FLASK_SINGLE_THREADED
value: '1'
- name: SELDON_DEBUG
value: '1'
- image: seldonio/mock_classifier:1.11.2
name: classifier-2
env:
- name: LOG_LEVEL_ENV
value: 'DEBUG'
- name: FLASK_DEBUG
value: '1'
- name: GUNICORN_WORKERS
value: '1'
- name: GUNICORN_THREADS
value: '1'
- name: FLASK_SINGLE_THREADED
value: '1'
- name: SELDON_DEBUG
value: '1'
graph:
children:
- endpoint:
type: REST
name: classifier-2
type: MODEL
children: []
endpoint:
type: REST
name: classifier
type: MODEL
name: default
replicas: 1
END
If you can replicate with a simple model we would be able to validate from our side and identify what the fix is, but does seem to point it's most likely your network.
Thank you for your code snippet. I tried to replicate the issue using it, and I couldn't trigger the issue.
After a few more rounds of experimentation, I ended up replicating the seldonio/mock_classifier:1.11.2 image like so:
FROM python:3.7.12-slim
RUN pip install --no-cache-dir seldon-core==1.11.2 numpy
WORKDIR /microservice
COPY MeanClassifier.py /microservice/MeanClassifier.py
ENV MODEL_NAME MeanClassifier
ENV SERVICE_TYPE MODEL
ENV SELDON_LOG_LEVEL DEBUG
ENV GUNICORN_WORKERS 1
CMD exec seldon-core-microservice $MODEL_NAME --service-type $SERVICE_TYPE --log-level ${SELDON_LOG_LEVEL} --debug
Which also didn't trigger the issue. However, upgrading Dockerfile python version, from
FROM python:3.7.12-slim
to
FROM python:3.9.7-slim
does trigger the issue.
I've also tested python:3.8.12-slim, which didn't trigger the issue.
It looks like python version is related to seldon-core-microservice not being able to start. There should be some incompatibility with python 3.9.x or a python bug.
Can you please verify if using this Dockerfile image with python:3.9.7-slim triggers the issue on your side?
(one or two Seldon pod restarts are enough to trigger it)
Ah ok that provides a bit more insights into why the issue must be happening. One thing that I have to say is that we are currently looking to add "maintained support" for Python 3.8 via https://github.com/SeldonIO/seldon-core/pull/3595, which means that it's still a preliminary step to 3.9 support. We have validated that 3.8 is now supported as you can see in that PR, so our recommendation for now would be to use a 3.8 python base image instead. We will be looking to add support for 3.9 in the near future but for your current workaround that may be the best way forward. Are you able to proceed with this suggestion?
Yes. I'll downgrade to python 3.8.x as a workaround
I've also confirmed that Manager Metrics:
https://github.com/SeldonIO/seldon-core/blob/136e77a924518f0aabb399e340bfa8d8f947b00f/python/seldon_core/metrics.py#L89
is choosing the same address across multiple containers thus leading to one of them failing to start due to address being already bind by other container in the same pod
Ok great. Interesting, I think it does sound like a reasonable formal fix to ensrue that ports are chosen with uniqueness - however having said that, we would have to make sure that generally support for Python 3.9 is covered as a broader effort that may involve more than just this fix. I'll update this issue to focus primarily on this initial step of ensuring non-overlapping ports.
Hi any news on this issue? I just discovered it when having exactly the same problem trying a ROUTER setup with containers running python 3.9.13.
This issue cause by this commit, and has been removed on master recently, which have not been released.
If you don't want to downgrade python like me, you can monkey patch multiprocessing.arbitrary_address like:
import tempfile
from multiprocessing import connection, util
from multiprocessing.connection import _mmap_counter
def arbitrary_address(family):
"""
Return an arbitrary free address for the given family
"""
if family == "AF_INET":
return ("localhost", 0)
elif family == "AF_UNIX":
return tempfile.mktemp(prefix="listener-", dir=util.get_temp_dir())
elif family == "AF_PIPE":
return tempfile.mktemp(
prefix=r"\\.\pipe\pyc-%d-%d-" % (os.getpid(), next(_mmap_counter)), dir=""
)
else:
raise ValueError("unrecognized family")
connection.arbitrary_address = arbitrary_address
Please use MLServer that support 3.9 and 3.10
upgrading to python 3.9.16 solves this (for those googling this :) )