seldon-core Python 3.9 Support

Describe the bug

I have several Seldon deployments across multiple versions. The pods are a 3-step graph with three different containers, using the python wrapper.

Occasionally, one of the containers will fail to start and enter into a CrashLoop stage where subsequently it'll either recover or not seemingly at random. Sometimes the only solution is to kill the container outright.

The error is the following:

2021-11-03 14:37:40,657 - seldon_core.microservice:main:206 - INFO:  Starting microservice.py:main
2021-11-03 14:37:40,657 - seldon_core.microservice:main:207 - INFO:  Seldon Core version: 1.11.2
2021-11-03 14:37:40,660 - seldon_core.microservice:main:362 - INFO:  Parse JAEGER_EXTRA_TAGS []
2021-11-03 14:37:40,660 - seldon_core.microservice:load_annotations:158 - INFO:  Found annotation buidId:38 
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO:  Found annotation buildJob:factory/webhooks/update-fused-seldon 
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO:  Found annotation deployment_version:v1 
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO:  Found annotation kubernetes.io/config.seen:2021-11-03T14:37:24.189445793Z 
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO:  Found annotation kubernetes.io/config.source:api 
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO:  Found annotation predictor_version:v1 
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO:  Found annotation prometheus.io/path:/prometheus 
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO:  Found annotation prometheus.io/scrape:true 
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO:  Found annotation sidecar.istio.io/inject:true 
2021-11-03 14:37:40,661 - seldon_core.microservice:load_annotations:158 - INFO:  Found annotation v1: 
2021-11-03 14:37:40,661 - seldon_core.microservice:main:365 - INFO:  Annotations: {'buidId': '38', 'buildJob': 'factory/webhooks/update-fused-seldon', 'deployment_version': 'v1', 'kubernetes.io/config.seen': '2021-11-03T14:37:24.189445793Z', 'kubernetes.io/config.source': 'api', 'predictor_version': 'v1', 'prometheus.io/path': '/prometheus', 'prometheus.io/scrape': 'true', 'sidecar.istio.io/inject': 'true', 'v1': ''}
2021-11-03 14:37:40,661 - seldon_core.microservice:main:369 - INFO:  Importing Model
2021-11-03 14:37:43,548 - Model:__init__:13 - INFO:  Clean transformer initiated
Process SyncManager-1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 583, in _run_server
    server = cls._Server(registry, address, authkey, serializer)
  File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 156, in __init__
    self.listener = Listener(address=address, backlog=16)
  File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 453, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 596, in __init__
    self._socket.bind(address)
OSError: [Errno 98[] Address already in use
Traceback (most recent call last):
  File "/usr/local/bin/seldon-core-microservice", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/seldon_core/microservice.py", line 385, in main
    seldon_metrics = SeldonMetrics(worker_id_func=os.getpid)
  File "/usr/local/lib/python3.9/site-packages/seldon_core/metrics.py", line 89, in __init__
    self._manager = Manager()
  File "/usr/local/lib/python3.9/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 558, in start
    self._address = reader.recv()
  File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 255, in recv
    buf = self._recv_bytes()
  File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
    raise EOFError
EOFError
Exception ignored in: <function SeldonMetrics.__del__ at 0x7f76680523a0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/seldon_core/metrics.py", line 96, in __del__
    self._manager.shutdown()
AttributeError: 'SeldonMetrics' object has no attribute '_manager'

Initially, it looked like a concurrency bug (due to happening randomly at any of the three containers and not always), so what I did was to follow the documentation to reduce the possible concurrency problems that may exist by setting env flags to every container.

- env:
    - name: LOG_LEVEL_ENV
      value: 'DEBUG'
    - name: FLASK_DEBUG
      value: '1'
    - name: GUNICORN_WORKERS
      value: '1'
    - name: GUNICORN_THREADS
      value: '1'
    - name: FLASK_SINGLE_THREADED
      value: '1'
    - name: SELDON_DEBUG
      value: '1'

I verified and the flags are indeed being applied.

I'm also following the def load(): pattern according to the python wrapper documentation.

So right now I'm without any ideas on how to debug even more and looking for ideas

To reproduce

Deploy seldon with a graph defined by 3 different images
Start the pod multiple times until the error happens

Environment

Cloud Provider: GKE

#### Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.3", GitCommit:"c92036820499fedefec0f847e2054d824aea6cd1", GitTreeState:"clean", BuildDate:"2021-10-27T18:41:28Z", GoVersion:"go1.16.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.13-gke.1900", GitCommit:"ee714a7b695ca42b9bd0c8fe2c0159024cdcba5e", GitTreeState:"clean", BuildDate:"2021-08-11T09:19:42Z", GoVersion:"go1.15.13b5", Compiler:"gc", Platform:"linux/amd64"} 

#### Seldon Images:
          value: docker.io/seldonio/engine:1.11.2
          value: docker.io/seldonio/seldon-core-executor:1.11.2
        image: docker.io/seldonio/seldon-core-operator:1.11.2

Model Details

kubectl get seldondeployment: fulldep.txt
kubectl kustomize deployment: full_deploy.txt

Nov 03 '21 16:11 luminoso

Hard to say. This looks supicious OSError: [Errno 98[] Address already in use is this happening on a crashLoopBackoff of the Pod? Or under high load?

Nov 03 '21 16:11 ukclivecox

The CrashLoopBackOff happens at container level, without high load

Nov 03 '21 16:11 luminoso

The section in question looks like to be at startup.

Here, when the microservice is staring: https://github.com/SeldonIO/seldon-core/blob/136e77a924518f0aabb399e340bfa8d8f947b00f/python/seldon_core/microservice.py#L390

it invokes the multiprocessing library https://github.com/SeldonIO/seldon-core/blob/136e77a924518f0aabb399e340bfa8d8f947b00f/python/seldon_core/metrics.py#L89

that is the one unable bind to the port for listening.

(...)

    self._socket.bind(address)                                                                                                                                                              
OSError: [Errno 98] Address already in use

Now, I cannot find why multiprocessing cannot bind and which address is in use. SocketListener is reusing the same address across containers. Is this a multiprocessing lib bug?

Nov 04 '21 16:11 luminoso

This seems quite strange - not really an issue I have seen before. What is your current implementation? I have tried a couple of different variations to validate if it can be replicated but can't seem to find a way to validate the same issue, which seems to suggest it's most likely specific to the cluster where you are running these models, and you are indeed running into denied port access, which means you may need to check with your network permissions.

This is an example we used to try to replicate:

kubectl apply -f - << END
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: seldon-model
spec:
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - image: seldonio/mock_classifier:1.11.2
          name: classifier
          env:
          - name: LOG_LEVEL_ENV
            value: 'DEBUG'
          - name: FLASK_DEBUG
            value: '1'
          - name: GUNICORN_WORKERS
            value: '1'
          - name: GUNICORN_THREADS
            value: '1'
          - name: FLASK_SINGLE_THREADED
            value: '1'
          - name: SELDON_DEBUG
            value: '1'
        - image: seldonio/mock_classifier:1.11.2
          name: classifier-2
          env:
          - name: LOG_LEVEL_ENV
            value: 'DEBUG'
          - name: FLASK_DEBUG
            value: '1'
          - name: GUNICORN_WORKERS
            value: '1'
          - name: GUNICORN_THREADS
            value: '1'
          - name: FLASK_SINGLE_THREADED
            value: '1'
          - name: SELDON_DEBUG
            value: '1'
    graph:
      children:
      - endpoint:
          type: REST
        name: classifier-2
        type: MODEL
        children: []
      endpoint:
        type: REST
      name: classifier
      type: MODEL
    name: default
    replicas: 1
END

If you can replicate with a simple model we would be able to validate from our side and identify what the fix is, but does seem to point it's most likely your network.

Nov 04 '21 21:11 axsaucedo

Thank you for your code snippet. I tried to replicate the issue using it, and I couldn't trigger the issue.

After a few more rounds of experimentation, I ended up replicating the seldonio/mock_classifier:1.11.2 image like so:

FROM python:3.7.12-slim

RUN pip install --no-cache-dir seldon-core==1.11.2 numpy

WORKDIR /microservice
COPY MeanClassifier.py /microservice/MeanClassifier.py

ENV MODEL_NAME MeanClassifier
ENV SERVICE_TYPE MODEL
ENV SELDON_LOG_LEVEL DEBUG
ENV GUNICORN_WORKERS 1

CMD exec seldon-core-microservice $MODEL_NAME --service-type $SERVICE_TYPE --log-level ${SELDON_LOG_LEVEL} --debug

Which also didn't trigger the issue. However, upgrading Dockerfile python version, from

FROM python:3.7.12-slim

to

FROM python:3.9.7-slim

does trigger the issue.

I've also tested python:3.8.12-slim, which didn't trigger the issue.

It looks like python version is related to seldon-core-microservice not being able to start. There should be some incompatibility with python 3.9.x or a python bug.

Can you please verify if using this Dockerfile image with python:3.9.7-slim triggers the issue on your side? (one or two Seldon pod restarts are enough to trigger it)

Nov 05 '21 16:11 luminoso

Ah ok that provides a bit more insights into why the issue must be happening. One thing that I have to say is that we are currently looking to add "maintained support" for Python 3.8 via https://github.com/SeldonIO/seldon-core/pull/3595, which means that it's still a preliminary step to 3.9 support. We have validated that 3.8 is now supported as you can see in that PR, so our recommendation for now would be to use a 3.8 python base image instead. We will be looking to add support for 3.9 in the near future but for your current workaround that may be the best way forward. Are you able to proceed with this suggestion?

Nov 05 '21 16:11 axsaucedo

Yes. I'll downgrade to python 3.8.x as a workaround

I've also confirmed that Manager Metrics:

https://github.com/SeldonIO/seldon-core/blob/136e77a924518f0aabb399e340bfa8d8f947b00f/python/seldon_core/metrics.py#L89

is choosing the same address across multiple containers thus leading to one of them failing to start due to address being already bind by other container in the same pod

Nov 05 '21 17:11 luminoso

Ok great. Interesting, I think it does sound like a reasonable formal fix to ensrue that ports are chosen with uniqueness - however having said that, we would have to make sure that generally support for Python 3.9 is covered as a broader effort that may involve more than just this fix. I'll update this issue to focus primarily on this initial step of ensuring non-overlapping ports.

Nov 05 '21 17:11 axsaucedo

Hi any news on this issue? I just discovered it when having exactly the same problem trying a ROUTER setup with containers running python 3.9.13.

Sep 15 '22 18:09 bimtauer

This issue cause by this commit, and has been removed on master recently, which have not been released.

If you don't want to downgrade python like me, you can monkey patch multiprocessing.arbitrary_address like:

import tempfile
from multiprocessing import connection, util
from multiprocessing.connection import _mmap_counter


def arbitrary_address(family):
    """
    Return an arbitrary free address for the given family
    """
    if family == "AF_INET":
        return ("localhost", 0)
    elif family == "AF_UNIX":
        return tempfile.mktemp(prefix="listener-", dir=util.get_temp_dir())
    elif family == "AF_PIPE":
        return tempfile.mktemp(
            prefix=r"\\.\pipe\pyc-%d-%d-" % (os.getpid(), next(_mmap_counter)), dir=""
        )
    else:
        raise ValueError("unrecognized family")


connection.arbitrary_address = arbitrary_address

Nov 14 '22 07:11 tshu-w

Please use MLServer that support 3.9 and 3.10

Dec 19 '22 11:12 ukclivecox

upgrading to python 3.9.16 solves this (for those googling this :) )

Apr 27 '23 18:04 mwm5945