azure-sdk-for-python AksEndpoint Versioning error (syncing with inference cluster)

Package Name: azureml
Package Version: 1.39.0
Operating System: Linux
Python Version: 3.9.10

Describe the bug While deploying two endpoints through AksEndpoint with the same version name, for example blue, one endpoint would override the other endpoint in k8s cluster.

The state of the endpoints are shown healthy after fully deployment in ML studio.
The deployment logs would be gone in ML studio.
In kubernetes service Azure Portal, there would be only one blue in Services and Ingresses.
The url link for the blue version is different from expected.

To Reproduce Steps to reproduce the behavior:

Create one endpoint naming test1 with version blue, attaching to the inference cluster test-cluster.
Create another endpoint naming test2 with version blue, attaching to the inference cluster test-cluster.
Check status of the endpoints in ML studio.
Go to Azure Portal, find the inference cluster test-cluster, and go to Services and Ingresses.
Check the url with specific version blue

Deployment script:

import os

from azureml.core import Workspace
from azureml.core.authentication import ServicePrincipalAuthentication
from azureml.core.compute import ComputeTarget
from azureml.core.environment import Environment
from azureml.core.model import InferenceConfig, Model
from azureml.core.webservice import AksEndpoint, LocalWebservice, Webservice
from azureml.exceptions import ComputeTargetException
from dotenv import dotenv_values, load_dotenv

load_dotenv(override=True)
subscription_id = os.getenv("subscriptionId")
tenant_id = os.getenv("tenantId")
client_id = os.getenv("clientId")
client_secret = os.getenv("clientSecret")
workspace = os.getenv("workspace")
resource_group = os.getenv("resource_group")

authentication = ServicePrincipalAuthentication(
    tenant_id=tenant_id,
    service_principal_id=client_id,
    service_principal_password=client_secret,
)
workspace = Workspace.get(
    name=workspace,
    subscription_id=subscription_id,
    resource_group=resource_group,
    auth=authentication,
)

deployment_config = AksEndpoint.deploy_configuration(
    cpu_cores=2,
    memory_gb=2,
    description="test",
    traffic_percentile=100,
    version_name="blue",
)

myenv = Environment.from_conda_specification(
    name="myenv",
    file_path="./env.yaml",
)

inference_config = InferenceConfig(
    entry_script="test_inference.py",
    source_directory="./deploy",
    environment=myenv,
)

deployment_target = ComputeTarget(workspace=workspace, name="test")

webservice = Model.deploy(
    workspace=workspace,
    name="test1",
    models=[],
    inference_config=inference_config,
    deployment_config=deployment_config,
    deployment_target=deployment_target,
    overwrite=True,
)

webservice.wait_for_deployment(show_output=True)

webservice = Model.deploy(
    workspace=workspace,
    name="test2",
    models=[],
    inference_config=inference_config,
    deployment_config=deployment_config,
    deployment_target=deployment_target,
    overwrite=True,
)

webservice.wait_for_deployment(show_output=True)

test_inference.py:

import os
import logging
import pickle
import json
import numpy


def init():
    global model
    logging.info("Init Complete")


def run(raw_data):
    try:
        logging.info("Request received")
        data = json.loads(raw_data)["data"]

        return data

    except Exception as e:
        error = str(e)
        return error

Expected behavior

Different endpoints should be able to have the same version name attaching to the same inference cluster.
Scoring uri should be /api/v1/service/test1/blue/score and /api/v1/service/test2/blue/score while current setting is /api/v1/service/blue/score for test1 and test2.

Screenshots

View from ML studio / Endpoints
View from Azure Portal / Services and Ingresses
View from ML studio / Endpoints / test1
View from ML studio / Endpoints / test2

Dec 28 '22 22:12 hao-happify

Hi @hao-happify, thank you for opening an issue! I'll tag some folks who should be able to help, and we'll get back to you as soon as possible. @luigiw @azureml-github

Dec 28 '22 22:12 mccoyp

Hello, @mccoyp any update on this? I also found while using Python SDK v1, the service is down when I add another version, causing the safe rollout failure. Is it because the versioning and safe-rollout are no longer supported by V1? I couldn't find the safe rollout document in v1 since it's all upgraded to v2.

Jan 19 '23 15:01 hao-happify

@hao-happify thank you for the ping on this; I'll alert the ML team again and get in contact directly. @luigiw @azureml-github

Jan 19 '23 18:01 mccoyp

@hao-happify For product improvement, we has released the new version of azureml-fe, and we are now transparently upgrading the azureml-fe from v1 to v2 in our customers' AKS clusters. However, the AKSEndpoint in v1 which is only previewed but not GAed before, had deprecated and will not be support with the new fetaures/capabilitlies. So the azureml-fe v2 doesn't support routing traffic on AKSEnpoint in v1.

We recommend you to stop using the v1 AKSEndpoint (which is not GAed), and do not build production on a previewed feature to prevent incompatibilities.

Notice that there are three options to mitigate this issue:

If you'd like to use endpoint, you can directly migrate to our v2 stack to use SDK/CLI V2 to creat the online-endpoint in v2.
If you'd like to stay on v1 while using the new azuremlfe-f2 v2, you have to use v1 webservice instead of the deprecated v1 endpoint, since scoring f2 v2 doesn't support v1 endpoint traffic routing.
If you'd like to continue using the v1 AKSEndpoint, we can roll back the azureml-fe to be v1 in your clusters. In this case, the version of azureml-fe will be pinned at v1, means you can not gain the performance improvement and new feature support in the future. To roll back the azureml-fe, please send an email to [email protected], indicating that you want to pin your fe version to be v1, while providing the AKS cluster resource ID to us.

Jan 20 '23 08:01 jiaochenlu

Hi @hao-happify. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

Jan 20 '23 15:01 ghost

@jiaochenlu

Thank you for the explanation and the recommendation. I will try to upgrade to v2 instead.

Jan 20 '23 15:01 hao-happify

Hi @hao-happify, since you haven’t asked that we “/unresolve” the issue, we’ll close this out. If you believe further discussion is needed, please add a comment “/unresolve” to reopen the issue.

Jan 27 '23 16:01 ghost