azure-sdk-for-python icon indicating copy to clipboard operation
azure-sdk-for-python copied to clipboard

AksEndpoint Versioning error (syncing with inference cluster)

Open hao-happify opened this issue 3 years ago • 1 comments

  • Package Name: azureml
  • Package Version: 1.39.0
  • Operating System: Linux
  • Python Version: 3.9.10

Describe the bug While deploying two endpoints through AksEndpoint with the same version name, for example blue, one endpoint would override the other endpoint in k8s cluster.

  1. The state of the endpoints are shown healthy after fully deployment in ML studio.
  2. The deployment logs would be gone in ML studio.
  3. In kubernetes service Azure Portal, there would be only one blue in Services and Ingresses.
  4. The url link for the blue version is different from expected.

To Reproduce Steps to reproduce the behavior:

  1. Create one endpoint naming test1 with version blue, attaching to the inference cluster test-cluster.
  2. Create another endpoint naming test2 with version blue, attaching to the inference cluster test-cluster.
  3. Check status of the endpoints in ML studio.
  4. Go to Azure Portal, find the inference cluster test-cluster, and go to Services and Ingresses.
  5. Check the url with specific version blue
  • Deployment script:
import os

from azureml.core import Workspace
from azureml.core.authentication import ServicePrincipalAuthentication
from azureml.core.compute import ComputeTarget
from azureml.core.environment import Environment
from azureml.core.model import InferenceConfig, Model
from azureml.core.webservice import AksEndpoint, LocalWebservice, Webservice
from azureml.exceptions import ComputeTargetException
from dotenv import dotenv_values, load_dotenv

load_dotenv(override=True)
subscription_id = os.getenv("subscriptionId")
tenant_id = os.getenv("tenantId")
client_id = os.getenv("clientId")
client_secret = os.getenv("clientSecret")
workspace = os.getenv("workspace")
resource_group = os.getenv("resource_group")

authentication = ServicePrincipalAuthentication(
    tenant_id=tenant_id,
    service_principal_id=client_id,
    service_principal_password=client_secret,
)
workspace = Workspace.get(
    name=workspace,
    subscription_id=subscription_id,
    resource_group=resource_group,
    auth=authentication,
)

deployment_config = AksEndpoint.deploy_configuration(
    cpu_cores=2,
    memory_gb=2,
    description="test",
    traffic_percentile=100,
    version_name="blue",
)

myenv = Environment.from_conda_specification(
    name="myenv",
    file_path="./env.yaml",
)

inference_config = InferenceConfig(
    entry_script="test_inference.py",
    source_directory="./deploy",
    environment=myenv,
)

deployment_target = ComputeTarget(workspace=workspace, name="test")

webservice = Model.deploy(
    workspace=workspace,
    name="test1",
    models=[],
    inference_config=inference_config,
    deployment_config=deployment_config,
    deployment_target=deployment_target,
    overwrite=True,
)

webservice.wait_for_deployment(show_output=True)

webservice = Model.deploy(
    workspace=workspace,
    name="test2",
    models=[],
    inference_config=inference_config,
    deployment_config=deployment_config,
    deployment_target=deployment_target,
    overwrite=True,
)

webservice.wait_for_deployment(show_output=True)
  • test_inference.py:
import os
import logging
import pickle
import json
import numpy


def init():
    global model
    logging.info("Init Complete")


def run(raw_data):
    try:
        logging.info("Request received")
        data = json.loads(raw_data)["data"]

        return data

    except Exception as e:
        error = str(e)
        return error

Expected behavior

  • Different endpoints should be able to have the same version name attaching to the same inference cluster.
  • Scoring uri should be /api/v1/service/test1/blue/score and /api/v1/service/test2/blue/score while current setting is /api/v1/service/blue/score for test1 and test2.

Screenshots

  • View from ML studio / Endpoints view_from_endpoints
  • View from Azure Portal / Services and Ingresses view_from_cluster
  • View from ML studio / Endpoints / test1 Screen Shot 2022-12-28 at 5 07 59 PM
  • View from ML studio / Endpoints / test2 Screen Shot 2022-12-28 at 5 08 11 PM

hao-happify avatar Dec 28 '22 22:12 hao-happify

Hi @hao-happify, thank you for opening an issue! I'll tag some folks who should be able to help, and we'll get back to you as soon as possible. @luigiw @azureml-github

mccoyp avatar Dec 28 '22 22:12 mccoyp

Hello, @mccoyp any update on this? I also found while using Python SDK v1, the service is down when I add another version, causing the safe rollout failure. Is it because the versioning and safe-rollout are no longer supported by V1? I couldn't find the safe rollout document in v1 since it's all upgraded to v2.

hao-happify avatar Jan 19 '23 15:01 hao-happify

@hao-happify thank you for the ping on this; I'll alert the ML team again and get in contact directly. @luigiw @azureml-github

mccoyp avatar Jan 19 '23 18:01 mccoyp

@hao-happify For product improvement, we has released the new version of azureml-fe, and we are now transparently upgrading the azureml-fe from v1 to v2 in our customers' AKS clusters. However, the AKSEndpoint in v1 which is only previewed but not GAed before, had deprecated and will not be support with the new fetaures/capabilitlies. So the azureml-fe v2 doesn't support routing traffic on AKSEnpoint in v1.

We recommend you to stop using the v1 AKSEndpoint (which is not GAed), and do not build production on a previewed feature to prevent incompatibilities.

Notice that there are three options to mitigate this issue:

  1. If you'd like to use endpoint, you can directly migrate to our v2 stack to use SDK/CLI V2 to creat the online-endpoint in v2.
  2. If you'd like to stay on v1 while using the new azuremlfe-f2 v2, you have to use v1 webservice instead of the deprecated v1 endpoint, since scoring f2 v2 doesn't support v1 endpoint traffic routing.
  3. If you'd like to continue using the v1 AKSEndpoint, we can roll back the azureml-fe to be v1 in your clusters. In this case, the version of azureml-fe will be pinned at v1, means you can not gain the performance improvement and new feature support in the future.  To roll back the azureml-fe, please send an email to [email protected], indicating that you want to pin your fe version to be v1, while providing the AKS cluster resource ID to us. 

jiaochenlu avatar Jan 20 '23 08:01 jiaochenlu

Hi @hao-happify. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

ghost avatar Jan 20 '23 15:01 ghost

@jiaochenlu

Thank you for the explanation and the recommendation. I will try to upgrade to v2 instead.

hao-happify avatar Jan 20 '23 15:01 hao-happify

Hi @hao-happify, since you haven’t asked that we “/unresolve” the issue, we’ll close this out. If you believe further discussion is needed, please add a comment “/unresolve” to reopen the issue.

ghost avatar Jan 27 '23 16:01 ghost