Using a custom SM image based on sagemaker-distribution hangs and fails in SM studio
I am not quite sure where to report but since the docs outline how to build a custom image I will try here.
I am building this custom image and pushing it to ECR and adding to sagemaker images and creating app image config, like one would according to the docs.
I am defining my docker image like this
FROM --platform=linux/amd public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu
USER $ROOT
RUN apt-get clean
# dependencies for building python and having opencv
RUN apt-get update && \
apt-get install -y gcc g++ python3-dev ffmpeg libsm6 libxext6 && \
rm -rf /var/lib/apt/lists/* && \
apt-get clean
USER $MAMBA_USER
# copy the environment.yml file into the container
COPY --chown=$MAMBA_USER:$MAMBA_USER processing/environment.yml /tmp/environment.yml
# Use micromamba to install the dependencies from the environment.yml file
RUN micromamba install -y -n base -f /tmp/environment.yml && \
micromamba clean --all --yes
The only difference I can see in the logs is this these two lines at 2024-02-09T10:23:34.006+01:00 and 2024-02-09T10:23:34.006+01:00
2024-02-09T10:23:34.006+01:00 [I 2024-02-09 09:23:33.875 ServerApp] Loading SageMaker Studio EMR server extension 0.1.9
2024-02-09T10:23:34.006+01:00 [I 2024-02-09 09:23:33.876 ServerApp] sagemaker_jupyterlab_emr_extension | extension was successfully loaded.
2024-02-09T10:23:34.006+01:00 [I 2024-02-09 09:23:33.876 ServerApp] Loading SageMaker JupyterLab server extension 0.2.0
2024-02-09T10:23:34.006+01:00 [I 2024-02-09 09:23:33.877 ServerApp] sagemaker_jupyterlab_extension | extension was successfully loaded.
2024-02-09T10:23:34.006+01:00 [I 2024-02-09 09:23:33.877 ServerApp] Loading SageMaker JupyterLab common server extension 0.1.9
2024-02-09T10:23:34.006+01:00 [I 2024-02-09 09:23:33.877 ServerApp] sagemaker_jupyterlab_extension_common | extension was successfully loaded.
2024-02-09T10:23:34.006+01:00 [I 2024-02-09 09:23:33.878 ServerApp] Serving notebooks from local directory: /home/sagemaker-user
this line -------> 2024-02-09T10:23:34.006+01:00 [I 2024-02-09 09:23:33.878 ServerApp] Jupyter Server 2.10.0 is running at: <-------- this line
2024-02-09T10:23:34.006+01:00 [I 2024-02-09 09:23:33.878 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
2024-02-09T10:23:34.006+01:00 [W 2024-02-09 09:23:33.882 ServerApp] No web browser found: Error('could not locate runnable browser').
this line -------> 2024-02-09T10:23:34.006+01:00 [C 2024-02-09 09:23:33.882 ServerApp] To access the server, open this file in a browser: file:///home/sagemaker-user/.local/share/jupyter/runtime/jpserver-1-open.html Or copy and paste one of these URLs: <------ and this line
2024-02-09T10:23:34.006+01:00 INFO: State start
2024-02-09T10:23:34.006+01:00 INFO: Scheduler at: inproc://169.255.254.1/1/1
2024-02-09T10:23:34.006+01:00 INFO: dashboard at: http://169.255.254.1:8787/status
2024-02-09T10:23:34.006+01:00 INFO: Registering Worker plugin shuffle
2024-02-09T10:23:34.006+01:00 INFO: Start worker at: inproc://169.255.254.1/1/4
2024-02-09T10:23:34.006+01:00 INFO: Listening to: inproc169.255.254.1
2024-02-09T10:23:34.006+01:00 INFO: Worker name: 0
2024-02-09T10:23:34.006+01:00 INFO: dashboard at: 169.255.254.1:39899
2024-02-09T10:23:34.006+01:00 INFO: Waiting to connect to: inproc://169.255.254.1/1/1
2024-02-09T10:23:34.006+01:00 INFO: -------------------------------------------------
2024-02-09T10:23:34.006+01:00 INFO: Threads: 2
2024-02-09T10:23:34.006+01:00 INFO: Memory: 3.78 GiB
2024-02-09T10:23:34.006+01:00 INFO: Local Directory: /tmp/dask-scratch-space/worker-ylr01t6j
2024-02-09T10:23:35.259+01:00 INFO: -------------------------------------------------
2024-02-09T10:23:35.260+01:00 INFO: Register worker <WorkerState 'inproc://169.255.254.1/1/4', name: 0, status: init, memory: 0, processing: 0>
2024-02-09T10:23:35.260+01:00 INFO: Starting worker compute stream, inproc://169.255.254.1/1/4
2024-02-09T10:23:35.260+01:00 INFO: Starting established connection to inproc://169.255.254.1/1/5
2024-02-09T10:23:35.260+01:00 INFO: Starting Worker plugin shuffle
2024-02-09T10:23:35.260+01:00 INFO: Registered to: inproc://169.255.254.1/1/1
2024-02-09T10:23:35.260+01:00 INFO: -------------------------------------------------
2024-02-09T10:23:35.260+01:00 INFO: Starting established connection to inproc://169.255.254.1/1/1
2024-02-09T10:23:35.260+01:00 INFO: Receive client connection: Client-e58dc3c4-c72c-11ee-8001-6efbcde7e649
2024-02-09T10:23:35.510+01:00 INFO: Starting established connection to inproc://169.255.254.1/1/6
2024-02-09T10:23:39.515+01:00 [I 2024-02-09 09:23:35.316 ServerApp] Skipped non-installed server(s): bash-language-server, dockerfile-language-
In the working images they have a URL that's configured correctly.
I am in VPC only mode for the domain,, but I dont see how that should change anything since the sagemaker-distribution image works fine.
Would appreciate any pointer
Can you provide the environment.yml packages? And are these logs from /aws/sagemaker/studio cloudwatch group?
Yes, these are logs for the /aws/sagemaker/studio log group
This is a bare-bones example of environment.yml that fails for me.
name: base
channels:
- conda-forge
dependencies:
- python==3.10
- jupyterlab
- pip
- pip:
- ipykernel
- sagemaker
But its correct that this should work be able to work as a custom Jupyter Lab image in the new studio as well?
If it helps I can provide my config too
aws sagemaker describe-image
{
"CreationTime": 1707914687.086,
"DisplayName": "prod-sagemaker-image",
"ImageArn": "arn:aws:sagemaker:some-arn",
"ImageName": "image_name_1",
"ImageStatus": "CREATED",
"LastModifiedTime": 1707914688.064
},
aws sagemaker describe-app-image-config
{
"AppImageConfigArn": "arn:aws:sagemaker:some_app_image_config_arn",
"AppImageConfigName": "sagemaker-app-image-config",
"CreationTime": 1707403360.291,
"LastModifiedTime": 1707405437.631,
"KernelGatewayImageConfig": {
"KernelSpecs": [
{
"Name": "python3",
"DisplayName": "yesyes"
}
],
"FileSystemConfig": {
"MountPath": "/home/sagemaker-user",
"DefaultUid": 1000,
"DefaultGid": 100
}
},
"JupyterLabAppImageConfig": {
"ContainerConfig": {
"ContainerEntrypoint": [
"jupyter-lab"
]
}
}
},
aws pagemaker describe-domain
{
"DomainArn": "arn:aws:sagemaker:some_domain_arn",
"DomainId": "d-mmo40dnf710s",
"DomainName": "sagemaker-domain",
"HomeEfsFileSystemId": "fs-",
"SingleSignOnManagedApplicationInstanceId": "ins-",
"SingleSignOnApplicationArn": "arn:aws:sso::application/",
"Status": "InService",
"CreationTime": 1707688181.443,
"LastModifiedTime": 1707913859.291,
"AuthMode": "SSO",
"DefaultUserSettings": {
"ExecutionRole": "arn:aws:iam::some_role_arn",
"SecurityGroups": [
"sg-"
],
"JupyterServerAppSettings": {
"LifecycleConfigArns": []
},
"KernelGatewayAppSettings": {
"CustomImages": [
{
"ImageName": "prod-sagemaker-image",
"ImageVersionNumber": 1,
"AppImageConfigName": "sagemaker-app-image-config"
}
],
"LifecycleConfigArns": []
},
"CodeEditorAppSettings": {
"LifecycleConfigArns": []
},
"JupyterLabAppSettings": {
"DefaultResourceSpec": {
"InstanceType": "ml.t3.medium"
},
"CustomImages": [
{
"ImageName": "prod-sagemaker-image",
"ImageVersionNumber": 1,
"AppImageConfigName": "sagemaker-app-image-config"
}
],
"LifecycleConfigArns": [
"arn:aws:sagemaker:lifecycle_arn"
]
},
"SpaceStorageSettings": {
"DefaultEbsStorageSettings": {
"DefaultEbsVolumeSizeInGb": 5,
"MaximumEbsVolumeSizeInGb": 100
}
},
"DefaultLandingUri": "studio::",
"StudioWebPortal": "ENABLED"
},
"DomainSettings": {
"SecurityGroupIds": [
"sg-"
],
"DockerSettings": {
"EnableDockerAccess": "ENABLED",
"VpcOnlyTrustedAccounts": []
}
},
"AppNetworkAccessType": "VpcOnly",
"SubnetIds": [
"subnet-",
"subnet-"
],
"VpcId": "vpc-",
"AppSecurityGroupManagement": "Customer",
"DefaultSpaceSettings": {
"ExecutionRole": "arn:aws:iam::some_role_arn",
"SecurityGroups": [
"sg-"
],
"JupyterServerAppSettings": {
"DefaultResourceSpec": {
"SageMakerImageArn": "arn:aws:sagemaker:eu-north-1:243637512696:image/jupyter-server-3",
"InstanceType": "system"
}
}
}
}
Our team has struggled with this as well.
I tried my best to reproduce your image based on the Dockerfile and env.yml and was able to get it to work.
The main difference is that instead of relying on the app-image-config property:
"JupyterLabAppImageConfig": { "ContainerConfig": { "ContainerEntrypoint": [ "jupyter-lab" ] } } },
we define the ENTRYPOINT and CMD in our Dockefile directly in accordance with https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-jl-image-specifications.html.
This was because we had a hard time getting the "ContainerEntrypoint" to work.
Below is the Dockerfile I used (the micromamba config is due to our proxy):
FROM --platform=linux/amd public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu
USER $ROOT
RUN apt-get clean
# dependencies for building python and having opencv
RUN apt-get update && \
apt-get install -y gcc g++ python3-dev ffmpeg libsm6 libxext6 && \
rm -rf /var/lib/apt/lists/* && \
apt-get clean
USER $MAMBA_USER
# copy the environment.yml file into the container
COPY --chown=$MAMBA_USER:$MAMBA_USER env_help.yml /tmp/environment.yml
RUN micromamba config prepend channels "CONDA-FORGE-PROXY" && \
micromamba config prepend channels "CONDA-PROXY" && \
micromamba config set channel_alias "CONDA-PROXY" && \
micromamba config set channel_priority flexible && \
micromamba config set pip_interop_enabled True && \
micromamba config set ssl_verify /etc/ssl/certs/ca-certificates.crt
# Use micromamba to install the dependencies from the environment.yml file
RUN micromamba install -y -n base -f /tmp/environment.yml && \
micromamba clean --all --yes
ENTRYPOINT ["jupyter-lab"]
CMD ["--ServerApp.ip=0.0.0.0", "--ServerApp.port=8888", "--ServerApp.allow_origin=*", "--ServerApp.token=''", "--ServerApp.base_url=/jupyterlab/default"]
Your logs seem to suggest that the CMD portion of this is missing since you do not get these logs (last two):
| 2024-02-19T16:01:54.014-05:00 | [I 2024-02-19 21:01:53.893 ServerApp] Serving notebooks from local directory: /home/sagemaker-user | |
|---|---|---|
| 2024-02-19T16:01:54.014-05:00 | [I 2024-02-19 21:01:53.893 ServerApp] Jupyter Server 2.10.0 is running at: | |
| 2024-02-19T16:01:54.014-05:00 | [I 2024-02-19 21:01:53.893 ServerApp] http://default:8888/jupyterlab/default/lab | |
| 2024-02-19T16:01:54.014-05:00 | [I 2024-02-19 21:01:53.894 ServerApp] http://127.0.0.1:8888/jupyterlab/default/lab |
NOTE:
I pushed the image to ECR and then just used the console to create and attach the image to the domain.
We use CDK to do our actual deployments.
Also, your app image config will have to at least have an empty {} for the "JupyterLabAppImageConfig" even if you decide to stop using this for the entrypoint stuff.