--image and --min-version filters may not work on irregular NGC tags
I am using the following configuration within my CronJob yaml file:
data:
ngc-update.sh: |
#!/bin/bash
ngc_replicator \
--project=nvidia \
--min-version=$(date +"%y.%m" -d "1 month ago") \
--py-version=py3 \
--image=tensorflow --image=pytorch --image=tensorrt --image=mxnet --image=digits --image=cuda --image=nvhpc --image=rapidsai \
--no-exporter \
--registry-url=mgmt01.cluster.local:31500
And, it seems to be executed with the following images to be fetched based on the logs:
2020-12-28 03:02:01,711 - ngc_replicator.ngc_replicator - 289 - INFO - images to be fetched: defaultdict(<class 'dict'>,
{ 'nvidia/digits': { '20.11-tensorflow-py3': { 'docker_id': '2020-11-20T02:46:37.875Z',
'registry': 'nvcr.io'},
'20.12-tensorflow-py3': { 'docker_id': '2020-12-18T03:42:35.815Z',
'registry': 'nvcr.io'}},
'nvidia/l4t-pytorch': { 'r32.4.2-pth1.2-py3': { 'docker_id': '2020-04-29T23:10:39.028Z',
'registry': 'nvcr.io'},
'r32.4.2-pth1.3-py3': { 'docker_id': '2020-04-29T23:11:07.724Z',
'registry': 'nvcr.io'},
'r32.4.2-pth1.4-py3': { 'docker_id': '2020-04-29T23:11:35.269Z',
'registry': 'nvcr.io'},
'r32.4.2-pth1.5-py3': { 'docker_id': '2020-04-29T23:12:04.055Z',
'registry': 'nvcr.io'},
'r32.4.3-pth1.6-py3': { 'docker_id': '2020-07-07T23:55:54.218Z',
'registry': 'nvcr.io'},
'r32.4.4-pth1.6-py3': { 'docker_id': '2020-10-21T21:27:22.926Z',
'registry': 'nvcr.io'}},
'nvidia/l4t-tensorflow': { 'r32.4.2-tf1.15-py3': { 'docker_id': '2020-04-29T22:23:48.073Z',
'registry': 'nvcr.io'},
'r32.4.3-tf1.15-py3': { 'docker_id': '2020-07-07T22:40:06.178Z',
'registry': 'nvcr.io'},
'r32.4.3-tf2.2-py3': { 'docker_id': '2020-07-07T22:40:40.409Z',
'registry': 'nvcr.io'},
'r32.4.4-tf1.15-py3': { 'docker_id': '2020-10-21T21:29:06.077Z',
'registry': 'nvcr.io'},
'r32.4.4-tf2.3-py3': { 'docker_id': '2020-10-21T22:36:26.793Z',
'registry': 'nvcr.io'}},
'nvidia/mxnet': { '20.11-py3': { 'docker_id': '2020-11-20T02:47:47.932Z',
'registry': 'nvcr.io'},
'20.12-py3': { 'docker_id': '2020-12-18T03:42:53.893Z',
'registry': 'nvcr.io'}},
'nvidia/pytorch': { '20.11-py3': { 'docker_id': '2020-11-20T02:46:27.312Z',
'registry': 'nvcr.io'},
'20.12-py3': { 'docker_id': '2020-12-18T03:52:53.213Z',
'registry': 'nvcr.io'}},
'nvidia/tensorflow': { '20.11-tf1-py3': { 'docker_id': '2020-11-20T02:49:23.047Z',
'registry': 'nvcr.io'},
'20.11-tf2-py3': { 'docker_id': '2020-11-20T02:51:56.543Z',
'registry': 'nvcr.io'},
'20.12-tf1-py3': { 'docker_id': '2020-12-18T03:54:53.111Z',
'registry': 'nvcr.io'},
'20.12-tf2-py3': { 'docker_id': '2020-12-18T03:45:48.862Z',
'registry': 'nvcr.io'}},
'nvidia/tensorrt': { '20.11-py3': { 'docker_id': '2020-11-20T02:47:41.008Z',
'registry': 'nvcr.io'},
'20.12-py3': { 'docker_id': '2020-12-18T03:44:24.218Z',
'registry': 'nvcr.io'}}})
There are a few things that I noticed didn't work well:
- It also fetch
l4t-pytorchandl4t-tensorflowwhich I didn't specify in the yaml earlier - It didn't fetch
cuda,nvhpcandrapidsaiimages - Even though I specified
--min-versionto be at least 1 month, it also capturesl4t-pytorchandl4t-tensorflowfrom much older version/release (i.e. 2020-04, 2020-07, and 2020-10)
For items no.2 above, I suspected that this is due to cuda or rapidsai images on NGC didn't follow the usual tag naming convention (e.g. 20.11-xx or 20.12-xx).
This is a show stopper for me at my company. Looks like we are going to have create our own tool. Sad no one is addressing this.
I believe the closest workaround possible at this moment would be some scripting work around their NGC CLI. I am thinking about the following steps at least:
- Parse the
ngc registry image listto filter out the images you are only going to replicate - For each of the images above, download the image (
ngc registry image pull), push to your private registry (docker push) and optionally delete the images to free up the space (docker rmi)
Hi Andi - A friend at Nvidia (Hi Adam) just put together a patch for my issue of strict filtering. Take a look he just uploaded this evening.