ngc-container-replicator icon indicating copy to clipboard operation
ngc-container-replicator copied to clipboard

--image and --min-version filters may not work on irregular NGC tags

Open andiariffin opened this issue 5 years ago • 3 comments

I am using the following configuration within my CronJob yaml file:

data:
  ngc-update.sh: |
    #!/bin/bash
    ngc_replicator                                        \
      --project=nvidia                                    \
      --min-version=$(date +"%y.%m" -d "1 month ago")     \
      --py-version=py3                                    \
      --image=tensorflow --image=pytorch --image=tensorrt --image=mxnet --image=digits --image=cuda --image=nvhpc --image=rapidsai \
      --no-exporter                                       \
      --registry-url=mgmt01.cluster.local:31500

And, it seems to be executed with the following images to be fetched based on the logs:

2020-12-28 03:02:01,711 - ngc_replicator.ngc_replicator - 289 - INFO - images to be fetched: defaultdict(<class 'dict'>,
            {   'nvidia/digits': {   '20.11-tensorflow-py3': {   'docker_id': '2020-11-20T02:46:37.875Z',
                                                                 'registry': 'nvcr.io'},
                                     '20.12-tensorflow-py3': {   'docker_id': '2020-12-18T03:42:35.815Z',
                                                                 'registry': 'nvcr.io'}},
                'nvidia/l4t-pytorch': {   'r32.4.2-pth1.2-py3': {   'docker_id': '2020-04-29T23:10:39.028Z',
                                                                    'registry': 'nvcr.io'},
                                          'r32.4.2-pth1.3-py3': {   'docker_id': '2020-04-29T23:11:07.724Z',
                                                                    'registry': 'nvcr.io'},
                                          'r32.4.2-pth1.4-py3': {   'docker_id': '2020-04-29T23:11:35.269Z',
                                                                    'registry': 'nvcr.io'},
                                          'r32.4.2-pth1.5-py3': {   'docker_id': '2020-04-29T23:12:04.055Z',
                                                                    'registry': 'nvcr.io'},
                                          'r32.4.3-pth1.6-py3': {   'docker_id': '2020-07-07T23:55:54.218Z',
                                                                    'registry': 'nvcr.io'},
                                          'r32.4.4-pth1.6-py3': {   'docker_id': '2020-10-21T21:27:22.926Z',
                                                                    'registry': 'nvcr.io'}},
                'nvidia/l4t-tensorflow': {   'r32.4.2-tf1.15-py3': {   'docker_id': '2020-04-29T22:23:48.073Z',
                                                                       'registry': 'nvcr.io'},
                                             'r32.4.3-tf1.15-py3': {   'docker_id': '2020-07-07T22:40:06.178Z',
                                                                       'registry': 'nvcr.io'},
                                             'r32.4.3-tf2.2-py3': {   'docker_id': '2020-07-07T22:40:40.409Z',
                                                                      'registry': 'nvcr.io'},
                                             'r32.4.4-tf1.15-py3': {   'docker_id': '2020-10-21T21:29:06.077Z',
                                                                       'registry': 'nvcr.io'},
                                             'r32.4.4-tf2.3-py3': {   'docker_id': '2020-10-21T22:36:26.793Z',
                                                                      'registry': 'nvcr.io'}},
                'nvidia/mxnet': {   '20.11-py3': {   'docker_id': '2020-11-20T02:47:47.932Z',
                                                     'registry': 'nvcr.io'},
                                    '20.12-py3': {   'docker_id': '2020-12-18T03:42:53.893Z',
                                                     'registry': 'nvcr.io'}},
                'nvidia/pytorch': {   '20.11-py3': {   'docker_id': '2020-11-20T02:46:27.312Z',
                                                       'registry': 'nvcr.io'},
                                      '20.12-py3': {   'docker_id': '2020-12-18T03:52:53.213Z',
                                                       'registry': 'nvcr.io'}},
                'nvidia/tensorflow': {   '20.11-tf1-py3': {   'docker_id': '2020-11-20T02:49:23.047Z',
                                                              'registry': 'nvcr.io'},
                                         '20.11-tf2-py3': {   'docker_id': '2020-11-20T02:51:56.543Z',
                                                              'registry': 'nvcr.io'},
                                         '20.12-tf1-py3': {   'docker_id': '2020-12-18T03:54:53.111Z',
                                                              'registry': 'nvcr.io'},
                                         '20.12-tf2-py3': {   'docker_id': '2020-12-18T03:45:48.862Z',
                                                              'registry': 'nvcr.io'}},
                'nvidia/tensorrt': {   '20.11-py3': {   'docker_id': '2020-11-20T02:47:41.008Z',
                                                        'registry': 'nvcr.io'},
                                       '20.12-py3': {   'docker_id': '2020-12-18T03:44:24.218Z',
                                                        'registry': 'nvcr.io'}}})

There are a few things that I noticed didn't work well:

  1. It also fetch l4t-pytorch and l4t-tensorflow which I didn't specify in the yaml earlier
  2. It didn't fetch cuda, nvhpc and rapidsai images
  3. Even though I specified --min-version to be at least 1 month, it also captures l4t-pytorch and l4t-tensorflow from much older version/release (i.e. 2020-04, 2020-07, and 2020-10)

For items no.2 above, I suspected that this is due to cuda or rapidsai images on NGC didn't follow the usual tag naming convention (e.g. 20.11-xx or 20.12-xx).

andiariffin avatar Dec 28 '20 05:12 andiariffin

This is a show stopper for me at my company. Looks like we are going to have create our own tool. Sad no one is addressing this.

blairjj avatar Jun 28 '21 19:06 blairjj

I believe the closest workaround possible at this moment would be some scripting work around their NGC CLI. I am thinking about the following steps at least:

  • Parse the ngc registry image list to filter out the images you are only going to replicate
  • For each of the images above, download the image (ngc registry image pull), push to your private registry (docker push) and optionally delete the images to free up the space (docker rmi)

andiariffin avatar Jun 28 '21 23:06 andiariffin

Hi Andi - A friend at Nvidia (Hi Adam) just put together a patch for my issue of strict filtering. Take a look he just uploaded this evening.

blairjj avatar Jun 29 '21 00:06 blairjj