torchx icon indicating copy to clipboard operation
torchx copied to clipboard

local_docker scheduler unable to set gpu correctly

Open ryxli opened this issue 1 year ago • 0 comments

🐛 Bug

Device Request capabilities should be updated to "gpu", not "compute" https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L308

                    c.kwargs["device_requests"] = [
                        DeviceRequest(
                            count=resource.gpu,
                            capabilities=[["compute"]],
                        )
                    ]

Module (check all that applies):

  • [ ] torchx.spec
  • [ ] torchx.component
  • [ ] torchx.apps
  • [ ] torchx.runtime
  • [ ] torchx.cli
  • [ x] torchx.schedulers
  • [ ] torchx.pipelines
  • [ ] torchx.aws
  • [ ] torchx.examples
  • [ ] other

To Reproduce

Steps to reproduce the behavior:

  1. start any container with local_docker scheduler on a machine with nvidia gpu
  2. run nvidia-smi inside container to verify that container does not detect gpu
pretrain/0 
pretrain/0 =============
pretrain/0 == PyTorch ==
pretrain/0 =============
pretrain/0 
pretrain/0 NVIDIA Release 23.12 (build 76438008)
pretrain/0 PyTorch Version 2.2.0a0+81ea7a4
pretrain/0 
pretrain/0 Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
pretrain/0 
pretrain/0 Copyright (c) 2014-2023 Facebook Inc.
pretrain/0 Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
pretrain/0 Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
pretrain/0 Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
pretrain/0 Copyright (c) 2011-2013 NYU                      (Clement Farabet)
pretrain/0 Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
pretrain/0 Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
pretrain/0 Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
pretrain/0 Copyright (c) 2015      Google Inc.
pretrain/0 Copyright (c) 2015      Yangqing Jia
pretrain/0 Copyright (c) 2013-2016 The Caffe contributors
pretrain/0 All rights reserved.
pretrain/0 
pretrain/0 Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
pretrain/0 
pretrain/0 This container image and its contents are governed by the NVIDIA Deep Learning Container License.
pretrain/0 By pulling and using the container, you accept the terms and conditions of this license:
pretrain/0 https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
pretrain/0 
pretrain/0 Failed to detect NVIDIA driver version.

Expected behavior

if device capability is properly set to "gpu", then i should see devices inside container and can detect nvidia driver

after changing "compute" to "gpu", works as expected

pretrain/0 
pretrain/0 =============
pretrain/0 == PyTorch ==
pretrain/0 =============
pretrain/0 
pretrain/0 NVIDIA Release 23.12 (build 76438008)
pretrain/0 PyTorch Version 2.2.0a0+81ea7a4
pretrain/0 
pretrain/0 Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
pretrain/0 
pretrain/0 Copyright (c) 2014-2023 Facebook Inc.
pretrain/0 Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
pretrain/0 Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
pretrain/0 Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
pretrain/0 Copyright (c) 2011-2013 NYU                      (Clement Farabet)
pretrain/0 Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
pretrain/0 Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
pretrain/0 Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
pretrain/0 Copyright (c) 2015      Google Inc.
pretrain/0 Copyright (c) 2015      Yangqing Jia
pretrain/0 Copyright (c) 2013-2016 The Caffe contributors
pretrain/0 All rights reserved.
pretrain/0 
pretrain/0 Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
pretrain/0 
pretrain/0 This container image and its contents are governed by the NVIDIA Deep Learning Container License.
pretrain/0 By pulling and using the container, you accept the terms and conditions of this license:
pretrain/0 https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
pretrain/0 
pretrain/0 NOTE: CUDA Forward Compatibility mode ENABLED.
pretrain/0   Using CUDA 12.3 driver version 545.23.08 with kernel driver version 535.129.03.
pretrain/0   See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
pretrain/0 

Environment

  • torchx version (e.g. 0.1.0rc1): 0.6.0
  • Python version: 3.10
  • OS (e.g., Linux): AL2
  • How you installed torchx (conda, pip, source, docker): pip
  • Docker image and tag (if using docker):
  • Git commit (if installed from source):
  • Execution environment (on-prem, AWS, GCP, Azure etc):
  • Any other relevant information:

Additional context

ryxli avatar Feb 15 '24 02:02 ryxli