gpushare-device-plugin icon indicating copy to clipboard operation
gpushare-device-plugin copied to clipboard

Question: request gpushare on the same GPU

Open xhejtman opened this issue 4 years ago • 3 comments

Hello,

is it possible for several pods to request gpu share on any but same gpu card? E.g., if you have stateful set consisting of Xserver container and application container, you need those two share the same gpu card.. I would request like 1G mem for each of the containers however, if I have more than one GPU per node, I have no guarantees they use the same device, right?

xhejtman avatar Aug 07 '21 11:08 xhejtman

by default, gpushare scheduler will try to allocate different pods on gpu cards with "Binpack" first policy. Binpack means multiple pods who request gpu memory, will be placed on same card of same node, in order to leave as many free gpu cards for "big" job. In that case, yes, your two pod would be possible to share the same gpu card. however, that's a kind of best effort policy. Means those two pods still can be allocated to different cards, if the card which pod1 placed doesn't have enough memory for pod2. Then pod2 will be placed to another card.

wsxiaozhang avatar Nov 05 '21 07:11 wsxiaozhang

Is there any chance to extending this plugin so that it would be possible to request allocation from the same physical card? It is usable for Statefulset deployments where you might need to share the same physical gpu among all containers.

xhejtman avatar Nov 05 '21 13:11 xhejtman

I'm trying to use this plugin and use your example code and it seems it doesn't work as declared.

ALIYUN_COM_GPU_MEM_DEV=14
ALIYUN_COM_GPU_MEM_CONTAINER=2
2021-11-11 22:22:59.635156: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2021-11-11 22:22:59.675793: E tensorflow/core/common_runtime/direct_session.cc:170] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE
/usr/local/lib/python3.5/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
0.1
Traceback (most recent call last):
  File "/app/main.py", line 40, in <module>
    train(fraction)
  File "/app/main.py", line 23, in train
    sess = tf.Session(config=config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1482, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 622, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

This situation occurs when I increase number of replicas to two:

apiVersion: apps/v1
kind: Deployment

metadata:
  name: binpack-1
  labels:
    app: binpack-1

spec:
  replicas: 2

  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: binpack-1

  template: # define the pods specifications
    metadata:
      labels:
        app: binpack-1

    spec:
      containers:
      - name: binpack-1
        image: cheyang/gpu-player:v2
        resources:
          limits:
            # GiB
            aliyun.com/gpu-mem: 2

I have three cards with 14Gb on each. However, I am not able to run two copies of this software. Why?

swood avatar Nov 11 '21 22:11 swood