NeMo-Guardrails Using Lynx 70B Cuda out of memory

Hello! I'm running Nemo Guardrails on Google Colab using the T4 GPU. However, when I deploy Lynx 70b using this code: !python -m vllm.entrypoints.openai.api_server --port 5000 --model 'PatronusAI/Patronus-Lynx-70B-Instruct'

I have a Cuda out of memory issue:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU

Does anyone know what I can do?

Aug 01 '24 18:08 sjay8

@varjoshi: can you provide some guidance? Thanks!

Aug 06 '24 15:08 drazvan

You cannot load a 70B model an a T4 with 16 GB VRAM. Some guidance for VRAM size vs model (for Llama 3.1 but it is similar for other models) is here: https://huggingface.co/blog/llama31

Sep 14 '24 20:09 trebedea