Humza Sami
Humza Sami
@HamidShojanazeri Please check
> NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.api_server --tensor-parallel-size 2 --host 127.0.0.1 @MasKong Can you bit elaborate this? This is my simple codebase and I want to use 1 and 3 gpus....
Use `` as end of string token in the generation
Please share your cudadnn, cuda toolkit version along with GPU model. Althoug following code block automatically detects GPU. ```python pipeline = transformers.pipeline( "text-generation", model=model, torch_dtype=torch.float16, device_map="auto", ) ```
```python tokenizer.add_special_tokens(["[BOST]"])
In the Huggingface generation pipeline, Are you using Instruct prompt instructions ? `[INST] user message 1 [/INST] response 1 [INST] user message 2 [/INST] response 2 ` In example_chat_completion.py, this...
Can you share code for inference ?
@for-just-we It would be helpful if you post your inference code here. Anyways, Could you try this **code snippet** and check if it is producing some better results. ``` from...
As far as I know, `codellama-Python` is not for infilling. Please refer to its documentation.  This model is not finetuned on infilling dataset. It is finetuned on only next...
This base model can produce total 4096 tokens. You can set `max_new_token` to 4096. This 4096 tokens are including number of tokens of the prompts as well.