Humza Sami

Results 17 comments of Humza Sami

> NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.api_server --tensor-parallel-size 2 --host 127.0.0.1 @MasKong Can you bit elaborate this? This is my simple codebase and I want to use 1 and 3 gpus....

Use `` as end of string token in the generation

Please share your cudadnn, cuda toolkit version along with GPU model. Althoug following code block automatically detects GPU. ```python pipeline = transformers.pipeline( "text-generation", model=model, torch_dtype=torch.float16, device_map="auto", ) ```

```python tokenizer.add_special_tokens(["[BOST]"])

In the Huggingface generation pipeline, Are you using Instruct prompt instructions ? `[INST] user message 1 [/INST] response 1 [INST] user message 2 [/INST] response 2 ` In example_chat_completion.py, this...

@for-just-we It would be helpful if you post your inference code here. Anyways, Could you try this **code snippet** and check if it is producing some better results. ``` from...

As far as I know, `codellama-Python` is not for infilling. Please refer to its documentation. ![image](https://github.com/facebookresearch/codellama/assets/63999516/5937eab9-271f-4abe-b18e-85ccb2388492) This model is not finetuned on infilling dataset. It is finetuned on only next...

This base model can produce total 4096 tokens. You can set `max_new_token` to 4096. This 4096 tokens are including number of tokens of the prompts as well.