twaka
twaka
### Description When using GPT model and T4 GPU with triton server, setting request_prompt_lengths causes to leak the previous inference's response. In the second request, its response contains the tokens...
**Describe the bug** When using deepspeed inference 0.9.0 and after, generating with different batch size before causes RuntimeError. For example, when first generation input is ['Hello'] and second generation input...
**Describe the bug** fp_quantizer is not correctly built when non-jit installation. **To Reproduce** Steps to reproduce the behavior: ``` DS_BUILD_FP_QUANTIZER=1 pip install deepspeed ``` install will succeed but ``` from...
I have a couple of questions about tie_embeddings but I don't have enough experience with lightning. I'm sorry if I'm mistaken. 1. In litgpt/pretrain.py, `model.transformer.wte.weight = model.lm_head.weight` is applied when...
### Bug description When using `litgpt generate` on models with softcapping, `build_mask_cache` creates mask as `torch.bool` https://github.com/Lightning-AI/litgpt/blob/ef9647cfa7cd73e03b0e29126bfe8b42cae509eb/litgpt/model.py#L465 and then it's added to scores. https://github.com/Lightning-AI/litgpt/blob/ef9647cfa7cd73e03b0e29126bfe8b42cae509eb/litgpt/model.py#L309 Therefore, attention mask is not accounted...