anterart comments

Results 9 comments of


                                            anterart

Dynamic LoRA switching

I would also like to have this feature as well 🙂

Unexpected inference results from Flan-T5 XXL converted to ctranslate2 with version 4.2.1 and 4.1.1 (using tensor parallel)

I also experienced an issue wtih 4.2.1 Translator. The inference with Translator with 4.2.1 produced poor results, I didn't inspect the output itself, I just looked on my metrics which...

[Feature]: Wait for the first model to return from cooldown instead of failing request

@ishaan-jaff It's possible that the first deployment that was added to cooldown is still in cooldown state. I think that the Router should wait until one of the deployments is...

[Feature]: Wait for the first model to return from cooldown instead of failing request

The router can check how much time is left until one of the deployments will stop being in the cooldown state, then exactly when this happens use that deployment. This...

[Feature]: Wait for the first model to return from cooldown instead of failing request

@krrishdholakia I'm working with deployments I have in the Azure OpenAI of gpt4-turbo model. There is a 80 kTPM rate limit on it for me, if I raise the `max_parallel_requests`...

[Feature]: Wait for the first model to return from cooldown instead of failing request

@krrishdholakia @ishaan-jaff In the meantime, maybe can you suggest me. Lets say all the deployments of a model are in cooldown. In this case, I want to know how much...

[Feature]: Wait for the first model to return from cooldown instead of failing request

I would love to get something like that @krrishdholakia. Because currently what I do in case of getting a `RateLimitError,` or a `ValueError` with text `No deployments available for selected...

[Feature]: Wait for the first model to return from cooldown instead of failing request

@ishaan-jaff this exception returns the `cooldown_time` value that I pass when creating the `Router` instance. It's not the updated cooldown time if say `x` seconds passed it will not show...

[Feature]: Wait for the first model to return from cooldown instead of failing request

@ishaan-jaff I checked again, and you're correct, it does return the actual time left until it goes out of cooldown.