ladi-pomsar
ladi-pomsar
This may also spark another debate - At the moment, the experiments are executed via .ipynb. I don't think that is a way that could actually ever be used in...
Hi, Thank you for doing this. Motivation behind this is that when one is working on secure system, it is important to open just as much ports as is needed,...
@ESWZY Were you able to use tensorflow federated with clients communicating over the internet?
This might be related to [Issue 341](https://github.com/huggingface/text-embeddings-inference/issues/341). Try to use tag [cpu-latest](https://github.com/huggingface/text-embeddings-inference/pkgs/container/text-embeddings-inference/275472037?tag=cpu-latest) instead of 1.5.
> facing similar error when trying cpu-latest image with bge-reranker-v2-m3; > > ``` > 2024-11-19T06:44:07.554912Z INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "BAA*/***-********-*2-m3", revision: None, tokenization_workers: None, dtype: None, pooling: None,...
Did double check, this issue is indeed caused by the lack of flash attention support on V100s. No such problem on Ada generation, but once you turn flash attention off,...
Doesn't seem to be case with flash attention-enabled ADA generation GPU, thus seems to be specific to lack of flash attention.
For anyone wondering about this, this is due to the fact that pad_token is not present in Llama's tokenizer_config.json. Something as simple as adding "pad_token": "" to the end of...
> @sywangyi : thank you for pointing this out. I missed this warning. Indeed `ghcr.io/huggingface/text-generation-inference:latest-intel-xpu` works for me. This also correlates with @ladi-pomsar assumption that this issue is specific to...
Hello, can confirm. This also breaks offline deployments utilizing HF_HUB_OFFLINE = 1. When HF_HUB_OFFLINE = 0 ``` bash releasellm.internal | 2025-03-21T16:03:00.370524Z WARN text_generation_launcher: Could not import Flash Attention enabled models:...