Guancheng Fu
Guancheng Fu
## Description To enable vllm tensor parallel related pr: https://github.com/analytics-zoo/vllm/pull/17
## Description Add related `enable_xetla` interface to `optimize_model`. Only a draft for now.
Details: https://github.com/analytics-zoo/nano/issues/1246#issuecomment-2046881777 This problem happens with transformers version greater than 4.36.0. The problem can be solved by either setting `optimize_model=False`, or using `transformers==4.34.0`. I guess the problem might be here:...
## Why are these changes needed? The previous format for ChatGLM3 is not correct, which will yield wrong output:  After changing:  ## Related issue number (if applicable) None...
## Description Add vLLM quickstart. Try access through: http://10.239.44.83:8001/doc/LLM/Quickstart/vLLM_quickstart.html#serving-using-ipex-llm-and-vllm
## Why are these changes needed? Description: [ipex-llm](https://github.com/intel-analytics/ipex-llm) is a library for running LLM on Intel CPU/XPU (from Laptop to GPU to Cloud) using INT4/FP4/INT8/FP8 with very low latency (for...
## Description This PR basically adds internal oneccl support for TP. Also changed the oneccl_bind_pt used for the image.
Hi, I am running some xpu workload and found that different compute runtime will lead to different xpu memory usage. When using version https://github.com/intel/compute-runtime/releases/tag/23.17.26241.22, the memory usage on Arc A770...
## Description Delete obsolete code for vLLM ### 1. Why the change? ### 2. User API changes ### 3. Summary of the change ### 4. How to test? - [...
### Describe the issue We encountered a performance regression issue that we think might be related to intel-extension-for-pytorch. Specifically, we found that the performance of the gemm_kernel is inconsistent across...