Guancheng Fu issues

Results 10 issues of


                                            Guancheng Fu

Support vllm tensor parallel

## Description To enable vllm tensor parallel related pr: https://github.com/analytics-zoo/vllm/pull/17

Add enable_xetla interface to optimize_model

## Description Add related `enable_xetla` interface to `optimize_model`. Only a draft for now.

Llama-2-7b-chat-hf produces wrong output on CPU

Details: https://github.com/analytics-zoo/nano/issues/1246#issuecomment-2046881777 This problem happens with transformers version greater than 4.36.0. The problem can be solved by either setting `optimize_model=False`, or using `transformers==4.34.0`. I guess the problem might be here:...

Fix ChatGLM3 format error

## Why are these changes needed? The previous format for ChatGLM3 is not correct, which will yield wrong output: ![image](https://github.com/lm-sys/FastChat/assets/110874468/ec8a3a39-e4eb-4e6d-8156-45c474f69018) After changing: ![image](https://github.com/lm-sys/FastChat/assets/110874468/a199166d-6560-44d1-bd8e-a0073195c5ed) ## Related issue number (if applicable) None...

Add vllm quickstart

## Description Add vLLM quickstart. Try access through: http://10.239.44.83:8001/doc/LLM/Quickstart/vLLM_quickstart.html#serving-using-ipex-llm-and-vllm

Integrate IPEX-LLM into serve worker

## Why are these changes needed? Description: [ipex-llm](https://github.com/intel-analytics/ipex-llm) is a library for running LLM on Intel CPU/XPU (from Laptop to GPU to Cloud) using INT4/FP4/INT8/FP8 with very low latency (for...

Add oneccl related patch for serving image

## Description This PR basically adds internal oneccl support for TP. Also changed the oneccl_bind_pt used for the image.

Different compute-runtime version lead to different xpu memory usage

Hi, I am running some xpu workload and found that different compute runtime will lead to different xpu memory usage. When using version https://github.com/intel/compute-runtime/releases/tag/23.17.26241.22, the memory usage on Arc A770...

Delete obsolete code for vLLM

## Description Delete obsolete code for vLLM ### 1. Why the change? ### 2. User API changes ### 3. Summary of the change ### 4. How to test? - [...

Performance regression with IPEX 2.3, TORCH 2.6 compared with IPEX 2.1

### Describe the issue We encountered a performance regression issue that we think might be related to intel-extension-for-pytorch. Specifically, we found that the performance of the gemm_kernel is inconsistent across...

ARC

Performance

Escalate