swiftLLM icon indicating copy to clipboard operation
swiftLLM copied to clipboard

A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).

Results 6 swiftLLM issues
Sort by recently updated
recently updated
newest added

## Basic Test: 1. offline mode: `python3 examples/offline.py --model-path ./models/Llama-3.2-1B` ![image](https://github.com/user-attachments/assets/30c8166b-9580-4f70-a36c-2baf7ed1eafb) 2. online mode: `python3 examples/online.py --model-path ./models/Llama-3.2-1B` ![image](https://github.com/user-attachments/assets/02e66a52-efb8-4586-963c-60ecadb66699) 3. api_server: `python3 swiftllm/server/api_server.py --model-path ./models/Llama-3.2-1B/ --host 0.0.0.0 --port 8082` ![image](https://github.com/user-attachments/assets/dba819bc-d3ae-4712-8e12-3bcd45505cd3)...

Could you please provide the relevant code for performance testing? Because during my testing process, the performance seems to be worse than that of vLLM.

Hello, I have my first try to run swiftLLM with Llama-3.2-1B, but get the following error: ![image](https://github.com/user-attachments/assets/f69bdd76-ed7f-4b55-a626-d2b756245d61) Python version 3.9.20 torch version 2.4.0

The current code doesn't seem to support parallelism and is there any plan to further support parallelism?

Even though --max-batch-size default value is 512, I could not get it to exceed 100. I ran it on gpu with much more vram also (From 6gb to 48gb), changed...

I saw this line in source code: https://github.com/interestingLSY/swiftLLM/blob/682cf9a28f97f7490409981a2f181528f377eb5d/swiftllm/worker/model.py#L116-L122 after `forward`, some memory has been released, for example memory for Intermediate Activations and memory for input ids .etc so could the...