swiftLLM
swiftLLM copied to clipboard
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
## Basic Test: 1. offline mode: `python3 examples/offline.py --model-path ./models/Llama-3.2-1B`  2. online mode: `python3 examples/online.py --model-path ./models/Llama-3.2-1B`  3. api_server: `python3 swiftllm/server/api_server.py --model-path ./models/Llama-3.2-1B/ --host 0.0.0.0 --port 8082` ...
Could you please provide the relevant code for performance testing? Because during my testing process, the performance seems to be worse than that of vLLM.
Hello, I have my first try to run swiftLLM with Llama-3.2-1B, but get the following error:  Python version 3.9.20 torch version 2.4.0
The current code doesn't seem to support parallelism and is there any plan to further support parallelism?
Even though --max-batch-size default value is 512, I could not get it to exceed 100. I ran it on gpu with much more vram also (From 6gb to 48gb), changed...
I saw this line in source code: https://github.com/interestingLSY/swiftLLM/blob/682cf9a28f97f7490409981a2f181528f377eb5d/swiftllm/worker/model.py#L116-L122 after `forward`, some memory has been released, for example memory for Intermediate Activations and memory for input ids .etc so could the...