gvisor Add vllm benchmark

Hi,

I was playing around with a vLLM benchmark and wanted to get your opinion on how to best deal with large files in benchmarks. The vLLM benchmark serves a model (facebook/opt-125m) and uses vLLM's own benchmark tooling to query it with a widely-used dataset (..). For development I just mounted the model and dataset into the container-under-test. But that seems impractical for CI/other devs.

Should I embed the model files into the benchmark image? Is there another way you would prefer?

Please don't do an actual review yet. Just wanted to get some early feedback. I still have to understand metricsviz and how to incorporate it.

Jun 07 '24 05:06 derpsteb

Heya, this is cool. I wrote the gVisor benchmark for ollama, so a few thoughts on writing benchmarks for these large beasts.

All the benchmarks are structured to have everything bundled in the image, so there's no need for internet connectivity at runtime. As you can tell, that can become a problem for huge models. Thankfully opt-125m seems tiny, it's definitely at the scale where this isn't a concern and you can just bundle it in the image. (Generally anything under 1GiB is ~negligible.)

That being said, I think it would also be valuable to benchmark a larger model too. From my own benchmarking, I've found that larger models have much lower overhead relative to unsandboxed performance, presumably because a larger fraction of time is spent waiting for the GPU to do its thing, rather than on the cost of shoveling tokens back and forth from it. Let me know if that's your next step because then it starts getting more complicated due to the incremental image size that have to be accounted for in the benchmarking VMs Google uses to run these. But the first step would be to first get a small vLLM image checked in either way before adding in a bigger model.

I'd also note that how the model files are available to the model server has an impact on model loading performance. It's a similar reason as to why test/benchmarks/fs/fsbench/fsbench.go's RunWithDifferentFilesystems exists. If you care about model loading performance, consider it.

You probably don't need to care that much about metricsviz, it's not going to be useful or reliable for a large benchmark like this. You can still plumb it though, probably on the server container side. Just call metricsviz.FromContainerLogs(ctx, b, serverCtr) in a defer statement that runs before serverCtr gets deleted.

Jun 07 '24 20:06 EtiennePerot

Awesome. Thanks for the tips! It will take me a few days to get the PR ready.

Interesting point about the fs. I want to isolate the inference loop as much as possible and exclude the loading time from this benchmark. I will document that. The benchmark waits for the inference server to become ready, at which point the model is loaded. I guess benchmarking loading in a separate benchmark would make sense(?).

I would indeed like to also have a "vllm-large" benchmark, once this is merged. But I will probably play around with this for a while first. Just in case: I thought about llama3-7b. Do you know if the license and the approval based access will be a problem for storing the model weights in images? Alternatively I'd probably go for mistral-7b. Both should be ~15GB in size. Will that work?

Jun 08 '24 06:06 derpsteb

I thought about llama3-7b. Do you know if the license and the approval based access will be a problem for storing the model weights in images? Alternatively I'd probably go for mistral-7b. Both should be ~15GB in size. Will that work?

I am not an expert here, but all we'd be doing here is only to create a Dockerfile that contains a command that downloads model weights when run by a user. I believe that's not "distribution"; at least it feels no different than linking to an external website that hosts the model weights. So I don't think there's a problem to have that file checked in.

Jun 11 '24 20:06 EtiennePerot

So I think this could be reviewed now. Let me know what you think. Idea of the test is:

load facebook/opt-125m model in server ctr
run benchmark_serving.py against that model
use ShareGPT_V3_unfiltered_cleaned_split dataset as input
only start timing after the model has loaded

Jun 14 '24 04:06 derpsteb

Thanks for the detailed feedback! Please do not re-review yet. I am still experimenting with the new changes.

Jun 18 '24 06:06 derpsteb

Thanks for the thorough review! Squashed the changes and rebased onto master.

Jul 01 '24 07:07 derpsteb

This is failing BuildKite presubmits because the newly-added image is too large for it... Can you add the image name (gpu/vllm) to NON_TEST_IMAGES here?

This is a regex pattern so you may need to make it gpu/ollama/bench|gpu/vllm, and then add quotes around the $(NON_TEST_IMAGES) on the line below. You can verify that it works by running make list-all-test-images and check that the vLLM image is no longer included.

Jul 01 '24 22:07 EtiennePerot

Merged! Thanks a lot for the contribution and for putting up with all these presubmit checks.

Jul 09 '24 23:07 EtiennePerot