vhost-device icon indicating copy to clipboard operation
vhost-device copied to clipboard

Interest in a WebGPU backend for virtio-gpu?

Open tareksander opened this issue 11 months ago • 7 comments

I was researching about the various ways of getting graphics/compute acceleration into a VM, and apart from partitioning a GPU with hardware support, they all seem susceptible to various potential escape vectors.

My first idea was to define a safe subset of Vulkan and use checks on the host, but I realized that's pretty much what WebGPU does, and with corporate backing of implementations and battle-tested by browsers.

AFAIK it would be a valid Vulkan implementation to pretty much ignore all synchronization and let it all be handled implicitly by WebGPU, so using Vulkan as an API in the VM would still be possible with driver support. The main point is that the safety is handled on the host, and not in a WebGPU layer inside the guest, where raw access to e.g. Vulkan could still lead to a compromised system.

Before I go and to the work though, I wanted to check whether safe GPU acceleration in VMs is even something other people want, too.

tareksander avatar Jun 02 '25 17:06 tareksander

@dorindabassey @mtjhrc WDYT?

stefano-garzarella avatar Jun 03 '25 07:06 stefano-garzarella

Hi, thanks for opening this.

My first idea was to define a safe subset of Vulkan and use checks on the host, but I realized that's pretty much what WebGPU does, and with corporate backing of implementations and battle-tested by browsers.

I wasn't familiar with WebGPU until now, but what you describe makes sense — defining a safe subset of GPU access, I want to read more about it.

AFAIK it would be a valid Vulkan implementation to pretty much ignore all synchronization and let it all be handled implicitly by WebGPU, so using Vulkan as an API in the VM would still be possible with driver support. The main point is that the safety is handled on the host, and not in a WebGPU layer inside the guest, where raw access to e.g. Vulkan could still lead to a compromised system.

Before I go and to the work though, I wanted to check whether safe GPU acceleration in VMs is even something other people want, too.

Talking about Vulkan, I started considering extending vhost-device-gpu to support compute workloads (e.g., running parts of LLM inference) using Vulkan compute shaders. If there's a way to leverage WebGPU as the host API backend instead of raw Vulkan, and expose only a safe superset of compute from the guest, then yes — safe GPU acceleration in VMs is something we would want. But then I'm a bit concerned about the performance overhead of WebGPU compared to raw Vulkan, especially for compute-heavy workloads like matrix multiplication in LLM inference.

In the current virglrenderer backend for example, the guest sends low-level GPU commands to the host, which executes them via OpenGL. and IIUC you plan to use a Vulkan backend wrapped in WebGPU, how would it work in that case? Let us know if you have a POC for this.

dorindabassey avatar Jun 04 '25 10:06 dorindabassey

Before I go and to the work though, I wanted to check whether safe GPU acceleration in VMs is even something other people want, too.

Yes, this sounds interesting!

AFAIK it would be a valid Vulkan implementation to pretty much ignore all synchronization and let it all be handled implicitly by WebGPU, so using Vulkan as an API in the VM would still be possible with driver support.

I don't know much about WebGPU though. But to my understanding WebGPU is a higher level abstraction than Vulkan, so I am not really sure how feasible it is to implement Vulkan on top of WebGPU. I doubt the WebGPU itself would be that bad for performance, but I am not so sure about the performance implications of Vulkan on top of WebGPU implementation. Do you know the details of how WebGPU and Vulkan differ? I feel like you could possibly run into some hard to emulate scenarios.

apart from partitioning a GPU with hardware support

If you want to restrict the attack surface area you could also consider DRM native context. This is similar to the hardware partitioning of the GPU, but is at the kernel-side of the GPU drivers instead. The idea there is that you don't use the host Vulkan implementation, instead you expose the ioctl's used by the GPU driver Vulkan implementation to the guest. This is only as secure as the specific GPU driver though... (and the VM has to have a driver for the host GPU)

mtjhrc avatar Jun 04 '25 12:06 mtjhrc

But then I'm a bit concerned about the performance overhead of WebGPU compared to raw Vulkan, especially for compute-heavy workloads like matrix multiplication in LLM inference.

The overhead depends on what you do and the device capabilities, e.g. how much of an overhead robust buffer access has. Additionally WebGPU lacks many features that aren't supported everywhere like the cooperative matrix extensions from Nvidia that may be of interest for LLMs.

In the current virglrenderer backend for example, the guest sends low-level GPU commands to the host, which executes them via OpenGL. and IIUC you plan to use a Vulkan backend wrapped in WebGPU, how would it work in that case?

It would function mostly like Venus, probably with protocol definitions generated from WebIDL. The other neat thing is that the backend doesn't need to be Vulkan - WebGPU also supports Metal and DirectX. I'd probably use the wgpu crate for the backend. Vulkan would be the exposed API inside the VM (in addition to WebGPU of course), since I don't think many existing applications use WebGPU yet.

I don't know much about WebGPU though. But to my understanding WebGPU is a higher level abstraction than Vulkan, so I am not really sure how feasible it is to implement Vulkan on top of WebGPU. I doubt the WebGPU itself would be that bad for performance, but I am not so sure about the performance implications of Vulkan on top of WebGPU implementation. Do you know the details of how WebGPU and Vulkan differ? I feel like you could possibly run into some hard to emulate scenarios.

WebGPU is comparable to Vulkan 1.0 with automatic (and mandatory) synchronization. It's designed to be a safe lowest common denominator between Vulkan, Metal and DirectX 12. Since Vulkan doesn't have strong implicit synchronization guarantees, but doesn't forbid them, a conformant implementation of at least Vulkan 1.0 should be possible.

If you want to restrict the attack surface area you could also consider DRM native context. This is similar to the hardware partitioning of the GPU, but is at the kernel-side of the GPU drivers instead. The idea there is that you don't use the host Vulkan implementation, instead you expose the ioctl's used by the GPU driver Vulkan implementation to the guest. This is only as secure as the specific GPU driver though... (and the VM has to have a driver for the host GPU)

This sounds like it provides relatively raw access to the GPU, which would still be vulnerable to a lot of scenarios when the firmware doesn't provide strong protection, like reading GPU memory from the host or other VMs (or is the whole GPU only available to the VM?). More complex scenarios could include embedding malware inside the GPU firmware and gaining persistence. Depending on how sensitive your data is, that is something you need to consider.

tareksander avatar Jun 05 '25 12:06 tareksander

Vulkan would be the exposed API inside the VM (in addition to WebGPU of course), since I don't think many existing applications use WebGPU yet.

Well even if they use WebGPU, they wouldn't automatically be able to use the implementation you exposed to the VM though - You would also need to add support to e.g. the wgpu crate, no? Or in general invent a standard for loading WebGPU drivers (or is there one already?)

You plan to do the Vulkan->WebGPU translation in the guest right? So would you integrate it into Mesa? In that case you should probably also ask there.

It would function mostly like Venus, probably with protocol definitions generated from WebIDL.

Sure, that part seems relatively straight forward (exposing the API to the guest), being able to actually use it in the guest though seems more difficult.

This would be good to implement in rutabaga_gfx crate, which this implementation consumes. That crate is under the crosvm repository though, but hopefully they (Google; @gurchetansingh) might move it to some better upstream place (under rust-vmm or Mesa would make sense). libkrun, which I'm also working on, vendors rutabaga with downstream patches.

This sounds like it provides relatively raw access to the GPU, which would still be vulnerable to a lot of scenarios when the firmware doesn't provide strong protection

Well they shouldn't be able to, but of course there might be bugs. I was just making a point that this is a possible lower level layer for enforcing security (which it already should enforce). Without root an application shouldn't be able to flash the GPU firmware, which is the same case here, the VMM process itself shouldn't be able to do that.

or is the whole GPU only available to the VM?)

No, it's shared

mtjhrc avatar Jun 06 '25 12:06 mtjhrc

WebGPU is implemented on-top on Vulkan, at-least for webgpu.rs (bindings to ash) and Dawn (used in the Chrome browser and will be AOSP BTW). So if you do gfxstream_vk and/or venus, you pretty much get WebGPU for free.

In regards to security, the trend is to move more responsibilities to kernel or system driver level (in the micro-kernel case). Extra user-space security is always good, but more user-space layers add more performance cost. So the idea is "fix the kernel and make sure it's secure".

You have heard stuff about Rust in the kernel for this exact reason. So personally, I wouldn't say WebGPU API passthrough is where virtualization or para-virtualization is headed.

but hopefully they (Google; @gurchetansingh) might move it to some better upstream place (under rust-vmm or Mesa would make sense

Definitely a goal to move out of crosvm, to a place that does not require a CLA.

I think the idea is that some crates will live on Mesa3D, some in rust-vmm.

The next incremental step is:

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35210

Reviews welcome to make the process move faster!

gurchetansingh avatar Jun 06 '25 14:06 gurchetansingh

FYI: https://gitlab.freedesktop.org/freedesktop/freedesktop/-/issues/2459 for a possible new places for rutabaga_gfx on fd.o

gurchetansingh avatar Jul 30 '25 22:07 gurchetansingh