kernel RFC: virtqueue interface

I have been considering how to rewrite the virtio-queue interface to make it more unified, safer and easier to understand. Since @mustermeiszer is working on this as well, I figured I'd post some notes here. Feedback very welcome.

Requirements:

Usability
- virtio-fs needs to send/receive in the same transmission (packed queue)
- virtio-fs wants to hand in multiple buffers, which get concatenated in a single descriptor chain. This way we can easily prepend a header to the data.
- virtio-net needs to have a filled receive queue, since the device can initiate transmissions on it's own. Once a buffer is used, the same descriptor is inserted into the queue again.
- virtio-net wants to also recycle transmit descriptors for speed. On initialization, it allocates all tx buffers, constructs descriptor chains for them. When we want to send we can now simply place the correct index into the ring.
Memory Safety
- While a transfer is ongoing, the buffer should not be read/writable for the initiator
- Check if given buffers are physical contiguous, if not split into multiple descriptors on page boundaries
- mem::forget is safe, so we should consider what happens when destructors don't run

Proposal:

Use TransferTokens, which take ownership of the buffers which are to be transferred. On initialization it creates the descritor chain. Can be placed into a virtqueue, upon which it gets converted into a Transfer. Transfer only gives back ownership (or references) to the buffers, once the transfer is complete.

To accomodate the virtio-net usecase of recycling the buffer, we have two modes:

consume the Transfer and return the buffers. This frees up the descriptor chain
extract only a reference to the buffer, use it, later 'recycle' the transfer by re-inserting it into the queue.

Questions:

is the interface sufficient for virtio-net? afaik it should be, but I have not looked too deep yet.
do we need a 'fire and forget' functionality, where we are allowed to drop an ongoing transfer, which is then automatically freed once done? Currently both virtio-fs and virtio-net don't need it. We could default to warning + leaking.
doing it the way I propose below is slightly unsafe, since when a transfer is dropped with mem::forget, we do NOT free the descriptors associated with the transfer, so permanently exhaust some of them. They are limited by the virtqueue length, so doing this often would block all transfers. But I think this is okay in our case, since we can just be careful to not use mem::forget. The only other option I see is doing more housekeeping within the virtqueue itself. I feel this is too complex (and likely slower?), since we would have to keep track of all transfertokens in the virtqueue.
Is the interface prepare_transfer(send, recv) -> Result appropriate for a split queue?
the function argument types (Option<Box<[Box<[u8]>]>>) look a bit wild. We could use a newtype with .into() to make it look cleaner
should we use boxed arrays, or vectors for storing the buffers? (Box<[Box<[u8]>]> vs Vec<Vec<u8>> vs Vec<Box<[u8]>>). Both store an array with size on the heap, a vector has an additional capacity field. I feel that boxes are better since we can easily convert vec to box with .into_boxed_slice(), but not the other way around.
We could, if we wanted to be extra safe, unmap the pages the buffers are stored in while a transfer is happening. This way, userspace (or other kernel code) cannot write to buffers if it has kept 'illegal' references around. This could prevent an VM escape via a broken device. But since this is not guaranteed by the spec, it should not be a problem. I know virtio-fs safeguards against changing buffer contents by copying the headers before parsing them.

Interface

/// A prepared transfer, that can be inserted into a virtqueue
pub struct TransferToken<'a> {
    /// memory that is to be send. Multiple, virtual-contiguous u8 arrays
    send_bufs: Option<Box<[Box<[u8]>]>>,

    /// memory that is to be received into. Multiple, virtual-contiguous u8 arrays
    recv_bufs: Option<Box<[Box<[u8]>]>>,
    
    /// descriptor chain, that references send/recv bufs
    desc_chain: Rc<RefCell<VirtqDescriptorChain>>,

    /// virtq on which the transfer is being done
    virtq: &'a Virtq<'a>,
}

/// A single transfer inside a virtqueue
pub struct Transfer<'a> {
    /// virtq-index is not enough, since it wraps easily. Since transfers do not have to be handled in-order, this might be an issue so we use a large id.
    id: u64,
    
    /// Token that was converted to this transfer
    token: TransferToken<'a>,
}

impl<'a> TransferToken<'a> {
    /// Places the already prepared descriptor chain into the virtqueue, returns a Transfer
    ///
    /// Since Transfer's give back the token once they are completed, we can insert the same
    /// token with the same descriptor chain multiple times if we want
    pub fn send(self) -> Transfer<'a> {
        unimplemented!()
    }

    /// Returns mutable references to the send and receive buffers.
    pub fn buf_ref_mut(&mut self) -> (Option<&[&mut [u8]]>, Option<&[&mut [u8]]>) {
        unimplemented!()
    }
    
    /// Consumes the token, returning ownership of the buffers
    pub fn consume(self) -> (Option<Box<[Box<[u8]>]>>, Option<Box<[Box<[u8]>]>>) {
        unimplemented!()
    }
}

impl<'a> Drop for TransferToken<'a> {
    fn drop(&mut self) { 
        // Free descriptor chain in the virtq, so it can be reused by new transfers
        unimplemented!() 
    }
}

impl<'a> Transfer<'a> {
    /// Checks whether the transfer is completed yet. If it is, returns the TransferToken
    pub fn check(&self) -> Option<TransferToken> {
        // checks current ID against virtq's last received id.
        unimplemented!()
    }

    /// Waits until transfer is completed. Returns the TransferToken used.
    pub fn wait(&self) -> TransferToken {
        unimplemented!()
    }
}

impl<'a> Drop for Transfer<'a> {
    fn drop(&mut self) { 
        // If a transfer is dropped while it is ongoing, do NOT free the memory, leak it instead.
        // We could register it in the virtq, so it can be freed later.
        unimplemented!() 
    }
}

impl<'a> Virtq<'a> {
    /// Places send and recv buffers in a descriptor chain, returns a Transfer Token.
    /// Transfer is not placed in the queue yet.
    /// Specifying both send and receive is only possible in a packed queue, else Err
    /// For a split queue, send/recv have to match the queue mode, else Err
    ///
    /// Checks whether the buffers are physically contiguous, split into multiple descriptors if not
    pub fn prepare_transfer(send: Option<Box<[Box<[u8]>]>>, recv: Option<Box<[Box<[u8]>]>>) -> Result<TransferToken<'a>, ()> {
        unimplemented!()
    }

    /// Places send and recv buffers in a descriptor chain, returns a Transfer Token.
    /// Transfer is not placed in the queue yet,
    /// Specifying both send and receive is only possible in a packed queue, else Err
    /// For a split queue, send/recv have to match the queue mode, else Err
    ///
    /// Does NOT check whether the buffers are physically contiguous! Caller has to guarantee this!
    pub unsafe fn prepare_transfer_unchecked(send: Option<Box<[Box<[u8]>]>>, recv: Option<Box<[Box<[u8]>]>>) -> TransferToken<'a> {
        unimplemented!()
    }
}

May 30 '20 13:05 tlambertz

Yes, it should also work for virtue-net. At least the rx buffers requires filled queues. In case of tx buffers, it is not so important, but it is easier to implement.

fire and forget sounds good. But when do you forget? For instance, you have to be sure that virtue-net's tx buffers are successfully send.

Is the interface prepare_transfer(send, recv) -> Result appropriate for a split queue?

👍

the function argument types (Option<Box<[Box<[u8]>]>>) look a bit wild. We could use a newtype with .into() to make it look cleaner

👍

should we use boxed arrays, or vectors for storing the buffers? (Box<[Box<[u8]>]> vs Vec<Vec> vs Vec<Box<[u8]>>). Both store an array with size on the heap, a vector has an additional capacity field. I feel that boxes are better since we can easily convert vec to box with .into_boxed_slice(), but not the other way around.

Hm, I am fine with both solutions. But I agree that boxes seems to be better.

We could, if we wanted to be extra safe, unmap the pages the buffers are stored in while a transfer is happening. This way, userspace (or other kernel code) cannot write to buffers if it has kept 'illegal' references around. This could prevent an VM escape via a broken device. But since this is not guaranteed by the spec, it should not be a problem. I know virtio-fs safeguards against changing buffer contents by copying the headers before parsing them.

I have another solution. We should discuss it in video conference. I am not sure, if my solution is valid.

Jun 04 '20 18:06 stlankes

Currently I am not yet at the packed queue implementation. That's why I don't have an exact picture of the requirements yet. I already have a few thoughts, but probably we should discuss everything in detail in a video conference, but I really like the idea.

A few remarks:

Doesn't the spec say, split queues can provide write and read descriptors in one descriptor-chain? (See 2.6.5) This should allow to have send and receive in the same transmission, or am missing something?
As the descriptor layout for split and packed queues is differently we need some kind of wrapper (enum or struct holding "raw" descriptors), to be able to use the interface for both queues.
I wonder if it is possible for the prepare_send() functions to accept a `Vec<dyn ToBeSerialzableTo_u8>, and the minimal and maximal number of buffers, these structures are allowed to be destructured into.
Is it necessary to give back the ownership of the Buffers to the driver? Can't we just return a reference to the driver? Thought this might reduces the risk of drivers manipulating the buffers after the descriptor chain has been created.

Jun 05 '20 14:06 mustermeiszer

Regarding the wrapped descriptors (thought about the same concept as described below) and the interface for both virtqueues under a common type. I am currently thinking about something like:

pub enum Virtqueue {
    PackedVqueue(PackedVQ),
    SplitVqueue(SplitVQ),
}

impl Virtqueue {
    pub fn get_id(&self) -> u16 {
        match self {
            Virtqueue::PackedVqueue(vq) => vq.get_id(),
            Virtqueue::SplitVqueue(vq) => vq.get_id(),
        }
    }
}

In this case ALL function calls to a queue would need to pass through a match statement like the above. I have the feeling, that this is not optimal, but I can not think about a different solution.

I also wondered if trait objects would work, with something like:

pub struct Virtqueue {
    queue: dyn VirtQueueInterface
}

trait VirtQueueInterface {
}

Where both split and packed virtqueue would implement the interface. But in this way, no associated types are possible. Which would be nice for returning different types in the interface. Also creating a queue via new() -> Self would only work if we boxed it into `new() -> Box<Self>. This also feels quite sub-optimal.

Jun 05 '20 15:06 mustermeiszer

fire and forget sounds good. But when do you forget?

We could do something like:

someone sends a transfer, drops() it before completion
we overwrite drop() to register the transfer as a dropped-transfer in the virtq. On receiving some event, virtq checks this list to see if cleanup is necessary.

Are there actual use-cases for fire-and-forget though? virtio-net and virito-fs don't seem to need that feature. Although this would provide some 'cheap' async, where writes() to the filesystem can return early.

One problem with my proposal is that we lack good zero-copy support. Consider the read(fd, buf, len) syscall: We might want to place the userland buffer directly into the virtq, but then the kernel never has ownership. Placing pointers provided by userland into DMA directly is slightly scary though, so we would need to add appropriate checks there. Is this something we want to support?

Hermit-sys/smoltcp are 'cheating' here, since they request a tx_buffer from the kernel, instead of providing one to it. The default read()/write() POSIX semantics do not allow this.

Doesn't the spec say, split queues can provide write and read descriptors in one descriptor-chain?

You are right, I got a bit confused. Split queues are the one I had implemented, packed are still missing. Both support this read/write chaining.

As the descriptor layout for split and packed queues is differently we need some kind of wrapper (enum or struct holding "raw" descriptors), to be able to use the interface for both queues.

Sounds good. Another idea is to use a Trait, and then do something like Virtq<DescriptorType>. I would avoid dyn, since this prevents some optimizations and does slow dynamic function dispatch. Due to lack of Trait-fields, we would have to use setters/getters for descriptor fields, but rusts monomorphising and inlining should make this quite fast at the expense of code size. I am not sure how often we would have these 'match enum' branches.

I wonder if it is possible for the prepare_send() functions to accept a `Vec, and the minimal and maximal number of buffers, these structures are allowed to be destructured into.

The goal of this would be construction of descriptor chains, without specifying the concrete buffers/lenghts, which can then be filled in later? I don't see how that helps us, why not just defer the transfer creation until the buffers are known?

Is it necessary to give back the ownership of the Buffers to the driver? Can't we just return a reference to the driver? Thought this might reduces the risk of drivers manipulating the buffers after the descriptor chain has been created.

Handing out ownership vs mutable references is not much of a difference. If we want to support the buffer-recycling, the driver has to have a way of writing into them.

Jun 05 '20 15:06 tlambertz