qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue


From: Peter Xu
Subject: Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
Date: Tue, 31 Aug 2021 16:27:04 -0400

On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > send calls. It does so by avoiding copying user data into kernel buffers.
> > 
> > To make it work, three steps are needed:
> > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > 3 - Process the socket's error queue, dealing with any error
> 
> AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> 
> It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> from a synchronous call to an asynchronous call.
> 
> It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> until an asynchronous completion notification has been received from
> the socket error queue. These notifications are not required to
> arrive in-order, even for a TCP stream, because the kernel hangs on
> to the buffer if a re-transmit is needed.
> 
> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> 
>   "Page pinning also changes system call semantics. It temporarily 
>    shares the buffer between process and network stack. Unlike with
>    copying, the process cannot immediately overwrite the buffer 
>    after system call return without possibly modifying the data in 
>    flight. Kernel integrity is not affected, but a buggy program
>    can possibly corrupt its own data stream."
> 
> AFAICT, the design added in this patch does not provide any way
> to honour these requirements around buffer lifetime.
> 
> I can't see how we can introduce MSG_ZEROCOPY in any seemless
> way. The buffer lifetime requirements imply need for an API
> design that is fundamentally different for asynchronous usage,
> with a callback to notify when the write has finished/failed.

Regarding buffer reuse - it indeed has a very deep implication on the buffer
being available and it's not obvious at all.  Just to mention that the initial
user of this work will make sure all zero copy buffers will be guest pages only
(as it's only used in multi-fd), so they should always be there during the
process.

I think asking for a complete design still makes sense.  E.g., for precopy
before we flush device states and completes the migration, we may want to at
least have a final ack on all the zero-copies of guest pages to guarantee they
are flushed.

IOW, we need to make sure the last piece of migration stream lands after the
guest pages so that the dest VM will always contain the latest page data when
dest VM starts.  So far I don't see how current code guaranteed that.

In short, we may just want to at least having a way to make sure all zero
copied buffers are finished using and they're sent after some function returns
(e.g., qio_channel_flush()).  That may require us to do some accounting on when
we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
error queue and keep those information somewhere too.

Some other side notes that reached my mind..

The qio_channel_writev_full() may not be suitable for async operations, as the
name "full" implies synchronous to me.  So maybe we can add a new helper for
zero copy on the channel?

We may also want a new QIOChannelFeature as QIO_CHANNEL_FEATURE_ZEROCOPY, then
we fail qio_channel_writv_zerocopy() (or whatever name we come up with) if that
bit is not set in qio channel features.

Thanks,

-- 
Peter Xu




reply via email to

[Prev in Thread] Current Thread [Next in Thread]