qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 3/4] migration: zero-copy flush only at the end of bitmap


From: Leonardo Brás
Subject: Re: [PATCH v4 3/4] migration: zero-copy flush only at the end of bitmap scanning
Date: Tue, 21 Jun 2022 00:26:06 -0300
User-agent: Evolution 3.44.2

On Mon, 2022-06-20 at 11:23 +0200, Juan Quintela wrote:
> Leonardo Bras <leobras@redhat.com> wrote:
> > When sending memory pages with MSG_ZEROCOPY, it's necessary to flush
> > to make sure all dirty pages are sent before a future version of them
> > happens to be sent.
> > 
> > Currently, the flush happens every time at the end of ram_save_iterate(),
> > which usually happens around 20x per second, due to a timeout.
> > 
> > Change so it flushes only after a whole scanning of the dirty bimap,
> > so it never sends a newer version of a page before an older one, while
> > avoiding unnecessary overhead.
> > 
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> 
> I agree that previous one is too many flushes, but this one changes to too
> much memory to be uncommitted, and that is important because otherwise we
> have unlimited amount of dirty memory.

I don't quite understand what you meant here.
What does it mean to be uncommitted at this context?
I don't see how we get unlimited amounts of dirty memory here. 

Zero-copy is not supposed to copy memory pages, so let's say all pages are dirty
and enqueued to send, but our network interface is stalling the send:
All memory we should have allocated is VM ram pagecount x sizeof(iov), because
then we flush.

Unless... are you referring to locked memory (as dirty enqueued memory -> locked
memory)? 
That would be a point: If we enqueue more memory than the locked memory amount
our user support, the migration will fail. 

But that would mean a very fast CPU (lots of sendmsgs) and a very slow network
interface. 

> 
> > +/*
> > + * Set zero_copy_flush = true for every multifd channel
> > + *
> > + * When using zero-copy, it's necessary to flush the pages before any of
> > + * the pages can be sent again, so we'll make sure the new version of the
> > + * pages will always arrive _later_ than the old pages.
> > + *
> > + * Should be called only after we finished one whole scanning of
> > + * all the dirty bitmaps.
> > + */
> > +int multifd_zero_copy_flush(void)
> > +{
> > +    int i;
> > +    Error *local_err = NULL;
> > +
> > +    if (!migrate_use_multifd()) {
> > +        return 0;
> > +    }
> > +
> > +    for (i = 0; i < migrate_multifd_channels(); i++) {
> > +        MultiFDSendParams *p = &multifd_send_state->params[i];
> > +        int ret;
> > +
> > +        ret = qio_channel_flush(p->c, &local_err);
> > +        if (ret < 0) {
> > +            error_report_err(local_err);
> > +            return ret;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> 
> 
> Here you flush every channel, Only at the end of a range you want to do this.

Yes, the idea is to flush after a full scan of the dirty-bitmap.

> 
> 
> >  int multifd_send_sync_main(QEMUFile *f)
> >  {
> >      int i;
> > -    bool flush_zero_copy;
> >  
> >      if (!migrate_use_multifd()) {
> >          return 0;
> > @@ -581,19 +613,6 @@ int multifd_send_sync_main(QEMUFile *f)
> >          }
> >      }
> >  
> > -    /*
> > -     * When using zero-copy, it's necessary to flush the pages before any
> > of
> > -     * the pages can be sent again, so we'll make sure the new version of
> > the
> > -     * pages will always arrive _later_ than the old pages.
> > -     *
> > -     * Currently we achieve this by flushing the zero-page requested writes
> > -     * per ram iteration, but in the future we could potentially optimize
> > it
> > -     * to be less frequent, e.g. only after we finished one whole scanning
> > of
> > -     * all the dirty bitmaps.
> > -     */
> > -
> > -    flush_zero_copy = migrate_use_zero_copy_send();
> > -
> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> >  
> > @@ -615,17 +634,6 @@ int multifd_send_sync_main(QEMUFile *f)
> >          ram_counters.transferred += p->packet_len;
> >          qemu_mutex_unlock(&p->mutex);
> >          qemu_sem_post(&p->sem);
> > -
> > -        if (flush_zero_copy && p->c) {
> > -            int ret;
> > -            Error *err = NULL;
> > -
> > -            ret = qio_channel_flush(p->c, &err);
> > -            if (ret < 0) {
> > -                error_report_err(err);
> > -                return -1;
> > -            }
> > -        }
> 
> This synchronization already happens once every iteration through all
> ram.
> 
> </me checks how>
> 
> And low and behold, it doesn't.
> 
> The problem here is that we are calling multifd_send_sync_main() in the
> wrong place, i.e. we are being too conservative.
> 
> We need to call multifd_send_sync_main() just before doing
> migration_bitmap_sync().  The reason that we need to call that function
> is exactly the same that we need to flush for zero_copy.
> 
> So, what we need to change is remove the call to
> multifd_send_sync_main(), not how it handles zero_copy.

So, IIUC, multifd have been syncing in a conservative way (i.e. more than
actually needed), and we need to do the same improvement (sync less) for multifd
too, instead of just changing it for zero-copy. Is that correct?

> 
> >      }
> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 5f5e37f64d..514584e44f 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -2288,6 +2288,13 @@ static int ram_find_and_save_block(RAMState *rs)
> >      rs->last_seen_block = pss.block;
> >      rs->last_page = pss.page;
> >  
> > +    if (pss.complete_round && migrate_use_zero_copy_send()) {
> > +        int ret = multifd_zero_copy_flush();
> > +        if (ret < 0) {
> > +            return ret;
> > +        }
> > +    }
> > +
> 
> This place is not right either, because we can have a sync in the middle
> for other reasons.

What do you mean by "in the middle" here? 
Does it mean we should not have multifd code in this function?

> 
> We call migration_bitmap_sync() in save_live_pending(), and that is not
> when we finish the complete_round.

Agree. 
We could add/set a flag in multifd, say in the above area, and then sync/flush
in the correct place (in the future).

But at my experience debugging my code, I found that the loop at
ram_save_iterate() will not stop/break at "bitmap scan end", so we may end up
sending the same page twice before a flush, which is what we want to avoid.

> 
> Once discussed this, what I asked in the past is that you are having too
> much dirty memory on zero_copy.  When you have a Multiterabyte guest, in
> a single round you have a "potentially" dirty memory on each channel of:
> 
>    total_amount_memory / number of channels.

If dirty memory -> locked memory, yes, max_locked_memory == guest_ram.

> 
> In a Multiterabyte guest, this is going to be more that probably in the
> dozens of gigabytes.  As far as I know there is no card/driver that will
> benefit for so many pages in zero_copy, and kernel will move to
> synchronous copy at some point.  (In older threads, daniel showed how to
> test for this case).

The idea of MSG_ZEROCOPY is to avoid copying the pages content over to the
kernel space. I don't recall any reason a network card would drop to copying
instead of keep using zero-copy, unless it does not support it (scatter-gather
not supported, for example), but in this case it would never use zero-copy from
the start. 

IIRC, as long as we have enough locked memory, there should be no problem for
the driver to keep using zero-copy send.

Well, unless you mean the gain of using zero-copy should be irrelevant compared
to the cost of locked memory in some scenarios, say very powerful CPU with slow
network interface, where using zero-copy should not interfere with migration
speed. But that is not a scenario for zero-copy anyway

> 
> What I proposed is that you check in the migration_send_thread() how
> much memory you have written since last synchronization.  Once that it
> is big enough (I don't know the limits for card, in the docs that you
> showed suggested the size is a few megabytes), you just sync at that
> point and continue.

This would help us keep the locked memory under control? yes.
But it would come with the cost of flushing *much* more often than needed.

I mean, it would only be useful if the network card is completely stalling the
send, or if the cpu is scheduling more zero-copy sends than the network card can
process.

The point here is: we don't have to keep track of how much has been sent, but
instead of how much is enqueued (i.e. locked memory). Maybe there is cheap way
to track this value, and we could flush when we detect it's too high.

> 
> You still need to synchronize all threads at bitmap sync, but as said,
> that is handled by multifd_send_sync_main(), and we should fix that
> instead of changing where zero_copy flushes.

Ok, so we keep the zero-copy-flush inside the multifd_send_sync_main() and
change where it's called in order to match the end of the dirty-bitmap scan.
Is that correct?

> 
> /* Removed not relevant bits of the function */
> 
> static void *multifd_send_thread(void *opaque)
> {
>     size_t zero_copy_bytes_sent = 0;
> 
>     ...
> 
>     while (true) {
> 
>             ....
> 
>             trace_multifd_send(p->id, packet_num, p->normal_num, p->zero_num,
>                                p->flags, p->next_packet_size);
> 
>             if (use_zero_copy_send) {
>                 /* Send header first, without zerocopy */
>                 ret = qio_channel_write_all(p->c, (void *)p->packet,
>                                             p->packet_len, &local_err);
>                 if (ret != 0) {
>                     break;
>                 }
> 
> ****** Note aside
> 
> Did you answered my question here of what happens when you do:
> 
> write_with_zero_copy(1MB); write_without_zero_copy(4KB);
> write_with_zero_copy(1MB); write_without_zero_copy(4KB);
> 
> My guess is that write_without_zero_copy(4KB) will just flush the
> socket.  I can't think how it can work without doing that.

To be honest, I did not follow the MSG_ZEROCOPY code to this extent. 

But it depends. If write_without_zero_copy() has MSG_DONTWAIT we could have
everything being sent in a non-blocking way.

But, of course, taking it into account that in multifd we are using a blocking
sendmsg(), I also agree that it is probably flushing the packets previously sent
with MSG_ZEROCOPY (or we could have issues in the receiving part).

But that's an assumption. I still need to dig the code in that aspect to be
sure. I will work on that asap.

> If so, we are not getting what we want.

In fact yes, we are.
We are avoiding copying the buffer, which is what we intend to do, and avoiding
the cost of the 'flush' syscall happening too often. 

If we think multifd is supposed to have blocking sendmsg(), then we are doing
exactly that here, with zero-copy for the heavy part, which is ideal IMHO.

We could try adding MSG_DONTWAIT to the sendmsg() in multifd (which would take
the blocking / synchronous send away), but I have no idea on the impact it would
have at multifd inner workings.

> 
>             } else {
>                 /* Send header using the same writev call */
>                 p->iov[0].iov_len = p->packet_len;
>                 p->iov[0].iov_base = p->packet;
>             }
> 
>             ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
>                                               0, p->write_flags, &local_err);
>             if (ret != 0) {
>                 break;
>             }
> 
>             qemu_mutex_lock(&p->mutex);
>             p->num_packets++;
>             p->total_normal_pages += p->normal_num;
>             p->total_zero_pages += p->zero_num;
>             p->pages->num = 0;
>             p->pages->block = NULL;
>             p->sent_bytes += p->packet_len;;
>             p->sent_bytes += p->next_packet_size;
> 
> **** Addition
>             zero_copy_bytes_sent += p->packet_len + p->next_packet_size;
> 
>             p->pending_job--;
>             qemu_mutex_unlock(&p->mutex);
> ***** Addition
>             if (zero_copy_bytes_sent > Threshold) // 2MB/4MB?  I don't know
>                     ret = qio_channel_flush(p->c, &local_err);
>                     // Handle error somehow
>                     //  If you want to be a pro, just change the
>                     // Threshold depending on what the kernel answers.
>                     // If it has to revert to synchronous sent, just
>                     // decrease the threshold, otherwise increase it.
> 
>             if (p->flags & MULTIFD_FLAG_SYNC) {
>                 qemu_sem_post(&p->sem_sync);
>             }
>             qemu_sem_post(&multifd_send_state->channels_ready);
>         } else if (p->quit) {
>             qemu_mutex_unlock(&p->mutex);
>             break;
>         } else {
>             qemu_mutex_unlock(&p->mutex);
>             /* sometimes there are spurious wakeups */
>         }
>     }
>     .............
> }
> 
> What do you think?

It's a good idea to keep locked memory at bay, but I think flushing after N
bytes sent is too often, and will kill the performance. Even more if we assume
the non-zerocopy blocking sendmsg() is already flushing for us. 

> 
> Later, Juan.
> 

Thanks Juan!

Best regards,
Leo




reply via email to

[Prev in Thread] Current Thread [Next in Thread]