[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH RFC 9/9] KVM: Dirty ring support
From: |
Dr. David Alan Gilbert |
Subject: |
Re: [PATCH RFC 9/9] KVM: Dirty ring support |
Date: |
Thu, 26 Mar 2020 14:14:36 +0000 |
User-agent: |
Mutt/1.13.3 (2020-01-12) |
* Peter Xu (address@hidden) wrote:
> On Wed, Mar 25, 2020 at 08:41:44PM +0000, Dr. David Alan Gilbert wrote:
>
> [...]
>
> > > +enum KVMReaperState {
> > > + KVM_REAPER_NONE = 0,
> > > + /* The reaper is sleeping */
> > > + KVM_REAPER_WAIT,
> > > + /* The reaper is reaping for dirty pages */
> > > + KVM_REAPER_REAPING,
> > > +};
> >
> > That probably needs to be KVMDirtyRingReaperState
> > given there are many things that could be reaped.
>
> Sure.
>
> >
> > > +/*
> > > + * KVM reaper instance, responsible for collecting the KVM dirty bits
> > > + * via the dirty ring.
> > > + */
> > > +struct KVMDirtyRingReaper {
> > > + /* The reaper thread */
> > > + QemuThread reaper_thr;
> > > + /*
> > > + * Telling the reaper thread to wakeup. This should be used as a
> > > + * generic interface to kick the reaper thread, like, in vcpu
> > > + * threads where it gets a exit due to ring full.
> > > + */
> > > + EventNotifier reaper_event;
> >
> > I think I'd just used a simple semaphore for this type of thing.
>
> I'm actually uncertain on which is cheaper...
>
> At the meantime, I wanted to poll two handles at the same time below
> (in kvm_dirty_ring_reaper_thread). I don't know how to do that with
> semaphore. Could it?
If you're OK with EventNotifier stick with it; it's just I'm used
to doing with it with a semaphore; e.g. a flag then the semaphore - but
that's fine.
> [...]
>
> > > @@ -412,6 +460,18 @@ int kvm_init_vcpu(CPUState *cpu)
> > > (void *)cpu->kvm_run + s->coalesced_mmio * PAGE_SIZE;
> > > }
> > >
> > > + if (s->kvm_dirty_gfn_count) {
> > > + cpu->kvm_dirty_gfns = mmap(NULL, s->kvm_dirty_ring_size,
> > > + PROT_READ | PROT_WRITE, MAP_SHARED,
> >
> > Is the MAP_SHARED required?
>
> Yes it's required. It's the same when we map the per-vcpu kvm_run.
>
> If we use MAP_PRIVATE, it'll be in a COW fashion - when the userspace
> writes to the dirty gfns the 1st time, it'll copy the current dirty
> ring page in the kernel and from now on QEMU will never be able to see
> what the kernel writes to the dirty gfn pages. MAP_SHARED means the
> userspace and the kernel shares exactly the same page(s).
OK, worth a comment.
> >
> > > + cpu->kvm_fd,
> > > + PAGE_SIZE *
> > > KVM_DIRTY_LOG_PAGE_OFFSET);
> > > + if (cpu->kvm_dirty_gfns == MAP_FAILED) {
> > > + ret = -errno;
> > > + DPRINTF("mmap'ing vcpu dirty gfns failed\n");
> >
> > Include errno?
>
> Will do.
>
> [...]
>
> > > +static uint64_t kvm_dirty_ring_reap(KVMState *s)
> > > +{
> > > + KVMMemoryListener *kml;
> > > + int ret, i, locked_count = s->nr_as;
> > > + CPUState *cpu;
> > > + uint64_t total = 0;
> > > +
> > > + /*
> > > + * We need to lock all kvm slots for all address spaces here,
> > > + * because:
> > > + *
> > > + * (1) We need to mark dirty for dirty bitmaps in multiple slots
> > > + * and for tons of pages, so it's better to take the lock here
> > > + * once rather than once per page. And more importantly,
> > > + *
> > > + * (2) We must _NOT_ publish dirty bits to the other threads
> > > + * (e.g., the migration thread) via the kvm memory slot dirty
> > > + * bitmaps before correctly re-protect those dirtied pages.
> > > + * Otherwise we can have potential risk of data corruption if
> > > + * the page data is read in the other thread before we do
> > > + * reset below.
> > > + */
> > > + for (i = 0; i < s->nr_as; i++) {
> > > + kml = s->as[i].ml;
> > > + if (!kml) {
> > > + /*
> > > + * This is tricky - we grow s->as[] dynamically now. Take
> > > + * care of that case. We also assumed the as[] will fill
> > > + * one by one starting from zero. Without this, we race
> > > + * with register_smram_listener.
> > > + *
> > > + * TODO: make all these prettier...
> > > + */
> > > + locked_count = i;
> > > + break;
> > > + }
> > > + kvm_slots_lock(kml);
> > > + }
> > > +
> > > + CPU_FOREACH(cpu) {
> > > + total += kvm_dirty_ring_reap_one(s, cpu);
> > > + }
> > > +
> > > + if (total) {
> > > + ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
> > > + assert(ret == total);
> > > + }
> > > +
> > > + /* Unlock whatever locks that we have locked */
> > > + for (i = 0; i < locked_count; i++) {
> > > + kvm_slots_unlock(s->as[i].ml);
> > > + }
> > > +
> > > + CPU_FOREACH(cpu) {
> > > + if (cpu->kvm_dirty_ring_full) {
> > > + qemu_sem_post(&cpu->kvm_dirty_ring_avail);
> > > + }
> >
> > Why do you need to wait until here - couldn't you release
> > each vcpu after you've reaped it?
>
> We probably still need to wait. Even after we reaped all the dirty
> bits we only marked the pages as "collected", the buffers will only be
> available again until the kernel re-protect those pages (when the
> above KVM_RESET_DIRTY_RINGS completes). Before that, continuing the
> vcpu could let it exit again with the same ring full event.
Ah OK.
> [...]
>
> > > +static int kvm_dirty_ring_reaper_init(KVMState *s)
> > > +{
> > > + struct KVMDirtyRingReaper *r = &s->reaper;
> > > + int ret;
> > > +
> > > + ret = event_notifier_init(&r->reaper_event, false);
> > > + assert(ret == 0);
> > > + ret = event_notifier_init(&r->reaper_flush_event, false);
> > > + assert(ret == 0);
> > > + qemu_sem_init(&r->reaper_flush_sem, 0);
> > > +
> > > + qemu_thread_create(&r->reaper_thr, "kvm-reaper",
> > > + kvm_dirty_ring_reaper_thread,
> > > + s, QEMU_THREAD_JOINABLE);
> >
> > That's marked as joinable - does it ever get joined?
> > If the reaper thread does exit on error (I can only see the poll
> > failure?) - what happens elsewhere - will things still try and kick it?
>
> The reaper thread is not designed to fail for sure. If it fails, it'll
> exit without being joined, but otherwise I'll just abort() directly in
> the thread which seems to be not anything better...
>
> Regarding to "why not join() it": I think it's simply because we don't
> have corresponding operation to AccelClass.init_machine() and
> kvm_init(). :-) From that POV, I think I'll still use JOINABLE with
> the hope that someday the destroy_machine() hook will be ready and
> we'll be able to join it.
OK, that's fine.
Dave
> Thanks,
>
> --
> Peter Xu
>
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK