qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 3/3] replay: introduce block devices record/repl


From: Kevin Wolf
Subject: Re: [Qemu-devel] [PATCH 3/3] replay: introduce block devices record/replay
Date: Fri, 12 Feb 2016 14:58:20 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

Am 12.02.2016 um 14:19 hat Pavel Dovgalyuk geschrieben:
> > From: Kevin Wolf [mailto:address@hidden
> > Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben:
> > > > From: Kevin Wolf [mailto:address@hidden
> > > > Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben:
> > > > > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben:
> > > > > > > > From: Kevin Wolf [mailto:address@hidden
> > > > > > > > But even this doesn't feel completely right, because block 
> > > > > > > > drivers are
> > > > > > > > already layered and there is no need to hardcode something 
> > > > > > > > optional (and
> > > > > > > > rarely used) in the hot code path that could just be another 
> > > > > > > > layer.
> > > > > > > >
> > > > > > > > I assume that you know beforehand if you want to replay 
> > > > > > > > something, so
> > > > > > > > requiring you to configure your block devices with a replay 
> > > > > > > > driver on
> > > > > > > > top of the stack seems reasonable enough.
> > > > > > >
> > > > > > > I cannot use block drivers for this. When driver functions are 
> > > > > > > used, QEMU
> > > > > > > is already used coroutines (and probably started bottom halves).
> > > > > > > Coroutines make execution non-deterministic.
> > > > > > > That's why we have to intercept blk_aio_ functions, that are 
> > > > > > > called
> > > > > > > deterministically.
> > > > > >
> > > > > > What does "deterministic" mean in this context, i.e. what are your 
> > > > > > exact
> > > > > > requirements?
> > > > >
> > > > > "Deterministic" means that the replayed execution should run exactly
> > > > > the same guest instructions in the same sequence, as in recording 
> > > > > session.
> > > >
> > > > Okay. I think with this we can do better than what you have now.
> > > >
> > > > > > I don't think that coroutines introduce anything non-deterministic 
> > > > > > per
> > > > > > se. Depending on what you mean by it, the block layer code paths in
> > > > > > block.c may contain problematic code.
> > > > >
> > > > > They are non-deterministic if we need instruction-level accuracy.
> > > > > Thread switching (and therefore callbacks and BH execution) is 
> > > > > non-deterministic.
> > > >
> > > > Thread switching depends on an external event (the kernel scheduler
> > > > deciding to switch), so agreed, if a thread switch ever influences what
> > > > the guest sees, that would be a problem.
> > > >
> > > > Generally, however, callbacks and BHs don't involve a thread switch at
> > > > all (BHs can be invoked from a different thread in theory, but we have
> > > > very few of those cases and they shouldn't be visible for the guest).
> > > > The same is true for coroutines, which are semantically equivalent to
> > > > callbacks.
> > > >
> > > > > In two different executions these callbacks may happen at different 
> > > > > moments of
> > > > > time (counting in number of executed instructions).
> > > > > All operations with virtual devices (including memory, interrupt 
> > > > > controller,
> > > > > and disk drive controller) should happen at deterministic moments of 
> > > > > time
> > > > > to be replayable.
> > > >
> > > > Right, so let's talk about what this external non-deterministic event
> > > > really is.
> > > >
> > > > I think the only thing whose timing is unknown in the block layer is the
> > > > completion of I/O requests. This non-determinism comes from the time the
> > > > I/O syscalls made by the lowest layer (usually raw-posix) take.
> > >
> > > Right.
> > >
> > > > This means that we can add logic to remove the non-determinism at the
> > > > point of our choice between raw-posix and the guest device emulation. A
> > > > block driver on top is as good as anything else.
> > > >
> > > > While recording, this block driver would just pass the request to next
> > > > lower layer (starting a request is deterministic, so it doesn't need to
> > > > be logged) and once the request completes it logs it. While replaying,
> > > > the completion of requests is delayed until we read it in the log; if we
> > > > read it in the log and the request hasn't completed yet, we do a busy
> > > > wait for it (while(!completed) aio_poll();).
> > >
> > > I tried serializing all bottom halves and worker thread callbacks in
> > > previous version of the patches. That code was much more complicated
> > > and error-prone than the current version. We had to classify all bottom
> > > halves to recorded and non-recorded (because sometimes they are used
> > > for qemu's purposes, not the guest ones).
> > >
> > > However, I don't understand yet which layer do you offer as the candidate
> > > for record/replay? What functions should be changed?
> > > I would like to investigate this way, but I don't got it yet.
> > 
> > At the core, I wouldn't change any existing function, but introduce a
> > new block driver. You could copy raw_bsd.c for a start and then tweak
> > it. Leave out functions that you don't want to support, and add the
> > necessary magic to .bdrv_co_readv/writev.
> > 
> > Something like this (can probably be generalised for more than just
> > reads as the part after the bdrv_co_reads() call should be the same for
> > reads, writes and any other request types):
> > 
> > int blkreplay_co_readv()
> > {
> >     BlockReplayState *s = bs->opaque;
> >     int reqid = s->reqid++;
> > 
> >     bdrv_co_readv(bs->file, ...);
> > 
> >     if (mode == record) {
> >         log(reqid, time);
> >     } else {
> >         assert(mode == replay);
> >         bool *done = req_replayed_list_get(reqid)
> >         if (done) {
> >             *done = true;
> >         } else {
> point A
> >             req_completed_list_insert(reqid, qemu_coroutine_self());
> >             qemu_coroutine_yield();
> >         }
> >     }
> > }
> > 
> > /* called by replay.c */
> > int blkreplay_run_event()
> > {
> >     if (mode == replay) {
> >         co = req_completed_list_get(e.reqid);
> >         if (co) {
> >             qemu_coroutine_enter(co);
> >         } else {
> >             bool done = false;
> >             req_replayed_list_insert(reqid, &done);
> point B
> >             /* wait synchronously for completion */
> >             while (!done) {
> >                 aio_poll();
> >             }
> >         }
> >     }
> > }
> 
> One more question about coroutines.
> Are race conditions possible in this sample?
> In replay mode we may call readv, and reach point A.
> On the same time, we will read point B in another thread.
> Then readv will yield and nobody will start it back?

There are two aspects to this:

* Real multithreading doesn't exist in the block layer. All block driver
  functions are only called with the mutex in the AioContext held. There
  is exactly one AioContext per BDS, so no two threads can possible be
  operating on the same BDS at the same time.

* Coroutines are different from threads in that they aren't preemptive.
  They are only interrupted in places where they explicitly yield.

Of course, in order for this to work, we actually need to take the mutex
before calling blkreplay_run_event(), which is called directly from the
replay code (which runs in the mainloop thread? Or vcpu?).

So I think you need to have a aio_context_acquire(bs->aio_context) and
aio_context_release(bs->aio_context) around the function; either here or
in the calling replay code.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]