qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incrementa


From: Kevin Wolf
Subject: Re: [Qemu-devel] [Qemu-block] Some question about savem/qcow2 incremental snapshot
Date: Fri, 11 May 2018 19:25:31 +0200
User-agent: Mutt/1.9.1 (2017-09-22)

Am 10.05.2018 um 10:26 hat Stefan Hajnoczi geschrieben:
> On Wed, May 09, 2018 at 07:54:31PM +0200, Max Reitz wrote:
> > On 2018-05-09 12:16, Stefan Hajnoczi wrote:
> > > On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
> > >> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> > >>> On 12/25/2017 01:33 AM, He Junyan wrote:
> > >> 2. Make the nvdimm device use the QEMU block layer so that it is backed
> > >>    by a non-raw disk image (such as a qcow2 file representing the
> > >>    content of the nvdimm) that supports snapshots.
> > >>
> > >>    This part is hard because it requires some completely new
> > >>    infrastructure such as mapping clusters of the image file to guest
> > >>    pages, and doing cluster allocation (including the copy on write
> > >>    logic) by handling guest page faults.
> > >>
> > >> I think it makes sense to invest some effort into such interfaces, but
> > >> be prepared for a long journey.
> > > 
> > > I like the suggestion but it needs to be followed up with a concrete
> > > design that is feasible and fair for Junyan and others to implement.
> > > Otherwise the "long journey" is really just a way of rejecting this
> > > feature.
> > > 
> > > Let's discuss the details of using the block layer for NVDIMM and try to
> > > come up with a plan.
> > > 
> > > The biggest issue with using the block layer is that persistent memory
> > > applications use load/store instructions to directly access data.  This
> > > is fundamentally different from the block layer, which transfers blocks
> > > of data to and from the device.
> > > 
> > > Because of block DMA, QEMU is able to perform processing at each block
> > > driver graph node.  This doesn't exist for persistent memory because
> > > software does not trap I/O.  Therefore the concept of filter nodes
> > > doesn't make sense for persistent memory - we certainly do not want to
> > > trap every I/O because performance would be terrible.
> > > 
> > > Another difference is that persistent memory I/O is synchronous.
> > > Load/store instructions execute quickly.  Perhaps we could use KVM async
> > > page faults in cases where QEMU needs to perform processing, but again
> > > the performance would be bad.
> > 
> > Let me first say that I have no idea how the interface to NVDIMM looks.
> > I just assume it works pretty much like normal RAM (so the interface is
> > just that it’s a part of the physical address space).
> > 
> > Also, it sounds a bit like you are already discarding my idea, but here
> > goes anyway.
> > 
> > Would it be possible to introduce a buffering block driver that presents
> > the guest an area of RAM/NVDIMM through an NVDIMM interface (so I
> > suppose as part of the guest address space)?  For writing, we’d keep a
> > dirty bitmap on it, and then we’d asynchronously move the dirty areas
> > through the block layer, so basically like mirror.  On flushing, we’d
> > block until everything is clean.
> > 
> > For reading, we’d follow a COR/stream model, basically, where everything
> > is unpopulated in the beginning and everything is loaded through the
> > block layer both asynchronously all the time and on-demand whenever the
> > guest needs something that has not been loaded yet.
> > 
> > Now I notice that that looks pretty much like a backing file model where
> > we constantly run both a stream and a commit job at the same time.
> > 
> > The user could decide how much memory to use for the buffer, so it could
> > either hold everything or be partially unallocated.
> > 
> > You’d probably want to back the buffer by NVDIMM normally, so that
> > nothing is lost on crashes (though this would imply that for partial
> > allocation the buffering block driver would need to know the mapping
> > between the area in real NVDIMM and its virtual representation of it).
> > 
> > Just my two cents while scanning through qemu-block to find emails that
> > don’t actually concern me...
> 
> The guest kernel already implements this - it's the page cache and the
> block layer!
> 
> Doing it in QEMU with dirty memory logging enabled is less efficient
> than doing it in the guest.
> 
> That's why I said it's better to just use block devices than to
> implement buffering.
> 
> I'm saying that persistent memory emulation on top of the iscsi:// block
> driver (for example) does not make sense.  It could be implemented but
> the performance wouldn't be better than block I/O and the
> complexity/code size in QEMU isn't justified IMO.

I think it could make sense if you put everything together.

The primary motivation to use this would of course be that you can
directly map the guest clusters of a qcow2 file into the guest. We'd
potentially fault on the first access, but once it's mapped, you get raw
speed. You're right about flushing, and I was indeed thinking of
Pankaj's work there; maybe I should have been more explicit about that.

Now buffering in QEMU might come in useful when you want to run a block
job on the device. Block jobs are usually just temporary, and accepting
temporarily lower performance might be very acceptable when the
alternative is that you can't perform block jobs at all.

If we want to offer something nvdimm-like not only for the extreme
"performance only, no features" case, but as a viable option for the
average user, we need to be fast in the normal case, and allow to use
any block layer features without having to restart the VM with a
different storage device, even if at a performance penalty.

On iscsi, you still don't gain anything compared to just using a block
device, but support for that might just happen as a side effect when you
implement the interesting features.

Kevin

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]