[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v2 1/1] NBD proto: add WRITE_ZEROES extension
From: |
Paolo Bonzini |
Subject: |
Re: [Qemu-devel] [PATCH v2 1/1] NBD proto: add WRITE_ZEROES extension |
Date: |
Thu, 31 Mar 2016 16:40:35 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 |
On 31/03/2016 16:27, Alex Bligh wrote:
> > > IE why not always permit trimming PROVIDED the data always reads back
> > > as zero? This would be far simpler.
> >
> > Because trimming can make future operations more expensive and cause
> > fragmentation (which may not be as bad as it used to be at the media
> > level, but it is still somewhat bad at the filesystem level).
> >
> > So if you want a fully-provisioned file, the simplest way to do so is to
> > write zeroes to it, and trimming is undesirable.
> But isn't the server in a better position to know this than the
> client?
There are at least three possible states for a sector:
- hole (thin-provisioned)
- allocated as data (disk contains actual zeroes)
- allocated as unwritten (blocks reserved on backing storage, reads as
zeroes but the disk may not contain actual zeroes)
It's always okay for the backend to convert a zero block to an unwritten
extent; it's generally not okay for a backend to take a request to
create an unwritten extent and instead create a hole.
It's all an "as if" situation. The server must provide the semantics
requested by the client. For example, writing to a hole could cause
ENOSPC, writing to an unwritten extend could not. The server might know
better, because it certainly is in a better position to know how to
fulfill the client's request.
But even if it's just a hint, it makes sense for NBD to provide it.
It's not a coincidence that this hint exists at all levels: SCSI has an
UNMAP bit that can be set in the WRITE SAME command (and it has UNMAP
which matches NBD's TRIM); the fallocate system call has
FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE (plus Linux has the
BLKDISCARD ioctl which again matches NBD's TRIM for block devices).
> EG if the server has a back end implementation (as I suspect
> Ceph on qemu-nbd does)
Ceph doesn't, but gluster does.
> which never actually stores all zero blocks,
> it won't make a difference, and conceivably you're generating a whole
> pile of I/O to avoid sparseness when sparseness might be faster. Take
> for example a persistent memory interface, where fragmentation is
> irrelevant, and writing piles of zeroes to memory is a waste of time.
It certainly isn't a waste of time if your intention is to scrub data
belonging to a previous tenant, before giving access to someone else!
If you have a metadata layer above then you can handle the command there
(that's why we're adding it); if you haven't you do have to write the
zeroes.
Paolo