[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2 1/1] NBD proto: add WRITE_ZEROES extension

From: Paolo Bonzini
Subject: Re: [Qemu-devel] [PATCH v2 1/1] NBD proto: add WRITE_ZEROES extension
Date: Thu, 31 Mar 2016 16:40:35 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0

On 31/03/2016 16:27, Alex Bligh wrote:
> > > IE why not always permit trimming PROVIDED the data always reads back
> > > as zero? This would be far simpler.
> > 
> > Because trimming can make future operations more expensive and cause
> > fragmentation (which may not be as bad as it used to be at the media
> > level, but it is still somewhat bad at the filesystem level).
> > 
> > So if you want a fully-provisioned file, the simplest way to do so is to
> > write zeroes to it, and trimming is undesirable.
> But isn't the server in a better position to know this than the
> client?

There are at least three possible states for a sector:

- hole (thin-provisioned)

- allocated as data (disk contains actual zeroes)

- allocated as unwritten (blocks reserved on backing storage, reads as
zeroes but the disk may not contain actual zeroes)

It's always okay for the backend to convert a zero block to an unwritten
extent; it's generally not okay for a backend to take a request to
create an unwritten extent and instead create a hole.

It's all an "as if" situation. The server must provide the semantics
requested by the client.  For example, writing to a hole could cause
ENOSPC, writing to an unwritten extend could not.  The server might know
better, because it certainly is in a better position to know how to
fulfill the client's request.

But even if it's just a hint, it makes sense for NBD to provide it.
It's not a coincidence that this hint exists at all levels: SCSI has an
UNMAP bit that can be set in the WRITE SAME command (and it has UNMAP
which matches NBD's TRIM); the fallocate system call has
BLKDISCARD ioctl which again matches NBD's TRIM for block devices).

> EG if the server has a back end implementation (as I suspect
> Ceph on qemu-nbd does)

Ceph doesn't, but gluster does.

> which never actually stores all zero blocks,
> it won't make a difference, and conceivably you're generating a whole
> pile of I/O to avoid sparseness when sparseness might be faster. Take
> for example a persistent memory interface, where fragmentation is
> irrelevant, and writing piles of zeroes to memory is a waste of time.

It certainly isn't a waste of time if your intention is to scrub data
belonging to a previous tenant, before giving access to someone else!
If you have a metadata layer above then you can handle the command there
(that's why we're adding it); if you haven't you do have to write the


reply via email to

[Prev in Thread] Current Thread [Next in Thread]