qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension


From: Pavel Borzenkov
Subject: Re: [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension
Date: Thu, 24 Mar 2016 15:30:06 +0300
User-agent: Mutt/1.5.24 (2015-08-30)

On Wed, Mar 23, 2016 at 10:27:00AM -0600, Eric Blake wrote:
> On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> > From: Pavel Borzenkov <address@hidden>
> > 
> > With the availability of sparse storage formats, it is often needed to
> > query status of a particular LBA range and read only those blocks of
> > data that are actually present on the block device.
> 
> The acronym LBA is not used elsewhere in the NBD spec; should we spell
> it out at least once?
> 
> > 
> > To provide such information, the patch adds GET_LBA_STATUS extension
> > with one new NBD_CMD_GET_LBA_STATUS command.
> > 
> > There exists a concept of data dirtiness, which is required during, for
> > example, incremental block device backup. To express this concept via
> > NBD protocol, this patch also adds additional mode of operation to
> > NBD_CMD_GET_LBA_STATUS command.
> > 
> > Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> > LBA STATUS" command more closely, it has been chosen to return a list of
> > extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> > bitmap.
> > 
> > Signed-off-by: Pavel Borzenkov <address@hidden>
> > Reviewed-by: Roman Kagan <address@hidden>
> > Signed-off-by: Denis V. Lunev <address@hidden>
> > CC: Wouter Verhelst <address@hidden>
> > CC: Paolo Bonzini <address@hidden>
> > CC: Kevin Wolf <address@hidden>
> > CC: Stefan Hajnoczi <address@hidden>
> > ---
> >  doc/proto.md | 82 
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 82 insertions(+)
> > 
> 
> >  
> > +### `GET_LBA_STATUS` extension
> > +
> > +With the availability of sparse storage formats, it is often needed to 
> > query
> > +status of a particular LBA range and read only those blocks of data that 
> > are
> > +actually present on the block device.
> > +
> > +Some storage formats and operations over such formats express a concept of
> > +data dirtiness. Whether the operation is block device mirroring,
> > +incremental block device backup or any other operation with a concept of
> > +data dirtiness, they all share a need to provide a list of LBA ranges
> > +that this particular operation treats as dirty.
> > +
> > +To provide such class of information, `GET_LBA_STATUS` extension adds new
> > +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
> > +their respective states.
> > +
> > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > +
> > +    An LBA range status query request. Length and offset define the range
> > +    of interest. The server MUST reply with a reply header, followed
> > +    immediately by the following data:
> > +
> > +      - 32 bits, length of parameter data that follow (unsigned)
> 
> Is this length the number of descriptors, or the number of bytes
> occupied by those descriptors?  It looks like bytes (that is, with the
> current layout, this field should be a multiple of 14 unless an error is
> returned and the data is bogus).
> 
> I guess 32 bits is sufficient: transmission commands are limited to
> 32-bit length, and we are unlikely to have more than one extent per 512
> bytes of length, so even if we have a header for every single sector
> (worst-case for alternating clean/dirty sectors), as long as the
> smallest granularity of an extent is larger than the extent field, the
> 'length of parameter data' in bytes is still sufficient.
> 
> > +      - zero or more LBA status descriptors, each having the following
> 
> zero or more? [1]
> 
> > +        structure:
> > +
> > +        * 64 bits, offset (unsigned)
> > +        * 32 bits, length (unsigned)
> > +        * 16 bits, status (unsigned)
> 
> An array of these status descriptors is packed on the wire, while the
> typical C layout of an array of these structures will have padding to
> reach a nice 8-byte alignment.  Should 'status' be a 32-bit field, so
> that clients and servers do not have to pack/unpack between 14 bytes on
> the wire and 16 bytes in efficient array handling, at the expense of
> more network traffic?
> 
> Conversely, it would be possible to send less data over the wire, as
> long as we require that all LBA status descriptors cover consecutive
> offsets.  That is, having the server reply with offsets is pointless,
> since they can be reconstructed on the client by starting with the
> offset in the client's request, then adding length from each status
> field.  Is less network traffic desirable?

I think it's better to explicitly send the start of each LBA extent.
So I'll go with changing 'status' field to 32 bits to avoid
packing/unpacking issues.

> 
> > +
> > +    unless an error condition has occurred.
> > +
> > +    If an error occurs, the server SHOULD set the appropriate error code
> > +    in the error field. The server MUST then either close the
> > +    connection, or send *length of parameter data* bytes of data
> > +    (which MAY be invalid).
> > +
> > +    The type of information required by the client is passed to server in 
> > the
> > +    command flags field. If the server does not implement requested type or
> > +    have no means to express it, it MUST NOT return an error, but instead 
> > MUST
> > +    return a single LBA status descriptor with *offset* and *length* equal 
> > to
> > +    the *offset* and *length* from request, and *status* set to `0`.
> 
> [1] So in what situations will we ever return an array of zero status
> fields? On an error?  Should we make it clear that the server MUST NOT
> return 0 status fields except on an error?
> 
> Do we want to require that the server MUST reply with enough extents to
> sum up to the length of the client's request, or are we permitting a
> short reply?

While the "GET LBA STATUS" command in SCSI allows partial reply, I
believe it'd better to keep things simple and require that the server
must either return a list of extents that covers the whole requested
range, or an error.

> 
> > +
> > +    The following request types are currently defined for the command:
> > +
> > +    1. Block provisioning state
> > +
> > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return
> 
> Here, you spell it '0x0'; in the previous patch, you spelled the command
> flag as 'bit 1' - does that mean that Block provisioning state is the
> default when no command flags are sent?  What if we later add other
> flags but still want block provisioning mode?  Wouldn't it be better to
> state that this mode is entered when the NBD_FLAG_GET_DIRTY flag is
> clear, without regards to the state of the other flags, than allowing
> this mode only when all 16 flags are zero?
> 
> For example, should we allow a flag that states that the client is
> interested only in allocated/unallocated, and that the server may
> coalesce NBD_STATE_ZEROED extents as if they were NBD_STATE_ALLOCATED
> for fewer extents reported and thus potentially less network traffic?

Actually, for this command I treat 'command flags' field not as a set of
flags, but rather as a plain number representing required mode of
operation. Probably, not a good idea as it doesn't match the rest of the
commands.

I went this way because I didn't want to allow clients to request
several modes simultaneously (e.g. provisioning + dirtiness) in the same
request. This makes server side implementation harder.

I think I'll just switch to bits to match the rest of the commands and
will add a note, that server should return EINVAL in case several modes
are requested simultaneously.

> 
> > +    the provisioning state of the device. The following provisionnig states
> 
> s/provisionnig/provisioning/
> 
> > +    are defined for the command:
> > +
> > +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block 
> > device;
> > +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> > +        and contains zeroes;
> > +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
> > +        block device. A client MUST NOT make any assumptions about the
> > +        contents of the extent.
> 
> Can NBD_STATE_ALLOCATED and NBD_STATE_DEALLOCATED both be set at the
> same time, or is that an error on the server?  What do we know about an
> extent that has neither NBD_STATE_ALLOCATED nor NBD_STATE_DEALLOCATED set?
> 
> /me re-reads
> 
> Oh, you are using this as the _entire_ 16-bit status value, rather than
> as bits 0, 1, and 2 as flags.

Yes

> 
> But I think you have two binary flags (four possible states) here: it is
> quite conceivable to have a server on top of a SCSI device, where we
> know that the extent is unallocated in SCSI, but where the server will
> guarantee that it reads as all zeroes (possibly because the server
> bypasses SCSI on the NBD read commands when it knows SCSI is
> unallocated).  That is, if we set this up as two flags:
> 
> 0x1 - allocated
> 0x2 - reads as 0
> 
> then we can express four states:
> 
> 0x0 - LBA extent not present, client MUST NOT make assumptions about
> contents, and reads should not be attempted
> 0x1 - LBA extent allocated, reads will succeed but no guarantee on contents
> 0x2 - LBA extent not present, but client can treat the extent as zeroes
> and reads will succeed
> 0x3 - LBA extent present, client can treat the extent as zeroes and
> reads will succeed

I'm not sure that clients need this level of details. From client's POV
0x2 and 0x3 are the same.

> 
> Actually, we should probably pick the bit values such that 0x0 means
> allocated and readable, as the most common state, since we also require
> that the command returns a single extent over the entire length with
> status 0 if the server can't provide any further details.
> 
> I'm not familiar enough with the SCSI "GET LBA STATUS" command to know
> if your command sanely matches to that one.

Extents returned by "GET LBA STATUS" in SCSI can have three possible
state: MAPPED/ANCHORED/DEALLOCATED. These are not bits and cannot be
combined together.

> 
> > +
> > +    2. Block dirtiness state
> > +
> > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
> 
> This overlaps with the bit value for NBD_FLAG_SEND_FUA in the previous
> patch.  Is that okay?  Or should we use a different bit value, on the
> grounds that some future extension may want to use both flags
> orthogonally within the same (possibly new) command?  Again, consistency
> in the spelling ('bit 1' in the previous patch, '0x1' here), would be nice.

I couldn't think of any use case. But, just in case, I'll change the
value of the flags so they don't overlap.

-- 
Pavel

> 
> > +    the dirtiness status of the device. The following dirtiness states
> > +    are defined for the command:
> > +
> > +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> > +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
> 
> Again, it looks like you are using these as two entire 16-bit status
> values, rather than as two separate bits (1<<0 and 1<<1).  Another way
> of expressing it is that a single boolean flag is defined, if clear, the
> extent is dirty, if set, the extent is clean.
> 
> > +
> > +    Generic NBD client implementation without knowledge of a particular NBD
> > +    server operation MUST NOT make any assumption on the meaning of the
> > +    NBD_STATE_DIRTY or NBD_STATE_CLEAN states.
> > +
> > +The server SHOULD return `EINVAL` if it receives a `GET_LBA_STATUS` request
> > +including one or more sectors beyond the size of the device.
> 
> As mentioned in the previous mail, should we also recommend an EINVAL if
> NBD_CMD_GET_LBA_STATUS was not negotiated in the export options but the
> client sends the command anyways; and/or a requirement that the client
> must not issue the command in that case?
> 
> > +
> >  ## About this file
> >  
> >  This file tries to document the NBD protocol as it is currently
> > 
> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 





reply via email to

[Prev in Thread] Current Thread [Next in Thread]