qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extensio


From: John Snow
Subject: Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension
Date: Mon, 28 Nov 2016 18:15:39 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0

Hi Wouter,

Some of this mess may be partially my fault, but I have not been following the NBD extension proposals up until this point.

Are you familiar with the genesis behind this idea and what we are trying to accomplish in general?

We had the thought to propose an extension roughly similar to SCSI's 'get lba status', but the existing status bits there did not correlate semantically to what we are hoping to convey. There was some debate over if it would be an abuse of protocol to attempt to use them as such.

For a quick recap, get lba status appears to offer three canonical statuses:

0h: "LBA extent is mapped, [...] or has an unknown state"
1h: "[...] LBA extent is deallocated"
2h: "[...] LBA extent is anchored"

My interpretation of mapped was simply that it was physically allocated, and 'deallocated' was simply unallocated.

(I uh, am not actually clear on what anchored means exactly.)

Either way, we felt at the time that it would be wrong to propose an analogue command for NBD and then immediately abuse the existing semantics, hence a new command like -- but not identical to -- the SCSI one.


On 11/27/2016 02:17 PM, Wouter Verhelst wrote:
Hi Vladimir,

Quickly: the reason I haven't merged this yes is twofold:
- I wasn't thrilled with the proposal at the time. It felt a bit
  hackish, and bolted onto NBD so you could use it, but without defining
  everything in the NBD protocol. "We're reading some data, but it's not
  about you". That didn't feel right
- There were a number of questions still unanswered (you're answering a
  few below, so that's good).

For clarity, I have no objection whatsoever to adding more commands if
they're useful, but I would prefer that they're also useful with NBD on
its own, i.e., without requiring an initiation or correlation of some
state through another protocol or network connection or whatever. If
that's needed, that feels like I didn't do my job properly, if you get
my point.

On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
With the availability of sparse storage formats, it is often needed
to query status of a particular range and read only those blocks of
data that are actually present on the block device.

To provide such information, the patch adds the BLOCK_STATUS
extension with one new NBD_CMD_BLOCK_STATUS command, a new
structured reply chunk format, and a new transmission flag.

There exists a concept of data dirtiness, which is required
during, for example, incremental block device backup. To express
this concept via NBD protocol, this patch also adds a flag to
NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
provisioning information; however, with the current proposal, data
dirtiness is only useful with additional coordination outside of
the NBD protocol (such as a way to start and stop the server from
tracking dirty sectors).  Future NBD extensions may add commands
to control dirtiness through NBD.

Since NBD protocol has no notion of block size, and to mimic SCSI
"GET LBA STATUS" command more closely, it has been chosen to return
a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
instead of a bitmap.

CC: Pavel Borzenkov <address@hidden>
CC: Denis V. Lunev <address@hidden>
CC: Wouter Verhelst <address@hidden>
CC: Paolo Bonzini <address@hidden>
CC: Kevin Wolf <address@hidden>
CC: Stefan Hajnoczi <address@hidden>
Signed-off-by: Eric Blake <address@hidden>
Signed-off-by: Vladimir Sementsov-Ogievskiy <address@hidden>
---

v3:

Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
removing the restriction, that sum of status descriptor lengths must be equal
to requested length. I.e., let's permit the server to replay with less data
than required if it wants.

Reasonable, yes. The length that the client requests should be a maximum (i.e.
"I'm interested in this range"), not an exact request.

Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
 NBD_FLAG_CAN_MULTI_CONN in master branch.

Right.

And, finally, I've rebased this onto current state of
extension-structured-reply branch (which itself should be rebased on
master IMHO).

Probably a good idea, given the above.

By this resend I just want to continue the diqussion, started about half
a year ago. Here is a summary of some questions and ideas from v2
diqussion:

1. Q: Synchronisation. Is such data (dirty/allocated) reliable?
   A: This all is for read-only disks, so the data is static and unchangeable.

I think we should declare that it's up to the client to ensure no other
writes happen without its knowledge. This may be because the client and
server communicate out of band about state changes, or because the
client somehow knows that it's the only writer, or whatever.

We can easily do that by declaring that the result of that command only
talks about *current* state, and that concurrent writes by different
clients may invalidate the state. This is true for NBD in general (i.e.,
concurrent read or write commands from other clients may confuse file
systems on top of NBD), so it doesn't change expectations in any way.


Agree with you here. "This was correct when I sent it, but it's up to you to ensure that nothing would have invalidated that in the meantime" is fine semantically to me.

Of course in our implementation, we intend to only export essentially read-only snapshots of data, so we don't personally expect to run into any of these kind of semantic problems.

2. Q: different granularities of dirty/allocated bitmaps. Any problems?
   A: 1: server replies with status descriptors of any size, granularity
         is hidden from the client
      2: dirty/allocated requests are separate and unrelated to each
         other, so their granularities are not intersecting

Not entirely sure anymore what this is about?


We have a concept of granularity for dirty tracking; consider it a block size.

I don't think it's necessarily relevant to NBD, except for the case of false positives. Consider the case of a 512 byte block that is assumed dirty simply because it's adjacent to actually dirty data.

That may have some meaning for how we write the NBD spec; i.e. the meaning of the status bit changes from "This is definitely modified" to "This was possibly modified," due to limitations in the management of this information by the server.

If we expose the granularity information via NBD, it at least makes it clear how fuzzy the results presented may be. Otherwise, it's not really required.

[Again, as seen below, a user-defined bit would be plenty sufficient...!]

3. Q: selecting of dirty bitmap to export
   A: several variants:
      1: id of bitmap is in flags field of request
          pros: - simple
          cons: - it's a hack. flags field is for other uses.
                - we'll have to map bitmap names to these "ids"
      2: introduce extended nbd requests with variable length and exploit this
         feature for BLOCK_STATUS command, specifying bitmap identifier.
         pros: - looks like a true way
         cons: - we have to create additional extension
               - possible we have to create a map,
                 {<QEMU bitmap name> <=> <NBD bitmap id>}
      3: exteranl tool should select which bitmap to export. So, in case of Qemu
         it should be something like qmp command block-export-dirty-bitmap.
         pros: - simple
               - we can extend it to behave like (2) later
         cons: - additional qmp command to implement (possibly, the lesser evil)
         note: Hmm, external tool can make chose between allocated/dirty data 
too,
               so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.

Downside of 3, though, is that it moves the definition of what the
different states mean outside of the NBD protocol (i.e., the protocol
messages are not entirely defined anymore, and their meaning depends on
the clients and servers in use).

To avoid this, we should have a clear definition of what the reply means
*by default*, but then we can add a note that clients and servers can
possibly define other meanings out of band if they want to.


I had personally only ever considered #3; where a command would be issued to QEMU to begin offering NBD data for some point-in-time as associated with a particular reference/snapshot/backup/etc. This leaves it up to the out-of-band client to order up the right data.

That does appear to choose a meaningful name for the status bits a bit more difficult...

[...but reading ahead, a 'user defined' bit would fit the bill just fine.]

When Vladimir authored a "persistence" feature for bitmaps to be stored alongside QCOW2 files, we had difficulty describing exactly what a dirty bit meant for the data -- ultimately it is reliant on external information. The meaning is simply but ambiguously, "The data associated with this bit has been changed since the last time the bit was reset."

We don't record or stipulate the conditions for a reset.

(for our purposes, a reset would occur during a full or incremental backup.)

4. Q: Should not get_{allocated,dirty} be separate commands?
   cons: Two commands with almost same semantic and similar means?
   pros: However here is a good point of separating clearly defined and native
         for block devices GET_BLOCK_STATUS from user-driven and actually
         undefined data, called 'dirtyness'.

Yeah, having them separate commands might be a bad idea indeed.

5. Number of status descriptors, sent by server, should be restricted
   variants:
   1: just allow server to restrict this as it wants (which was done in v3)
   2: (not excluding 1). Client specifies somehow the maximum for number
      of descriptors.
      2.1: add command flag, which will request only one descriptor
           (otherwise, no restrictions from the client)
      2.2: again, introduce extended nbd requests, and add field to
           specify this maximum

I think having a flag which requests just one descriptor can be useful,
but I'm hesitant to add it unless it's actually going to be used; so in
other words, I'll leave the decision on that bit to you.

6. A: What to do with unspecified flags (in request/reply)?
   I think the normal variant is to make them reserved. (Server should
   return EINVAL if found unknown bits, client should consider replay
   with unknown bits as an error)

Right, probably best to do that, yes.

======

Also, an idea on 2-4:

    As we say, that dirtiness is unknown for NBD, and external tool
    should specify, manage and understand, which data is actually
    transmitted, why not just call it user_data and leave status field
    of reply chunk unspecified in this case?

    So, I propose one flag for NBD_CMD_BLOCK_STATUS:
    NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
    Eric's 'Block provisioning status' paragraph.  If it is set, we just
    leave status field for some external... protocol? Who knows, what is
    this user data.

Yes, this sounds like a reasonable approach.


I'd be pretty happy (personally) with some user defined bits. Leaves a lot of the ambiguousness of exactly what we're trying to convey away from the NBD spec, which is nice.

    Note: I'm not sure, that I like this (my) proposal. It's just an
    idea, may be someone like it.  And, I think, it represents what we
    are trying to do more honestly.

Indeed.

    Note2: the next step of generalization will be NBD_CMD_USER, with
    variable request size, structured reply and no definition :)

Well, er, no please, if we can avoid it :-)

Another idea, about backups themselves:

    Why do we need allocated/zero status for backup? IMHO we don't.

Well, I've been thinking so all along, but then I don't really know what
it is, in detail, that you want to do :-)

I can understand a "has this changed since time X" request, which the
"dirty" thing seems to want to be. Whether something is allocated is
just a special case of that.

Actually, come to think of that. What is the exact use case for this
thing? I understand you're trying to create incremental backups of
things, which would imply you don't write from the client that is
getting the block status thingies, right? If so, how about:


Essentially you can create a bitmap object in-memory in QEMU and then associate it with a drive. It records changes to the drive, and QEMU can be instructed to write out any changes since the last backup to disk, while clearing the bits of the bitmap.

It doesn't record a specific point in time, it's just implicit to the last time you cleared the bitmap -- usually the last incremental or full backup you've made.

So it does describe "blocks changed since time X," it's just that we don't really know exactly when time X was.

This is all well and dandy, except there is some desire from third parties to be able to ask QEMU about this dirty block information -- to be able to see this bitmap, more or less. We already use NBD for exporting data, and instead of inventing a new side-band, we decided that we wanted to use NBD to let external users get this information.

What exactly they do with that info is beyond QEMU's scope.

- NBD_OPT_GET_SNAPSHOTS (during negotiation): returns a list of
  snapshots. Not required, optional, includes a machine-readable form,
  not defined by NBD, which explains what the snapshot is about (e.g., a
  qemu json file). The "base" version of that is just "allocation
  status", and is implied (i.e., you don't need to run
  NBD_OPT_GET_SNAPSHOTS if you're not interested in anything but the
  allocation status).

We don't necessarily have /snapshots/ per se, but we do have some block device and one or more bitmaps describing deltas to one or more backups that we do not necessarily have access to.

e.g. a given bitmap may describe a delta to an off-site backup. We know the delta, but do not maintain any meaningful handle to the given off-site backup, including name, URI, or date.

QEMU leaves this association up to upper management, and provides only IDs to facilitate any additional correlation.

We could offer up descriptions of these bitmaps in response to such a command, but they're not ... quite snapshots. It is more the case that we use them to create snapshots.

- NBD_CMD_BLOCK_STATUS (during transmission), returns block descriptors
  which tell you what the status of a block of data is for each of the
  relevant snapshots that we know about.

Perhaps this is somewhat overengineered, but it does bring most of the
definition of what a snapshot is back into the NBD protocol, without
having to say "this could be anything", and without requiring
connectivity over two ports for this to be useful (e.g., you could store
the machine-readable form of the snapshot description into your backup
program and match what they mean with what you're interested in at
restore time, etc).

This wouldn't work if you're interested in new snapshots that get
created once we've already moved into transmission, but hey.

Thoughts?

    Full backup: just do structured read - it will show us, which chunks
    may be treaded as zeroes.

Right.

    Incremental backup: get dirty bitmap (somehow, for example through
    user-defined part of proposed command), than, for dirty blocks, read
    them through structured read, so information about zero/unallocated
    areas are here.

For me all the variants above are OK. Let's finally choose something.

v2:
v1 was: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg05574.html

Since then, we've added the STRUCTURED_REPLY extension, which
necessitates a rather larger rebase; I've also changed things
to rename the command 'NBD_CMD_BLOCK_STATUS', changed the request
modes to be determined by boolean flags (rather than by fixed
values of the 16-bit flags field), changed the reply status fields
to be bitwise-or values (with a default of 0 always being sane),
and changed the descriptor layout to drop an offset but to include
a 32-bit status so that the descriptor is nicely 8-byte aligned
without padding.

 doc/proto.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 154 insertions(+), 1 deletion(-)

[...]

I'll commit this in a minute into a separate branch called
"extension-blockstatus", under the understanding that changes are still
required, as per above (i.e., don't assume that just because there's a
branch I'm happy with the current result ;-)

Regards


Err, I hope I haven't confused everything to heck and back now....

Thanks,
--js



reply via email to

[Prev in Thread] Current Thread [Next in Thread]