qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 10/19] block: introduce fleecing block driver


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [PATCH v3 10/19] block: introduce fleecing block driver
Date: Thu, 27 Jan 2022 18:28:53 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0

21.01.2022 13:46, Vladimir Sementsov-Ogievskiy wrote:
20.01.2022 19:11, Hanna Reitz wrote:
On 22.12.21 18:40, Vladimir Sementsov-Ogievskiy wrote:
Introduce a new driver, that works in pair with copy-before-write to
improve fleecing.

Without fleecing driver, old fleecing scheme looks as follows:

[guest]
   |
   |root
   v
[copy-before-write] -----> [temp.qcow2] <--- [nbd export]
   |                 target  |
   |file                     |backing
   v                         |
[active disk] <-------------+

With fleecing driver, new scheme is:

[guest]
   |
   |root
   v
[copy-before-write] -----> [fleecing] <--- [nbd export]
   |                 target  |    |
   |file                     |    |file
   v                         |    v
[active disk]<--source------+  [temp.img]

Benefits of new scheme:

1. Access control: if remote client try to read data that not covered
    by original dirty bitmap used on copy-before-write open, client gets
    -EACCES.

2. Discard support: if remote client do DISCARD, this additionally to
    discarding data in temp.img informs block-copy process to not copy
    these clusters. Next read from discarded area will return -EACCES.
    This is significant thing: when fleecing user reads data that was
    not yet copied to temp.img, we can avoid copying it on further guest
    write.

3. Synchronisation between client reads and block-copy write is more
    efficient: it doesn't block intersecting block-copy write during
    client read.

4. We don't rely on backing feature: active disk should not be backing
    of temp image, so we avoid some permission-related difficulties and
    temp image now is not required to support backing, it may be simple
    raw image.

Note that now nobody calls fleecing_drv_activate(), so new driver is
actually unusable. It's a work for the following patch: support
fleecing block driver in copy-before-write filter driver.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
  qapi/block-core.json |  37 +++++-
  block/fleecing.h     |  16 +++
  block/fleecing-drv.c | 261 +++++++++++++++++++++++++++++++++++++++++++
  MAINTAINERS          |   1 +
  block/meson.build    |   1 +
  5 files changed, 315 insertions(+), 1 deletion(-)
  create mode 100644 block/fleecing-drv.c

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 6904daeacf..b47351dbac 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2917,13 +2917,14 @@
  # @blkreplay: Since 4.2
  # @compress: Since 5.0
  # @copy-before-write: Since 6.2
+# @fleecing: Since 7.0
  #
  # Since: 2.9
  ##
  { 'enum': 'BlockdevDriver',
    'data': [ 'blkdebug', 'blklogwrites', 'blkreplay', 'blkverify', 'bochs',
              'cloop', 'compress', 'copy-before-write', 'copy-on-read', 'dmg',
-            'file', 'ftp', 'ftps', 'gluster',
+            'file', 'fleecing', 'ftp', 'ftps', 'gluster',
              {'name': 'host_cdrom', 'if': 'HAVE_HOST_BLOCK_DEVICE' },
              {'name': 'host_device', 'if': 'HAVE_HOST_BLOCK_DEVICE' },
              'http', 'https', 'iscsi',
@@ -4181,6 +4182,39 @@
    'base': 'BlockdevOptionsGenericFormat',
    'data': { 'target': 'BlockdevRef', '*bitmap': 'BlockDirtyBitmap' } }
+##
+# @BlockdevOptionsFleecing:
+#
+# Driver that works in pair with copy-before-write filter to make a fleecing
+# scheme like this:
+#
+#    [guest]
+#      |
+#      |root
+#      v
+#    [copy-before-write] -----> [fleecing] <--- [nbd export]
+#      |                 target  |    |
+#      |file                     |    |file
+#      v                         |    v
+#    [active disk]<--source------+  [temp.img]

When generating docs, my sphinx doesn’t like this very much.  I don’t know 
exactly what of it, but it complains with:

docs/../qapi/block-core.json:4190:Line block ends without a blank line.

(Line 4190 is the “@BlockdevOptionsFleecing:” line, but there is no warning if 
I remove this ASCII art.)

I usually disable docs building to not waste the time.. But I should enable it 
at least once to check that I don't break it.


+#
+# The scheme works like this: on write, fleecing driver saves data to its
+# ``file`` child and remember that this data is in ``file`` child. On read
+# fleecing reads from ``file`` child if data is already stored to it and
+# otherwise it reads from ``source`` child.

I.e. it’s basically a COW format with the allocation bitmap stored as a block 
dirty bitmap.

+# In the same time, before each guest write, ``copy-before-write`` copies
+# corresponding old data  from ``active disk`` to ``fleecing`` node.
+# This way, ``fleecing`` node looks like a kind of snapshot for extenal
+# reader like NBD export.

So this description sounds like the driver is just a COW driver with an 
in-memory allocation bitmap.  But it’s actually specifically tuned for 
fleecing, because it interacts with the CBW node to prevent conflicts, and 
discard requests result in the respective areas become unreadable.

I find that important to mention, because if we don’t, then I’m wondering why 
this isn’t a generic “in-memory-cow” driver, and what makes it so useful for 
fleecing over any other COW driver.

(In fact, I’m asking myself all the time whether we can’t pull this driver 
apart into more generic nodes, like one in-memory-cow driver, and another 
driver managing the discard feature, and so on.  Could be done e.g. like this:


                 Guest -> copy-before-write --file--> fleecing-lock --file--> 
disk image
^        |                  ^
|      target               |
+-- cbw-child --+        |               backing
|           v                  |
NBD -> fleecing-discard --file--> in-memory-cow -----------+
                                         |
         file
           |
           v
       temp.img

Hmm ASCII art is broken for me.. Me trying to fix:


                                     ┌──────────────────┐
                                     │       NBD        │
                                     └─┬────────────────┘
                                       │
                                       │ root
                                       ▼
    ┌──────────┐                     ┌──────────────────┐
    │  guest   │     ┌───────────────┤ fleecing-discard │
    └─┬────────┘     │ cbw-child     └─┬────────────────┘
      │              │                 │
      │ root         │                 │ file
      ▼              ▼                 ▼
    ┌──────────────────┐  target     ┌──────────────────┐
    │       CBW        ├────────────►│  in-memory-cow   │
    └─┬────────────────┘             └─┬───────────┬────┘
      │                                │           │
      │ file                           │           │ file
      ▼                                │           ▼
    ┌──────────────────┐     backing   │        ┌─────────────┐
    │  fleecing-lock   │◄──────────────┘        │ temp.img    │
    └─┬────────────────┘                        └─────────────┘
      │
      │ file
      ▼
    ┌──────────────────┐
    │   active-disk    │
    └──────────────────┘


I.e. fleecing-discard would handle discards (telling its cbw-child to drop 
those areas from the copy-bitmap, and forwarding discards to the in-memory-cow 
node)

, the in-memory-cow node would just be a generic implementation of COW (could 
be replaced by any other COW-implementing node, like qcow2),

Hmm, but than in-memory-cow should own the done_bitmap bitmap. But we want to 
use it for synchronization in upper layers..


and the fleecing-lock driver would prevent areas that are still being read from 
from being written to concurrently.

But we want to call fleecing_mark_done_and_wait_readers() exactly after 
copy-before-write operation, so this call should be done in CBW filter, not in 
fleecing lock

[*] upd after answering to last comment: or we don't want..


Problem is, of course, that’s very complicated, I haven’t thought this through, 
and it’s extremely questionable whether we really need this modularity.  Most 
likely not.

Yes, I try to go with not-too-many filters.


I still feel compelled to think about such modularization, because the 
relationship between the CBW and the fleecing driver as laid out in this series 
doesn’t feel quite right to me.  They feel bolted together in a way that 
doesn’t fit in with the general design of the block layer where every node is 
basically self-contained.  I understand CBW and fleecing will need some 
communication, but I don’t (yet) like how in the next patch, the CBW driver 
looks for the fleecing driver and directly communicates with it through the 
FleecingState instead of going through the block layer, as we’d normally do 
when communicating between block nodes.

That’s why I’m trying to pick apart the functionality of the fleecing block 
driver into self-contained “atomic” nodes that perform its different 
functionalities, so that perhaps I can eventually put it back together and find 
out whether we can do better than `is_fleecing_drv(unfiltered_target)`.)

Big part of the problem is that we want somehow bind together two filters. But we can't 
make both the child of each other, as it would be a loop. May be we should introduce 
"non-child" relationship on the graph? Which will not participate in permission 
update but only in aio-context management?

We may add a parameter for CBW filter, that points directly to fleecing filter instead of 
"is_fleecing_drv(unfiltered_target)".. But it's just and extra argument wchih 
we can detect automatically.


+#
+# @source: node name of source node of fleecing scheme
+#
+# Since: 7.0
+##
+{ 'struct': 'BlockdevOptionsFleecing',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'source': 'str' } }
+
  ##
  # @BlockdevOptions:
  #
@@ -4237,6 +4271,7 @@
        'copy-on-read':'BlockdevOptionsCor',
        'dmg':        'BlockdevOptionsGenericFormat',
        'file':       'BlockdevOptionsFile',
+      'fleecing':   'BlockdevOptionsFleecing',
        'ftp':        'BlockdevOptionsCurlFtp',
        'ftps':       'BlockdevOptionsCurlFtps',
        'gluster':    'BlockdevOptionsGluster',
diff --git a/block/fleecing.h b/block/fleecing.h
index fb7b2f86c4..75ad2f8b19 100644
--- a/block/fleecing.h
+++ b/block/fleecing.h
@@ -80,6 +80,9 @@
  #include "block/block-copy.h"
  #include "block/reqlist.h"
+
+/* fleecing.c */
+
  typedef struct FleecingState FleecingState;
  /*
@@ -132,4 +135,17 @@ void fleecing_discard(FleecingState *f, int64_t offset, 
int64_t bytes);
  void fleecing_mark_done_and_wait_readers(FleecingState *f, int64_t offset,
                                           int64_t bytes);
+
+/* fleecing-drv.c */
+
+/* Returns true if @bs->drv is fleecing block driver */
+bool is_fleecing_drv(BlockDriverState *bs);
+
+/*
+ * Normally FleecingState is created by copy-before-write filter. Then
+ * copy-before-write filter calls fleecing_drv_activate() to share 
FleecingState
+ * with fleecing block driver.
+ */
+void fleecing_drv_activate(BlockDriverState *bs, FleecingState *fleecing);
+
  #endif /* FLEECING_H */
diff --git a/block/fleecing-drv.c b/block/fleecing-drv.c
new file mode 100644
index 0000000000..202208bb03
--- /dev/null
+++ b/block/fleecing-drv.c
@@ -0,0 +1,261 @@
+/*
+ * fleecing block driver
+ *
+ * Copyright (c) 2021 Virtuozzo International GmbH.
+ *
+ * Author:
+ *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+
+#include "sysemu/block-backend.h"
+#include "qemu/cutils.h"
+#include "qapi/error.h"
+#include "block/block_int.h"
+#include "block/coroutines.h"
+#include "block/qdict.h"
+#include "block/block-copy.h"
+#include "block/reqlist.h"
+
+#include "block/copy-before-write.h"
+#include "block/fleecing.h"
+
+typedef struct BDRVFleecingState {
+    FleecingState *fleecing;
+    BdrvChild *source;
+} BDRVFleecingState;
+
+static coroutine_fn int fleecing_co_preadv_part(
+        BlockDriverState *bs, int64_t offset, int64_t bytes,
+        QEMUIOVector *qiov, size_t qiov_offset, BdrvRequestFlags flags)
+{
+    BDRVFleecingState *s = bs->opaque;
+    const BlockReq *req;
+    int ret;
+
+    if (!s->fleecing) {
+        /* fleecing_drv_activate() was not called */
+        return -EINVAL;

I'd rather treat a missing connection with a CBW driver as if we had an empty 
copy/access bitmap, and so return -EACCES in these places.

OK for me


+    }
+
+    /* TODO: upgrade to async loop using AioTask */
+    while (bytes) {
+        int64_t cur_bytes;
+
+        ret = fleecing_read_lock(s->fleecing, offset, bytes, &req, &cur_bytes);
+        if (ret < 0) {
+            return ret;
+        }
+
+        if (req) {
+            ret = bdrv_co_preadv_part(s->source, offset, cur_bytes,
+                                      qiov, qiov_offset, flags);
+            fleecing_read_unlock(s->fleecing, req);
+        } else {
+            ret = bdrv_co_preadv_part(bs->file, offset, cur_bytes,
+                                      qiov, qiov_offset, flags);
+        }
+        if (ret < 0) {
+            return ret;
+        }
+
+        bytes -= cur_bytes;
+        offset += cur_bytes;
+        qiov_offset += cur_bytes;
+    }
+
+    return 0;
+}
+
+static int coroutine_fn fleecing_co_block_status(BlockDriverState *bs,
+                                                 bool want_zero, int64_t 
offset,
+                                                 int64_t bytes, int64_t *pnum,
+                                                 int64_t *map,
+                                                 BlockDriverState **file)
+{
+    BDRVFleecingState *s = bs->opaque;
+    const BlockReq *req = NULL;
+    int ret;
+    int64_t cur_bytes;
+
+    if (!s->fleecing) {
+        /* fleecing_drv_activate() was not called */
+        return -EINVAL;
+    }
+
+    ret = fleecing_read_lock(s->fleecing, offset, bytes, &req, &cur_bytes);
+    if (ret < 0) {
+        return ret;
+    }
+
+    *pnum = cur_bytes;
+    *map = offset;
+
+    if (req) {
+        *file = s->source->bs;
+        fleecing_read_unlock(s->fleecing, req);
+    } else {
+        *file = bs->file->bs;
+    }
+
+    return ret;

Is ret == 0 the right return value here?

Hmm yes, looks strange, it should be some combination of flags.


+}
+
+static int coroutine_fn fleecing_co_pdiscard(BlockDriverState *bs,
+                                             int64_t offset, int64_t bytes)
+{
+    BDRVFleecingState *s = bs->opaque;
+    if (!s->fleecing) {
+        /* fleecing_drv_activate() was not called */
+        return -EINVAL;
+    }
+
+    fleecing_discard(s->fleecing, offset, bytes);
+
+    bdrv_co_pdiscard(bs->file, offset, bytes);
+
+    /*
+     * Ignore bdrv_co_pdiscard() result: fleecing_discard() succeeded, that
+     * means that next read from this area will fail with -EACCES. More correct
+     * to report success now.
+     */

I don’t know.  I’m asking myself why the caller in turn would care about the 
discard result (usually one doesn’t really care whether discarding succeeded or 
not), and I feel like if they care, they’d like to know that discard the data 
from storage did fail.

Returning error is OK too. Will change. Anyway if error is returned, caller 
shouldn't rely on any assumptions.


+    return 0;
+}
+
+static int coroutine_fn fleecing_co_pwrite_zeroes(BlockDriverState *bs,
+        int64_t offset, int64_t bytes, BdrvRequestFlags flags)
+{
+    BDRVFleecingState *s = bs->opaque;
+    if (!s->fleecing) {
+        /* fleecing_drv_activate() was not called */
+        return -EINVAL;
+    }
+
+    /*
+     * TODO: implement cache, to have a chance to fleecing user to read and
+     * discard this data before actual writing to temporary image.
+     */

Is there a good reason why a cache shouldn’t be implemented as a separate block 
driver?

I don't remember. My last idea was just to implement all the features in 
special fleecing driver. But you are right that if we see things that could be 
split to separate small filter which make sense by itself, it _probably_ worth 
doing.. I'll think about it when prepare a new version, as it is hard to 
imagine the whole picture not trying to implement it.


+    return bdrv_co_pwrite_zeroes(bs->file, offset, bytes, flags);
+}
+
+static coroutine_fn int fleecing_co_pwritev(BlockDriverState *bs,
+                                            int64_t offset,
+                                            int64_t bytes,
+                                            QEMUIOVector *qiov,
+                                            BdrvRequestFlags flags)
+{
+    BDRVFleecingState *s = bs->opaque;
+    if (!s->fleecing) {
+        /* fleecing_drv_activate() was not called */
+        return -EINVAL;
+    }
+
+    /*
+     * TODO: implement cache, to have a chance to fleecing user to read and
+     * discard this data before actual writing to temporary image.
+     */
+    return bdrv_co_pwritev(bs->file, offset, bytes, qiov, flags);
+}
+
+
+static void fleecing_refresh_filename(BlockDriverState *bs)
+{
+    pstrcpy(bs->exact_filename, sizeof(bs->exact_filename),
+            bs->file->bs->filename);
+}
+
+static int fleecing_open(BlockDriverState *bs, QDict *options, int flags,
+                         Error **errp)
+{
+    BDRVFleecingState *s = bs->opaque;
+
+    bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds,
+                               BDRV_CHILD_DATA | BDRV_CHILD_PRIMARY,
+                               false, errp);
+    if (!bs->file) {
+        return -EINVAL;
+    }
+
+    s->source = bdrv_open_child(NULL, options, "source", bs, &child_of_bds,
+                               BDRV_CHILD_DATA, false, errp);
+    if (!s->source) {
+        return -EINVAL;
+    }
+
+    bs->total_sectors = bs->file->bs->total_sectors;
+
+    return 0;
+}
+
+static void fleecing_child_perm(BlockDriverState *bs, BdrvChild *c,
+                                BdrvChildRole role,
+                                BlockReopenQueue *reopen_queue,
+                                uint64_t perm, uint64_t shared,
+                                uint64_t *nperm, uint64_t *nshared)
+{
+    bdrv_default_perms(bs, c, role, reopen_queue, perm, shared, nperm, 
nshared);
+
+    if (role & BDRV_CHILD_PRIMARY) {
+        *nshared &= BLK_PERM_CONSISTENT_READ;
+    } else {
+        *nperm &= BLK_PERM_CONSISTENT_READ;
+
+        /*
+         * copy-before-write filter is responsible for source child and need
+         * write access to it.
+         */
+        *nshared |= BLK_PERM_WRITE;
+    }
+}
+
+BlockDriver bdrv_fleecing_drv = {
+    .format_name = "fleecing",
+    .instance_size = sizeof(BDRVFleecingState),
+
+    .bdrv_open                  = fleecing_open,
+
+    .bdrv_co_preadv_part        = fleecing_co_preadv_part,
+    .bdrv_co_pwritev            = fleecing_co_pwritev,
+    .bdrv_co_pwrite_zeroes      = fleecing_co_pwrite_zeroes,
+    .bdrv_co_pdiscard           = fleecing_co_pdiscard,
+    .bdrv_co_block_status       = fleecing_co_block_status,
+
+    .bdrv_refresh_filename      = fleecing_refresh_filename,
+
+    .bdrv_child_perm            = fleecing_child_perm,
+};
+
+bool is_fleecing_drv(BlockDriverState *bs)
+{
+    return bs && bs->drv == &bdrv_fleecing_drv;
+}

Besides the question whether the FleecingState should be part of CBW or the 
fleecing driver, I don’t like this very much.  As stated above, normally we go 
through the block layer to communicate between nodes, and this function for 
example prevents the possibility of having filters between CBW and the fleecing 
node.

Normally, I would expect a new BlockDriver method that the CBW driver would 
call to communicate with the fleecing driver.  Isn’t 
fleecing_mark_done_and_wait_readers() the only part where the CBW driver ever 
needs to tell the fleecing driver something?

Hm, actually, I wonder why we need fleecing_mark_done_and_wait_readers() to be 
called from CBW – can we not have the fleecing driver call this in its write 
implementations?  (It’s my understanding that the fleecing node is to be used 
read-only from the NBD export, besides discards.)

Interesting idea. That means that we establish the guarantee: successful write 
to fleecing node is a point after which it will not touch this region in 
active-disk, and all in-flight reads are awaited. Then we should propagate this 
guarantee to block_copy() call.. Seems it should work. I'll try.


Thanks a lot for reviewing, I now have enough material to work on v4. Will see, 
could this all become a bit more beautiful :)



OK, me now thinking. Let me think out loud.

First about RAM cache. We should keep it in mind. But the scheme should work 
well with system cache used instead of RAM cache.

Image CBW write opertion now:

CBW WRITE  (copy-before-write operation, copying old unchanged data from active 
disk to some target)

1. It should be guearanteed, that corresponding area in "done" dirty bitmap is 
not dirty. And it should be guaranteed that we don't have intersecting parallel writes. 
It all is true for CBW writes. Should we check and assert it? Probably yes.

2. Do write. We are safe, as dirty bitmap is unset here, and all reads goes to 
active-disk.

If we work only with system cache, it's all that we can. If write is fast, all is OK, 
data becomes available for read almost immediately. But if write trigger real flush, 
fleecing reader will have to read from active-disk during this write, when we actually 
have the data in RAM. It's a bit inefficient. This may be solved by ram-cache node, for 
which write is always fast, but than we should call some "flush" operation for 
that region so that RAM usage not grow endlessly.

3. Data is written, so, make it available for readers

  - mutex_lock
  - set bits in dirty bitmap
  - mutex_unlock

4. If ram-cache is in use trigger cache flush, and wait until cache size 
normalized (we can't finish cbw-write, otherwise RAM usage will grow 
indefinitely).

5. before starting actual guest write to active-disk, we should wait for all 
in-flight fleecing reads from active-disk in this area. So, wait on reqlist.. 
[*]


think about fleecing read

FLEECING READ  (read operation that done by fleecing user like NBD export)

* mutex_lock()

* check the bitmap:

if data is available in the cache or underlying storage, we don't need any 
synchronization:

   * mutex_unlock()
* do read from cache (or from underlying storage through cache)

else, we should read from active-disk, and want a guarantee that active-disk 
will not change in this area during the read

   * create request in reqlist
* mutex_unlock()

   * do read from active-disk

   * drop request from reqlist ( reqlist most probably should be protected by 
the same mutex as above... should it be some sepearate mutex? Or we want to 
abuse bitmaps mutex? I don't like to abuse anything )



Ok, now, let's think how to spread all this functionality between nodes..

If we have one "fleecing" node, that does everything, it's simple. It owns all 
the objects: mutex, dirty-bitmap, reqlist, ram-cache..


But what if we want to split it? Decision where from to read and creating 
request in reqlist should be done under mutex. So it should be in in-ram-cow 
node. But this brings a kind of syncrhonization which is not needed for generic 
in-ram-cow node.. Then, if we start to care of CBW/fleecing synchronization in 
this node, no reason to not do [*] waiting here too. So that doesn't look like 
generic in-ram-cow, but like specific fleecing driver.. Which is COW driver. 
But rather specific.

Could we split RAM cache? Seems we could. Write to it is always fast and may be 
done under mutex. And after write cache size may exceed the maximum. And we 
need an API, to wait for cache size normalized..

On the other hand, the simplest and minimal implementation of RAM cache is just 
a list of in-flight write-requests inside fleecing node + rely on system cache. 
So the operations would look like:


CBW WRITE to FLEECING node

- mutex lock
- check that corresponding bits in the bitmap are unset and no intersecting 
write requests
- add write request (together with data buf copied or stolen) to write requests 
list
- set corresponding bits in the bitmap
- mutex unlock
- great, starting from this point data is already available for reads
- write data to underlying node (system cache helps almost never do real write 
to disk)
- mutex lock
- drop write request from inflight write requests list
- mutex unlock
- wait for in-flight read requests in active-disk in this area

READ from FLEECING node
- mutex lock
- if data is in in-flight write request list, just copy it, unlock mutex and we 
are done
- else if bitmap is dirty, unlock mutex and read the data from underlying 
temporary storage
- else
   - create in-flight read request in reqlist
   - mutex unlock
   - read from active disk
   - mutex lock
   - drop in-flight read request from reqlist
   - mutex unlock


That's all about synchronization + simple improvement that makes data of 
in-flight writes available for reads.. Do we really need it? Or old 
synchronization based on serializing requests is enough?


Another thing is access/discard feature. It may simply be split, so we finally 
have something like



                                   ┌─────────────────┐
                                   │ fleecing-access │
                                   └────┬───┬────────┘
                         cbw-child      │   │
                 ┌──────────────────────┘   │file
                 │                          │
   ┌─────────────▼───┐   target    ┌────────▼────────┐
   │ CBW             ├─────────────► fleecing-sync   │
   └────────┬────────┘             └──┬─────┬────────┘
            │                         │     │
      file  │    ┌────────────────────┘     │file
            │    │     source               │
   ┌────────▼────▼───┐             ┌────────▼────────┐
   │ active-disk     │             │ temp.img        │
   └─────────────────┘             └─────────────────┘


And in this scheme, the question becomes meningful: does it worth the 
complexity? Or we can simply live with old fleecing synchronization based on 
qcow2 temporary image and serializing requests, and then we have only one 
fleecing-access driver.

The only operation that fleecing-access does on cbw-child would some new 
special operation, like bdrv_discard_cbw, or new flag for discard. We can 
support automatic passs-through this new operation through filters, but I don't 
think it may be useful. So, the only reason to have two nodes is to have a 
cbw-child relation.. When actually combing discard feature and check for dirty 
bitmap on read to one fleecing driver seems reasonable: it becomes a complete 
fleecing driver, which has signinficant actions for all block operations: read, 
write, discard. And it make sense.

So actually we want something like this:


                         cbw-friend
                 ┌──────────────────────────┐
                 │                          │
   ┌─────────────▼───┐   target    ┌────────┴────────┐
   │ CBW             ├─────────────► fleecing        │
   └────────┬────────┘             └──┬─────┬────────┘
            │                         │     │
      file  │    ┌────────────────────┘     │file
            │    │     source               │
   ┌────────▼────▼───┐             ┌────────▼────────┐
   │ active-disk     │             │ temp.img        │
   └─────────────────┘             └─────────────────┘


Where cbw-friend link is used to do discard_cbw() operations. And keeping in 
mind that discard_cbw() is a very specific operation, there is no reason to 
support this relationship in generic layer, so all we need is simply keep a 
reference of CBW node in fleecing node, and it seems not so bad.

And this way, question about does new synchronization and caching worth the complexit 
goes away: no additional complexity for user, we have only one additional node (which 
is needed for discard functionality anyway) and simple relationship. So for user the 
whole differecy is links [fleecing] --source-> [active-disk] instead of 
[temp.qcow2] --backing-> [active-disk]. And we have a small improvement: a bit 
more optimal synchronization scheme, reuse of data that is already in ram, 
possibility to use simple raw file as temp.img.


Hmm. But of course, such relationship raises some questions.. What about aio-context 
switch? We have to hope for "target" relationship.

Interesting, looking at last picture, I see that CBW and fleecing looks like 
_one_ block node.. Why can't we merge them? Because guest-read and 
fleecing-client-read operations are different and should be handled 
differently. As well as guest-discard and fleecing-client-discard..

But we can implement such operatons as separate bdrv_* operations or as flags 
for read and discard. And then, the scheme will look like this

                            ┌──────────────────┐
                            │  NBD export      │
                            └─────────┬────────┘
                                      │
                                      │root
                                      │
┌─────────────────┐         ┌─────────▼────────┐
│  Guest          │         │ fleecing-access  │
└───────┬─────────┘         └─────────┬────────┘
        │                             │
        │root                         │file
        │                             │
┌───────▼─────────────────────────────▼────────┐
│                  CBW                         │
└───────┬─────────────────────────────┬────────┘
        │                             │
        │file                         │target
        │                             │
┌───────▼─────────┐         ┌─────────▼────────┐
│   active-disk   │         │  temp.img        │
└─────────────────┘         └──────────────────┘


So, the whole logic is in CBW driver. And fleecing related logic is realized as 
new bdrv handlers: .bdrv_co_fleecing_preadv()  and .bdrv_co_fleecing_discard()

And fleecing-access is a very simple driver that just on .bdrv_co_preadv() calls 
bdrv_co_fleecing_preadv(bs->file) and on .bdrv_co_pdiscard() calls 
bdrv_co_fleecing_pdiscard(bs->file).

This way fleecing-access driver is fully independent of CBW, all relations in 
graph are Qemu-native and the scheme looks very simple for understanding.



More over, it reminds me my old idea of implementing a possibility to read 
qcow2 internal snapshot directly. It may look like this:

                             ┌──────────────────┐
                             │  NBD export      │
                             └─────────┬────────┘
                                       │
                                       │root
                                       │
 ┌─────────────────┐         ┌─────────▼────────┐
 │  Guest          │         │ fleecing-access  │
 └───────┬─────────┘         └─────────┬────────┘
         │                             │
         │root                         │file
         │                             │
 ┌───────▼─────────────────────────────▼────────┐
 │                qcow2 active disk             │
 └──────────────────────────────────────────────┘


For this to work, we impelement .bdrv_co_fleecing_preadv() in qcow2, which 
directly reads data from internal snapshot. This also means that it is better 
to rename fleecing-access to snapshot-access, and handlers to 
.bdrv_co_snapshot_preadv().. This way, CBW driver just provides a kind of 
internal snapshot and corresponding interface for it.

And at some point we even can implement reverse-snapshot inside qcow2 driver, 
so it will work like fleecing scheme, where instead of COW operations CBW 
operations are done, and the snapshot doesn't influence the disk fragmenation..


--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]