qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v5 07/10] block: introduce preallocate filter


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [PATCH v5 07/10] block: introduce preallocate filter
Date: Wed, 26 Aug 2020 14:29:45 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0

26.08.2020 12:58, Max Reitz wrote:
On 26.08.20 11:15, Vladimir Sementsov-Ogievskiy wrote:
26.08.2020 11:52, Max Reitz wrote:
On 26.08.20 08:49, Vladimir Sementsov-Ogievskiy wrote:
25.08.2020 18:11, Max Reitz wrote:
On 21.08.20 16:11, Vladimir Sementsov-Ogievskiy wrote:
It's intended to be inserted between format and protocol nodes to
preallocate additional space (expanding protocol file) on writes
crossing EOF. It improves performance for file-systems with slow
allocation.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
    docs/system/qemu-block-drivers.rst.inc |  26 +++
    qapi/block-core.json                   |  20 +-
    block/preallocate.c                    | 291
+++++++++++++++++++++++++
    block/Makefile.objs                    |   1 +
    4 files changed, 337 insertions(+), 1 deletion(-)
    create mode 100644 block/preallocate.c

[...]

diff --git a/block/preallocate.c b/block/preallocate.c
new file mode 100644
index 0000000000..bdf54dbd2f
--- /dev/null
+++ b/block/preallocate.c
@@ -0,0 +1,291 @@
+/*
+ * preallocate filter driver
+ *
+ * The driver performs preallocate operation: it is injected above
+ * some node, and before each write over EOF it does additional
preallocating
+ * write-zeroes request.
+ *
+ * Copyright (c) 2020 Virtuozzo International GmbH.
+ *
+ * Author:
+ *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
+ *
+ * This program is free software; you can redistribute it and/or
modify
+ * it under the terms of the GNU General Public License as
published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see
<http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+
+#include "qapi/error.h"
+#include "qemu/module.h"
+#include "qemu/option.h"
+#include "qemu/units.h"
+#include "block/block_int.h"
+
+
+typedef struct BDRVPreallocateState {
+    int64_t prealloc_size;
+    int64_t prealloc_align;
+
+    /*
+     * Filter is started as not-active, so it doesn't do any
preallocations nor
+     * requires BLK_PERM_RESIZE on its child. This is needed to
create filter
+     * above another node-child and then do bdrv_replace_node to
insert the
+     * filter.

A bit weird, but seems fair.  It’s basically just a cache for whether
this node has a writer on it or not.

Apart from the weirdness, I don’t understand the “another node-child”.
Say you have “format” -> “proto”, and then you want to insert
“prealloc”.  Wouldn’t you blockdev-add prealloc,file=proto and then
blockdev-replace format.file=prealloc?

Yes something like this. Actually, I'm about inserting the filter
automatically from block layer code, but your reasoning is about same
thing and is better.

So what is that “other node-child”?

Hmm, just my misleading wording. At least '-' in wrong place. Just
"other node child", or "child of another node"..

OK.

+     *
+     * Filter becomes active the first time it detects that its
parents have
+     * BLK_PERM_RESIZE on it.
+     * Filter becomes active forever: it doesn't lose active status
if parents
+     * lose BLK_PERM_RESIZE, otherwise we'll not be able to shrink
the file on
+     * filter close.

Oh, the file is shrunk?  That wasn’t clear from the documentation.

Hm.  Seems like useful behavior.  I just wonder if keeping the
permission around indefinitely makes sense.  Why not just truncate it
when the permission is revoked?

How? Parent is closed earlier, so on close we will not have the
permission. So, we force-keep the permission up to our close().

I thought that preallocate_child_perm() would be invoked when the parent
is detached, so we could do the truncate there, before relinquishing
preallocate’s RESIZE permission.


Hmm, I can check it. I just a bit afraid of doing something serious like
truncation in .bdrv_child_perm() handler, which doesn't seem to imply
such usage.

I’m a bit conflicted.  On one hand, I share your concern.  On the other,
I find it completely reasonable that drivers might want to do I/O when
permissions change.

Usually, this is done as part of reopen, like in qcow2 when a drive
changes from R/W to RO and caches need to be flushed.  But I actually
think it makes more sense as part of the permission system, because of
course a reopen is not the only time when permissions can change.

So that would be another alternative, to implement
.bdrv_reopen_prepare(), and to check reopen_state->perm there.  If
RESIZE is about to go away, then we truncate.  We could keep track of
whether we did any preallocations after the last such truncate, and if
we did, do another truncate on close.

This would cover reopen at least.  Seems better than nothing, but, well...

Reopen will not cover the main case: bdrv_clase_all..


(One big problem I see with truncating in .bdrv_child_perm() is that
that function is in no way committing.  It’s just a request: “If your
parents need this of you, what do you need of your children?”
But I think that could be addressed by doing it in .bdrv_set_perm()
instead.  Of note is that file-posix actually does do I/O in its
raw_set_perm() function, in that it commits to file locks.)

will try


[...]

+static int coroutine_fn
+preallocate_co_truncate(BlockDriverState *bs, int64_t offset,
+                        bool exact, PreallocMode prealloc,
+                        BdrvRequestFlags flags, Error **errp)
+{
+    BDRVPreallocateState *s = bs->opaque;
+    int ret = bdrv_co_truncate(bs->file, offset, exact, prealloc,
flags, errp);
+
+    /* s->data_end may become negative here, which means unknown
data end */
+    s->data_end = bdrv_getlength(bs->file->bs);

What would be the problem with just setting data_end = offset?  We only
use data_end to truncate down on close, so basically repeat the
bdrv_co_truncate() call here, which seems correct.

We can use offset only if ret >= 0 && exact is true..

Well, on close, we ignore any errors from truncate() anyway.  So even if
we used exact=false here, it shouldn’t matter.

Here we handle truncate from user. If I understand "exact" correctly the
following is possible:

1. parent does truncate 1M -> 2M, exact=false
2. driver decides to truncate to 5M (exact condition is sutisfied)
3. but we remember s->data_end = 2M, and on close will shrink to 2M

In theory, yes; in practice, exact=false just means to ignore failures
due to unsupported sizes.  So in this example, the driver would have
resized to 5M because 2M is an impossible size, and thus the shrink on
close would just fail.

Doesn't seem a real problem.. So, we just consider the preallocation
done by driver as our preallocation to be dropped on close().

Could there be problems on user shrink?

1. parent does truncate 2M -> 1M, exact=false
2. driver decides to do notihng (why not)
3. on close() we'll shrink to 1M..

Again, seems no real problems.

Same as above, in practice the only difference between exact=false and
exact=true is how failures are reported.

But in case when bdrv_co_truncate failed, we should call bdrv_getlength
anyway, as we don't know the result of failed truncation. Or we can just
set s->data_end = -1, and not care too much about fail-scenarios.

So, probably we can do
s->data_end = ret < 0 : ret : offset;

But I just feel safer with additional bdrv_getlength().

Yes, let’s just leave it as you wrote it.

(It’s a bit silly of me to argue based on how drivers handle exact=false
in practice right now.  It can always change.  Also, I shouldn’t pretend
calling bdrv_getlength() would be an issue whatsoever, because it just
isn’t.)

Max



--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]