qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [RFC v2] new, node-graph-based fleecing and backup


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [Qemu-block] [RFC v2] new, node-graph-based fleecing and backup
Date: Tue, 21 Aug 2018 12:29:50 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0

20.08.2018 21:30, Vladimir Sementsov-Ogievskiy wrote:
20.08.2018 20:25, Max Reitz wrote:
On 2018-08-20 16:49, Vladimir Sementsov-Ogievskiy wrote:
20.08.2018 16:32, Max Reitz wrote:
On 2018-08-20 11:42, Vladimir Sementsov-Ogievskiy wrote:
18.08.2018 00:50, Max Reitz wrote:
On 2018-08-14 19:01, Vladimir Sementsov-Ogievskiy wrote:
[...]

Proposal:

For fleecing we need two nodes:

1. fleecing hook. It's a filter which should be inserted on top of active disk. It's main purpose is handling guest writes by copy-on-write operation,
i.e. it's a substitution for write-notifier in backup job.

2. fleecing cache. It's a target node for COW operations by fleecing-hook. It also represents a point-in-time snapshot of active disk for the readers.
It's not really COW, it's copy-before-write, isn't it? It's something
else entirely.  COW is about writing data to an overlay *instead* of
writing it to the backing file.  Ideally, you don't copy anything,
actually.  It's just a side effect that you need to copy things if your cluster size doesn't happen to match exactly what you're overwriting.
Hmm. I'm not against. But COW term was already used in backup to
describe this.
Bad enough. :-)
So, we agreed about new "CBW" abbreviation? :)
It is already used for the USB mass-storage command block wrapper, but I
suppose that is sufficiently different not to cause much confusion. :-)

(Or at least that's the only other use I know of.)

[...]

2. We already have fleecing scheme, when we should create some subgraph
between nodes.
Yes, but how do the permissions work right now, and why wouldn't they
work with your schema?
now it uses backup job, with shared_perm = all for its source and target
nodes.
Uh-huh.

So the issue is...  Hm, what exactly?  The backup node probably doesn't
want to share WRITE for the source anymore, as there is no real point in
doing so.  And for the target, the only problem may be to share
CONSISTENT_READ.  It is OK to share that in the fleecing case, but in
other cases maybe it isn't.  But that's easy enough to distinguish in
the driver.

The main issue I could see is that the overlay (the fleecing target)
might not share write permissions on its backing file (the fleecing
source)...  But your diagram shows (and bdrv_format_default_perms() as
well) that this is no the case, when the overlay is writable, the
backing file may be written to, too.

Hm, actually overlay may share write permission to clusters which are saved in overlay, or which are not needed (if we have dirty bitmap for incremental backup).. But we don't have such permission kind, and it looks not easy to implement it... And it may be too expensive in operation overhead.


(ha, you can look at the picture in "[PATCH v2 0/3] block nodes
graph visualization")
:-)

3. If we move to filter-node instead of write_notifier, block job is not
actually needed for fleecing, and it is good to drop it from the
fleecing scheme, to simplify it, to make it more clear and transparent.
If that's possible, why not.  But again, I'm not sure whether that's
enough of a reason for the endavour, because whether you start a block
job or do some graph manipulation yourself is not really a difference in
complexity.
not "or" but "and": in current fleecing scheme we do both graph
manipulations and block-job stat/cancel..
Hm!  Interesting.  I didn't know blockdev-backup didn't set the target's
backing file.  It makes sense, but I didn't think about it.

Well, still, my point was whether you do a blockdev-backup +
block-job-cancel, or a blockdev-add + blockdev-reopen + blockdev-reopen
+ blockdev-del...  If there is a difference, the former is going to be
simpler, probably.

(But if there are things you can't do with the current blockdev-backup,
then, well, that doesn't help you.)

Yes, I agree, that there no real benefit in difficulty. I just thing,
that if we have filter node which performs "CBW" operations, block-job
backup(sync=none) becomes actually empty, it will do nothing.
On the code side, yes, that's true.

But it's mostly your call, since I suppose you'd be doing most of the work.

And finally, we will have unified filter-node-based scheme for backup
and fleecing, modular and customisable.
[...]

Benefits, or, what can be done:

1. We can implement special Fleecing cache filter driver, which will be a real cache: it will store some recently written clusters and RAM, it can have a backing (or file?) qcow2 child, to flush some clusters to the disk, etc. So, for each cluster of active disk we will have the following characteristics:

- changed (changed in active disk since backup start)
- copy (we need this cluster for fleecing user. For example, in RFC patch all clusters are "copy", cow_bitmap is initialized to all ones. We can use some existent bitmap to initialize cow_bitmap, and it will provide an "incremental"
fleecing (for use in incremental backup push or pull)
- cached in RAM
- cached in disk
Would it be possible to implement such a filter driver that could just
be used as a backup target?
for internal backup we need backup-job anyway, and we will be able to
create different schemes.
One of my goals is the scheme, when we store old data from CBW
operations into local cache, when
backup target is remote, relatively slow NBD node. In this case, cache
is backup source, not target.
Sorry, my question was badly worded.  My main point was whether you
could implement the filter driver in such a generic way that it wouldn't
depend on the fleecing-hook.
yes, I want my filter nodes to be self-sufficient entities. However it
may be more effective to have some shared data, between them, for
example, dirty-bitmaps, specifying drive clusters, to know which
clusters are cached, which are changed, etc.
I suppose having global dirty bitmaps may make sense.

Judging from your answer and from the fact that you proposed calling the
filter node backup-filter and just using it for all backups, I suppose
the answer is "yes".  So that's good.

(Though I didn't quite understand why in your example the cache would be
the backup source, when the target is the slow node...)
cache is a point-in-time view of active disk (actual source) for
fleecing. So, we can start backup job to copy data from cache to target.
But wouldn't the cache need to be the immediate fleecing target for
this?  (And then you'd run another backup/mirror from it to copy the
whole disk to the real target.)

Yes, the cache is immediate fleecing target.


On top of these characteristics we can implement the following features:

1. COR, we can cache clusters not only on writes but on reads too, if we have free space in ram-cache (and if not, do not cache at all, don't write to disk-cache). It may be done like bdrv_write(..., BDRV_REQ_UNNECESARY)
You can do the same with backup by just putting a fast overlay between source and the backup, if your source is so slow, and then do COR, i.e.:

slow source --> fast overlay --> COR node --> backup filter
How will we check ram-cache size to make COR optional in this scheme?
Yes, well, if you have a caching driver already, I suppose you can just
use that.

You could either write it a bit simpler to only cache on writes and then
put a COR node on top if desired; or you implement the read cache
functionality directly in the node, which may make it a bit more
complicated, but probably also faster.

(I guess you indeed want to go for faster when already writing a RAM
cache driver...)

(I don't really understand what BDRV_REQ_UNNECESSARY is supposed to do,
though.)
When we do "CBW", we _must_ save data before guest write, so, we write
this data to the cache (or directly to target, like in current approach).
When we do "COR", we _may_ save data to our ram-cache. It's safe to not
save data, as we can read it from active disk (data is not changed yet).
BDRV_REQ_UNNECESSARY is a proposed interface to write this unnecessary
data to the cache: if ram-cache is full, cache will skip this write.
Hm, OK...  But deciding for each request how much priority it should get
in a potential cache node seems like an awful lot of work. Well, I
don't even know what kind of requests you would deem unnecessary.  If it
has something to do with the state of a dirty bitmap, then having global
dirty bitmaps might remove the need for such a request flag.

Yes, if we have some "shared fleecing object", accessible by fleecing-hook filter, fleecing-cache filter (and backup job, if it is an internal backup), we don't need
such flag.


[...]

Hm.  So what you want here is a special block driver or at least a
special interface that can give information to an outside tool, namely
the information you listed above.

If you want information about RAM-cached clusters, well, you can only
get that information from the RAM cache driver.  It probably would be
allocation information, do we have any way of getting that out?

It seems you can get all of that (zero information and allocation
information) over NBD.  Would that be enough?
it's a most generic and clean way, but I'm not sure that it will be
performance-effective.
Intuitively I'd agree, but I suppose if NBD is written right, such a
request should be very fast and the response basically just consists of
the allocation information, so I don't suspect it can be much faster
than that.

(Unless you want some form of interrupts.  I suppose NBD would be the
wrong interface, then.)

Yes, for external backup through NBD it's ok to get block status, but for internal backup it seems faster to access shared fleecing object (or global bitmaps, etc).

However, if we have some shared fleecing object, it's not a problem to export it as a blockstatus metadata through NBD export..


[...]

I need several features, which are hard to implement using current scheme.

1. The scheme when we have a local cache as COW target and slow remote
backup target.
How to do it now? Using two backups, one with sync=none... Not sure that
this is right way.
If it works...

(I'd rather build simple building blocks that you can put together than
something complicated that works for a specific solution)
exactly, I want to implement simple building blocks = filter nodes,
instead of implementing all the features in backup job.
Good, good. :-)

3. Then,
we'll need a possibility for backup(sync=none) to
not COW clusters, which are already copied to backup, and so on.
Isn't that the same as 2?
We can use one bitmap for 2 and 3, and drop bits from it, when
external-tool has read corresponding cluster from nbd-fleecing-export..
Oh, right, it needs to be modifiable from the outside.  I suppose that
would be possible in NBD, too.  (But I don't know exactly.)

I think it's natural to implement it through discard operation on fleecing-cache node: if fleecing-user discard something, it will not read it more and we can drop it from the cache and clear bit in shared bitmap.

Then we can improve it by creating flag READ_ONCE for each READ command or for the whole connection, to discard data after each read.. Or pass this flag to bdrv_read, to handle it in one command..


[...]

I don't think that will be any simpler.

I mean, it would make blockdev-copy simpler, because we could
immediately replace backup by mirror, and then we just have mirror,
which would then automatically become blockdev-copy...

But it's not really going to be simpler, because whether you put the
copy-before-write logic into a dedicated block driver, or into the
backup filter driver, doesn't really make it simpler either way.  Well, adding a new driver always is a bit more complicated, so there's that.
what is the difference between separate filter driver and backup filter
driver?
I thought we already had a backup filter node, so you wouldn't have had
to create a new driver in that case.

But we don't, so there really is no difference.  Well, apart from being able to share state easier when the driver is in the same file as the job.
But if we make it separate - it will be a separate "building block" to
be reused in different schemes.
Absolutely true.

it should not care about guest writes, it copies clusters from a kind of snapshot which is not changing in time. This job should follow recommendations
from fleecing scheme [7].

What about the target?

We can use separate node as target, and copy from fleecing cache to the target. If we have only ram-cache, it would be equal to current approach (data is copied directly to the target, even on COW). If we have both ram- and disk- caches, it's a cool solution for slow-target: instead of make guest wait for long write to backup target (when ram-cache is full) we can write to disk-cache which is local
and fast.
Or you backup to a fast overlay over a slow target, and run a live
commit on the side.
I think it will lead to larger io overhead: all clusters will go through overlay, not only guest-written clusters, for which we did not have time
to copy them..
Well, and it probably makes sense to have some form of RAM-cache driver.
  Then that'd be your fast overlay.
but there no reasons to copy all the data through the cache: we need it
only for CBW.
Well, if there'd be a RAM-cache driver, you may use it for anything that
seems useful (I seem to remember there were some patches on the list
like three or four years ago...).

any way, I think it will be good if both schemes will be possible.

Another option is to combine fleecing cache and target somehow (I didn't think
about this really).

Finally, with one - two (three?) special filters we can implement all current fleecing/backup schemes in unique and very configurable way  and do a lot more
cool features and possibilities.

What do you think?
I think adding a specific fleecing target filter makes sense because you gave many reasons for interesting new use cases that could emerge from that.

But I think adding a new fleecing-hook driver just means moving the
implementation from backup to that new driver.
But in the same time you say that it's ok to create backup-filter
(instead of write_notifier) and make it insertable by qapi? So, if I
implement it in block/backup, it's ok? Why not do it separately?
Because I thought we had it already.  But we don't.  So feel free to do
it separately. :-)
Ok, that's good :) . Then, I'll try to reuse the filter in backup
instead of write-notifiers, and understand do we really need internal
state of backup block-job or not.

Max

PS: in background, I have unpublished work, aimed to parallelize
backup-job into several coroutines (like it is done for mirror, qemu-img
clone cmd). And it's really hard.It creates queues of requests with
different priority, to handle CBW requests in common pipeline, it's
mostly a rewrite of block/backup. If we split CBW from backup to
separate filter-node, backup becomes very simple thing (copy clusters
from constant storage) and its parallelization becomes simpler.
If CBW is split from backup, maybe mirror could replace backup
immediately.  You'd fleece to a RAM cache target and then mirror from there.

Hmm, good option. It would be just one mirror iteration.
But then I'll need to  teach mirror to copy clusters with some priorities, to avoid ram-cache overloading (and guest io hang). It may be better to have a separate simple (a lot simpler than mirror) block job for it. or use a backup. Anyway, it's a separate
building block, performance comparison will show better candidate.


(To be precise: The exact replacement would be an active mirror, so a
mirror with copy-mode=write-blocking, so it immediately writes the old
block to the target when it is changed in the source, and thus the RAM
cache could stay effectively empty.)

Hmm, or this way. So, actually for such thing, we need a cache node which do absolutely nothing, write will be actually handled by mirror job. But in this case we cant control size of actual ram cache: if target is slow we will accumulate unfinished bdrv_mirror_top_pwritev calls, which has allocated memory and waiting in a queue to create mirror coroutine.

Oh, sorry, no. active mirror copy data synchronously on write, so, it's really should be the same copy pattern as in backup.



I don't say throw the backup away, but I have several ideas, which may
alter current approach. They may live in parallel with current backup
path, or replace it in future, if they will be more effective.
Thing is, contrary to the impression I've probably given, we do want to
throw away backup sooner or later.  We want a single block job
(blockdev-copy) that unifies mirror, backup, and commit.

(mirror already basically supersedes commit, with live commit just being
exactly mirror; the main problem is integrating backup.  But with a
fleecing node and a RAM cache target, that would suddenly be really
simple, I assume.)

((All that's missing is sync=top, where the mirror would need to not
only check its source (which would be the RAM cache), but also its
backing file; and sync=incremental, which just isn't there with mirror
at all.  OTOH, it may be possible to implement both modes simply in the
fleecing/backup node, so it only copies that respective data to the
target and the mirror simply sees nothing else.))

Good idea. If we have fleecing-cache node as a "view" or "export", we can export only selected portions of data, marking the other as unallocated.. Or we need to share bitmaps (global bitmaps, shared fleecing state, etc) with a block-job.


Max





--
Best regards,
Vladimir




reply via email to

[Prev in Thread] Current Thread [Next in Thread]