[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead o
From: |
Vladimir Sementsov-Ogievskiy |
Subject: |
Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers |
Date: |
Mon, 28 Jan 2019 11:29:23 +0000 |
18.01.2019 17:56, Max Reitz wrote:
> On 29.12.18 13:20, Vladimir Sementsov-Ogievskiy wrote:
>> Drop write notifiers and use filter node instead. Changes:
>>
>> 1. copy-before-writes now handled by filter node, so, drop all
>> is_write_notifier arguments.
>>
>> 2. we don't have intersecting requests, so their handling is dropped.
>> Instead, synchronization works as follows:
>> when backup or backup-top starts copying of some area it firstly
>> clears copy-bitmap bits, and nobody touches areas, not marked with
>> dirty bits in copy-bitmap, so there is no intersection. Also, backup
>> job copy operations are surrounded by bdrv region lock, which is
>> actually serializing request, to not interfer with guest writes and
>> not read changed data from source (before reading we clear
>> corresponding bit in copy-bitmap, so, this area is not more handled by
>> backup-top).
>>
>> 3. To sync with in-flight requests we now just drain hook node, we
>> don't need rw-lock.
>>
>> 4. After the whole backup loop (top, full, incremental modes), we need
>> to check for not copied clusters, which were under backup-top operation
>> and we skipped them, but backup-top operation failed, error returned to
>> the guest and dirty bits set back.
>>
>> 5. Don't create additional blk, use backup-top children for copy
>> operations.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <address@hidden>
>> ---
>> block/backup.c | 285 ++++++++++++++++++++++++++-----------------------
>> 1 file changed, 149 insertions(+), 136 deletions(-)
>>
>> diff --git a/block/backup.c b/block/backup.c
>> index 88c0242b4e..e332909fb7 100644
>> --- a/block/backup.c
>> +++ b/block/backup.c
>
> [...]
>
>> @@ -300,21 +231,23 @@ static void backup_abort(Job *job)
>> static void backup_clean(Job *job)
>> {
>> BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
>> - assert(s->target);
>> - blk_unref(s->target);
>> +
>> + /* We must clean it to not crash in backup_drain. */
>> s->target = NULL;
>
> Why not set s->source to NULL along with it? It makes sense if you're
> going to drop the backup-top node because both of these are its children.
agree.
>
>>
>> if (s->copy_bitmap) {
>> hbitmap_free(s->copy_bitmap);
>> s->copy_bitmap = NULL;
>> }
>> +
>> + bdrv_backup_top_drop(s->backup_top);
>> }
>
> [...]
>
>> @@ -386,21 +319,45 @@ static int coroutine_fn
>> backup_run_incremental(BackupBlockJob *job)
>> bool error_is_read;
>> int64_t offset;
>> HBitmapIter hbi;
>> + void *lock = NULL;
>>
>> hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
>> - while ((offset = hbitmap_iter_next(&hbi)) != -1) {
>> + while (hbitmap_count(job->copy_bitmap)) {
>> + offset = hbitmap_iter_next(&hbi);
>> + if (offset == -1) {
>> + /*
>> + * we may have skipped some clusters, which were handled by
>> + * backup-top, but failed and finished by returning error to
>> + * the guest and set dirty bit back.
>> + */
>> + hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
>> + offset = hbitmap_iter_next(&hbi);
>> + assert(offset);
>
> I think you want to assert "offset >= 0".
>
>> + }
>> +
>> + lock = bdrv_co_try_lock(job->source, offset, job->cluster_size);
>> + /*
>> + * Dirty bit is set, which means that there are no in-flight
>> + * write requests on this area. We must succeed.
>> + */
>> + assert(lock);
>
> I'm not sure that is true right now, but more on that below in backup_run().
>
>> +
>> do {
>> if (yield_and_check(job)) {
>> + bdrv_co_unlock(lock);
>> return 0;
>> }
>> - ret = backup_do_cow(job, offset,
>> - job->cluster_size, &error_is_read, false);
>> + ret = backup_do_cow(job, offset, job->cluster_size,
>> &error_is_read);
>> if (ret < 0 && backup_error_action(job, error_is_read, -ret) ==
>> BLOCK_ERROR_ACTION_REPORT)
>> {
>> + bdrv_co_unlock(lock);
>> return ret;
>> }
>> } while (ret < 0);
>> +
>> + bdrv_co_unlock(lock);
>> + lock = NULL;
>
> This statement seems unnecessary here.
>
>> }
>>
>> return 0;
>
> [...]
>
>> @@ -447,26 +402,39 @@ static int coroutine_fn backup_run(Job *job, Error
>> **errp)
>> hbitmap_set(s->copy_bitmap, 0, s->len);
>> }
>>
>> - s->before_write.notify = backup_before_write_notify;
>> - bdrv_add_before_write_notifier(bs, &s->before_write);
>> -
>> if (s->sync_mode == MIRROR_SYNC_MODE_NONE) {
>> /* All bits are set in copy_bitmap to allow any cluster to be
>> copied.
>> * This does not actually require them to be copied. */
>> while (!job_is_cancelled(job)) {
>> - /* Yield until the job is cancelled. We just let our
>> before_write
>> - * notify callback service CoW requests. */
>> + /*
>> + * Yield until the job is cancelled. We just let our backup-top
>> + * fileter driver service CbW requests.
>
> *filter
>
>> + */
>> job_yield(job);
>> }
>> } else if (s->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
>> ret = backup_run_incremental(s);
>> } else {
>
> [...]
>
>> @@ -505,8 +474,20 @@ static int coroutine_fn backup_run(Job *job, Error
>> **errp)
>> if (alloced < 0) {
>> ret = alloced;
>> } else {
>> + if (!hbitmap_get(s->copy_bitmap, offset)) {
>> + trace_backup_do_cow_skip(job, offset);
>> + continue; /* already copied */
>> + }
>> + if (!lock) {
>> + lock = bdrv_co_try_lock(s->source, offset,
>> s->cluster_size);
>> + /*
>> + * Dirty bit is set, which means that there are no
>> in-flight
>> + * write requests on this area. We must succeed.
>> + */
>> + assert(lock);
>
> What if I have a different parent node for the source that issues
> concurrent writes? This used to work fine because the before_write
> notifier would still work. After this patch, that would be broken
> because those writes would not cause a CbW.
But haw could you have this different parent node? After appending filter,
there should not be such nodes. And I think, during backup it should be
forbidden to append new parents to source, ignoring filter, as it definitely
breaks what filter does. And it applies to other block-job with their filters.
If we appended a filter, we don't want someone other to write omit our filter.
>
> That's not so bad because we just have to make sure that all writes go
> through the backup-top node. That would make this assertion valid
> again, too. But that means we cannot share PERM_WRITE; see [1].
But we don't share PERM_WRITE on source in backup_top, only on target.
>
>> + }
>> ret = backup_do_cow(s, offset, s->cluster_size,
>> - &error_is_read, false);
>> + &error_is_read);
>> }
>> if (ret < 0) {
>> /* Depending on error action, fail now or retry cluster */
>> @@ -516,17 +497,34 @@ static int coroutine_fn backup_run(Job *job, Error
>> **errp)
>> break;
>> } else {
>> offset -= s->cluster_size;
>> + retry = true;
>> continue;
>> }
>> }
>> }
>> + if (lock) {
>> + bdrv_co_unlock(lock);
>> + lock = NULL;
>> + }
>> + if (ret == 0 && !job_is_cancelled(job) &&
>> + hbitmap_count(s->copy_bitmap))
>> + {
>> + /*
>> + * we may have skipped some clusters, which were handled by
>> + * backup-top, but failed and finished by returning error to
>> + * the guest and set dirty bit back.
>
> So it's a matter of a race?
>
>> + */
>> + goto iteration;
>> + }
>
> Why not wrap everything in a do {} while (ret == 0 && !job_is...)
> instead? Because it would mean reindenting everything?
Don't remember, but assume that yes. And this code is anyway "To be refactored",
I want all FULL/TOP/INCREMENTAL go through the same (mostly) code path.
>
>> }
>>
>> - notifier_with_return_remove(&s->before_write);
>> + /* wait pending CBW operations in backup-top */
>> + bdrv_drain(s->backup_top);
>>
>> - /* wait until pending backup_do_cow() calls have completed */
>> - qemu_co_rwlock_wrlock(&s->flush_rwlock);
>> - qemu_co_rwlock_unlock(&s->flush_rwlock);
>> + backup_top_progress = bdrv_backup_top_progress(s->backup_top);
>> + job_progress_update(job, ret + backup_top_progress -
>
> Why the "ret"?
oops, it looks like a copy-paste bug ("ret" is reasonable in backup_do_cow())
>
>> + s->backup_top_progress);
>> + s->backup_top_progress = backup_top_progress;
>
> So the backup-top progress is ignored during basically all of the block
> job until it suddenly jumps to 100 % completion? That doesn't sound ideal.
>
> Or did you mean to put this into the for () loop of MODE_TOP/MODE_FULL?
> (And the while() loop of MODE_NONE)
It is done in backup_do_cow(), so FULL and TOP are covered.
But you are right that MODE_NONE seems to have a problem about it.. And just
updating it
in a while loop would not work, as I doubt that job_yield will return until job
finish
or user interaction like pause/continue/cancel..
So now, it looks better to call job_progress_update() from backup_top directly,
and drop
this hack.
>
>>
>> return ret;
>> }
>> @@ -563,6 +561,9 @@ BlockJob *backup_job_create(const char *job_id,
>> BlockDriverState *bs,
>> int ret;
>> int64_t cluster_size;
>> HBitmap *copy_bitmap = NULL;
>> + BlockDriverState *backup_top = NULL;
>> + uint64_t all_except_resize = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>> + BLK_PERM_WRITE_UNCHANGED |
>> BLK_PERM_GRAPH_MOD;
>
> BLK_PERM_ALL & ~BLK_PERM_RESIZE?
>
> [1] But we probably do not want to share either PERM_WRITE or
> PERM_WRITE_UNCHANGED because during the duration of the backup,
> everything should go through the backup-top filter (not sure about
> PERM_WRITE_UNCHANGED right now). Or is that something that the filter
> node should enforce in backup_top_child_perm()?
It's not shared perm of backup_top, it's a shared perm of block-job common.blk,
which is
used only to "job->len is fixed, so we can't allow resize", so this part is not
changed.
So yes, the problem you mean by [1] is about backup_top_child_perm() where we
share PERM_WRITE.
And it is described by comment, we must share this write perm, otherwise we
break guest writes.
We share PERM_WRITE in backup_top to force its target child share PERM_WRITE on
its backing,
as backing of target is source.
But again, we share PERM_WRITE only on target, and it is shared in current code
too.
>
>>
>> assert(bs);
>> assert(target);
>> @@ -655,25 +656,31 @@ BlockJob *backup_job_create(const char *job_id,
>> BlockDriverState *bs,
>>
>> copy_bitmap = hbitmap_alloc(len, ctz32(cluster_size));
>>
>> - /* job->len is fixed, so we can't allow resize */
>> - job = block_job_create(job_id, &backup_job_driver, txn, bs,
>> - BLK_PERM_CONSISTENT_READ,
>> - BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>> - BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD,
>> - speed, creation_flags, cb, opaque, errp);
>> - if (!job) {
>> + /*
>> + * bdrv_get_device_name will not help to find device name starting from
>> + * @bs after backup-top append,
>
> Why not? Since backup-top is appended, shouldn't all parents of @bs be
> parents of @backup_top then? (Making bdrv_get_parent_name() return the
> same result)
bdrv_get_device_name goes finally through role->get_name, and only root role has
this handler. After append we'll have backing role instead of root.
>
>> so let's calculate job_id before. Do
>> + * it in the same way like block_job_create
>> + */
>> + if (job_id == NULL && !(creation_flags & JOB_INTERNAL)) {
>> + job_id = bdrv_get_device_name(bs);
>> + }
>> +
>> + backup_top = bdrv_backup_top_append(bs, target, copy_bitmap, errp);
>> + if (!backup_top) {
>> goto error;
>> }
>>
>> - /* The target must match the source in size, so no resize here either */
>> - job->target = blk_new(BLK_PERM_WRITE,
>> - BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>> - BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD);
>> - ret = blk_insert_bs(job->target, target, errp);
>> - if (ret < 0) {
>> + /* job->len is fixed, so we can't allow resize */
>> + job = block_job_create(job_id, &backup_job_driver, txn, bs, 0,
>
> Is there a reason you dropped PERM_CONSISTENT_READ here?
Because, we don't use this blk for read now, we read through backup_top child.
>
>> + all_except_resize, speed, creation_flags,
>> + cb, opaque, errp);
>> + if (!job) {
>> goto error;
>> }
>>
>> + job->source = backup_top->backing;
>> + job->target = ((BDRVBackupTopState *)backup_top->opaque)->target;
>
> This looks really ugly. I think as long as the block job performs
> writes itself, it should use its own BlockBackend.
They are not BlockBackends, they are BdrvChildren. It was Kevin's idea to reuse
filter's
children in backup job:
https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01017.html
Hmm, and this is also why I need PERM_WRITE in backup_top, to write to target.
>
> Alternatively, I think it would make sense for the backup-top filter to
> offer functions that this job can use to issue writes to the target.
> Then the backup job would no longer need direct access to the target as
> a BdrvChild.
>
>> +
>> job->on_source_error = on_source_error;
>> job->on_target_error = on_target_error;
>> job->sync_mode = sync_mode;
--
Best regards,
Vladimir