qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead o


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [Qemu-block] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers
Date: Mon, 28 Jan 2019 11:29:23 +0000

18.01.2019 17:56, Max Reitz wrote:
> On 29.12.18 13:20, Vladimir Sementsov-Ogievskiy wrote:
>> Drop write notifiers and use filter node instead. Changes:
>>
>> 1. copy-before-writes now handled by filter node, so, drop all
>>     is_write_notifier arguments.
>>
>> 2. we don't have intersecting requests, so their handling is dropped.
>> Instead, synchronization works as follows:
>> when backup or backup-top starts copying of some area it firstly
>> clears copy-bitmap bits, and nobody touches areas, not marked with
>> dirty bits in copy-bitmap, so there is no intersection. Also, backup
>> job copy operations are surrounded by bdrv region lock, which is
>> actually serializing request, to not interfer with guest writes and
>> not read changed data from source (before reading we clear
>> corresponding bit in copy-bitmap, so, this area is not more handled by
>> backup-top).
>>
>> 3. To sync with in-flight requests we now just drain hook node, we
>> don't need rw-lock.
>>
>> 4. After the whole backup loop (top, full, incremental modes), we need
>> to check for not copied clusters, which were under backup-top operation
>> and we skipped them, but backup-top operation failed, error returned to
>> the guest and dirty bits set back.
>>
>> 5. Don't create additional blk, use backup-top children for copy
>> operations.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <address@hidden>
>> ---
>>   block/backup.c | 285 ++++++++++++++++++++++++++-----------------------
>>   1 file changed, 149 insertions(+), 136 deletions(-)
>>
>> diff --git a/block/backup.c b/block/backup.c
>> index 88c0242b4e..e332909fb7 100644
>> --- a/block/backup.c
>> +++ b/block/backup.c
> 
> [...]
> 
>> @@ -300,21 +231,23 @@ static void backup_abort(Job *job)
>>   static void backup_clean(Job *job)
>>   {
>>       BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
>> -    assert(s->target);
>> -    blk_unref(s->target);
>> +
>> +    /* We must clean it to not crash in backup_drain. */
>>       s->target = NULL;
> 
> Why not set s->source to NULL along with it?  It makes sense if you're
> going to drop the backup-top node because both of these are its children.

agree.

> 
>>   
>>       if (s->copy_bitmap) {
>>           hbitmap_free(s->copy_bitmap);
>>           s->copy_bitmap = NULL;
>>       }
>> +
>> +    bdrv_backup_top_drop(s->backup_top);
>>   }
> 
> [...]
> 
>> @@ -386,21 +319,45 @@ static int coroutine_fn 
>> backup_run_incremental(BackupBlockJob *job)
>>       bool error_is_read;
>>       int64_t offset;
>>       HBitmapIter hbi;
>> +    void *lock = NULL;
>>   
>>       hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
>> -    while ((offset = hbitmap_iter_next(&hbi)) != -1) {
>> +    while (hbitmap_count(job->copy_bitmap)) {
>> +        offset = hbitmap_iter_next(&hbi);
>> +        if (offset == -1) {
>> +            /*
>> +             * we may have skipped some clusters, which were handled by
>> +             * backup-top, but failed and finished by returning error to
>> +             * the guest and set dirty bit back.
>> +             */
>> +            hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
>> +            offset = hbitmap_iter_next(&hbi);
>> +            assert(offset);
> 
> I think you want to assert "offset >= 0".
> 
>> +        }
>> +
>> +        lock = bdrv_co_try_lock(job->source, offset, job->cluster_size);
>> +        /*
>> +         * Dirty bit is set, which means that there are no in-flight
>> +         * write requests on this area. We must succeed.
>> +         */
>> +        assert(lock);
> 
> I'm not sure that is true right now, but more on that below in backup_run().
> 
>> +
>>           do {
>>               if (yield_and_check(job)) {
>> +                bdrv_co_unlock(lock);
>>                   return 0;
>>               }
>> -            ret = backup_do_cow(job, offset,
>> -                                job->cluster_size, &error_is_read, false);
>> +            ret = backup_do_cow(job, offset, job->cluster_size, 
>> &error_is_read);
>>               if (ret < 0 && backup_error_action(job, error_is_read, -ret) ==
>>                              BLOCK_ERROR_ACTION_REPORT)
>>               {
>> +                bdrv_co_unlock(lock);
>>                   return ret;
>>               }
>>           } while (ret < 0);
>> +
>> +        bdrv_co_unlock(lock);
>> +        lock = NULL;
> 
> This statement seems unnecessary here.
> 
>>       }
>>   
>>       return 0;
> 
> [...]
> 
>> @@ -447,26 +402,39 @@ static int coroutine_fn backup_run(Job *job, Error 
>> **errp)
>>           hbitmap_set(s->copy_bitmap, 0, s->len);
>>       }
>>   
>> -    s->before_write.notify = backup_before_write_notify;
>> -    bdrv_add_before_write_notifier(bs, &s->before_write);
>> -
>>       if (s->sync_mode == MIRROR_SYNC_MODE_NONE) {
>>           /* All bits are set in copy_bitmap to allow any cluster to be 
>> copied.
>>            * This does not actually require them to be copied. */
>>           while (!job_is_cancelled(job)) {
>> -            /* Yield until the job is cancelled.  We just let our 
>> before_write
>> -             * notify callback service CoW requests. */
>> +            /*
>> +             * Yield until the job is cancelled.  We just let our backup-top
>> +             * fileter driver service CbW requests.
> 
> *filter
> 
>> +             */
>>               job_yield(job);
>>           }
>>       } else if (s->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
>>           ret = backup_run_incremental(s);
>>       } else {
> 
> [...]
> 
>> @@ -505,8 +474,20 @@ static int coroutine_fn backup_run(Job *job, Error 
>> **errp)
>>               if (alloced < 0) {
>>                   ret = alloced;
>>               } else {
>> +                if (!hbitmap_get(s->copy_bitmap, offset)) {
>> +                    trace_backup_do_cow_skip(job, offset);
>> +                    continue; /* already copied */
>> +                }
>> +                if (!lock) {
>> +                    lock = bdrv_co_try_lock(s->source, offset, 
>> s->cluster_size);
>> +                    /*
>> +                     * Dirty bit is set, which means that there are no 
>> in-flight
>> +                     * write requests on this area. We must succeed.
>> +                     */
>> +                    assert(lock);
> 
> What if I have a different parent node for the source that issues
> concurrent writes?  This used to work fine because the before_write
> notifier would still work.  After this patch, that would be broken
> because those writes would not cause a CbW.

But haw could you have this different parent node? After appending filter,
there should not be such nodes. And I think, during backup it should be
forbidden to append new parents to source, ignoring filter, as it definitely
breaks what filter does. And it applies to other block-job with their filters.
If we appended a filter, we don't want someone other to write omit our filter.

> 
> That's not so bad because we just have to make sure that all writes go
> through the backup-top node.  That would make this assertion valid
> again, too.  But that means we cannot share PERM_WRITE; see [1].

But we don't share PERM_WRITE on source in backup_top, only on target.

> 
>> +                }
>>                   ret = backup_do_cow(s, offset, s->cluster_size,
>> -                                    &error_is_read, false);
>> +                                    &error_is_read);
>>               }
>>               if (ret < 0) {
>>                   /* Depending on error action, fail now or retry cluster */
>> @@ -516,17 +497,34 @@ static int coroutine_fn backup_run(Job *job, Error 
>> **errp)
>>                       break;
>>                   } else {
>>                       offset -= s->cluster_size;
>> +                    retry = true;
>>                       continue;
>>                   }
>>               }
>>           }
>> +        if (lock) {
>> +            bdrv_co_unlock(lock);
>> +            lock = NULL;
>> +        }
>> +        if (ret == 0 && !job_is_cancelled(job) &&
>> +            hbitmap_count(s->copy_bitmap))
>> +        {
>> +            /*
>> +             * we may have skipped some clusters, which were handled by
>> +             * backup-top, but failed and finished by returning error to
>> +             * the guest and set dirty bit back.
> 
> So it's a matter of a race?
> 
>> +             */
>> +            goto iteration;
>> +        }
> 
> Why not wrap everything in a do {} while (ret == 0 && !job_is...)
> instead?  Because it would mean reindenting everything?

Don't remember, but assume that yes. And this code is anyway "To be refactored",
I want all FULL/TOP/INCREMENTAL go through the same (mostly) code path.

> 
>>       }
>>   
>> -    notifier_with_return_remove(&s->before_write);
>> +    /* wait pending CBW operations in backup-top */
>> +    bdrv_drain(s->backup_top);
>>   
>> -    /* wait until pending backup_do_cow() calls have completed */
>> -    qemu_co_rwlock_wrlock(&s->flush_rwlock);
>> -    qemu_co_rwlock_unlock(&s->flush_rwlock);
>> +    backup_top_progress = bdrv_backup_top_progress(s->backup_top);
>> +    job_progress_update(job, ret + backup_top_progress -
> 
> Why the "ret"?

oops, it looks like a copy-paste bug ("ret" is reasonable in backup_do_cow())

> 
>> +                        s->backup_top_progress);
>> +    s->backup_top_progress = backup_top_progress;
> 
> So the backup-top progress is ignored during basically all of the block
> job until it suddenly jumps to 100 % completion?  That doesn't sound ideal.
> 
> Or did you mean to put this into the for () loop of MODE_TOP/MODE_FULL?
>   (And the while() loop of MODE_NONE)


It is done in backup_do_cow(), so FULL and TOP are covered.

But you are right that MODE_NONE seems to have a problem about it.. And just 
updating it
in a while loop would not work, as I doubt that job_yield will return until job 
finish
or user interaction like pause/continue/cancel..

So now, it looks better to call job_progress_update() from backup_top directly, 
and drop
this hack.

> 
>>   
>>       return ret;
>>   }
>> @@ -563,6 +561,9 @@ BlockJob *backup_job_create(const char *job_id, 
>> BlockDriverState *bs,
>>       int ret;
>>       int64_t cluster_size;
>>       HBitmap *copy_bitmap = NULL;
>> +    BlockDriverState *backup_top = NULL;
>> +    uint64_t all_except_resize = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>> +                                 BLK_PERM_WRITE_UNCHANGED | 
>> BLK_PERM_GRAPH_MOD;
> 
> BLK_PERM_ALL & ~BLK_PERM_RESIZE?
> 
> [1] But we probably do not want to share either PERM_WRITE or
> PERM_WRITE_UNCHANGED because during the duration of the backup,
> everything should go through the backup-top filter (not sure about
> PERM_WRITE_UNCHANGED right now).  Or is that something that the filter
> node should enforce in backup_top_child_perm()?

It's not shared perm of backup_top, it's a shared perm of block-job common.blk, 
which is
used only to "job->len is fixed, so we can't allow resize", so this part is not 
changed.

So yes, the problem you mean by [1] is about backup_top_child_perm() where we 
share PERM_WRITE.
And it is described by comment, we must share this write perm, otherwise we 
break guest writes.
We share PERM_WRITE in backup_top to force its target child share PERM_WRITE on 
its backing,
as backing of target is source.

But again, we share PERM_WRITE only on target, and it is shared in current code 
too.

> 
>>   
>>       assert(bs);
>>       assert(target);
>> @@ -655,25 +656,31 @@ BlockJob *backup_job_create(const char *job_id, 
>> BlockDriverState *bs,
>>   
>>       copy_bitmap = hbitmap_alloc(len, ctz32(cluster_size));
>>   
>> -    /* job->len is fixed, so we can't allow resize */
>> -    job = block_job_create(job_id, &backup_job_driver, txn, bs,
>> -                           BLK_PERM_CONSISTENT_READ,
>> -                           BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>> -                           BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD,
>> -                           speed, creation_flags, cb, opaque, errp);
>> -    if (!job) {
>> +    /*
>> +     * bdrv_get_device_name will not help to find device name starting from
>> +     * @bs after backup-top append,
> 
> Why not?  Since backup-top is appended, shouldn't all parents of @bs be
> parents of @backup_top then?  (Making bdrv_get_parent_name() return the
> same result)

bdrv_get_device_name goes finally through role->get_name, and only root role has
this handler. After append we'll have backing role instead of root.

> 
>>                                       so let's calculate job_id before. Do
>> +     * it in the same way like block_job_create
>> +     */
>> +    if (job_id == NULL && !(creation_flags & JOB_INTERNAL)) {
>> +        job_id = bdrv_get_device_name(bs);
>> +    }
>> +
>> +    backup_top = bdrv_backup_top_append(bs, target, copy_bitmap, errp);
>> +    if (!backup_top) {
>>           goto error;
>>       }
>>   
>> -    /* The target must match the source in size, so no resize here either */
>> -    job->target = blk_new(BLK_PERM_WRITE,
>> -                          BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>> -                          BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD);
>> -    ret = blk_insert_bs(job->target, target, errp);
>> -    if (ret < 0) {
>> +    /* job->len is fixed, so we can't allow resize */
>> +    job = block_job_create(job_id, &backup_job_driver, txn, bs, 0,
> 
> Is there a reason you dropped PERM_CONSISTENT_READ here?

Because, we don't use this blk for read now, we read through backup_top child.

> 
>> +                           all_except_resize, speed, creation_flags,
>> +                           cb, opaque, errp);
>> +    if (!job) {
>>           goto error;
>>       }
>>   
>> +    job->source = backup_top->backing;
>> +    job->target = ((BDRVBackupTopState *)backup_top->opaque)->target;
> 
> This looks really ugly.  I think as long as the block job performs
> writes itself, it should use its own BlockBackend.

They are not BlockBackends, they are BdrvChildren. It was Kevin's idea to reuse 
filter's
children in backup job:

https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01017.html

Hmm, and this is also why I need PERM_WRITE in backup_top, to write to target.

> 
> Alternatively, I think it would make sense for the backup-top filter to
> offer functions that this job can use to issue writes to the target.
> Then the backup job would no longer need direct access to the target as
> a BdrvChild.
> 
>> +
>>       job->on_source_error = on_source_error;
>>       job->on_target_error = on_target_error;
>>       job->sync_mode = sync_mode;



-- 
Best regards,
Vladimir

reply via email to

[Prev in Thread] Current Thread [Next in Thread]