[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 13/14] migration: Use migrate_has_error() in close_return_pat
From: |
Fabiano Rosas |
Subject: |
Re: [PATCH 13/14] migration: Use migrate_has_error() in close_return_path_on_source() |
Date: |
Wed, 14 Feb 2024 13:00:17 -0300 |
Cédric Le Goater <clg@redhat.com> writes:
> On 2/8/24 14:57, Fabiano Rosas wrote:
>> Cédric Le Goater <clg@redhat.com> writes:
>>
>>> On 2/8/24 14:07, Fabiano Rosas wrote:
>>>> Cédric Le Goater <clg@redhat.com> writes:
>>>>
>>>>> close_return_path_on_source() retrieves the migration error from the
>>>>> the QEMUFile '->to_dst_file' to know if a shutdown is required. This
>>>>> shutdown is required to exit the return-path thread. However, in
>>>>> migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling
>>>>> close_return_path_on_source() and the shutdown is never performed,
>>>>> leaving the source and destination waiting for an event to occur.
>>>>>
>>>>> Avoid relying on '->to_dst_file' and use migrate_has_error() instead.
>>>>>
>>>>> Suggested-by: Peter Xu <peterx@redhat.com>
>>>>> Signed-off-by: Cédric Le Goater <clg@redhat.com>
>>>>> ---
>>>>> migration/migration.c | 3 +--
>>>>> 1 file changed, 1 insertion(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>>> index
>>>>> d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d
>>>>> 100644
>>>>> --- a/migration/migration.c
>>>>> +++ b/migration/migration.c
>>>>> @@ -2372,8 +2372,7 @@ static bool
>>>>> close_return_path_on_source(MigrationState *ms)
>>>>> * cause it to unblock if it's stuck waiting for the destination.
>>>>> */
>>>>> WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
>>>>> - if (ms->to_dst_file && ms->rp_state.from_dst_file &&
>>>>> - qemu_file_get_error(ms->to_dst_file)) {
>>>>> + if (migrate_has_error(ms) && ms->rp_state.from_dst_file) {
>>>>> qemu_file_shutdown(ms->rp_state.from_dst_file);
>>>>> }
>>>>> }
>>>>
>>>> Hm, maybe Peter can help defend this, but this assumes that every
>>>> function that takes an 'f' and sets the file error also sets
>>>> migrate_set_error(). I'm not sure we have determined that, have we?
>>>
>>> How could we check all the code path ? I agree it is difficult when
>>> looking at the code :/
>>
>> It would help if the thing wasn't called 'f' for the most part of the
>> code to begin with.
>>
>> Whenever there's a file error at to_dst_file there's the chance that the
>> rp_state.from_dst_file got stuck. So we cannot ignore the file error.
>>
>> Would it work if we checked it earlier during cleanup as you did
>> previously and then set the migration error?
>
> Do you mean doing something similar to what is done in
> source_return_path_thread() ?
>
> if (qemu_file_get_error(s->to_dst_file)) {
> qemu_file_get_error_obj(s->to_dst_file, &err);
> if (err) {
> migrate_set_error(ms, err);
> error_free(err);
> ...
>
> Yes. That would be safer I think.
Yes, something like that.
I wish we could make that return path cleanup more deterministic, but
currently it's just: "if something hangs, call shutdown()". We don't
have a way to detect a hang, we just look at the file error and hope it
works.
A crucial aspect here is that calling qemu_file_shutdown() itself sets
the file error. So there's not even a guarantee that an error is
actually an error.
>
>
> Nevertheless, I am struggling to understand how qemu_file_set_error()
> and migrate_set_error() fit together. I was expecting some kind of
> synchronization routine but there isn't it seems. Are they completely
> orthogonal ? when should we use these routines and when not ?
We're trying to phase out the QEMUFile usage altogether. One thing that
is getting in the way is this dependency on the qemu_file_*_error
functions.
While we're not there yet, a good pattern is to find a
qemu_file_set|get_error() pair and replace it with
migrate_set|has_error(). Unfortunately the return path does not fit in
this, because we don't have a matching qemu_file_set_error, it could be
anywhere. As I said above, we're using that error as a heuristic for: "a
recvmsg() might be hanging".
>
> My initial goal was to modify some of the memory handlers (log_global*)
> and migration handlers to propagate errors at the QMP level and them
> report to the management layer. This is growing in something bigger
> and currently, I don't find a good approach to the problem.
>
> The last two patches of this series try to fix the return-path thread
> termination. Let's keep that for after.
I'll try to figure that out. I see you provided a reproducer.
>
> Thanks,
>
> C.
[RFC PATCH 14/14] migration: Fix return-path thread exit, Cédric Le Goater, 2024/02/07
Re: [RFC PATCH 14/14] migration: Fix return-path thread exit, Fabiano Rosas, 2024/02/08