qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH for-6.1? v2 5/7] job: Add job_cancel_requested()


From: Max Reitz
Subject: Re: [PATCH for-6.1? v2 5/7] job: Add job_cancel_requested()
Date: Wed, 4 Aug 2021 16:15:03 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0

On 04.08.21 12:34, Kevin Wolf wrote:
[ Peter, the question for you is at the end. ]

Am 04.08.2021 um 10:07 hat Max Reitz geschrieben:
On 03.08.21 16:25, Kevin Wolf wrote:
Am 26.07.2021 um 16:46 hat Max Reitz geschrieben:
Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_request() as the general variant, which returns true for any
jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
   include/qemu/job.h |  8 +++++++-
   block/mirror.c     | 10 ++++------
   job.c              |  7 ++++++-
   3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 8aa90f7395..032edf3c5f 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -436,9 +436,15 @@ const char *job_type_str(const Job *job);
   /** Returns true if the job should not be visible to the management layer. */
   bool job_is_internal(Job *job);
-/** Returns whether the job is scheduled for cancellation. */
+/** Returns whether the job is being cancelled. */
   bool job_is_cancelled(Job *job);
+/**
+ * Returns whether the job is scheduled for cancellation (at an
+ * indefinite point).
+ */
+bool job_cancel_requested(Job *job);
I don't think non-force blockdev-cancel for mirror should actually be
considered cancellation, so what is the question that this function
answers?

"Is this a cancelled job, or a mirror block job that is supposed to
complete soon, but only if it doesn't switch over the users to the
target on completion"?
Well, technically yes, but it was more intended as “Has the user ever
invoked (block-)job-cancel on this job?”.
I understand this, but is this much more useful to know than "Has the
user ever called HMP 'change'?", if you know what I mean?

Hm.  Not really.  It’s still a crutch that shouldn’t be there ideally.

But I like this crutch for this series so I can get this batch done, and then worry about all the other bugs that keep popping up (and where job_cancel_requested() is a nice sign that something’s off).

Is this ever a reasonable question to ask, except maybe inside the
mirror implementation itself?
I asked myself the same for v3, but found two places in job.c where I
would like to keep it:

First, there’s an assertion in job_completed_txn_abort().  All jobs
other than @job have been force-cancelled, and so job_is_cancelled()
would be true for them.  As for @job itself, the function is mostly
called when the job’s return value is not 0, but a soft-cancelled
mirror does have a return value of 0 and so would not end up in that
function.
But job_cancel() invokes job_completed_txn_abort() if the job has been
deferred to the main loop, which mostly correlates with the job having
been completed (in which case the assertion is skipped), but not 100 %
(there’s a small window between setting deferred_to_main_loop and the
job changing to a completed state).
So I’d prefer to keep the assertion as-is functionally, i.e. to only
check job->cancelled.
Well, you don't. It's still job_is_cancelled() after this patch.

No: I didn’t. O:)

For v3, I had absolutely planned to use job_cancel_requested(), and I wanted to put the above explanation into the commit message.

So the scenario you're concerned about is a job that has just finished
successfully (job->ret = 0) and then gets a cancel request?

Yes.

With force=false, I'm pretty sure the code is wrong anyway because
calling job_completed_txn_abort() is not the right response.

Absolutely possible, I just didn’t want to deal with this, too… :/

It should
return an error because you're trying to complete twice, possibly with
conflicting completion modes. Second best is just ignoring the cancel
request because we obviously already fulfilled the request of completing
the job (the completion mode might be different, though).

With force=true, arguably still letting the job fail is correct.
However, letting it fail involves more than just letting the transaction
fail. We would have to call job_update_rc() as well so that instead of
reporting success for the job, ECANCELED is returned and the job
transitions to JOB_STATUS_ABORTING (after which job_is_completed()
returns true).

So, just bugs to be fixed.

Yep.  The question is, when/where; in this series or later?

After this, I think we could even assert(job->ret != 0 ||
job->status == JOB_STATUS_PENDING) in job_completed_txn_abort().
ret == 0 can only happen when called from job_do_finalize(), when the
job is only failing because other jobs in the same transaction have
failed in their .prepare callback.

Sounds right.

Second, job_complete() refuses to let a job complete that has been
cancelled.  This function is basically only invoked by the user
(through qmp_block_job_complete()/qmp_job_complete(), or
job_complete_sync(), which comes from qemu-img), so I believe that it
should correspond to the external interface we have right now; i.e.,
if the user has invoked (block-)job-cancel at one point,
job_complete() should generally return an error.
True. But it should also return an error if the user has invoked
job-complete at some point. The distinction between complete and
non-force cancel doesn't make sense there.

Yes, that’s true, it’s just that having double complete be a failure would be a change in behavior, and it would require another patch. Which is why I didn’t do it.

And cancelling with force=false should fail, too, when either job-cancel
or job-complete was called for the job before.

Yes.  At least for force=true, force=false is just a no-op, I believe.  (.force_cancel is never reset to false.)

I’d like to defer this for the design series that Vladimir is planning to write, though.

job_complete() is the only function outside of mirror that seems to use
it. But even there, it feels wrong to make a difference. Either we
accept redundant completion requests, or we don't. It doesn't really
matter how the job reconfigures the graph on completion. (Also, I feel
this should really have been part of the state machine, but I'm not sure
if we want to touch it now...)
Well, yes, I don’t think it makes a difference because I don’t think
anyone will first tell the job via block-job-cancel to complete
without pivoting, and then change their mind and call
block-job-complete after all.  (Not least because that’s an error
pre-series.)
Right, I'm just arguing that we shouldn't allow the opposite order
either. Currently I think we do, and it's buggy, as explained above.

I agree.

Also, I’m not even sure whether completing after a soft cancel request
works.  I don’t think any of our code accounts for such a case, so I’d
rather avoid allowing it if there’s no need to allow it anyway.
Yes, definitely avoid it. We should allow only one completion request
(be it with job-complete or block-job-cancel) and return an error for
all future completion requests for the same job.

We could in theory keep allowing redundant completion requests when the
completion mode doesn't conflict, but I don't see the point of that.

OK.  I personally think this need not be part of this series, though, so I’d like to defer it for now. O:)

(And since Vladimir is planning on turning soft cancel into a completion mode, I think we’re unlikely to forget about the problem.)

Unless libvirt can actually issue multiple completion requests (again,
this includes both (block-)job-complete and non-force block-job-cancel
for mirror) for the same block job - Peter, I hope it doesn't?

Kevin





reply via email to

[Prev in Thread] Current Thread [Next in Thread]