Re: [PATCH v4 1/3] job: take each job's lock individually in job_txn

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 1/3] job: take each job's lock individually in job_txn_app

From:	Stefan Reiter
Subject:	Re: [PATCH v4 1/3] job: take each job's lock individually in job_txn_apply
Date:	Thu, 2 Apr 2020 17:04:54 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0

On 02/04/2020 14:33, Max Reitz wrote:

On 01.04.20 10:15, Stefan Reiter wrote:

All callers of job_txn_apply hold a single job's lock, but different
jobs within a transaction can have different contexts, thus we need to
lock each one individually before applying the callback function.

Similar to job_completed_txn_abort this also requires releasing the
caller's context before and reacquiring it after to avoid recursive
locks which might break AIO_WAIT_WHILE in the callback.

This also brings to light a different issue: When a callback function in
job_txn_apply moves it's job to a different AIO context, job_exit will
try to release the wrong lock (now that we re-acquire the lock
correctly, previously it would just continue with the old lock, leaving
the job unlocked for the rest of the codepath back to job_exit). Fix
this by not caching the job's context in job_exit and add a comment
about why this is done.

One test needed adapting, since it calls job_finalize directly, so it
manually needs to acquire the correct context.

Signed-off-by: Stefan Reiter <address@hidden>
---
  job.c                 | 48 ++++++++++++++++++++++++++++++++++---------
  tests/test-blockjob.c |  2 ++
  2 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/job.c b/job.c
index 134a07b92e..5fbaaabf78 100644
--- a/job.c
+++ b/job.c
@@ -136,17 +136,36 @@ static void job_txn_del_job(Job *job)
      }
  }

-static int job_txn_apply(JobTxn *txn, int fn(Job *))

+static int job_txn_apply(Job *job, int fn(Job *))
  {
-    Job *job, *next;
+    AioContext *inner_ctx;
+    Job *other_job, *next;
+    JobTxn *txn = job->txn;
      int rc = 0;

- QLIST_FOREACH_SAFE(job, &txn->jobs, txn_list, next) {

-        rc = fn(job);
+    /*
+     * Similar to job_completed_txn_abort, we take each job's lock before
+     * applying fn, but since we assume that outer_ctx is held by the caller,
+     * we need to release it here to avoid holding the lock twice - which would
+     * break AIO_WAIT_WHILE from within fn.
+     */
+    aio_context_release(job->aio_context);


Hm, is it safe to drop the lock here and then reacquire it later?  I.e.,
is the job in a consistent state in between?  I don’t know.  Looks like
it.  Maybe someone else knows better.

I would have said so too, but the iotest segfaults Kevin discovered (Ican reproduce them after a dozen or so cycles) seem to be triggered bysome kind of race, which I can only imagine being caused by it not beingsafe to drop the lock.

It can be resolved by doing a job_ref/unref in job_txn_apply, but thatmight be only fixing the symptoms and ignoring the problem.

(I find the job code rather confusing.)


That seems to be common around here ;)

+
+    QLIST_FOREACH_SAFE(other_job, &txn->jobs, txn_list, next) {
+        inner_ctx = other_job->aio_context;
+        aio_context_acquire(inner_ctx);

Ignoring the fact that fn() can change a job's lock, one idea I had wasto do a


  if (inner_ctx != job->aio_context) {
      aio_context_acquire(inner_ctx);
  }

instead of releasing the lock prior.
However, that gets complicated when the job's context does change in fn.

+        rc = fn(other_job);
+        aio_context_release(inner_ctx);
          if (rc) {
              break;
          }
      }
+
+    /*
+     * Note that job->aio_context might have been changed by calling fn, so we
+     * can't use a local variable to cache it.
+     */
+    aio_context_acquire(job->aio_context);


But all callers of job_txn_apply() (but job_exit(), which you fix in
this patch) cache it.  Won’t they release the wrong context then?  Or is
a context change only possible when job_txn_apply() is called from
job_exit()?

You're right that it can probably change for other callers as well(though at least it doesn't seem to currently, since no other test casesfail?). Looking at the analysis Vladimir did [0], there's a few placeswhere that would need fixing.

The issue is that all of these places would also need the job_ref/unrefpart AFAICT, which would make this a bit unwieldy...


[0] https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07867.html

Max

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH v4 0/3] Fix some AIO context locking in jobs, Stefan Reiter, 2020/04/01
- [PATCH v4 2/3] replication: acquire aio context before calling job_cancel_sync, Stefan Reiter, 2020/04/01
  - Re: [PATCH v4 2/3] replication: acquire aio context before calling job_cancel_sync, Max Reitz, 2020/04/02
    - Re: [PATCH v4 2/3] replication: acquire aio context before calling job_cancel_sync, Stefan Reiter, 2020/04/02
- [PATCH v4 3/3] backup: don't acquire aio_context in backup_clean, Stefan Reiter, 2020/04/01
  - Re: [PATCH v4 3/3] backup: don't acquire aio_context in backup_clean, Max Reitz, 2020/04/02
- [PATCH v4 1/3] job: take each job's lock individually in job_txn_apply, Stefan Reiter, 2020/04/01
  - Re: [PATCH v4 1/3] job: take each job's lock individually in job_txn_apply, Max Reitz, 2020/04/02
    - Re: [PATCH v4 1/3] job: take each job's lock individually in job_txn_apply, Stefan Reiter <=
- Re: [PATCH v4 0/3] Fix some AIO context locking in jobs, Kevin Wolf, 2020/04/02
  - Re: [PATCH v4 0/3] Fix some AIO context locking in jobs, John Snow, 2020/04/06

Prev by Date: Re: [PATCH v3 1/1] vl/s390x: fixup ram sizes for compat machines
Next by Date: Re: [PATCH v4 2/3] replication: acquire aio context before calling job_cancel_sync
Previous by thread: Re: [PATCH v4 1/3] job: take each job's lock individually in job_txn_apply
Next by thread: Re: [PATCH v4 0/3] Fix some AIO context locking in jobs
Index(es):
- Date
- Thread