qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] [PATCH v3] util/async: use atomic_mb_set i


From: Pavel Butsykin
Subject: Re: [Qemu-block] [Qemu-devel] [PATCH v3] util/async: use atomic_mb_set in qemu_bh_cancel
Date: Wed, 8 Nov 2017 16:50:01 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0

On 08.11.2017 09:34, Sergio Lopez wrote:
Commit b7a745d added a qemu_bh_cancel call to the completion function
as an optimization to prevent it from unnecessarily rescheduling itself.

This completion function is scheduled from worker_thread, after setting
the state of a ThreadPoolElement to THREAD_DONE.


Great! We are seeing the same problem, and I was describing my fix,
when I came across your patch :)

This was considered to be safe, as the completion function restarts the
loop just after the call to qemu_bh_cancel. But, under certain access
patterns and scheduling conditions, the loop may wrongly use a
pre-fetched elem->state value, reading it as THREAD_QUEUED, and ending
the completion function without having processed a pending TPE linked at
pool->head:

I'm not quite sure that the pre-fetched is involved in this issue,
because pre-fetch reading a certain addresses should be invalidated by
write on another core to the same addresses. In our case write
req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE.
I am inclined to think that there is a memory-reordering read with
write. It's a very real case for x86 and I don't see the reasons which
can prevent it:

.text:000000000060E21E loc_60E21E: ; CODE XREF: .text:000000000060E2F4j
.text:000000000060E21E                 mov     rbx, [r12+98h]
.text:000000000060E226                 test    rbx, rbx
.text:000000000060E229                 jnz     short loc_60E238
.text:000000000060E22B                 jmp     short exit_0
.text:000000000060E22B ; ---------------------------------------------------------------------------
.text:000000000060E22D                 align 10h
.text:000000000060E21E loc_60E21E: ; CODE XREF: .text:000000000060E2F4j
.text:000000000060E21E                 mov     rbx, [r12+98h]
.text:000000000060E226                 test    rbx, rbx
.text:000000000060E229                 jnz     short loc_60E238
.text:000000000060E22B                 jmp     short exit_0
.text:000000000060E230 loc_60E230: ; CODE XREF: .text:000000000060E240j
.text:000000000060E230                 test    rbp, rbp
.text:000000000060E233                 jz      short exit_0
.text:000000000060E235
.text:000000000060E235 loc_60E235: ; CODE XREF: .text:000000000060E289j
.text:000000000060E235                 mov     rbx, rbp
.text:000000000060E238
.text:000000000060E238 loc_60E238: ; CODE XREF: .text:000000000060E229j .text:000000000060E238 cmp [rbx+ThreadPoolElement.state], 2 ; THREAD_DONE .text:000000000060E23C mov rbp, [rbx+ThreadPoolElement.all.link_next]
.text:000000000060E240                 jnz     short loc_60E230
.text:000000000060E242 mov r15d, [rbx+ThreadPoolElement.ret] .text:000000000060E246 mov r13, [rbx+ThreadPoolElement.common.opaque]
.text:000000000060E24A                 nop
.text:000000000060E24B lea rax, trace_events_enabled_count
.text:000000000060E252                 mov     eax, [rax]
.text:000000000060E254                 test    eax, eax
.text:000000000060E256                 mov     rax, rbp
.text:000000000060E259                 jnz     loc_60E2F9
 ...

.text:000000000060E2BC loc_60E2BC: ; CODE XREF: .text:000000000060E27Cj
.text:000000000060E2BC                 mov     rdi, [r12+8]
.text:000000000060E2C1                 call    qemu_bh_schedule
.text:000000000060E2C6                 mov     rdi, [r12]
.text:000000000060E2CA                 call    aio_context_release
.text:000000000060E2CF                 mov     esi, [rbx+44h]
.text:000000000060E2D2                 mov     rdi, [rbx+18h]
.text:000000000060E2D6                 call    qword ptr [rbx+10h]
.text:000000000060E2D9                 mov     rdi, [r12]
.text:000000000060E2DD                 call    aio_context_acquire
.text:000000000060E2E2                 mov     rdi, [r12+8]
.text:000000000060E2E7                 call    qemu_bh_cancel
.text:000000000060E2EC                 mov     rdi, rbx
.text:000000000060E2EF                 call    qemu_aio_unref
.text:000000000060E2F4                 jmp     loc_60E21E


The read (req->state == THREAD_DONE) can be reordered
with qemu_bh_cancel(p->completion_bh) and then we get the same picture:

           worker thread             |            I/O thread
 ------------------------------------------------------------------------
                                     | reordered read req->state
  req->state = THREAD_DONE;          |
  qemu_bh_schedule(p->completion_bh) |
    bh->scheduled = 1;               |
                                     | qemu_bh_cancel(p->completion_bh)
                                     |   bh->scheduled = 0;
                                     | if (req->state == THREAD_DONE)
                                     |   // sees THREAD_QUEUED



          worker thread             |            I/O thread
------------------------------------------------------------------------
                                    | speculatively read req->state
req->state = THREAD_DONE;          |
qemu_bh_schedule(p->completion_bh) |
   bh->scheduled = 1;               |
                                    | qemu_bh_cancel(p->completion_bh)
                                    |   bh->scheduled = 0;
                                    | if (req->state == THREAD_DONE)
                                    |   // sees THREAD_QUEUED

The source of the misunderstanding was that qemu_bh_cancel is now being
used by the _consumer_ rather than the producer, and therefore now needs
to have acquire semantics just like e.g. aio_bh_poll.

In some situations, if there are no other independent requests in the
same aio context that could eventually trigger the scheduling of the
completion function, the omitted TPE and all operations pending on it
will get stuck forever.

Signed-off-by: Sergio Lopez <address@hidden>
---
  util/async.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/util/async.c b/util/async.c
index 355af73ee7..0e1bd8780a 100644
--- a/util/async.c
+++ b/util/async.c
@@ -174,7 +174,7 @@ void qemu_bh_schedule(QEMUBH *bh)
   */
  void qemu_bh_cancel(QEMUBH *bh)
  {
-    bh->scheduled = 0;
+    atomic_mb_set(&bh->scheduled, 0);

But in the end, the patch looks correct. atomic_mb_set() is xchg:

#if defined(__i386__) || defined(__x86_64__) || defined(__s390x__)
#define atomic_mb_set(ptr, i)  ((void)atomic_xchg(ptr, i))

Reads and writes cannot be reordered with locked instructions, so it
should protect from reordering.


  }
/* This func is async.The bottom half will do the delete action at the finial




reply via email to

[Prev in Thread] Current Thread [Next in Thread]