qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] intermittent failure in test-replication


From: Kevin Wolf
Subject: Re: [Qemu-block] intermittent failure in test-replication
Date: Thu, 27 Sep 2018 13:17:30 +0200
User-agent: Mutt/1.9.1 (2017-09-22)

Am 27.09.2018 um 12:08 hat Peter Maydell geschrieben:
> Hi; I seem to be getting an intermittent failure in
> tests/test-replication:
> 
> MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}
> gtester -k --verbose -m=quick tests/test-replication
> TEST: tests/test-replication... (pid=8465)
>   /replication/primary/read:                                           OK
>   /replication/primary/write:                                          OK
>   /replication/primary/start:                                          OK
>   /replication/primary/stop:                                           OK
>   /replication/primary/do_checkpoint:                                  OK
>   /replication/primary/get_error_all:                                  OK
>   /replication/secondary/read:                                         OK
>   /replication/secondary/write:                                        OK
>   /replication/secondary/start:                                        OK
>   /replication/secondary/stop:
> qemu: qemu_mutex_unlock_impl: Operation not permitted
> FAIL
> GTester: last random seed: R02S46f564a1d3ba1513490a19cb15805626
> (pid=8735)
>   /replication/secondary/do_checkpoint:                                OK
>   /replication/secondary/get_error_all:                                OK
> FAIL: tests/test-replication
> 
> This was on an x86-64 Linux system. I think the error means "thread
> tried to unlock a mutex it does not own"...
> 
> This is probably an "unly happens when system is under heavy load"
> issue and/or a race condition.

Indeed, I managed to reproduce this only while building QEMU. Anyway,
got two failures, with the same stack trace:

(gdb) bt
#0  0x00007f51c067c9fb in raise () from /lib64/libc.so.6
#1  0x00007f51c067e77d in abort () from /lib64/libc.so.6
#2  0x0000558c9d5dde7b in error_exit (err=<optimized out>, address@hidden 
<__func__.18373> "qemu_mutex_unlock_impl") at util/qemu-thread-posix.c:36
#3  0x0000558c9d6b5263 in qemu_mutex_unlock_impl (address@hidden, 
address@hidden "util/async.c", address@hidden) at util/qemu-thread-posix.c:96
#4  0x0000558c9d6b0565 in aio_context_release (address@hidden) at 
util/async.c:516
#5  0x0000558c9d5eb3da in job_completed_txn_abort (job=0x558c9f68e640) at 
job.c:738
#6  0x0000558c9d5eb227 in job_finish_sync (job=0x558c9f68e640, address@hidden 
<job_cancel_err>, address@hidden) at job.c:986
#7  0x0000558c9d5eb8ee in job_cancel_sync (job=<optimized out>) at job.c:941
#8  0x0000558c9d64d853 in replication_close (bs=<optimized out>) at 
block/replication.c:148
#9  0x0000558c9d5e5c9f in bdrv_close (bs=0x558c9f41b020) at block.c:3420
#10 bdrv_delete (bs=0x558c9f41b020) at block.c:3629
#11 bdrv_unref (bs=0x558c9f41b020) at block.c:4685
#12 0x0000558c9d62a3f3 in blk_remove_bs (address@hidden) at 
block/block-backend.c:783
#13 0x0000558c9d62a667 in blk_delete (blk=0x558c9f42a7c0) at 
block/block-backend.c:402
#14 blk_unref (blk=0x558c9f42a7c0) at block/block-backend.c:457
#15 0x0000558c9d5dfcea in test_secondary_stop () at tests/test-replication.c:478
#16 0x00007f51c1f13178 in g_test_run_suite_internal () from 
/lib64/libglib-2.0.so.0
#17 0x00007f51c1f1337b in g_test_run_suite_internal () from 
/lib64/libglib-2.0.so.0
#18 0x00007f51c1f1337b in g_test_run_suite_internal () from 
/lib64/libglib-2.0.so.0
#19 0x00007f51c1f13552 in g_test_run_suite () from /lib64/libglib-2.0.so.0
#20 0x00007f51c1f13571 in g_test_run () from /lib64/libglib-2.0.so.0
#21 0x0000558c9d5de31f in main (argc=<optimized out>, argv=<optimized out>) at 
tests/test-replication.c:581

Paolo, I think this is an effect of being rather inconsistent with
respect to the locking of the main AioContext (and usually not
documenting which functions expect the lock to be held and which don't).

job_cancel_sync() expects the lock to be held, but nothing above it
actually acquired it. Should blk_unref() be called only with the lock
held or only without it?

We have probably both cases in the current code base, so currently we
can get aborts like this one; if we change blk_unref() to acquire the
lock, we may get deadlocks instead. So apart from deciding what we want
blk_unref() to be like, I'm afraid we'll also have to audit the callers.

I'm inclined to say that you should hold the lock when you call
blk_unref(), which would technically make this a bug in the test case,
but I think other callers have the same bug.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]