[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 00/23] backup performance: block_status + async

From: Max Reitz
Subject: Re: [PATCH v4 00/23] backup performance: block_status + async
Date: Wed, 20 Jan 2021 16:53:26 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0

On 20.01.21 15:44, Max Reitz wrote:
On 20.01.21 15:34, Max Reitz wrote:


 From a glance, it looks to me like two coroutines are created simultaneously in two threads, and so one thread sets up a special SIGUSR2 action, then another reverts SIGUSR2 to the default, and then the first one kills itself with SIGUSR2.

Not sure what this has to do with backup, though it is interesting that backup_loop() runs in two threads.  So perhaps some AioContext problem.

Oh, 256 runs two backups concurrently.  So it isn’t that interesting, but perhaps part of the problem still.  (I have no idea, still looking.)

So this is what I found out:

coroutine-sigaltstack, when creating a new coroutine, sets up a signal handler for SIGUSR2, then kills itself with SIGUSR2, then uses the signal handler context (with a sigaltstack) for the new coroutine, and then (the signal handler returns after a sigsetjmp()) the old SIGUSR2 behavior is restored.

What I fail to understand is how this is thread-safe. Setting up signal handlers is a process-wide action. When one thread changes what SIGUSR2 does, this will affect all threads immediately, so when two threads run coroutine-sigaltstack’s qemu_coroutine_new() concurrently, and one thread reverts to the default action before the other has SIGUSR2’ed itself, that later SIGUSR2 will kill the whole process.

(I suppose it gets even more interesting when one thread has set up the sigaltstack, then the other sets up its own sigaltstack, and then both kill themselves with SIGUSR2, so both coroutines get the same stack...)

I have no idea why this has never been hit before, but it makes sense why block-copy backup makes it apparent: It creates 64+x coroutines in a very short time span, and 256 makes it do so in two threads concurrently (thanks to launching two backups in two AioContexts in a transaction).

So... Looks to me like a bug in coroutine-sigaltstack. Not sure what to do now, though. I don’t think we can use block-copy for backup before that backend is fixed. (And that is assuming that it’s indeed coroutine-sigaltstack’s fault.)

I’ll try to add some locking, see what it does, and send a mail concerning coroutine-sigaltstack to qemu-devel.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]