qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] coroutine-ucontext broken for x86-32


From: Michael Tokarev
Subject: Re: [Qemu-devel] coroutine-ucontext broken for x86-32
Date: Wed, 09 May 2012 11:32:05 +0400
User-agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:10.0.3) Gecko/20120329 Icedove/10.0.3

On 08.05.2012 23:35, Jan Kiszka wrote:
> Hi,
> 
> I hunted down a fairly subtle corruption of the VCPU thread signal mask
> in KVM mode when using the ucontext version of coroutines:
> 
> coroutine_new calls getcontext, makecontext, swapcontext. Those
> functions get/set also the signal mask of the caller. Unfortunately,
> they only use the sigprocmask syscall on i386, not the rt_sigprocmask
> version. So they do not properly save/restore the blocked RT signals,
> namely our SIG_IPI - it becomes unblocke this way. And this will sooner
> or later make the kernel actually deliver a SIG_IPI to our
> dummy_handler, and we miss a wakeup, which means losing control over
> VCPU thread - qemu hangs.
> 
> I was able to reproduce the issue very reliably with virtio-block
> enabled, 32-bit qemu userspace on a 64-bit host, using a 32-bit WinXP
> guest.

Jan, I tried to hunt down (well, FSVO anyway, since I don't understand
qemu code as a whole still) this very issue since some 0.15 (IIRC -
when coroutines were introduced) version.  The sympthom I faced was
32bit kvm process lockup when rebooting windows guest.  The cause
was lost/ignored interrupts, and for me it was possible to just
suspend/resume (SIGSTOP/SIGCONT) the kvm process or to attach a
debugger or strace to it.  It looked like a corruption somewhere,
and while bisecting I were finding "unrelated" commits -- like,
eg, "switch qcow2 to coroutines" (I was using -snapshot, so qcow2
was actually in use, but the commit itself were innocent).  There
are several discussions in archives, debian bugreport about it and
several IRC discussions, all with no outcome.  So at least now I
can say that it is not only me who see the issue, so it passes a
reality check somehow... ;)

But the thing is: generally, almost no one cares about 32/64bit
"mixed" environment anymore.  I had a few users in Debian who
complained, and it has always been the same scenario: an old 32bit
install moved to a new hardware, next due to large amount of
memory, switch to 64bit kernel, and the result is "something
not working".  My suggestion to them has always been "reinstall".
I use such a mixed environment myself on my development box
(and actually even on production machines @office), so I'm
one of the first to face issues in this area, and it sometimes
does not let me to do other things -- eg, I can't debug some
other bug because qemu locks up due to this 32/64 thing.  I
learned to use a 64bit chroot for this things after all.

So I'm not sure if there's enough interest to hunt this.  It
must be something very simple, and it might pop up somewhere
else, but so far it - seemingly - only affects 32/64bit mixed
environment.

> Simple workaround:
> 
> diff --git a/main-loop.h b/main-loop.h
> index c06b8bc..dce1cd9 100644
> --- a/main-loop.h
> +++ b/main-loop.h
> @@ -25,11 +25,7 @@
>  #ifndef QEMU_MAIN_LOOP_H
>  #define QEMU_MAIN_LOOP_H 1
>  
> -#ifdef SIGRTMIN
> -#define SIG_IPI (SIGRTMIN+4)
> -#else
>  #define SIG_IPI SIGUSR1
> -#endif
>  
>  /**
>   * qemu_init_main_loop: Set up the process so that it can run the main loop.
> 
> 
> But maybe someone has a better idea, ie. something that addresses the
> issue at the root. Otherwise we would have to erect large warning signs:
> "Do not use RT signals! Coroutines will break them for you."
> 
> Michael, maybe this also relates to the issue you saw. I'm not able to
> reproduce any VAPIC problems after make Windows bootable by switching
> to SIGUSR1.

I'll try to verify it later today, I've unrelated urgent family
issues right now...

Thank you!

/mjt



reply via email to

[Prev in Thread] Current Thread [Next in Thread]