qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] main-loop: Don't lock starve io-threads when ma


From: Hans de Goede
Subject: Re: [Qemu-devel] [PATCH] main-loop: Don't lock starve io-threads when main_loop_tlg has pending events
Date: Tue, 08 Oct 2013 22:16:50 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.0

Hi,

On 10/08/2013 10:01 PM, Alex Bligh wrote:

<snip>

The purpose of the 1 ns timeout is to cause os_host_main_loop_wait
to unlock the iothread, as $subject says the problem I'm seeing seems
to be lock starvation not cpu starvation.

Note as I already indicated I'm in no way an expert in this, if you
and or Paolo suspect cpu starvation may happen too, then bumping
the timeout to 250 us is fine with me too.

If we go with 250 us that thus pose the question though if we should
always keep a minimum timeout of 250 us when not non-blocking, or only
bump it to 250 us when main_loop_tlg has already expired events and
thus is causing a timeout of 0.

I am by no means an expert in the iothread bit, so let's pool our
ignorance ... :-)

Somewhere within that patch series (7b595f35 I think) I fixed up
the spin counter bit, which made it slightly less yucky and work
with milliseconds. I hope I didn't break it but there seems
something slightly odd about the use case here.

If you are getting the spin error, this implies something is
pretty much constantly polling os_host_main_loop_wait with a
zero timeout. As you point out this is going to be main_loop_wait
and almost certainly main_loop_wait called with nonblocking
set to 1.

No, it is calling main_loop_wait with nonblocking set to 0, so
normally the lock would get released. But
timerlistgroup_deadline_ns(&main_loop_tlg) is returning 0,
causing timeout_ns to be 0, and this causes the lock to not get
released.

I'm quite sure this is what is happing because once my
bisect pointed to the "aio / timers: Convert mainloop to use timeout"
commit as a culprit, I read that commit very carefully multiple
times and that seemed like the only problem it could cause,
so I added a debug printf to test for that case, and it triggered.

What I believe is happening in my troublesome scenario is that one
thread is calling main_loop_wait(0) repeatedly, waiting for another
thread to do some work (*), but that other thread is not getting a
chance to do that work because the iothread never gets unlocked.

*) likely the spice-server thread which does a lot of work for
the qxl device



The comment at line 208 suggests that "the I/O thread is very busy
or we are incorrectly busy waiting in the I/O thread". Do we know
which is happening? Perhaps rather than give up the io_thread
mutex on every call (which is in practice what a 1 nanosecond
timeout does) we should give it up if we have not released
it for X nanoseconds (maybe X=250us), or on every Y calls. I think
someone other than me should consider the effect of dropping and
reacquiring a mutex so frequently under heavy I/O load, but I'm not
sure it's a great idea.

We're only waiting so short if there are timers which want to run
immediately, normally we would wait a lot longer.

So on reflection you might be more right with 1 nanosecond than
250us as a timeout of 250us, but I wonder whether a strategy
of just dropping the lock occasionally (and still using a zero
timeout) might be better.

Paolo probably has some better insights on this, but he seems to
have called it a day for today, and I'm going to do the same :)

So lets continue this tomorrow.

Regards,

Hans




reply via email to

[Prev in Thread] Current Thread [Next in Thread]