How to tell if an emulated aarch64 CPU has stopped doing work?

We use qemu (4.0.0, about to flip the switch to 5.0.0) to test our aarch64 images, running in linux containers on x86_64 alongside other workloads.

We've recently run into issues where it looks like an emulated CPU (out of four) sometimes stops making progress for ten or more seconds, and we're trying to characterize the problem. When this happens, the other emulated CPUs run just fine, though sometimes two will stall out at the same time.

Any suggestions for how to tell if an emulated CPU stopped doing work?

Based on our experiments, the guest-visible clocks and cycle counters continue to run when a qemu CPU thread is suspended, so it's hard to tell whether the emulation paused, or if our code is spinning with interrupts disabled (though evidence is mounting that that's not the case). We're adding a bunch more instrumentation to our code, but maybe qemu has some features that will help us out.

I tried to find a way to count the number of TBs executed by an emulated core over time, but I didn't see a cheap way to do that with the plugin APIs.

We could maybe turn on instruction tracing, but this problem happens pretty rarely (<1%), we don't have a repro case yet, and we can't really afford the cost of slowing down every test run. There's a decent chance that this is caused by an overloaded host, but our host-side investigations haven't turned up anything concrete either.

Any advice?

--dbort

From:	Dave Bort
Subject:	How to tell if an emulated aarch64 CPU has stopped doing work?
Date:	Thu, 14 May 2020 18:14:00 -0700