qemu-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-discuss] Qemu VM random freeze


From: LiChuanyun
Subject: [Qemu-discuss] Qemu VM random freeze
Date: Thu, 10 Sep 2015 19:22:46 +0800

Hi, all

We have a openstack cluster newly installed.  Every vm in the cluster randomly freezes(about 2~3 seconds) every few minutes.  When a vm freeze,  ping from outside (a physical machine or another vm) will have latency of a few seconds or timeout (vs <1ms in normal situation). Also, processes on the vm that does not use network (eg. writing data time to a file each second) also stops working during those freezeing period (so that we dont have lines in those seconds).

On the compute node that runs the vm, we found high cpu usage (often near or over 100%) of the qemu process running the vm when it freezes. But inside the vm, the cpu utilization remains low all the time. This indicates the cpu time is given to qemu to do some busy stuff but not given to its vcpu threads, or it is given to the vcpu threads but they do not get into guest mode during that.

We have a very simillar setup of openstack using the same versions of openstack/qemu/kvm/host OS/guest OS, but which does not have such freezes. The only obvious difference is the "random freezing"compute nodes are Huawei RH 2288H V2, and the "good" ones are some Dell servers (I can get that info if it is important). The CPUs are Xeon E5-2560 and Xeon 5405 respectively, with the former having more advanced virtualization support (VT-d and EPT).

The host OS is ubuntu 14.04 LTS (kernel 3.13.0-32-generic), qemu version is 2.0. It looks the guest OS does not matter (it happens on a few difference guest OS's we have tried).

We have only a rough idea it is related to some scheduling problem on the host leading to starvation of vcpu threads. There are other freezing problems reported on the network that are solved by disabling kvm-clock, but we tried that and failed. 

We lack a diagnostic method to identify the root cause. Could anyone give suggestions where should we start? Any "suspected fixes" are also welcome.

We

reply via email to

[Prev in Thread] Current Thread [Next in Thread]