I have recently deployed a new hypervisor with the intent to use in the hosting of both Linux & Windows virtual machines, however after getting everything setup I am running into issues where is appears the virtual machines are "freezing" or "stuttering" for a few seconds at random intervals.
2x Intel Xeon E5-2620 V2
128GB of RAM
8x 480GB Intel 530 SSD's (RAID 10)
2x 1Gbit NIC's (on-motherboard) - bonded
Supermicro Motherboard - X9DRI-F
QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.2)
libvirtd (libvirt) 1.2.2
Dnsmasq version 2.68 (DHCP Server)
total used free shared buffers cached
Mem: 128910 18413 110496 2 134 6349
-/+ buffers/cache: 11929 116980
Swap: 61034 0 61034
top - 18:57:53 up 4:07, 1 user, load average: 5.85, 5.30, 5.28
Tasks: 372 total, 2 running, 370 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.3 us, 3.5 sy, 0.0 ni, 92.7 id, 0.5 wa, 0.1 hi, 0.0 si, 0.0 st
KiB Mem: 13200402+total, 18855240 used, 11314878+free, 137560 buffers
KiB Swap: 62499836 total, 0 used, 62499836 free. 6502164 cached Mem
Known Effected Guest OS's: CentOS 6.5, Windows Server 2012 R2
The issue & troubleshooting I have completed:
After a random period of time I will begin to experience bursts of high latency and packet loss to the guest operating system. When connecting to the VNC console to investigate the virtual machine I have confirmed that when the high latency and packet loss bursts occur the virtual machine VNC output will "freeze" until which time the burst passes. Once the burst passes the machine will act like nothing happened and from what I can tell it isn't even aware it froze or time passed during the event.
Example ping to the guest during the burst of latency and packet loss
64 bytes from x.x.x.x: icmp_seq=6285 ttl=48 time=54.956 ms
64 bytes from x.x.x.x: icmp_seq=6286 ttl=48 time=54.765 ms
64 bytes from x.x.x.x: icmp_seq=6287 ttl=48 time=54.725 ms
64 bytes from x.x.x.x: icmp_seq=6288 ttl=48 time=5091.305 ms
64 bytes from x.x.x.x: icmp_seq=6290 ttl=48 time=3090.609 ms
64 bytes from x.x.x.x: icmp_seq=6289 ttl=48 time=4091.357 ms
64 bytes from x.x.x.x: icmp_seq=6291 ttl=48 time=2090.073 ms
64 bytes from x.x.x.x: icmp_seq=6292 ttl=48 time=1088.983 ms
64 bytes from x.x.x.x: icmp_seq=6293 ttl=48 time=88.455 ms
64 bytes from x.x.x.x: icmp_seq=6294 ttl=48 time=52.370 ms
64 bytes from x.x.x.x: icmp_seq=6295 ttl=48 time=52.087 ms
64 bytes from x.x.x.x: icmp_seq=6296 ttl=48 time=54.872 ms
64 bytes from x.x.x.x: icmp_seq=6297 ttl=48 time=52.708 ms
Example outbound ping from the guest back during the same example interval above
64 bytes from x.x.x.x: icmp_seq=6261 ttl=48 time=53.488 ms
64 bytes from x.x.x.x: icmp_seq=6262 ttl=48 time=50.878 ms
64 bytes from x.x.x.x: icmp_seq=6263 ttl=48 time=52.926 ms
64 bytes from x.x.x.x: icmp_seq=6264 ttl=48 time=51.401 ms
64 bytes from x.x.x.x: icmp_seq=6265 ttl=48 time=54.259 ms
64 bytes from x.x.x.x: icmp_seq=6266 ttl=48 time=52.404 ms
64 bytes from x.x.x.x: icmp_seq=6267 ttl=48 time=55.412 ms
64 bytes from x.x.x.x: icmp_seq=6268 ttl=48 time=69.590 ms
64 bytes from x.x.x.x: icmp_seq=6269 ttl=48 time=54.899 ms
64 bytes from x.x.x.x: icmp_seq=6270 ttl=48 time=53.875 ms
64 bytes from x.x.x.x: icmp_seq=6271 ttl=48 time=52.909 ms
64 bytes from x.x.x.x: icmp_seq=6272 ttl=48 time=53.257 ms
64 bytes from x.x.x.x: icmp_seq=6273 ttl=48 time=53.671 ms
As you can see from the above examples the guest never see's packet loss outbound during the event, but the inbound ping is erratic to say the least.
During the bursts of packet loss and high latency events the ping to the hypervisor's own IP is perfect, it doesn't even slightly hiccup. I am able to keep an SSH connection throughout the entire event and when viewing something like "top" I see a constant stream of updates - in other words, the hypervisor never experiences an issue from what I can see/tell.
During the testing I also setup a Windows Server 2012 R2 guest and connected to its VNC console. I opened the task manager so I could see the graphs so that when the issue begins I could see if i see the graphs "lurch" forward or if the just stop & start again.
When the event occurred (it took several hours of waiting) I brought up the VNC connection for the Windows guest VM and watched the task managers graphs. Each time there was a burst of packet loss and high latency I would experience the same as the above, the VNC output would freeze and I couldn't input to it either - the VNC connection remains connected, it never times out.
After each event the task manager graphs would pick up right where they left off like nothing ever happened. There isn't a "lurch" or "jump" forward like you would expect if you simply lost connection to the guest, but simply when the guest begins to output via VNC its as if the time never passed.
The only thing I noted was when the output resumes there is a sudden spike to 100% CPU inside the guest.
Once the bursts begin the typically continue worsening and lessoning until I do one of the following temporary resolutions.
1) Restart the guest via libvirt or virtsh - Once the guest boots back up the issue is resolved for a length of time (random length, could be 5 minutes could be 8 hours) and then the issue returns with identical symptoms as the above.
2) Restart the entire hypervisor - Same as restarting an individual guest, the issue is resolved for a time, but eventually returns.
- The issue occurs on both Ubuntu 12.04.4 LTS w/Qemu 1.0.0 and on Ubuntu 14.04.1 LTS w/Qemu 2.0.0
- The issue effects both Windows & Linux guest operating systems
- VirtIO is the default driver for all guests, however I have confirmed it also effects IDE, Realtek & E1000 when set for the guest VM.
- When the bursts of packet loss and high latency begins it does NOT effect all the guest machines simultaneously. While it eventually effects every guest it doesn't effect them all at the same time. One guest will be having issues while 5 others are smooth as glass. Its almost like they each guest has their own timer for when the event will begin.
Any insight into what I may be doing wrong or how to fix the above would be greatly appreciated!