qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&


From: Gerhard Wiesinger
Subject: Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
Date: Tue, 03 Mar 2015 21:50:33 +0100
User-agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0

On 03.03.2015 14:18, Gerhard Wiesinger wrote:
On 03.03.2015 13:28, Gerhard Wiesinger wrote:
On 03.03.2015 10:12, Gerhard Wiesinger wrote:
On 02.03.2015 18:15, Gerhard Wiesinger wrote:
On 02.03.2015 16:52, Gerhard Wiesinger wrote:
On 02.03.2015 10:26, Paolo Bonzini wrote:

On 01/03/2015 11:36, Gerhard Wiesinger wrote:
So far it happened only the PostgreSQL database VM. Kernel is alive
(ping works well). ssh is not working.
console window: after entering one character at login prompt, then crashed: [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or
sacrifice child
[1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
anon-rss: 162712kB, file-rss: 220kB
Can you get a vmcore or at least sysrq-t output?

Yes, next time it happens I can analyze it.

I think there are 2 problems:
1.) OOM (Out of Memory) problem with the low memory settings and kernel settings (see below)
2.) Instability problem which might have a dependency to 1.)

What I've done so far (thanks to Andrey Korolyov for ideas and help):
a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2
virsh dumpxml database | grep "<type"
    <type arch='x86_64' machine='pc-0.15'>hvm</type>

virsh edit database
virsh dumpxml database | grep "<type"
    <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>

SMBIOS is updated therefore from 2.4 to 2.8:
dmesg|grep -i SMBIOS
[    0.000000] SMBIOS 2.8 present.
b.) Switched to tsc clock, kernel parameters: clocksource=tsc nohz=off highres=off
c.) Changed overcommit to 1
echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf
d.) Tried 1 VCPU instead of 2
e.) Installed 512MB vRAM instead of 384MB
f.) Prepared for sysrq and vmcore
echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf
sysctl -w kernel.sysrq=1
virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T
virsh dump domain-name /tmp/dumpfile
g.) Further ideas, not yet done: disable memory balooning by blacklisting baloon driver or remove from virsh xml config

Summary:
1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM problem, no crash 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM problem, no crash

3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM problem, no crash

3b.) Still happened again at the nightly backup with same configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no reaction, BUT CRASHED again


3c.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1, pc-i440fx-2.2: OOM problem, no crash

postgres invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Free swap  = 905924kB
Total swap = 1081340kB
Out of memory: Kill process 19312 (pg_dump) score 142 or sacrifice child
Killed process 19312 (pg_dump) total-vm:384516kB, anon-rss:119260kB, file-rss:0kB

An OOM should not occour:
https://www.kernel.org/doc/gorman/html/understand/understand016.html
Is there enough swap space left (nr_swap_pages > 0) ? If yes, not OOM

Why does an OOM condition occour? Looks like a bug in the kernel?
Any ideas?

# Allocating 800MB, killed by OOM killer
./mallocsleep 805306368
Killed

Out of memory: Kill process 27160 (mallocsleep) score 525 or sacrifice child Killed process 27160 (mallocsleep) total-vm:790588kB, anon-rss:214948kB, file-rss:0kB

free -m
total used free shared buff/cache available
Mem:            363          23         252          23 87 295
Swap:          1055         134         921

ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1392
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1392
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


# Maschine is getting inresponsive and stalls for seconds, but never reaches more than 1055MB swap size (+ 384MB RAM)
vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 136472 241196 1400 98544 4 57 1724 67 211 261 2 3 91 2 2 0 0 136472 241228 1400 98540 0 0 0 0 30 48 0 0 100 0 0 0 0 136472 241228 1408 98532 0 0 0 52 53 51 0 0 89 11 0 0 0 136472 241224 1408 98540 0 0 0 112 44 92 0 0 100 0 0 0 0 136472 241224 1408 98540 0 0 0 0 24 32 0 0 100 0 0 0 0 136472 241352 1408 98540 0 0 0 0 31 44 0 1 100 0 0 0 0 136472 241328 1408 98540 0 0 0 36 97 142 0 1 99 0 0 0 0 136472 241364 1408 98540 0 0 0 0 22 30 0 0 100 0 0 0 0 136472 241376 1416 98532 0 0 0 80 52 45 0 0 92 8 1 1 0 136472 9236 1416 98548 0 0 8 0 762 55 11 23 66 0 0 2 7 270496 3804 140 61172 1144 412268 15028 412340 92805 301836 1 49 1 27 22 1 12 620320 4788 140 35240 1240 114864 96860 114976 46242 96395 1 26 0 61 12 3 18 661436 4788 144 35568 508 0 167884 0 5605 8097 5 76 0 16 4 3 4 661220 4288 144 34256 252 0 273684 0 7454 9777 3 71 0 19 7 5 20 661024 4532 144 34772 320 0 238288 0 9452 12395 3 78 0 13 6 6 19 660596 4592 144 35884 320 0 233160 8 12401 16798 5 67 0 12 15 3 20 677268 4296 140 36816 2180 18200 444328 18332 19382 36234 8 67 0 11 14 3 25 677208 4792 136 36044 68 0 524340 12 20637 26558 3 74 0 15 8 2 21 687880 4964 136 38200 260 10784 311152 10884 17707 28941 4 78 0 12 5 3 21 693808 4380 176 36860 136 6024 388932 6096 14576 22372 3 84 0 6 7 3 27 693740 4432 152 38288 56 20736 419592 20744 23212 31219 4 87 0 7 2 3 23 713696 4384 152 38172 796 0 481420 96 16498 27177 8 87 0 4 1 3 27 713360 4116 152 38372 1844 0 1308552 296 25074 33901 5 85 0 9 1 3 29 714628 4416 180 41992 256 2556 501832 2704 56498 76293 3 91 0 5 1 3 29 714572 3860 172 41076 156 0 920736 152 12131 17339 5 94 0 0 0 4 28 714396 5108 152 40124 212 10924 567648 11148 41901 56712 4 90 0 4 2 3 30 725216 4060 136 40604 124 0 286384 156 21992 35505 5 91 0 2 3 8 12 148836 230388 320 70888 5356 0 24304 52 9977 15084 17 75 0 5 3 0 0 146692 271900 416 76680 2200 0 6592 0 1561 3198 10 10 78 2 1 0 0 146584 271900 416 76892 152 0 184 0 75 139 0 0 100 0 1 0 0 146488 271396 552 76980 128 0 264 36 124 230 0 1 98 1 0 0 0 146372 271076 680 77196 124 0 252 8 79 167 0 0 100 0 0 0 0 146312 270948 688 77332 64 0 64 80 61 102 0 0 97 3 1

What's wrong here?
Kernel Bug?


Reminds me all of the post here:
http://blog.nitrous.io/2014/03/10/stability-and-a-linux-oom-killer-bug.html
Last month, these outages began to happen more regularly but also very randomly. The symptoms were quite similar:
    CPU spiked to 100% utilization.
    Disk I/O spiked.
    Server became completely inaccessible via SSH, etc.
Logs show the Linux Out Of Memory (OOM) killer killing user processes that have hit their cgroup's memory limit shortly before the server froze. Host memory was not under pressure - it was close to fully utilized (which is normal) but there was a lot of unused swap.

Ciao,
Gerhard




reply via email to

[Prev in Thread] Current Thread [Next in Thread]