Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&

From:	Gerhard Wiesinger
Subject:	Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
Date:	Tue, 03 Mar 2015 21:50:33 +0100
User-agent:	Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0

On 03.03.2015 14:18, Gerhard Wiesinger wrote:

On 03.03.2015 13:28, Gerhard Wiesinger wrote:
On 03.03.2015 10:12, Gerhard Wiesinger wrote:
On 02.03.2015 18:15, Gerhard Wiesinger wrote:
On 02.03.2015 16:52, Gerhard Wiesinger wrote:
On 02.03.2015 10:26, Paolo Bonzini wrote:
On 01/03/2015 11:36, Gerhard Wiesinger wrote:
So far it happened only the PostgreSQL database VM. Kernel is alive
(ping works well). ssh is not working.
console window: after entering one character at login prompt,then crashed:[1438.384864] Out of memory: Kill process 10115 (pg_dump) score112 or
sacrifice child
[1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
anon-rss: 162712kB, file-rss: 220kB
Can you get a vmcore or at least sysrq-t output?
Yes, next time it happens I can analyze it.

I think there are 2 problems:
1.) OOM (Out of Memory) problem with the low memory settings andkernel settings (see below)
2.) Instability problem which might have a dependency to 1.)

What I've done so far (thanks to Andrey Korolyov for ideas and help):
a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2
virsh dumpxml database | grep "<type"
    <type arch='x86_64' machine='pc-0.15'>hvm</type>

virsh edit database
virsh dumpxml database | grep "<type"
    <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>

SMBIOS is updated therefore from 2.4 to 2.8:
dmesg|grep -i SMBIOS
[    0.000000] SMBIOS 2.8 present.
b.) Switched to tsc clock, kernel parameters: clocksource=tscnohz=off highres=off
c.) Changed overcommit to 1
echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf
d.) Tried 1 VCPU instead of 2
e.) Installed 512MB vRAM instead of 384MB
f.) Prepared for sysrq and vmcore
echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf
sysctl -w kernel.sysrq=1
virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T
virsh dump domain-name /tmp/dumpfile
g.) Further ideas, not yet done: disable memory balooning byblacklisting baloon driver or remove from virsh xml config
Summary:
1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOMproblem, no crash2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOMproblem, no crash
3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOMproblem, no crash
3b.) Still happened again at the nightly backup with sameconfiguration as in 3.) configuration 384MB, kvm_clock, 2VCPU,vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, noreaction, BUT CRASHED again
3c.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1,pc-i440fx-2.2: OOM problem, no crash
postgres invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Free swap  = 905924kB
Total swap = 1081340kB
Out of memory: Kill process 19312 (pg_dump) score 142 or sacrifice child
Killed process 19312 (pg_dump) total-vm:384516kB, anon-rss:119260kB,file-rss:0kB
An OOM should not occour:
https://www.kernel.org/doc/gorman/html/understand/understand016.html
Is there enough swap space left (nr_swap_pages > 0) ? If yes, not OOM

Why does an OOM condition occour? Looks like a bug in the kernel?
Any ideas?
# Allocating 800MB, killed by OOM killer
./mallocsleep 805306368
Killed
Out of memory: Kill process 27160 (mallocsleep) score 525 or sacrificechildKilled process 27160 (mallocsleep) total-vm:790588kB,anon-rss:214948kB, file-rss:0kB
free -m
total used free shared buff/cacheavailable
Mem:            363          23         252          23 87 295
Swap:          1055         134         921

ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1392
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1392
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
# Maschine is getting inresponsive and stalls for seconds, but neverreaches more than 1055MB swap size (+ 384MB RAM)
vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system--------cpu-----r b swpd free buff cache si so bi bo in cs us syid wa st0 0 136472 241196 1400 98544 4 57 1724 67 211 261 2 391 2 20 0 136472 241228 1400 98540 0 0 0 0 30 48 0 0100 0 00 0 136472 241228 1408 98532 0 0 0 52 53 51 0 089 11 00 0 136472 241224 1408 98540 0 0 0 112 44 92 0 0100 0 00 0 136472 241224 1408 98540 0 0 0 0 24 32 0 0100 0 00 0 136472 241352 1408 98540 0 0 0 0 31 44 0 1100 0 00 0 136472 241328 1408 98540 0 0 0 36 97 142 0 199 0 00 0 136472 241364 1408 98540 0 0 0 0 22 30 0 0100 0 00 0 136472 241376 1416 98532 0 0 0 80 52 45 0 092 8 11 0 136472 9236 1416 98548 0 0 8 0 762 55 11 2366 0 02 7 270496 3804 140 61172 1144 412268 15028 412340 92805301836 1 49 1 27 221 12 620320 4788 140 35240 1240 114864 96860 114976 4624296395 1 26 0 61 123 18 661436 4788 144 35568 508 0 167884 0 5605 8097 576 0 16 43 4 661220 4288 144 34256 252 0 273684 0 7454 9777 371 0 19 75 20 661024 4532 144 34772 320 0 238288 0 9452 12395 378 0 13 66 19 660596 4592 144 35884 320 0 233160 8 12401 167985 67 0 12 153 20 677268 4296 140 36816 2180 18200 444328 18332 19382 362348 67 0 11 143 25 677208 4792 136 36044 68 0 524340 12 20637 265583 74 0 15 82 21 687880 4964 136 38200 260 10784 311152 10884 17707 289414 78 0 12 53 21 693808 4380 176 36860 136 6024 388932 6096 14576 223723 84 0 6 73 27 693740 4432 152 38288 56 20736 419592 20744 23212 312194 87 0 7 23 23 713696 4384 152 38172 796 0 481420 96 16498 271778 87 0 4 13 27 713360 4116 152 38372 1844 0 1308552 296 25074 339015 85 0 9 13 29 714628 4416 180 41992 256 2556 501832 2704 56498 762933 91 0 5 13 29 714572 3860 172 41076 156 0 920736 152 12131 173395 94 0 0 04 28 714396 5108 152 40124 212 10924 567648 11148 41901 567124 90 0 4 23 30 725216 4060 136 40604 124 0 286384 156 21992 355055 91 0 2 38 12 148836 230388 320 70888 5356 0 24304 52 9977 15084 1775 0 5 30 0 146692 271900 416 76680 2200 0 6592 0 1561 3198 1010 78 2 10 0 146584 271900 416 76892 152 0 184 0 75 139 0 0100 0 10 0 146488 271396 552 76980 128 0 264 36 124 230 0 198 1 00 0 146372 271076 680 77196 124 0 252 8 79 167 0 0100 0 00 0 146312 270948 688 77332 64 0 64 80 61 102 0 097 3 1
What's wrong here?
Kernel Bug?


Reminds me all of the post here:
http://blog.nitrous.io/2014/03/10/stability-and-a-linux-oom-killer-bug.html

Last month, these outages began to happen more regularly but also veryrandomly. The symptoms were quite similar:

    CPU spiked to 100% utilization.
    Disk I/O spiked.
    Server became completely inaccessible via SSH, etc.

Logs show the Linux Out Of Memory (OOM) killer killing userprocesses that have hit their cgroup's memory limit shortly before theserver froze.Host memory was not under pressure - it was close to fully utilized(which is normal) but there was a lot of unused swap.


Ciao,
Gerhard

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever, Gerhard Wiesinger, 2015/03/01
- Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever, Paolo Bonzini, 2015/03/02
  - Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever, Gerhard Wiesinger, 2015/03/02
    - Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever, Gerhard Wiesinger, 2015/03/02
    - Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever, Gerhard Wiesinger, 2015/03/03
    - Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever, Paolo Bonzini, 2015/03/03
    - Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever, Gonglei, 2015/03/03
    - Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever, Gerhard Wiesinger, 2015/03/03
    - Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever, Gerhard Wiesinger, 2015/03/03
    - Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever, Gerhard Wiesinger <=

Prev by Date: Re: [Qemu-devel] [PATCH 0/4] block: Convert bdrv_find to blk_by_name and drop it
Next by Date: Re: [Qemu-devel] [PATCH 7/9] throttle: Add throttle group support
Previous by thread: Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever
Next by thread: [Qemu-devel] [PATCH] oslib-posix: Fix compiler warning (-Wclobbered) and simplify the code
Index(es):
- Date
- Thread