On 03.03.2015 13:28, Gerhard Wiesinger wrote:
On 03.03.2015 10:12, Gerhard Wiesinger wrote:
On 02.03.2015 18:15, Gerhard Wiesinger wrote:
On 02.03.2015 16:52, Gerhard Wiesinger wrote:
On 02.03.2015 10:26, Paolo Bonzini wrote:
On 01/03/2015 11:36, Gerhard Wiesinger wrote:
So far it happened only the PostgreSQL database VM. Kernel is alive
(ping works well). ssh is not working.
console window: after entering one character at login prompt,
then crashed:
[1438.384864] Out of memory: Kill process 10115 (pg_dump) score
112 or
sacrifice child
[1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
anon-rss: 162712kB, file-rss: 220kB
Can you get a vmcore or at least sysrq-t output?
Yes, next time it happens I can analyze it.
I think there are 2 problems:
1.) OOM (Out of Memory) problem with the low memory settings and
kernel settings (see below)
2.) Instability problem which might have a dependency to 1.)
What I've done so far (thanks to Andrey Korolyov for ideas and help):
a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2
virsh dumpxml database | grep "<type"
<type arch='x86_64' machine='pc-0.15'>hvm</type>
virsh edit database
virsh dumpxml database | grep "<type"
<type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>
SMBIOS is updated therefore from 2.4 to 2.8:
dmesg|grep -i SMBIOS
[ 0.000000] SMBIOS 2.8 present.
b.) Switched to tsc clock, kernel parameters: clocksource=tsc
nohz=off highres=off
c.) Changed overcommit to 1
echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf
d.) Tried 1 VCPU instead of 2
e.) Installed 512MB vRAM instead of 384MB
f.) Prepared for sysrq and vmcore
echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf
sysctl -w kernel.sysrq=1
virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T
virsh dump domain-name /tmp/dumpfile
g.) Further ideas, not yet done: disable memory balooning by
blacklisting baloon driver or remove from virsh xml config
Summary:
1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM
problem, no crash
2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM
problem, no crash
3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM
problem, no crash
3b.) Still happened again at the nightly backup with same
configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU,
vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no
reaction, BUT CRASHED again
3c.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1,
pc-i440fx-2.2: OOM problem, no crash
postgres invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Free swap = 905924kB
Total swap = 1081340kB
Out of memory: Kill process 19312 (pg_dump) score 142 or sacrifice child
Killed process 19312 (pg_dump) total-vm:384516kB, anon-rss:119260kB,
file-rss:0kB
An OOM should not occour:
https://www.kernel.org/doc/gorman/html/understand/understand016.html
Is there enough swap space left (nr_swap_pages > 0) ? If yes, not OOM
Why does an OOM condition occour? Looks like a bug in the kernel?
Any ideas?
# Allocating 800MB, killed by OOM killer
./mallocsleep 805306368
Killed
Out of memory: Kill process 27160 (mallocsleep) score 525 or sacrifice
child
Killed process 27160 (mallocsleep) total-vm:790588kB,
anon-rss:214948kB, file-rss:0kB
free -m
total used free shared buff/cache
available
Mem: 363 23 252 23 87 295
Swap: 1055 134 921
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1392
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 1392
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
# Maschine is getting inresponsive and stalls for seconds, but never
reaches more than 1055MB swap size (+ 384MB RAM)
vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system--
------cpu-----
r b swpd free buff cache si so bi bo in cs us sy
id wa st
0 0 136472 241196 1400 98544 4 57 1724 67 211 261 2 3
91 2 2
0 0 136472 241228 1400 98540 0 0 0 0 30 48 0 0
100 0 0
0 0 136472 241228 1408 98532 0 0 0 52 53 51 0 0
89 11 0
0 0 136472 241224 1408 98540 0 0 0 112 44 92 0 0
100 0 0
0 0 136472 241224 1408 98540 0 0 0 0 24 32 0 0
100 0 0
0 0 136472 241352 1408 98540 0 0 0 0 31 44 0 1
100 0 0
0 0 136472 241328 1408 98540 0 0 0 36 97 142 0 1
99 0 0
0 0 136472 241364 1408 98540 0 0 0 0 22 30 0 0
100 0 0
0 0 136472 241376 1416 98532 0 0 0 80 52 45 0 0
92 8 1
1 0 136472 9236 1416 98548 0 0 8 0 762 55 11 23
66 0 0
2 7 270496 3804 140 61172 1144 412268 15028 412340 92805
301836 1 49 1 27 22
1 12 620320 4788 140 35240 1240 114864 96860 114976 46242
96395 1 26 0 61 12
3 18 661436 4788 144 35568 508 0 167884 0 5605 8097 5
76 0 16 4
3 4 661220 4288 144 34256 252 0 273684 0 7454 9777 3
71 0 19 7
5 20 661024 4532 144 34772 320 0 238288 0 9452 12395 3
78 0 13 6
6 19 660596 4592 144 35884 320 0 233160 8 12401 16798
5 67 0 12 15
3 20 677268 4296 140 36816 2180 18200 444328 18332 19382 36234
8 67 0 11 14
3 25 677208 4792 136 36044 68 0 524340 12 20637 26558
3 74 0 15 8
2 21 687880 4964 136 38200 260 10784 311152 10884 17707 28941
4 78 0 12 5
3 21 693808 4380 176 36860 136 6024 388932 6096 14576 22372
3 84 0 6 7
3 27 693740 4432 152 38288 56 20736 419592 20744 23212 31219
4 87 0 7 2
3 23 713696 4384 152 38172 796 0 481420 96 16498 27177
8 87 0 4 1
3 27 713360 4116 152 38372 1844 0 1308552 296 25074 33901
5 85 0 9 1
3 29 714628 4416 180 41992 256 2556 501832 2704 56498 76293
3 91 0 5 1
3 29 714572 3860 172 41076 156 0 920736 152 12131 17339
5 94 0 0 0
4 28 714396 5108 152 40124 212 10924 567648 11148 41901 56712
4 90 0 4 2
3 30 725216 4060 136 40604 124 0 286384 156 21992 35505
5 91 0 2 3
8 12 148836 230388 320 70888 5356 0 24304 52 9977 15084 17
75 0 5 3
0 0 146692 271900 416 76680 2200 0 6592 0 1561 3198 10
10 78 2 1
0 0 146584 271900 416 76892 152 0 184 0 75 139 0 0
100 0 1
0 0 146488 271396 552 76980 128 0 264 36 124 230 0 1
98 1 0
0 0 146372 271076 680 77196 124 0 252 8 79 167 0 0
100 0 0
0 0 146312 270948 688 77332 64 0 64 80 61 102 0 0
97 3 1
What's wrong here?
Kernel Bug?