|
From: | John Snow |
Subject: | Re: tests/vm infrastructure fails to notice that QEMU dying is a failure |
Date: | Fri, 17 Jul 2020 18:24:18 -0400 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 |
On 7/17/20 9:25 AM, Philippe Mathieu-Daudé wrote:
On 7/17/20 3:22 PM, Philippe Mathieu-Daudé wrote:Cc'ing John & Cleber. On 7/17/20 3:08 PM, Peter Maydell wrote:If you run vm-build-openbsd, our makefile/scripting infrastructure seems to fail to notice that death of the QEMU process that's running the VM should be a failure, and ends up allowing make to return a success condition. I have a script which runs a VM build which basically does: #!/bin/sh -e make -C "build" "vm-build-openbsd" J=8 V=1 echo "OK DONE openbsd" It just gave me this output (tail end of logfile). We're executing tests, and then the qemu-system-x86_64 that's running the OpenBSD VM gets a signal 9 (sigkill), for unclear reasons (oom killer??). The python scripting gets an exception, but doesn't exit with a failure status to make, which then thinks all is fine, exits success itself and allows the set -e script to proceed to print the "OK DONE" line...
The signal 9 is likely due to QEMU freezing on exit. The machine.py code waits about 3 seconds after trying to shut QEMU down gracefully before sending SIGKILL and cleaning up forcefully.
PASS 30 qos-test /arm/imx25-pdk/imx.i2c/i2c-bus/pca9552/pca9552-tests/tx-rx PASS 31 qos-test /arm/imx25-pdk/imx.i2c/i2c-bus/pca9552/pca9552-tests/rx-autoinc PASS 32 qos-test /arm/imx25-pdk/imx.i2c/i2c-bus/ds1338/ds1338-tests/tx-rx DEBUG:QMP.qemu-26462:>>> {'execute': 'quit'} DEBUG:QMP.qemu-26462:<<< {'timestamp': {'seconds': 1594984057, 'microseconds': 485197}, 'event': 'NIC_RX_FILTER_CHANGED', 'data': {'path': '/machine/peripheral-anon/device[0]/virtio-backend'}} DEBUG:QMP.qemu-26462:<<< {'timestamp': {'seconds': 1594985855, 'microseconds': 169552}, 'event': 'RTC_CHANGE', 'data': {'offset': 0}} DEBUG:QMP.qemu-26462:<<< {'timestamp': {'seconds': 1594987655, 'microseconds': 169187}, 'event': 'RTC_CHANGE', 'data': {'offset': 0}} DEBUG:QMP.qemu-26462:<<< {'timestamp': {'seconds': 1594989456, 'microseconds': 88866}, 'event': 'RTC_CHANGE', 'data': {'offset': 0}} DEBUG:QMP.qemu-26462:<<< {'return': {}} WARNING:qemu.machine:qemu received signal 9; command: "qemu-system-x86_64 -display none -vga none -chardev socket,id=mon,path=/var/tmp/qemu-26462-monitor.sock -mon chardev=mon,mode=control -machine pc -chardev socket,id=console,path=/var/tmp/qemu-26462-console.sock,server,nowait -serial chardev:console -nodefaults -m 4G -cpu max -netdev user,id=vnet,hostfwd=:127.0.0.1:0-:22 -device virtio-net-pci,netdev=vnet -vnc 127.0.0.1:0,to=20 -smp 8 -enable-kvm -drive file=/home/peter.maydell/.cache/qemu-vm/images/openbsd.img,snapshot=on,if=none,id=drive0,cache=writeback -device virtio-blk,drive=drive0,bootindex=0 -drive file=/home/peter.maydell/qemu-openbsd/build/vm-test-yzwn6xdc.tmp/data-993a1.tar,if=none,id=data-993a1,cache=writeback,format=raw -device virtio-blk,drive=data-993a1,serial=data-993a1,bootindex=1" Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/home/peter.maydell/qemu-openbsd/tests/vm/../../python/qemu/machine.py", line 436, in _do_shutdown self._soft_shutdown(has_quit, timeout) File "/home/peter.maydell/qemu-openbsd/tests/vm/../../python/qemu/machine.py", line 419, in _soft_shutdown self._popen.wait(timeout=timeout) File "/usr/lib/python3.6/subprocess.py", line 1469, in wait raise TimeoutExpired(self.args, timeout)
This is where it times out waiting for the process to exit of its own volition. This is the "Inner" exception ...
subprocess.TimeoutExpired: Command '['qemu-system-x86_64', '-display', 'none', '-vga', 'none', '-chardev', 'socket,id=mon,path=/var/tmp/qemu-26462-monitor.sock', '-mon', 'chardev=mon,mode=control', '-machine', 'pc', '-chardev', 'socket,id=console,path=/var/tmp/qemu-26462-console.sock,server,nowait', '-serial', 'chardev:console', '-nodefaults', '-m', '4G', '-cpu', 'max', '-netdev', 'user,id=vnet,hostfwd=:127.0.0.1:0-:22', '-device', 'virtio-net-pci,netdev=vnet', '-vnc', '127.0.0.1:0,to=20', '-smp', '8', '-enable-kvm', '-drive', 'file=/home/peter.maydell/.cache/qemu-vm/images/openbsd.img,snapshot=on,if=none,id=drive0,cache=writeback', '-device', 'virtio-blk,drive=drive0,bootindex=0', '-drive', 'file=/home/peter.maydell/qemu-openbsd/build/vm-test-yzwn6xdc.tmp/data-993a1.tar,if=none,id=data-993a1,cache=writeback,format=raw', '-device', 'virtio-blk,drive=data-993a1,serial=data-993a1,bootindex=1']' timed out after 3 seconds The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/peter.maydell/qemu-openbsd/tests/vm/../../python/qemu/machine.py", line 466, in shutdown self._do_shutdown(has_quit, timeout=timeout) File "/home/peter.maydell/qemu-openbsd/tests/vm/../../python/qemu/machine.py", line 440, in _do_shutdown from exc qemu.machine.AbnormalShutdown: Could not perform graceful shutdown
And this is the "Outer" exception, the machine.py cleanup handler letting you know that while it did perform cleanup, QEMU was not shut down properly.
make: Leaving directory '/home/peter.maydell/qemu-openbsd/build' OK DONE openbsdJohn it seems due to the addition of the 'finally' clause in 193bf1c061 ("python/machine.py: split shutdown into hard and soft flavors").
That's surprising. I'll look into this Monday, but the "Finally: " should only perform some actions without suppressing the exception. You can see here it DIDN'T suppress it.
...but maybe we don't return non-zero if an exception gets raised during atexit()? Maybe that's what's happening here.
... ... Ah, guess what, that's exactly it. atexit suppresses exceptions. You just never noticed this problem before because machine.py used to suppress problems during shutdown instead of reporting them.
``` #!/usr/bin/env python3 import atexit from qemu import machine def handler(): raise machine.AbnormalShutdown("oh no") atexit.register(handler) ``` Oops, this returns a zero code. So:- basevm.py should likely be converted to a context manager object just like machine.py's QEMUMachine class was so cleanup can happen explicitly when the VM object goes out of scope.
(Callers would use: "with baseVM() as foo:" pattern, or, where that's not possible, call an explicit .shutdown().)
- The real problem, though: Why is QEMU hanging? It might need a longer timeout, or it might be having problems with the console socket again.
(CC Robert Foley who has been working on the Console socket draining problems. Maybe he has some insight here?)
--js
[Prev in Thread] | Current Thread | [Next in Thread] |