qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Qemu-block] Aborts in iotest 169


From: Kevin Wolf
Subject: Re: [Qemu-devel] [Qemu-block] Aborts in iotest 169
Date: Thu, 24 Jan 2019 11:45:01 +0100
User-agent: Mutt/1.10.1 (2018-07-13)

Am 24.01.2019 um 11:32 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 24.01.2019 13:15, Kevin Wolf wrote:
> > Am 24.01.2019 um 10:29 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> 23.01.2019 18:48, Max Reitz wrote:
> >>> Hi,
> >>>
> >>> When running 169 in parallel (e.g. like so:
> >>>
> >>> $ while TEST_DIR=/tmp/t0 ./check -T -qcow2 169; do; done
> >>> $ while TEST_DIR=/tmp/t1 ./check -T -qcow2 169; do; done
> >>> $ while TEST_DIR=/tmp/t2 ./check -T -qcow2 169; do; done
> >>> $ while TEST_DIR=/tmp/t3 ./check -T -qcow2 169; do; done
> >>>
> >>> in four different shells), I get aborts:
> >>>
> >>> (Often I get segfaults, but that's because of
> >>> http://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg05579.html --
> >>> feel free to apply the attached patch to make them go away)
> >>>
> >>>
> >>> WARNING:qemu:qemu received signal 6:
> >>> build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64
> >>> -chardev socket,id=mon,path=/tmp/t0/tmpbX30XU/qemua-25745-monitor.sock
> >>> -mon chardev=mon,mode=control -display none -vga none -qtest
> >>> unix:path=/tmp/t0/qemua-25745-qtest.sock -machine accel=qtest
> >>> -nodefaults -machine accel=qtest -drive
> >>> if=virtio,id=drive0,file=/tmp/t0/disk_a,format=qcow2,cache=writeback
> >>> .................E..
> >>> ======================================================================
> >>> ERROR:
> >>> test_do_test_migration_resume_source_not_persistent__not_migbitmap
> >>> (__main__.TestDirtyBitmapMigration)
> >>> ----------------------------------------------------------------------
> >>> Traceback (most recent call last):
> >>>     File "169", line 206, in <lambda>
> >>>       setattr(klass, 'test_' + method + name, lambda self: mc(self))
> >>>     File "169", line 113, in do_test_migration_resume_source
> >>>       self.check_bitmap(self.vm_a, sha256)
> >>>     File "169", line 72, in check_bitmap
> >>>       node='drive0', name='bitmap0')
> >>>     File "tests/qemu-iotests/../../scripts/qemu.py", line 369, in qmp
> >>>       return self._qmp.cmd(cmd, args=qmp_args)
> >>>     File "tests/qemu-iotests/../../scripts/qmp/qmp.py", line 191, in cmd
> >>>       return self.cmd_obj(qmp_cmd)
> >>>     File "tests/qemu-iotests/../../scripts/qmp/qmp.py", line 174, in 
> >>> cmd_obj
> >>>       resp = self.__json_read()
> >>>     File "tests/qemu-iotests/../../scripts/qmp/qmp.py", line 82, in
> >>> __json_read
> >>>       data = self.__sockfile.readline()
> >>>     File "/usr/lib64/python2.7/socket.py", line 451, in readline
> >>>       data = self._sock.recv(self._rbufsize)
> >>> error: [Errno 104] Connection reset by peer
> >>>
> >>> ----------------------------------------------------------------------
> >>> Ran 20 tests
> >>>
> >>> FAILED (errors=1)
> >>>
> >>>
> >>> Or:
> >>>
> >>> WARNING:qemu:qemu received signal 6:
> >>> build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64
> >>> -chardev socket,id=mon,path=/tmp/t3/tmp0pllWD/qemua-3445-monitor.sock
> >>> -mon chardev=mon,mode=control -display none -vga none -qtest
> >>> unix:path=/tmp/t3/qemua-3445-qtest.sock -machine accel=qtest -nodefaults
> >>> -machine accel=qtest -drive
> >>> if=virtio,id=drive0,file=/tmp/t3/disk_a,format=qcow2,cache=writeback
> >>> WARNING:qemu:qemu received signal 6:
> >>> build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64
> >>> -chardev socket,id=mon,path=/tmp/t3/tmp0pllWD/qemua-3445-monitor.sock
> >>> -mon chardev=mon,mode=control -display none -vga none -qtest
> >>> unix:path=/tmp/t3/qemua-3445-qtest.sock -machine accel=qtest -nodefaults
> >>> -machine accel=qtest -drive
> >>> if=virtio,id=drive0,file=/tmp/t3/disk_a,format=qcow2,cache=writeback
> >>>
> >>> ...................F
> >>> ======================================================================
> >>> FAIL: test_do_test_migration_resume_source_persistent__not_migbitmap
> >>> (__main__.TestDirtyBitmapMigration)
> >>> ----------------------------------------------------------------------
> >>> Traceback (most recent call last):
> >>>     File "169", line 206, in <lambda>
> >>>       setattr(klass, 'test_' + method + name, lambda self: mc(self))
> >>>     File "169", line 125, in do_test_migration_resume_source
> >>>       self.assertEqual(log, '')
> >>> AssertionError: "qemu-system-x86_64: invalid runstate transition:
> >>> 'running' -> 'postmigrate'\n" != ''
> >>>
> >>> ----------------------------------------------------------------------
> >>> Ran 20 tests
> >>>
> >>> FAILED (failures=1)
> >>>
> >>>
> >>> The backtrace always goes like this:
> >>>
> >>> (gdb) bt
> >>> #0  0x00007f0acf5cc53f in raise () at /lib64/libc.so.6
> >>> #1  0x00007f0acf5b6895 in abort () at /lib64/libc.so.6
> >>> #2  0x000055a46ebbb1a6 in runstate_set (new_state=RUN_STATE_POSTMIGRATE)
> >>> at vl.c:742
> >>> #3  0x000055a46ebbb1a6 in runstate_set
> >>> (address@hidden) at vl.c:730
> >>> #4  0x000055a46ed39129 in migration_iteration_finish (s=0x55a4708be000)
> >>> at migration/migration.c:2972
> >>> #5  0x000055a46ed39129 in migration_thread
> >>> (address@hidden) at migration/migration.c:3130
> >>> #6  0x000055a46eea665a in qemu_thread_start (args=<optimized out>) at
> >>> util/qemu-thread-posix.c:502
> >>>
> >>>
> >>> #7  0x00007f0acf76258e in start_thread () at /lib64/libpthread.so.0
> >>> #8  0x00007f0acf6916a3 in clone () at /lib64/libc.so.6
> >>> (gdb) frame 2
> >>> #2  0x000055a46ebbb1a6 in runstate_set (new_state=RUN_STATE_POSTMIGRATE)
> >>> at vl.c:742
> >>> 742             abort();
> >>> (gdb) print current_run_state
> >>> $1 = RUN_STATE_RUNNING
> >>>
> >>>
> >>> Neither of migration or runstates are my strong suite, so I thought I'd
> >>> report it before diving into it.
> >>>
> >>> Max
> >>>
> >>
> >> [...]
> >> address@hidden:migrate_set_state new state active
> >> address@hidden:migration_thread_setup_complete
> >> address@hidden:migrate_transferred transferred 985298 time_spent 100 
> >> bandwidth 9852 max_size 2955894
> >> address@hidden:migration_bitmap_sync_start
> >> address@hidden:migration_bitmap_sync_end dirty_pages 32898
> >> address@hidden:migration_thread_low_pending 2048
> >> migration_completion
> >> address@hidden:runstate_set current_state 9 new_state 7
> >> address@hidden:migration_bitmap_sync_start
> >> address@hidden:migration_bitmap_sync_end dirty_pages 32898
> >> address@hidden:migrate_global_state_pre_save saved state: running
> >> [1] address@hidden:migrate_set_state new state completed
> >> address@hidden:handle_qmp_command mon 0x55aebef99830 req: {"execute": 
> >> "x-debug-block-dirty-bitmap-sha256", "arguments": {"name": "bitmap0", 
> >> "node": "drive0"}}
> >> address@hidden:migration_thread_after_loop
> >> [2] address@hidden:handle_qmp_command mon 0x55aebef99830 req: {"execute": 
> >> "cont"}
> >> address@hidden:runstate_set current_state 7 new_state 9
> >> qemu-system-x86_64: invalid runstate transition: 'running' -> 'postmigrate'
> >>
> >>
> >> Aha, so, in test we wait for MIGRATION COMPLETED [1] and run qmp_cont [2] 
> >> and go to RUNNING.
> >> then in migration thread we go to migration_iteration_finish() and fail to 
> >> go to POSTMIGRATE.
> >> (note, that this testcase is about resuming source after migration)
> >>
> >> So, fix for test may be additionally wait for POSTMIGRATE, not only for 
> >> MIGRATION COMPLETION.
> >>
> >> But how to fix Qemu not to crash? May be, forbid some transitions 
> >> (FINISH_MIGRATE -> RUNNING),
> >>    or at least error-out qmp_cont if runstate is  FINISH_MIGRATE?
> > 
> > qmp_cont currently checks for RUN_STATE_INMIGRATE and gives it special
> > treatment. As a quick hack, doing the same for RUN_STATE_FINISH_MIGRATE
> > might fix the problem.
> > A better simple fix would possibly include a
> > whitelist of states where you can use 'cont' to avoid that the user
> > messes with other internal states.
> > 
> > However, like the problem that Dave reported a few days ago (migrating
> > twice leads to a crash because we try to inactivate already inactive
> > block devices), I think this is a symptom of a larger problem. We're
> > not taking the state machine serious enough. One-off checks here and
> > there are unlikely to ever fully solve the problem of running a command
> > in a runstate in which it was never supposed to run.
> > 
> > I wonder whether the QAPI schema should have a field 'run-states' for
> > commands, and by default we would only include states where the VM has
> > ownership of its resources (e.g. images are activated) and which are not
> > temporary states that are automatically left, like finish-migrate.
> > 
> > Then the default for commands is to be rejected in "unusual" runstates
> > where we're not expecting user intervention, and we must explicitly
> > allow them if they are okay, in fact.
> > 
> > Instead of listing every obscure runstate, maybe we should really use
> > categories of runstates instead:
> > 
> > 1. Running
> > 2. Paused, owns all resources (like disk images)
> > 3. Paused, doesn't own some resources (source VM after migration
> >     completes, destination before migration completes)
> > 4. Paused temporarily for internal reasons (e.g. finish-migrate,
> >     restore-vm, save-vm)
> > 
> > Most commands should be okay with 1 and 2, but possibly not 3, and
> > almost never 4.
> 
> And then we need a mechanism to block state transition while some
> operations which don't support new state are ongoing.

This essentially means not allowing those QMP commands that could cause
a state transition (primarily cont/stop, but probably some more).

State transitions that happen automatically should in theory always be
correct, everything else is a bug and abort() is the right way to stop
them (like we already do).

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]