Re: [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and dri

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and dri

From:	Vladimir Sementsov-Ogievskiy
Subject:	Re: [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and drive_del
Date:	Wed, 8 Aug 2018 17:32:23 +0300
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0

08.08.2018 12:33, Vladimir Sementsov-Ogievskiy wrote:

07.08.2018 22:57, Eric Blake wrote:
On 08/06/2018 05:04 PM, Eric Blake wrote:
On 06/18/2018 11:44 AM, Kevin Wolf wrote:
From: Greg Kurz <address@hidden>

Removing a drive with drive_del while it is being used to run an I/O
intensive workload can cause QEMU to crash.
...
Test 83 sets up a client that intentionally disconnects at criticalpoints in the NBD protocol exchange, to ensure that the serverreacts sanely.
Rather, nbd-fault-injector.py is a server that disconnects atcritical points, and the test is of client reaction.
I suspect that somewhere in the NBD code, the server detects thedisconnect and was somehow calling into blk_remove_bs() (although Icould not quickly find the backtrace); and that prior to this patch,the 'Connection closed' message resulted from other NBD coroutinesgetting a shot at the (now-closed) connection, while after thispatch, the additional blk_drain() somehow tweaks things in a waythat prevents the other NBD coroutines from printing a message. Ifso, then the change in 83 reference output is probably intentional,and we should update it.
It seems like this condition is racy, and that the race is morelikely to be lost prior to this patch than after. It's a question ofwhether the client has time to start a request to the server prior tothe server hanging up, as the message is generated duringnbd_co_do_receive_one_chunk. Here's a demonstration of the fact thatthings are racy:
$ git revert f45280cbf
$ make
$ cd tests/qemu-iotests
$ cat fault.txt
[inject-error "a"]
event=neg2
when=after
$ python nbd-fault-injector.py localhost:10809 ./fault.txt &
Listening on 127.0.0.1:10809
$ ../../qemu-io -f raw nbd://localhost:10809 -c 'r 0 512'
Closing connection on rule match inject-error "a"
Connection closed
read failed: Input/output error
$ python nbd-fault-injector.py localhost:10809 ./fault.txt &
Listening on 127.0.0.1:10809
$ ../../qemu-io -f raw nbd://localhost:10809
Closing connection on rule match inject-error "a"
qemu-io> r 0 512
read failed: Input/output error
qemu-io> q
So, depending on whether the read command is kicked off quickly (via-c) or slowly (via typing into qemu-io) determines whether themessage appears.
What's more, in commit f140e300, we specifically called out in thecommit message that maybe it was better to trace when we detectconnection closed rather than log it to stdout, and in all cases inthat commit, the additional 'Connection closed' messages do not addany information to the error message already displayed by the rest ofthe code.
I don't know how much the proposed NBD reconnect code will changethings in 3.1. Meanwhile, we've missed any chance for 3.0 to fixtest 83.
But I'm having a hard time convincing myself that this is the case,particularly since I'm not even sure how to easily debug theassumptions I made above.
Since I'm very weak on the whole notion of what blk_drain() vs.blk_remove_bs() is really supposed to be doing, and could easily bepersuaded that the change in output is a regression instead of a fix.
At this point, I don't think we have a regression, just merely a badiotests reference output. The extra blk_drain() merely adds more timebefore the NBD code can send out its first request, and thus makes itmore likely that the fault injector has closed the connection beforethe read request is issued rather than after (the message onlyappears when read beats the race), but the NBD code shouldn't beprinting the error message in the first place, and 083 needs to betweaked to remove the noisy lines added in f140e300 (not just thethree lines that are reliably different due to this patch, but allother such lines due to strategic server drops at other points in theNBD protocol).
Ok, agree, I'll do it in reconnect series.



hmm, do what?

I was going to change these error messages to be traces, but now I'm notsure that it's a good idea. We have generic errp returned from thefunction, and why to drop it from logs? Fixing iotest is not a goodreason, better is to adjust iotest itself a bit (just commit changedoutput) and forget about it. Is iotest racy itself, did you seedifferent output running 83 iotest, not testing by hand?


--
Best regards,
Vladimir

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and drive_del, Eric Blake, 2018/08/06
- Re: [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and drive_del, Eric Blake, 2018/08/07
  - Re: [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and drive_del, Vladimir Sementsov-Ogievskiy, 2018/08/08
    - Re: [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and drive_del, Vladimir Sementsov-Ogievskiy <=
    - Re: [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and drive_del, Eric Blake, 2018/08/08
  - Re: [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and drive_del, Vladimir Sementsov-Ogievskiy, 2018/08/08
    - Re: [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and drive_del, Eric Blake, 2018/08/08

Prev by Date: Re: [Qemu-devel] [Bug 1785972] Re: v3.0.0-rc4: VM fails to start after vcpuhotunplug, managedsave sequence
Next by Date: Re: [Qemu-devel] [PATCH v3 2/5] qcow2: Make the default L2 cache sufficient to cover the entire image
Previous by thread: Re: [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and drive_del
Next by thread: Re: [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and drive_del
Index(es):
- Date
- Thread