qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none


From: Paolo Bonzini
Subject: Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads
Date: Wed, 16 May 2018 09:47:42 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0

On 15/05/2018 23:25, Daniel Henrique Barboza wrote:
> This is the current status of this investigation. I decided to start a
> discussion here, see if someone can point me something that I overlooked
> or got it wrong, before I started changing the POSIX thread pool
> behavior to see if I can enforce one specific POSIX thread to do a
> read() if we had a write() done in the same fd. Any suggestions?

Copying from the bug:

> Unless we learn something new, my understanding is that we're dealing
> with a host side limitation/bug when calling pwritev() in a different
> thread than a following preadv(), using the same file descriptor
> opened with O_DIRECT and no WCE in the host side, the kernel can't
> grant data coherency, e.g:
> 
> - thread A executes a pwritev() writing dataA in the disk
> 
> - thread B executes a preadv() call to read the data, but this
> preadv() call isn't aware of the previous pwritev() call done in
> thread A, thus the guarantee of the preadv() call reading dataA isn't
> assured (as opposed to what is described in man 3 write)
> 
> - the physical disk, due to the heavy load of the stress test, didn't
> finish writing up dataA. Since the disk itself doesn't have any
> internal cache to rely on, the preadv() call goes in and read an old
> data that differs from dataA.

There is a problem in the reasoning of the third point: if the physical
disk hasn't yet finished writing up dataA, pwritev() shouldn't have
returned.  This could be a bug in the kernel, or even in the disk.  I
suspect the kernel because SCSI passthrough doesn't show the bug; SCSI
passthrough uses ioctl() which completes exactly when the disk tells
QEMU that the command is done---it cannot report completion too early.

(Another small problem in the third point is that the disk actually does
have a cache.  But the cache should be transparent, if it weren't the
bug would be in the disk firmware).

It has to be debugged and fixed in the kernel.  The thread pool is
just... a thread pool, and shouldn't be working around bugs, especially
as serious as these.

A more likely possibility: maybe the disk has 4K sectors and QEMU is
doing read-modify-write cycles to emulate 512 byte sectors?  In this
case, mismatches are not expected, since QEMU serializes RMW cycles, but
at least we would know that the bug would be in QEMU, and where.

Paolo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]