qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none


From: Dr. David Alan Gilbert
Subject: Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads
Date: Wed, 16 May 2018 10:47:31 +0100
User-agent: Mutt/1.9.5 (2018-04-13)

* Daniel Henrique Barboza (address@hidden) wrote:
> Hi,
> 
> I've been working in the last two months in a miscompare issue that happens
> when using a raid device and a SATA as scsi-hd (emulated SCSI) with
> cache=none and io=threads during a hardware stress test. I'll summarize it
> here as best as I can without creating a great wall of text - Red Hat folks
> can check [1] for all the details.
> 
> Using the following setup:
> 
> - Host is a POWER9 RHEL 7.5-alt: kernel 4.14.0-49.1.1.el7a.ppc64le,
> qemu-kvm-ma 2.10.0-20.el7 (also reproducible with upstream QEMU)
> 
> - Guest is RHEL 7.5-alt using the same kernel as the host, using two storage
> disks (a 1.8 Tb raid and a 446Gb SATA drive) as follows:
> 
>     <disk type='block' device='disk'>
>       <driver name='qemu' type='raw' cache='none'/>
>       <source dev='/dev/disk/by-id/scsi-3600605b000a2c110ff0004053d84a61b'/>
>       <target dev='sdc' bus='scsi'/>
>       <alias name='scsi0-0-0-2'/>
>       <address type='drive' controller='0' bus='0' target='0' unit='2'/>
>     </disk>
> 
> Both block devices have WCE off in the host.
> 
> With this env, we found problems when running a stress test called HTX [2].
> At a given time (usually after 24+ hours of test) HTX finds a data
> miscompare in one of the devices. This is an example:
> 
> -------
> 
> Device name: /dev/sdb
> Total blocks: 0x74706daf, Block size: 0x200
> Rule file name: /usr/lpp/htx/rules/reg/hxestorage/default.hdd
> Number of Rulefile passes (cycle) completed: 0
> Stanza running: rule_6, Thread no.: 8
> Oper performed: wrc, Current seek type: SEQ
> LBA no. where IO started: 0x94fa
> Transfer size: 0x8400
> 
> Miscompare Summary:
> ===================
> LBA no. where miscomapre started:     0x94fa
> LBA no. where miscomapre ended:       0x94ff
> Miscompare start offset (in bytes):   0x8
> Miscomapre end offset (in bytes):     0xbff
> Miscompare size (in bytes):           0xbf8
> 
> Expected data (at miscomapre offset): 8c9aea5a736462000000000000007275
> Actual data (at miscomapre offset): 889aea5a736462000000000000007275

Are all the miscompares single bit errors like that one?
Is the test doing single bit manipulation or is that coming out of the
blue?

Dave

> -----
> 
> 
> This means that the test executed a write at  LBA 0x94fa and, after
> confirming that the write was completed, issue 2 reads in the same LBA to
> assert the written contents and found out a mismatch.
> 
> 
> I've tested all sort of configurations between disk vs LUN, cache modes and
> AIO. My findings are:
> 
> - using device='lun' instead of device='disk', I can't reproduce the issue
> doesn't matter what other configurations are;
> - using device='disk' but with cache='writethrough', issue doesn't happen
> (haven't checked other cache modes);
> - using device='disk', cache='none' and io='native', issue doesn't happen.
> 
> 
> The issue seems to be tied with the combination device=disk + cache=none +
> io=threads. I've started digging into the SCSI layer all the way down to the
> block backend. With a shameful amount of logs I've discovered that, in the
> write that the test finds a miscompare, in block/file-posix.c:
> 
> - when doing the write, handle_aiocb_rw_vector() returns success, pwritev()
> reports that all bytes were written
> - in both reads after the write, handle_aiocb_rw_vector returns success, all
> bytes read by preadv(). In both reads, the data read is different from the
> data written by  the pwritev() that happened before
> 
> In the discussions at [1], Fam Zheng suggested a test in which we would take
> down the number of threads created in the POSIX thread pool from 64 to 1.
> The idea is to ensure that we're using the same thread to write and read.
> There was a suspicion that the kernel can't guarantee data coherency between
> different threads, even if using the same fd, when using pwritev() and
> preadv(). This would explain why the following reads in the same fd would
> fail to retrieve the same data that was written before. After doing this
> modification, the miscompare didn't reproduce.
> 
> After reverting the thread pool number change, I've made a couple of
> attempts trying to flush before read() and flushing after write(). Both
> attempts failed - the miscompare appears in both scenarios. This enforces
> the suspicion we have above - if data coherency can't be granted between
> different threads, flushing in different threads wouldn't make a difference
> too. I've also tested a suggestion from Fam where I started the disks with
> "cache.direct=on,cache.no-flush=off" - bug still reproduces.
> 
> 
> This is the current status of this investigation. I decided to start a
> discussion here, see if someone can point me something that I overlooked or
> got it wrong, before I started changing the POSIX thread pool behavior to
> see if I can enforce one specific POSIX thread to do a read() if we had a
> write() done in the same fd. Any suggestions?
> 
> 
> 
> ps: it is worth mentioning that I was able to reproduce this same bug in a
> POWER8 system running Ubuntu 18.04. Given that the code we're dealing with
> doesn't have any arch-specific behavior I wouldn't be surprised if this bug
> is also reproducible in other archs like x86.
> 
> 
> Thanks,
> 
> Daniel
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1561017
> [2] https://github.com/open-power/HTX
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK



reply via email to

[Prev in Thread] Current Thread [Next in Thread]