Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none

From:	Daniel Henrique Barboza
Subject:	Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads
Date:	Wed, 16 May 2018 18:40:36 -0300
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0



On 05/16/2018 06:47 AM, Dr. David Alan Gilbert wrote:

* Daniel Henrique Barboza (address@hidden) wrote:

Hi,

I've been working in the last two months in a miscompare issue that happens
when using a raid device and a SATA as scsi-hd (emulated SCSI) with
cache=none and io=threads during a hardware stress test. I'll summarize it
here as best as I can without creating a great wall of text - Red Hat folks
can check [1] for all the details.

Using the following setup:

- Host is a POWER9 RHEL 7.5-alt: kernel 4.14.0-49.1.1.el7a.ppc64le,
qemu-kvm-ma 2.10.0-20.el7 (also reproducible with upstream QEMU)

- Guest is RHEL 7.5-alt using the same kernel as the host, using two storage
disks (a 1.8 Tb raid and a 446Gb SATA drive) as follows:

     <disk type='block' device='disk'>
       <driver name='qemu' type='raw' cache='none'/>
       <source dev='/dev/disk/by-id/scsi-3600605b000a2c110ff0004053d84a61b'/>
       <target dev='sdc' bus='scsi'/>
       <alias name='scsi0-0-0-2'/>
       <address type='drive' controller='0' bus='0' target='0' unit='2'/>
     </disk>

Both block devices have WCE off in the host.

With this env, we found problems when running a stress test called HTX [2].
At a given time (usually after 24+ hours of test) HTX finds a data
miscompare in one of the devices. This is an example:

-------

Device name: /dev/sdb
Total blocks: 0x74706daf, Block size: 0x200
Rule file name: /usr/lpp/htx/rules/reg/hxestorage/default.hdd
Number of Rulefile passes (cycle) completed: 0
Stanza running: rule_6, Thread no.: 8
Oper performed: wrc, Current seek type: SEQ
LBA no. where IO started: 0x94fa
Transfer size: 0x8400

Miscompare Summary:
===================
LBA no. where miscomapre started:     0x94fa
LBA no. where miscomapre ended:       0x94ff
Miscompare start offset (in bytes):   0x8
Miscomapre end offset (in bytes):     0xbff
Miscompare size (in bytes):           0xbf8

Expected data (at miscomapre offset): 8c9aea5a736462000000000000007275
Actual data (at miscomapre offset): 889aea5a736462000000000000007275

Are all the miscompares single bit errors like that one?

The miscompares differs in size. What it is displayed here is the firstsnippet ofthe miscompare data, but in this case the miscompare has 0xbf8 bytes ofsize.

I've seen cases where the miscompare has the same size of the datawritten - thetest initialize the disk with a known pattern (bbbbbbb for example),then a miscompare

happens and it founds out that the disk had the starting pattern.

Is the test doing single bit manipulation or is that coming out of the
blue?

As far as I've read in the test suite code, it is writing severalsectors at once

then asserting that the contents were written.


Dave

-----


This means that the test executed a write at  LBA 0x94fa and, after
confirming that the write was completed, issue 2 reads in the same LBA to
assert the written contents and found out a mismatch.


I've tested all sort of configurations between disk vs LUN, cache modes and
AIO. My findings are:

- using device='lun' instead of device='disk', I can't reproduce the issue
doesn't matter what other configurations are;
- using device='disk' but with cache='writethrough', issue doesn't happen
(haven't checked other cache modes);
- using device='disk', cache='none' and io='native', issue doesn't happen.


The issue seems to be tied with the combination device=disk + cache=none +
io=threads. I've started digging into the SCSI layer all the way down to the
block backend. With a shameful amount of logs I've discovered that, in the
write that the test finds a miscompare, in block/file-posix.c:

- when doing the write, handle_aiocb_rw_vector() returns success, pwritev()
reports that all bytes were written
- in both reads after the write, handle_aiocb_rw_vector returns success, all
bytes read by preadv(). In both reads, the data read is different from the
data written by  the pwritev() that happened before

In the discussions at [1], Fam Zheng suggested a test in which we would take
down the number of threads created in the POSIX thread pool from 64 to 1.
The idea is to ensure that we're using the same thread to write and read.
There was a suspicion that the kernel can't guarantee data coherency between
different threads, even if using the same fd, when using pwritev() and
preadv(). This would explain why the following reads in the same fd would
fail to retrieve the same data that was written before. After doing this
modification, the miscompare didn't reproduce.

After reverting the thread pool number change, I've made a couple of
attempts trying to flush before read() and flushing after write(). Both
attempts failed - the miscompare appears in both scenarios. This enforces
the suspicion we have above - if data coherency can't be granted between
different threads, flushing in different threads wouldn't make a difference
too. I've also tested a suggestion from Fam where I started the disks with
"cache.direct=on,cache.no-flush=off" - bug still reproduces.


This is the current status of this investigation. I decided to start a
discussion here, see if someone can point me something that I overlooked or
got it wrong, before I started changing the POSIX thread pool behavior to
see if I can enforce one specific POSIX thread to do a read() if we had a
write() done in the same fd. Any suggestions?



ps: it is worth mentioning that I was able to reproduce this same bug in a
POWER8 system running Ubuntu 18.04. Given that the code we're dealing with
doesn't have any arch-specific behavior I wouldn't be surprised if this bug
is also reproducible in other archs like x86.


Thanks,

Daniel

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1561017
[2] https://github.com/open-power/HTX

--
Dr. David Alan Gilbert / address@hidden / Manchester, UK

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads, Daniel Henrique Barboza, 2018/05/15
- Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads, Paolo Bonzini, 2018/05/16
  - Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads, Daniel Henrique Barboza, 2018/05/16
    - Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads, Daniel Henrique Barboza, 2018/05/16
- Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads, Dr. David Alan Gilbert, 2018/05/16
  - Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads, Daniel Henrique Barboza <=
- Re: [Qemu-devel] [Qemu-block] Problem with data miscompare using scsi-hd, cache=none and io=threads, Stefan Hajnoczi, 2018/05/24
  - Re: [Qemu-devel] [Qemu-block] Problem with data miscompare using scsi-hd, cache=none and io=threads, Daniel Henrique Barboza, 2018/05/24

Prev by Date: Re: [Qemu-devel] [RFC PATCH 09/12] qapi/block-core: add BitmapMapping and BitmapEntry structs
Next by Date: [Qemu-devel] [Bug 1769189] Re: Issue with qemu 2.12.0 + SATA
Previous by thread: Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads
Next by thread: Re: [Qemu-devel] [Qemu-block] Problem with data miscompare using scsi-hd, cache=none and io=threads
Index(es):
- Date
- Thread