qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] virtio-blk: simple multithreaded MQ implementatio


From: Roman Penyaev
Subject: Re: [Qemu-devel] [RFC] virtio-blk: simple multithreaded MQ implementation for bdrv_raw
Date: Mon, 30 May 2016 13:59:47 +0200

On Sat, May 28, 2016 at 12:27 AM, Stefan Hajnoczi <address@hidden> wrote:
> On Fri, May 27, 2016 at 01:55:04PM +0200, Roman Pen wrote:
>> Hello, all.
>>
>> This is RFC because mostly this patch is a quick attempt to get true
>> multithreaded multiqueue support for a block device with native AIO.
>> The goal is to squeeze everything possible on lockless IO path from
>> MQ block on a guest to MQ block on a host.
>>
>> To avoid any locks in qemu backend and not to introduce thread safety
>> into qemu block-layer I open same backend device several times, one
>> device per one MQ.  e.g. the following is the stack for a virtio-blk
>> with num-queues=2:
>>
>>             VirtIOBlock
>>                /   \
>>      VirtQueue#0   VirtQueue#1
>>       IOThread#0    IOThread#1
>>          BH#0          BH#1
>>       Backend#0     Backend#1
>>                \   /
>>              /dev/null0
>>
>> To group all objects related to one vq new structure is introduced:
>>
>>     typedef struct VirtQueueCtx {
>>         BlockBackend *blk;
>>         struct VirtIOBlock *s;
>>         VirtQueue *vq;
>>         void *rq;
>>         QEMUBH *bh;
>>         QEMUBH *batch_notify_bh;
>>         IOThread *iothread;
>>         Notifier insert_notifier;
>>         Notifier remove_notifier;
>>         /* Operation blocker on BDS */
>>         Error *blocker;
>>     } VirtQueueCtx;
>>
>> And VirtIOBlock includes an array of these contexts:
>>
>>      typedef struct VirtIOBlock {
>>          VirtIODevice parent_obj;
>>     +    VirtQueueCtx mq[VIRTIO_QUEUE_MAX];
>>      ...
>>
>> This patch is based on Stefan's series: "virtio-blk: multiqueue support",
>> with minor difference: I reverted "virtio-blk: multiqueue batch notify",
>> which does not make a lot sense when each VQ is handled by it's own
>> iothread.
>>
>> The qemu configuration stays the same, i.e. put num-queues=N and N
>> iothreads will be started on demand and N drives will be opened:
>>
>>     qemu -device virtio-blk-pci,num-queues=8
>>
>> My configuration is the following:
>>
>> host:
>>     Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz,
>>     8 CPUs,
>>     /dev/nullb0 as backend with the following parameters:
>>       $ cat /sys/module/null_blk/parameters/submit_queues
>>       8
>>       $ cat /sys/module/null_blk/parameters/irqmode
>>       1
>>
>> guest:
>>     8 VCPUs
>>
>> qemu:
>>     -object iothread,id=t0 \
>>     -drive 
>> if=none,id=d0,file=/dev/nullb0,format=raw,snapshot=off,cache=none,aio=native 
>> \
>>     -device 
>> virtio-blk-pci,num-queues=$N,iothread=t0,drive=d0,disable-modern=off,disable-legacy=on
>>
>>     where $N varies during the tests.
>>
>> fio:
>>     [global]
>>     description=Emulation of Storage Server Access Pattern
>>     bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4
>>     fadvise_hint=0
>>     rw=randrw:2
>>     direct=1
>>
>>     ioengine=libaio
>>     iodepth=64
>>     iodepth_batch_submit=64
>>     iodepth_batch_complete=64
>>     numjobs=8
>>     gtod_reduce=1
>>     group_reporting=1
>>
>>     time_based=1
>>     runtime=30
>>
>>     [job]
>>     filename=/dev/vda
>>
>> Results:
>>     num-queues   RD bw      WR bw
>>     ----------   -----      -----
>>
>>     * with 1 iothread *
>>
>>     1 thr 1 mq   1225MB/s   1221MB/s
>>     1 thr 2 mq   1559MB/s   1553MB/s
>>     1 thr 4 mq   1729MB/s   1725MB/s
>>     1 thr 8 mq   1660MB/s   1655MB/s
>>
>>     * with N iothreads *
>>
>>     2 thr 2 mq   1845MB/s   1842MB/s
>>     4 thr 4 mq   2187MB/s   2183MB/s
>>     8 thr 8 mq   1383MB/s   1378MB/s
>>
>> Obviously, 8 iothreads + 8 vcpu threads is too much for my machine
>> with 8 CPUs, but 4 iothreads show quite good result.
>
> Cool, thanks for trying this experiment and posting results.
>
> It's encouraging to see the improvement.  Did you use any CPU affinity
> settings to co-locate vcpu and iothreads onto host CPUs?

No, in these measurements I did not try to pin anything.
But the following are results with pinning, take a look:

8 VCPUs, 8 fio jobs
===========================================================

 o each fio job is pinned to VCPU in 1 to 1
 o VCPUs are not pinned
 o iothreads are not pinned

num queues   RD bw
----------   --------

* with 1 iothread *

1 thr 1 mq   1096MB/s
1 thr 2 mq   1602MB/s
1 thr 4 mq   1818MB/s
1 thr 8 mq   1860MB/s

* with N iothreads *

2 thr 2 mq   2008MB/s
4 thr 4 mq   2267MB/s
8 thr 8 mq   1388MB/s



8 VCPUs, 8 fio jobs
===============================================

 o each fio job is pinned to VCPU in 1 to 1
 o each VCPU is pinned to CPU in 1 to 1
 o each iothread is pinned to CPU in 1 to 1

affinity masks:
     CPUs   01234567
    VCPUs   XXXXXXXX

num queues   RD bw      iothreads affinity mask
----------   --------   -----------------------

* with 1 iothread *

1 thr 1 mq   997MB/s    X-------
1 thr 2 mq   1066MB/s   X-------
1 thr 4 mq   969MB/s    X-------
1 thr 8 mq   1050MB/s   X-------

* with N iothreads *

2 thr 2 mq   1597MB/s   XX------
4 thr 4 mq   1985MB/s   XXXX----
8 thr 8 mq   1230MB/s   XXXXXXXX



4 VCPUs, 4 fio jobs
===============================================

 o each fio job is pinned to VCPU in 1 to 1
 o VCPUs are not pinned
 o iothreads are not pinned

num queues   RD bw
----------   --------

* with 1 iothread *

1 thr 1 mq   1312MB/s
1 thr 2 mq   1445MB/s
1 thr 4 mq   1505MB/s

* with N iothreads *

2 thr 2 mq   1710MB/s
4 thr 4 mq   1590MB/s



4 VCPUs, 4 fio jobs
===============================================

 o each fio job is pinned to VCPU in 1 to 1
 o each VCPU is pinned to CPU in 1 to 1
 o each iothread is pinned to CPU in 1 to 1

affinity masks:
     CPUs   01234567
    VCPUs   XXXX----

num queues   RD bw      iothreads affinity mask
----------   --------   -----------------------

* with 1 iothread *

1 thr 1 mq   1230MB/s   ----X---
1 thr 2 mq   1357MB/s   ----X---
1 thr 4 mq   1430MB/s   ----X---

* with N iothreads *

2 thr 2 mq   1803MB/s   ----XX--
4 thr 4 mq   1673MB/s   ----XXXX



4 VCPUs, 4 fio jobs
===============================================

 o each fio job is pinned to VCPU in 1 to 1
 o each VCPU is pinned to 0123 CPUs
 o each iothread is pinned to 4567 CPUs

affinity masks:
     CPUs   01234567
    VCPUs   XXXX----

num queues   RD bw      iothreads affinity mask
----------   --------   -----------------------

* with 1 iothread *

1 thr 1 mq   1213MB/s   ----XXXX
1 thr 2 mq   1417MB/s   ----XXXX
1 thr 4 mq   1435MB/s   ----XXXX

* with N iothreads *

2 thr 2 mq   1792MB/s   ----XXXX
4 thr 4 mq   1667MB/s   ----XXXX


SUMMARY:

For 8 jobs the only thing I noticed makes sense is fio job pinning.
On my machine with 8 CPUs there is no room to optimize execution of
8 jobs.

For 4 jobs and 4 VCPUs I tried to pin VCPUs threads and iothreads
to different CPUs: VCPUs go to 0123, iothreads go to 4567.  And seems
that brings something, but not that much.

--
Roman



reply via email to

[Prev in Thread] Current Thread [Next in Thread]