[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC] virtio-blk: simple multithreaded MQ implementatio
From: |
Roman Penyaev |
Subject: |
Re: [Qemu-devel] [RFC] virtio-blk: simple multithreaded MQ implementation for bdrv_raw |
Date: |
Mon, 30 May 2016 13:59:47 +0200 |
On Sat, May 28, 2016 at 12:27 AM, Stefan Hajnoczi <address@hidden> wrote:
> On Fri, May 27, 2016 at 01:55:04PM +0200, Roman Pen wrote:
>> Hello, all.
>>
>> This is RFC because mostly this patch is a quick attempt to get true
>> multithreaded multiqueue support for a block device with native AIO.
>> The goal is to squeeze everything possible on lockless IO path from
>> MQ block on a guest to MQ block on a host.
>>
>> To avoid any locks in qemu backend and not to introduce thread safety
>> into qemu block-layer I open same backend device several times, one
>> device per one MQ. e.g. the following is the stack for a virtio-blk
>> with num-queues=2:
>>
>> VirtIOBlock
>> / \
>> VirtQueue#0 VirtQueue#1
>> IOThread#0 IOThread#1
>> BH#0 BH#1
>> Backend#0 Backend#1
>> \ /
>> /dev/null0
>>
>> To group all objects related to one vq new structure is introduced:
>>
>> typedef struct VirtQueueCtx {
>> BlockBackend *blk;
>> struct VirtIOBlock *s;
>> VirtQueue *vq;
>> void *rq;
>> QEMUBH *bh;
>> QEMUBH *batch_notify_bh;
>> IOThread *iothread;
>> Notifier insert_notifier;
>> Notifier remove_notifier;
>> /* Operation blocker on BDS */
>> Error *blocker;
>> } VirtQueueCtx;
>>
>> And VirtIOBlock includes an array of these contexts:
>>
>> typedef struct VirtIOBlock {
>> VirtIODevice parent_obj;
>> + VirtQueueCtx mq[VIRTIO_QUEUE_MAX];
>> ...
>>
>> This patch is based on Stefan's series: "virtio-blk: multiqueue support",
>> with minor difference: I reverted "virtio-blk: multiqueue batch notify",
>> which does not make a lot sense when each VQ is handled by it's own
>> iothread.
>>
>> The qemu configuration stays the same, i.e. put num-queues=N and N
>> iothreads will be started on demand and N drives will be opened:
>>
>> qemu -device virtio-blk-pci,num-queues=8
>>
>> My configuration is the following:
>>
>> host:
>> Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz,
>> 8 CPUs,
>> /dev/nullb0 as backend with the following parameters:
>> $ cat /sys/module/null_blk/parameters/submit_queues
>> 8
>> $ cat /sys/module/null_blk/parameters/irqmode
>> 1
>>
>> guest:
>> 8 VCPUs
>>
>> qemu:
>> -object iothread,id=t0 \
>> -drive
>> if=none,id=d0,file=/dev/nullb0,format=raw,snapshot=off,cache=none,aio=native
>> \
>> -device
>> virtio-blk-pci,num-queues=$N,iothread=t0,drive=d0,disable-modern=off,disable-legacy=on
>>
>> where $N varies during the tests.
>>
>> fio:
>> [global]
>> description=Emulation of Storage Server Access Pattern
>> bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4
>> fadvise_hint=0
>> rw=randrw:2
>> direct=1
>>
>> ioengine=libaio
>> iodepth=64
>> iodepth_batch_submit=64
>> iodepth_batch_complete=64
>> numjobs=8
>> gtod_reduce=1
>> group_reporting=1
>>
>> time_based=1
>> runtime=30
>>
>> [job]
>> filename=/dev/vda
>>
>> Results:
>> num-queues RD bw WR bw
>> ---------- ----- -----
>>
>> * with 1 iothread *
>>
>> 1 thr 1 mq 1225MB/s 1221MB/s
>> 1 thr 2 mq 1559MB/s 1553MB/s
>> 1 thr 4 mq 1729MB/s 1725MB/s
>> 1 thr 8 mq 1660MB/s 1655MB/s
>>
>> * with N iothreads *
>>
>> 2 thr 2 mq 1845MB/s 1842MB/s
>> 4 thr 4 mq 2187MB/s 2183MB/s
>> 8 thr 8 mq 1383MB/s 1378MB/s
>>
>> Obviously, 8 iothreads + 8 vcpu threads is too much for my machine
>> with 8 CPUs, but 4 iothreads show quite good result.
>
> Cool, thanks for trying this experiment and posting results.
>
> It's encouraging to see the improvement. Did you use any CPU affinity
> settings to co-locate vcpu and iothreads onto host CPUs?
No, in these measurements I did not try to pin anything.
But the following are results with pinning, take a look:
8 VCPUs, 8 fio jobs
===========================================================
o each fio job is pinned to VCPU in 1 to 1
o VCPUs are not pinned
o iothreads are not pinned
num queues RD bw
---------- --------
* with 1 iothread *
1 thr 1 mq 1096MB/s
1 thr 2 mq 1602MB/s
1 thr 4 mq 1818MB/s
1 thr 8 mq 1860MB/s
* with N iothreads *
2 thr 2 mq 2008MB/s
4 thr 4 mq 2267MB/s
8 thr 8 mq 1388MB/s
8 VCPUs, 8 fio jobs
===============================================
o each fio job is pinned to VCPU in 1 to 1
o each VCPU is pinned to CPU in 1 to 1
o each iothread is pinned to CPU in 1 to 1
affinity masks:
CPUs 01234567
VCPUs XXXXXXXX
num queues RD bw iothreads affinity mask
---------- -------- -----------------------
* with 1 iothread *
1 thr 1 mq 997MB/s X-------
1 thr 2 mq 1066MB/s X-------
1 thr 4 mq 969MB/s X-------
1 thr 8 mq 1050MB/s X-------
* with N iothreads *
2 thr 2 mq 1597MB/s XX------
4 thr 4 mq 1985MB/s XXXX----
8 thr 8 mq 1230MB/s XXXXXXXX
4 VCPUs, 4 fio jobs
===============================================
o each fio job is pinned to VCPU in 1 to 1
o VCPUs are not pinned
o iothreads are not pinned
num queues RD bw
---------- --------
* with 1 iothread *
1 thr 1 mq 1312MB/s
1 thr 2 mq 1445MB/s
1 thr 4 mq 1505MB/s
* with N iothreads *
2 thr 2 mq 1710MB/s
4 thr 4 mq 1590MB/s
4 VCPUs, 4 fio jobs
===============================================
o each fio job is pinned to VCPU in 1 to 1
o each VCPU is pinned to CPU in 1 to 1
o each iothread is pinned to CPU in 1 to 1
affinity masks:
CPUs 01234567
VCPUs XXXX----
num queues RD bw iothreads affinity mask
---------- -------- -----------------------
* with 1 iothread *
1 thr 1 mq 1230MB/s ----X---
1 thr 2 mq 1357MB/s ----X---
1 thr 4 mq 1430MB/s ----X---
* with N iothreads *
2 thr 2 mq 1803MB/s ----XX--
4 thr 4 mq 1673MB/s ----XXXX
4 VCPUs, 4 fio jobs
===============================================
o each fio job is pinned to VCPU in 1 to 1
o each VCPU is pinned to 0123 CPUs
o each iothread is pinned to 4567 CPUs
affinity masks:
CPUs 01234567
VCPUs XXXX----
num queues RD bw iothreads affinity mask
---------- -------- -----------------------
* with 1 iothread *
1 thr 1 mq 1213MB/s ----XXXX
1 thr 2 mq 1417MB/s ----XXXX
1 thr 4 mq 1435MB/s ----XXXX
* with N iothreads *
2 thr 2 mq 1792MB/s ----XXXX
4 thr 4 mq 1667MB/s ----XXXX
SUMMARY:
For 8 jobs the only thing I noticed makes sense is fio job pinning.
On my machine with 8 CPUs there is no room to optimize execution of
8 jobs.
For 4 jobs and 4 VCPUs I tried to pin VCPUs threads and iothreads
to different CPUs: VCPUs go to 0123, iothreads go to 4567. And seems
that brings something, but not that much.
--
Roman
- Re: [Qemu-devel] [PATCH 7/9] virtio-blk: live migrate s->rq with multiqueue, (continued)