qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] [PATCH RFC] virtio: proposal to optimize accesses to VQs


From: Vincenzo Maffione
Subject: [Qemu-devel] [PATCH RFC] virtio: proposal to optimize accesses to VQs
Date: Mon, 14 Dec 2015 15:51:18 +0100

Hi,
  I am doing performance experiments to test how QEMU behaves when the
guest is transmitting (short) network packets at very high packet rates, say
over 1Mpps.
I run a netmap application in the guest to generate high packet rates,
but this is not relevant to this discussion. The only important fact is that
the generator running in the guest is not the bottleneck, and in fact CPU
utilization is low (20%).

Moreover, I'm not considering vhost-net to boost virtio-net HV-side processing,
because I want to do performance unit-tests on the QEMU virtio userspace
implementation (hw/virtio/virtio.c).

In the most common benchmarks - e.g. netperf TCP_STREAM, TCP_RR,
UDP_STREAM, ..., with one end of the communication in the guest, and the
other in the host, for instance with the simplest TAP networking setup - the
virtio-net adapter definitely outperforms the emulated e1000 adapter (and
all the other emulated devices). This was expected because of the great
benefits of I/O paravirtualization.

However, I was surprised to find out that the situation changes drastically
at very high packet rates.

My measurements show that e1000 emulated adapter is able to transmit over
3.5 Mpps when the network backend is disconnected. I disconnect the
backend to see at what packet rate e1000 becomes the bottleneck.

The same experiment, however, shows that virtio-net has a bottleneck at
1Mpps only. Once verified that the TX VQ kicks and TX VQ interrupts are
properly amortized/suppressed, I found out that the bottleneck is partially
due to the way the code accesses the VQ in the guest physical memory, since
each access involves an expensive address space translation. For each VQ
element to process, I counted over 15 accesses, while e1000 has just 2 accesses
to its rings.

This patch slightly rewrites the code to reduce the number of accesses, since
many of them seems unnecessary to me. After this reduction, the bottleneck
jumps from 1 Mpps to 2 Mpps.

Patch is not complete (e.g. it still does not properly manage endianess, it is
not clean, etc.). I just wanted to ask if you think the idea makes sense, and
a proper patch in this direction would be accepted.

Thanks,
  Vincenzo

Vincenzo Maffione (1):
  virtio: optimize access to guest physical memory

 hw/virtio/virtio.c | 118 +++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 88 insertions(+), 30 deletions(-)

-- 
2.6.3




reply via email to

[Prev in Thread] Current Thread [Next in Thread]