[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only]

From: Eric Blake
Subject: Re: [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only]
Date: Fri, 10 Mar 2023 13:04:12 -0600
User-agent: NeoMutt/20220429

On Thu, Mar 09, 2023 at 11:39:42AM +0000, Richard W.M. Jones wrote:
> [ Patch series also available here, along with this cover letter and the
>   script used to generate test results:
>   https://gitlab.com/rwmjones/qemu/-/commits/2023-nbd-multi-conn-v1 ]
> This patch series adds multi-conn support to the NBD block driver in
> qemu.  It is only meant for discussion and testing because it has a
> number of obvious shortcomings (see "XXX" in commit messages and
> code).  If we decided this was a good idea, we can work on a better
> patch.

Overall, I'm in favor of this.  A longer term project might be to have
qemu's NBD client code call into libnbd instead of reimplementing
things itself, at which point having libnbd manage multi-conn under
the hood would be awesome, but as that's a much bigger effort, a
shorter-term task of having qemu itself handle parallel sockets seems

>  - It works effectively for qemu client & nbdkit server, especially in
>    cases where the server does large, heavyweight requests.  This is
>    important for us because virt-v2v uses an nbdkit Python plugin and
>    various other heavyweight plugins (eg. plugins that access remote
>    servers for each request).
>  - It seems to make little or no difference with qemu + qemu-nbd
>    server.  I speculate that's because qemu-nbd doesn't support system
>    threads, so networking is bottlenecked through a single core.  Even
>    though there are coroutines handling different sockets, they must
>    all wait in turn to issue send(3) or recv(3) calls on the same
>    core.

Is the current work to teach qemu to do multi-queue (that is, spread
the I/O load for a single block device across multiple cores) going to
help here?  I haven't been following the multi-queue efforts closely
enough to know if the approach used in this series will play nicely,
or need even further overhaul.

>  - qemu-img unfortunately uses a single thread for all coroutines so
>    it suffers from a similar problem to qemu-nbd.  This change would
>    be much more effective if we could distribute coroutines across
>    threads.

qemu-img uses the same client code as qemu-nbd; any multi-queue
improvements that can spread the send()/recv() load of multiple
sockets across multiple cores will benefit both programs

>  - For tests which are highly bottlenecked on disk I/O (eg. the large
>    local file test and null test) multi-conn doesn't make much
>    difference.

As long as it isn't adding to much penalty, that's okay.  If the
saturation is truly at the point of how fast disk requests can be
served, it doesn't matter if we can queue up more of those requests in
parallel across multiple NBD sockets.

>  - Multi-conn even with only 2 connections can make up for the
>    overhead of range requests, exceeding the performance of wget.

That alone is a rather cool result, and an argument in favor of
further developing this.

>  - In the curlremote test, qemu-nbd is especially slow, for unknown
>    reasons.
> Integrity test (./multi-conn.pl integrity)
> ==========================================
> nbdkit-sparse-random-plugin
>   |                 ^
>   | nbd+unix        | nbd+unix
>   v                 |
>    qemu-img convert
> Reading from and writing the same data back to nbdkit sparse-random
> plugin checks that the data written is the same as the data read.
> This uses two Unix domain sockets, with or without multi-conn.  This
> test is mainly here to check we don't crash or corrupt data with this
> patch.
>   server          client        multi-conn
>   ---------------------------------------------------------------
>     nbdkit      qemu-img      [u/s]   9.07s   
>     nbdkit      qemu-img      1       9.05s   
>     nbdkit      qemu-img      2       9.02s   
>     nbdkit      qemu-img      4       8.98s   
> [u/s] = upstream qemu 7.2.0

How many of these timing numbers can be repeated with TLS in the mix?

> Curl local server test (./multi-conn.pl curlhttp)
> =================================================
> Localhost Apache serving a file over http
>                   |
>                   | http
>                   v
> nbdkit-curl-plugin   or   qemu-nbd
>                   |
>                   | nbd+unix
>                   v
> qemu-img convert   or   nbdcopy
> We download an image from a local web server through
> nbdkit-curl-plugin or qemu-nbd using the curl block driver, over NBD.
> The image is copied to /dev/null.
>   server          client        multi-conn
>   ---------------------------------------------------------------
>   qemu-nbd       nbdcopy      1       8.88s   
>   qemu-nbd       nbdcopy      2       8.64s   
>   qemu-nbd       nbdcopy      4       8.37s   
>   qemu-nbd      qemu-img      [u/s]   6.47s

Do we have any good feel for why qemu-img is faster than nbdcopy in
the baseline?  But improving that is orthogonal to this series.

>   qemu-nbd      qemu-img      1       6.56s   
>   qemu-nbd      qemu-img      2       6.63s   
>   qemu-nbd      qemu-img      4       6.50s   
>     nbdkit       nbdcopy      1       12.15s  

I'm assuming this is nbdkit with your recent in-progress patches to
have the curl plugin serve parallel requests.  But another place where
we can investigate why nbdkit is not as performant as qemu-nbd at
utilizing curl.

>     nbdkit       nbdcopy      2       7.05s   (72.36% better)
>     nbdkit       nbdcopy      4       3.54s   (242.90% better)

That one is impressive!

>     nbdkit      qemu-img      [u/s]   6.90s   
>     nbdkit      qemu-img      1       7.00s   

Minimal penalty for adding the code but not utilizing it...

>     nbdkit      qemu-img      2       3.85s   (79.15% better)
>     nbdkit      qemu-img      4       3.85s   (79.15% better)

...and definitely shows its worth.

> Curl local file test (./multi-conn.pl curlfile)
> ===============================================
> nbdkit-curl-plugin   using file:/// URI
>                   |
>                   | nbd+unix
>                   v
> qemu-img convert   or   nbdcopy
> We download from a file:/// URI.  This test is designed to exercise
> NBD and some curl internal paths without the overhead from an external
> server.  qemu-nbd doesn't support file:/// URIs so we cannot duplicate
> the test for qemu as server.
>   server          client        multi-conn
>   ---------------------------------------------------------------
>     nbdkit       nbdcopy      1       31.32s  
>     nbdkit       nbdcopy      2       20.29s  (54.38% better)
>     nbdkit       nbdcopy      4       13.22s  (136.91% better)
>     nbdkit      qemu-img      [u/s]   31.55s  

Here, the baseline is already comparable; both nbdcopy and qemu-img
are parsing the image off nbdkit in about the same amount of time.

>     nbdkit      qemu-img      1       31.70s  

And again, minimal penalty for having the new code in place but not
exploiting it.

>     nbdkit      qemu-img      2       21.60s  (46.07% better)
>     nbdkit      qemu-img      4       13.88s  (127.25% better)

Plus an obvious benefit when the parallel sockets matter.

> Curl remote server test (./multi-conn.pl curlremote)
> ====================================================
> nbdkit-curl-plugin   using http://remote/*.qcow2 URI
>          |
>          | nbd+unix
>          v
> qemu-img convert
> We download from a remote qcow2 file to a local raw file, converting
> between formats during copying.
> qemu-nbd   using http://remote/*.qcow2 URI
>     |
>     | nbd+unix
>     v
> qemu-img convert
> Similarly, replacing nbdkit with qemu-nbd (treating the remote file as
> if it is raw, so the conversion is still done by qemu-img).
> Additionally we compare downloading the file with wget (note this
> doesn't include the time for conversion, but that should only be a few
> seconds).
>   server          client        multi-conn
>   ---------------------------------------------------------------
>          -          wget      1       58.19s  
>     nbdkit      qemu-img      [u/s]   68.29s  (17.36% worse)
>     nbdkit      qemu-img      1       67.85s  (16.60% worse)
>     nbdkit      qemu-img      2       58.17s  

Comparable to wget on paper, but a win in practice (since the wget
step also has to add a post-download qemu-img local conversion step).

>     nbdkit      qemu-img      4       59.80s  
>     nbdkit      qemu-img      6       59.15s  
>     nbdkit      qemu-img      8       59.52s  
>   qemu-nbd      qemu-img      [u/s]   202.55s
>   qemu-nbd      qemu-img      1       204.61s 
>   qemu-nbd      qemu-img      2       196.73s 
>   qemu-nbd      qemu-img      4       179.53s (12.83% better)
>   qemu-nbd      qemu-img      6       181.70s (11.48% better)
>   qemu-nbd      qemu-img      8       181.05s (11.88% better)

Less dramatic results here, but still nothing horrible.

> Local file test (./multi-conn.pl file)
> ======================================
> qemu-nbd or nbdkit serving a large local file
>                   |
>                   | nbd+unix
>                   v
> qemu-img convert   or   nbdcopy
> We download a local file over NBD.  The image is copied to /dev/null.
>   server          client        multi-conn
>   ---------------------------------------------------------------
>   qemu-nbd       nbdcopy      1       15.50s  
>   qemu-nbd       nbdcopy      2       14.36s  
>   qemu-nbd       nbdcopy      4       14.32s  
>   qemu-nbd      qemu-img      [u/s]   10.16s  

Once again, we're seeing qemu-img baseline faster than nbdcopy as
client.  But throwing more sockets at either client does improve
performance, except for...

>   qemu-nbd      qemu-img      1       11.17s  (10.01% worse)

...this one looks bad.  Is it a case of this series adding more mutex
work (qemu-img is making parallel requests; each request then contends
for the mutex only to learn that it will be using the same NBD
connection)?  And your comments about smarter round-robin schemes mean
there may still be room to avoid this much of a penalty.

>   qemu-nbd      qemu-img      2       10.35s  
>   qemu-nbd      qemu-img      4       10.39s  
>     nbdkit       nbdcopy      1       9.10s   

This one in interesting: nbdkit as server performs better than

>     nbdkit       nbdcopy      2       8.25s   
>     nbdkit       nbdcopy      4       8.60s   
>     nbdkit      qemu-img      [u/s]   8.64s   
>     nbdkit      qemu-img      1       9.38s   
>     nbdkit      qemu-img      2       8.69s   
>     nbdkit      qemu-img      4       8.87s   
> Null test (./multi-conn.pl null)
> ================================
> qemu-nbd with null-co driver  or  nbdkit-null-plugin + noextents filter
>                   |
>                   | nbd+unix
>                   v
> qemu-img convert   or   nbdcopy
> This is like the local file test above, but without needing a file.
> Instead all zeroes (fully allocated) are downloaded over NBD.

And I'm sure that if you allowed block status to show the holes, the
performance would be a lot faster, but that would be testing something
completely differently ;)

>   server          client        multi-conn
>   ---------------------------------------------------------------
>   qemu-nbd       nbdcopy      1       14.86s  
>   qemu-nbd       nbdcopy      2       17.08s  (14.90% worse)
>   qemu-nbd       nbdcopy      4       17.89s  (20.37% worse)

Oh, that's weird.  I wonder if qemu's null-co driver has some poor
mutex behavior when being hit by parallel I/O.  Seems like
investigating that can be separate from this series, though.

>   qemu-nbd      qemu-img      [u/s]   13.29s  

And another point where qemu-img is faster than nbdcopy as a
single-client baseline.

>   qemu-nbd      qemu-img      1       13.31s  
>   qemu-nbd      qemu-img      2       13.00s  
>   qemu-nbd      qemu-img      4       12.62s  
>     nbdkit       nbdcopy      1       15.06s  
>     nbdkit       nbdcopy      2       12.21s  (23.32% better)
>     nbdkit       nbdcopy      4       11.67s  (29.10% better)
>     nbdkit      qemu-img      [u/s]   17.13s  
>     nbdkit      qemu-img      1       17.11s  
>     nbdkit      qemu-img      2       16.82s  
>     nbdkit      qemu-img      4       18.81s  

Overall, I'm looking forward to seeing this go in (8.1 material; we're
too close to 8.0)

Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

reply via email to

[Prev in Thread] Current Thread [Next in Thread]